How to do it...
We will be installing a number of packages with pip. These packages are installed into a Python environment. There often can be version conflicts with other packages, so a good practice for following along with the recipes in the book will be to create a new virtual Python environment where the packages we will use will be ensured to work properly.
Virtual Python environments are managed with the virtualenv tool. This can be installed with the following command:
~ $ pip install virtualenv
Collecting virtualenv
Using cached virtualenv-15.1.0-py2.py3-none-any.whl
Installing collected packages: virtualenv
Successfully installed virtualenv-15.1.0
Now we can use virtualenv. But before that let's briefly look at pip. This command installs Python packages from PyPI, a package repository with literally 10's of thousands of packages. We just saw using the install subcommand to pip, which ensures a package is installed. We can also see all currently installed packages with pip list:
~ $ pip list
alabaster (0.7.9)
amqp (1.4.9)
anaconda-client (1.6.0)
anaconda-navigator (1.5.3)
anaconda-project (0.4.1)
aniso8601 (1.3.0)
I've truncated to the first few lines as there are quite a few. For me there are 222 packages installed.
Packages can also be uninstalled using pip uninstall followed by the package name. I'll leave it to you to give it a try.
Now back to virtualenv. Using virtualenv is very simple. Let's use it to create an environment and install the code from github. Let's walk through the steps:
- Create a directory to represent the project and enter the directory.
~ $ mkdir pywscb
~ $ cd pywscb
- Initialize a virtual environment folder named env:
pywscb $ virtualenv env
Using base prefix '/Users/michaelheydt/anaconda'
New python executable in /Users/michaelheydt/pywscb/env/bin/python
copying /Users/michaelheydt/anaconda/bin/python => /Users/michaelheydt/pywscb/env/bin/python
copying /Users/michaelheydt/anaconda/bin/../lib/libpython3.6m.dylib => /Users/michaelheydt/pywscb/env/lib/libpython3.6m.dylib
Installing setuptools, pip, wheel...done.
- This creates an env folder. Let's take a look at what was installed.
pywscb $ ls -la env
total 8
drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 .
drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 ..
drwxr-xr-x 16 michaelheydt staff 544 Jan 18 15:38 bin
drwxr-xr-x 3 michaelheydt staff 102 Jan 18 15:35 include
drwxr-xr-x 4 michaelheydt staff 136 Jan 18 15:38 lib
-rw-r--r-- 1 michaelheydt staff 60 Jan 18 15:38 pip-selfcheck.json
- New we activate the virtual environment. This command uses the content in the env folder to configure Python. After this all python activities are relative to this virtual environment.
pywscb $ source env/bin/activate
(env) pywscb $
- We can check that python is indeed using this virtual environment with the following command:
(env) pywscb $ which python
/Users/michaelheydt/pywscb/env/bin/python
With our virtual environment created, let's clone the books sample code and take a look at its structure.
(env) pywscb $ git clone https://github.com/PacktBooks/PythonWebScrapingCookbook.git
Cloning into 'PythonWebScrapingCookbook'...
remote: Counting objects: 420, done.
remote: Compressing objects: 100% (316/316), done.
remote: Total 420 (delta 164), reused 344 (delta 88), pack-reused 0
Receiving objects: 100% (420/420), 1.15 MiB | 250.00 KiB/s, done.
Resolving deltas: 100% (164/164), done.
Checking connectivity... done.
This created a PythonWebScrapingCookbook directory.
(env) pywscb $ ls -l
total 0
drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 PythonWebScrapingCookbook
drwxr-xr-x 6 michaelheydt staff 204 Jan 18 15:38 env
Let's change into it and examine the content.
(env) PythonWebScrapingCookbook $ ls -l
total 0
drwxr-xr-x 15 michaelheydt staff 510 Jan 18 16:21 py
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 www
There are two directories. Most the the Python code is is the py directory. www contains some web content that we will use from time-to-time using a local web server. Let's look at the contents of the py directory:
(env) py $ ls -l
total 0
drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 01
drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 03
drwxr-xr-x 21 michaelheydt staff 714 Jan 18 16:21 04
drwxr-xr-x 10 michaelheydt staff 340 Jan 18 16:21 05
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 06
drwxr-xr-x 25 michaelheydt staff 850 Jan 18 16:21 07
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 08
drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 09
drwxr-xr-x 7 michaelheydt staff 238 Jan 18 16:21 10
drwxr-xr-x 9 michaelheydt staff 306 Jan 18 16:21 11
drwxr-xr-x 8 michaelheydt staff 272 Jan 18 16:21 modules
Code for each chapter is in the numbered folder matching the chapter (there is no code for chapter 2 as it is all interactive Python).
Note that there is a modules folder. Some of the recipes throughout the book use code in those modules. Make sure that your Python path points to this folder. On Mac and Linux you can sets this in your .bash_profile file (and environments variables dialog on Windows):
export PYTHONPATH="/users/michaelheydt/dropbox/packt/books/pywebscrcookbook/code/py/modules"
export PYTHONPATH
The contents in each folder generally follows a numbering scheme matching the sequence of the recipe in the chapter. The following is the contents of the chapter 6 folder:
(env) py $ ls -la 06
total 96
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:21 .
drwxr-xr-x 14 michaelheydt staff 476 Jan 18 16:26 ..
-rw-r--r-- 1 michaelheydt staff 902 Jan 18 16:21 01_scrapy_retry.py
-rw-r--r-- 1 michaelheydt staff 656 Jan 18 16:21 02_scrapy_redirects.py
-rw-r--r-- 1 michaelheydt staff 1129 Jan 18 16:21 03_scrapy_pagination.py
-rw-r--r-- 1 michaelheydt staff 488 Jan 18 16:21 04_press_and_wait.py
-rw-r--r-- 1 michaelheydt staff 580 Jan 18 16:21 05_allowed_domains.py
-rw-r--r-- 1 michaelheydt staff 826 Jan 18 16:21 06_scrapy_continuous.py
-rw-r--r-- 1 michaelheydt staff 704 Jan 18 16:21 07_scrape_continuous_twitter.py
-rw-r--r-- 1 michaelheydt staff 1409 Jan 18 16:21 08_limit_depth.py
-rw-r--r-- 1 michaelheydt staff 526 Jan 18 16:21 09_limit_length.py
-rw-r--r-- 1 michaelheydt staff 1537 Jan 18 16:21 10_forms_auth.py
-rw-r--r-- 1 michaelheydt staff 597 Jan 18 16:21 11_file_cache.py
-rw-r--r-- 1 michaelheydt staff 1279 Jan 18 16:21 12_parse_differently_based_on_rules.py
In the recipes I'll state that we'll be using the script in <chapter directory>/<recipe filename>.
Now just the be complete, if you want to get out of the Python virtual environment, you can exit using the following command:
(env) py $ deactivate
py $
And checking which python we can see it has switched back:
py $ which python
/Users/michaelheydt/anaconda/bin/python
Now let's move onto doing some scraping.