Python Automation Cookbook
上QQ阅读APP看书,第一时间看更新

Speeding up web scraping

Most of the time spent downloading information from web pages is usually spent waiting. A request goes from our computer to the remote server to process it, and until the response is composed and comes back to our computer, we cannot do much about it.

During the execution of the recipes in the book, you'll notice there's a wait involved in requests calls, normally of around one or two seconds. But computers can do other stuff while waiting, including making more requests at the same time. In this recipe, we will see how to download a list of pages in parallel and wait until they are all ready. We will use an intentionally slow server to show why it's worth getting this right.

Getting ready

We'll get code to crawl and search for keywords, making use of the futures capabilities of Python 3 to download multiple pages at the same time.

A future is an object that represents the promise of a value. This means that you immediately receive an object while the code is being executed in the background – only, when specifically requesting its .result(), the code waits until the result is available.

If the result is already available at that point, that makes it faster. Think of the operation as putting something in the washing machine while doing other tasks. There's a chance that the laundry will be done by the time we finish the rest of our chores.

To generate a future, you need a background engine, called an executor. Once created, submit a function and parameters to it to retrieve a future. The retrieval of the result can be delayed as long as necessary, allowing the generation of several futures in a row; then we can wait until all are finished and execute them in parallel. This is an alternative to creating one, waiting until it finishes, creating another, and so on.

There are several ways to create an executor; in this recipe, we'll use ThreadPoolExecutor, which uses threads.

We'll use a prepared example, available at the following GitHub repo: https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master/Chapter03/test_site. Download the whole site and run the included script:

$ python simple_delay_server.py -d 2

This serves the site at the URL http://localhost:8000. You can see it in a browser. It's a simple blog with three entries. Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python. The parameter -d 2 makes the server intentionally slow, simulating a bad connection.

How to do it...

  1. Write the following script, speed_up_step1.py. The full code is available on GitHub at the Chapter03 directory (https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/blob/master/Chapter03/speed_up_step1.py). Here are only the most relevant parts. It is based on crawling_web_step1.py:
    ...
    def process_link(source_link, text):
        ...
        return source_link, get_links(parsed_source, page)
    ...
    def main(base_url, to_search, workers):
        checked_links = set()
        to_check = [base_url]
        max_checks = 10
        with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
            while to_check:
                futures = [executor.submit(process_link, url, to_search)
                           for url in to_check]
                to_check = []
                for data in concurrent.futures.as_completed(futures):
                    link, new_links = data.result()
                    checked_links.add(link)
                    for link in new_links:
                        if link not in checked_links and link not in to_check:
                            to_check.append(link)
                    max_checks -= 1
                    if not max_checks:
                        return
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        ...
        parser.add_argument('-w', type=int, help='Number of workers',
                            default=4)
        args = parser.parse_args()
        main(args.u, args.p, args.w)
    
  2. Notice the differences in the main function. Also, there's an extra parameter added (number of concurrent workers), and the function process_link now returns the source link.
  3. Run the crawling_web_step1.py script to get a time baseline. Notice that the output has been removed here for clarity:
    $ time python crawling_web_step1.py http://localhost:8000/
    ... REMOVED OUTPUT
    real 0m12.221s
    user 0m0.160s
    sys 0m0.034s
    
  4. Run the new script with one worker, which will make it slower than the original one:
    $ time python speed_up_step1.py -w 1
    ... REMOVED OUTPUT
    real 0m16.403s
    user 0m0.181s
    sys 0m0.068s
    
  5. Increase the number of workers:
    $ time python speed_up_step1.py -w 2
    ... REMOVED OUTPUT
    real 0m10.353s
    user 0m0.199s
    sys 0m0.068s
    
  6. Adding more workers decreases the time:
    $ time python speed_up_step1.py -w 5
    ... REMOVED OUTPUT
    real 0m6.234s
    user 0m0.171s
    sys 0m0.040s
    

How it works...

The main engine to create the concurrent requests is the main function. Notice that the rest of the code is basically untouched (other than returning the source link in the process_link function).

This change is actually quite common when adapting for concurrency. Concurrent tasks need to return all the relevant data, as they cannot rely on an ordered context.

This is the relevant part of the code that handles the concurrent engine:

with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
    while to_check:
        futures = [executor.submit(process_link, url, to_search)
                   for url in to_check]
        to_check = []
        for data in concurrent.futures.as_completed(futures):
            link, new_links = data.result()
            checked_links.add(link)
            for link in new_links:
                if link not in checked_links and link not in to_check:
                    to_check.append(link)
             max_checks -= 1
             if not max_checks:
                return

The with context creates a pool of workers, specifying how many. Inside, a list of futures containing all the URLs to retrieve is created. The .as_completed() function returns the futures that are finished, and then there's some work to do to obtain newly found links and check whether they need to be added to be retrieved or not. This process is similar to the one presented in the Crawling the web recipe.

The process starts again until enough links have been retrieved or there are no links to retrieve. Note that the links are retrieved in batches; the first time, the base link is processed and all links are retrieved. In the second iteration, all those links will be requested. Once they are all downloaded, a new batch will be processed.

When dealing with concurrent requests, keep in mind that they can change order between two executions. If a request takes a little more or a little less time, that can affect the ordering of the retrieved information. Because we stop after downloading 10 pages, that also means that the 10 pages could be different.

There's more...

The full futures documentation in Python can be found here: https://docs.python.org/3/library/concurrent.futures.html.

As you can see in steps 4 and 5 in the How to do it… section, properly determining the number of workers can require some tests. Some numbers can make the process slower, due to the increase in management. Do not be afraid to experiment!

In the Python world, there are other ways to make concurrent HTTP requests. There's a native request module that allows us to work with futures, called requests-futures. It can be found here: https://github.com/ross/requests-futures.

Another alternative is to use asynchronous programming. This way of working has recently gotten a lot of attention, as it can be very efficient in situations that involve dealing with many concurrent calls, but the resulting way of coding is different from the traditional way and requires some time to get used to. Python includes the asyncio module to work that way, and there's a good module called aiohttp to work with HTTP requests. You can find more information about aiohttp here: https://aiohttp.readthedocs.io/en/stable/client_quickstart.html.

A good introduction to asynchronous programming can be found in this article: https://djangostars.com/blog/asynchronous-programming-in-python-asyncio/.

See also

  • The Crawling the web recipe, earlier in this chapter, for the less efficient alternative to this recipe.
  • The Downloading web pages recipe, earlier in this chapter, to learn the basics of requesting web pages.