Web Scraping with Python: An Overview

Disclaimer: Scraping content from websites is often forbidden by terms of service and can violate copyright law. This article doesn't condone in any way activities that might contravene regulations or laws.

One of the most compelling aspects of programming is automating dull tasks. Grabbing data from web is a repetetive job and a prime candidate for scripting. These are some notes from my own dabbling in web scraping over the years.

Components

In my view, web scraping can be broadly delegated to two components: Crawling and Parsing. Crawling is programmatically traversing and exploring content using hypermedia. Parsing is extracting information for content you've found. Naturally, these components can work quite closely together. In fact, you'll often be parsing pages as you go in order to find the next hyperlinks to feed to you crawler!

Tools

There are many options out there for scraping in Python, ranging from building your own crawler from components such as Requests and Beautiful Soup, to fully fledged, open source crawlers such as Scrapy. The advantage that something like Scrapy gives you is that many useful features have already been developed and battle tested for you, such as:

  • Rate limiting
  • Parallelism
  • HTML parsing

On the other hand it's satisfying to develop your own solution from more so building blocks and a tool like Scrapy may even be overkill sometimes. For instance, below is a quick and dirty script that I wrote to simply download a series of pdfs for which I could derive the urls.

import time
from urllib.request import urlretrieve
from urllib.error import URLError


base_url = 'http://<url-removed>.co.uk/pdfs/<generic-document-name>{}.pdf'
file_name = '<generic-document-name>{0:{fill}{align}2}.pdf'

A = time.time()


for i in range(1, 83):    # I knew in advance how many documents there were, so I could hardcode it
    print('Downloading file {}'.format(i))
    try:
        urlretrieve(base_url.format(i), file_name.format(i, fill=0, align='>'))
    except URLError:
        print("Unable to locate " + file_name.format(i, fill=0, align='>'))
print("Done! Took {:.2f} seconds".format())

Not the prettiest, but it worked!

Above was a simple case, in which I was simply downloading a series of files. As noted above, sometimes you'll need to retreive the web page and parse the information in it, or even find the next link to scrape within it. Beautiful soup is the tool you'll want to use for this. It allows you to navigate the HTML tree, search for types of tag, or tags with cetain labels with ease.

Another thing you might want to consider is parallelism. You can use threading, a pipeline (such as Luigi) or a message queue (such as RabbitMQ) to divide and conquer. This is a great example of a scraper that uses parallism published by the Architecture of Open Source project. Written by A. Jesse Jiryu Davis and Guido van Rossum, it is a crawler built with coroutines that dives into how to schedule asynchronous work with Python.

Something Else

Sometimes, the above approaches won't work as the info you want to grab will be loaded in your browser by javascript. As requests and urllib simply grab the source code, you might need some way to invoke the javascript that generates the content that you really want. This is where Selenium comes in.

Selenium is a browser automation tool that allows you to directly control your browser, and is a different tool to approach web scraping with.

Set Up

As you'll actually be using your browser, it'll be useful to create a profile specifically for Selenium.

On the command line with Firefox closed, run the following snippets. For the purpose of this article I'm assuming a linux set up.

$ firefox --ProfileManager

Then create a new profile. This profile can usually be found under $HOME/.mozilla/firefox.

Next, you need a piece of software called geckodriver to actually use Selenium to control Firefox. You can find the latest version at https://github.com/mozilla/geckodriver/releases/. At the time of writing, that is v0.21.0.

Again on the command line, download it, unpack the compressed file and place it on your path.

$ wget -nv https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz
$ tar -xzf geckodriver-v0.21.0-linux64.tar.gz
$ rm geckodriver-v0.21.0-linux64.tar.gz
$ ls -1    # You should see geckodriver in the output.
geckodriver

$HOME/.local/bin is a good place for this type of software. First make sure it exists and then put the software there.

$ mkdir -p ~/.local/bin
$ mv geckodriver ~/.local/bin

Check it's on your path:

$ geckodriver --version
geckodriver 0.21.0

The source code of this program is available from
testing/geckodriver in https://hg.mozilla.org/mozilla-central.

This program is subject to the terms of the Mozilla Public License 2.0.
You can obtain a copy of the license at https://mozilla.org/MPL/2.0/.

If you get an command not found error, you'll need to add the following line to your .bashrc file and restart your bash session:

export PATH=$PATH:$HOME/.local/bin

Now geckodriver is on your path and can be used by Selenium!

Using Selenium

Here is a snippet:

from selenium import webdriver


path_to_profile = '<path-to-home-dir>/.mozilla/firefox/<string-of-characters>.gecko_profile/'
fp = webdriver.FirefoxProfile(path_to_profile)
driver = webdriver.Firefox(fp)

driver.get('https://example.com')

The Selenium docs give a good overview of more advanced usage.