Robust web scraping or web API based data collection

There are thousands of articles on the web about web scraping and accessing web APIs. Most of them show you how to extract information from specific elements on a web page or how to communicate with a specific API in order to collect data. For smaller data collection projects, this knowledge may be sufficient, but large scale data collection which must run reliably over days or even weeks brings up additional problems that mainly focus on the robustness of the data collection process. I will try to tackle some of these problems in this post. I will use examples in Python, but the basic concepts can easily be translated to R or other programming languages.

Robustness

By robust data collection via web scraping and web APIs, I mean that the data collection process satisfies at least the following requirements:

  1. It runs largely without manual intervention and handles upcoming issues such as loss of internet connection, server failures, rate limit exceedance etc. by itself;
  2. if it breaks (or we manually interrupt it), there’s no or only few loss of already collected data;
  3. if we need to restart or continue the data collection process, already collected data is not fetched again;
  4. it reports and records failures or unexpected situations (e.g. failed connection to server, unexpected HTML structure or API response).

You may add more requirements such as live (e-mail) notifications about failures or automatic recovery from fatal errors. This may very well be necessary if you plan to run a very long running data collection, e.g. real-time data collection from Twitter’s streaming API over several months. For this post, however, I will focus on the mentioned requirements, since this will already improve any “quick and dirty” single-shot web scraper quite much with only little effort and fits best for many data collection projects in the social sciences.

A baseline web scraping script

As an example, I will consider the web scraping script that I used in the previous blog post on COVID-19 related topic models to fetch articles from the Spiegel Online (SPON) news archive. The news archive contains a list of all articles published on a given day on the SPON website. Each day’s archive can be viewed with an URL pattern https://www.spiegel.de/nachrichtenarchiv/artikel-<DD>.<MM>.<YYYY>.html, where <DD>, <MM> and <YYYY> are day, month and year respectively. Please note that the URL and/or the HTML structure on SPON may change in the future which would require to adapt our web scraper. However, the techniques for robust data collection that I’ll present here are independent from the actual website or web API.

The baseline Python script is a stripped down version of the original web scraper to only collect news headlines and URLs from the archive for a given time frame. This baseline script is our starting point and is available as gist. The script does what it should do: for a given time frame defined as START_DATE and END_DATE the news archive HTML is fetched for each day and the respective elements of interest are extracted and stored in the variable archive_rows. We use the standard Python toolkit for web scraping – the requests package to retrieve the HTML and BeautifulSoup4 to parse it. archive_rows is a dictionary that maps each day’s date to a list of extracted article headlines and URLs that were published on that day. This data is finally stored in a JSON file. A condensed version of the web scraping loop looks like this (excluded parts marked with “# [...]“):

archive_rows = defaultdict(list)
duration = END_DATE - START_DATE   # timedelta
# loop through the days in the specified timespan
for day in range(duration.days + 1):
    fetch_date = START_DATE + timedelta(days=day)
    fetch_date_str = fetch_date.date().isoformat()
    archive_url = ARCHIVE_URL_FORMAT \
                     .format(fetch_date.day,
                             fetch_date.month, 
                             fetch_date.year)
    # [...]
    resp = requests.get(archive_url)
    if resp.ok:  # status OK -> parse page
        soup = BeautifulSoup(resp.content, 'html.parser')
        container = soup.find_all(
            'section', attrs={'data-area': 'article-teaser-list'}
        )
        headlines_container = container[0].select('article')
        # iterate through article teasers
        for hcont in headlines_container:  
            # [...]
            # add extracted data for article at this date
            archive_rows[fetch_date_str].append({
                'archive_headline': headline,
                'url': url,
                'pub_date': fetch_date_str,
            })

However, this script doesn’t meet the requirements above and this would fall on our feet when we’d run this script for longer time frames than only the seven days specified in the example. There are too many things that can go wrong during the web scraping process and you can’t control all of them: your internet connection may fail, the server may be down or you may be blocked from the server; you may get an unexpected response from the server or unexpected HTML structure; your script may have a bug that only occurs under very specific circumstances, etc. The longer you need to run the script, the higher the probability that you run into one of these problems.

Using a cache file for intermediate results

No matter what happens during data collection, we should never lose already collected data. This is especially important when you can’t repeat the data collection (e.g. when collecting real-time tweets). But even when you can repeat the data collection (as in our example) you should avoid it since it may be very time-consuming and costly (e.g. API usage fees).

There’s a simple solution to our problem which is called caching: As you collect your data, you store intermediate results to a cache file. When your script breaks, you still have the data collected so far. When you start the script again, the intermediate results are loaded from the cache file and we will make sure not to fetch the already collected data again.

I’ve implemented this in a second version of the SPON web scraper script. Let’s go through the changes. We will need the os and pickle modules:

import os
import pickle

os is later needed to check whether a cache file was already written and pickle is used to quickly store and load (almost) any Python object. An equivalent in R would be saveRDS()/readRDS().

We also define the file name of our cache file:

CACHEFILE = 'cache.pickle'

No we define two simple functions that manage our cache:

def load_cache(init_with):
    if os.path.exists(CACHEFILE):
        print('loading existing data from %s' % CACHEFILE)
        with open(CACHEFILE, 'rb') as f:
            return pickle.load(f)
    else:
        print('initializing with empty dataset')
        return init_with


def store_cache(data):
    with open(CACHEFILE, 'wb') as f:
        pickle.dump(data, f)

The first function, load_cache(), checks whether the cache file exists. If so, its data is loaded via pickle.load() and returned. If there’s no cache file, the init_with data is simply returned. This allows us to easily replace the initialization of archive_rows as follows:

archive_rows = load_cache(init_with=defaultdict(list))

So archive_rows will be either populated with an empty defaultdict as before or, when the cache file exists, with the already fetched data loaded from the cache file.

Next we should make sure not to retrieve already scraped data again, so we add a line that checks whether fetch_date_str already exists in archive_rows and if so, continues with the next day:

for day in range(duration.days + 1):
    fetch_date = START_DATE + timedelta(days=day)
    fetch_date_str = fetch_date.date().isoformat()
    archive_url = ARCHIVE_URL_FORMAT \
                     .format(fetch_date.day,
                             fetch_date.month, 
                             fetch_date.year)

    # check if data already exists
    if fetch_date_str in archive_rows.keys():
        print('> already fetched this date – skipping')
        continue

    resp = requests.get(archive_url)
    # [...]

Finally, we should periodically store our intermediate results to the cache file. The question is how periodical should we do that? We have two nested loops in the scraper: the first iterates over days and the second iterates over article headlines posted on the given day in the daily archive page. It only makes sense to store intermediate results at the end of the outer loop, i.e. after all headlines of an archive page are processed, because only then have we collected all data for a given day and can safely skip this day when we’d start the script again.

for day in range(duration.days + 1):
    # [...]
    if resp.ok:  # status OK -> parse page
        # [...]
        # iterate through article teasers
        for hcont in headlines_container: 
            # [...]

    # when all headlines of a day were processed,
    # store the intermediate results
    store_cache(archive_rows)

The complete second version of the SPON scraper is available as gist.

With these simple modifications we now make sure that once collected data is never lost when the script is interrupted and that when we restart the script, it can simply continue to run where it was left off. Furthermore, we can easily extend the time frame of the data collection and we will only collect data from days which we didn’t collect before. However, you should also be aware of the negative implications of this kind of caching: The data in the cache is never updated as long as you don’t delete the whole cache file. So if you later notice you had an error in parsing the HTML and you fix that error, the fix will only affect newly collected data unless you delete the cache.

Timeouts, exceptions and retries

As already said initially, many things can go wrong during web scraping or web API communication. This is mainly because you’re dealing with things that are often beyond your control and may fail anytime: Your internet connection, the target server’s internet connection, the target server itself, etc. This means your code must be resilient towards possible failures and a common way to implement that resilience is to

  1. specify expectations;
  2. handle exceptions gracefully when they occur;
  3. retry on failures that may be temporary.

The first point refers to expectations that we have on how “the other side”, i.e. the server, responds to our requests. This could mean the HTTP status codes that we accept, the structure of the JSON or the HTML in the response or the maximum time it can take until we get a response.

The second point means that we should capture possible exceptions in our code and handle them accordingly. For example, we may expect that a HTTP request may fail and so if it does, your script shouldn’t crash (as it would do for an unhandled exception). Instead, you could for example print an error message and continue with the request – it all depends on the use case and the severity of the exception.

The third point simply means that we shouldn’t give up on the first try if we suspect that a temporary issue, such as a broken internet connection, caused a failure. Instead, we should have a retry strategy, i.e. we define how often and in which frequency we try again with the same request. We also define when our patience ends and when we give up trying.

In our current web scraping script, we already accounted for the HTTP status code and the HTML structure that we expect by checking whether the response code was “OK” and by skipping HTML sections that don’t meet our expectations. However, we didn’t set a timeout, i.e. the maximum time it can take until we get a response. This means our script may hang indefinitely when the connection to the server is lost. This can happen quite quickly: If you disconnect your internet connection immediately after a request was made, your script may wait forever to receive an answer, even as you re-establish your connection.

With requests it’s quite straight-forward to define a timeout: one can simply set a timeout=... parameter in the requests.get() function. Defining a retry strategy is also implemented very quickly. We create an HTTPAdapter with a retry strategy. We then create a Session and register the HTTPAdapter instance with all https:// requests:

from requests.adapters import HTTPAdapter

retryadapter = HTTPAdapter(max_retries=3)
httpsess = requests.Session()
httpsess.mount('https://', retryadapter)

In the above example, the retry strategy is very simple. We try to perform the request up to three times without waiting time in between. This is not ideal, since usually you should allow some time to re-establish a lost connection. This can be done by passing a Retry object from urllib3 with a backoff factor. This calculates the delay in seconds between the attempts as follows: backoff * (2 ^ (attempt - 1)). For a backoff factor of one, this results in the exponentially growing sequence 1, 2, 4, 8, 16, … We can set such a strategy for up to three retries as follows:

from urllib3 import Retry

retryadapter = HTTPAdapter(
    max_retries=Retry(total=3, backoff_factor=1)
)

You can define very fine-grained retry strategies with the Retry class. By this, you can for example set a different backoff factor for “rate limit exceeded” responses (HTTP code 429). See Hodovic 2020 for a more complex example.

The only thing that is left now is to actually use our custom Session instance with a timeout and the retry strategy in our GET request and furthermore to handle any exceptions that may occur during that request:

for day in range(duration.days + 1):
    # [...]

    try:
        resp = httpsess.get(archive_url, timeout=15)
    except IOError as exc:
        print(f'> got IO error: {exc}')
        continue

    if resp.ok:
        # [...]

I chose a timeout of 15 seconds here. In case a communication error with the server persists for that time even after the specified three attempts, an IOError exception will be generated which we capture and log and let the script continue with the next day.

In order to check if this really works we could interrupt our internet connection while the script is running or simply set an unrealistically low timeout for a certain day:

resp = httpsess.get(archive_url, timeout=0.001 if day == 3 else 15)

You will notice how the script halts for some seconds on the specified day – this is the retry and backoff delay at work. The output for this day is then:

> got IO error: HTTPSConnectionPool(host='www.spiegel.de', port=443):
  Max retries exceeded with url: /nachrichtenarchiv/artikel-04.11.2020.html
  (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fce15d90d30>, 'Connection to www.spiegel.de timed out. (connect timeout=0.001)'))

Please note that no data is stored to the cache for that day so if we run the script again, the script will try to fetch the missing data again.

In R, timeouts may be specified with the httr or curl package (both are replacements for base::url() that you can also use in conjunction with rvest). You can handle exceptions with tryCatch(). I haven’t found a convenient way for a retry mechanism in R similar to urllib3’s Retry class.

Again, the full third version of the script is available as gist.

Avoiding corrupted cache files

This updated version of the script already works fine and is reliable. This holds as long as you work with comparatively small data. The larger your data, the longer it takes to store your cache file to disk. It may very well take several seconds if your have several hundred MB to store, even with a fast serialization method like the pickle module. The problem is that if your script is interrupted during the process of writing the cache file to disk (for example if it is stopped manually by you or killed by the OS because it takes up too much memory), it may produce a corrupted cache file. This would mean all your data collected so far is lost.

We can easily circumvent this: First, we can update our store_cache() function to “rotate” the cache files, i.e. to rename the cache file from the previous call (if it exists) to cache.pickle~ (note the “~”). By this, even when something goes wrong when writing the current cache file, the old one is still available as backup. The updated function looks like this:

def store_cache(data, rotate_files=True):
    if rotate_files and os.path.exists(CACHEFILE):
        os.rename(CACHEFILE, CACHEFILE + '~')

    with open(CACHEFILE, 'wb') as f:
        pickle.dump(data, f)

The second step is to capture OS signals that are intended to kill the script so that this won’t interrupt a cache writing process. We can do that in Python with the signal module. A simple solution would be to set up a signal handler function that triggers a global variable abort_script like this:

import signal

# [...]

abort_script = False
def handle_abort(signum, frame):
    global abort_script
    print('received signal %d – aborting script...' % signum)
    abort_script = True

for signame in ('SIGINT', 'SIGHUP', 'SIGTERM'):
    sig = getattr(signal, signame, None)
    if sig is not None:
        signal.signal(sig, handle_abort)

We then only allow to interrupt the web scraping loop at the very beginning of the loop:

for day in range(duration.days + 1):
    if abort_script:    # if interrupted by OS, break loop
        break
    # [...]

This makes sure that we finish with collecting the current day’s articles and storing that data to the cache. We then finally abort the script via exit():

if abort_script:
    print('aborted.')
    exit(1)

With these modifications we make sure that our cached data won’t get corrupted when the script is interrupted untimely. You can try it out and interrupt the script while it is running (e.g. by pressing Ctrl-C when running in the terminal). The output would then look like the following (the stray ^C comes from the Ctrl-C input):

day 1: 2020-11-01 from https://www.spiegel.de/nachrichtenarchiv/artikel-01.11.2020.html
^Creceived signal 2 – aborting script...
aborted.

To implement rotating cache files in R, you can use file.rename(). Handling OS signals on the other hand is quite cumbersome in R.

Again, the fourth version of the script is available as gist.

Optimizing storage size and speed for large cache files

I already mentioned that writing large cache files may take some time and this can cause your script to run very slowly. One pragmatic solution to that is to store the cache only on every ith iteration, e.g. only on even days or only every fifth day in our example. This can be implemented with a modulo operation. Our pragmatic solution comes with a cost, though: We may lose some intermediate data when we don’t store the data on each iteration.

CACHERATE = 2

# [...]

for day in range(duration.days + 1):
    # [...]
    if (day+1) % CACHERATE == 0 or day == duration.days:
        store_cache(archive_rows)

Here we only write the cache to disk when the current day index is divisible by CACHERATE (so in our case when day is 0, 2, 4, …) or if the current iteration is the last iteration.

Another issue might be that the cache file becomes too large. In that case, you can compress the cache file, for example by using the zipfile module. I added this as another improvement to the fifth version of the SPON scraper script.

In R, the modulo operator is %% and the above code can be translated easily. Compressing can be implemented with the zip() function. However, if you use saveRDS() the cache file will be compressed by default already.

Further optimizations

There are further improvements possible. For example, you may write several partial cache files (e.g. one for each month of data in our case) and load them only when necessary. This greatly improves memory usage and loading time. Furthermore, you could implement parallel processing to speed up writing and loading of multiple cache files, but this is quite an effort.

I also didn’t write much on point 4 from our initial “robustness checklist” – error logging. Logging in general is crucial for web scraping or web API based data collection scripts, since only a comprehensive log allows you to understand what went wrong when you encounter a misbehaving script or bad data. The present example script only uses print() statements for basic logging which is sufficient for smaller projects. For larger projects I recommend looking at Python’s builtin logging facility. For R, you can use the packages logging or futile.logger. You should also explicitly record errors in your output data and not just skip over them. You should then check for systematic data collection errors by analyzing your collected data.

Comments are closed.

Post Navigation