Web scraping, i.e. automated data mining from websites, usually involves fetching a web page’s HTML document, parsing it, extracting the required information, and optionally follow links within this document to other web pages to repeat this process. This approach is sufficient for many websites that display information in a static way, i.e. do not respond to user interaction dynamically by the means of JavaScript. In these cases, web scraping can be implemented with Python packages such as requests and BeautifulSoup. Even interactive elements such as forms can be emulated by observing the HTTP POST and GET data that is send to the server, whenever a form is submitted. However, this approach has limits. Sometimes, it is necessary to automate a whole browser in order to implement web scraping on JavaScript-heavy websites as will be shown with a short example in this post.
Nowadays, many websites contain a lot of interactive JavaScript elements and data is updated on the page asynchronously as the user interacts with it. This means that “classic” web scraping, as described before, is often hard or sometimes impossible to implement on these pages. One solution can be to monitor the asynchronous HTTP traffic using the developer tools in your web browser as can be seen in the next figure:
However, it is often very cumbersome to find out how the asynchronous requests work on a specific website, e.g. which URLs or which HTTP methods to use and how to structure the request. An alternative solution is to automate a whole browser so that each interaction with a website is scripted. The most popular software package for browser automation is Selenium, which provides a scripting interface for many browsers in many programming languages. We will stick to the Python API for Selenium and use the ChromeDriver to automate a Chrome or Chromium browser.
Scraping auto-suggestions of Google searches
A classic example of asynchronous server communication are the search suggestions that you get while entering a search query into Google search. Our goal is to automatically extract these suggestions by using an automated Chrome browser. I recommend to use an interactive Python shell such as IPython and run the code snippets step by step in order to see how the Python commands interact with the browser.
At first we will need to import the webdriver
module from the selenium
package and start a browser session with the ChromeDriver:
import time # we will need this later
from selenium import webdriver
browser = webdriver.Chrome()
This will open a browser window that displays a notice on top, saying that this browser is automated:
It says “automated test software” because Selenium is usually used to automate testing of websites.
The next step is to open the Google website:
browser.get('https://google.com')
Now that we’re on the website, we can fill out the search form. We do so by selecting the element that we want to interact with (the search input element) and then “type in” our search query. Of course we don’t actually type anything because we’re lazy and this whole thing is about automation, right? Instead, we define which keys to send to the input element:
search_input = browser.find_element_by_css_selector('#lst-ib')
search_input.send_keys('how to start') # let's search for this phrase
A ghostwriting browser! The first line is to select the search input element of the website using a CSS selector. For a given website, you will need to find a way to specify the element that you want to interact with. One way is to right click on an element in your browser and select “Inspect”, which will open the developer tools of your browser and show the nested structure of the HTML website, highlighting the current element and the “path” through this HTML structure. This can give a hint for the right CSS selector. With the search input it is quite easy, because it contains a unique element ID “lst-ib”. However, you should note that whenever the HTML code of the website changes, your CSS selector might not work anymore.
Whenever a search query is entered, the search suggestions appear beneath it. We can now extract these suggestions, again by selecting the respective HTML elements at first and then iterating through them in order to fetch each elements’ label text. However, you should be aware that when this script is run, all the commands happen in the blink of an eye. Since the suggestions are loaded asynchronously, two problems can appear, because there’s always some latency in the asynchronous communication with the server:
- it could be that the script already tries extract the search suggestions, although they did not appear yet on the website
- it could be that the search suggestions already appeared on the website but they are not up-to-date
The latter happens because the search query is evaluated as the automated browser “types”. So it could be that we get search suggestions for the partial query “how to” because the suggestions for the full phrase “how to start” did not yet arrive!
To circumvent this, we can implement a loop that tries to find the search suggestions and if they didn’t arrive yet or if the results are not valid (i.e. do not begin with the full phrase), will wait for half a second for the next try. In order not to end up in an endless loop, we limit the maximum number of tries to, say, five. If there are no suggestions after five tries (i.e. 2.5 seconds of waiting) we will end up with no suggestions (we should report this in some way for later fault diagnostics).In Python that can be implemented like this:
suggestions = []
n_tries = 1
while n_tries <= 5:
# find the elements that contain the search suggestions
# again this cryptic CSS selector was found using the "inspect" tool
autosuggest_elems = browser.find_elements_by_css_selector('#sbtc .sbsb_b li .sbqs_c')
# make sure that we got results and that the first result starts with
# our search phrase
if not autosuggest_elems or \
not autosuggest_elems[0].text.strip().startswith('how to start'):
# no valid results (yet)
print('(%d) no valid suggestions yet, will wait...' % n_tries)
n_tries += 1 # do not forget to increment the number of tries
time.sleep(0.5) # wait for half a second for the next try
else:
# we got results
for e in autosuggest_elems:
# use the "text" attribute of the HTML element and strip unnecessary whitespace
suggestions.append(e.text.strip())
break # break out of the while loop, because we got our results!
print(suggestions)
At the end of the script we should not forget to close the browser window:
browser.close()
This is basically it! For me, I get the following results when running the full script:
['how to start a blog',
'how to start a conversation',
'how to startup beuth',
'how to start a business',
'how to start a cover letter',
'how to start a comment',
'how to start a presentation',
'how to start a summary',
'how to start an essay',
'how to start a speech']
We can see that the search results from Google are at least localized if not personalized, since the third result “how to startup beuth” refers to a Univ. of Applied Sciences in Berlin. That’s not a surprise — Google has a patent on that and concerns about personalized search results have been in the press long ago.
Caveats
This could be the starting point for further automation, e.g. using different search phrases or different browsers at different (faked) locations with different browsing histories in order to find out how this affects the personalized search results. However, you should note that there are some caveats:
First of all, web scraping can raise legal issues, from copyright issues to computer fraud and abuse. I recommend reading this article or the legal issues section of the Wikipedia article. You should always check the Terms of Service page of a website and respect the robots.txt when you think about scraping a website. You should also avoid flooding the website’s server with requests, i.e. limiting the number of requests you send within a certain time frame.
From a technical point of view, implementing an automated browser should be your last retreat in case every other options for getting the data that you want, fail. This is because it can be quite cumbersome to work around all the problems that come up with asynchronous data retrieval. It can be very time consuming to build a “fail-safe” scraper with this method because so many things can go wrong on a website. And then again, all the work could be for nothing from one day to the other, if suddenly your target website gets a big facelift and its whole HTML code and user interaction have changed.
Recent Comments