There are thousands of articles on the web about web scraping and accessing web APIs. Most of them show you how to extract information from specific elements on a web page or how to communicate with a specific API in order to collect data. For smaller data collection projects, this knowledge may be sufficient, but large scale data collection which must run reliably over days or even weeks brings up additional problems that mainly focus on the robustness of the data collection process. I will try to tackle some of these problems in this post. I will use examples in Python, but the basic concepts can easily be translated to R or other programming languages.
Read More →
Recent Comments