In the recent weeks I’ve collaborated on the online book APIs for social scientists and added two chapters: a chapter about the genderize.io API and a chapter about the GitHub API. The book seeks to provide an overview about web or cloud services and their APIs that might be useful for social scientists and covers a wide range from text translation to accessing social media APIs complete with code examples in R. By harnessing the GitHub workflow model, the book itself is also a nice example of fruitful collaboration via work organization methods that were initially developed in the open source software community.
While working on the two chapters and playing around with the APIs, I once again noticed the double-edged nature of using web APIs in research. It can greatly improve research or even enable research that was not possible before. At the same time, data collected from these APIs can inject bias and the use of these APIs may cause issues with research transparency and replicability. I noted some of these issues in the respective book chapters and I’ve written about them before,[1]See this article in WZB Mitteilungen (only in German) together with Jonas Wiedner. but the two APIs that I covered for the book provide some very practical examples of the main issues when working with web APIs and I wanted to point them out in this blog post.
The first issue I’d like to address is the lack of transparency that often accompanies the use of web or cloud services. Even basic information, such as the source or amount of data that backs the service is often missing. However, such information is crucial as it helps to spot possible issues with biases and accuracy. For example, there’s no information about the types of text that are used to feed the language models for Google’s Translate API. When I researched about the genderize.io API, I didn’t find anything about the methodology it uses for gender prediction so the best guess is that it simply calculates the proportion of female and male uses of a name listed in a large database. At least there’s a page that lists the number of entries for each name. But how do these names end up in this database? All the documentation says is “[o]ur data is collected from all over the web” which is quite a broad indication of source I’d say. These name-gender pairs could be scraped from social media profiles but there’s no way to assess how likely users were in exposing their true gender in these profiles.
Which leads me to the next problem that I’d call blind trust or over-confidence in the results that come from these magical black boxes. For this, I have a concrete example that uses the R package DemografixR and utilizes the genderize.io API via the genderize
function to predict the likely gender from a given name. For this example, we use “Sasha”:
> library(DemografixeR) > genderize('sasha') [1] "male"
The genderize
function simply gives a binary result male/female for a name that it submitted to the genderize.io API.[2]A binary categorization of gender is of course a gross oversimplification but not the main point of this post. Simply trusting this result would be a mistake. In this particular example we can also see the effect of a bad default value for a function argument: If not explicitly specified as simplify = FALSE
, this function will reduce its output to the categorization and omit very important information which we can see next:
> genderize('sasha', simplify = FALSE) name type gender probability count 1 sasha gender male 0.51 13219
With a probability of only 0.51, categorizing a likely gender for “Sasha” is like flipping an (almost) fair coin.[3]Of course you may be aware that this name is used as unisex name in many cultures, but imagine you try to categorize a large dataset with thousands of names, many of which you’ve never heard … Continue reading The least you should do is defining a probability threshold for accepting a gender categorization. The genderize.io API in this way is actually much more transparent than many other cloud services, since it at least provides the respective information about prediction accuracy and sample size (though hiding this information by default remains a bad choice in the DemografixeR package). All predictions come with uncertainty, also those from cloud services like text or image classification. Whenever possible, you should incorporate this uncertainty into your models to prevent overly confident results. You should at least set minimum thresholds for accuracy and sample size.
You should also always explore all options that an API provides, because it may allow you to adapt it better to your use-case and get more precise results. E.g. for the genderize.io API, you can also localize your results by providing the country_id
parameter. When we localize the results for Germany and use the German variant of the name, “Sascha”, we can see that the prediction accuracy is much higher:
> genderize('sascha', country_id = 'DE', simplify = FALSE) name type gender probability count country_id sascha gender male 0.99 22408 DE
The next issue I’d like to focus on is bias that may unknowingly slip into your analyses when working with cloud services or APIs. This issue is well known and there’s a lot of research about bias in ML models that are also used in the cloud services such as translation, NLP or classification. But I’d like to focus on cases where you may get biased results even when there’s no complicated ML model involved when querying a cloud service or API. I will at first focus on the genderize.io API again. Due to the already mentioned opacity in terms of the how the data for the API itself is collected, we can’t say much about the bias at that level. But I noticed a possible source of bias that I only found through experimenting with the API: Names with non-Latin letters seem to be strongly underrepresented in the database. We can see this clearly when comparing these names with their Latinized versions:
> genderize(c('gül', 'gul', 'jürgen', 'jurgen', 'andré', 'andre', 'gökçe', 'gokce', 'jörg', 'jorg', 'rené', 'rene'), simplify = FALSE) name type gender probability count gül gender female 0.89 36 gul gender female 0.88 4963 jürgen gender male 0.99 727 jurgen gender male 0.99 3966 andré gender male 1.00 3 andre gender male 0.95 64369 gökçe gender male 0.80 5 gokce gender female 0.81 416 jörg gender male 0.99 628 jorg gender male 0.98 641 rené gender male 1.00 4 rene gender male 0.91 35497
The documentation says nothing about this issue and it’s easy to imagine how you can introduce bias: If you for example automatically dismissed results with low sample sizes, you’d much more often hit those with non-Latin letters in their names.
Another possible source of bias when using cloud services comes from the fact that many APIs don’t provide proper random sampling for collecting data. They do indeed provide very biased samples by employing some sort of ranking for their results. I will show this with some code taken from the GitHub API chapter of the APIs for social scientists book. The GitHub API provides an interface for searching for users which allows you to obtain user profile data. For example, the following query searches for R users in Berlin:
> library(jsonlite) > search_results <- fromJSON('https://api.github.com/search/users?q=language:r+location:berlin') > str(search_results, list.len = 3, vec.len = 3) List of 3 $ total_count : int 559 $ incomplete_results: logi FALSE $ items :'data.frame': 30 obs. of 19 variables: ..$ login : chr [1:30] "IndrajeetPatil" "RobertTLange" "christopher... ..$ id : int [1:30] 11330453 20374662 1285805 9379282 1569647 63... ..$ node_id : chr [1:30] "MDQ6VXNlcjExMzMwNDUz" "MDQ6VXNlcjIwMzc0NjYy... .. [list output truncated]
We get 559 users that match these criteria, but each query only comes with one page of 20 results and we would need to query another 27 pages to obtain all user information. Collecting data takes time, so you may be tempted to fetch only the first few pages of results, but this may introduce bias: As the documentation says, the results are sorted by “best match” and “relevance” without further specifying what either of this means.[4]The documentation also notes that you can sort by other attributes but some kind of ranking is always employed, always leaving room for possible bias. So you will get the most “popular” (by number of followers, repositories, etc.) R users in Berlin instead of a random sample.
A similar problem arises when using for example the Google Places API: All results from this API are ranked by some unknown Google algorithm. You can provide a search location and a radius around this location to give a “hint” about where to concentrate this spatial search, but there’s no guarantee that the search results will actually be inside this radius and there’s especially no guarantee that you get all places that match your search criteria. In fact, you will very likely get results outside your spatial search focus if they’re considered more “popular” by Google’s search algorithm.
So whenever you use an API that involves some sort of searching and ranking, you should be aware that this can introduce bias in your data. These APIs were never developed for collecting data for research with them. By incrementally narrowing search queries and employing different sorts of ranking (which for example the GitHub API allows), you may achieve collecting the full data for a certain search query, e.g. “all GitHub users in Berlin”, but this is not so easy to implement and data collection will take a long time. If an API provides random access of result pages you may mitigate the negative effect of ranking by collecting data only from randomly sampled result pages.
Besides the mentioned problems with using cloud services and APIs in research, there are further issues like hindered replicability, negative data protection implications and increasing dependency from Big Tech. Researchers should keep in mind the issues that come along with using cloud services and should at least transparently report them in their works. When unfamiliar with an API, I encourage to extensively experiment with it since documentation and background information for many APIs is still scarce. I also recommend to follow these rules for robust data collection.
Cloud service and API providers should be aware that their services may be used in research contexts. They should provide concise documentation and especially information about their data collection process and possible biases in the data, which should both be evaluated by external stakeholders. They should provide uncertainty measures and samples sizes in API responses so researchers can incorporate them into their models. Some sort of version control for API access should be considered for better replicability and if this is not possible, changes to the service should at least be reported in a version history.
Acknowledgments
Many thanks to Camille Landesvatter and Paul C. Bauer for reading the draft and providing feedback!
Footnotes
↑1 | See this article in WZB Mitteilungen (only in German) together with Jonas Wiedner. |
---|---|
↑2 | A binary categorization of gender is of course a gross oversimplification but not the main point of this post. |
↑3 | Of course you may be aware that this name is used as unisex name in many cultures, but imagine you try to categorize a large dataset with thousands of names, many of which you’ve never heard before. |
↑4 | The documentation also notes that you can sort by other attributes but some kind of ranking is always employed, always leaving room for possible bias. |
Pingback: Some thoughts about the use of cloud services and web APIs in social science research – Data Science Austria