When natural scientists grapple with big data, they almost inevitably receive raw information from instruments that were designed to provide them the material they wanted in a format for which they had prepared. Social and behavioral scientists, in contrast, are often tapping data flows designed by other people – entrepreneurs and government are key examples – and for different purposes. Think of Facebook, which in its earliest incarnation wasn’t about gathering information at all, merely facilitating connections among Harvard students.
A new white paper from SAGE Publishing titled, Who Is Doing Computational Social Science? Trends in Big Data Research, asked social and behavioral researchers about what data sources they used and what tools they used to tap these sources. Among respondents who are already active in computational social science, by far the most common data source they had most recently used for their endeavors was administrative data – government generated data on subjects as diverse as government departments and can include health, education or income. Some 55 percent of respondents reported having used that in their most recent research involving big data.
The next largest source, cited by 29 percent of respondents, was social media data, such as Facebook or Twitter. (Multiple answers were possible.) The third most commonly cited was commercial or proprietary data, cited by 23 percent of respondents. Giving an idea of the scope of what can constitute ‘big data,’ the fourth most common response includes photographs, video or audio sources.
Not everyone who completed the survey that informs the white paper has conducted big data research. The survey team initially reached out to more than a half million social science contacts around the world, and 9,412 fully completed the survey. A third of those self-reported that they had conducted research using big data recently.
Here at MethodSpace, we’re unpacking those findings in three posts. The first, available here, looked at who is doing computational social science. This post examines what is being used for computational research, and the last post will discuss the challenges the survey respondents identified. The white paper itself was authored by Katie Metzler, publisher for SAGE Research Methods; David A. Kim, in the Department of Emergency Medicine at Stanford University; Nick Allum, a professor of sociology and research methodology at the University of Essex; and Angella Denman of the University of Essex.
Among that third of respondents who have already conducted computational social science, the tools they need are a prime subject. For example, since big data is by definition ‘big,’ a distributed computing infrastructure is necessary. Among those who have used such shared software systems, the most commonly used was Hadoop, followed by two Hadoop subproducts, MapReduce and Spark. The authors of the white paper, however, wrote that respondents may have been confused by what counted as “other distributed computing.”
“Although 579 researchers answered with software that is used for big data research, 1248 respondents used traditional software (SPSS and STATA) for their research. While SPSS and STATA have both been enhanced to handle larger data sets, there is also a possibility that respondents who answered naming a traditional software package were either not working with very large data sets or were working with smaller subsets of a large data set, which is common among researchers in the social sciences engaging with social media data.”
The authors also asked active researchers if they had shared either their bespoke code or software they may have developed with other researchers. For a majority, the answer was no. Among those who had shared, the most common way to share was via email (19 percent of the 873 answering this question), followed by submitting supplementary material as part of the publishing process (12 percent). Only 56 reported using GitHub.