The New Digital Divide: Big Data Rich and Big Data Poor

Categories: Big Data

Tags:

“Data is the new oil.” Clive Humby, mathematician and architect of Tesco’s Clubcard, is credited with saying this first in 2006, and it’s been repeated numerous times in the last decade. The comparison between data and oil refers to its value being extracted through refinement; or in the case of data, through analysis. Unlike oil, data is being created at a faster pace than it can be consumed, or analysed. We’re awash with data. You may have heard it said that “90 percent of all the data in the world has been generated over the last two years.” Or, as Hal R. Varian, chief economist at Google, puts it another way: “A billion hours ago, modern homo sapiens emerged. A billion minutes ago, Christianity began. A billion seconds ago, the IBM PC was released. A billion Google searches ago … was this morning.”

LSE Impact logo

This post originally appeared on the LSE Impact of Social Sciences blog.

The capacity to collect and analyse massive datasets has already transformed fields such as biology, astronomy, and physics, and for many, the ‘big data revolution’ promises to ask, and help us answer, fundamental questions about individuals and collectives. But who gets access to all this data we’re producing through our increasingly networked and digital lives, and for what purpose?

In 2012, danah boyd and Kate Crawford offered a provocation that the limited access to big data was creating a new digital divide between “the Big Data rich and the Big Data poor.” It’s only companies, and the social scientists working within these companies, that have access to really large social and transactional datasets. The broader scholarly community usually does not because companies refuse to release it or because purchasing it costs too much.

Recently, I conducted a survey of more than 9,000 social scientists to learn more about researchers who are engaged in research using big data and the challenges they face, as well as the barriers to entry for those looking to do this kind of research in the future. 32 percent of respondents who are currently engaged in big data research reported that getting access to commercial or proprietary data was a “big problem” for them:

figure-1

Challenges facing big data researchers (n = 2273)

But it isn’t only the question of who can access data that leads to divides. As boyd and Crawford point out, and our survey supports, there is also a skills gap holding social science back: the level of quantitative and programming skills required for big data research make it a challenge for educators to introduce it into traditional social science degree courses as there is little time or expertise amongst teaching faculty:

figure-2

Challenges facing educators teaching big data (n = 1212)

Why does it matter?

So who cares if academic social scientists can’t do big data, either because they can’t access the data and/or don’t have the skills they need to engage with it? Why not just have companies like Twitter and Facebook analysing social media data? Some have even gone so as far as to argue that academics should not engage in research that can be done better by industry.

There are a couple of reasons why this is problematic. Firstly, because replication is the engine of science, and irreproducible research slows progress. If only researchers within companies can access and analyse big social datasets, “those without access can neither reproduce nor evaluate the methodological claims of those who have privileged access”.

And secondly, and arguably most importantly, the motivations of industry researchers and social scientists may differ in ways that may really matter. Big data research conducted by companies is usually in service of a single overarching goal: to sell you more stuff. Social scientists with the right skills and access to the right data may use their research to contribute to the body of knowledge, with the aim of better understanding and improving social outcomes.

The questions boyd and Crawford pose at the start of their paper summarize this perfectly. They ask:

“Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means?”

As of yet, the answers to these important questions are unclear.

Read more in the recent SAGE Publishing white paper revealing full results of the survey, “Who is Doing Computational Social Science? Trends in Big Data Research.”

One thought on “The New Digital Divide: Big Data Rich and Big Data Poor

  1. Katie,
    Thanks for this interesting blog post on big data. As a research statistician in healthcare I’ve often wondered when my work will be impacted by big data. I have not seen it yet, but the possibilities are lurking out there, somewhere. I think that getting access to big data in healthcare has been hampered by patient privacy (HIPPA) laws. Patient information is heavily guarded by IRBs, computer programs, and financial penalties for violating patient privacy. The largest dataset I’ve worked with was about 64 thousand patient records. I don’t know if this qualifies as big data, but we had to jump through the usual hoops to get the patient records.
    Here’s my question with big data. When do you have enough cases to qualify as having your entire population? Inferential statistics is based on the assumption that we are working with a subset of a population. If you have your entire population then are inferential statistics and un-biased estimates of standard deviations no longer necessary? If big data gives us what we might call our entire population, are inferential statistics necessary? So I suppose that a barrier to using big data would be acquiring appropriate data analysis techniques.

Leave a Reply