Discussions of 'big data' are everywhere. To some it heralds the end of the 'paradigm wars' and the dawn of a new golden age of robust computational social science. For others, 'big data' is far from being a panacea, and is merely another tool in the researcher's toolbox, and one that needs subjecting to the same methodological questioning as any other method.
What is 'big data' anyway? Do you plan to use to 'big data' in your research? What do you think it will offer you? Or is 'big data' no big deal?
First, I found Katie Metzler's post elsewhere on Methodspace about this topic to be quite interesting:
Next, one of my major concerns is that we end up limiting our questions to the kinds of things that the data can readily answer. I do admire the way that new sources of data can lead to new research questions, but we have to put that into context. In particular, the frustrations of working with any kind of "secondary data" are well known.
Along this line, one question that I often ask graduate students on qualifying exams is which of the following two alternatives they personally would prefer: 1) working on a very high quality existing dataset, even though some of the things you want to examine aren't available, or 2) developing your own smaller dataset, which won't be as high quality, but which will do a better job of addressing your particular research questions.
Really good points, David, and I agree with them. Some fundamental methodological basics still apply and it would be folly to forget them in our rush to use the biggest data set, because the biggest data set is not necessarily the best one for our research project.
We need to ask will the dataset get at what I want to know? Will it do so in way that is valid and representative of the thing that I am studying in the social state/context that I think is significant? Will the dataset give me the depth that my research question demands? Will it tell us anything beyond the 'bleedin' obvious'?
Andrew Gelman recently gave a talk with the bottom line that big data analysis faces the same problem as small data analysis (http://goo.gl/WXvtW). I fully agree (and also with the previous posts). I think big data is mainly a label for really huge datasets that were largely unavailable prior to the advent of the internet and computational social science. But big data is not much worth for sound causal inference if the design is not well crafted. If big data does not include a variable I need to control for in order to rule out the common cause problem, big data is not much worth because it is still only about an association. Moreover, with big data one is more likely to find significant effects that are substantively irrelevant.
Svend Brinkmann (2012) argues for the antithesis of big data - the philosophically and theoretically informed sociological and psychological analysis of everyday life experiences and objects - and effectively illustrates how richly meaningful insights can be gained from small data (I have just posted a review of this book for the Book Reviews group).
Agreed, but rich insights are not the goal of big data. Big data is about broad insights (meant as large n).
I think most (but not all) social scientists would agree broadly with Pat and Ingo. 'Rich insights' are not the goal of big data. But many big data advocates reside in the corporate world where big data is being presented as a methodological mircale prescription, affording banks, big MNCs and marketing companies with the richest of insights about their customers.
Corporate leaders never fail to be dazzled by big numbers and perhaps few have the patience or the inclination to see the value of Brinkmannesque approaches to research.
Probably, I should have said deep insights, not rich insights (meant to be many insights). Corporate managers might not be interested in many insights about consumers, but not why exactly a consumer behaves as he/she behaves. The same holds for voters, actually, because both the Obama and Romney campaign use big data to target voters. As long as big data works and one finds patterns allowing one to increase turnout, it does not matter why this holds.
Good point, I agree - big data sets are still subject to the same fundamental limitations as small data sets; it's the design that matters, and how you control for your potential confounds. In relation to your final sentence: 'Moreover, with big data one is more likely to find significant effects that are substantively irrelevant' - i can see the relevance here if one is using null hypothesis significant testing, but would Bayesian statistics not avoid this? Second to that, even though a significant effect will always be found with a large enough sample size, so long as we have access to the effect size then we're always able to determine the utility of the finding...
For big data (and in general), we need to examine the effect size alongside with the statistical significance. My point referred to the practice of quantitative research to focus on the latter and ignore the former.
I do not know how big data plays out for Bayesianism, very interesting question. Does anyone have thoughts on that?
As one who has dabbled in the mysterious Bayesian arts, here is how I see it.
P-values in Bayesian analysis are called Bayes factors. In its simplest form, Bayesian p-values refer to the probability of getting a result equal to or more extreme than what one would expect in a created posterior distribution. So if an observed value falls into the <.05 area of a posterior distribution we could reject the null hypothesis. The logic is essentially the same as the frequentist approach but the frequentist approach relies on theoretical distributions and the Bayes approach relies on posterior distributions created through, for example, Markov Chain Monte Carlo sampling.
I am currently evaluating whether Bayes' p-values are better at avoiding Type II errors, particularly with smaller data sets, than frequentist approaches. And I am evaluating whether they are better at avoiding Type I errors. I have read several Bayes sources and few mention this comparison because, as you probably know, Bayes takes on a different interpretation. However, Spiegelhalter et al. (2004) claim that "properly constructed Bayes factors can, for large sample sizes relative to the prior precision, support the null hypothesis when a classical analysis would lead to its rejection." This seems to suggest that Bayesian analysis may better guard against Type I errors when the sample size is very large.
I have another thought on Big Data. We all know that inferential statistics are done because we are unable to study the entire population, so we draw samples. With Big Data do we reach a point where we have the entire population or an identical subset of the population? If so, inferential statistics are not needed, according to theory. For example, if you want to know whether group A differs from group B on some outcome variable, and you have the entire population for both groups, theoretically you just compare means and determine if the differences are meaningful. It is kind of a weird thought.
Haha I like that last comment.. you could potentially test this out if you identified a small, specific population.
In addition to your comments re Bayes, I think an important part of the argument (i.e. from Cohen etc) is the fact that NHST only tests the null hypothesis - whereas Bayes methods test the probabilities of both the null and the explanatory hypotheses.
Glad my first post was interesting David. See also my post on the Big Data conference at the OII!