Theory and tools in the age of big data

Dec 4

By Carolina Mattsson, PhD Candidate, Network Science, Lazar Lab, Network Science Institute, Northeastern University.

Back in February, I had the privilege of attending Social Science Foo Camp*, a flexible-schedule conference hosted in part by SAGE at Facebook HQ where questions of progress in the age of Big Data were a major topic of discussion. What I found most insightful about these conversations is how using or advocating for Big Data is one thing, but making sense of it in the context of an established discipline to do science and scholarship is quite another.

Computational social science as a field is tremendously dynamic and the same with digital humanities. When done well, research that thoughtfully integrates Big Data with an existing discipline can bring with it a new vitality. I have been lucky to be exposed to superb examples since early in my graduate career. The NULab at Northeastern University introduced me to projects like Viral Texts and #HashtagActivism. At the Network Science Institute, I have had a window into insightful projects on Chinese Censorship and Fake News. There are researchers out there integrating Big Data with history and political science and communications and literature and sociology and more.

The best of this work makes sense of the data in the context of the discipline. I have come to conceptualize "making sense of Big Data" as having three essential components: a grounded understanding of the data itself, a solid grasp of relevant disciplinary theory, and appropriate computational tools that bring theory and data together. Research that glosses over any of these elements will miss the mark, no matter how big the data, how fancy the tools or how beautiful the theory.

Research that brings all three pieces to the table can be transformative, opening up new questions that invigorate a discipline's research agenda. However, doing such work can be hard—very hard. It is often unclear, even, what a successful project would look like. The challenging aspects of research are magnified when you can't take anything for granted: not the data, not the theory, and not even the tools.

A stylized equation of the three components to innovation in computational social science

Knowing your data

Using any kind of social media data or administrative records or large digital corpora for scholarship requires an in-depth understanding of how that data got there in the first place and why. Now it may seem obvious that you won’t know what you’re looking at unless you know what you’re looking at, but that doesn’t make it any easier.

Data collection is difficult enough in the best of cases that there are often strong disciplinary norms surrounding it; you can find best practices in administering surveys, conducting interviews, annotating primary sources, observing chimpanzees, and even calibrating dual confocal microscopes. While a discipline’s best practices are rarely transferable to new sources of data, the norms underlying them exist for a reason. Throwing them away to use Big Data exposes a scholar to all the pitfalls that generations before them have worked to prevent. Knowing your data means adapting the norms of a discipline to the particulars of a large digital data set, and this requires real methodological work. Unfortunately, this is often difficult to convey. Many of those working in a discipline are accustomed to following a field’s best practices or contributing to refinements of existing techniques, but extending disciplinary norms to new kinds of data collection necessarily starts at a more basic level. This means that expectations for this kind of a contribution can be unattainably high without considerable finesse in framing.

Integrating relevant theory

Big Data lends itself to descriptive work and a-theoretical prediction, but this bumps up against the reality that stumbling upon something truly novel is quite rare in research. In most disciplines, scholars have been studying the relevant phenomena for decades, building up a deep descriptive lexicon and a nuanced understanding of causal processes. Assuming there would be something entirely missing from disciplinary consideration but easy to spot using Big Data betrays considerable hubris; disciplines are right to be skeptical of bold claims. Better for everyone if we acknowledge that Big Data is most likely to pick up on known phenomena, but also that observing these phenomena using new and different data still advances the discipline. The issue is that making a compelling case for this approach can be, again, quite difficult. It requires articulating how the particulars of the data give us new traction on existing theory within the discipline. Even when you’re lucky and the connections are straightforward—such as social network data to social network theory—it takes additional work to spell those connections out. This tends to demand more than a passing familiarity with the existing literature. When done well, the research explicitly contributes to existing theory even when the substantive results would speak for themselves to a sympathetic audience. Preferably, of course, even attempts to integrate relevant theory that fall short of this ideal would be recognized as contributions. In reality, for many fields, the bar can be high.

“The tension between the promise and difficulty of computational social science is not always recognized by established disciplines nor proponents of interdisciplinary work”

— Carolina Mattsson

Using appropriate computational tools

Knowing your data and integrating relevant theory can’t happen in practice without computational tools that can bring them together. To advance social network theory using social network data you need network analysis tools that help you measure relevant quantities. To study innovations in style using large corpora of literary works you need tools for distant reading. If you’re lucky then the appropriate tool has already been developed for a compatible purpose, but oftentimes the appropriate tool simply does not yet exist. The constraints of a new source of data under the guidance of relevant theory can immediately put a scholar outside of what’s possible with current off-the-shelf tools. Going forward then means either qualifying the scholarship accordingly or adapting the tools to suit. Both of these directions require a higher level of mastery than simply applying a set of tools ever would. Again, this makes interdisciplinary scholarship with new digital sources of data especially difficult both to do and to communicate.

Doing science and scholarship is never a simple endeavor to begin with, but making sense of Big Data in the context of an established discipline challenges a researcher on all three of these fronts at once. For those of us just starting out, that can be downright daunting. Trainees need to learn the best practices in data collection, the relevant theory, and the appropriate tools before they can master them well enough to produce publishable research. For computational social science this takes more time; it takes longer to learn, longer to master, and longer to publish when there is nothing you can take as given. And a measure of confidence; there will always be those who have gotten further in the data or the theory or the tools. And added stubbornness; trainees will face added challenges in learning, research, and publishing.

The tension between the promise and difficulty of computational social science is not always recognized by established disciplines nor proponents of interdisciplinary work. Highlighting the promise brings in many trainees, but we must also support our trainees against the added difficulties of doing research in this field if it is to flourish. All the uncertainty and self-doubt that comes along with beginning a research career are amplified and the institutions we build for training computational social scientists must incorporate this reality.

In my own experience, navigating the added uncertainty requires positive examples and supportive mentorship. Unless you know where you are going it can be very easy to get lost along the way. For me, being surrounded by successful interdisciplinary projects gave me something to refer back to in my own work. It also helped me pin down the coherent ideas that underlie computational social science. This is, effectively, what disciplines do. They establish a notion of what is successful research, which provides trainees with a sense of direction and a subtle confidence in the process. Crucially, my exposure to computational social science as a coherent discipline came before I ran headlong into all the challenges I’ve described.

Why Is Inequality Bad?

Read more from Carolina Mattsson on Social Science Space

Most important for maintaining my confidence and purpose through the added challenges has been candid and supportive mentors both at the Network Science Institute and elsewhere. Kind strangers on Twitter have made a world of difference. Trainees should seek out mentors, of course, but the onus cannot fall only on students. Supportive and candid mentorship can and should be a part of institutional practice. For instance, the Society for Young Network Scientists runs a series of events that we call "Paper Unwinds" where established researchers tell us of the rejections, disciplinary faux pas, and frustrations behind their own successful projects. The Young Scholar Initiative of the Institute for New Economic Thinking puts on student-centric conferences with designated mentors who give supportive feedback to participants.

As I reach the end of my graduate program I realize that I'm not a success story, yet. I've done what I can to make sense of large-scale financial transaction data in the context of economics. There's no guarantee that payment providers will trust me with data in the future, that economists will make the extra effort to understand the implications of my work, or that the code I've written is as robust as I hope it is. But with the positive examples and candid, supportive mentorship I've found during my PhD, I still do trust the process. Here's hoping I'm right to!

*Social Science Foo Camp is organized by Facebook, O’Reilly Media, SAGE Publishing and the Alfred P. Sloan Foundation.

About

I am a PhD Candidate in Network Science, and in May I will become one of the first to graduate from this innovative interdisciplinary program at Northeastern University. For my dissertation, I have delved into the most minute details of mobile money payment systems and built up a broadly applicable way to analyze large-scale transaction data. Along the way I have found that the movement of money can be a fascinating perspective to take on Economics. Follow me @CarolinaMttssn.

Carolina Mattssoninterdisciplinary researchnetwork scienceNortheastern UniversityLazer LabBig DataQuantitative Data CollectionQuantitative Methods

Chris Burnage

Theory and tools in the age of big data

Knowing your data

Integrating relevant theory

Using appropriate computational tools

Why Is Inequality Bad?

About

Researchers in the Gig Economy

What does it mean to anonymize text?

Subscribe to our methods mailing list

Sage Research Methods Community