What do researchers need to know about using datasets?

Aug 17

Written By

By Janet Salmons, PhD Research Community Manager, Sage Methodspace

Want to learn more about research with datasets? This curated collection of open-access articles can help you understand defining characteristics, and develop data literacy skills needed to work with large datasets and machine learning tools for managing Big Data sources.

Characteristics of Big Data

Kitchin, R., & McArdle, G. (2016). What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data & Society. https://doi.org/10.1177/2053951716631130

Big Data has been variously defined in the literature. In the main, definitions suggest that Big Data possess a suite of key traits: volume, velocity and variety (the 3Vs), but also exhaustivity, resolution, indexicality, relationality, extensionality and scalability. However, these definitions lack ontological clarity, with the term acting as an amorphous, catch-all label for a wide selection of data. In this paper, we consider the question ‘what makes Big Data, Big Data?’, applying Kitchin’s taxonomy of seven Big Data traits to 26 datasets drawn from seven domains, each of which is considered in the literature to constitute Big Data. The results demonstrate that only a handful of datasets possess all seven traits, and some do not possess either volume and/or variety. Instead, there are multiple forms of Big Data. Our analysis reveals that the key definitional boundary markers are the traits of velocity and exhaustivity. We contend that Big Data as an analytical category needs to be unpacked, with the genus of Big Data further delineated and its various species identified. It is only through such ontological work that we will gain conceptual clarity about what constitutes Big Data, formulate how best to make sense of it, and identify how it might be best used to make sense of the world.

Lupton, D. (2018). How do data come to matter? Living and becoming with personal data. Big Data & Society. https://doi.org/10.1177/2053951718786314

Humans have become increasingly datafied with the use of digital technologies that generate information with and about their bodies and everyday lives. The onto-epistemological dimensions of human–data assemblages and their relationship to bodies and selves have yet to be thoroughly theorised. In this essay, I draw on key perspectives espoused in feminist materialism, vital materialism and the anthropology of material culture to examine the ways in which these assemblages operate as part of knowing, perceiving and sensing human bodies. I draw particularly on scholarship that employs organic metaphors and concepts of vitality, growth, making, articulation, composition and decomposition. I show how these metaphors and concepts relate to and build on each other, and how they can be applied to think through humans’ encounters with their digital data. I argue that these theoretical perspectives work to highlight the material and embodied dimensions of human–data assemblages as they grow and are enacted, articulated and incorporated into everyday lives.

Resnyansky, L. (2019). Conceptual frameworks for social and cultural Big Data analytics: Answering the epistemological challenge. Big Data & Society. https://doi.org/10.1177/2053951718823815

This paper aims to contribute to the development of tools to support an analysis of Big Data as manifestations of social processes and human behaviour. Such a task demands both an understanding of the epistemological challenge posed by the Big Data phenomenon and a critical assessment of the offers and promises coming from the area of Big Data analytics. This paper draws upon the critical social and data scientists’ view on Big Data as an epistemological challenge that stems not only from the sheer volume of digital data but, predominantly, from the proliferation of the narrow-technological and the positivist views on data. Adoption of the social-scientific epistemological stance presupposes that digital data was conceptualised as manifestations of the social. In order to answer the epistemological challenge, social scientists need to extend the repertoire of social scientific theories and conceptual frameworks that may inform the analysis of the social in the age of Big Data. However, an ‘epistemological revolution’ discourse on Big Data may hinder the integration of the social scientific knowledge into the Big Data analytics.

Stewart, R. (2021). Big data and Belmont: On the ethics and research implications of consumer-based datasets. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211048183

Consumer-based datasets are the products of data brokerage firms that agglomerate millions of personal records on the adult US population. This big data commodity is purchased by both companies and individual clients for purposes such as marketing, risk prevention, and identity searches. The sheer magnitude and population coverage of available consumer-based datasets and the opacity of the business practices that create these datasets pose emergent ethical challenges within the computational social sciences that have begun to incorporate consumer-based datasets into empirical research. To directly engage with the core ethical debates around the use of consumer-based datasets within social science research, I first consider two case study applications of consumer-based dataset-based scholarship. I then focus on three primary ethical dilemmas within consumer-based datasets regarding human subject research, participant privacy, and informed consent in conversation with the principles of the seminal Belmont Report.

Data Literacy

Corrall, Sheila (2019) Repositioning Data Literacy as a Mission-Critical Competence. In: ACRL 2019: Recasting the Narrative, April 10-13, 2019, Cleveland, OH.

With data rapidly replacing information as the currency of research, business, government, and healthcare, is it time for librarians to make data literacy central to their professional mission, take on roles as interdisciplinary mediators, and lead the data literacy movement on campus? Join the data literacy debate and discuss what librarians can do to cut across the disciplinary and professional silos now threatening the development of lifewide data literacy. Investigate and critique diverse conceptions and pedagogies for data literacy, and experiment with the MAW theory of stakeholder saliency to identify individuals and groups to target in your data literacy initiatives.

Gray, J., Gerlitz, C., & Bounegru, L. (2018). Data infrastructure literacy. Big Data & Society. https://doi.org/10.1177/2053951718786316

A recent report from the UN makes the case for “global data literacy” in order to realise the opportunities afforded by the “data revolution”. Here and in many other contexts, data literacy is characterised in terms of a combination of numerical, statistical and technical capacities. In this article, we argue for an expansion of the concept to include not just competencies in reading and working with datasets but also the ability to account for, intervene around and participate in the wider socio-technical infrastructures through which data is created, stored and analysed – which we call “data infrastructure literacy”. We illustrate this notion with examples of “inventive data practice” from previous and ongoing research on open data, online platforms, data journalism and data activism. Drawing on these perspectives, we argue that data literacy initiatives might cultivate sensibilities not only for data science but also for data sociology, data politics as well as wider public engagement with digital data infrastructures. The proposed notion of data infrastructure literacy is intended to make space for collective inquiry, experimentation, imagination and intervention around data in educational programmes and beyond, including how data infrastructures can be challenged, contested, reshaped and repurposed to align with interests and publics other than those originally intended.

Koltay, T. (2017). Data literacy for researchers and data librarians. Journal of Librarianship and Information Science, 49(1), 3–14. https://doi.org/10.1177/0961000615616450

This paper describes data literacy and emphasizes its importance. Data literacy is vital for researchers who need to become data literate science workers and also for (potential) data management professionals. Its important characteristic is a close connection and similarity to information literacy. To support this argument, a review of literature was undertaken on the importance of data, and the data-intensive paradigm of scientific research, researchers’ expected and real behaviour, the nature of research data management, the possible roles of the academic library, data quality and data citation, Besides describing the nature of data literacy and enumerating the related skills, the application of phenomenographic approaches to data literacy and its relationship to the digital humanities have been identified as subjects for further investigation.

Nguyen, D. (2021). Mediatisation and datafication in the global COVID-19 pandemic: on the urgency of data literacy. Media International Australia, 178(1), 210–214. https://doi.org/10.1177/1329878X20947563

In the COVID-19 pandemic, societal discourses and social interaction are subject to rapid mediatisation and digitalisation, which accelerate datafication. This indicates urgency for increasing data literacy: individual abilities in understanding and critically assessing datafication and its social implications. Immediate challenges concern misconceptions about the crisis, data misuses, widening (social) divides and (new) data biases. Citizens need to be on guard in respect to the crisis’ impact on the next stages of the digital transformation.

Poirier, L. (2021). Reading datasets: Strategies for interpreting the politics of data signification. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211029322

All datasets emerge from and are enmeshed in power-laden semiotic systems. While emerging data ethics curriculum is supporting data science students in identifying data biases and their consequences, critical attention to the cultural histories and vested interests animating data semantics is needed to elucidate the assumptions and political commitments on which data rest, along with the externalities they produce. In this article, I introduce three modes of reading that can be engaged when studying datasets—a denotative reading (extrapolating the literal meaning of values in a dataset), a connotative reading (tracing the socio-political provenance of data semantics), and a deconstructive reading (seeking what gets Othered through data semantics and structure). I then outline how I have taught students to engage these methods when analyzing three datasets in Data and Society—a course designed to cultivate student competency in politically aware data analysis and interpretation. I show how combined, the reading strategies prompt students to grapple with the double binds of perceiving contemporary problems through systems of representation that are always situated, incomplete, and inflected with diverse politics. While I introduce these methods in the context of teaching, I argue that the methods are integral to any data practice in the conclusion.

Machine Learning Tools

Denton, E., Hanna, A., Amironesei, R., Smart, A., & Nicole, H. (2021). On the genealogy of machine learning datasets: A critical history of ImageNet. Big Data & Society, 8(2). https://doi.org/10.1177/20539517211035955

In response to growing concerns of bias, discrimination, and unfairness perpetuated by algorithmic systems, the datasets used to train and evaluate machine learning models have come under increased scrutiny. Many of these examinations have focused on the contents of machine learning datasets, finding glaring underrepresentation of minoritized groups. In contrast, relatively little work has been done to examine the norms, values, and assumptions embedded in these datasets. In this work, we conceptualize machine learning datasets as a type of informational infrastructure, and motivate a genealogy as method in examining the histories and modes of constitution at play in their creation. We present a critical history of ImageNet as an exemplar, utilizing critical discourse analysis of major texts around ImageNet’s creation and impact. We find that assumptions around ImageNet and other large computer vision datasets more generally rely on three themes: the aggregation and accumulation of more data, the computational construction of meaning, and making certain types of data labor invisible. By tracing the discourses that surround this influential benchmark, we contribute to the ongoing development of the standards and norms around data development in machine learning and artificial intelligence research.

Fournier-Tombs, E., & MacKenzie, M. K. (2021). Big data and democratic speech: Predicting deliberative quality using machine learning techniques. Methodological Innovations. https://doi.org/10.1177/20597991211010416

This article explores techniques for using supervised machine learning to study discourse quality in large datasets. We explain and illustrate the computational techniques that we have developed to facilitate a large-scale study of deliberative quality in Canada’s three northern territories: Yukon, Northwest Territories, and Nunavut. This larger study involves conducting comparative analyses of hundreds of thousands of parliamentary speech acts since the creation of Nunavut 20 years ago. Without computational techniques, we would be unable to conduct such an ambitious and comprehensive analysis of deliberative quality. The purpose of this article is to demonstrate the machine learning techniques that we have developed with the hope that they might be used and improved by other communications scholars who are interested in conducting textual analyses using large datasets. Other possible applications of these techniques might include analyses of campaign speeches, party platforms, legislation, judicial rulings, online comments, newspaper articles, and television or radio commentaries.

Gray, J. E., & Suzor, N. P. (2020). Playing with machines: Using machine learning to understand automated copyright enforcement at scale. Big Data & Society. https://doi.org/10.1177/2053951720919963

This article presents the results of methodological experimentation that utilises machine learning to investigate automated copyright enforcement on YouTube. Using a dataset of 76.7 million YouTube videos, we explore how digital and computational methods can be leveraged to better understand content moderation and copyright enforcement at a large scale.We used the BERT language model to train a machine learning classifier to identify videos in categories that reflect ongoing controversies in copyright takedowns. We use this to explore, in a granular way, how copyright is enforced on YouTube, using both statistical methods and qualitative analysis of our categorised dataset. We provide a large-scale systematic analysis of removals rates from Content ID’s automated detection system and the largely automated, text search based, Digital Millennium Copyright Act notice and takedown system. These are complex systems that are often difficult to analyse, and YouTube only makes available data at high levels of abstraction. Our analysis provides a comparison of different types of automation in content moderation, and we show how these different systems play out across different categories of content. We hope that this work provides a methodological base for continued experimentation with the use of digital and computational methods to enable large-scale analysis of the operation of automated systems.

Hansen, K. B. (2020). The virtue of simplicity: On machine learning models in algorithmic trading. Big Data & Society. https://doi.org/10.1177/2053951720926558

Machine learning models are becoming increasingly prevalent in algorithmic trading and investment management. The spread of machine learning in finance challenges existing practices of modelling and model use and creates a demand for practical solutions for how to manage the complexity pertaining to these techniques. Drawing on interviews with quants applying machine learning techniques to financial problems, the article examines how these people manage model complexity in the process of devising machine learning-powered trading algorithms. The analysis shows that machine learning quants use Ockham’s razor – things should not be multiplied without necessity – as a heuristic tool to prevent excess model complexity and secure a certain level of human control and interpretability in the modelling process. I argue that understanding the way quants handle the complexity of learning models is a key to grasping the transformation of the human’s role in contemporary data and model-driven finance. The study contributes to social studies of finance research on the human–model interplay by exploring it in the context of machine learning model use.

Jacobsen, B. N. (2023). Machine learning and the politics of synthetic data. Big Data & Society, 10(1). https://doi.org/10.1177/20539517221145372

Machine-learning algorithms have become deeply embedded in contemporary society. As such, ample attention has been paid to the contents, biases, and underlying assumptions of the training datasets that many algorithmic models are trained on. Yet, what happens when algorithms are trained on data that are not real, but instead data that are ‘synthetic’, not referring to real persons, objects, or events? Increasingly, synthetic data are being incorporated into the training of machine-learning algorithms for use in various societal domains. There is currently little understanding, however, of the role played by and the ethicopolitical implications of synthetic training data for machine-learning algorithms. In this article, I explore the politics of synthetic data through two central aspects: first, synthetic data promise to emerge as a rich source of exposure to variability for the algorithm. Second, the paper explores how synthetic data promise to place algorithms beyond the realm of risk. I propose that an analysis of these two areas will help us better understand the ways in which machine-learning algorithms are envisioned in the light of synthetic data, but also how synthetic training data actively reconfigure the conditions of possibility for machine learning in contemporary society.

Jaton, F. (2021). Assessing biases, relaxing moralism: On ground-truthing practices in machine learning design and application. Big Data & Society. https://doi.org/10.1177/20539517211013569

This theoretical paper considers the morality of machine learning algorithms and systems in the light of the biases that ground their correctness. It begins by presenting biases not as a priori negative entities but as contingent external referents—often gathered in benchmarked repositories called ground-truth datasets—that define what needs to be learned and allow for performance measures. I then argue that ground-truth datasets and their concomitant practices—that fundamentally involve establishing biases to enable learning procedures—can be described by their respective morality, here defined as the more or less accounted experience of hesitation when faced with what pragmatist philosopher William James called “genuine options”—that is, choices to be made in the heat of the moment that engage different possible futures. I then stress three constitutive dimensions of this pragmatist morality, as far as ground-truthing practices are concerned: (I) the definition of the problem to be solved (problematization), (II) the identification of the data to be collected and set up (databasing), and (III) the qualification of the targets to be learned (labeling). I finally suggest that this three-dimensional conceptual space can be used to map machine learning algorithmic projects in terms of the morality of their respective and constitutive ground-truthing practices. Such techno-moral graphs may, in turn, serve as equipment for greater governance of machine learning algorithms and systems.

Thylstrup, N. B., Hansen, K. B., Flyverbom, M., & Amoore, L. (2022). Politics of data reuse in machine learning systems: Theorizing reuse entanglements. Big Data & Society, 9(2). https://doi.org/10.1177/20539517221139785

Policy discussions and corporate strategies on machine learning are increasingly championing data reuse as a key element in digital transformations. These aspirations are often coupled with a focus on responsibility, ethics and transparency, as well as emergent forms of regulation that seek to set demands for corporate conduct and the protection of civic rights. And the Protective measures include methods of traceability and assessments of ‘good’ and ‘bad’ datasets and algorithms that are considered to be traceable, stable and contained. However, these ways of thinking about both technology and ethics obscure a fundamental issue, namely that machine learning systems entangle data, algorithms and more-than-human environments in ways that challenge a well-defined separation. This article investigates the fundamental fallacy of most data reuse strategies as well as their regulation and mitigation strategies that data can somehow be followed, contained and controlled in machine learning processes. Instead, the article argues that we need to understand the reuse of data as an inherently entangled phenomenon. To examine this tension between the discursive regimes and the realities of data reuse, we advance the notion of reuse entanglements as an analytical lens. The main contribution of the article is the conceptualization of reuse that places entanglements at its core and the articulation of its relevance using empirical illustrations. This is important, we argue, for our understanding of the nature of data and algorithms, for the practical uses of data and algorithms and our attitudes regarding ethics, responsibility and regulation.

More Methodspace Posts about Big Data

Blog

SRM, SRM Cases, Research Cases, Case Studies, Case Methods, Big Data

Use Research Cases to Teach Methods for Large-Scale Data Analysis

SRM, SRM Cases, Research Cases, Case Studies, Case Methods, Big Data

Use research cases as the basis for individual or team activities that build skills.

SRM, SRM Cases, Research Cases, Case Studies, Case Methods, Big Data

Data Literacy, Collecting Data from Electronic or Paper Documents, Big Data, Q3, August 2023, Quantitative Data Analysis

What do researchers need to know about using datasets?

Data Literacy, Collecting Data from Electronic or Paper Documents, Big Data, Q3, August 2023, Quantitative Data Analysis

The Future of Computational Social Science is Black, computational social science, Summer Institutes in Computational Social Science, SICSS, SICSS-Howard/Mathematica, diversity in academia, anti-black racism, Big Data, SICSS-Howard/Mathematica 2022

Understanding Algorithmic and Societal Bias: Scientists and Advocates Discuss Data and Blackness

Speakers at SICSS-Howard/Mathematica 2022 explore how change does not affect populations equally, and how the exclusion of underrepresented communities can perpetuate social injustice.

computational social science, Q1, January 2023, Matti Nelimarkka, Big Data

Research design with computing: Something old, something new

computational social science, Q1, January 2023, Matti Nelimarkka, Big Data

Learn how computational social sciences help scholars to renew their research in several directions.

computational social science, Q1, January 2023, Matti Nelimarkka, Big Data

Big Data & Society, Big Data, Collaboration, Collaborative Research, October 2022

Collaboration: Human Skills in a Big Data World

Big Data & Society, Big Data, Collaboration, Collaborative Research, October 2022

Big Data can mean the research is too big to conduct on your own. In this post, find four types of research collaborations involving Big Data, with open-access examples.

Big Data & Society, Big Data, Collaboration, Collaborative Research, October 2022

Research Ethics, Online Research Ethics, August 2022, Big Data, computational social science

Ethics in Big Data and Computational Social Science Research

Research Ethics, Online Research Ethics, August 2022, Big Data, computational social science

This collection of open-access articles offers multiple perspectives on the use of Big Data and ethical protocols for computational research methods.

Research Ethics, Online Research Ethics, August 2022, Big Data, computational social science

Relevance, Editor Interviews, Academic writing, January 2022, Big Data, SAGE Journals, Journals, Research Ethics, Online Research Ethics, Kate Chatfield, February 2022

Video interview: Kate Chatfield, Editor of Research Ethics on research relevance

Relevance, Editor Interviews, Academic writing, January 2022, Big Data, SAGE Journals, Journals, Research Ethics, Online Research Ethics, Kate Chatfield, February 2022

What does Dr. Kate Chatfield, Editor of the Research Ethics journal, have to say about research relevance?

Relevance, Editor Interviews, Academic writing, January 2022, Big Data, SAGE Journals, Journals, Research Ethics, Online Research Ethics, Kate Chatfield, February 2022

Relevance, Editor Interviews, Academic writing, January 2022, Big Data, Big Data & Society, SAGE Journals, Journals

Video interview: Matthew Zook, Managing Editor of Big Data & Society on research relevance

Relevance, Editor Interviews, Academic writing, January 2022, Big Data, Big Data & Society, SAGE Journals, Journals

What does Dr. Matthew Zoom, Managing Editor of the Big Data & Society journal, have to say about research relevance?

Relevance, Editor Interviews, Academic writing, January 2022, Big Data, Big Data & Society, SAGE Journals, Journals

Big Data, July 2021, Quantitative Data Analysis

Analyze Big Data

Big Data, July 2021, Quantitative Data Analysis

Want to learn about Big Data analysis? Here are some open-access examples.

Big Data, July 2021, Quantitative Data Analysis

Gary King, online teaching, Quantitative Methods, Big Data

Gary King makes all lectures for Quantitative Social Science Methods course free online

Gary King, online teaching, Quantitative Methods, Big Data

What is the field of statistical analysis? So begins Gary King’s first online course in the Harvard Government Dept graduate methods sequence. King, the Albert J. Weatherhead III University Professor at Harvard University -- one of 25 with Harvard's most distinguished faculty title -- and Director of the Institute for Quantitative Social Science has just recorded all his lectures and made them free to access online. The videos range in length from 30 minutes to an hour and half and you can watch them all on YouTube here.

Gary King, online teaching, Quantitative Methods, Big Data

Sam Gilbert, SGilbert2, search data, Google, infodemiology, economics, data science, Top 2020, Big Data, sociology

In a pandemic, what use is Google?

Sam Gilbert, SGilbert2, search data, Google, infodemiology, economics, data science, Top 2020, Big Data, sociology

This blog by Sam Gilbert explains how internet search data is being used in responses to the Covid-19 pandemic, and what search datasets and tools are available to researchers.

Sam Gilbert, SGilbert2, search data, Google, infodemiology, economics, data science, Top 2020, Big Data, sociology

Mixed Methods, Dariusz Jemielniak, Author Interviews, Big Data

Thick Big Data: Mixed Methods for Our Time

Mixed Methods, Dariusz Jemielniak, Author Interviews, Big Data

Dr. Dariusz Jemielniak discusses the importance of mixed methods in Big Data research.

Mixed Methods, Dariusz Jemielniak, Author Interviews, Big Data

music streaming, social dynamics, music taste, Big Data

Big data, music streaming platforms and the social dynamics of music taste

music streaming, social dynamics, music taste, Big Data

The rise of music streaming platforms, such as Spotify and Apple Music, has contributed to an explosion of new forms of digital data about music consumption practices. As the digital platforms through which consumers access and engage with recorded music and creators distribute it, they are uniquely positioned to create immense volumes of data about what and how people consume music, individually and at scale. From data about what music people search for and skip, to demographic information about who is consuming what, music streaming platforms generate data about almost every micro-interaction with music, amassing enormous databases ripe for further value-extraction.

music streaming, social dynamics, music taste, Big Data

The Costs of Connection: How Data is Colonizing Human Life and Appropriating it for Capitalism, Nick Couldry, Ulises A. Mejias, book review, Big Data

Book Review: The Costs of Connection: How Data is Colonizing Human Life and Appropriating it for Capitalism

The Costs of Connection: How Data is Colonizing Human Life and Appropriating it for Capitalism, Nick Couldry, Ulises A. Mejias, book review, Big Data

The age of Big Data has frequently been framed as a new frontier in human life, presenting both brand new opportunities and brand new challenges. In The Costs of Connection, Nick Couldry and Ulises A. Mejias articulate an alternative view: the quantified world in which we now live is a product of the continuation and expansion of both colonialism and capitalism: not a new frontier, but the inevitable expansion of an existing one.

The Costs of Connection: How Data is Colonizing Human Life and Appropriating it for Capitalism, Nick Couldry, Ulises A. Mejias, book review, Big Data

Carolina Mattsson, interdisciplinary research, network science, Social Science Foo Camp, Northeastern University, Lazer Lab, Big Data, Quantitative Data Collection, Quantitative Methods, computational social science

Theory and tools in the age of big data

Back in February, I had the privilege of attending Social Science Foo Camp, a flexible-schedule conference hosted in part by SAGE at Facebook HQ where questions of progress in the age of Big Data were a major topic of discussion. What I found most insightful about these conversations is how using or advocating for Big Data is one thing, but making sense of it in the context of an established discipline to do science and scholarship is quite another.

Alex Stockham, IN-PART, Emma Brown, Eve Satkevic, Joe Ferner, social science innovations, Big Data

Top 10 big data and social science innovations

Alex Stockham, IN-PART, Emma Brown, Eve Satkevic, Joe Ferner, social science innovations, Big Data

People look to academia as the source of innovation, and especially so in the natural and physical sciences. Researchers in biosciences, clinical medicine, physics, and chemistry have always generated new ideas for industry to capitalize on. Generally, innovations coming out of the social sciences would be assimilated into the private sector via secondments or collaborative projects, with Richard Thaler’s Behavioral Insights Team as the finest example. However, the emergence of big data and computational social science has generated a host of technologies that are either developed together with social science researchers or have clear application in the social science praxis outside academia.

Alex Stockham, IN-PART, Emma Brown, Eve Satkevic, Joe Ferner, social science innovations, Big Data

Cape Town, University of Ghana, Fidelia Dake, Dr. Visseho Adjiwanou, SICSS4, SICSS roundup, Big Data, Start Working with Big Data, Research Ethics, computational social science

New ways of thinking about social science research. My time at the Summer Institute in Computational Social Science

Cape Town, University of Ghana, Fidelia Dake, Dr. Visseho Adjiwanou, SICSS4, SICSS roundup, Big Data, Start Working with Big Data, Research Ethics, computational social science

Coming from a social science background, I have had very limited exposure to data science. I was therefore excited to learn about the emerging field of computational social science and the Summer Institute in Computational Social Science (SICSS) presented the right opportunity. I applied to the 2019 SICSS and I was accepted for the Cape Town partner site. I went in not knowing what to expect but by the end of the first day I knew the experience at the two-week Summer Institute was going to be truly worthwhile.

Cape Town, University of Ghana, Fidelia Dake, Dr. Visseho Adjiwanou, SICSS4, SICSS roundup, Big Data, Start Working with Big Data, Research Ethics, computational social science

Creative Methods, Big Data, SAGE Journals

Creative Approaches with Big Data

Creative Methods, Big Data, SAGE Journals

Quantitative researchers, including those who work with Big Data, think creatively. This collection of open access articles might give you some food for thought about your own research.

Creative Methods, Big Data, SAGE Journals

Nick Adams, TagWorks, Thusly, content analysis, natural language processing, CAQDAS, AtlasTI, NVivo, MaxQDA, crowd workers, automated crowd task management, Figure Eight, Mechanical Turk, Big Data, Quantitative Data Analysis

No more tradeoffs: The era of big data content analysis has come

For centuries, being a scientist has meant learning to live with limited data. People only share so much on a survey form. Experiments don’t account for all the conditions of real world situations. Field research and interviews can only be generalized so far. Network analyses don’t tell us everything we want to know about the ties among people. And text/content/document analysis methods allow us to dive deep into a small set of documents, or they give us a shallow understanding of a larger archive. Never both. So far, the truly great scientists have had to apply many of these approaches to help us better see the world through their kaleidoscope of imperfect lenses.

LinkedIn, LinkedIn Economic Graph, Daniela Duca, Social Media, Big Data, computational social science

Social scientists working with LinkedIn data

LinkedIn, LinkedIn Economic Graph, Daniela Duca, Social Media, Big Data, computational social science

Today, researchers are using LinkedIn data in a variety of ways: to find and recruit participants for research and experiments (Using Facebook and LinkedIn to Recruit Nurses for an Online Survey), to analyze how the features of this network affect people’s behavior and identity or how data is used for hiring and recruiting purposes, or most often to enrich other data sources with publicly available information from selected LinkedIn profiles (Examining the Career Trajectories of Nonprofit Executive Leaders, The Tech Industry Meets Presidential Politics: Explaining the Democratic Party’s Technological Advantage in Electoral Campaigning).

LinkedIn, LinkedIn Economic Graph, Daniela Duca, Social Media, Big Data, computational social science

big search data, data science, Sam Gilbert, SGilbert, Big Data, Quantitative Data Analysis

Tapping into the hidden power of big search data

big search data, data science, Sam Gilbert, SGilbert, Big Data, Quantitative Data Analysis

Sam Gilbert demonstrates the value of big search data for social scientists, and suggests some practical steps to using internet search data in your own research.

big search data, data science, Sam Gilbert, SGilbert, Big Data, Quantitative Data Analysis

Shulin Hu, SHu, Social Media, Big Data, Data Visualization

How researchers around the world are making use of Weibo data

Shulin Hu, SHu, Social Media, Big Data, Data Visualization

Zoufan posted her last words on Weibo on 18, March, 2012. She was suffering from a major depressive disorder, and shortly after - committed suicide. Weibo is a microblogging application, launched by Sina Corporation back in 2009, based on user relationships to share, disseminate and get information. In essence, it is similar to Twitter, although it has a number of other useful capabilities. The app has more than 400 million users (compared to Twitter’s 300 million) and features that enable the study of emotional states and responses to the topics being discussed or spread across the web.

Shulin Hu, SHu, Social Media, Big Data, Data Visualization

Urban Institute, Graham MacDonald, data science, Big Data, Social Science

Three exciting possibilities for combining data science and social science

Urban Institute, Graham MacDonald, data science, Big Data, Social Science

As the leader of a data science team at the Urban Institute, I get to work on interesting issues that intersect data science and social science every day. By data science, I mean technical tools, architectures, and processes that are borrowed from computer science and are atypical in the social sciences. This is a slightly more limited definition than most would have for the term data science, but because so much of what defines a data scientist at Urban also defines a researcher — cleaning data, analyzing it, visualizing results, etc. — my definition draws a finer line.

Urban Institute, Graham MacDonald, data science, Big Data, Social Science

SAGE Research Methods, SAGE Video, SRMV1, data science, Big Data, online learning, computational social science

Learn data science with new video collection

SAGE Research Methods, SAGE Video, SRMV1, data science, Big Data, online learning, computational social science

SAGE Research Methods has launched a new Data Science video collection, with hours of educational material for researchers of all levels and backgrounds.

SAGE Research Methods, SAGE Video, SRMV1, data science, Big Data, online learning, computational social science

Timo Hannay, ethics, facebook, SSRC, Gary King, Nate Persily, guest blog, internet, opinion, regulation, THannay1, industry, data, Big Data, Quantitative Data Collection

Humans broke the internet, understanding them better might help fix it

Timo Hannay, ethics, facebook, SSRC, Gary King, Nate Persily, guest blog, internet, opinion, regulation, THannay1, industry, data, Big Data, Quantitative Data Collection

By Timo Hannay

Here's a multiple-choice question: Is the internet (a) the most open, egalitarian and empowering means of communication ever devised, or (b) a dystopian nightmare populated by hucksters, trolls and miscellaneous abusers of human rights? The answer is, of course, (c) all of the above and much else besides. This stark contrast between the internet's light and dark sides has become a defining characteristic of the digital age, but is not an inevitable consequence of the mostly innocuous technologies on which it's built. Rather, it is the product of their bewilderingly diverse and eccentric user base – otherwise known as humanity.

Timo Hannay, ethics, facebook, SSRC, Gary King, Nate Persily, guest blog, internet, opinion, regulation, THannay1, industry, data, Big Data, Quantitative Data Collection

Gary King, Social Science Space, Social Science Bites, Quantitative Data Analysis, Big Data

Gary King on big data analysis

Gary King, Social Science Space, Social Science Bites, Quantitative Data Analysis, Big Data

In this Social Science Bite, Professor Gary King, uses text analysis as an example of this big data analysis... King, spotlights the difference between computer scientists’ goals and social scientists’ goals, then talks about work examining social media and censorship in China.

Gary King, Social Science Space, Social Science Bites, Quantitative Data Analysis, Big Data

Katie Metzler, Big Data

Big data rich and big data poor

Katie Metzler, Big Data

Data is being created faster than ever before however without access to these data-sets or the expertise to analyse them, research is confronted with a replication crisis and is vulnerable to commercial motivations. The problem is growing as Katie Metzler points out, "Firstly, because replication is the engine of science, and irreproducible research slows progress... secondly the motivations of industry researchers and social scientists may differ in ways that may really matter."

Katie Metzler, Big Data

Data LiteracyCollecting Data from Electronic or Paper DocumentsBig DataQ3August 2023Quantitative Data Analysis

What do researchers need to know about using datasets?

Characteristics of Big Data

Data Literacy

Machine Learning Tools

More Methodspace Posts about Big Data

How can you analyze online talk? Researchers demonstrate!

Teaching Students Quants is Hard Enough and Now I Have to Do It on MS Teams!

Subscribe to our methods mailing list

Sage Research Methods Community