What do researchers need to know about using datasets?

Categories: Big Data, Data Collection, Online Research, Other, Research, Research Design, Research Skills

Tags: , ,

May logo- Finding Data in Documents and Datasets

In May we are focusing on Finding Data in Documents and Datasets. You will find the unfolding series through this link. Explore the whole 2021 series on stages of the research process: Finding the Question,  Choosing Methodology and MethodsDesigning an Ethical Study, and  Collecting Data from & with Participants.


Want to learn more about research with datasets? This post includes a curated collection of open-access articles about defining characteristics, data literacy skills needed to understand and work with large datasets, and machine learning tools for managing Big Data sources.

Characteristics of Big Data

Kitchin, R., & McArdle, G. (2016). What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasetsBig Data & Societyhttps://doi.org/10.1177/2053951716631130

Big Data has been variously defined in the literature. In the main, definitions suggest that Big Data possess a suite of key traits: volume, velocity and variety (the 3Vs), but also exhaustivity, resolution, indexicality, relationality, extensionality and scalability. However, these definitions lack ontological clarity, with the term acting as an amorphous, catch-all label for a wide selection of data. In this paper, we consider the question ‘what makes Big Data, Big Data?’, applying Kitchin’s taxonomy of seven Big Data traits to 26 datasets drawn from seven domains, each of which is considered in the literature to constitute Big Data. The results demonstrate that only a handful of datasets possess all seven traits, and some do not possess either volume and/or variety. Instead, there are multiple forms of Big Data. Our analysis reveals that the key definitional boundary markers are the traits of velocity and exhaustivity. We contend that Big Data as an analytical category needs to be unpacked, with the genus of Big Data further delineated and its various species identified. It is only through such ontological work that we will gain conceptual clarity about what constitutes Big Data, formulate how best to make sense of it, and identify how it might be best used to make sense of the world.

Lupton, D. (2018). How do data come to matter? Living and becoming with personal data. Big Data & Society. https://doi.org/10.1177/2053951718786314

Humans have become increasingly datafied with the use of digital technologies that generate information with and about their bodies and everyday lives. The onto-epistemological dimensions of human–data assemblages and their relationship to bodies and selves have yet to be thoroughly theorised. In this essay, I draw on key perspectives espoused in feminist materialism, vital materialism and the anthropology of material culture to examine the ways in which these assemblages operate as part of knowing, perceiving and sensing human bodies. I draw particularly on scholarship that employs organic metaphors and concepts of vitality, growth, making, articulation, composition and decomposition. I show how these metaphors and concepts relate to and build on each other, and how they can be applied to think through humans’ encounters with their digital data. I argue that these theoretical perspectives work to highlight the material and embodied dimensions of human–data assemblages as they grow and are enacted, articulated and incorporated into everyday lives.

Resnyansky, L. (2019). Conceptual frameworks for social and cultural Big Data analytics: Answering the epistemological challenge. Big Data & Society. https://doi.org/10.1177/2053951718823815

This paper aims to contribute to the development of tools to support an analysis of Big Data as manifestations of social processes and human behaviour. Such a task demands both an understanding of the epistemological challenge posed by the Big Data phenomenon and a critical assessment of the offers and promises coming from the area of Big Data analytics. This paper draws upon the critical social and data scientists’ view on Big Data as an epistemological challenge that stems not only from the sheer volume of digital data but, predominantly, from the proliferation of the narrow-technological and the positivist views on data. Adoption of the social-scientific epistemological stance presupposes that digital data was conceptualised as manifestations of the social. In order to answer the epistemological challenge, social scientists need to extend the repertoire of social scientific theories and conceptual frameworks that may inform the analysis of the social in the age of Big Data. However, an ‘epistemological revolution’ discourse on Big Data may hinder the integration of the social scientific knowledge into the Big Data analytics.

Data Literacy

Corrall, Sheila (2019) Repositioning Data Literacy as a Mission-Critical Competence. In: ACRL 2019: Recasting the Narrative, April 10-13, 2019, Cleveland, OH.

With data rapidly replacing information as the currency of research, business, government, and healthcare, is it time for librarians to make data literacy central to their professional mission, take on roles as interdisciplinary mediators, and lead the data literacy movement on campus? Join the data literacy debate and discuss what librarians can do to cut across the disciplinary and professional silos now threatening the development of lifewide data literacy. Investigate and critique diverse conceptions and pedagogies for data literacy, and experiment with the MAW theory of stakeholder saliency to identify individuals and groups to target in your data literacy initiatives.

Gray, J., Gerlitz, C., & Bounegru, L. (2018). Data infrastructure literacy. Big Data & Society. https://doi.org/10.1177/2053951718786316

A recent report from the UN makes the case for “global data literacy” in order to realise the opportunities afforded by the “data revolution”. Here and in many other contexts, data literacy is characterised in terms of a combination of numerical, statistical and technical capacities. In this article, we argue for an expansion of the concept to include not just competencies in reading and working with datasets but also the ability to account for, intervene around and participate in the wider socio-technical infrastructures through which data is created, stored and analysed – which we call “data infrastructure literacy”. We illustrate this notion with examples of “inventive data practice” from previous and ongoing research on open data, online platforms, data journalism and data activism. Drawing on these perspectives, we argue that data literacy initiatives might cultivate sensibilities not only for data science but also for data sociology, data politics as well as wider public engagement with digital data infrastructures. The proposed notion of data infrastructure literacy is intended to make space for collective inquiry, experimentation, imagination and intervention around data in educational programmes and beyond, including how data infrastructures can be challenged, contested, reshaped and repurposed to align with interests and publics other than those originally intended.

Koltay, T. (2017). Data literacy for researchers and data librarians. Journal of Librarianship and Information Science49(1), 3–14. https://doi.org/10.1177/0961000615616450

This paper describes data literacy and emphasizes its importance. Data literacy is vital for researchers who need to become data literate science workers and also for (potential) data management professionals. Its important characteristic is a close connection and similarity to information literacy. To support this argument, a review of literature was undertaken on the importance of data, and the data-intensive paradigm of scientific research, researchers’ expected and real behaviour, the nature of research data management, the possible roles of the academic library, data quality and data citation, Besides describing the nature of data literacy and enumerating the related skills, the application of phenomenographic approaches to data literacy and its relationship to the digital humanities have been identified as subjects for further investigation.

Nguyen, D. (2021). Mediatisation and datafication in the global COVID-19 pandemic: on the urgency of data literacy. Media International Australia, 178(1), 210–214. https://doi.org/10.1177/1329878X20947563

In the COVID-19 pandemic, societal discourses and social interaction are subject to rapid mediatisation and digitalisation, which accelerate datafication. This indicates urgency for increasing data literacy: individual abilities in understanding and critically assessing datafication and its social implications. Immediate challenges concern misconceptions about the crisis, data misuses, widening (social) divides and (new) data biases. Citizens need to be on guard in respect to the crisis’ impact on the next stages of the digital transformation.

Machine Learning Tools

Fournier-Tombs, E., & MacKenzie, M. K. (2021). Big data and democratic speech: Predicting deliberative quality using machine learning techniques. Methodological Innovations. https://doi.org/10.1177/20597991211010416

This article explores techniques for using supervised machine learning to study discourse quality in large datasets. We explain and illustrate the computational techniques that we have developed to facilitate a large-scale study of deliberative quality in Canada’s three northern territories: Yukon, Northwest Territories, and Nunavut. This larger study involves conducting comparative analyses of hundreds of thousands of parliamentary speech acts since the creation of Nunavut 20 years ago. Without computational techniques, we would be unable to conduct such an ambitious and comprehensive analysis of deliberative quality. The purpose of this article is to demonstrate the machine learning techniques that we have developed with the hope that they might be used and improved by other communications scholars who are interested in conducting textual analyses using large datasets. Other possible applications of these techniques might include analyses of campaign speeches, party platforms, legislation, judicial rulings, online comments, newspaper articles, and television or radio commentaries.

Gray, J. E., & Suzor, N. P. (2020). Playing with machines: Using machine learning to understand automated copyright enforcement at scale. Big Data & Society. https://doi.org/10.1177/2053951720919963

This article presents the results of methodological experimentation that utilises machine learning to investigate automated copyright enforcement on YouTube. Using a dataset of 76.7 million YouTube videos, we explore how digital and computational methods can be leveraged to better understand content moderation and copyright enforcement at a large scale.We used the BERT language model to train a machine learning classifier to identify videos in categories that reflect ongoing controversies in copyright takedowns. We use this to explore, in a granular way, how copyright is enforced on YouTube, using both statistical methods and qualitative analysis of our categorised dataset. We provide a large-scale systematic analysis of removals rates from Content ID’s automated detection system and the largely automated, text search based, Digital Millennium Copyright Act notice and takedown system. These are complex systems that are often difficult to analyse, and YouTube only makes available data at high levels of abstraction. Our analysis provides a comparison of different types of automation in content moderation, and we show how these different systems play out across different categories of content. We hope that this work provides a methodological base for continued experimentation with the use of digital and computational methods to enable large-scale analysis of the operation of automated systems.

Hansen, K. B. (2020). The virtue of simplicity: On machine learning models in algorithmic tradingBig Data & Societyhttps://doi.org/10.1177/2053951720926558

Machine learning models are becoming increasingly prevalent in algorithmic trading and investment management. The spread of machine learning in finance challenges existing practices of modelling and model use and creates a demand for practical solutions for how to manage the complexity pertaining to these techniques. Drawing on interviews with quants applying machine learning techniques to financial problems, the article examines how these people manage model complexity in the process of devising machine learning-powered trading algorithms. The analysis shows that machine learning quants use Ockham’s razor – things should not be multiplied without necessity – as a heuristic tool to prevent excess model complexity and secure a certain level of human control and interpretability in the modelling process. I argue that understanding the way quants handle the complexity of learning models is a key to grasping the transformation of the human’s role in contemporary data and model-driven finance. The study contributes to social studies of finance research on the human–model interplay by exploring it in the context of machine learning model use.

Jaton, F. (2021). Assessing biases, relaxing moralism: On ground-truthing practices in machine learning design and applicationBig Data & Societyhttps://doi.org/10.1177/20539517211013569

This theoretical paper considers the morality of machine learning algorithms and systems in the light of the biases that ground their correctness. It begins by presenting biases not as a priori negative entities but as contingent external referents—often gathered in benchmarked repositories called ground-truth datasets—that define what needs to be learned and allow for performance measures. I then argue that ground-truth datasets and their concomitant practices—that fundamentally involve establishing biases to enable learning procedures—can be described by their respective morality, here defined as the more or less accounted experience of hesitation when faced with what pragmatist philosopher William James called “genuine options”—that is, choices to be made in the heat of the moment that engage different possible futures. I then stress three constitutive dimensions of this pragmatist morality, as far as ground-truthing practices are concerned: (I) the definition of the problem to be solved (problematization), (II) the identification of the data to be collected and set up (databasing), and (III) the qualification of the targets to be learned (labeling). I finally suggest that this three-dimensional conceptual space can be used to map machine learning algorithmic projects in terms of the morality of their respective and constitutive ground-truthing practices. Such techno-moral graphs may, in turn, serve as equipment for greater governance of machine learning algorithms and systems.

Relevant MethodSpace Posts

Leave a Reply