What to do about missing data?

Data Analysis

Oct 12

Written By

By Dr. Stephen Gorard. Dr. Gorard, author of How to Make Sense of Statistics, was a Methodspace Mentor in Residence in 2021.

Use code SAGE30 for a 30% discount through the end of 2021 — **Use the code MSPACEQ423 for a 20% discount**

Social science datasets usually have missing cases, perhaps through non-response, and cases with missing values, perhaps through dropout. Full participation in any social science study is so rare as to be unheard of, and researchers need to assess the impact of missing data. However, many research reports ignore the issue of missing data, or only consider some aspects of it, or do not report how it is handled.

It is safest to assume that all datasets are incomplete. This means that cases are never truly randomly selected, or randomly allocated to groups, because a random sample which is incomplete is no longer random (I have written another post on this). Any form of analysis, such as significance testing, predicated on randomisation cannot be used where any data is missing.

All missing data has the potential to bias any future research findings, because in any real-life dataset, non-response is not randomly distributed. There are proven systematic differences between people who tend to take part in research and those who refuse – in terms of leisure, attitudes, education, income, social class, age and so on. Even among those who do respond, there will be questions unanswered, or unintelligible responses. And these forms of missing data have also been shown not to occur by chance, but lead to potential bias in the results that are achieved in any study. What can we do about it?

Track missing data

The first and safest approach is to prevent missing data as far as possible. Design your study to make to get as near to full response as you can, by making access and participation easy, brief and rewarding. Do not ask for data that is already available elsewhere. Make sure the research is important, and interesting to participants, and not a waste of their time. Make your research instruments easy to read, and complete, avoiding jargon and those long words and sentences that social scientists are prone to. Make the questions clear, non-threatening and unobtrusive. Make the first question fascinating. Ensure that any instruments are accessible to their full intended audience, including those with limited literacy and reduced visual acuity. Follow up missing responses, chase up non-responders, and treat every response with care and appreciation. Make clear that all data will be treated with respect, respondents will be anonymised, data will be destroyed after use etc.

Record all missing data, the reasons for it being missing (if known), and the stage of the research at which it occurred. Report this clearly, perhaps in flowchart form. Also report what is known about cases with missing values. If some respondents to a survey do not answer an item about their income, report what you know about their education or occupation, for example. This can provide a caution for the substantive results about income.

Conduct sensitivity analyses, envisaging the least favourable substantive findings, based on imagining unfavourable replacement values for missing values in the study. One approach is to assess the proportion of missing cases that would have to be replaced with counterfactual data in order to invalidate the substantive result. This is a tough but fair test that soon makes clear how inadequate many published samples are. The calculation is quite easy. For example, with a result based on an “effect” size (standardised difference between two means), divide your effect size by the number of cases in the smallest group being compared. If the result is clearly larger than the number of missing values then your substantive result would not have been changed by inconvenient values for missing data. This “number of counterfactuals needed to disturb a finding” (NNTD) is described further in my new book.

What not to do

Any attempt to replace missing values through complex “imputation” or weighting will not help and will probably make the bias in your data worse. These approaches often rest on shaky assumptions, and all involve using the data you do have to compute the data you do not have. But the data you have is biased.

For example, imagine a large survey of adults with an overall response rate of 98% but in which the response rate from the small Traveller community was only 2%. Would it be reasonable to multiply the results obtained from a few Travellers by 49 to estimate what the majority of missing Travellers would have said? This would be very misleading if the 98% of Travellers who did not respond were different in some ways from the 2% who did.

Complete case analysis

Probably the simplest and safest way to preserve the full number of cases in your analysis is to respect missingness as a response. For categorical variables, add or retain a category of not known, and use this as a valid response in your analyses. For example, if you compare the happiness of employed and unemployed respondents you might have a third row in your table for the happiness of respondents not recorded as either employed or unemployed. For real number variables you can create a new categorical variable representing missing or not. And then replace missing real number values with the mean for cases with values. This retains the cases with missing values in your analysis without changing the overall mean. It will, however reduce the apparent variability (standard deviation) of the sample, because more cases will now have the same value. To address this, calculate “effect” sizes and similar using the standard deviation of only the complete cases.

It is also better to insist that only a small fraction of values be replaced for any variable using these methods. If a variable has a very high proportion of genuinely missing values then it is better to treat the whole variable as non-viable and discard it.

Remember that all missing data creates the potential for bias, but these simple approaches help to illuminate the level of bias, and help us to be appropriately cautious about the strength of our results.

More Methodspace Posts about Data Analysis

Blog

Data Analysis, Big Data

Recent Advances in Partial Least Squares Structural Equation Modeling: Disclosing Necessary Conditions

Data Analysis, Big Data

Learn about options available in the dynamic landscape of emerging methodological extensions in the PLS-SEM field is the necessary condition analysis (NCA).

Data Analysis, Big Data

Research Design, Data Collection, Data Analysis, Communicating Research

Research Stages: A 2023 Recap

Research Design, Data Collection, Data Analysis, Communicating Research

Looking back at 2023, find all posts here!
We explored stages of a research project, from concept to publication. In each quarter we focused on one part of the process. In this recap for the year you will find original guest posts, interviews, curated collections of open-access resources, recordings from webinars or roundtable discussions, and instructional resources.

Research Design, Data Collection, Data Analysis, Communicating Research

Impact & Society, Research Design, Data Collection, Data Analysis, Communicating Research, Teaching Methods

Methods Film Fest: Researchers Share Insights

Impact & Society, Research Design, Data Collection, Data Analysis, Communicating Research, Teaching Methods

Methods Film Fest!
We can read what they write, but what do researchers say? What are they thinking about, what are they exploring, what insights do they share about methodologies, methods, and approaches? In 2023 Methodspace produced 32 videos, and you can find them all in this post!

Impact & Society, Research Design, Data Collection, Data Analysis, Communicating Research, Teaching Methods

Data Analysis

Choosing digital tools for qualitative data analysis

Data Analysis

Christina Silver explains why and how to use qualitative data analysis software to manage and analyze your notes, literature, materials, and data. Sign up for her upcoming (free) symposium!

Data Analysis

Teaching Methods, Data Analysis

Use Research Cases to Teach Methods for Large-Scale Data Analysis

Teaching Methods, Data Analysis

Use research cases as the basis for individual or team activities that build skills.

Teaching Methods, Data Analysis

Data Analysis, Communicating Research

Finding gems in limited data: How we went from “ungeneralizable” to valuable findings

Data Analysis, Communicating Research

How do you find gems in a research project when the data is too thin for generalizations? In this post researchers discuss creative ways to learn from (and write about) the experience.

Data Analysis, Communicating Research

Data Analysis

Analyzing Qualitative and/or Quantitative Data

Data Analysis

The focus for Q3 of 2023 was on analyzing and interpreting qualitative and quantitative data. Find all the posts, interviews, and resources here!

Data Analysis

Data Analysis, Research Design, Skills

What is randomness?

Data Analysis, Research Design, Skills

Dr. Stephen Gorard defines and explains randomness in a research context.

Data Analysis, Research Design, Skills

Data Analysis

The power of prediction

Data Analysis

Mentor in Residence Stephen Gorard explains how researchers can think about predicting results.

Data Analysis

Data Analysis, Diversity Equity & Inclusion

Part Two: Equity Approaches in Quantitative Analysis

Data Analysis, Diversity Equity & Inclusion

The Career and Technical Education (CTE) Equity Framework approach draws high-level insights from this body of work to inform equity in data analysis that can apply to groups of people who may face systemic barriers to CTE participation. Learn more in this two-part post!

Data Analysis, Diversity Equity & Inclusion

Part One: The Need for Equity Approaches in Quantitative Analysis

Data Analysis, Diversity Equity & Inclusion

Data Analysis, Teaching Methods

Teaching and learning quantitative research methods in the social sciences

Data Analysis, Teaching Methods

Instructional tips for teaching quantitative data analysis.

Data Analysis, Teaching Methods

Data Analysis

How can we judge the trustworthiness of a research finding?

Data Analysis

In an era of rampant misinformation and disinformation, what research can you trust? Dr. Stephen Gorard offers guidance!

Data Analysis

Analysing complex qualitative data - a brief guide for undergraduate social science research

Data Analysis

Learn how inductive and deductive styles of reasoning are used to interpret qualitative research findings.

Data Analysis

Methods Innovation, Data Analysis

Image as data: Automated visual content analysis for social science

Methods Innovation, Data Analysis

Images contain information absent in text, and this extra information presents opportunities and challenges. It is an opportunity because one image can document variables with which text sources (newspaper articles, speeches or legislative documents) struggle or on datasets too large to feasibly code manually. Learn how to overcome the challenges.