What to do about missing data?

By Dr. Stephen Gorard. Dr. Gorard, author of How to Make Sense of Statistics, was a Methodspace Mentor in Residence in 2021.


Use code SAGE30 for a 30% discount through the end of 2021

Use the code MSPACEQ423 for a 20% discount

Social science datasets usually have missing cases, perhaps through non-response, and cases with missing values, perhaps through dropout. Full participation in any social science study is so rare as to be unheard of, and researchers need to assess the impact of missing data. However, many research reports ignore the issue of missing data, or only consider some aspects of it, or do not report how it is handled.

It is safest to assume that all datasets are incomplete. This means that cases are never truly randomly selected, or randomly allocated to groups, because a random sample which is incomplete is no longer random (I have written another post on this). Any form of analysis, such as significance testing, predicated on randomisation cannot be used where any data is missing. 

All missing data has the potential to bias any future research findings, because in any real-life dataset, non-response is not randomly distributed. There are proven systematic differences between people who tend to take part in research and those who refuse – in terms of leisure, attitudes, education, income, social class, age and so on. Even among those who do respond, there will be questions unanswered, or unintelligible responses. And these forms of missing data have also been shown not to occur by chance, but lead to potential bias in the results that are achieved in any study. What can we do about it?

Track missing data

The first and safest approach is to prevent missing data as far as possible. Design your study to make to get as near to full response as you can, by making access and participation easy, brief and rewarding. Do not ask for data that is already available elsewhere. Make sure the research is important, and interesting to participants, and not a waste of their time. Make your research instruments easy to read, and complete, avoiding jargon and those long words and sentences that social scientists are prone to. Make the questions clear, non-threatening and unobtrusive. Make the first question fascinating. Ensure that any instruments are accessible to their full intended audience, including those with limited literacy and reduced visual acuity. Follow up missing responses, chase up non-responders, and treat every response with care and appreciation. Make clear that all data will be treated with respect, respondents will be anonymised, data will be destroyed after use etc.

Record all missing data, the reasons for it being missing (if known), and the stage of the research at which it occurred. Report this clearly, perhaps in flowchart form. Also report what is known about cases with missing values. If some respondents to a survey do not answer an item about their income, report what you know about their education or occupation, for example. This can provide a caution for the substantive results about income.

Conduct sensitivity analyses, envisaging the least favourable substantive findings, based on imagining unfavourable replacement values for missing values in the study. One approach is to assess the proportion of missing cases that would have to be replaced with counterfactual data in order to invalidate the substantive result. This is a tough but fair test that soon makes clear how inadequate many published samples are. The calculation is quite easy. For example, with a result based on an “effect” size (standardised difference between two means), divide your effect size by the number of cases in the smallest group being compared. If the result is clearly larger than the number of missing values then your substantive result would not have been changed by inconvenient values for missing data. This “number of counterfactuals needed to disturb a finding” (NNTD) is described further in my new book.

What not to do

Any attempt to replace missing values through complex “imputation” or weighting will not help and will probably make the bias in your data worse. These approaches often rest on shaky assumptions, and all involve using the data you do have to compute the data you do not have. But the data you have is biased.

For example, imagine a large survey of adults with an overall response rate of 98% but in which the response rate from the small Traveller community was only 2%. Would it be reasonable to multiply the results obtained from a few Travellers by 49 to estimate what the majority of missing Travellers would have said? This would be very misleading if the 98% of Travellers who did not respond were different in some ways from the 2% who did.

Complete case analysis

Probably the simplest and safest way to preserve the full number of cases in your analysis is to respect missingness as a response. For categorical variables, add or retain a category of not known, and use this as a valid response in your analyses. For example, if you compare the happiness of employed and unemployed respondents you might have a third row in your table for the happiness of respondents not recorded as either employed or unemployed. For real number variables you can create a new categorical variable representing missing or not. And then replace missing real number values with the mean for cases with values. This retains the cases with missing values in your analysis without changing the overall mean. It will, however reduce the apparent variability (standard deviation) of the sample, because more cases will now have the same value. To address this, calculate “effect” sizes and similar using the standard deviation of only the complete cases.

It is also better to insist that only a small fraction of values be replaced for any variable using these methods. If a variable has a very high proportion of genuinely missing values then it is better to treat the whole variable as non-viable and discard it.

Remember that all missing data creates the potential for bias, but these simple approaches help to illuminate the level of bias, and help us to be appropriately cautious about the strength of our results.


More Methodspace Posts about Data Analysis

Previous
Previous

Image as data: Automated visual content analysis for social science

Next
Next

How Standard is Standard Deviation?