The Three Stages of Data Analysis: Cleaning your Data

Tips with Diana_banner2

The basics

The term “data cleaning,” the second stage of the data analysis process, is usually met with some confusion. I mentioned to a friend that the most recent SAGE Stats data update required a lot of cleaning, which was taking up a significant amount of time. She asked, “So what exactly is data cleaning?” An excellent question! Data cleaning or “scrubbing” consists of taking disorganized, messy data and transforming it into a format that enables easier analysis and visualizations. Depending on your formatting or metadata requirements and how big the data file is, it can take days to clean a file into submission.Since I began working on SAGE Stats, I’ve learned many Excel tricks that can be applied to any kind of data cleaning situation. To avoid information overload, I’ll stick to the tricks I’ve successfully used in the past two years.

The breakdown

Top 10 Tips on Cleaning Your DataRead the data documentation. This will tell you what each component of the data file represents and help you identify what data is most relevant to your research interests and what data you can avoid.Excel’s “Text-to-Columns” feature. Especially large data files are often stored in “csv” or “comma separated value” formats and can be imported into Excel using this handy feature.VLOOKUP formula. My holy grail of Excel formulas. Do you want to pull multiple values from a workbook into another workbook? VLOOKUP has your back.COUNTIF formula. Are you looking for duplicate values in a range or checking whether values in one workbook are present in another workbook? COUNTIF counts the number of times a value occurs in a range!LEFT and RIGHT formulas. These are very useful when you need to parse out specific characters from the beginning or end of a value. For instance if “092017” represents September 2017, but I only need the year, then I can use the RIGHT formula to collect the last four digits.TRIM formula. Frustrated by inexplicable extra spaces that follow the value you want? This formula “trims” those out for you.CONCATENATE formula = “&”. Concatenate is a fancy word for linking two values together – you can use the formula for this or insert an ampersand between the two cell references, e.g. =A1&B1.I don’t think Excel’s filters get enough credit. Are you looking for multiple misspellings of New York? The filters help you quickly identify and correct them.Nest your formulas. Find ways to combine formulas to reduce the number of steps you have to complete! For instance, do you need to look up values in Workbook 1 that are associated to a value’s last five characters in Workbook 2? Nest the RIGHT and VLOOKUP formulas to quickly get your answer.Work off a copy of the original data file. You don’t want to be in a situation where you have mistakenly deleted data values and then have to download the data file again. Keep the original version handy as a backup.

This is a lot of work. Why do I need to clean the data file at all?

Sometimes a data set is so simple that it requires no cleaning at all; however, that’s not usually the case. These days you will typically encounter a file with all data merged into one column, which you then have to unmerge or parse out by yourself. Then you find that you need to concatenate some values back together. And then you realize that some values occur multiple times and you want to find out how many times each one occurs in the file. All this when you only want a snippet of that information! 

Data cleaning is a necessary evil at times in order to get your data in shape for easier visualizations and more accurate information.

The best way to learn these tricks (and even more advanced tricks) is to dive in head first and try them out with a specific data set. In Excel’s case, doing is better than reading or listening. After all, no one starts out as an expert, and I am no exception! My tips above are suggestions and may not work with your specific needs, but they can be applied in almost every kind of data situation. If you use them often enough, then they practically become muscle memory.

Previous
Previous

A Twitter Discussion of Big Data in Social Science

Next
Next

The Three Stages of Data Analysis: Evaluating Raw Data