Collecting, analyzing, and reporting with data can be daunting. The person that SAGE Publishing — the parent of MethodSpace — turns to when it has questions is Diana Aleman – editor extraordinaire for SAGE Stats and U.S. Political Stats. And now she is bringing her trials, tribulations, and expertise with data to you in a monthly blog, Tips with Diana. Stay tuned for Diana’s experiences, tips, and tricks with finding, analyzing and visualizing data. View Diana’s blog HERE.
There may be more than a few data points to double-check as you review and clean a data file. These can include blank values, outlier data points, data label misspellings, and so on. Duplicate data points are probably one of the most difficult to spot unless you’re lucky. Duplicates are exactly what they sound like: exact copies of the same data point. For instance, if I am looking at a data set on the number of hamsters across the United States and I see that Wisconsin has two data points, both of which are 50,000 (totally fabricated!), then I can infer that the data set has mistakenly included two duplicate values for Wisconsin.
So why does this matter? It matters because duplicate data points may inadvertently lead to miscalculation or misunderstanding of the data. The appearance of duplicates does not necessarily mean the entire data set is completely wrong – only that the data set may require a closer eye and some additional clean-up work as do most data sets. Thankfully, Excel offers two handy features that simplify the identification and removal of duplicate data points from a file!
It usually takes finding one set of duplicate data points for me to determine that Conditional Formatting should be applied to identify if any additional duplicates are present in a data file. The Conditional Formatting feature programmatically identifies duplicates in an entire data set. Without this feature I would be forced to manually check each data point. That may not be a big deal for a data set with about 50 rows of data, but it can be an incredibly inefficient process for a data set that contains, say, over 50,000 rows of data.
Using Conditional Formatting:
- Select the entire data set. Actually, you don’t have to select the entire data set; you may want to identify duplicate values in a particular column or row. If you want to identify duplicates across the entire data set, then select the entire set.
- Navigate to the Home tab and select the Conditional Formatting button.
- In the Conditional Formatting menu, select Highlight Cells Rules.
- In the menu that pops up, select Duplicate Values.
- A window will appear detailing how Excel will highlight the duplicate values it identifies. The default setting is light red highlighting with red font, which works very well.
- Voilà. All duplicate values should now be highlighted in red!
After reviewing the highlighted duplicates, you can determine whether all the duplicates should be removed or not. To remove all duplicate values, you can use the Remove Duplicates feature to, well, remove the duplicates!
Using the Remove Duplicates feature:
- Select the data set that contains duplicates.
- Navigate to the Data tab in the tool bar.
- In the Data Tools section of the Data tab, select Remove Duplicates.
- One of two windows will appear:
- If you selected the entire data set, then an option will appear asking you to specify which columns you wish to delete duplicates from; if you want duplicates removed from the entire data set, then leave all the columns selected.
- If you selected a specific column, then a warning will appear to confirm that you want to limit removal to the column selected; if yes, then be sure to select “Continue with current selection”. If you decide to expand it to the entire data set, then choose “Expand the selection”.
All the duplicates should now be removed!