So – we have found the data and we have cleaned the data. Great! But, now what do we do with it? The third and final stage of the data analysis process really gets to what you needed to begin with – information and supporting evidence.
As I mentioned in my first post, raw data oftentimes does not make sense at face value or it at least does not provide enough context for a person to understand its significance. This requires the user to “summarize” that micro-information into straightforward intelligence. “Summarizing” data into statistics is much less about creating new information than it is translating and contextualizing the data into meaningful information for everyone.
In the midst of the 2016 campaign debates about climate change, my editor came across statistics on U.S. electrical generation by state and recommended it as a great addition to SAGE Stats. I agreed and found the original data on the Energy Information Administration’s (EIA) website. The EIA regularly releases data on electricity generation by source across the U.S; however, you’ll see below that electricity generation is measured in megawatt hours.
Not many people understand what a “megawatt hour” is – I certainly didn’t, I had to Google it! Is 1,000 megawatt hours a lot? Is it too little? How about one million?
Although I had no idea what a megawatt hour was, I understood it was measuring energy production across the U.S, which is valuable information for assessing which states are moving away from traditional energy sources like coal. But how could I translate megawatt hours into a statistic that everyone could understand? This required me to calculate statistics from the EIA data to neatly “summarize” the information it provided.
The breakdown: Summarizing data, an example.
In this scenario, I was specifically interested in electricity generation by source type and by U.S. state. The EIA provides this information as well as overall total electricity generation values. When facing raw values such as these, ask yourself, “What can I compare these values to in order to better understand their significance?” This is a great question to ask because your audience will understand information much more easily if it’s compared to other information. Tare a number of statistics you can calculate to answer this question:
- Totals: summing values to get a big picture perspective is often handy.
- Percent of totals: excellent for comparing segmented data against overall totals.
- Amount change: a good option to compare how much values have changed.
- Percent change: a good way to compare the size by which values have changed.
- Averages: these include mean and median averages.
Based on the EIA’s data, I decided that comparing electricity generated by source type to the overall total electricity generated was much more meaningful. So I calculated percentages for each source type against the total number of megawatt hours. That way I could gauge how much of each state’s total electricity was generated by coal, wind, natural gas, and so on. The results were much easier to understand and particularly enlightening!
Once the data was mapped out, I saw that a large percentage of the Midwest’s electrical generation was due to wind energy – an interesting result considering that neighboring states have strongly adhered to coal and oil.
So I can just throw any raw data values together, right?
Yeah, that’s a big N-O. Use best judgement when you calculate statistics. Any of the statistics in the section above should be calculated with values of the same unit of measure. So don’t go adding dollar values and percentages together because that makes no sense. Likewise, be careful of any missing data values or incorrect data values that can throw your calculations off (although your analysis as reviewed in my first post of this series should help you become aware of those!). If in doubt, ask for help from a trusted resource such as your instructor, librarian, or colleague.