Practical Tips for Getting Started with Harvesting and Analyzing Online Text

By Jeffrey Stanton

The first person who realized you could make excellent ice cream with a food processor and ingredients laying around the kitchen must have been so excited. That’s how excited I am about the analysis of online text for social scientists: the available text analysis tools are becoming easier and easier to use and all of the necessary “ingredients” are incredibly abundant across the internet. And like a fresh bowl of ice cream, the results can be very cool indeed.

Just as a recipe usually starts with ingredients, a text analysis project should begin with a consideration of what kind of text will be harvested and analyzed. The conversations and documents that people post online can illuminate many important issues of interest - from the workplace, education, business, health care, sports, entertainment, politics, and many other areas of interest to researchers. Your search for suitable textual material should begin wherever the people you plan to study are posting text. A general purpose social media site may be great for following the dynamics of political conversations, but if you’re interested in analyzing health care advice, you’ll have to look to more specialized sites. Before you start the work of extracting text directly from a site, however, see if there is a data set of suitable text (and possibly accompanying document-level variables) that already exists - some enterprising data scientist may have done much of the work for you already. Kaggle, GitHub, and Google Dataset Search are good places to start your sleuthing for existing text data sets.

Let’s say that you do need to accomplish of the harvesting of text material from a website or a set of documents. You will need a capable tool to help you with the job. The two most common platforms for this kind of work are the programming languages R and Python. If you’ve not used either of these before you are in luck because the amount of available tutorial material has skyrocketed. There’s also a wonderful newer approach to scripting called a “Jupyter Notebook” that frames your coding efforts inside a regular web page. Anyone with a browser can create and run R and Python code as easily as writing a blog post. Incredibly, most of these tools are open source and free for all to use.

Finally, once you have obtained the text you want to analyze, you will need a strategy for systematically breaking the text into its component features. Loosely speaking there are two major approaches to this process, known as natural language processing and text mining. By far it is easier to get started with text mining and both R and Python have powerful tools to help with pre-processing and analyzing text in this mode. In R, for example, the “quanteda” (QUANTtitative Exploratory Data Analysis) package is both powerful and easy to use. In Python, you will probably want to start with nltk (the Natural Language ToolKit). In both cases, the crucial first step is creating a so-called document-term matrix - a large data structure that, like whipped cream, is mostly empty space! This document-term matrix then serves as the essential ingredient in many subsequent kinds of text analysis.

Data Science for Business With R

Some people like to have a cookbook so that they can have the recipes right in front of them when they are working. Here’s where I will shamelessly promote the book I wrote with my friend and colleague Jeff Saltz: Data Science for Business with R. The first two chapters are available as a preview.

This beginner’s book has chapters on text mining along with all of the information you will need to get started with R programming. The book is sprinkled with easy code recipes to get you started. Whether you follow the book or any of the many available online tutorials to get started with text analysis, have fun and bon appetit!


More Methodspace posts

Previous
Previous

A Unique Lens of Research Ethics

Next
Next

Online Research: Analyze Talk