How to embrace text analysis as a computational social scientist

Guest blog by Alix Dumoulin and Regina Catipon

Social scientist experience

Access to big data has transformed many industries, from e-commerce to academia. In the social sciences, researchers have leveraged innovative text mining and analysis methods to widen the scale and scope of their work in political science, economics, psychology, and more.

New methods, increased quantitative education of social scientists, and access to relevant text data have allowed for the emergence of innovative research previously limited to qualitative methods such as the estimation of people’s political positions or the detection of sentiment in terrorist speech. 

Despite the advances in methodologies, some challenges in collecting and cleaning corpora remain. Computational social scientist at UChicago’s Knowledge Lab and author of Natural Language Processing and Computational Linguistics, Bhargav Srinivasa-Desikan, explained to us why text cleaning should take high priority.


“The single most time-consuming process in my professional life is text cleaning.”

After he collected syllabi, research papers, and job listings, Srinivasa-Desikan applied a blanket text-cleaning script to all three corpora. He found that his topic model results were “decent” but not what he expected. Most importantly, he could not account for why the semantic space between syllabi and job listings was so large. He realized then that each corpora needed their own unique text cleaning script. 

When working with enormous data sets, the time spent cleaning can start to add up. He added that, “just cleaning is taking me three days, that is, on 225 GB of textual data.”
Taking the time to clean up the text early on, and do manually inspect that cleaned text, will save you headaches down the road. 

“The kind of cleaning you do dramatically changes the kind of results you get.” 

To help social scientists and the larger data science community access quality text sources and tools, we created a repository of corpora and scripts and we’ve made it publicly available and open to contributions. Here are some of the key points we’ve taken away while putting together the repo.

Get to know political social science corpora

We found over 60 political texts based on their availability and common usage. These corpora include UK parliamentary speeches, US bills and amendments, press releases, and party manifestos from 50 different countries as well as less commonly used corpora such as trade agreements, speeches, and content of e-petitions or national consultations. 

Many social scientists will call APIs or download bulk files through official websites or government open data initiatives. The variety of data types per corpora vary widely from XML, PDF, JSON, Plain Text, to CSV files. And while there are a number of text sources online, there is no master list of political science texts and corpora. And so, we created a centralized repository to help you get started in your text analysis process.

Get started with text mining

Once corpora have been collected, you will then be faced with actually extracting and structuring information from the text. To do so, you will probably want to:

  1. Preview the structure

  2. Identify the location of wanted text

After gaining an understanding of the characteristics of the text, you might have different approaches to extracting the body of the text. These options include:

  • Keeping all text

  • Excluding header and footer

  • Recognizing header

  • Extracting knowledge

  • Recognizing entities

  • Tagging or annotating the text

In this example, the proceedings from the Culture, Welsh Language and Communications Committee were extracted as text from an XML file using Python.

Code snippet of XML text parsing process by Alix Dumoulin

At this point, you can choose which tags and text to keep or filter out. After you have mined the text out, you can then move on to cleaning it.

Confronting text preparation challenges

Any data scientist or computational social scientist will tell you, text cleaning can be a pain. But what is the difference between text cleaning and preprocessing -- terms that are often mentioned in conjunction with one another? The distinction lies in the extent to which they modify the text data. 

Text cleaning, such as the removal of white spaces or symbols, first clears and formats the corpora for pre-processing. Then preprocessing, like punctuation removal or tokenization, modifies the content in preparation for analysis. More so than text cleaning, preprocessing transforms the textual data and as such, can have a larger impact upon analysis. For example, if you remove punctuation in preprocessing, you may not be able to generate sentence embeddings. Or if you do not remove stopwords, you may have inaccurate classification results. In fact, according to a study done by Turkish researchers in 2013, preprocessing can be as important as feature extraction, feature selection, and classification.

Here are common examples of text preparation you may come across:

  • Text cleaning - stripping spaces, removing metadata, junk characters, and reformatting numbers or html scripts. 

  • Preprocessing - stemming, lemmatization, tokenization, and stopword removal.

  • Special considerations - personal information (email address, phone numbers etc.)

Let’s say, for example, we take the line from the earlier extracted text.

'<p>The committee met by video-conference.</p>\n<p>The meeting began at 13:29.</p>',

There are paragraph <p> tags and line breaks \n. There is also a declaration of time which might be useful information.

A  remove_tags function could remove all HTML scripts, and a simple string replace operation would take care of the line breaks.

import re
TAG_RE = re.compile(r'<[^>]+>')
ef remove_tags(text):

    return TAG_RE.sub('', text)

english_text[1] = remove_tags(english_text[1])  
english_text[1] = english_text[1].replace(“\n”, “”)

A regular expression could also specify that any time the phrase “the meeting began...” appears, whatever follows should be added as the time stamp of the meeting. While initially time-consuming, manual inspection is integral to the success of any text project. So it is recommended to check your text cleaning, early and often.

How to contribute

The repo currently focuses on political science texts from the EU and the United States, but so many more regions and countries can be added. Do you know of a tool that parses Thai political texts? Maybe you have seen a package that does NLP pre-processing on Hebrew language corpora? You can contribute to the repository with more corpora and scripts at any time. Find the steps for repo contributions here.

Get early access to our new tool: Texti

In addition to the repository, SAGE Ocean is currently developing new tools to support computational content analysis, and specifically looking to reduce the time you spend on cleaning so you can focus on the analysis. As the field of text analysis grows, so too do the needs of researchers. Find out about Texti and sign up to get early access.


About the authors of this guest blog

Regina Catipon

Regina is pursuing a master’s in computational social science at UChicago. She is interested in information propagation in online networks and tracing emerging narratives. She is a news junkie and Star Trek fan. Nowadays you can find her in Chicago and tweeting (infrequently) from @RKCAT.

Alix Dumoulin

Alix is completing the MSc Applied Social Data Science at the London School of Economics and is interested in public policy, political behaviour, applied machine learning, and data ethics. She is also a co-founder of ethi, a start-up that helps people control and benefit from their personal data. She overshares about it from @alix_dumoulin.

Previous
Previous

Virtual reality: The future of experimental research?

Next
Next

Tips for Faculty Who Mentor Students Who are Working Professionals