My journey into text mining

Aug 5

Paul Schuler, Computational Social Science graduate student at Linköping University

My journey into text mining started when the institute of Digital Humanities (DH) at the University of Leipzig invited students from other disciplines to take part in their introductory course. I was enrolled in a sociology degree at the time, and this component of data science was not part of the classic curriculum; however, I could explore other departments through course electives and the DH course sounded like the perfect fit.

As a complete novice to data science, I stepped into the interdisciplinary domain of digitally-driven scientists that used innovative methods to explore questions in the humanities. I learned about the structure of the internet; online crowd-sourced projects; web scraping with Beautiful Soup; GitHub, and Python.

I was fascinated and partly overwhelmed. In under a few months, I was not only occupied reading sociological classics like Weber, Durkheim, or Merton and applying parametric statistics to survey data but also experienced the potential that large scale and automated text analysis offer. Given the variety of options, I struggled to select a single topic and method to use in my final paper. With every class, my inspiration grew and my mind flowed with ideas. I even started to keep a list of ideas to try out later. Ultimately, I decided to work on a project examining song lyrics.

As the tool of analysis, I chose to work with RStudio in R, a programming language I learned in my other courses. R is just as sophisticated as Python, although it was developed for more statistical purposes. I scraped the lyrics of top tracks from different genres, trying to prove that pop music is lexically less rich than other genres. I was unsuccessful but lastingly excited. Though R is very powerful for statistical approaches and more, I plan to improve my Python skills as the language seems to be more versatile especially for analyzing text data.

The first projects that caught my eye were the somewhat silly ones. I partly wondered what the purpose of projects that searched for jokes in Victorian newspapers or sought to recreate a Socratic dialogue using Neural Nets, or cluster beer types based on reviews was for other than a good laugh. However, I am now convinced that one, fun and fascination are the most effective ways into ‘serious research’ and two, language itself teaches us a lot about society, human thinking and interaction. Through this, I encountered more projects and articles on language that present important findings and highlight the various opportunities arising from computational approaches in the social sciences. I should note that the online coding community also has a good sense of humor, which is evident in the etymological derivation of the coding language Python from the comedy group Monty Python.

As promising as the use of computational power in text analysis is, the exciting rush into it is not without its concerns. Computers cannot yet fully replace human interpretation and analytical skills (see this and this example or more extensive discussions here), but they can be the medium that aggregates the data for us and enables new modes of analysis, as Joshua Gans and his co-authors argue in their book “Prediction Machines.”

Furthermore, while an increasingly higher volume of books and text data are being produced at a faster rate than any human being could ever read, new tools are necessary to sort and summarise this information to help us in our analysis. I am certain that text mining methods are vital in examining shifts in the use of language which indicate the change of social norms and the structure of society.

Today, I am a graduate student in Computational Social Science at Linköping University (Sweden) and my interests are in the lexical change over time and the connection of language and social interaction. However, there are many more fascinating projects I’d love to work on in the future like exploring similes in different text types, comparing languages based on the correlation of phonetic letters and literal ones and examining the use of characterizing adjectives in media.

At present, I get the opportunity to apply sophisticated methods to problems concerning society, bringing both strands that fascinated me together. I remain working with R and am eager to explore all of the ideas that were inspired by my Digital Humanities elective. This is the reason I continue to work through my journey to become more experienced in text mining. In testing out the different methods that could work best for my own sociological research, I hope to find one that will hold a permanent place in my methodological toolbox.

About

Paul Schuler finished his bachelor’s degree in sociology at University Leipzig (Ger) in early 2019. After internships at different stations in the public and private sector he moved to Norrköping in Sweden in August 2019 where he is currently enrolled in the Computational Social Science graduate program. Besides, he works as student research assistant at the Institute of Analytical Sociology in the project “The Matthew Effect revisited: the social and cultural dynamics of awarding the Nobel Prize”.

PythonRText MiningDigital HumanitiesText AnalysisQuantitative MethodsQuantitative Data AnalysisQuantitative Data Collectionsociology

Chris Burnage

My journey into text mining

About

Active Online Learning in Research Methods

August Focus & Mentor-in-Residence

Subscribe to our methods mailing list

Sage Research Methods Community