Text Mining for Historical Analysis

This methods book provides a practical introduction to the R programming language for text mining historical records. And more than just a code cookbook, it offers a critical perspective to handling our human history. It is the companion guide to The Dangerous Art of Text Mining by Jo Guldi.

About this Course

Computer-powered methods are changing the way that we access information about society. New methods help us to detect change over time, to identify influential figures, and to name turning points. What happens when we apply these tools to a million congressional debates or tweets?

I designed and collaborated on a series of Jupyter Notebooks for Jo Gulid’s class, “Text Mining as Historical Method,” to explore questions like these. The Notebooks for the class provide an introduction to the cutting-edge methodologies of textual analysis that are transforming the humanities. They are geared towards both computationalists as well as those with a background in the humanities (but not code), and are designed to teach the skills of analyzing texts as data for evidence of change over time.

For the full course materials, including the interactive Jupyter Notebooks used by the class, please see the digital history repository on GitHub.

Becuase this class encourages exploring discourses that have shaped our culture throughout time, students have access to a diverse range of data sets for their own research projects. These data sets include: Reddit posts from the Push Shift API from 2008 to 2015; the 19th-century Hansard Parliamentary debates of Great Britain; the Stanford Congressional Records; the Dallas, Texas and Houston, Texas City Council Minutes; literature from Project Gutenberg; metadata from the NovelTM Datasets for English-Language Fiction, 1700-2009; and corporate reports from EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system.

The skills practiced in this course begin with basic word counts and visualization techniques and extend to the high-level application of machine learning modules to tell digital history, modules such as spaCy, sci-kit learn, and gensim for word embeddings. By the end of the course students are able to perform comparative analyses and observe how langauge chnages over time.