Skip to Main Content

Digital Humanities: Text Analysis

This guide provides resources for faculty and students working on digital humanities research involving text analysis/text mining.

Cleaning your Text

When performing text mining, it is important to "clean up" your text. This refers to taking out tags, URLS, and other text contained in a document that you do not want to include in your text analysis. See the following examples using the following article.

Most popular document terms after text-mining the article prior to cleaning up the text.


Most popular document terms after text-mining the article following cleaning up the text.    

In the first analysis, the most popular terms included various formatting elements that existed on the page which can skew an analysis of the text., such as: "https," "," and extra occurrences of "Murray State."  There are various methods to clean a document, many of which require programming skills or the use of a digital tool. One simple way, however, is to copy the text and paste it in to a plain text editor (such as Notepad or TextEdit). Pasting the text into plain text will remove formatting or any embedded text that would be picked up by a program like Voyant. You would then save the document as a .txt file. 

Building a Corpus

What is a Corpus?

A corpus refers to the collection of texts/documents you wish to analyze. It is good to include each individual text as its own file, rather than one file including the text of every document together. 

Things to Consider when your Corpus is a single book

When building your corpus, think about what parts of the whole you want to analyze. Using a resource such as Voyant, you are able to apply text analysis tools to the full corpus, but also select different parts to analyze individually. While this is standard when analyzing separate works, consider singular works which you may break up in to smaller parts.

Let's look at L. Frank Baum's The Wonderful Wizard of Oz. Below are 1. A plain text file of the entire book, and 2. A Zip folder containing a .txt file for each chapter. 

By putting each chapter in its own file, you are able to analyze a specific chapter in the context of the full text. 

Analysis sample for full text in one file:

Analysis sample for full text by individual chapters.

Notice that in the second sample, each chapter is selectable for analysis.