Link Search Menu Expand Document

Text-Mining
Photo by Eugenio Mazzone on Unsplash

Fundamentals of Text Mining

The term ‘text mining’, which can also be called ‘data mining’, refers to any process of analysis performed on a dataset to extract information from it. That definition is so general that it could mean something as simple as doing a string search (typing into a search box) in a library catalogue or in Google. Mining quantitative data or statistical information is standard practice in the social sciences where software packages for doing this work have a long history and vary in sophistication and complexity.

But data mining in the digital humanities usually involves performing some kind of extraction of information from a body of texts and/or their metadata in order to ask research questions that may or may not be quantitative. Supposing you want to compare the frequency of the word “she” and “he” in newspaper accounts of political speeches in the early 20th century before and after the 19th Amendment guaranteed women the right to vote in August 1920. Suppose you wanted to collocate these words with the phrases in which they were written and sort the results based on various factors—frequency, affective value, attribution and so on. This kind of text analysis is a subset of data mining. Quite a few tools have been developed to do analyses of unstructured texts, that is, texts in conventional formats. Text analysis programs use word counts, keyword density, frequency, and other methods to extract meaningful information. The question of what constitutes meaningful information is always up for discussion, and completely silly or meaningless results can be generated as readily from text analysis tools as they can from any other.

Johanna Drucker, Intro to Digital Humanities, 2013

Welcome to the ‘Fundamentals of Text Mining’ workshop! We’re glad you have joined us to explore the field of text mining humanities data. You’ll have the opportunity to choose from a range of six datasets to prepare for analysis, then explore using a suite of text mining tools.

Here is an overview of how to navigate this class

Learning Objectives

By the end of this workshop, you will be able to:

  • Source datasets for text mining and analysis
  • Understand how to develop research questions that you can answer using text mining methodologies
  • Use Lexos for text cleaning
  • Use Voyant to analyze your dataset, and to begin to develop answers to research questions.

Office Hours

Note the following opportunities to attend live office hours to meet us and ask questions:

  • Monday 15th March 1.30pm EST
  • Thursday 18th March 1.30pm EST
  • Thursday 25th March 1.30pm EST

    See Workshop details from ER&L for information on how to join these meetings.

Milestones

You’ll see that there are milestones to reach at the end of each module. Complete the activities in each Module, and make sure you’ve completed the milestone. You’ll then be prepared to move on to the next module’s work.

Proceed to the Data page to get started.