Workshop Preparation
This page contains the datasets for the text and data mining workshop. There is contextual information about each collection of documents, including links to the original location they were derived from. Choose one dataset to work with in class. Click on the dataset title (eg Adult British Fiction) to download the dataset.
Choose a dataset to work with
-
Adult British Fiction
Fiction from the 1880s. Sample corpora assembled from Project Gutenberg by students in Alan Liu’s English 197 course, Fall 2014 at UC Santa Barbara. -
Watergate Scandal
Dataset compiled for the Fundamentals of Text Mining workshop using the Gale Digital Scholar Lab. OCR text sourced from: The International Herald Tribune Digital Archive, The Daily Mail, The Telegraph, The Sunday Times, and the Times Digital Archive. October 2019. -
Inaugural Presidential Speeches
Dataset of the inaugural speeches of every US president from Washington in 1789 to Trump in 2017, compiled by Alan Liu on DH Resources for Project Building. -
Feeding America
The Feeding America: The Historic American Cookbook dataset contains transcribed and encoded text from 76 influential American cookbooks held by MSU Libraries Stephen O. Murray and Keelung Hong Special Collections. Features encoded within the text include but are not limited to recipes, types of recipes, cooking implements, and ingredients. The 76 texts were chosen among more than 7000 cookbooks that MSU Libraries holds as representative of periods and themes in American cookbook history spanning the late 18th to early 20th century. Source: Feeding America: The Historic American Cookbook Dataset. East Lansing: Michigan State University Libraries Special Collections -
Billboard Hits
A collection of songs from popular 20th century artists, including The Beatles, Michael Jackson, Mariah Carey and Madonna. -
19th Century Sunday School Texts (data/Dataset 5 - #1 Billboard Hits-20210122T202254Z-001.zip)
The Sunday School Books in Nineteenth Century America dataset consists of 166 texts, including Sunday school books published between 1809 and 1887. The material reflects the emerging diversity of Protestant Christian denominations in the United States during that period. Additionally, texts included also mark the appearance of a theologically inflected genre of juvenile literature, which was published by a variety of sectarian presses. More contextual information is available here Source: East Lansing: Michigan State University Libraries Special Collections
You should also download this
to your local machine.
When you’ve chosen your dataset, downloaded it to your local machine along with the stopword list, you’re ready to begin exploring the background to the field of text and data mining. Go to Module 1 to learn the basics of text mining.