Skip to Main Content

Introduction to Text Data Mining: Armstrong Browning Library's Victorian Collection: Home

This introduction to text data mining (TDM) will introduce the fundamentals of TDM and hands-on practice using Voyant-Tools, AntConc, and Python scripts created using the spaCy natural language processing library.

A hands-on text data mining (TDM) workshop

using Baylor's Armstrong Browning Library's Victorian Collection

 

Workshop Steps:

  1. Introduction to the Armstrong Browning Library's Victorian Collection
  2. Introduction to Text Data Mining
  3. Use prepared Python script to enrich and preprocess Victorian Collection texts
  4. Use Voyant-Tools to visualize term frequencies and keyword extraction
  5. Use AntConc to delve deeper into term frequencies and keyword extractions

Workshop Materials

Director of the Liaison Program
Research & Engagement

Profile Photo
Ellen Hampton Filgo
Contact:
Ellen_Filgo@baylor.edu
Jones 121
710-2968
Website

Curator, Armstrong Browning Library

Profile Photo
Laura French
Contact:
Armstrong Browning Library
710 Speight Avenue
Waco, Texas 76798-7152
254-710-4959
Website

Workshop Procedures

 

 

The Armstrong Browning Library is home to the world's largest collection of Robert Browning and Elizabeth Barrett Browning research resources. Robert Browning, May 7, 1812 – December 12, 1889, is the British poet credited with creating and popularizing the dramatic monolog form of poetry. He was so popular that Browning Societies dedicated to gathering together to read and discuss his work began during his lifetime and continue to this day. Robert Browning was married to Elizabeth Barrett Browning, March 6, 1806- June 29, 1861, one of the foremost British poet of the 19th Century.

A. J. Armstrong was a Robert Browning scholar and Chair of Baylor's English Department from 1912-1952. In 1918, Armstrong donated his personal library of books and periodicals by and about Robert Browning to Baylor University Library. He continued to gather together all possible items of interest in connection with Robert Browning for an intensive or extensive study of the poet into Baylor's Browning Collection. When the collection outgrew its home in Carroll Library, Armstrong undertook fundraising to build a library specifically for Baylor's Browning Collection. Construction on the Armstrong Browning Library completed in 1951.

The Victorian Collection includes more than 8,000 letters and manuscripts by or to Browning family members or other prominent, as well as less known, British and American figures. The Armstrong Browning Library acquired some of these items because of either the author's or recipient’s (intended audience’s) connection to the Brownings. In many instances there was a single Browning resource included as part of a group of 19th century items. The collection includes letters and manuscripts from many notable nineteenth-century authors such as Charles Dickens, William Wordsworth, Samuel Taylor Coleridge, Thomas Carlyle, John Henry Newman, George MacDonald, and John Ruskin. The collection also includes letters and manuscripts from political figures, religious leaders, scientists, artists, art collectors, and explorers. To increase awareness of the Victorian Collection, the Armstrong Browning Library has digitized more than 3,000 of the Victorian Collection’s letters and manuscripts
Victorian Collection in ABL
 

 

 

 

Download Victorian Collection Workshop Data Here

What is Metadata?

 

The Victorian Collection metadata contains Descriptive, Structural, and Administrative metadata.

The metadata also include the full text, where digitized.

Simply, metadata is information about a dataset.

The victorian_table_raw.csv contains the data extract from the Baylor University Libraries Digital Collections.

 

Each row represents a document page.

Descriptive Fields
  • Title
  • First Line
  • Date
  • Author
  • Recipient
  • Location
  • Envelope Address
  • Physical Description
  • Format
  • Language
  • Notes
  • Books Mentioned
  • People Mentioned
  • Places Mentioned
  • * Transcript (full text of page)
Structural Fields
  • DI
  • ABLID
  • Physical Location
Administrative Fields
  • Custodian
  • Rights
  • Resource Type
How documents were transcribed... Student workers manually transcribing pages.

Click to Launch PowerPoint

Seven Broad Text Data Mining Workflow Procedures

* Workshop focuses on highlighted items

  1. Identify Sources for Corpus
  2. Prepare for Reading and Parsing
  3. Enrich Corpus
  4. Preprocess Corpus
  5. Term Frequencies & Keyword Extraction
  6. Transformations
  7. Visualization & Analysis

This step is optional: Follow along or just watch

Text Data Mining Procedures Covered in this Section:

  • Prepare for Reading and Parsing
  • Enrich Corpus
  • Preprocess Corpus

 

Python Script Using the Following Libraries:

Click the image below to launch Google Colaboratory

When this segment is completed, you will be able to:

  1. Identify the best uses of Voyant as a TDM tool
  2. Work within the various "skins" of Voyant and change them out as needed
  3. Edit the Voyant stop word list
  4. Understand the difference between a corpora and a corpus and it's importance to your research methodology

 Voyant home screen accepts uploads in a variety of languages:

Arabic, Bosnian, Croation, Czech, English, French, Hebrew, Italian, Japanese, Portuguese, and Serbian;

Auto-Detect is default 

 

and a variety of formats:

TXT, HTML, XML, PDF, RTF, MS Word, ZIP

5 Voyant "Skins": the default are:

Cirrus - word cloud

Reading - text being analyzed

Trends - top keywords visualized across 10 equal segments of text

Summary - key points

Contexts - keyword plus 5 words to either side 

 

 

Available options show on mouse over of upper right of each skin

Visualization URL / Change Tool in this Skin / Options / Help

Available skin view / Available skin view / Current skin view / Help

Editing the Stop Word List:

Review the words showing in the word cloud to identify any you want to eliminate (you may repeat this several times during your analysis process)

Choose the Options button in the Cirrus skin;

Check that language is either Auto-Detect or the language you are analyzing
Add your chosen stop words, one per line

Corpus:

One or several texts saved as a continuous document in a single file will be analyzed as one continuous document

Corpora:

Several texts saved individually in a single file will be analyzed as individual texts

The example to the right is for an analysis of Tom SawyerHuck Finn, and The Prince and the Pauper as a corpora

Visualizations in Voyant

In this segment, you will learn to:

  1. Create and explain visualization formats available in Voyant
  2. Learn to export visualizations for use in presentations, websites, etc.

 

Go to voyant-tools.org and upload the file 

victorian_transcribed_no_metadata

corpus view of file

Adding to the Stop Word List

  1. In the Cirrus skin, click on the Options button
  2. Click on the Edit List button next to Stopwords
  3. Let's add the following to the list: dear, mr mrs miss dowden
  4. Click Save and then Confirm the word cloud will recompose
  5. Note that some of the skins are interactive and have changed as well - the Summary and the Trends skin

 

 

Let's make a static version of the Cirrus word cloud:

  1. Click on the Export URL tool icon
  2. Choose "export a PNG image of this visualization"
  3. Follow the instructions on the Export PNG window to save the image or to capture it for embedding in a web page.

 

 

 

Changing Skins: Identifying Collocates

  1. Words pairs which occur frequently
  2. In the Summary skin click on the Tools option
  3. Scroll to Corpus Tools and choose Collocates

Examine Trends for Entertainment

  1. Use the search bar in the Trends skin to search for specific terms (review terms to see which you want to truncate): N.B.: all terms must be in lower case
    1. book
    2. theat*
    3. opera
    4. music*

 

An Interactive Visualization of the Trends for Entertainment

To create an interactive visualization in Voyant:

  1. Click on Export URL 
  2. Choose Export View (Tools and Data)
  3. Chose HTML snippet and follow the instructions

Explore Lawrence Anthony's Ant Tools and download AntConc

 

https://www.laurenceanthony.net/software.html

Phrase Concordances

Allows you to search for a word or a phrase you are interested in from your corpus. It will show you the kind of patterns that it appears in.

 

Identify expressions of love of written works

  1. Search for love
  2. Search for love*
  3. Advanced search
    1. love* within 10 words of book(s), essay(s), poem(s)

Concordance Plots

 

Number of, and location of, search results within each document.

Clusters & N-Grams

 

N-Grams are words of a particular number of characters (letters). No search terms accepted.

 

Clusters are based on search terms and show words that cluster around the search term.

Word Lists & Collocates

Set Word List Preferences Before Running Collocates

Lemma = Base word (Token) and its inflections

Stopword = Words to exclude from text data mining

Collocate = Strength of association between search word and other words.

Keywords

 

Lists top keywords in each document using variations of TF-IDF (Term Frequency/Inverse Document Frequency)

  • Basically, what words are in each document that defines that document as unique when compared to the words in the entire corpus.

Keyness = Keyword Strength

First Reset AntConc using File/Clear all Tools and Files

Second, load the letters by Elizabeth Dickinson West Dowden.

  1. File\Clear All Files
  2. FIle\Open FIle(s) and select all files beginning with DowdenElizabethDickinsonWest.
  3. Start

 

University Libraries

One Bear Place #97148
Waco, TX 76798-7148

(254) 710-6702