Guides: Text Analysis I: Voyant: Home

Feedback

Please take this quick survey to let us know how well this workshop met your needs. Thank you!

Workshop Intro

Text Mining Introduction
Brief introduction to text analysis in a PowerPoint slide

Exercise Materials

American Red Cross Text-Book on Home Hygiene and Care of the Sick
The HTML version from Project Gutenberg
Corpora Articles
Corpora - several texts to be analyzed at once - for the hands on exercise

Reliable Sources of Texts for Analysis

HathiTrust digital library This link opens in a new window
A large-scale collaborative repository of digital content from research libraries including content digitized via Google Books and the Internet Archive digitization initiatives, as well as content digitized locally by libraries.

Sources: Mineable Text
This guide page lists sources of free text for data mining.

Social Media Sources

These are BIG data files so not easily accessible on a laptop. Ask for our help.

Internet Archive's Twitter Stream Grab
Big data (terabytes worth) not for the general laptop user. Please ask for help from Josh Been, BU Libraries' Digital Scholarship Librarian.

Text Analysis Using Voyant

Voyant is a web-based tool:

Voyant
Text mining tool for discovering patterns in texts

What Can Voyant Analyze?

Voyant can read the following file types: .txt, .html,. htm, .xml, MS Word .doc or .docx, .rtf, and .pdf

Single documents (a corpus) or multiple documents (a corpora) can be analyzed.

You can load text one of three ways:

Cut and paste
Upload files from your computer
Open a pre-created corpora from your files

Open the American Red Cross Text-Book from the text box to the left, then open either the first HTML version or the plain text UTF-8 version. Select all of the text using the Ctrl A keystroke on a PC or the Command A keystroke on a MAC, copy it then paste into the box on the Voyant home page.

Click the REVEAL button and in a few minutes you'll see the results.

The Voyant result screen is divided into 5 segments or "skins," each displaying a different format of text analysis.

Mouse over the upper right corner of the Cirrus skin to reveal the task icons available in this skin.

The word cloud is the most obvious of the visualizations available in Voyant. The default minimum is 55 words:

the number of terms included can be adjusted with the radio button slider at the bottom of the skin
to adjust the specific terms in the word cloud, click on the Terms function at the top of the screen and use the check boxes to select terms
Clicking on the Links function will reveal a relational word cloud of key terms

Take a moment to notice what happens in each of the other 4 skins as you click on individual terms in the word cloud.

Explore the Trends skin:

Default display shows the top 5 most frequently occurring words in the text
Displays the distribution of these words across 10 equal divisions of the single text
Display options allow for different chart formats

Customize Your Trends Skin:

Let's explore the relationship of several terms in this text. I'm interested in how the words "mother" "father" and "child" relate to one another throughout the text. Type each of these words into the search box at the bottom of the Trends skin to add them to the visualization.

Hover your mouse pointer over the ? to see a list of syntax variations you can perform. Perhaps most useful is the option for proximity searching: ~5 to locate words within 5 words of each other (or other number that is appropriate for the situation).

The Context skin is where we will explore the other visualization tools that are part of Voyant.

Move your mouse into the tile bar on the Context skin and click on the Window icon that will appear. You'll see a menu of options. Let's look at the Corpus Tools option. You're familiar with several of these - Cirrus, Terms, and Summary - as they are already visible in the standard Voyant display. We'll take a closer look at the Word Tree, Topics, and the Scatterplot tools.

Lastly, we'll look at how Voyant processes and helps you analyze several texts as a unit - known as a corpora.

My example for this workshop uses several articles I've downloaded from my Zotero file which I am hypothetically using to write a literature or systematic review article. I want to see if Voyant will help me identify the key themes among the articles and the relationships of those themes across the selected literature.

Open the link Corpora Articles. This will take you to a Box folder Medical Students Career Choice.
Right click on the Box file tile and download the FILE, not the individual articles, to your desktop
Choose the UPLOAD option on Voyant's home page, locate the file and select all four pdfs and load them into Voyant

Note changes in the Reader, Trends, and Summary skins so that the separate documents are identified.

A variety of sources can be used to obtain text for analysis:

Full text library databases (with results in one of the formats read by Voyant)
texts from Project Gutenburg
texts from HathiTrus t (log in as a Baylor affiliate to gain full-text access in readable form)
social media posts
digital collections like The Proceedings of the Old Bailey - London's criminal court records from 1764-1913
UNC's Documenting the American South has four datasets of textual material available
JSTOR's Data for Research

Your first pass may reveal data you don't care about:

frequently-cited items (journal names (whole or part), pp, web addresses, authors)
unusual characters (%��)
stopwords such as prepositions, articles, and inconsequential parts of speech
content of footnotes, introduction, etc.

These are indications that your text is "dirty" - has additional coding or older typefaces - that it is reading and adding into your results or that it can't read and is creating a best approximation. You will want to clean these up before running them through the text analyzer tool again.

Text Analysis I: Voyant: Home

Workshop Description