Skip to Main Content

Data & Digital Scholarship Tutorials

Workshop Description

Participants will be introduced to the freely-accessible text analysis tools, Voyant and through hands-on examples will learn the features of this tool for analyzing both single and multiple texts.

Workshop Intro

Reliable Sources of Texts for Analysis

Social Media Sources

These are BIG data files so not easily accessible on a laptop. Ask for our help.

Text Analysis Using Voyant

Voyant is a web-based tool:


What Can Voyant Analyze?

Voyant can read the following file types: .txt, .html,. htm, .xml, MS Word .doc or .docx, .rtf, and .pdf

Single documents (a corpus) or multiple documents (a corpora) can be analyzed.



You can load text one of three ways:

  • Cut and paste
  • Upload files from your computer
  • Open a pre-created corpora from your files 

Open the  American Red Cross Text-Book  from the text box to the left, then open either the first HTML version or the plain text UTF-8 version. Select all of the text using the Ctrl A keystroke on a PC or the Command A keystroke on a MAC, copy it then paste into the box on the Voyant home page.

Click the REVEAL button and in a few minutes you'll see the results.

The Voyant result screen is divided into 5 segments or "skins," each displaying a different format of text analysis.

Mouse over the upper right corner of the Cirrus skin to reveal the task icons available in this skin.

The word cloud is the most obvious of the visualizations available in Voyant. The default minimum is 55 words:

  • the number of terms included can be adjusted with the radio button slider at the bottom of the skin
  • to adjust the specific terms in the word cloud, click on the Terms function at the top of the screen and use the check boxes to select terms
  • Clicking on the Links function will reveal a relational word cloud of key terms

Take a moment to notice what happens in each of the other 4 skins as you click on individual terms in the word cloud.

Explore the Trends skin:

  • Default display shows the top 5 most frequently occurring words in the text
  • Displays the distribution of these words across 10 equal divisions of the single text
  • Display options allow for different chart formats  

Customize Your Trends Skin:

Let's explore the relationship of several terms in this text. I'm interested in how the words "mother" "father" and "child" relate to one another throughout the text. Type each of these words into the search box at the bottom of the Trends skin to add them to the visualization.

Hover your mouse pointer over the to see a list of syntax variations you can perform. Perhaps most useful is the option for proximity searching: ~5 to locate words within 5 words of each other (or other number that is appropriate for the situation).

The Context skin is where we will explore the other visualization tools that are part of Voyant. 

 Move your mouse into the tile bar on the Context skin and click on the Window icon that will appear. You'll see a menu of options. Let's look at the Corpus Tools option. You're familiar with several of these - Cirrus, Terms, and Summary - as they are already visible in the standard Voyant display. We'll take a closer look at the Word Tree, Topics, and the Scatterplot tools.

Lastly, we'll look at how Voyant processes and helps you analyze several texts as a unit - known as a corpora.

 My example for this workshop uses several articles I've downloaded from my Zotero file which I am hypothetically using to write a literature or systematic review article. I want to see if Voyant will help me identify the key themes among the articles and the relationships of those themes across the selected literature.

  • Open the link Corpora Articles. This will take you to a Box folder Medical Students Career Choice.
  • Right click on the Box file tile and download the FILE, not the individual articles, to your desktop
  • Choose the UPLOAD option on Voyant's home page, locate the file and select all four pdfs and load them into Voyant

Note changes in the Reader, Trends, and Summary skins so that the separate documents are identified.


Finding & Cleaning Textual Data

A variety of sources can be used to obtain text for analysis:

Your first pass may reveal data you don't care about:

  • frequently-cited items (journal names (whole or part), pp, web addresses, authors)
  • unusual characters (%����)
  • stopwords such as prepositions, articles, and inconsequential parts of speech 
  • content of footnotes, introduction, etc.

These are indications that your text is "dirty" - has additional coding or older typefaces - that it is reading and adding into your results or that it can't read and is creating a best approximation. You will want to clean these up before running them through the text analyzer tool again. 

University Libraries

One Bear Place #97148
Waco, TX 76798-7148

(254) 710-6702