























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of Voyant Tools, an open-source, web-based application for performing text and data mining. Developed by Stéfan Sinclair and Geoffrey Rockwell, Voyant Tools offers lightweight text analytics such as word frequency lists, frequency distribution plots, and KWIC (Key Word in Context) analysis. The guide covers getting started with Voyant Tools, acceptable file formats, uploading texts, and using various text analysis tools like Cirrus and Summary.
Typology: Lecture notes
1 / 31
This page cannot be seen from the preview
Don't miss anything!
























Text Analysis 101: Voyant Tools Introduction: This guide provides an introduction to Voyant Tools, an open-source, web-based application for performing text and data mining. Developed by Stéfan Sinclair at McGill University and Geoffrey Rockwell at the University of Alberta, Voyant Tools was created to support scholarly reading and interpretation of texts. Created with digital humanities scholars in mind, Voyant Tools provides lightweight text analytics such as word frequency lists, frequency distribution plots, and KWIC (Key Word in Context) analysis. To learn more about Voyant Tools, please visit the Voyant Tools repository on GitHub or the Voyant Tools Help Guide. Getting Started: Voyant Tools can be accessed online at https://voyant-tools.org. You can also install the Voyant Server as a stand-alone version on your computer. This has several potential advantages, including optimal performance, reliability, security, and privacy. To download the Voyant Server to your computer, you will need to have Java installed on your computer first. Once you have Java installed and set up on your computer, you will need to go to the latest releases page of Voyant Tools and click on the VoyantServer2_4-M28.zip file to download the files needed to set up the server. This is a large zip file of about 200MB – it includes large data models for language processing. The .zip archive file will need to be decompressed before you can install it. For more detailed instructions on how to install the Voyant Server, see the Voyant Server Github Repository. If you are unsure about how to unzip the VoyantServer2_4- M28.zip file, see Microsoft’s instructions for PC users and Apple’s instructions for Mac users on how to decompress zip files.
Acceptable File Formats: Voyant Tools is a web-based text reading and analysis environment. It allows you to upload a variety of text formats to analyze, including TXT, HTML, XML, PDF, RTF, and MS Word documents. You can create your own collection of texts, or you can use one of the sample corpuses available in Voyant Tools. Uploading Texts: There are four ways of uploading texts to Voyant Tools:
1. Open an existing corpus by Voyant Tools. Click on the “Open” button under the text box and select one of the sample corpuses provided by Voyant Tools.
● Pride and Prejudice
2. Type or paste text into the main text box (this creates a corpus with one document) 3. Type or paste one or more URLs into the main text box (one URL per line)
● Reader: The Reader Tool is where the text is displayed for reading. You can scroll down within the text reader to fetch more content. You can also hover over a word to show its frequency in the document. Additionally, you can click on a word or search for it in the search box to see how often it appears in your corpus. ● Trends: The Trends tool (also known as the Terms Frequency Chart) provides distribution plots that represent the frequencies of terms across texts in your corpus. Each series in the graph is colored according to the word it represents. At the top of the graph a legend displays which words are associated with certain colors. You can click on the words in the legend to toggle their visibility. Hovering over any point in the graph causes a callout box to appear with information about the point, including the word and the frequency. ● Summary: The Summary tool displays the number of documents in the corpus and the total number of words and unique words (multiple occurrences of words) in the corpus. The next part of the summary displays the document length of the corpus. It shows the longest and shortest documents by the number of words in the corpus. In parentheses after each title in the corpus, you will see the number of words. The next section provides the documents with the top vocabulary densities (the ratio of the number of words in the document to the number of unique words in the document) and the documents with the lowest vocabulary densities. Following this section is an approximation of the average number of words per sentence, both the highest and lowest values. Next, there are the five most frequent words in the corpus which are indicated to the right of the corpus. ● Contexts: The Contexts (or Keywords in Context) tool shows each occurrence of a keyword with its surrounding text (the context). The table view shows the following three columns:
○ Document: this shows which keyword and contexts occur ○ Left: contextual words to the left of the keyword (note that sorting by this column treats words in reverse order, right to left from the keyword) ○ Term: the keyword matching the default or user-provided term query ○ Right: contextual words to the right of the keyword To select an alternative tool, you will need to hover over the gray bar at the top of the trends or cirrus window by the question mark symbol: until a menu of icons appears: If you select the window button: , you will see a variety of other tools that can perform different visualizations and text analysis. There are many tools to choose from. To learn more about these tools and their functionalities, visit the Voyant Tools help guide. Step 3: Word Clouds and Stopwords Cirrus Tool: Word Cloud To begin, let's take a look at the Cirrus Tool. In this word cloud, you will see the most frequent words in the Paradise Lost corpus. The most common words are the largest words in the cloud.
Stopwords Voyant Tools has a list of common stopwords that have been removed from your corpus, such as the words: "an," "and," "or," "but," etc. If you would like to remove other stopwords or words that don't add much meaning to your analysis, you can filter them out. For example, let's filter out the words, "said,"went," and "came" from your word cloud. Hover over the gray bar at the top of the word cloud window until a menu of icons appears. Click on the blue options icon:. In the Options pop-up window, you can also modify the tool's settings. Here is a list of options available to you: ● Stopwords: you can define a set of stopwords to exclude – see the stopwords guide for more information
● White List: you can define a set of allowed words (the opposite of a stopwords list), only terms in this list will be shown in Cirrus (note that the stopwords list is still active, so you may want to choose "None" from the stopwords menu to deactivate it) ● Categories: you can specify categories based on the frequencies of words. ● Font Family: you can determine which font is used by Cirrus, a set of web-safe fonts is provided. Here you can also specify a font installed on your computer, but of course, it may not be available on other computers (in which case a default font is used) ● Palette: you can edit the colour palette To filter out stopwords, click on the "Edit List" button to the right of the "Stopwords: Auto-detect" dropdown menu.
The words will only filter out in the word cloud. You will need to click the Reset button in the Trends tool for the rest of the tools in Voyant to update with the updated stopwords list filtered out. Step 4: Trends and Frequency of Words The Trends tool, also known as the Term Frequencies Chart, shows a line graph of the most frequent words used in your corpus. Each series in the graph is colored according to the word it represents. At the top of the graph a legend displays which words are associated with certain colors. You can click on words in the legend to toggle their visibility. Hovering over any point in the graph causes a callout box to appear with information about the term selected and its frequency.
By default, the trends tool shows the Relative Frequencies of words in your corpus. To view the absolute count for each document in your corpus, you will need to select Raw Frequencies. Click on the blue options icon in the grey menu bar. In the Options pop-up window, click on Raw Frequencies, then hit Confirm. The Trends tool should update automatically with the absolute count for each of the top words in your corpus.
○ Columns: each item is its own column for each category ○ Line: line chart across categories ○ Stacked Bar: stacked bar chart (values are shown in columns) ○ Line & Stacked Bar (default): superimposed line and stacked bar chart Let's change the display to Columns. Select the Display dropdown menu, then click on Columns. Your chart should automatically update with a bar chart.
To select a particular word in Trends, click on that word in the word cloud Cirrus tool. The Trends tool will update with the frequency of that selected word across your corpus. For example, click on the word, “god” in the word cloud and then take a look at the Trends column chart. You will see that the word, “god” shows up_._
As you can see from examining the above Trends chart, the adam shows up 94 times and eve shows up 98 times. There are several tools in the search box that allow you to do an advanced search query. Here are some examples of other advanced keyword searches you can do: ● eve: match exact term eve ● eve: match terms that start with the prefix eve and then a wildcard as one term ● ^eve: match terms that start with eve as separate terms(ever, even, evening, etc.) ● ve: match terms that end with the suffix ve as one term ● ^ve: match terms that end with suffix ve as separate terms(leave, live, serve, etc.) ● eve, adam: match each term separated by commas separate terms ● eve|adam: match terms separated by pipes as a single term
● “eve and adam”: as an exact phrase(word order matters) ● “eve and adam”~0: phrase(word order doesn't matter but 0 words in between) ● “adam and eve”~5: match adam near eve(within 5 words) Step 6: Words in Context Word frequencies can only take you so far with your analysis. Let's move on and examine some keywords in context. The Context Tool The Contexts (or Keywords in Context) tool shows each occurrence of a keyword with a bit of surrounding text (the context). The table view shows the following three columns: ● Document : this displays which keyword and contexts occur together ● Left : this displays contextual words to the left of the keyword (note that sorting by this column treats words in reverse order, right to left from the keyword) ● Term : this displays the keyword matching the default or user-provided term query ● Right : this displays the contextual words to the right of the keyword By default, contexts are shown for the most frequent words in your corpus. If you would like to search a different word, you can specify a term in the Context tool's search box. Let's examine the frequency of the word, "tree" in context. In the search box, type "tree" and click enter. The term will appear with its surrounding words (the context), organized by book.