Skip to Main Content

Digital Scholarship: Text Analysis

An introduction to the concepts and uses of text analysis

Text Analysis

Text analysis is a broad term for various software tools and methods used to read, analyze, explore and manipulate text.

Software can be used to examine a corpus of text to large to be read by a human being, or to find patterns and other phenomena in text that would be much more difficult, or impossible, for a human to identify alone. 

Some common modes of text analysis are:

  • Word Frequency - Counts of words in various scenarios, for example,  graphs of frequency over the course of a text
  • Part of Speech Tagging - Allows investigation into the structure of language in various texts
  • Topic Modelling - Groups words that appear in similar patterns which may be then characterized as a theme for the text
  • Sentiment analysis - Uses a weighted dictionary to grade the positive or negative emotions being expressed in a text
  • Document Classification - Train software on a specific type of document, financial statements for example, then ask it to find similar documents in a collection.

Browser Based Tools

R and Python are the two most popular programming environments for text analysis.  But there are also many browser based tools that provide easy access to the basics of text analysis.

Voyant Tools - https://voyant-tools.org/

Voyant allows you to type in multiple URLs, paste in full text, or upload your own files for analysis.  Voyant provides many different modes of visualization.

Lexos - http://lexos.wheatoncollege.edu/upload

Similar to Voyant, and allows you to analyze large sets of text from various types of sources and provides a variety of visualization tools.

Google Ngram Viewer - https://books.google.com/ngrams/

Charts the frequency of any word or phrase in a chosen corpus of books over time.  Good for quick inquiries.

NLTK

https://www.nltk.org/

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

https://www.nltk.org/book/

The NLTK book provides a comprehensive tutorial for the NLTK package.  It covers everything from importing text to how to use various complex analytical frameworks.  Examples of working code are present throughout the book.

Constellate

Constellate is currently being used by CWRU on a trial basis.  Trial access will last until the end of April and may be extended.

https://constellate.org/

Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who run JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.

Constellate has a browser based lab environment, so you don't need to install anything on your PC, and comprehensive tutorials that introduce you to using python for text mining inside user friendly jupyter notebooks.  The plaftorm also allows you to explore large collections from JSTOR using the browser based lab.

Resources for Texts

There are numerous resources for texts online.  Listed here are a few large and/or popular collections that are used for text mining.  If you have a unique text that you would like to digitize and use in text analysis, please contact the Digital Scholarship team at KSL.  freemancenter@case.edu

HathiTrust Digital Library - Hathi Trust's collections include over 16 million volumes that span the history of printed text, primarily in English, but also in over 400 other languages.

Internet Archive eBooks and Texts - Over 20 million freely downloadable books and texts

Project Gutenberg - A library of over 60,000 free e-books, most of which are older works in the public domain

University of Oxford Text Archive - Thousands of full-text literary and linguistic sources in more than 25 languages

corpus.byu.edu - A number of large corpora compiled by Prof. Mark Davies (Linguistics) of Brigham Young University

Caselaw Access Project - Harvard's downloadable database of 360 years of United States caselaw

Chronicling America: Historic American Newspapers - access to information about historic U.S. newspapers and millions of digitized newspaper pages

Other Resources

SpaCy - https://spacy.io/

"Industrial Strength Natural Language Processing" for python focused on efficiency in order to enable large scale information extraction tasks.

Mallet - https://mimno.github.io/Mallet/topics.html

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.