Skip to Main Content

Digital Scholarship: Text Analysis

An introduction to the concepts and uses of text analysis

Browser Based Tools

R and Python are the two most popular programming environments for text analysis.  But there are also many browser based tools that provide easy access to the basics of text analysis.

Voyant Tools - https://voyant-tools.org/

Voyant allows you to type in multiple URLs, paste in full text, or upload your own files for analysis.  Voyant provides many different modes of visualization.

Lexos - http://lexos.wheatoncollege.edu/upload

Similar to Voyant, and allows you to analyze large sets of text from various types of sources and provides a variety of visualization tools.

Google Ngram Viewer - https://books.google.com/ngrams/

Charts the frequency of any word or phrase in a chosen corpus of books over time.  Good for quick inquiries.

Constellate

Constellate is currently being used by CWRU on a trial basis.  Trial access will last until the end of April and may be extended.

https://constellate.org/

Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who run JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.

Constellate has a browser based lab environment, so you don't need to install anything on your PC, and comprehensive tutorials that introduce you to using python for text mining inside user friendly jupyter notebooks.  The plaftorm also allows you to explore large collections from JSTOR using the browser based lab.

NLTK

https://www.nltk.org/

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

https://www.nltk.org/book/

The NLTK book provides a comprehensive tutorial for the NLTK package.  It covers everything from importing text to how to use various complex analytical frameworks.  Examples of working code are present throughout the book.

Other Resources

SpaCy - https://spacy.io/

"Industrial Strength Natural Language Processing" for python focused on efficiency in order to enable large scale information extraction tasks.

Mallet - https://mimno.github.io/Mallet/topics.html

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.