Kelvin Smith Library
Text analysis is a broad term for various software tools and methods used to read, analyze, explore and manipulate text.
Software can be used to examine a corpus of text to large to be read by a human being, or to find patterns and other phenomena in text that would be much more difficult, or impossible, for a human to identify alone.
Some common modes of text analysis are:
R and Python are the two most popular programming environments for text analysis. But there are also many browser based tools that provide easy access to the basics of text analysis.
Voyant Tools - https://voyant-tools.org/
Voyant allows you to type in multiple URLs, paste in full text, or upload your own files for analysis. Voyant provides many different modes of visualization.
Similar to Voyant, and allows you to analyze large sets of text from various types of sources and provides a variety of visualization tools.
Google Ngram Viewer - https://books.google.com/ngrams/
Charts the frequency of any word or phrase in a chosen corpus of books over time. Good for quick inquiries.
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
The NLTK book provides a comprehensive tutorial for the NLTK package. It covers everything from importing text to how to use various complex analytical frameworks. Examples of working code are present throughout the book.
Constellate is currently being used by CWRU on a trial basis. Trial access will last until the end of April and may be extended.
Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who run JSTOR and Portico. It is a platform for teaching, learning, and performing text analysis using the world’s leading archival repositories of scholarly and primary source content.
Constellate has a browser based lab environment, so you don't need to install anything on your PC, and comprehensive tutorials that introduce you to using python for text mining inside user friendly jupyter notebooks. The plaftorm also allows you to explore large collections from JSTOR using the browser based lab.
There are numerous resources for texts online. Listed here are a few large and/or popular collections that are used for text mining. If you have a unique text that you would like to digitize and use in text analysis, please contact the Digital Scholarship team at KSL. email@example.com
HathiTrust Digital Library - Hathi Trust's collections include over 16 million volumes that span the history of printed text, primarily in English, but also in over 400 other languages.
Internet Archive eBooks and Texts - Over 20 million freely downloadable books and texts
Project Gutenberg - A library of over 60,000 free e-books, most of which are older works in the public domain
University of Oxford Text Archive - Thousands of full-text literary and linguistic sources in more than 25 languages
corpus.byu.edu - A number of large corpora compiled by Prof. Mark Davies (Linguistics) of Brigham Young University
Caselaw Access Project - Harvard's downloadable database of 360 years of United States caselaw
Chronicling America: Historic American Newspapers - access to information about historic U.S. newspapers and millions of digitized newspaper pages
SpaCy - https://spacy.io/
"Industrial Strength Natural Language Processing" for python focused on efficiency in order to enable large scale information extraction tasks.
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.