Text Analysis

From DARC (Digital Archive Research Collective)
Jump to navigation Jump to search

What is Text Analysis?

Also known as “text mining” or “distant reading,” text analysis searches for patterns in large quantities of text in order to make inferences about that text. It uses computational processes (such as applications or algorithms) to clean, sort, classify, and evaluate textual data. Text analysis usually consists of three core activities. First is structuring the input text, which might include cleaning errors and tagging parts of the text, such as parts of speech. The second is finding patterns in the text, which might involve analyzing parts of speech or the use of gender pronouns, for example. The final stepp is evaluating and interpreting the results of the analysis.

Often, text analysis will make use of machine learning, or automated pattern recognition, provided by computational tools.

Who would want to use Text Analysis?

Anyone who wants to “distant read” large amounts of text would find text analysis useful. Text analysis can handle massive quantities of text without actually having to read those texts. According to Ted Underwood, there are many things you can do with text analysis, from visualizing single texts (on a network diagram, for example) to modeling literary forms or social boundaries.

More concretely, you can use text analysis to find patterns, identify distinctive words, point out obscure and overlooked aspects about text, or work with machine learning to deliver ever more sophisticated analyses (for example, [/en.wikipedia.org/wiki/Topic_model| Topic Modeling]).

How do I get access to Text Analysis?

Text analysis tools have a little bit of a learning curve, but most of them (particularly Python) are worth your while and can yield sophisticated results with a little familiarity.

Python is multi-use programming language that can clean, sort, and classify text. NLTK (Natural Language Toolkit) is a popular library for working with texts in python.

MALLET, or MAchine Learning for LanguagE Toolkit, deploys Machine Learning methods to parse and process text, and can be used for Topic Modeling. More specifically, it is a Java-based package (or cluster of code) for statistical natural language processing, document classification, clustering, topic modeling, information extraction.

Where can I find more help with using Text Analysis?

For Python help, see the GC Digital Fellows workshops on getting started with Python, and when you’re feeling ready, the blog post by Rachel Rakov in Tagging the Tower about using Python libraries to suit your specific needs.

For help with MALLET, see Programming Historian’s “Getting Started with Topic Modeling and MALLET.”

For an introduction to Machine Learning concepts, see Digital Fellow Hannah Aizenman’s workshop on Text Analysis and Machine Learning, using a dataset from the Titantic manifest.

For more theory about text analysis and visualization, check out these articles from DHQ about thehistory of distant reading, or ways of handling and capturing data.

What are some projects built with Text Analysis?

Digital Fellow Jeffrey Binder (English) writes in Tagging the Tower about a text analysis tool he developed with NLTK, called “Synonymizer,” which substitutes the nouns in a text with their synonyms.