Corpus Analysis with spaCy

Megan S. Kane

Corpus Analysis with spaCy

Authors

Megan S. Kane

Topics:

Say you have a big collection of texts. Maybe you’ve gathered speeches from the French Revolution, compiled a bunch of Amazon product reviews, or unearthed a collection of diary entries written during the first world war. In any of these cases, computational analysis can be a good way to compliment close reading of your corpus… but where should you start?

One possible way to begin is with spaCy, an industrial-strength library for Natural Language Processing (NLP) in Python. spaCy is capable of processing large corpora, generating linguistic annotations including part-of-speech tags and named entities, as well as preparing texts for further machine classification. This lesson is a ‘spaCy 101’ of sorts, a primer for researchers who are new to spaCy and want to learn how it can be used for corpus analysis. It may also be useful for those who are curious about natural language processing tools in general, and how they can help us to answer humanities research questions.

Reviewed by:

Maria Antoniak
William Mattingly

Learning outcomes

After completing this lesson, you will be able to:

Upload a corpus of texts to a platform for Python analysis (using Google Colaboratory)
Use spaCy to enrich the corpus through tokenization, lemmatization, part-of-speech tagging, dependency parsing and chunking, and named entity recognition
Conduct frequency analyses using part-of-speech tags and named entities
Download an enriched dataset for use in future NLP analyses

Corpus Analysis with spaCy

Reviewed by:

Learning outcomes

Cite as

Reuse conditions

Full metadata

#Reviewed by:

#Learning outcomes

Cite as

Reuse conditions

Full metadata

Reviewed by:

Learning outcomes