Working with The Latin Library Corpus in CLTK, pt. 3

code, tutorial

In the previous two posts, I explained how to load either the whole Latin Library or individual files from the corpus. In today’s post, I’ll split the difference and show how to build a custom text from PlaintextCorpusReader output, in this case how to access Virgil’s Aeneid using this method. Unlike Catullus, whose omnia opera can be found in a single text file (catullus.txt) in the Latin Library, each book of the Aeneid has been placed in its own text file. Let’s look at how we can work with multiple files at once using PlaintextCorpusReader .

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus and build a list of available files with the following commands:

from cltk.corpus.latin import latinlibrary
files = latinlibrary.fileids()

We can then use a list comprehension to figure out which files we need:

print([file for file in files if 'vergil' in file])
>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt', 'vergil/ec1.txt', 'vergil/ec10.txt', 'vergil/ec2.txt', 'vergil/ec3.txt', 'vergil/ec4.txt', 'vergil/ec5.txt', 'vergil/ec6.txt', 'vergil/ec7.txt', 'vergil/ec8.txt', 'vergil/ec9.txt', 'vergil/geo1.txt', 'vergil/geo2.txt', 'vergil/geo3.txt', 'vergil/geo4.txt']

The file names for the Aeneid texts all follow the same pattern and we can use this to build a list of the twelve files we want for our subcorpus.

aeneid_files = [file for file in files if 'vergil/aen' in file]

>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt']

Now that we have a list of files, we can loop through them and build our collection using passing a list to our raw, sents, and words methods instead of a string:

aeneid_raw = latinlibrary.raw(aeneid_files)
aeneid_sents = latinlibrary.sents(aeneid_files)
aeneid_words = latinlibrary.words(aeneid_files)

At this point, we have our raw materials and are free to explore. So, like we did with Lesbia in Catullus, we can do the same for, say, Aeneas in the Aeneid:

import re
aeneas = re.findall(r'\bAenea[e|n|s]?\b', aeneid_raw, re.IGNORECASE)

# i.e. Return a list of matches of single words made up of 
# the letters 'Aenea' followed by the letters e, m, n, s, or nothing, and ignoring case.

>>> 236

# Note that this regex misses 'Aeneaeque' at Aen. 11.289—it is 
# important to define our regexes carefully to make sure they return
# what we expect them to return!
# A fix... 

aeneas = re.findall(r'\bAenea[e|n|s]?(que)?\b', aeneid_raw, re.IGNORECASE)
>>> 237

Aeneas appears in the Aeneid 237 times. (This matches the result found, for example, in Wetmore’s concordance.)

We are now equipped to work with the entire Latin Library corpus as well as smaller sections that we define for ourselves. There is still work to do, however, before we can ask serious research questions of this material. In a series of upcoming posts, we’ll look at a number of important preprocessing tasks that can be used to transform our unexamined text into useful data.



One thought on “Working with The Latin Library Corpus in CLTK, pt. 3

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s