Working with The Latin Library Corpus in CLTK, pt. 2

code, tutorial

In the previous post, I explained how to load the whole The Latin Library as a plaintext corpus of sentences or words or even as a “raw” string which you can then process in Python and the  Classical Language Toolkit. While it can be interesting to experiment with the corpus in its entirety, it is often more useful to focus on a specific author or work, which is what we’ll do in this post using the works of Catullus.

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus with the following command:

from cltk.corpus.latin import latinlibrary

The Latin Library corpus is organized as a collection of plaintext (.txt) files some of which can be found in the top-level directory (e.g. 12tables.txt) and some of which can be found in an author- or collection-specific folder (e.g. abelard/dialogus.txt).

Screen Shot 2016-08-15 at 10.56.15 AM

The PlaintextCorpusReader gives us a method for returning a list of all of the files available in the corpus, fileids():

print(latinlibrary.fileids())
>>> ['12tables.txt', '1644.txt', 'abbofloracensis.txt', 'abelard/dialogus.txt', 'abelard/epistola.txt', 'abelard/historia.txt', 'addison/barometri.txt', etc.

print(len(latinlibrary.fileids())
>>> 2162

We have 2,162 different text files that we can explore via the Latin Library. Let’s say we want to work with only one—the poems of Catullus. We can inspect the list of files to find the file we want to work with. Here is one way, using a list comprehension:

files = latinlibrary.fileids()

print([file for file in files if 'catullus' in file])
>>> ['catullus.txt']

### Note that the list comprehension
### [file for file in files if 'catull' in file]
### returns the following list:
### ['catullus.txt', 'pascoli.catull.txt']
###
### You may need to experiment with variations
### on the names of authors and works to find
### the file you are looking for.

We can now limit our corpus by passing an argument to the raw, sents, and words methods we already know from the previous post:

catullus_raw = latinlibrary.raw('catullus.txt')
catullus_sents = latinlibrary.sents('catullus.txt')
catullus_words = latinlibrary.words('catullus.txt')

From here, we can do anything we did with the entire corpus on a smaller scale. So, for example, if we want to see what the most common ‘words’ are in Catullus, we can use the following:

# Find the most commonly occurring 'words' (i.e., tokens) in Catullus:

from collections import Counter

catullus_wordlist = list(catullus_words)

c = Counter(catullus_wordlist)
print(c.most_common(25))
>>> [(',', 1598), ('.', 686), ('et', 193), ('est', 181), ('in', 159), ('ad', 142), (':', 141), ('*', 135), ('non', 135), ('?', 134), ('ut', 99), ('nec', 92), ('te', 90), ('quod', 84), ('cum', 82), ('sed', 80), ('quae', 74), ('tibi', 70), ('si', 69), ('mihi', 68), ('me', 68), ('o', 57), ('aut', 57), ('qui', 56), ('atque', 51)]

Not particularly interesting stuff here, but were on our way. We now have the raw materials to do any number of word studies based on Catullus’ poetry. We could, for example, find the number of times Lesbia appears in this part of the corpus:

### There are many ways we could do this.
### Let's use regular expressions for this one...

import re
lesbia = re.findall(r'\bLesbi.+?\b', catullus_raw, re.IGNORECASE)
# i.e. Return a list of matches of single words made up of 
# the letters 'Lesbi' followed by any number of letters, ignoring case.

print(len(lesbia))
>>> 31

# Note that this includes the "titles" to the poems that are included in the Latin Library, e.g. "II. fletus passeris Lesbiae".

In the next post, we’ll figure out how to do the same thing with, say, an author like Virgil—that is, an author whose corpus consists, not of a single text file like Catullus, but rather several.

Advertisements

2 thoughts on “Working with The Latin Library Corpus in CLTK, pt. 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s