Making a Keyword-in-Context index with CLTK

code, tutorial

The “key word-in-context” (KWIC) index was an innovation of early information retrieval, the basic concepts of which were developed in the late 1950s by H.P. Luhn. The idea is to produce a list of all occurrences of a word, aligned so that the word is printed as a column in the center of the text with the corresponding context printed to the immediate left and right. This allows a user to scan quickly a large number of uses in a given text. For examples, David Packard’s 1968 A Concordance to Livy uses an alphabetical KWIC format. Here are the first entries for the preposition e in Packard’s concordance:

Screen Shot 2017-08-17 at 10.16.31 AM

Using the Classical Language Toolkit and the Natural Language Toolkit’s Text module, we can easily create KWICs for texts in the Latin Library.

[This post assumes that you have already imported the Latin Library corpus as described in an earlier post and as always that you are running the latest version of CLTK on Python3.6. This tutorial was tested on v. 0.1.56.]

First, we can import a text from the Latin Library—here, Cicero’s De amicitia—as a list of words:

In [1]: from cltk.corpus.latin import latinlibrary
In [2]: amicitia_words = latinlibrary.words('cicero/amic.txt')
In [3]: print(amicitia_words[117:188])
Out [3]: ['Q.', 'Mucius', 'augur', 'multa', 'narrare', 'de', 'C.', 'Laelio', 'socero', 'suo', 'memoriter', 'et', 'iucunde', 'solebat', 'nec', 'dubitare', 'illum', 'in', 'omni', 'sermone', 'appellare', 'sapientem', ';', 'ego', 'autem', 'a', 'patre', 'ita', 'eram', 'deductus', 'ad', 'Scaevolam', 'sumpta', 'virili', 'toga', ',', 'ut', ',', 'quoad', 'possem', 'et', 'liceret', ',', 'a', 'senis', 'latere', 'numquam', 'discederem', ';', 'itaque', 'multa', 'ab', 'eo', 'prudenter', 'disputata', ',', 'multa', 'etiam', 'breviter', 'et', 'commode', 'dicta', 'memoriae', 'mandabam', 'fieri', '-que', 'studebam', 'eius', 'prudentia', 'doctior', '.']

We can then convert this list of words to an NLTK Text:

In [4]: import nltk
In [5]: amicitia_text = nltk.Text(amicitia_words)
In [6]: print(type(amicitia_text))
Out [6]: nltk.text.Text

Now that we have an NLTK text, there are several methods available to us, including “concordance,” which generates a KWIC for us based on keywords that we provide. Here, for example, is the NLTK concordance for ‘amicus’:

In [7]: amicitia_text.concordance('amicus')
Out [7]: Displaying 5 of 5 matches:
tentiam . Quonam enim modo quisquam amicus esse poterit ei , cui se putabit in
 optare , ut quam saepissime peccet amicus , quo plures det sibi tamquam ansas
escendant . Quamquam Ennius recte : Amicus certus in re incerta cernitur , tam
m in amicitiam transferetur , verus amicus numquam reperietur ; est enim is qu
itas . [ 95 ] Secerni autem blandus amicus a vero et internosci tam potest adh

The KWICs generated by NLTK Text are case insensitive (see amicus and Amicus above) and sorted sequentially by location in the text. There’s not much customization available for the method, so this is pretty much what it does. [You can set parameters for context width and number of lines presented; e.g. amicitia_text.concordance('amicus', width=50, lines=3)] Admittedly, it is pretty basic—it does not even return an identification or location code to help the user move easily to the wider context and the only way we know that the fifth match is in Chapter 95 is because the chapter number happens to be included in the context. At the same time, it is another step towards combining existing resources and tools (here, NLTK Text and a CLTK corpus) to explore Latin literature from different angles.

In a future post, I will build a KWIC method from scratch that offers more flexibility, especially with respect to context scope and location identification.

Advertisements

Replicating Zipf

article

As usually formulated, Zipf’s law states that when given a natural-language corpus, the relationship between the frequency of words and their frequency rank is inversely proportional. The results of a recent post on word frequencies in Latin suggested that Zipf’s law would hold up for this language and I wanted to test it to be sure. I was working with Seneca’s Epistulae Morales when I came across an interesting bit of trivia in R.E. Wyllys’s article, “Empirical and Theoretical Bases of Zipf’s Law”:

In this next book, The Psycho-Biology of Language, published in 1935, Zipf called attention for the first time to the phenomenon that has come to bear his name. This book contained Zipf’s first diagram of the log(frequency)-v.-log(rank) relationship, a Zipf curve for his count of words in the Latin writings of Plautus.

Plautus now seemed much more fun to work with than Seneca. So I decided to write a script that would replicate Zipf’s original experiment from the texts of Plautus up using Python and available online texts.

Digging into Psycho-Biology—which has the incredible subtitle An Introduction to Dynamic Philology—I learned the following about Zipf’s method (pp. 24-25):

With all the words of four Plautine plays (AululariaMostellariaPseudolus, and Trinummus) selected for material, the average number of syllables in each frequency category was computed. …The average number of syllables of all words occurring once was 3.23, of those occurring twice, 2.92, etc.

Zipf combined his Plautine experiment with a study of morpheme length in colloquial Chinese and in the English of American newspapers. For all three, he concludes (p. 27) that “a statistical relationship has been established between high frequency, small variety, and shortness in length, a relationship which is presumably valid for language in general.”

So, Zipf’s experiment with the plays of Plautus involved not the distribution of words, but the distribution of the frequency of words containing a certain number of syllables. Not what I was expecting to work with, say, with the Senecan letters, but an interesting problem nevertheless and one no less tractable using Python.

Here is his chart of word-syllable frequency in Plautus:

Screen Shot 2017-04-09 at 6.48.07 AM.png

To replicate Zipf’s method, I have done the following: 1. I downloaded the texts of the four plays from Tesserae. These files were for the most part already preprocessed (e.g. character names preceding lines have already been removed, unlike in the Latin Library texts); 2. I computed the number of syllables in each word, estimating this by number of vowels per word; and 3. I created a frequency table using Pandas, grouping words by their frequency and averaging the number syllables. (All of the code for this post can be found at https://github.com/diyclassics/zipf/blob/master/zipf.ipynb.)

Here is what my chart based on the Tesserae texts of Plautus looks like:

occurrences words avg_syll
0 1 5461 3.274309
1 2 1199 2.937448
2 3 494 2.777328
3 4 301 2.714286
4 5 152 2.631579
5 6 137 2.598540
6 7 84 2.440476
7 8 73 2.438356
8 9 51 2.411765
etc.

The numbers are similar and so encouraging. The differences, I assume, come from two main sources. One, Zipf, from what I can determine, does not name which edition of Plautus he used for this study, or perhaps just as likely, which concordance or wordlist. I have 35,215 tokens to work with and Zipf has 33,094. Close, but not ideal. More encouraging are the counts for some of the top counts. Zipf has 5,429 words that appear once where I have 5,461, or a difference of 32 words. For words appearing twice, Zipf has 1,198. I have only one more. So, the total variation seems at least to be distributed throughout the list. Secondly, Zipf does not explain how he determined the number of syllables per word. I used vowel-counting to keep my own experiment rooted in the text and replicable. It is impossible for me to know if Zipf is consistent in syllabifying words or even correct (I’ll assume he was though!). Again, the numbers are more encouraging than not. For single-occurrence words, Zipf has an average of 3.23 syllables; I show 3.27. For words that appear twice, it is his average of 2.92 against mine of 2.94.

We are now ready to plot these numbers “upon double logarithmic graph-paper,” or the Matplotlib equivalent, the loglog function. Here is an comparison of what Zipf got and what I get:

Screen Shot 2017-04-08 at 5.39.51 AM.png

download.png

Again, encouragingly close. I will note that where Zipf has plotted “the orderliness of the distribution of words” (i.e., the downward sloping line) in a ab2 = k relationship, where a is the number of words for a given occurrence and b the number of occurrences. I plotted instead a line of best fit using Numpy and Matplotlib which seems very close. I will look at the relationship between these two ideas in a future post.

Zipf concludes his chapter with the following comment (pp. 47-48):

The high degree of orderliness of the distribution of words in the stream of speech points unmistakably to a tendency to maintain an equilibrium in the stream of speech between frequency on the one hand and what may tentatively be termed variety on the other.

The graphs above suggest as much. But Zipf’s conclusions are not the main point of this post. Rather, this post is meant to show that we have the texts and methods at hand to replicate past experiments that had to be done with analog methods and with great difficulty in tracing specific, yet critical aspects, of their methods. I can point you to exactly the texts and exactly the code I used to derive my plot. Coding is a series of decisions based on an input and resulting in an output. So is a good argument. If I can put myself somewhere in the middle with a computational humanities approach, I feel like I am making some progress.

Next up, a look at the distribution of Seneca’s vocabulary as originally scheduled.

Finding Palindromes in the Latin Library

article

A playful diversion for the morning: What is the longest palindrome in the Latin language? And secondarily, what are the most common? (Before we even check, it won’t be too much of a surprise that non takes the top spot. It is the only palindrome in the Top 10 Most Frequent Latin Words list.)

As with other experiments in this series, we will use the Latin Library as a corpus and let it be our lexical playground. In this post, I will post some comments about method and report results. The code itself, using the CLTK and the CLTK Latin Library corpus with Python3, is available in this notebook.

As far as method, this experiment is fairly straightforward. First, we import the Latin Library, preprocess it in the usual ways, tokenize the text, and remove tokens of less than 3 letters. Now that we have a list of tokens, we can look for palindromes. We can use Python’s text slice and negative step to create a test for palindromes. Something like this:

def is_palindrome(token):
    return token == token[::-1]

This function takes a token, makes a copy but reversing the order of the letters, and returns true if they match. At this point, we can filter our list of tokens using this test and report our results. So…

Drumroll, please—the most frequently occurring palindromes in the Latin language are:

non, 166078
esse, 49426
illi, 9922
ibi, 7155
ecce, 3662
tot, 3443
sumus, 2678
sis, 1526
usu, 1472
tenet, 1072

Second drumroll, please—the longest palindrome in the Latin language is Massinissam (11 letters!), the accusative form of Massinissa, the first king of Numidia. We find other proper names in the top spots for longest palindromes: Aballaba, a site long Hadrian’s Wall reported in the Notitia Dignitatum; Suillius, a 1st-cent. Roman politician; and the Senones, a Celtic tribe well known to us from Livy among others. The longest Latin palindrome that is not a proper name is the dative/ablative plural of the superlative for similis, namely simillimis (10 letters). Rounding out the top ten are: the accusative of sarabara, “wide trowsers,” namely sarabaras; the feminine genitive plural of muratus, “walled,” namely muratarum; the first-person plural imperfect subjunctive of sumere, that is sumeremus; the  dative/ablative of silvula, “a little wood”, namely silvulis (notice the u/v normalization though); and rotator, “one who turns a thing round in a circle, a whirler round,” as Lewis & Short define it.

Not much here other than a bit of Latin word trivia. But we see again that using a large corpus like The Latin Library with Python/CLTK, we can extract information about the language easily. This sort of casual experiment lays the foundation for similar work that could be used perhaps to look into questions of greater philological significance.

A closing note. Looking over the list of Latin palindromes, I think my favorite is probably mutatum, a word that means something has changed, but when reversed stays exactly the same.

 

 

Nuntii Latini: 2016 Year in Review

tutorial

Earlier this week, Radio Bremen announced that it would be discontinuing its Nuntii Latini Septimanales. As a weekly listener, I was disappointed by the news—luckily, the monthly broadcasts will continue. Where else can you read news stories about heros artis musicae mortuus, i.e. David Bowie, or Trump victor improvisus? Coincidentally, I learned about the fate of Septimanales while preparing a quick study of word usage in these weekly news broadcasts. So, as a tribute to the work of the Nuntii writers and as a follow up to the Latin Library word-frequency post from earlier this week, I present “Nuntii Latini: 2016 Year in Review”.

[A Jupyter Notebook with the code and complete lists of tokens and lemmas for this post is available here.]

A quick note about how I went about this work. To get the data, I collected a list of web pages from the “Archivum Septimanale” page and used the Python Requests package to get the html contents of each of the weekly posts. I then used Beautiful Soup to extract only the content of the three weekly stories that Radio Bremen publishes every week. Here is a sample of what I scraped from each page:

[['30.12.2016',
  'Impetus terroristicus Berolini factus',
  'Anis Amri, qui impetum terroristicum Berolini fecisse pro certo habetur, '
  'a custode publico prope urbem Mediolanum in fuga necatus est. In Tunisia, '
  'qua e civitate ille islamista ortus est, tres viri comprehensi sunt, in his '
  'nepos auctoris facinoris. Quos huic facinori implicatos esse suspicio est. '
  'Impetu media in urbe Berolino facto duodecim homines interfecti, '
  'quinquaginta tres graviter vulnerati erant.'],
 ['30.12.2016',
  'Plures Turci asylum petunt',
  'Numerus asylum petentium, qui e Turcia orti sunt, anno bis millesimo sexto '
  'decimo evidenter auctus est, ut a moderatoribus Germaniae nuntiatur. '
  'Circiter quattuor partes eorum sunt Cordueni. Post seditionem ad irritum '
  'redactam ii, qui Turciam regunt, magis magisque regimini adversantes '
  'opprimunt, imprimis Corduenos, qui in re publica versantur.'],
 ['30.12.2016',
  'Septimanales finiuntur',
  'A. d. XI Kal. Febr. anni bis millesimi decimi redactores nuntiorum '
  'Latinorum Radiophoniae Bremensis nuntios septimanales lingua Latina '
  'emittere coeperunt. Qui post septem fere annos hoc nuntio finiuntur. Nuntii '
  'autem singulorum mensium etiam in futurum emittentur ut solent. Cuncti '
  'nuntii septimanales in archivo repositi sunt ita, ut legi et audiri '
  'possint.']]

The stories were preprocessed following more or less the same process that I’ve used in earlier posts. One exception was that I need to tweak the CLTK Latin tokenizer. This tokenizer currently checks tokens against a list of high-frequency forms ending in ‘-ne‘ and ‘-n‘ to best predict when the enclitic –ne should be assigned its own token. Nuntii Latini unsurpisingly contains a number of words not on this list—mostly proper names ending in ‘-n‘, such as Clinton, Putin, Erdoğan, John and Bremen among others.

Here are some basic stats about the Nuntii Latini 2016:

Number of weekly nuntii: 46 (There was a break over the summer.)
Number of stories: 138
Number of tokens: 6546
Number of unique tokens: 3021
Lexical diversity: 46.15% (i.e. unique tokens / tokens)
Number of unique lemmas: 2033
Here are the top tokens:
Top 10 tokens in Nuntii Latini 2016:

       TOKEN       COUNT       Type-Tok %  RUNNING %   
    1. in          206         3.15%       3.15%       
    2. est         135         2.06%       5.21%       
    3. et          106         1.62%       6.83%       
    4. qui         70          1.07%       7.9%        
    5. ut          56          0.86%       8.75%       
    6. a           54          0.82%       9.58%       
    7. sunt        50          0.76%       10.34%      
    8. esse        42          0.64%       10.98%      
    9. quod        41          0.63%       11.61%      
   10. ad          40          0.61%       12.22%      

How does this compare with the top tokens from the Latin Library that I posted earlier in the week? Usual suspects overall. Curious that the Nuntii uses -que relatively infrequently and even et less than we would expect compared to a larger sample like the Latin Library. There seems to be a slight preference for a (#6) over ab (#27). [Similar pattern is e (#21) vs. ex (#25).] And three forms of the verb sum crack the Top 10—an interesting feature of the Nuntii Latini style.

The top lemmas are more interesting:

Top 10 lemmas in Nuntii Latini 2016:

       LEMMA       COUNT       TYPE-LEM %  RUNNING %   
    1. sum         323         4.93%       4.93%       
    2. qui         208         3.18%       8.11%       
    3. in          206         3.15%       11.26%      
    4. et          106         1.62%       12.88%      
    5. annus       91          1.39%       14.27%      
    6. ab          74          1.13%       15.4%       
    7. hic         64          0.98%       16.38%      
    8. ut          56          0.86%       17.23%      
    9. ille        51          0.78%       18.01%      
   10. homo        49          0.75%       18.76%

Based on the top tokens, it is no surprise to see sum take the top spot. At the same time, we should note that this is a good indicator of Nuntii Latini style. Of greater interest though, unlike the Latin Library lemma list, we see content words appearing with greater frequency. Annus is easily explained by the regular occurrence of dates in the news stories, especially formulas for the current year such as anno bis millesimo sexto decimo. Homo on the other hand tells us more about the content and style of the Nuntii. Simply put, the news stories concern the people of the world and in the abbreviated style of the Nuntii, homo and often homines is a useful and general way of referring to them, e.g. Franciscus papa…profugos ibi permanentes et homines ibi viventes salutavit from April 22.

Since I had the Top 10,000 Latin Library tokens at the ready, I thought it would be interesting to “subtract” these tokens from the Nuntii list to see what remains. This would give a (very) rough indication of which words represent the 2016 news cycle more than Latin usage in general. So, here are the top 25 tokens from the Nuntii Latini that do not appear in the Latin Library list:

Top 25 tokens in Nuntii Latini 2016 (not in the Latin Library 10000):

       LEMMA               COUNT       
    1. praesidens          19          
    2. turciae             17          
    3. ministrorum         14          
    4. americae            13          
    5. millesimo           13          
    6. moderatores         12          
    7. unitarum            12          
    8. electionibus        10          
    9. factio              9           
   10. merkel              8           
   11. factionis           8           
   12. imprimis            8           
   13. habitis             8           
   14. europaeae           8           
   15. millesimi           8           
   16. turcia              7           
   17. britanniae          7           
   18. cancellaria         7           
   19. angela              7           
   20. declarauit          7           
   21. recep               7           
   22. democrata           7           
   23. profugis            7           
   24. tayyip              7           
   25. suffragiorum        6

As I said above, this is a rough, inexact way of weighting the vocabulary. At the same time, it does give a good sense of the year in (Latin) world news. We see important regions in world politics (Europe, Turkey, America, Britain), major players (Angela Merkel, Recep Tayyip [Erdoğan]), and their titles (praesidens, minister, moderator). There are indicators of top news stories like the elections (electio, factio, suffragium, democrata) in the U.S and elsewhere as well as the refugee crisis (profugus). Now that I have this dataset, I’d like to use it to look for patterns in the texts more systematically, e.g. compute TF-IDF scores, topic model the stories, extract named entities, etc. Look for these posts in upcoming weeks.

10,000 Most Frequent ‘Words’ in the Latin Library

article

A few months ago, I posted a list of the 10,000 most frequent words in the PHI Classical Latin Texts. While I did include a notebook with the code for that experiment, I could not include the data because the PHI texts are not available for redistribution. So here is an updated post, based on a freely available corpus of Latin literature—and one that I have been using for my recent Disiecta Membra posts like this one and this one and this one—the Latin Library. (The timing is good, as the Latin Library has received some positive attention recently.) The code for this post is available as a Jupyter Notebook here.

The results, based on the 13,563,476 tokens in the Latin Library:

Top 10 tokens in Latin Library:

       TOKEN       COUNT       TYPE-TOK %  RUNNING %   
    1. et          446474      3.29%       3.29%       
    2. in          274387      2.02%       5.31%       
    3. est         174413      1.29%       6.6%        
    4. non         166083      1.22%       7.83%       
    5. -que        135281      1.0%        8.82%       
    6. ad          133596      0.98%       9.81%       
    7. ut          119504      0.88%       10.69%      
    8. cum         109996      0.81%       11.5%       
    9. quod        104315      0.77%       12.27%      
   10. si          95511       0.70%       12.97%

How does this compare with the previous test against the PHI run? Here are the frequency rankings from the PHI run, 1 through 10: et, in, -que, ne, est, non, ut, cum, si, and ad. So—basically, the same. The loss of ne from the top 10 is certainly a result of improvements to the CLTK tokenizer, specifically improvements in tokenizing the the enclitic -ne. Ne is now #41 with 26,825 appearances and -ne #30 with 36,644 appearances. The combined count would still not crack the Top 10, which suggests that there may have been a lot of words wrongly tokenized of the form, e.g. ‘homine’ as [‘homi’, ‘-ne’]. (I suspect that this still happens, but am confident that the frequency of this problem is declining. If you spot any “bad” tokenization involving words ending in ‘-ne‘ or ‘-n‘, please submit an issue.) With ne out of the Top 10, we see that quod has joined the list. It should come as little surprise that quod was #11 in the PHI frequency list.

Since the PHI post, significant advances have been made with the CLTK Latin lemmatizer. Recent tests show accuracies consistently over 90%. So, let’s put out a provisional list of top lemmas as well—

Top 10 lemmas in Latin Library:

       LEMMA       COUNT       TYPE-LEM %  RUNNING %   
    1. et          446474      3.29%       3.29%       
    2. sum         437415      3.22%       6.52%       
    3. qui         365280      2.69%       9.21%       
    4. in          274387      2.02%       11.23%      
    5. is          213677      1.58%       12.81%      
    6. non         166083      1.22%       14.03%      
    7. -que        144790      1.07%       15.1%       
    8. hic         140421      1.04%       16.14%      
    9. ad          133613      0.99%       17.12%      
   10. ut          119506      0.88%       18.0%

No real surprises here. Six from the Top 10 lemmas are indeclinable, whether conjunctions, prepositions, adverbs, or enclitic, and so remain from the top tokens list: etinnon-quead and ut. Forms of sum and qui can be found in the top tokens list as well, est and quod respectively. Hic rises to the top based on its large number of relatively high ranking forms, though it should be noted that its top ranking form is #23 (hoc), followed by #46 (haec), #71 (his), #91 (hic), and #172 (hanc) among others. Is also joins the top 10, though I have my concerns about this because of the relatively high frequency of overlapping forms with the verb eo (i.e. eoiseam, etc.). This result should be reviewed and tested further.

While I’m thinking about it, other concerns I have would be the counts for hic, i.e. with respect to the demonstrative and the adverb, as well as the slight fluctuations in the counts of indeclinables, e.g. ut (119,504 tokens vs. 119,506 lemmas), or the somewhat harder to explain jump in -que. So, we’ll consider this a work in progress. But one that is—at least for the Top 10—more or less in line with other studies (e.g. Diederich, which—with the exception of cum—has same words, if different order.)

 

Parlor Game, Revisited

article

In August, the Dickinson College Commentaries blog featured a post on common Latin words that are not found in Virgil’s Aeneid. Author Chris Francese refers to the post as a “diverting Latin parlor game” and in that spirit of diversion I’d like play along and push the game further.

The setup is as follows, to quote the post:

Take a very common Latin word (in the DCC Latin Core Vocabulary) that does not occur in Vergil’s Aeneid, and explain its absence. Why would Vergil avoid certain lemmata (dictionary head words) that are frequent in preserved Latin?

So, Virgil avoids words such as aegre, arbitrorauctoritas, beneficium, etc. and it is up to us to figure out why. An interesting question and by asking the question, Francese enters a fascinating conversation on Latin poetic diction which includes Bertil Axelson, Gordon Williams, Patricia Watson, and many others (myself included, I suppose). But my goal in this post is not so much to answer the “why?” posed in the quote above, but more to investigate the methods through which we can start the conversation.

The line in Francese’s post that got me thinking was this:

The Vergilian data comes from LASLA  (no automatic lemmatizers were used, all human inspection), as analyzed by Seth Levin.

It just so happened that when this post came out, I was completing a summer-long project building an “automatic lemmatizer” for Latin for the Classical Language Toolkit. So my first reaction to the post was to see how close I could get to the DCC Blog’s list using the new lemmatizer. The answer is pretty close.

[I have published a Jupyter Notebook with the code for these results here: https://github.com/diyclassics/dcc-lemma/blob/master/Parlor%20Game%2C%20Revisited.ipynb.]

There are 75 lemmas from the DCC Latin Core Vocabulary that do not appear in the Aeneid (DCC Missing). Using the Backoff Latin lemmatizer on the Latin Library text of the Aeneid (CLTK Missing), I returned a list of 119 lemmas. There are somewhere around 6100 unique lemmas in the Aeneid meaning that our results only differ by 0.7%.

The results from CLTK Missing show 69 out of 75 lemmas (92%) from the DCC list. The six lemmas that it missed are:

[‘eo’, ‘mundus’, ‘plerusque’, ‘reliquus’, ‘reuerto’, ‘solum’]

Some of these can be easily explained. Reliqui (from relinquo) was incorrectly lemmatized as reliquus—an error. Mundus was lemmatized correctly and so appears in the list of Aeneid lemmas, just not the one on DCC Missing, i.e. mundus (from mundus, a, um = ‘clean’). A related problem with both eo and solum—homonyms of both these words appear in the list of Aeneid lemmas. (See below on the issue of lemmatizing adverbs/adjectives, adjective/nouns, etc.)  Plerusque comes from parsing error in my preprocessing script, where I split the DCC list on whitespace. Since this word is listed as plērus- plēra- plērumqueplerus- made it into reference list, but not plerusque. (I could have fixed this, but I thought it was better in this informal setting to make clear the full range on small errors that can creep into a text processing “parlor game” like this.)  Lastly, is reverto wrong? The LASLA lemma is revertor which—true enough—does not appear on the DCC Core Vocabulary, but this is probably too fine a distinction. Lewis & Short, e.g., lists reverto and revertor as the headword.

This leaves 50 lemmas returned in CLTK Missing that are—compared to DCC Missing—false positives. The list is as follows:

[‘aduersus’, ‘alienus’, ‘aliquando’, ‘aliquis’, ‘aliter’, ‘alius’, ‘animal’, ‘antequam’, ‘barbarus’, ‘breuiter’, ‘certe’, ‘citus’, ‘ciuitas’, ‘coepi’, ‘consilium’, ‘diuersus’, ‘exsilium’, ‘factum’, ‘feliciter’, ‘fore’, ‘forte’, ‘illuc’, ‘ingenium’, ‘item’, ‘longe’, ‘male’, ‘mare’, ‘maritus’, ‘pauci’, ‘paulo’, ‘plerus’, ‘praeceptum’, ‘primum’, ‘prius’, ‘proelium’, ‘qua’, ‘quantum’, ‘quomodo’, ‘singuli’, ‘subito’, ‘tantum’, ‘tutus’, ‘ualidus’, ‘uarius’, ‘uere’, ‘uero’, ‘uictoria’, ‘ultimus’, ‘uolucer’, ‘uos’]

To be perfectly honest, you learn more about the lemmatizer than the Aeneid from this list and this is actually very useful data for uncovering places where the CLTK tools can be improved.

So, for example, there are a number of adverbs on this list (breuiter, certe, tantum, etc.). These are cases where the CLTK lemmatizer return the associated adjective (so breuiscertustantus). This is a matter of definition. That is, the CLTK result is more different than wrong. We can debate whether some adverbs deserve to be given their own lemma, but is still that—a debate. (Lewis & Short, e.g. has certe listed under certus, but a separate entry for breuiter.)

The DCC Blog post makes a similar point about nouns and adjectives:

At times there might be some lemmatization issues (for example barbarus came up in the initial list of excluded core words, since Vergil avoids the noun, though he uses the adjective twice. I deleted it from this version.

This explains why barbarus appears on CLTK Missing. Along the same line, factum has been lemmatized under facio. Again, not so much incorrect, but a matter of how we define our terms and set parameters for the lemmatizer. I have tried as much as possible to follow the practice of the Ancient Greek and Latin Dependency Treebank and the default Backoff lemmatizer uses the treebanks as the source of its default training data. This explains why uos appears in CLTK Missing—the AGLDT lemmatizes forms of uos as the second-person singular pronoun tu.

As I continue to test the lemmatizer, I will use these results to fine tune and improve the output, trying to explain each case and make decisions such as which adverbs need to be lemmatized as adverbs and so on. It would be great to hear comments, either on this post or in the CLTK Github issues, about where improvements need to be made.

There remains a final question. If the hand lemmatized data from LASLA produces more accurate results, why use the CLTK lemmatizer at all?

It is an expensive process—time/money/resources—to produce curated data. This data is available for Virgil, but may not be for another author. What if we wanted to play the same parlor game with Lucan? I don’t know whether lemmatized data is available for Lucan, but I was a trivial task for me to rerun this experiment (with minimal preprocessing changes) on the Bellum Ciuile. (I placed the list of DCC core words not appearing in Lucan at the bottom of this post.) And I could do it for any text in the Latin Library just as easily.

Automatic lemmatizers are not perfect, but they are often good and sometimes very good. More importantly, they are getting better and, in the case of the CLTK, they are being actively developed and developers like myself can work with researchers to make the tools as good as possible.

Lemmas from the DCC Latin Core Vocabulary not found in Lucan*
(* A first draft by an automatic lemmatizer)

accido
adhibeo
aduersus
aegre
alienus
aliquando
aliquis
aliter
alius
amicitia
antequam
arbitror
auctoritas
autem
beneficium
bos
breuiter
celebro
celeriter
centum
certe
ceterum
citus
ciuitas
coepi
cogito
comparo
compono
condicio
confiteor
consilium
consuetudo
conuiuium
deinde
desidero
dignitas
disciplina
diuersus
dormio
edico
egregius
epistula
existimo
exspecto
factum
familia
fere
filia
fore
forte
frumentum
gratia
hortor
illuc
imperator
impleo
impono
ingenium
initium
integer
interim
interrogo
intersum
ita
itaque
item
legatus
libido
longe
magnitudo
maiores
male
mare
maritus
memoria
mulier
multitudo
narro
nauis
necessitas
negotium
nemo
oportet
oratio
pauci
paulo
pecunia
pertineo
plerumque
plerus
poeta
postea
posterus
praeceptum
praesens
praesidium
praeterea
primum
princeps
principium
priuatus
prius
proelium
proficiscor
proprius
puella
qua
quantum
quattuor
quemadmodum
quomodo
ratio
sanctus
sapiens
sapientia
scientia
seruus
singuli
statim
studeo
subito
suscipio
tantum
tempestas
tutus
ualidus
uarius
uere
uero
uictoria
uinum
uitium
ultimus
uoluntas
uos
utrum

Backoff Latin Lemmatizer, pt. 1

tutorial

I spent this summer working on a new approach to Latin lemmatization. Following and building on the logic of the NLTK Sequential Backoff Tagger—a POS-tagger that tries to maximize the accuracy of part-of-speech tagging tasks by combining multiple taggers and making several passes over the input—the Backoff Lemmatizer takes a token and passes this to any one of a number of different lemmatizers. This lemmatizer either returns a match or passes the token, or backs off, to another lemmatizer. When no more lemmatizers are available, it returns None. This setup allows users to customize the order of the lemmatizer sequence to best fit their processing task.

In this series of blog posts, I will explain how to use the different lemmatizers available in the Backoff Latin Lemmatizer. Here I will introduce the two most basic lemmatizers: DefaultLemmatizer and IdentityLemmatizer. Both are simple and will produce results with poor accuracy. (In fact, the DefaultLemmatizer’s accuracy will pretty much always be 0%!) Yet both can be useful as the final backoff lemmatizer in the sequence.

The DefaultLemmatizer returns the same “lemma” for all tokens. You can either specify what you want the lemmatizer to return, or if you leave the parameter blank, None. Note that all of the lemmatizers take as their input a list of tokens.

> from cltk.lemmatize.latin.backoff import DefaultLemmatizer
> from cltk.tokenize.word import WordTokenizer

> lemmatizer = DefaultLemmatizer()
> tokenizer = WordTokenizer('latin')

> sent = "Quo usque tandem abutere, Catilina, patientia nostra?"

> # Tokenize the sentence
> tokens = tokenizer.tokenize(sent)

> lemmatizer.lemmatize(tokens)
[('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]

As mentioned above, you can specify your own “lemma” instead of  None.

> lemmatizer = DefaultLemmatizer('UNK')
> lemmatizer.lemmatize(tokens)
[('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]

This is all somewhat unimpressive. But the DefaultLemmatizer is in fact quite useful. When placed as the last lemmatizer in a backoff chain, it allows you to identify easily (and with your own designation) which tokens did not return a match in any of the preceding lemmatizers. Note again that the accuracy of this lemmatizer is basically always 0%.

The IdentityLemmatizer has a similarly straightforward logic, returning the input token as the output lemma:

> from cltk.lemmatize.latin.backoff import IdentityLemmatizer
> from cltk.tokenize.word import WordTokenizer

> lemmatizer = IdentityLemmatizer()
> tokenizer = WordTokenizer('latin')

> sent = "Quo usque tandem abutere, Catilina, patientia nostra?"

> # Tokenize the sentence
> tokens = tokenizer.tokenize(sent)

> lemmatizer.lemmatize(tokens)
[('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]

Like the DefaultLemmatizer, the IdentityLemmatizer is useful as the final lemmatizer in a backoff chain. The difference here is that it is a default meant to boost accuracy—there is likely to be some lemmas in any given input that find no match but are in fact already correct. In the example above, by simply using the IdentityLemmatizer on this list of tokens, we get an accuracy—including punctuation—of  80% (i.e. 8 out of 10). Not a bad start (and not all sentences will be so successful!) but one can imagine a case in which, say, “Catalina” is not found in training data and it would be better to take a chance on returning a match than not. Of course, if this is not the case, the DefaultLemmatizer is probably the better final backoff option.

In the next post, I will introduce a second approach to lemmatizing with the Backoff Latin Lemmatizer, namely working with a model, i.e. a dictionary of lemma matches.

Cleaning up a Latin Library text

tutorial

In a recent post, I made the false claim that there are 16,667,542 words in the Latin Library. This is false because the the PlaintextCorpusReader uses the CLTK Latin tokenizer to split up the corpus into “words,” or more precisely, tokens, and some of these tokens are not words. This list of tokens includes a lot of punctuation and a lot of numbers, which depending on our research question may not be all that useful. If I want to know the most frequent word in the Latin Library, I probably do not want #1 to be the 1.37 million commas. So, in this post, we will “clean up” our data and return more useful word count information. (Note that you will need to have at least version 0.1.43 of CLTK for this tutorial.)

Let’s start with something small, like the works of Catullus that we saw in this post.

from cltk.corpus.latin import latinlibrary
files = latinlibrary.fileids()
catullus_raw = latinlibrary.raw('catullus.txt')

For today’s post, we will work directly with the raw string for preprocessing. Here is what the first 1,000 characters of the string look like (with some vertical whitespace removed) :

print(catullus_raw[:1000])

>>> Catullus
>>> 
>>> C. VALERIVS CATVLLVS 
>>> 
>>> 1 2 2b 3 4 5 6 7 8 9 10 11 12 13 14 14b 15 16 17 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 58b 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 95b 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
>>> 
>>> I. ad Cornelium 
>>> Cui dono lepidum novum libellum 
>>> arida modo pumice expolitum? 
>>> Corneli, tibi: namque tu solebas 
>>> meas esse aliquid putare nugas. 
>>> Iam tum, cum ausus es unus Italorum 
>>> omne aevum tribus explicare cartis... 
>>> Doctis, Iuppiter, et laboriosis! 
>>> Quare habe tibi quidquid hoc libelli— 
>>> qualecumque, quod, o patrona virgo, 
>>> plus uno maneat perenne saeclo!
>>> 
>>> II. fletus passeris Lesbiae 
>>> Passer, deliciae meae puellae, 
>>> quicum ludere, quem in sinu tenere, 
>>> cui primum digitum dare appetenti 
>>> et acris solet incitare

Our goal will be the preprocess this text so that we can best answer the question: “What are the 25 most frequent words in Catullus’ poetry?” Four steps we can take to improve the results are: 1. make the whole text lowercase, 2. remove punctuation, 3. remove numbers, and 4. remove English words that appear in our plaintext file. We will do the first three operations on the text string, then tokenize and do the fourth on the list of tokens.

catullus_edit = catullus_raw # Make a copy of the list

# 1. Make the whole text lowercase
# Use 'lower' string method

catullus_edit = catullus_edit.lower()

# 2. Remove punctuation
# Use 'translate'

from string import punctuation

translator = str.maketrans({key: " " for key in punctuation})
catullus_edit = catullus_edit.translate(translator)

# 3. Remove numbers
# Again, use 'translate'

translator = str.maketrans({key: " " for key in '0123456789'})
catullus_edit = catullus_edit.translate(translator)

# 4. Normalize u/v
# Use CLTK 'JVReplacer'

from cltk.stem.latin.j_v import JVReplacer
replacer = JVReplacer()

catullus_edit = replacer.replace(catullus_edit)

# 5. Remove English words that appear in our plaintext file
# Use 'replace'

remove_list = ['the', 'latin', 'library', 'classics', 'page']
remove_dict = {key: ' ' for key in remove_list}

for k, v in remove_dict.items():
    catullus_edit = catullus_edit.replace(k,v)

 

Here is what the preprocessed, i.e. ‘cleaned-up’, Catullus looks like:

print(catullus_edit[469:599])

>>> cui dono lepidum nouum libellum 
>>> arida modo pumice expolitum 
>>> corneli tibi namque tu solebas 
>>> meas esse aliquid putare nugas

We are now in a much better position to answer the question: “What are the 25 most frequent words in Catullus’ poetry?” As we did in a previous post, let’s use Counter to get the top words.

# Find the most commonly occurring 'words' (i.e., tokens) in Catullus:

# Tokenize catullus_edit
from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('latin')

catullus_words = word_tokenizer.tokenize(catullus_edit)
# Count the most common words in the list catullus_words

from collections import Counter
catullus_wordlist = list(catullus_words)

c = Counter(catullus_wordlist)
print(c.most_common(25))

>>> [('et', 194), ('est', 188), ('-que', 175), ('in', 160), ('ad', 142), ('non', 141), ('te', 107), ('cum', 103), ('ut', 101), ('nec', 92), ('quod', 85), ('sed', 80), ('si', 76), ('quae', 74), ('tibi', 70), ('me', 69), ('mihi', 68), ('o', 63), ('qui', 60), ('aut', 58), ('quam', 53), ('atque', 51), ('nam', 49), ('esse', 48), ('hymenaee', 48)]

Etest, -queinadnon—we see a lot of the words we expect to see based on the entire corpus. But we also see some more interest results, such as the high proportion of second-person pronouns te (#7) and tibi (#15). I think it is also fair to say that Catullus’ poems are going to be Latin literature’s only text that returns hymenaee in the top 25.

In the next post, we will continue to think about preprocessing and how to extract the most meaningful information from our texts by looking at stop words.

Wrapping up Google Summer of Code

article

GSoC-logo-vertical-200Today marks the final day of Google Summer of Code. I have submitted the code for the Latin/Greek Backoff Lemmatizer and the beta version should work its way into the Classical Language Toolkit soon enough. Calling it a lemmatizer is perhaps a little misleading—it is in fact a series of lemmatizers that can be run consecutively, with each pass designed to suggest lemmas that earlier passes missed. The lemmatizers fall into three main categories: 1. lemmas determined from context based on tagged training data, 2. lemmas determined by rules, in this case mostly regex matching on word endings, and 3. lemmas determined by dictionary lookup, that is using a similar process to the one that already exists in the CLTK. By putting these three types of lemmatizers together,  I was consistently able to return > 90% accuracy on the development test sets. There will be several blog posts in the near future to document the features of each type of lemmatizer and report more thoroughly the test results. The main purpose of today’s post is simply to share the report I wrote to summarize my summer research project.

But before sharing the report, I wanted to comment briefly on what I see as the most exciting part of this lemmatizer project. I was happy to see accuracies consistently over 90% as I tested various iterations of the lemmatizer in recent weeks. That said, it is clear to me that the path to even higher accuracy and better performance is now wide open. By organizing the lemmatizer as a series of sub-lemmatizers that can be run in a backoff sequence, tweaks can be made to any part of the chain as well as in the order of the chain itself to produce higher quality results. With a lemmatizer based on dictionary lookups, there are not many options for optimization: find and fix key/value errors or make the dictionary larger. The problem with the first option is that it is finite—errors exist in the model but not enough to have that much of an effect on accuracy. Even more of a concern, the second option is infinite—as new texts are worked on (and hopefully, as new discoveries are made!) there will always be another token missed by the dictionary. Accordingly, a lemmatizer based on training data and rules—or better yet one based on training data, rules and lookups combined in a systematic and modular fashion like this  GSoC “Backoff Lemmatizer” project—is the preferred way forward.

Now the report. I wrote this over the weekend as a Gist to summarize my summer work for GSoC. The blog format makes it a bit easier to read, but you can find the original here.

Google Summer of Code 2016 Final Report

Here is a summary of the work I completed for the 2016 Google Summer of Code project “CLTK Latin/Greek Backoff Lemmatizer” for the Classical Language Toolkit (cltk.org). The code can be found at https://github.com/diyclassics/cltk/tree/lemmatize/cltk/lemmatize.

  • Wrote custom lemmatizers for Latin and Greek as subclasses of NLTK’s tag module, including:
    • Default lemmatization, i.e. same lemma returned for every token
    • Identity lemmatization, i.e. original token returned as lemma
    • Model lemmatization, i.e. lemma returned based on dictionary lookup
    • Context lemmatization, i.e. lemma returned based on proximal token/lemma tuples in training data
    • Context/POS lemmatization, i.e. same as above, but proximal tuples are inspected for POS information
    • Regex lemmatization, i.e. lemma returned through rules-based inspection of token endings
    • Principal parts lemmatization, i.e. same as above, but matched regexes are then subjected to dictionary lookup to determine lemma
  • Organized the custom lemmatizers into a backoff chain, increasing accuracy (compared to dictionary lookup alone by as much as 28.9%). Final accuracy tests on test corpus showed average of 90.82%.
    • An example backoff chain is included in the backoff.py file under the class LazyLatinLemmatizer.
  • Constructed models for language-specific lookup tasks, including:
    • Dictionaries of high-frequency, unambiguous lemmas
    • Regex patterns for high-accuracy lemma prediction
    • Constructed models to be used as training data for context-based lemmatization
  • Wrote tests for basic subclasses. Code for tests can be found here.
  • Tangential work for CLTK inspired by daily work on lemmatizer
    • Continued improvements to the CLTK Latin tokenizer. Lemmatization is performed on tokens, and it is clear that accuracy is affected by the quality of the tokens text pass as parameters to the lemmatizer.
    • Introduction of PlaintextCorpusReader-based corpus of Latin (using the Latin Library corpus) to encourage easier adoption of the CLTK. Initial blog posts on this feature are part of an ongoing series which will work through a Latin NLP task workflow and will soon treat lemmatization. These posts will document in detail features developed during this summer project.

Next steps

  • Test various combinations of backoff chains like the one used in LazyLatinLemmatizer to determine which returns data with the highest accuracy.
    • The most significant increases in accuracy appear to come from the ContextLemmatizer, which is based on training data. Two comments here:
    • Training data for the GSoC summer project was derived from Ancient Greek Dependency Treebank (v. 2.1). The Latin data consists of around 5,000 sentences. Experiments throughout the summer (and research by others) suggests that more training data will lead to improved results. This data will be “expensive” to produce, but I am sure it will lead to higher accuracy. There are other large, tagged sets available and testing will continue with those in upcoming months. The AGDT data also has some inconsistancies, e.g. various lemma tagging for punctuation. I would like to work with the Perseus team to bring this data increasing closer to being a “gold standard” dataset for applications such as this.
    • The NLTK ContextTagger uses look-behind ngrams to create context. The nature of Latin/Greek as a “free” word-order language suggests that it may be worthwhile to think about and write code for generating different contexts. Skipgram context is one idea that I will pursue in upcoming months.
    • More model/pattern information will only improve accuracy, i.e. more ‘endings’ patterns for the RegexLemmatizer, a more complete principal parts list for the PPLematizer. The original dictionary model—currently included at the end of the LazyLatinLemmatizer—could also be revised/augmented.
  • Continued testing of the lemmatizer with smaller, localized selections will help to isolate edge cases and exceptions. The RomanNumeralLemmatizer, e.g., was written to handle a type of token that as an edge case was lowering accuracy.
  • The combination context/POS lemmatizer is very basic at the moment, but has enormous potential for increasing the accuracy of a notoriously difficult lemmatization problem, i.e. ambiguous forms. The current version (inc. the corresponding training data) is only set to resolve one ambiguous case, namely ‘cum1’ (prep.) versus ‘cum2’ (conj.). Two comments:
    • More testing is needed to determine the accuracy (as well as the precision and recall) of this lemmatizer in distinguishing between the two forms of ‘cum1/2’. The current version only uses bigram POS data, but (see above) different contexts may yield better results as well.
    • More ambiguous cases should be introduced to the training data and tested like ‘cum1/2’. The use of Morpheus numbers in the AGDT data should assist with this.

This was an incredible project to work on following several years of philological/literary critical graduate work and as I finished up my PhD in classics at Fordham University. I improved my skills and/or learned a great deal about, but not limited to, object-oriented programming, unit testing, version control, and working with important open-source development architecture such as TravisCI, ZenHub, Codecov, etc.

Acknowledgments

I want to thank the following people: my mentors Kyle P. Johnson and James Tauber who have set an excellent example of what the future of philology will look like: open source/access and community-developed, while rooted in the highest standards of both software development and traditional scholarship; the rest of the CLTK development community; my team at the Institute of the Study of the Ancient World Library for supporting this work during my first months there; Matthew McGowan, my dissertation advisor, for supporting both my traditional and digital work throughout my time at Fordham; the Tufts/Perseus/Leipzig DH/Classics team—the roots of this project come from working with them at various workshops in recent years and they first made the case to me about what could be accomplished through humanties computing; Neil Coffee and the DCA; the NLTK development team; Google for supporting an open-source, digital humanities coding project with Summer of Code; and of course, the #DigiClass world of Twitter for proving to me that there is an enthusiastic audience out there who want to ‘break’ classical texts, study them, and put them back together in various ways to learn more about them—better lemmatization is a desideratum and my motivation comes from wanting to help the community fill this need.—PJB

Working with The Latin Library Corpus in CLTK, pt. 3

code, tutorial

In the previous two posts, I explained how to load either the whole Latin Library or individual files from the corpus. In today’s post, I’ll split the difference and show how to build a custom text from PlaintextCorpusReader output, in this case how to access Virgil’s Aeneid using this method. Unlike Catullus, whose omnia opera can be found in a single text file (catullus.txt) in the Latin Library, each book of the Aeneid has been placed in its own text file. Let’s look at how we can work with multiple files at once using PlaintextCorpusReader .

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus and build a list of available files with the following commands:

from cltk.corpus.latin import latinlibrary
files = latinlibrary.fileids()

We can then use a list comprehension to figure out which files we need:

print([file for file in files if 'vergil' in file])
>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt', 'vergil/ec1.txt', 'vergil/ec10.txt', 'vergil/ec2.txt', 'vergil/ec3.txt', 'vergil/ec4.txt', 'vergil/ec5.txt', 'vergil/ec6.txt', 'vergil/ec7.txt', 'vergil/ec8.txt', 'vergil/ec9.txt', 'vergil/geo1.txt', 'vergil/geo2.txt', 'vergil/geo3.txt', 'vergil/geo4.txt']

The file names for the Aeneid texts all follow the same pattern and we can use this to build a list of the twelve files we want for our subcorpus.

aeneid_files = [file for file in files if 'vergil/aen' in file]

print(aeneid_files)
>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt']

Now that we have a list of files, we can loop through them and build our collection using passing a list to our raw, sents, and words methods instead of a string:

aeneid_raw = latinlibrary.raw(aeneid_files)
aeneid_sents = latinlibrary.sents(aeneid_files)
aeneid_words = latinlibrary.words(aeneid_files)

At this point, we have our raw materials and are free to explore. So, like we did with Lesbia in Catullus, we can do the same for, say, Aeneas in the Aeneid:

import re
aeneas = re.findall(r'\bAenea[e|n|s]?\b', aeneid_raw, re.IGNORECASE)

# i.e. Return a list of matches of single words made up of 
# the letters 'Aenea' followed by the letters e, m, n, s, or nothing, and ignoring case.

print(len(aeneas))
>>> 236

# Note that this regex misses 'Aeneaeque' at Aen. 11.289—it is 
# important to define our regexes carefully to make sure they return
# what we expect them to return!
#
# A fix... 

aeneas = re.findall(r'\bAenea[e|n|s]?(que)?\b', aeneid_raw, re.IGNORECASE)
print(len(aeneas))
>>> 237

Aeneas appears in the Aeneid 237 times. (This matches the result found, for example, in Wetmore’s concordance.)

We are now equipped to work with the entire Latin Library corpus as well as smaller sections that we define for ourselves. There is still work to do, however, before we can ask serious research questions of this material. In a series of upcoming posts, we’ll look at a number of important preprocessing tasks that can be used to transform our unexamined text into useful data.