Making a Keyword-in-Context index with CLTK

code, tutorial

The “key word-in-context” (KWIC) index was an innovation of early information retrieval, the basic concepts of which were developed in the late 1950s by H.P. Luhn. The idea is to produce a list of all occurrences of a word, aligned so that the word is printed as a column in the center of the text with the corresponding context printed to the immediate left and right. This allows a user to scan quickly a large number of uses in a given text. For examples, David Packard’s 1968 A Concordance to Livy uses an alphabetical KWIC format. Here are the first entries for the preposition e in Packard’s concordance:

Screen Shot 2017-08-17 at 10.16.31 AM

Using the Classical Language Toolkit and the Natural Language Toolkit’s Text module, we can easily create KWICs for texts in the Latin Library.

[This post assumes that you have already imported the Latin Library corpus as described in an earlier post and as always that you are running the latest version of CLTK on Python3.6. This tutorial was tested on v. 0.1.56.]

First, we can import a text from the Latin Library—here, Cicero’s De amicitia—as a list of words:

In [1]: from cltk.corpus.latin import latinlibrary
In [2]: amicitia_words = latinlibrary.words('cicero/amic.txt')
In [3]: print(amicitia_words[117:188])
Out [3]: ['Q.', 'Mucius', 'augur', 'multa', 'narrare', 'de', 'C.', 'Laelio', 'socero', 'suo', 'memoriter', 'et', 'iucunde', 'solebat', 'nec', 'dubitare', 'illum', 'in', 'omni', 'sermone', 'appellare', 'sapientem', ';', 'ego', 'autem', 'a', 'patre', 'ita', 'eram', 'deductus', 'ad', 'Scaevolam', 'sumpta', 'virili', 'toga', ',', 'ut', ',', 'quoad', 'possem', 'et', 'liceret', ',', 'a', 'senis', 'latere', 'numquam', 'discederem', ';', 'itaque', 'multa', 'ab', 'eo', 'prudenter', 'disputata', ',', 'multa', 'etiam', 'breviter', 'et', 'commode', 'dicta', 'memoriae', 'mandabam', 'fieri', '-que', 'studebam', 'eius', 'prudentia', 'doctior', '.']

We can then convert this list of words to an NLTK Text:

In [4]: import nltk
In [5]: amicitia_text = nltk.Text(amicitia_words)
In [6]: print(type(amicitia_text))
Out [6]: nltk.text.Text

Now that we have an NLTK text, there are several methods available to us, including “concordance,” which generates a KWIC for us based on keywords that we provide. Here, for example, is the NLTK concordance for ‘amicus’:

In [7]: amicitia_text.concordance('amicus')
Out [7]: Displaying 5 of 5 matches:
tentiam . Quonam enim modo quisquam amicus esse poterit ei , cui se putabit in
 optare , ut quam saepissime peccet amicus , quo plures det sibi tamquam ansas
escendant . Quamquam Ennius recte : Amicus certus in re incerta cernitur , tam
m in amicitiam transferetur , verus amicus numquam reperietur ; est enim is qu
itas . [ 95 ] Secerni autem blandus amicus a vero et internosci tam potest adh

The KWICs generated by NLTK Text are case insensitive (see amicus and Amicus above) and sorted sequentially by location in the text. There’s not much customization available for the method, so this is pretty much what it does. [You can set parameters for context width and number of lines presented; e.g. amicitia_text.concordance('amicus', width=50, lines=3)] Admittedly, it is pretty basic—it does not even return an identification or location code to help the user move easily to the wider context and the only way we know that the fifth match is in Chapter 95 is because the chapter number happens to be included in the context. At the same time, it is another step towards combining existing resources and tools (here, NLTK Text and a CLTK corpus) to explore Latin literature from different angles.

In a future post, I will build a KWIC method from scratch that offers more flexibility, especially with respect to context scope and location identification.


Nuntii Latini: 2016 Year in Review


Earlier this week, Radio Bremen announced that it would be discontinuing its Nuntii Latini Septimanales. As a weekly listener, I was disappointed by the news—luckily, the monthly broadcasts will continue. Where else can you read news stories about heros artis musicae mortuus, i.e. David Bowie, or Trump victor improvisus? Coincidentally, I learned about the fate of Septimanales while preparing a quick study of word usage in these weekly news broadcasts. So, as a tribute to the work of the Nuntii writers and as a follow up to the Latin Library word-frequency post from earlier this week, I present “Nuntii Latini: 2016 Year in Review”.

[A Jupyter Notebook with the code and complete lists of tokens and lemmas for this post is available here.]

A quick note about how I went about this work. To get the data, I collected a list of web pages from the “Archivum Septimanale” page and used the Python Requests package to get the html contents of each of the weekly posts. I then used Beautiful Soup to extract only the content of the three weekly stories that Radio Bremen publishes every week. Here is a sample of what I scraped from each page:

  'Impetus terroristicus Berolini factus',
  'Anis Amri, qui impetum terroristicum Berolini fecisse pro certo habetur, '
  'a custode publico prope urbem Mediolanum in fuga necatus est. In Tunisia, '
  'qua e civitate ille islamista ortus est, tres viri comprehensi sunt, in his '
  'nepos auctoris facinoris. Quos huic facinori implicatos esse suspicio est. '
  'Impetu media in urbe Berolino facto duodecim homines interfecti, '
  'quinquaginta tres graviter vulnerati erant.'],
  'Plures Turci asylum petunt',
  'Numerus asylum petentium, qui e Turcia orti sunt, anno bis millesimo sexto '
  'decimo evidenter auctus est, ut a moderatoribus Germaniae nuntiatur. '
  'Circiter quattuor partes eorum sunt Cordueni. Post seditionem ad irritum '
  'redactam ii, qui Turciam regunt, magis magisque regimini adversantes '
  'opprimunt, imprimis Corduenos, qui in re publica versantur.'],
  'Septimanales finiuntur',
  'A. d. XI Kal. Febr. anni bis millesimi decimi redactores nuntiorum '
  'Latinorum Radiophoniae Bremensis nuntios septimanales lingua Latina '
  'emittere coeperunt. Qui post septem fere annos hoc nuntio finiuntur. Nuntii '
  'autem singulorum mensium etiam in futurum emittentur ut solent. Cuncti '
  'nuntii septimanales in archivo repositi sunt ita, ut legi et audiri '

The stories were preprocessed following more or less the same process that I’ve used in earlier posts. One exception was that I need to tweak the CLTK Latin tokenizer. This tokenizer currently checks tokens against a list of high-frequency forms ending in ‘-ne‘ and ‘-n‘ to best predict when the enclitic –ne should be assigned its own token. Nuntii Latini unsurpisingly contains a number of words not on this list—mostly proper names ending in ‘-n‘, such as Clinton, Putin, Erdoğan, John and Bremen among others.

Here are some basic stats about the Nuntii Latini 2016:

Number of weekly nuntii: 46 (There was a break over the summer.)
Number of stories: 138
Number of tokens: 6546
Number of unique tokens: 3021
Lexical diversity: 46.15% (i.e. unique tokens / tokens)
Number of unique lemmas: 2033
Here are the top tokens:
Top 10 tokens in Nuntii Latini 2016:

       TOKEN       COUNT       Type-Tok %  RUNNING %   
    1. in          206         3.15%       3.15%       
    2. est         135         2.06%       5.21%       
    3. et          106         1.62%       6.83%       
    4. qui         70          1.07%       7.9%        
    5. ut          56          0.86%       8.75%       
    6. a           54          0.82%       9.58%       
    7. sunt        50          0.76%       10.34%      
    8. esse        42          0.64%       10.98%      
    9. quod        41          0.63%       11.61%      
   10. ad          40          0.61%       12.22%      

How does this compare with the top tokens from the Latin Library that I posted earlier in the week? Usual suspects overall. Curious that the Nuntii uses -que relatively infrequently and even et less than we would expect compared to a larger sample like the Latin Library. There seems to be a slight preference for a (#6) over ab (#27). [Similar pattern is e (#21) vs. ex (#25).] And three forms of the verb sum crack the Top 10—an interesting feature of the Nuntii Latini style.

The top lemmas are more interesting:

Top 10 lemmas in Nuntii Latini 2016:

       LEMMA       COUNT       TYPE-LEM %  RUNNING %   
    1. sum         323         4.93%       4.93%       
    2. qui         208         3.18%       8.11%       
    3. in          206         3.15%       11.26%      
    4. et          106         1.62%       12.88%      
    5. annus       91          1.39%       14.27%      
    6. ab          74          1.13%       15.4%       
    7. hic         64          0.98%       16.38%      
    8. ut          56          0.86%       17.23%      
    9. ille        51          0.78%       18.01%      
   10. homo        49          0.75%       18.76%

Based on the top tokens, it is no surprise to see sum take the top spot. At the same time, we should note that this is a good indicator of Nuntii Latini style. Of greater interest though, unlike the Latin Library lemma list, we see content words appearing with greater frequency. Annus is easily explained by the regular occurrence of dates in the news stories, especially formulas for the current year such as anno bis millesimo sexto decimo. Homo on the other hand tells us more about the content and style of the Nuntii. Simply put, the news stories concern the people of the world and in the abbreviated style of the Nuntii, homo and often homines is a useful and general way of referring to them, e.g. Franciscus papa…profugos ibi permanentes et homines ibi viventes salutavit from April 22.

Since I had the Top 10,000 Latin Library tokens at the ready, I thought it would be interesting to “subtract” these tokens from the Nuntii list to see what remains. This would give a (very) rough indication of which words represent the 2016 news cycle more than Latin usage in general. So, here are the top 25 tokens from the Nuntii Latini that do not appear in the Latin Library list:

Top 25 tokens in Nuntii Latini 2016 (not in the Latin Library 10000):

       LEMMA               COUNT       
    1. praesidens          19          
    2. turciae             17          
    3. ministrorum         14          
    4. americae            13          
    5. millesimo           13          
    6. moderatores         12          
    7. unitarum            12          
    8. electionibus        10          
    9. factio              9           
   10. merkel              8           
   11. factionis           8           
   12. imprimis            8           
   13. habitis             8           
   14. europaeae           8           
   15. millesimi           8           
   16. turcia              7           
   17. britanniae          7           
   18. cancellaria         7           
   19. angela              7           
   20. declarauit          7           
   21. recep               7           
   22. democrata           7           
   23. profugis            7           
   24. tayyip              7           
   25. suffragiorum        6

As I said above, this is a rough, inexact way of weighting the vocabulary. At the same time, it does give a good sense of the year in (Latin) world news. We see important regions in world politics (Europe, Turkey, America, Britain), major players (Angela Merkel, Recep Tayyip [Erdoğan]), and their titles (praesidens, minister, moderator). There are indicators of top news stories like the elections (electio, factio, suffragium, democrata) in the U.S and elsewhere as well as the refugee crisis (profugus). Now that I have this dataset, I’d like to use it to look for patterns in the texts more systematically, e.g. compute TF-IDF scores, topic model the stories, extract named entities, etc. Look for these posts in upcoming weeks.

Backoff Latin Lemmatizer, pt. 1


I spent this summer working on a new approach to Latin lemmatization. Following and building on the logic of the NLTK Sequential Backoff Tagger—a POS-tagger that tries to maximize the accuracy of part-of-speech tagging tasks by combining multiple taggers and making several passes over the input—the Backoff Lemmatizer takes a token and passes this to any one of a number of different lemmatizers. This lemmatizer either returns a match or passes the token, or backs off, to another lemmatizer. When no more lemmatizers are available, it returns None. This setup allows users to customize the order of the lemmatizer sequence to best fit their processing task.

In this series of blog posts, I will explain how to use the different lemmatizers available in the Backoff Latin Lemmatizer. Here I will introduce the two most basic lemmatizers: DefaultLemmatizer and IdentityLemmatizer. Both are simple and will produce results with poor accuracy. (In fact, the DefaultLemmatizer’s accuracy will pretty much always be 0%!) Yet both can be useful as the final backoff lemmatizer in the sequence.

The DefaultLemmatizer returns the same “lemma” for all tokens. You can either specify what you want the lemmatizer to return, or if you leave the parameter blank, None. Note that all of the lemmatizers take as their input a list of tokens.

> from cltk.lemmatize.latin.backoff import DefaultLemmatizer
> from cltk.tokenize.word import WordTokenizer

> lemmatizer = DefaultLemmatizer()
> tokenizer = WordTokenizer('latin')

> sent = "Quo usque tandem abutere, Catilina, patientia nostra?"

> # Tokenize the sentence
> tokens = tokenizer.tokenize(sent)

> lemmatizer.lemmatize(tokens)
[('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]

As mentioned above, you can specify your own “lemma” instead of  None.

> lemmatizer = DefaultLemmatizer('UNK')
> lemmatizer.lemmatize(tokens)
[('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]

This is all somewhat unimpressive. But the DefaultLemmatizer is in fact quite useful. When placed as the last lemmatizer in a backoff chain, it allows you to identify easily (and with your own designation) which tokens did not return a match in any of the preceding lemmatizers. Note again that the accuracy of this lemmatizer is basically always 0%.

The IdentityLemmatizer has a similarly straightforward logic, returning the input token as the output lemma:

> from cltk.lemmatize.latin.backoff import IdentityLemmatizer
> from cltk.tokenize.word import WordTokenizer

> lemmatizer = IdentityLemmatizer()
> tokenizer = WordTokenizer('latin')

> sent = "Quo usque tandem abutere, Catilina, patientia nostra?"

> # Tokenize the sentence
> tokens = tokenizer.tokenize(sent)

> lemmatizer.lemmatize(tokens)
[('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]

Like the DefaultLemmatizer, the IdentityLemmatizer is useful as the final lemmatizer in a backoff chain. The difference here is that it is a default meant to boost accuracy—there is likely to be some lemmas in any given input that find no match but are in fact already correct. In the example above, by simply using the IdentityLemmatizer on this list of tokens, we get an accuracy—including punctuation—of  80% (i.e. 8 out of 10). Not a bad start (and not all sentences will be so successful!) but one can imagine a case in which, say, “Catalina” is not found in training data and it would be better to take a chance on returning a match than not. Of course, if this is not the case, the DefaultLemmatizer is probably the better final backoff option.

In the next post, I will introduce a second approach to lemmatizing with the Backoff Latin Lemmatizer, namely working with a model, i.e. a dictionary of lemma matches.

Cleaning up a Latin Library text


In a recent post, I made the false claim that there are 16,667,542 words in the Latin Library. This is false because the the PlaintextCorpusReader uses the CLTK Latin tokenizer to split up the corpus into “words,” or more precisely, tokens, and some of these tokens are not words. This list of tokens includes a lot of punctuation and a lot of numbers, which depending on our research question may not be all that useful. If I want to know the most frequent word in the Latin Library, I probably do not want #1 to be the 1.37 million commas. So, in this post, we will “clean up” our data and return more useful word count information. (Note that you will need to have at least version 0.1.43 of CLTK for this tutorial.)

Let’s start with something small, like the works of Catullus that we saw in this post.

from cltk.corpus.latin import latinlibrary
files = latinlibrary.fileids()
catullus_raw = latinlibrary.raw('catullus.txt')

For today’s post, we will work directly with the raw string for preprocessing. Here is what the first 1,000 characters of the string look like (with some vertical whitespace removed) :


>>> Catullus
>>> 1 2 2b 3 4 5 6 7 8 9 10 11 12 13 14 14b 15 16 17 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 58b 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 95b 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
>>> I. ad Cornelium 
>>> Cui dono lepidum novum libellum 
>>> arida modo pumice expolitum? 
>>> Corneli, tibi: namque tu solebas 
>>> meas esse aliquid putare nugas. 
>>> Iam tum, cum ausus es unus Italorum 
>>> omne aevum tribus explicare cartis... 
>>> Doctis, Iuppiter, et laboriosis! 
>>> Quare habe tibi quidquid hoc libelli— 
>>> qualecumque, quod, o patrona virgo, 
>>> plus uno maneat perenne saeclo!
>>> II. fletus passeris Lesbiae 
>>> Passer, deliciae meae puellae, 
>>> quicum ludere, quem in sinu tenere, 
>>> cui primum digitum dare appetenti 
>>> et acris solet incitare

Our goal will be the preprocess this text so that we can best answer the question: “What are the 25 most frequent words in Catullus’ poetry?” Four steps we can take to improve the results are: 1. make the whole text lowercase, 2. remove punctuation, 3. remove numbers, and 4. remove English words that appear in our plaintext file. We will do the first three operations on the text string, then tokenize and do the fourth on the list of tokens.

catullus_edit = catullus_raw # Make a copy of the list

# 1. Make the whole text lowercase
# Use 'lower' string method

catullus_edit = catullus_edit.lower()

# 2. Remove punctuation
# Use 'translate'

from string import punctuation

translator = str.maketrans({key: " " for key in punctuation})
catullus_edit = catullus_edit.translate(translator)

# 3. Remove numbers
# Again, use 'translate'

translator = str.maketrans({key: " " for key in '0123456789'})
catullus_edit = catullus_edit.translate(translator)

# 4. Normalize u/v
# Use CLTK 'JVReplacer'

from cltk.stem.latin.j_v import JVReplacer
replacer = JVReplacer()

catullus_edit = replacer.replace(catullus_edit)

# 5. Remove English words that appear in our plaintext file
# Use 'replace'

remove_list = ['the', 'latin', 'library', 'classics', 'page']
remove_dict = {key: ' ' for key in remove_list}

for k, v in remove_dict.items():
    catullus_edit = catullus_edit.replace(k,v)


Here is what the preprocessed, i.e. ‘cleaned-up’, Catullus looks like:


>>> cui dono lepidum nouum libellum 
>>> arida modo pumice expolitum 
>>> corneli tibi namque tu solebas 
>>> meas esse aliquid putare nugas

We are now in a much better position to answer the question: “What are the 25 most frequent words in Catullus’ poetry?” As we did in a previous post, let’s use Counter to get the top words.

# Find the most commonly occurring 'words' (i.e., tokens) in Catullus:

# Tokenize catullus_edit
from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('latin')

catullus_words = word_tokenizer.tokenize(catullus_edit)
# Count the most common words in the list catullus_words

from collections import Counter
catullus_wordlist = list(catullus_words)

c = Counter(catullus_wordlist)

>>> [('et', 194), ('est', 188), ('-que', 175), ('in', 160), ('ad', 142), ('non', 141), ('te', 107), ('cum', 103), ('ut', 101), ('nec', 92), ('quod', 85), ('sed', 80), ('si', 76), ('quae', 74), ('tibi', 70), ('me', 69), ('mihi', 68), ('o', 63), ('qui', 60), ('aut', 58), ('quam', 53), ('atque', 51), ('nam', 49), ('esse', 48), ('hymenaee', 48)]

Etest, -queinadnon—we see a lot of the words we expect to see based on the entire corpus. But we also see some more interest results, such as the high proportion of second-person pronouns te (#7) and tibi (#15). I think it is also fair to say that Catullus’ poems are going to be Latin literature’s only text that returns hymenaee in the top 25.

In the next post, we will continue to think about preprocessing and how to extract the most meaningful information from our texts by looking at stop words.

Working with The Latin Library Corpus in CLTK, pt. 3

code, tutorial

In the previous two posts, I explained how to load either the whole Latin Library or individual files from the corpus. In today’s post, I’ll split the difference and show how to build a custom text from PlaintextCorpusReader output, in this case how to access Virgil’s Aeneid using this method. Unlike Catullus, whose omnia opera can be found in a single text file (catullus.txt) in the Latin Library, each book of the Aeneid has been placed in its own text file. Let’s look at how we can work with multiple files at once using PlaintextCorpusReader .

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus and build a list of available files with the following commands:

from cltk.corpus.latin import latinlibrary
files = latinlibrary.fileids()

We can then use a list comprehension to figure out which files we need:

print([file for file in files if 'vergil' in file])
>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt', 'vergil/ec1.txt', 'vergil/ec10.txt', 'vergil/ec2.txt', 'vergil/ec3.txt', 'vergil/ec4.txt', 'vergil/ec5.txt', 'vergil/ec6.txt', 'vergil/ec7.txt', 'vergil/ec8.txt', 'vergil/ec9.txt', 'vergil/geo1.txt', 'vergil/geo2.txt', 'vergil/geo3.txt', 'vergil/geo4.txt']

The file names for the Aeneid texts all follow the same pattern and we can use this to build a list of the twelve files we want for our subcorpus.

aeneid_files = [file for file in files if 'vergil/aen' in file]

>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt']

Now that we have a list of files, we can loop through them and build our collection using passing a list to our raw, sents, and words methods instead of a string:

aeneid_raw = latinlibrary.raw(aeneid_files)
aeneid_sents = latinlibrary.sents(aeneid_files)
aeneid_words = latinlibrary.words(aeneid_files)

At this point, we have our raw materials and are free to explore. So, like we did with Lesbia in Catullus, we can do the same for, say, Aeneas in the Aeneid:

import re
aeneas = re.findall(r'\bAenea[e|n|s]?\b', aeneid_raw, re.IGNORECASE)

# i.e. Return a list of matches of single words made up of 
# the letters 'Aenea' followed by the letters e, m, n, s, or nothing, and ignoring case.

>>> 236

# Note that this regex misses 'Aeneaeque' at Aen. 11.289—it is 
# important to define our regexes carefully to make sure they return
# what we expect them to return!
# A fix... 

aeneas = re.findall(r'\bAenea[e|n|s]?(que)?\b', aeneid_raw, re.IGNORECASE)
>>> 237

Aeneas appears in the Aeneid 237 times. (This matches the result found, for example, in Wetmore’s concordance.)

We are now equipped to work with the entire Latin Library corpus as well as smaller sections that we define for ourselves. There is still work to do, however, before we can ask serious research questions of this material. In a series of upcoming posts, we’ll look at a number of important preprocessing tasks that can be used to transform our unexamined text into useful data.


Working with The Latin Library Corpus in CLTK, pt. 2

code, tutorial

In the previous post, I explained how to load the whole The Latin Library as a plaintext corpus of sentences or words or even as a “raw” string which you can then process in Python and the  Classical Language Toolkit. While it can be interesting to experiment with the corpus in its entirety, it is often more useful to focus on a specific author or work, which is what we’ll do in this post using the works of Catullus.

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus with the following command:

from cltk.corpus.latin import latinlibrary

The Latin Library corpus is organized as a collection of plaintext (.txt) files some of which can be found in the top-level directory (e.g. 12tables.txt) and some of which can be found in an author- or collection-specific folder (e.g. abelard/dialogus.txt).

Screen Shot 2016-08-15 at 10.56.15 AM

The PlaintextCorpusReader gives us a method for returning a list of all of the files available in the corpus, fileids():

>>> ['12tables.txt', '1644.txt', 'abbofloracensis.txt', 'abelard/dialogus.txt', 'abelard/epistola.txt', 'abelard/historia.txt', 'addison/barometri.txt', etc.

>>> 2162

We have 2,162 different text files that we can explore via the Latin Library. Let’s say we want to work with only one—the poems of Catullus. We can inspect the list of files to find the file we want to work with. Here is one way, using a list comprehension:

files = latinlibrary.fileids()

print([file for file in files if 'catullus' in file])
>>> ['catullus.txt']

### Note that the list comprehension
### [file for file in files if 'catull' in file]
### returns the following list:
### ['catullus.txt', 'pascoli.catull.txt']
### You may need to experiment with variations
### on the names of authors and works to find
### the file you are looking for.

We can now limit our corpus by passing an argument to the raw, sents, and words methods we already know from the previous post:

catullus_raw = latinlibrary.raw('catullus.txt')
catullus_sents = latinlibrary.sents('catullus.txt')
catullus_words = latinlibrary.words('catullus.txt')

From here, we can do anything we did with the entire corpus on a smaller scale. So, for example, if we want to see what the most common ‘words’ are in Catullus, we can use the following:

# Find the most commonly occurring 'words' (i.e., tokens) in Catullus:

from collections import Counter

catullus_wordlist = list(catullus_words)

c = Counter(catullus_wordlist)
>>> [(',', 1598), ('.', 686), ('et', 193), ('est', 181), ('in', 159), ('ad', 142), (':', 141), ('*', 135), ('non', 135), ('?', 134), ('ut', 99), ('nec', 92), ('te', 90), ('quod', 84), ('cum', 82), ('sed', 80), ('quae', 74), ('tibi', 70), ('si', 69), ('mihi', 68), ('me', 68), ('o', 57), ('aut', 57), ('qui', 56), ('atque', 51)]

Not particularly interesting stuff here, but were on our way. We now have the raw materials to do any number of word studies based on Catullus’ poetry. We could, for example, find the number of times Lesbia appears in this part of the corpus:

### There are many ways we could do this.
### Let's use regular expressions for this one...

import re
lesbia = re.findall(r'\bLesbi.+?\b', catullus_raw, re.IGNORECASE)
# i.e. Return a list of matches of single words made up of 
# the letters 'Lesbi' followed by any number of letters, ignoring case.

>>> 31

# Note that this includes the "titles" to the poems that are included in the Latin Library, e.g. "II. fletus passeris Lesbiae".

In the next post, we’ll figure out how to do the same thing with, say, an author like Virgil—that is, an author whose corpus consists, not of a single text file like Catullus, but rather several.

Working with the Latin Library Corpus in CLTK

code, tutorial

In an earlier post, I explained how to import the contents of The Latin Library as a plaintext corpus for you to use with the Classical Language Toolkit. In this post, I want to show you a quick and easy way to access this corpus (or parts of this corpus).

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.41. In addition, if you imported the Latin Library corpus in the past, I recommend that you delete and reimport the corpus as I have fixed the encoding of the plaintext files so that they are all UTF-8.]

With the corpus imported, you can access it with the following command:

from cltk.corpus.latin import latinlibrary

If we check the type, we see that our imported latinlibrary is an instance of the PlaintextCorpusReader of the Natural Language Toolkit:

>>> <class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>

Now we have access to several useful PlaintextCorpus Reader functions that we can use to explore the corpus. Let’s look at working with the Latin Library as raw data (i.e. a very long string), a list of sentences, and a list of words.

ll_raw = latinlibrary.raw()

>>> <class 'str'>

>>> 96167304

>>> Arma virumque cano, Troiae qui primus ab oris

The “raw” function returns the entire text of the corpus as a string. So with a few Python string operations, we can learn the size of the Latin Library (96,167,304 characters!) and we can do other things like print slices from the string.

PlaintextCorpusReader can also return our corpus as sentences or as words:

ll_sents = latinlibrary.sents()
ll_words = latinlibrary.words()

Both of these are returned as instances of the class ‘nltk.corpus.reader.util.ConcatenatedCorpusView’, and we can work with them either directly or indirectly. (Note that this is a very large corpus and some of the commands—rest assured, I’ve marked them—will take a long time to run. In an upcoming post, I will both discuss strategies for iterating over these collections more efficiently as well as for avoiding having to wait for these results over and over again.)

# Get the total number of words (***slow***):
ll_wordcount = len(latinlibrary.words())
>>> 16667761

# Print a slice from 'words' from the concatenated view:

# Return a complete list of words (***slow***):
ll_wordlist = list(latinlibrary.words())

# Print a slice from 'words' from the list:

# Check for list membership:
test_words = ['est', 'Caesar', 'lingua', 'language', 'Library', '101', 'CI']

for word in test_words:
    if word in ll_wordlist:
        print('\'%s\' is in the Latin Library' %word)
        print('\'%s\' is *NOT* in the Latin Library' %word)

>>> 'est' is in the Latin Library
>>> 'Caesar' is in the Latin Library
>>> 'lingua' is in the Latin Library
>>> 'language' is *NOT* in the Latin Library
>>> 'Library' is in the Latin Library
>>> '101' is in the Latin Library
>>> 'CI' is in the Latin Library

# Find the most commonly occurring words in the list:
from collections import Counter
c = Counter(ll_wordlist)
>>> [(',', 1371826), ('.', 764528), ('et', 428067), ('in', 265304), ('est', 171439), (';', 167311), ('non', 156395), ('-que', 135667), (':', 131200), ('ad', 127820)]

There are 16,667,542 words in the Latin Library. Well, this is not strictly true—for one thing, the Latin word tokenizer isolates punctuation and numbers. In addition, it is worth pointing out that the plaintext Latin Library include the English header and footer information from each page. (This explains why the word “Library” tests positive for membership.) So while we don’t really have 16+ million Latin words, what we do have is a large list of tokens from a large Latin corpus. And now that we have this large list, we can “clean it up” depending on what research questions we want to ask. So, even though it is slow to create a list from the Concatenated CorpusView, once we have that list, we can perform any list operation and do so much more quickly. Remove punctuation, normalize case, remove stop words, etc. I will leave it to you to experiment with this kind of preprocessing on your own for now. (Although all of these steps will be covered in future posts.)

Much of the time, we will not want to work with the entire corpus but rather with subsets of the corpus such as the plaintext files of a single author or work. Luckily, PlaintextCorpusReader allows us to load multi-file corpora by file. In the next post, we will look at loading and working with smaller selections of the Latin Library.

CLTK: Importing the Latin Library as a Corpus


Here is quick tutorial to help users import the Latin Library as a corpus that they can use to explore the Latin language with the Classical Language Toolkit. [This tutorial assumes that you are running Python3 and the current version of the CLTK on Mac OS X (10.11). The documentation for Importing Corpora can be found here.]

Let’s begin by opening up a new session in Terminal and running Python. Type the following:

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')

First, we start by importing the CLTK CorpusImporter. This is the general class used for importing any of the available CLTK corpora in any language. Next, we create an instance of the class that will specifically help us to import Latin materials. Note that CorpusImporter takes the language you want to work with as an argument, here ‘latin’.

You can get a list of the corpora for this language that are currently available by typing the following:


At the time of writing, the following corpora are available:

['latin_text_perseus', 'latin_treebank_perseus', 'latin_text_lacus_curtius', 'latin_text_latin_library', 'phi5', 'phi7', 'latin_proper_names_cltk', 'latin_models_cltk', 'latin_pos_lemmata_cltk', 'latin_treebank_index_thomisticus', 'latin_lexica_perseus', 'latin_training_set_sentence_cltk', 'latin_word2vec_cltk', 'latin_text_antique_digiliblt', 'latin_text_corpus_grammaticorum_latinorum']

We want to import  ‘latin_text_latin_library’. This corpus can be downloaded by passing the name of the corpus we want to download to the following CLTK function:


(When given a single argument, this function downloads the corpus from from the CLTK Github repo [see here] if it is available. Note that corpora can also be loaded locally by providing the filepath to the corpus as a second argument. This is covered in the documentation.)

Assuming everything runs properly, you should now have a new folder in your user directory called cltk_data and inside that directory you should have the following path: /latin/text/latin_text_latin_library/. This is where your new local Latin Library corpus is located. If you explore this folder, you will find hundreds of text files from the Latin Library ready for you to work with. In an upcoming post, I will explain some strategies for working with this corpus in CLTK projects.