Working with The Latin Library Corpus in CLTK, pt. 2

code, tutorial

In the previous post, I explained how to load the whole The Latin Library as a plaintext corpus of sentences or words or even as a “raw” string which you can then process in Python and the  Classical Language Toolkit. While it can be interesting to experiment with the corpus in its entirety, it is often more useful to focus on a specific author or work, which is what we’ll do in this post using the works of Catullus.

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus with the following command:

from cltk.corpus.latin import latinlibrary

The Latin Library corpus is organized as a collection of plaintext (.txt) files some of which can be found in the top-level directory (e.g. 12tables.txt) and some of which can be found in an author- or collection-specific folder (e.g. abelard/dialogus.txt).

Screen Shot 2016-08-15 at 10.56.15 AM

The PlaintextCorpusReader gives us a method for returning a list of all of the files available in the corpus, fileids():

>>> ['12tables.txt', '1644.txt', 'abbofloracensis.txt', 'abelard/dialogus.txt', 'abelard/epistola.txt', 'abelard/historia.txt', 'addison/barometri.txt', etc.

>>> 2162

We have 2,162 different text files that we can explore via the Latin Library. Let’s say we want to work with only one—the poems of Catullus. We can inspect the list of files to find the file we want to work with. Here is one way, using a list comprehension:

files = latinlibrary.fileids()

print([file for file in files if 'catullus' in file])
>>> ['catullus.txt']

### Note that the list comprehension
### [file for file in files if 'catull' in file]
### returns the following list:
### ['catullus.txt', 'pascoli.catull.txt']
### You may need to experiment with variations
### on the names of authors and works to find
### the file you are looking for.

We can now limit our corpus by passing an argument to the raw, sents, and words methods we already know from the previous post:

catullus_raw = latinlibrary.raw('catullus.txt')
catullus_sents = latinlibrary.sents('catullus.txt')
catullus_words = latinlibrary.words('catullus.txt')

From here, we can do anything we did with the entire corpus on a smaller scale. So, for example, if we want to see what the most common ‘words’ are in Catullus, we can use the following:

# Find the most commonly occurring 'words' (i.e., tokens) in Catullus:

from collections import Counter

catullus_wordlist = list(catullus_words)

c = Counter(catullus_wordlist)
>>> [(',', 1598), ('.', 686), ('et', 193), ('est', 181), ('in', 159), ('ad', 142), (':', 141), ('*', 135), ('non', 135), ('?', 134), ('ut', 99), ('nec', 92), ('te', 90), ('quod', 84), ('cum', 82), ('sed', 80), ('quae', 74), ('tibi', 70), ('si', 69), ('mihi', 68), ('me', 68), ('o', 57), ('aut', 57), ('qui', 56), ('atque', 51)]

Not particularly interesting stuff here, but were on our way. We now have the raw materials to do any number of word studies based on Catullus’ poetry. We could, for example, find the number of times Lesbia appears in this part of the corpus:

### There are many ways we could do this.
### Let's use regular expressions for this one...

import re
lesbia = re.findall(r'\bLesbi.+?\b', catullus_raw, re.IGNORECASE)
# i.e. Return a list of matches of single words made up of 
# the letters 'Lesbi' followed by any number of letters, ignoring case.

>>> 31

# Note that this includes the "titles" to the poems that are included in the Latin Library, e.g. "II. fletus passeris Lesbiae".

In the next post, we’ll figure out how to do the same thing with, say, an author like Virgil—that is, an author whose corpus consists, not of a single text file like Catullus, but rather several.


Working with the Latin Library Corpus in CLTK

code, tutorial

In an earlier post, I explained how to import the contents of The Latin Library as a plaintext corpus for you to use with the Classical Language Toolkit. In this post, I want to show you a quick and easy way to access this corpus (or parts of this corpus).

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.41. In addition, if you imported the Latin Library corpus in the past, I recommend that you delete and reimport the corpus as I have fixed the encoding of the plaintext files so that they are all UTF-8.]

With the corpus imported, you can access it with the following command:

from cltk.corpus.latin import latinlibrary

If we check the type, we see that our imported latinlibrary is an instance of the PlaintextCorpusReader of the Natural Language Toolkit:

>>> <class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>

Now we have access to several useful PlaintextCorpus Reader functions that we can use to explore the corpus. Let’s look at working with the Latin Library as raw data (i.e. a very long string), a list of sentences, and a list of words.

ll_raw = latinlibrary.raw()

>>> <class 'str'>

>>> 96167304

>>> Arma virumque cano, Troiae qui primus ab oris

The “raw” function returns the entire text of the corpus as a string. So with a few Python string operations, we can learn the size of the Latin Library (96,167,304 characters!) and we can do other things like print slices from the string.

PlaintextCorpusReader can also return our corpus as sentences or as words:

ll_sents = latinlibrary.sents()
ll_words = latinlibrary.words()

Both of these are returned as instances of the class ‘nltk.corpus.reader.util.ConcatenatedCorpusView’, and we can work with them either directly or indirectly. (Note that this is a very large corpus and some of the commands—rest assured, I’ve marked them—will take a long time to run. In an upcoming post, I will both discuss strategies for iterating over these collections more efficiently as well as for avoiding having to wait for these results over and over again.)

# Get the total number of words (***slow***):
ll_wordcount = len(latinlibrary.words())
>>> 16667761

# Print a slice from 'words' from the concatenated view:

# Return a complete list of words (***slow***):
ll_wordlist = list(latinlibrary.words())

# Print a slice from 'words' from the list:

# Check for list membership:
test_words = ['est', 'Caesar', 'lingua', 'language', 'Library', '101', 'CI']

for word in test_words:
    if word in ll_wordlist:
        print('\'%s\' is in the Latin Library' %word)
        print('\'%s\' is *NOT* in the Latin Library' %word)

>>> 'est' is in the Latin Library
>>> 'Caesar' is in the Latin Library
>>> 'lingua' is in the Latin Library
>>> 'language' is *NOT* in the Latin Library
>>> 'Library' is in the Latin Library
>>> '101' is in the Latin Library
>>> 'CI' is in the Latin Library

# Find the most commonly occurring words in the list:
from collections import Counter
c = Counter(ll_wordlist)
>>> [(',', 1371826), ('.', 764528), ('et', 428067), ('in', 265304), ('est', 171439), (';', 167311), ('non', 156395), ('-que', 135667), (':', 131200), ('ad', 127820)]

There are 16,667,542 words in the Latin Library. Well, this is not strictly true—for one thing, the Latin word tokenizer isolates punctuation and numbers. In addition, it is worth pointing out that the plaintext Latin Library include the English header and footer information from each page. (This explains why the word “Library” tests positive for membership.) So while we don’t really have 16+ million Latin words, what we do have is a large list of tokens from a large Latin corpus. And now that we have this large list, we can “clean it up” depending on what research questions we want to ask. So, even though it is slow to create a list from the Concatenated CorpusView, once we have that list, we can perform any list operation and do so much more quickly. Remove punctuation, normalize case, remove stop words, etc. I will leave it to you to experiment with this kind of preprocessing on your own for now. (Although all of these steps will be covered in future posts.)

Much of the time, we will not want to work with the entire corpus but rather with subsets of the corpus such as the plaintext files of a single author or work. Luckily, PlaintextCorpusReader allows us to load multi-file corpora by file. In the next post, we will look at loading and working with smaller selections of the Latin Library.

Current State of the CLTK Latin Lemmatizer


Lemmatization is a core task in natural language processing that allows us to return the dictionary headword—also known as the lemma—for each token in a given string. The Classical Language Toolkit includes a lemmatizer for Latin and Greek and for my Google Summer of Code project I have been rewriting these tools to improve their accuracy. In this post, I want to 1. review the current state of the lemmatizer, specifically the Latin lemmatizer, 2. test some sample sentences to see where the lemmatizer performs well and where it does not, and 3. suggest where I think improvements could be made.

[This post uses Python3 and the current version of the CLTK.]

The current version of the lemmatizer uses a model that is kept in the CLTK_DATA directory. (More specifically, the model is a Python dictionary called LEMMATA that can be found in the ‘’ file in the ‘latin_models_cltk’ corpus.) So before we can lemmatize Latin texts we need to import this model/corpus. The import commands are given below, but if you want more details on loading CLTK corpora, see this post.

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')

[Note that once this corpus is imported into CLTK_DATA, you will not need to repeat these steps to use the Latin lemmatized in the future.]

To use the lemmatizer, we import it as follows:

from cltk.stem.lemma import LemmaReplacer

LemmaReplacer takes a language argument, so we can create an instance of the Latin lemmatizer with the following command:

lemmatizer = LemmaReplacer('latin')

This lemmatized checks words against the LEMMATA dictionary that you installed above. That is, it checks the dictionary to see if a word is found as a key and returns the associated value. Here is the beginning of the lemma dictionary:

    '-nam' : 'nam', 
    '-namque' : 'nam', 
    '-sed' : 'sed', 
    'Aaron' : 'Aaron', 
    'Aaroni' : 'Aaron', 
    'Abante' : 'Abas', 
    'Abanteis' : 'Abanteus', 
    'Abantem' : 'Abas', 
    'Abantes' : 'Abas', etc...

If a word is not found in the dictionary, the lemmatizer returns the original word unchanged. Since Python dictionaries do not support duplicate keys, there is no resolution for ambiguous forms with the current lemmatizer. For example, this key-value pair {‘amor’ : ‘amo’} ensures that the word “amor” is always lemmatized as a verb and not a noun, even though the nominative singular form of ‘amor’ appears much more frequently than the first-person singular passive form of ‘amor’.

Let’s try some test sentences. Here is the first sentence from Cicero’s In Catilinam 1:

sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'
sentence = sentence.lower()

Note that I have also made the sentence lowercase as the current lemmatizer can raise errors due to case handling.

Now let’s pass this to the lemmatizer:

lemmas = lemmatizer.lemmatize(sentence)

>>> ['quis1', 'usque', 'tandem', 'abutor', ',', 'catilina', ',', 'patior', 'noster', '?']

The lemmatizer does a pretty good job. Punctuation included, its accuracy is 80% when compared with the lemmas found in Perseus Treebank Data. According to this dataset, the “quis1” should resolve to “quo”. (Though an argument could be made about whether this adverb is a form of ‘quis’ or its own word deriving from ‘quis’. The argument about whether ‘quousque’ should in fact be one word is also worth mentioning. Note that the number following ‘quis’ is a feature of the Morpheus parser to disambiguate identical forms.) “Patientia” is perhaps a clearer case. Though derived from the verb “patior”, the expected behavior of the lemmatizer is to resolve this word as the self-sufficient noun ‘patientia’. This is what we find in our comparative data from Perseus.

Another example, a longer sentence from the opening of Sallust’s Bellum Catilinae:

sentence = 'Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit.'
sentence = sentence.lower()

lemmas = lemmatizer.lemmatize(sentence)

>>> ['omne', 'homo', ',', 'qui1', 'sui', 'studeo', 'praesto2', 'ceter', 'animalis', ',', 'summum', 'ops1', 'nitor1', 'decet', ',', 'neo1', 'vita', 'silentium', 'transeo', 'velut', 'pecus1', ',', 'qui1', 'natura', 'pronus', 'atque', 'venter', 'oboedio', 'fingo.']

Again, pretty good results overall—82.76%. But the errors reveal the shortcomings of the lemmatizer. “Omnis” is an extremely common word in Latin and it simply appears incorrectly in the lemma model. Ditto ‘summus’. Ditto ‘ceter’, though worse because this is not even a valid Latin form. ‘Animalibus’ suffers from the kind of ambiguity noted above with ‘amor’—the noun ‘animal’ is much more common that the adjective ‘animals’. The most significant error is lemmatizing ‘ne’—one of the most common words in the language—incorrectly as the extremely infrequent (if ever appearing) present active imperative of ‘neo’.

If this all sounds critical simply for the sake of being critical, that is not my intention. I have been working on new approaches to the problem of Latin lemmatization and have learned a great deal from the current CLTK lemmatizer. The work shown above is a solid start and there is significant room for improvement. I see it as a baseline: every percentage point above 80% or 82.76% accuracy is a step in the right direction. Next week, I will publish some new blog posts with ideas for new approaches to Latin lemmatization based not on dictionary matching, but on training data, regex matching, and attention to word order and context. While dictionary matching is still the most efficient way to resolve some lemmas (e.g. unambiguous, indeclinables like “ad”), it is through a combination of multiple approaches that we will be able to increase substantially the accuracy of this important tool in the CLTK.


CLTK: Importing the Latin Library as a Corpus


Here is quick tutorial to help users import the Latin Library as a corpus that they can use to explore the Latin language with the Classical Language Toolkit. [This tutorial assumes that you are running Python3 and the current version of the CLTK on Mac OS X (10.11). The documentation for Importing Corpora can be found here.]

Let’s begin by opening up a new session in Terminal and running Python. Type the following:

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')

First, we start by importing the CLTK CorpusImporter. This is the general class used for importing any of the available CLTK corpora in any language. Next, we create an instance of the class that will specifically help us to import Latin materials. Note that CorpusImporter takes the language you want to work with as an argument, here ‘latin’.

You can get a list of the corpora for this language that are currently available by typing the following:


At the time of writing, the following corpora are available:

['latin_text_perseus', 'latin_treebank_perseus', 'latin_text_lacus_curtius', 'latin_text_latin_library', 'phi5', 'phi7', 'latin_proper_names_cltk', 'latin_models_cltk', 'latin_pos_lemmata_cltk', 'latin_treebank_index_thomisticus', 'latin_lexica_perseus', 'latin_training_set_sentence_cltk', 'latin_word2vec_cltk', 'latin_text_antique_digiliblt', 'latin_text_corpus_grammaticorum_latinorum']

We want to import  ‘latin_text_latin_library’. This corpus can be downloaded by passing the name of the corpus we want to download to the following CLTK function:


(When given a single argument, this function downloads the corpus from from the CLTK Github repo [see here] if it is available. Note that corpora can also be loaded locally by providing the filepath to the corpus as a second argument. This is covered in the documentation.)

Assuming everything runs properly, you should now have a new folder in your user directory called cltk_data and inside that directory you should have the following path: /latin/text/latin_text_latin_library/. This is where your new local Latin Library corpus is located. If you explore this folder, you will find hundreds of text files from the Latin Library ready for you to work with. In an upcoming post, I will explain some strategies for working with this corpus in CLTK projects.

10,000 Most Frequent ‘Words’ in the Latin Canon, revisited


Last year, the CLTK’s Kyle Johnson wrote a post on the “10,000 most frequent words in Greek and Latin canon”. Since that post was written, I updated the CLTK’s Latin tokenizer to better handle enclitics and other affixes. I thought it would be a good idea to revisit that post for two reasons: 1. to look at the most important changes introduced by the new tokenizer features, and 2. to discuss briefly what we can learn from the most frequent words as I continue to develop the new Latin lemmatizer for the CLTK.

Here is an iPython notebook with the code for generating the Latin list: I have followed Johnson’s workflow, i.e. tokenize the PHI corpus and create a frequency distribution list. (In a future post, I will run the same experiment on the Latin Library corpus using the built-in NLTK FreqDist function.)

Here are the results:

Top 10 tokens using the NLTK tokenizer:
et	197240
in	141628
est	99525
non	91073
ut	70782
cum	61861
si	60652
ad	59462
quod	53346
qui	46724
Top 10 tokens using the CLTK tokenizer:
et	197242
in	142130
que	110612
ne	103342
est	103254
non	91073
ut	71275
cum	65341
si	61776
ad	59475

The list gives a good indication of what the new tokenizer does:

  • The biggest change is that the (very common) enclitics -que and -ne take their place in the list of top Latin tokens.
  • The words et and non (words which do not combine with -que) are for the most part unaffected.
  • The words estin, and ut see their count go up because of enclitic handling in the Latin tokenizer, e.g. estne > est, ne; inque > in, que. While these tokens are the most obvious examples of this effect, it is the explanation for most of the changed counts on the top 10,000 list, e.g. amorque amor, que. (Ad is less clear. Adque may be a variant of atque; this should be looked into.)
  • The word cum also see its count go up, both because of enclitic handling and also because of the tokenization of forms like mecum as cumme.
  • The word si sees its count go up because the Latin tokenizer handles contractions if words like sodes (siaudes) and sultis (sivultis).

I was thinking about this list of top tokens as I worked on the Latin lemmatizer this week. These top 10 tokens represent 17.3% of all the tokens in the PHI corpus; related, the top 228 tokens represent 50% of the corpus. Making sure that these words are handled correctly then will have the largest overall effect on the accuracy of the Latin lemmatizer.

A few observations…

  • Many of the highest frequency words in the corpus are conjunctions, prepositions, adverbs and other indeclinable, unambiguous words. These should be lemmatized with dictionary matching.
  • Ambiguous tokens are the real challenge of the lemmatizer project and none is more important than cumCum alone makes up 1.1% of the corpus with both the conjunction (‘when’) and the preposition (‘with’) significantly represented. Compare this with est, which is an ambiguous form (i.e. est sum “to be” vs. est edo “to eat”), but with one occurring by far more frequently in the corpus. For this reason, cum will be a good place to start with testing a context-based lemmatizer, such as one that uses bigrams to resolve ambiguities. Quod and quam, also both in the top 20 tokens, can be added to this category.

In addition to high-frequency tokens, extremely rare tokens also present a significant challenge to lemmatization. Look for a post about hapax legomena in the Latin corpus later this week.

GSoC 2016: Lemmatizing Latin/Greek for CLTK


Google Summer of Code 2016 started this week. That means that my work on improving the Latin (and Greek) lemmatizer in the Classical Language Toolkit is now underway. For this summer project, I proposed to rewrite the CLTK lemmatizer using a backoff strategy—that is, using a series of different lemmatizers to increase accuracy. Backoff tagging is a common technique in part-of-speech tagging in NLP, but it should also help to resolve ambiguities, predict unknown words, and similar issues that can trip up a lemmatizer. The current CLTK lemmatizer uses dictionary matching, but lacks a systematic way to differentiate ambiguous forms. (Is that forma the nominative singular noun [ > forma, –ae] or forma the present imperative active verb [ > formo (1) ?) The specifics of my backoff strategy will be discussed here as the project develops, but for now I’ll say that it is a combination of training on context, regex matching, and, yes, dictionary matching for high frequency, indeclinable, and unambiguous words.

Screen Shot 2016-05-23 at 11.28.28 PM

First round of tests today with the default Latin lemmatizer.

As I mention in my GSoC proposal, having a lemmatizer with high accuracy is particularly important for NLP in highly inflected languages because: 1. words often have a dozen or more possible forms (and, as opposed to go in English, this is the norm and not only a characteristic of irregularly formed words), and 2. small corpus size in general often demands that counts for a given feature—like words—be based on the broadest measure possible. So, for example, if you want to study the idea of shapes in Ovid’s Metamorphoses, you would need to would want to look at the word forma. This “word” (token, really) appears 39 times in the poem. But what you really want to look at is not just forma, but formae (21), formam (18), formarum (0—yes, it’s zero, but you would still want to know), formis (1), and formas (6). And you wouldn’t want to miss tokens like formasque (Met. 2.78) or formaene (Met. 10.563)—there are 9 such instances. If you were going to, say, topic model the Metamorphoses, you would be much better off having the 94 examples of “forma” than the smaller numbers of its different forms.

“Ancient languages do not have complete BLARKs.” writes Barbara McGillivray  [2014: 19], referring to Krauwer’s idea [2003: 4] of the Basic LAnguage Resource Kit. A BLARK consists of the fundamental resources necessary for text analysis—corpora, lexicons, tokenizers, POS-taggers, etc. A lemmatizer is another basic tool. More and more, the CLTK is solving the BLARK problem for Latin, Greek, and other historical languages which have been referred to as “less-resourced” [see Piotrowski 2012: 85]. In order for these languages to participate in advances in text analysis and to take full advantage of digital resources for language processing, basic tools, like the lemmatizer, need to be available and need to work at accuracy rates high enough to stand up to the very high bar demanded in philological research. This is the goal for the summer.

Works cited:
Bird, S., E. Klein, and E. Loper. 2009. Natural Language Processing with Python. Cambridge, Ma.: O’Reilly. (Esp. Ch. 5 “Categorizing and Tagging Words”).
Krauwer, S. 2003. “The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap.” Proceedings of the 2003 International Workshop on Speech and Computer (SPECOM 2003) : 8-15.
McGillivray, B. 2014. Methods in Latin Computational Linguistics. Leiden: Brill.
Piotrowski, M. 2012. “Natural Language Processing for Historical Texts.” Synthesis Lectures on Human Language Technologies 5: 1-157.

More Tokenizing Latin Text


When I first started working on the CLTK Latin tokenizer, I wrote a blog post both explaining tokenizing in general and also showing some of the advantages of using a language-specific tokenizer. At that point, the most important feature of the CLTK Latin tokenizer was the ability to split tokens on the enclitic ‘-que’. In the meantime, I have added several more features, described below. Like the last post, the code below assumes the following requirements: Python 3.4, NLTK3, and the current version of CLTK.

Start by importing the Latin word tokenizer with the following code:

from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('latin')

The following code demonstrates the current features of the tokenizer:

# -que
# V. Aen. 1.1
text = "Arma virumque cano, Troiae qui primus ab oris"

>>> ['Arma', 'que', 'virum', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']

# -ne
# Cic. Orat. 1.226.1
text = "Potestne virtus, Crasse, servire istis auctoribus, quorum tu praecepta oratoris facultate complecteris?"

>>> ['ne', 'Potest', 'virtus', ',', 'Crasse', ',', 'servire', 'istis', 'auctoribus', ',', 'quorum', 'tu', 'praecepta', 'oratoris', 'facultate', 'complecteris', '?']

# -ve
# Catull. 14.4-5
text = "Nam quid feci ego quidve sum locutus, cur me tot male perderes poetis?"

>>> ['Nam', 'quid', 'feci', 'ego', 've', 'quid', 'sum', 'locutus', ',', 'cur', 'me', 'tot', 'male', 'perderes', 'poetis', '?']

# -'st' contractions
# Prop. 2.5.1-2

text = "Hoc verumst, tota te ferri, Cynthia, Roma, et non ignota vivere nequitia?"


>>> ['Hoc', 'verum', 'est', ',', 'tota', 'te', 'ferri', ',', 'Cynthia', ',', 'Roma', ',', 'et', 'non', 'ignota', 'vivere', 'nequitia', '?']

# Plaut. Capt. 937
text = "Quid opust verbis? lingua nullast qua negem quidquid roges."


>>> ['Quid', 'opus', 'est', 'verbis', '?', 'lingua', 'nulla', 'est', 'qua', 'negem', 'quidquid', 'roges.']

# 'nec' and 'neque'
# Cic. Phillip. 13.14

text = "Neque enim, quod quisque potest, id ei licet, nec, si non obstatur, propterea etiam permittitur."


>>> ['que', 'Ne', 'enim', ',', 'quod', 'quisque', 'potest', ',', 'id', 'ei', 'licet', ',', 'c', 'ne', ',', 'si', 'non', 'obstatur', ',', 'propterea', 'etiam', 'permittitur.']

# '-n' for '-ne'
# Plaut. Amph. 823

text = "Cenavin ego heri in navi in portu Persico?"


>>> ['Cenavi', 'ne', 'ego', 'heri', 'in', 'navi', 'in', 'portu', 'Persico', '?']

# Contractions with 'si'; also handles 'sultis',
# Plaut. Bacch. 837-38

text = "Dic sodes mihi, bellan videtur specie mulier?"


>>> ['Dic', 'si', 'audes', 'mihi', ',', 'bella', 'ne', 'videtur', 'specie', 'mulier', '?']

There are still improvements to be done, but this handles a high percentage of Latin tokenization tasks. If you have any ideas for more cases that need to be handled or if you see any errors, let me know.

Tokenizing Latin Text


One of the first tasks necessary in any text analysis projects is tokenization—we take our text as a whole and convert it to a list of smaller units, or tokens. When dealing with Latin—or at least digitized version of modern editions, like those found in the Perseus Digital Library, the Latin Library, etc.—paragraph- and sentence-level tokenization present little problem. Paragraphs are usually well marked and can be split by newlines (</n>). Sentences in modern Latin editions use the same punctuation set as English (i.e., ‘.’, ‘?’, and ‘!’), so most sentence-level tokenization can be done more or less successfully with the built-in tools found in the Natural Language Toolkit (NLTK), e.g. nltk.word_tokenize. But just as in English, Latin word tokenization offers small, specific issues that are not addressed by NLTK. The classic case in English is the negative contraction—how do we want to handle, for example, “didn’t”: [“didn’t”] or [“did”, “n’t”] or [“did”, “not”]?

There are four important cases in which Latin word tokenization demands special attention: the enclictics “-que”, “-ue/-ve”, and “-ne” and the postpositive use of “-cum” with the personal pronouns (e.g. nobiscum for *cum nobis). The Classical Language Toolkit now takes these cases into consideration when doing Latin word tokenization. Below is a brief how-to on using the CLTK to tokenize your Latin texts by word. [The tutorial assumes the following requirements: Python3, NLTK3, CLTK.]

Tokenizing Latin Text with CLTK

We could simply use Python to split our texts into a list of tokens. (And sometimes this will be enough!) So…

text = "Arma virumque cano, Troiae qui primus ab oris"

>>> ['Arma', 'virumque', 'cano,', 'Troiae', 'qui', 'primus', 'ab', 'oris']

A good start, but we’ve lost information, namely the comma between cano and Troiae. This might be ok, but let’s use NLTK’s tokenizer to hold on to the punctuation.

import nltk


>>> ['Arma', 'virumque', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']

Using word_tokenize, we retain the punctuation. But otherwise we have more or less the same division of words.

But for someone working with Latin that second token is an issue. Do we really want virumque? Or are we looking for virum and the enclitic –que? In many cases, it will be there latter. Let’s use CLTK to handle this. (***UPDATED 3.19.16***)

from cltk.tokenize.word import WordTokenizer

word_tokenizer = WordTokenizer('latin')

>>> ['Arma', 'que', 'virum', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']


Using the CLTK WordTokenizer for Latin we retain the punctuation and split the special case more usefully.