Making a Keyword-in-Context index with CLTK

code, tutorial

The “key word-in-context” (KWIC) index was an innovation of early information retrieval, the basic concepts of which were developed in the late 1950s by H.P. Luhn. The idea is to produce a list of all occurrences of a word, aligned so that the word is printed as a column in the center of the text with the corresponding context printed to the immediate left and right. This allows a user to scan quickly a large number of uses in a given text. For examples, David Packard’s 1968 A Concordance to Livy uses an alphabetical KWIC format. Here are the first entries for the preposition e in Packard’s concordance:

Screen Shot 2017-08-17 at 10.16.31 AM

Using the Classical Language Toolkit and the Natural Language Toolkit’s Text module, we can easily create KWICs for texts in the Latin Library.

[This post assumes that you have already imported the Latin Library corpus as described in an earlier post and as always that you are running the latest version of CLTK on Python3.6. This tutorial was tested on v. 0.1.56.]

First, we can import a text from the Latin Library—here, Cicero’s De amicitia—as a list of words:

In [1]: from cltk.corpus.latin import latinlibrary
In [2]: amicitia_words = latinlibrary.words('cicero/amic.txt')
In [3]: print(amicitia_words[117:188])
Out [3]: ['Q.', 'Mucius', 'augur', 'multa', 'narrare', 'de', 'C.', 'Laelio', 'socero', 'suo', 'memoriter', 'et', 'iucunde', 'solebat', 'nec', 'dubitare', 'illum', 'in', 'omni', 'sermone', 'appellare', 'sapientem', ';', 'ego', 'autem', 'a', 'patre', 'ita', 'eram', 'deductus', 'ad', 'Scaevolam', 'sumpta', 'virili', 'toga', ',', 'ut', ',', 'quoad', 'possem', 'et', 'liceret', ',', 'a', 'senis', 'latere', 'numquam', 'discederem', ';', 'itaque', 'multa', 'ab', 'eo', 'prudenter', 'disputata', ',', 'multa', 'etiam', 'breviter', 'et', 'commode', 'dicta', 'memoriae', 'mandabam', 'fieri', '-que', 'studebam', 'eius', 'prudentia', 'doctior', '.']

We can then convert this list of words to an NLTK Text:

In [4]: import nltk
In [5]: amicitia_text = nltk.Text(amicitia_words)
In [6]: print(type(amicitia_text))
Out [6]: nltk.text.Text

Now that we have an NLTK text, there are several methods available to us, including “concordance,” which generates a KWIC for us based on keywords that we provide. Here, for example, is the NLTK concordance for ‘amicus’:

In [7]: amicitia_text.concordance('amicus')
Out [7]: Displaying 5 of 5 matches:
tentiam . Quonam enim modo quisquam amicus esse poterit ei , cui se putabit in
 optare , ut quam saepissime peccet amicus , quo plures det sibi tamquam ansas
escendant . Quamquam Ennius recte : Amicus certus in re incerta cernitur , tam
m in amicitiam transferetur , verus amicus numquam reperietur ; est enim is qu
itas . [ 95 ] Secerni autem blandus amicus a vero et internosci tam potest adh

The KWICs generated by NLTK Text are case insensitive (see amicus and Amicus above) and sorted sequentially by location in the text. There’s not much customization available for the method, so this is pretty much what it does. [You can set parameters for context width and number of lines presented; e.g. amicitia_text.concordance('amicus', width=50, lines=3)] Admittedly, it is pretty basic—it does not even return an identification or location code to help the user move easily to the wider context and the only way we know that the fifth match is in Chapter 95 is because the chapter number happens to be included in the context. At the same time, it is another step towards combining existing resources and tools (here, NLTK Text and a CLTK corpus) to explore Latin literature from different angles.

In a future post, I will build a KWIC method from scratch that offers more flexibility, especially with respect to context scope and location identification.

Advertisements

Working with The Latin Library Corpus in CLTK, pt. 3

code, tutorial

In the previous two posts, I explained how to load either the whole Latin Library or individual files from the corpus. In today’s post, I’ll split the difference and show how to build a custom text from PlaintextCorpusReader output, in this case how to access Virgil’s Aeneid using this method. Unlike Catullus, whose omnia opera can be found in a single text file (catullus.txt) in the Latin Library, each book of the Aeneid has been placed in its own text file. Let’s look at how we can work with multiple files at once using PlaintextCorpusReader .

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus and build a list of available files with the following commands:

from cltk.corpus.latin import latinlibrary
files = latinlibrary.fileids()

We can then use a list comprehension to figure out which files we need:

print([file for file in files if 'vergil' in file])
>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt', 'vergil/ec1.txt', 'vergil/ec10.txt', 'vergil/ec2.txt', 'vergil/ec3.txt', 'vergil/ec4.txt', 'vergil/ec5.txt', 'vergil/ec6.txt', 'vergil/ec7.txt', 'vergil/ec8.txt', 'vergil/ec9.txt', 'vergil/geo1.txt', 'vergil/geo2.txt', 'vergil/geo3.txt', 'vergil/geo4.txt']

The file names for the Aeneid texts all follow the same pattern and we can use this to build a list of the twelve files we want for our subcorpus.

aeneid_files = [file for file in files if 'vergil/aen' in file]

print(aeneid_files)
>>> ['vergil/aen1.txt', 'vergil/aen10.txt', 'vergil/aen11.txt', 'vergil/aen12.txt', 'vergil/aen2.txt', 'vergil/aen3.txt', 'vergil/aen4.txt', 'vergil/aen5.txt', 'vergil/aen6.txt', 'vergil/aen7.txt', 'vergil/aen8.txt', 'vergil/aen9.txt']

Now that we have a list of files, we can loop through them and build our collection using passing a list to our raw, sents, and words methods instead of a string:

aeneid_raw = latinlibrary.raw(aeneid_files)
aeneid_sents = latinlibrary.sents(aeneid_files)
aeneid_words = latinlibrary.words(aeneid_files)

At this point, we have our raw materials and are free to explore. So, like we did with Lesbia in Catullus, we can do the same for, say, Aeneas in the Aeneid:

import re
aeneas = re.findall(r'\bAenea[e|n|s]?\b', aeneid_raw, re.IGNORECASE)

# i.e. Return a list of matches of single words made up of 
# the letters 'Aenea' followed by the letters e, m, n, s, or nothing, and ignoring case.

print(len(aeneas))
>>> 236

# Note that this regex misses 'Aeneaeque' at Aen. 11.289—it is 
# important to define our regexes carefully to make sure they return
# what we expect them to return!
#
# A fix... 

aeneas = re.findall(r'\bAenea[e|n|s]?(que)?\b', aeneid_raw, re.IGNORECASE)
print(len(aeneas))
>>> 237

Aeneas appears in the Aeneid 237 times. (This matches the result found, for example, in Wetmore’s concordance.)

We are now equipped to work with the entire Latin Library corpus as well as smaller sections that we define for ourselves. There is still work to do, however, before we can ask serious research questions of this material. In a series of upcoming posts, we’ll look at a number of important preprocessing tasks that can be used to transform our unexamined text into useful data.

 

Working with The Latin Library Corpus in CLTK, pt. 2

code, tutorial

In the previous post, I explained how to load the whole The Latin Library as a plaintext corpus of sentences or words or even as a “raw” string which you can then process in Python and the  Classical Language Toolkit. While it can be interesting to experiment with the corpus in its entirety, it is often more useful to focus on a specific author or work, which is what we’ll do in this post using the works of Catullus.

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.42.]

We can access the corpus with the following command:

from cltk.corpus.latin import latinlibrary

The Latin Library corpus is organized as a collection of plaintext (.txt) files some of which can be found in the top-level directory (e.g. 12tables.txt) and some of which can be found in an author- or collection-specific folder (e.g. abelard/dialogus.txt).

Screen Shot 2016-08-15 at 10.56.15 AM

The PlaintextCorpusReader gives us a method for returning a list of all of the files available in the corpus, fileids():

print(latinlibrary.fileids())
>>> ['12tables.txt', '1644.txt', 'abbofloracensis.txt', 'abelard/dialogus.txt', 'abelard/epistola.txt', 'abelard/historia.txt', 'addison/barometri.txt', etc.

print(len(latinlibrary.fileids())
>>> 2162

We have 2,162 different text files that we can explore via the Latin Library. Let’s say we want to work with only one—the poems of Catullus. We can inspect the list of files to find the file we want to work with. Here is one way, using a list comprehension:

files = latinlibrary.fileids()

print([file for file in files if 'catullus' in file])
>>> ['catullus.txt']

### Note that the list comprehension
### [file for file in files if 'catull' in file]
### returns the following list:
### ['catullus.txt', 'pascoli.catull.txt']
###
### You may need to experiment with variations
### on the names of authors and works to find
### the file you are looking for.

We can now limit our corpus by passing an argument to the raw, sents, and words methods we already know from the previous post:

catullus_raw = latinlibrary.raw('catullus.txt')
catullus_sents = latinlibrary.sents('catullus.txt')
catullus_words = latinlibrary.words('catullus.txt')

From here, we can do anything we did with the entire corpus on a smaller scale. So, for example, if we want to see what the most common ‘words’ are in Catullus, we can use the following:

# Find the most commonly occurring 'words' (i.e., tokens) in Catullus:

from collections import Counter

catullus_wordlist = list(catullus_words)

c = Counter(catullus_wordlist)
print(c.most_common(25))
>>> [(',', 1598), ('.', 686), ('et', 193), ('est', 181), ('in', 159), ('ad', 142), (':', 141), ('*', 135), ('non', 135), ('?', 134), ('ut', 99), ('nec', 92), ('te', 90), ('quod', 84), ('cum', 82), ('sed', 80), ('quae', 74), ('tibi', 70), ('si', 69), ('mihi', 68), ('me', 68), ('o', 57), ('aut', 57), ('qui', 56), ('atque', 51)]

Not particularly interesting stuff here, but were on our way. We now have the raw materials to do any number of word studies based on Catullus’ poetry. We could, for example, find the number of times Lesbia appears in this part of the corpus:

### There are many ways we could do this.
### Let's use regular expressions for this one...

import re
lesbia = re.findall(r'\bLesbi.+?\b', catullus_raw, re.IGNORECASE)
# i.e. Return a list of matches of single words made up of 
# the letters 'Lesbi' followed by any number of letters, ignoring case.

print(len(lesbia))
>>> 31

# Note that this includes the "titles" to the poems that are included in the Latin Library, e.g. "II. fletus passeris Lesbiae".

In the next post, we’ll figure out how to do the same thing with, say, an author like Virgil—that is, an author whose corpus consists, not of a single text file like Catullus, but rather several.

Working with the Latin Library Corpus in CLTK

code, tutorial

In an earlier post, I explained how to import the contents of The Latin Library as a plaintext corpus for you to use with the Classical Language Toolkit. In this post, I want to show you a quick and easy way to access this corpus (or parts of this corpus).

[This post assumes that you have already imported the Latin Library corpus as described in the earlier post and as always that you are running the latest version of CLTK on Python3. This tutorial was tested on v. 0.1.41. In addition, if you imported the Latin Library corpus in the past, I recommend that you delete and reimport the corpus as I have fixed the encoding of the plaintext files so that they are all UTF-8.]

With the corpus imported, you can access it with the following command:

from cltk.corpus.latin import latinlibrary

If we check the type, we see that our imported latinlibrary is an instance of the PlaintextCorpusReader of the Natural Language Toolkit:

print(type(latinlibrary))
>>> <class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>

Now we have access to several useful PlaintextCorpus Reader functions that we can use to explore the corpus. Let’s look at working with the Latin Library as raw data (i.e. a very long string), a list of sentences, and a list of words.

ll_raw = latinlibrary.raw()

print(type(ll_raw))
>>> <class 'str'>

print(len(ll_raw))
>>> 96167304

print(ll_raw[91750273:91750318])
>>> Arma virumque cano, Troiae qui primus ab oris

The “raw” function returns the entire text of the corpus as a string. So with a few Python string operations, we can learn the size of the Latin Library (96,167,304 characters!) and we can do other things like print slices from the string.

PlaintextCorpusReader can also return our corpus as sentences or as words:

ll_sents = latinlibrary.sents()
ll_words = latinlibrary.words()

Both of these are returned as instances of the class ‘nltk.corpus.reader.util.ConcatenatedCorpusView’, and we can work with them either directly or indirectly. (Note that this is a very large corpus and some of the commands—rest assured, I’ve marked them—will take a long time to run. In an upcoming post, I will both discuss strategies for iterating over these collections more efficiently as well as for avoiding having to wait for these results over and over again.)

# Get the total number of words (***slow***):
ll_wordcount = len(latinlibrary.words())
print(ll_wordcount)
>>> 16667761

# Print a slice from 'words' from the concatenated view:
print(latinlibrary.words()[:100])
>>> ['DUODECIM', 'TABULARUM', 'LEGES', 'DUODECIM', ...]

# Return a complete list of words (***slow***):
ll_wordlist = list(latinlibrary.words())

# Print a slice from 'words' from the list:
print(ll_wordlist[:10])
>>> ['DUODECIM', 'TABULARUM', 'LEGES', 'DUODECIM', 'TABULARUM', 'LEGES', 'TABULA', 'I', 'Si', 'in']

# Check for list membership:
test_words = ['est', 'Caesar', 'lingua', 'language', 'Library', '101', 'CI']

for word in test_words:
    if word in ll_wordlist:
        print('\'%s\' is in the Latin Library' %word)
    else:
        print('\'%s\' is *NOT* in the Latin Library' %word)

>>> 'est' is in the Latin Library
>>> 'Caesar' is in the Latin Library
>>> 'lingua' is in the Latin Library
>>> 'language' is *NOT* in the Latin Library
>>> 'Library' is in the Latin Library
>>> '101' is in the Latin Library
>>> 'CI' is in the Latin Library

# Find the most commonly occurring words in the list:
from collections import Counter
c = Counter(ll_wordlist)
print(c.most_common(10))
>>> [(',', 1371826), ('.', 764528), ('et', 428067), ('in', 265304), ('est', 171439), (';', 167311), ('non', 156395), ('-que', 135667), (':', 131200), ('ad', 127820)]

There are 16,667,542 words in the Latin Library. Well, this is not strictly true—for one thing, the Latin word tokenizer isolates punctuation and numbers. In addition, it is worth pointing out that the plaintext Latin Library include the English header and footer information from each page. (This explains why the word “Library” tests positive for membership.) So while we don’t really have 16+ million Latin words, what we do have is a large list of tokens from a large Latin corpus. And now that we have this large list, we can “clean it up” depending on what research questions we want to ask. So, even though it is slow to create a list from the Concatenated CorpusView, once we have that list, we can perform any list operation and do so much more quickly. Remove punctuation, normalize case, remove stop words, etc. I will leave it to you to experiment with this kind of preprocessing on your own for now. (Although all of these steps will be covered in future posts.)

Much of the time, we will not want to work with the entire corpus but rather with subsets of the corpus such as the plaintext files of a single author or work. Luckily, PlaintextCorpusReader allows us to load multi-file corpora by file. In the next post, we will look at loading and working with smaller selections of the Latin Library.

Current State of the CLTK Latin Lemmatizer

code

Lemmatization is a core task in natural language processing that allows us to return the dictionary headword—also known as the lemma—for each token in a given string. The Classical Language Toolkit includes a lemmatizer for Latin and Greek and for my Google Summer of Code project I have been rewriting these tools to improve their accuracy. In this post, I want to 1. review the current state of the lemmatizer, specifically the Latin lemmatizer, 2. test some sample sentences to see where the lemmatizer performs well and where it does not, and 3. suggest where I think improvements could be made.

[This post uses Python3 and the current version of the CLTK.]

The current version of the lemmatizer uses a model that is kept in the CLTK_DATA directory. (More specifically, the model is a Python dictionary called LEMMATA that can be found in the ‘latin_lemmata_cltk.py’ file in the ‘latin_models_cltk’ corpus.) So before we can lemmatize Latin texts we need to import this model/corpus. The import commands are given below, but if you want more details on loading CLTK corpora, see this post.

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_models_cltk')

[Note that once this corpus is imported into CLTK_DATA, you will not need to repeat these steps to use the Latin lemmatized in the future.]

To use the lemmatizer, we import it as follows:

from cltk.stem.lemma import LemmaReplacer

LemmaReplacer takes a language argument, so we can create an instance of the Latin lemmatizer with the following command:

lemmatizer = LemmaReplacer('latin')

This lemmatized checks words against the LEMMATA dictionary that you installed above. That is, it checks the dictionary to see if a word is found as a key and returns the associated value. Here is the beginning of the lemma dictionary:

LEMMATA = { 
    '-nam' : 'nam', 
    '-namque' : 'nam', 
    '-sed' : 'sed', 
    'Aaron' : 'Aaron', 
    'Aaroni' : 'Aaron', 
    'Abante' : 'Abas', 
    'Abanteis' : 'Abanteus', 
    'Abantem' : 'Abas', 
    'Abantes' : 'Abas', etc...

If a word is not found in the dictionary, the lemmatizer returns the original word unchanged. Since Python dictionaries do not support duplicate keys, there is no resolution for ambiguous forms with the current lemmatizer. For example, this key-value pair {‘amor’ : ‘amo’} ensures that the word “amor” is always lemmatized as a verb and not a noun, even though the nominative singular form of ‘amor’ appears much more frequently than the first-person singular passive form of ‘amor’.

Let’s try some test sentences. Here is the first sentence from Cicero’s In Catilinam 1:

sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'
sentence = sentence.lower()

Note that I have also made the sentence lowercase as the current lemmatizer can raise errors due to case handling.

Now let’s pass this to the lemmatizer:

lemmas = lemmatizer.lemmatize(sentence)
print(lemmas)

>>> ['quis1', 'usque', 'tandem', 'abutor', ',', 'catilina', ',', 'patior', 'noster', '?']

The lemmatizer does a pretty good job. Punctuation included, its accuracy is 80% when compared with the lemmas found in Perseus Treebank Data. According to this dataset, the “quis1” should resolve to “quo”. (Though an argument could be made about whether this adverb is a form of ‘quis’ or its own word deriving from ‘quis’. The argument about whether ‘quousque’ should in fact be one word is also worth mentioning. Note that the number following ‘quis’ is a feature of the Morpheus parser to disambiguate identical forms.) “Patientia” is perhaps a clearer case. Though derived from the verb “patior”, the expected behavior of the lemmatizer is to resolve this word as the self-sufficient noun ‘patientia’. This is what we find in our comparative data from Perseus.

Another example, a longer sentence from the opening of Sallust’s Bellum Catilinae:

sentence = 'Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit.'
sentence = sentence.lower()

lemmas = lemmatizer.lemmatize(sentence)
print(lemmas)

>>> ['omne', 'homo', ',', 'qui1', 'sui', 'studeo', 'praesto2', 'ceter', 'animalis', ',', 'summum', 'ops1', 'nitor1', 'decet', ',', 'neo1', 'vita', 'silentium', 'transeo', 'velut', 'pecus1', ',', 'qui1', 'natura', 'pronus', 'atque', 'venter', 'oboedio', 'fingo.']

Again, pretty good results overall—82.76%. But the errors reveal the shortcomings of the lemmatizer. “Omnis” is an extremely common word in Latin and it simply appears incorrectly in the lemma model. Ditto ‘summus’. Ditto ‘ceter’, though worse because this is not even a valid Latin form. ‘Animalibus’ suffers from the kind of ambiguity noted above with ‘amor’—the noun ‘animal’ is much more common that the adjective ‘animals’. The most significant error is lemmatizing ‘ne’—one of the most common words in the language—incorrectly as the extremely infrequent (if ever appearing) present active imperative of ‘neo’.

If this all sounds critical simply for the sake of being critical, that is not my intention. I have been working on new approaches to the problem of Latin lemmatization and have learned a great deal from the current CLTK lemmatizer. The work shown above is a solid start and there is significant room for improvement. I see it as a baseline: every percentage point above 80% or 82.76% accuracy is a step in the right direction. Next week, I will publish some new blog posts with ideas for new approaches to Latin lemmatization based not on dictionary matching, but on training data, regex matching, and attention to word order and context. While dictionary matching is still the most efficient way to resolve some lemmas (e.g. unambiguous, indeclinables like “ad”), it is through a combination of multiple approaches that we will be able to increase substantially the accuracy of this important tool in the CLTK.

 

10,000 Most Frequent ‘Words’ in the Latin Canon, revisited

code

Last year, the CLTK’s Kyle Johnson wrote a post on the “10,000 most frequent words in Greek and Latin canon”. Since that post was written, I updated the CLTK’s Latin tokenizer to better handle enclitics and other affixes. I thought it would be a good idea to revisit that post for two reasons: 1. to look at the most important changes introduced by the new tokenizer features, and 2. to discuss briefly what we can learn from the most frequent words as I continue to develop the new Latin lemmatizer for the CLTK.

Here is an iPython notebook with the code for generating the Latin list: https://github.com/diyclassics/lemmatizer/blob/master/notebooks/phi-10000.ipynb. I have followed Johnson’s workflow, i.e. tokenize the PHI corpus and create a frequency distribution list. (In a future post, I will run the same experiment on the Latin Library corpus using the built-in NLTK FreqDist function.)

Here are the results:

Top 10 tokens using the NLTK tokenizer:
et	197240
in	141628
est	99525
non	91073
ut	70782
cum	61861
si	60652
ad	59462
quod	53346
qui	46724
Top 10 tokens using the CLTK tokenizer:
et	197242
in	142130
que	110612
ne	103342
est	103254
non	91073
ut	71275
cum	65341
si	61776
ad	59475

The list gives a good indication of what the new tokenizer does:

  • The biggest change is that the (very common) enclitics -que and -ne take their place in the list of top Latin tokens.
  • The words et and non (words which do not combine with -que) are for the most part unaffected.
  • The words estin, and ut see their count go up because of enclitic handling in the Latin tokenizer, e.g. estne > est, ne; inque > in, que. While these tokens are the most obvious examples of this effect, it is the explanation for most of the changed counts on the top 10,000 list, e.g. amorque amor, que. (Ad is less clear. Adque may be a variant of atque; this should be looked into.)
  • The word cum also see its count go up, both because of enclitic handling and also because of the tokenization of forms like mecum as cumme.
  • The word si sees its count go up because the Latin tokenizer handles contractions if words like sodes (siaudes) and sultis (sivultis).

I was thinking about this list of top tokens as I worked on the Latin lemmatizer this week. These top 10 tokens represent 17.3% of all the tokens in the PHI corpus; related, the top 228 tokens represent 50% of the corpus. Making sure that these words are handled correctly then will have the largest overall effect on the accuracy of the Latin lemmatizer.

A few observations…

  • Many of the highest frequency words in the corpus are conjunctions, prepositions, adverbs and other indeclinable, unambiguous words. These should be lemmatized with dictionary matching.
  • Ambiguous tokens are the real challenge of the lemmatizer project and none is more important than cumCum alone makes up 1.1% of the corpus with both the conjunction (‘when’) and the preposition (‘with’) significantly represented. Compare this with est, which is an ambiguous form (i.e. est sum “to be” vs. est edo “to eat”), but with one occurring by far more frequently in the corpus. For this reason, cum will be a good place to start with testing a context-based lemmatizer, such as one that uses bigrams to resolve ambiguities. Quod and quam, also both in the top 20 tokens, can be added to this category.

In addition to high-frequency tokens, extremely rare tokens also present a significant challenge to lemmatization. Look for a post about hapax legomena in the Latin corpus later this week.

GSoC 2016: Lemmatizing Latin/Greek for CLTK

code

Google Summer of Code 2016 started this week. That means that my work on improving the Latin (and Greek) lemmatizer in the Classical Language Toolkit is now underway. For this summer project, I proposed to rewrite the CLTK lemmatizer using a backoff strategy—that is, using a series of different lemmatizers to increase accuracy. Backoff tagging is a common technique in part-of-speech tagging in NLP, but it should also help to resolve ambiguities, predict unknown words, and similar issues that can trip up a lemmatizer. The current CLTK lemmatizer uses dictionary matching, but lacks a systematic way to differentiate ambiguous forms. (Is that forma the nominative singular noun [ > forma, –ae] or forma the present imperative active verb [ > formo (1) ?) The specifics of my backoff strategy will be discussed here as the project develops, but for now I’ll say that it is a combination of training on context, regex matching, and, yes, dictionary matching for high frequency, indeclinable, and unambiguous words.

Screen Shot 2016-05-23 at 11.28.28 PM

First round of tests today with the default Latin lemmatizer.

As I mention in my GSoC proposal, having a lemmatizer with high accuracy is particularly important for NLP in highly inflected languages because: 1. words often have a dozen or more possible forms (and, as opposed to go in English, this is the norm and not only a characteristic of irregularly formed words), and 2. small corpus size in general often demands that counts for a given feature—like words—be based on the broadest measure possible. So, for example, if you want to study the idea of shapes in Ovid’s Metamorphoses, you would need to would want to look at the word forma. This “word” (token, really) appears 39 times in the poem. But what you really want to look at is not just forma, but formae (21), formam (18), formarum (0—yes, it’s zero, but you would still want to know), formis (1), and formas (6). And you wouldn’t want to miss tokens like formasque (Met. 2.78) or formaene (Met. 10.563)—there are 9 such instances. If you were going to, say, topic model the Metamorphoses, you would be much better off having the 94 examples of “forma” than the smaller numbers of its different forms.

“Ancient languages do not have complete BLARKs.” writes Barbara McGillivray  [2014: 19], referring to Krauwer’s idea [2003: 4] of the Basic LAnguage Resource Kit. A BLARK consists of the fundamental resources necessary for text analysis—corpora, lexicons, tokenizers, POS-taggers, etc. A lemmatizer is another basic tool. More and more, the CLTK is solving the BLARK problem for Latin, Greek, and other historical languages which have been referred to as “less-resourced” [see Piotrowski 2012: 85]. In order for these languages to participate in advances in text analysis and to take full advantage of digital resources for language processing, basic tools, like the lemmatizer, need to be available and need to work at accuracy rates high enough to stand up to the very high bar demanded in philological research. This is the goal for the summer.

Works cited:
Bird, S., E. Klein, and E. Loper. 2009. Natural Language Processing with Python. Cambridge, Ma.: O’Reilly. (Esp. Ch. 5 “Categorizing and Tagging Words”).
Krauwer, S. 2003. “The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap.” Proceedings of the 2003 International Workshop on Speech and Computer (SPECOM 2003) : 8-15.
McGillivray, B. 2014. Methods in Latin Computational Linguistics. Leiden: Brill.
Piotrowski, M. 2012. “Natural Language Processing for Historical Texts.” Synthesis Lectures on Human Language Technologies 5: 1-157.

More Tokenizing Latin Text

code

When I first started working on the CLTK Latin tokenizer, I wrote a blog post both explaining tokenizing in general and also showing some of the advantages of using a language-specific tokenizer. At that point, the most important feature of the CLTK Latin tokenizer was the ability to split tokens on the enclitic ‘-que’. In the meantime, I have added several more features, described below. Like the last post, the code below assumes the following requirements: Python 3.4, NLTK3, and the current version of CLTK.

Start by importing the Latin word tokenizer with the following code:

from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('latin')

The following code demonstrates the current features of the tokenizer:

# -que
# V. Aen. 1.1
text = "Arma virumque cano, Troiae qui primus ab oris"
word_tokenizer.tokenize(text)

>>> ['Arma', 'que', 'virum', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']


# -ne
# Cic. Orat. 1.226.1
text = "Potestne virtus, Crasse, servire istis auctoribus, quorum tu praecepta oratoris facultate complecteris?"
word_tokenizer.tokenize(text)

>>> ['ne', 'Potest', 'virtus', ',', 'Crasse', ',', 'servire', 'istis', 'auctoribus', ',', 'quorum', 'tu', 'praecepta', 'oratoris', 'facultate', 'complecteris', '?']


# -ve
# Catull. 14.4-5
text = "Nam quid feci ego quidve sum locutus, cur me tot male perderes poetis?"
word_tokenizer.tokenize(text)

>>> ['Nam', 'quid', 'feci', 'ego', 've', 'quid', 'sum', 'locutus', ',', 'cur', 'me', 'tot', 'male', 'perderes', 'poetis', '?']


# -'st' contractions
# Prop. 2.5.1-2

text = "Hoc verumst, tota te ferri, Cynthia, Roma, et non ignota vivere nequitia?"

word_tokenizer.tokenize(text)

>>> ['Hoc', 'verum', 'est', ',', 'tota', 'te', 'ferri', ',', 'Cynthia', ',', 'Roma', ',', 'et', 'non', 'ignota', 'vivere', 'nequitia', '?']

# Plaut. Capt. 937
text = "Quid opust verbis? lingua nullast qua negem quidquid roges."

word_tokenizer.tokenize(text)

>>> ['Quid', 'opus', 'est', 'verbis', '?', 'lingua', 'nulla', 'est', 'qua', 'negem', 'quidquid', 'roges.']


# 'nec' and 'neque'
# Cic. Phillip. 13.14

text = "Neque enim, quod quisque potest, id ei licet, nec, si non obstatur, propterea etiam permittitur."

word_tokenizer.tokenize(text)

>>> ['que', 'Ne', 'enim', ',', 'quod', 'quisque', 'potest', ',', 'id', 'ei', 'licet', ',', 'c', 'ne', ',', 'si', 'non', 'obstatur', ',', 'propterea', 'etiam', 'permittitur.']


# '-n' for '-ne'
# Plaut. Amph. 823

text = "Cenavin ego heri in navi in portu Persico?"

word_tokenizer.tokenize(text)

>>> ['Cenavi', 'ne', 'ego', 'heri', 'in', 'navi', 'in', 'portu', 'Persico', '?']


# Contractions with 'si'; also handles 'sultis',
# Plaut. Bacch. 837-38

text = "Dic sodes mihi, bellan videtur specie mulier?"

word_tokenizer.tokenize(text)

>>> ['Dic', 'si', 'audes', 'mihi', ',', 'bella', 'ne', 'videtur', 'specie', 'mulier', '?']


There are still improvements to be done, but this handles a high percentage of Latin tokenization tasks. If you have any ideas for more cases that need to be handled or if you see any errors, let me know.

Tokenizing Latin Text

code

One of the first tasks necessary in any text analysis projects is tokenization—we take our text as a whole and convert it to a list of smaller units, or tokens. When dealing with Latin—or at least digitized version of modern editions, like those found in the Perseus Digital Library, the Latin Library, etc.—paragraph- and sentence-level tokenization present little problem. Paragraphs are usually well marked and can be split by newlines (</n>). Sentences in modern Latin editions use the same punctuation set as English (i.e., ‘.’, ‘?’, and ‘!’), so most sentence-level tokenization can be done more or less successfully with the built-in tools found in the Natural Language Toolkit (NLTK), e.g. nltk.word_tokenize. But just as in English, Latin word tokenization offers small, specific issues that are not addressed by NLTK. The classic case in English is the negative contraction—how do we want to handle, for example, “didn’t”: [“didn’t”] or [“did”, “n’t”] or [“did”, “not”]?

There are four important cases in which Latin word tokenization demands special attention: the enclictics “-que”, “-ue/-ve”, and “-ne” and the postpositive use of “-cum” with the personal pronouns (e.g. nobiscum for *cum nobis). The Classical Language Toolkit now takes these cases into consideration when doing Latin word tokenization. Below is a brief how-to on using the CLTK to tokenize your Latin texts by word. [The tutorial assumes the following requirements: Python3, NLTK3, CLTK.]

Tokenizing Latin Text with CLTK

We could simply use Python to split our texts into a list of tokens. (And sometimes this will be enough!) So…

text = "Arma virumque cano, Troiae qui primus ab oris"
text.split()

>>> ['Arma', 'virumque', 'cano,', 'Troiae', 'qui', 'primus', 'ab', 'oris']

A good start, but we’ve lost information, namely the comma between cano and Troiae. This might be ok, but let’s use NLTK’s tokenizer to hold on to the punctuation.

import nltk

nltk.word_tokenize(text)

>>> ['Arma', 'virumque', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']

Using word_tokenize, we retain the punctuation. But otherwise we have more or less the same division of words.

But for someone working with Latin that second token is an issue. Do we really want virumque? Or are we looking for virum and the enclitic –que? In many cases, it will be there latter. Let’s use CLTK to handle this. (***UPDATED 3.19.16***)

from cltk.tokenize.word import WordTokenizer

word_tokenizer = WordTokenizer('latin')
word_tokenizer.tokenize(text)

>>> ['Arma', 'que', 'virum', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']

 

Using the CLTK WordTokenizer for Latin we retain the punctuation and split the special case more usefully.