Google Summer of Code 2016 started this week. That means that my work on improving the Latin (and Greek) lemmatizer in the Classical Language Toolkit is now underway. For this summer project, I proposed to rewrite the CLTK lemmatizer using a backoff strategy—that is, using a series of different lemmatizers to increase accuracy. Backoff tagging is a common technique in part-of-speech tagging in NLP, but it should also help to resolve ambiguities, predict unknown words, and similar issues that can trip up a lemmatizer. The current CLTK lemmatizer uses dictionary matching, but lacks a systematic way to differentiate ambiguous forms. (Is that forma the nominative singular noun [ > forma, –ae] or forma the present imperative active verb [ > formo (1) ?) The specifics of my backoff strategy will be discussed here as the project develops, but for now I’ll say that it is a combination of training on context, regex matching, and, yes, dictionary matching for high frequency, indeclinable, and unambiguous words.As I mention in my GSoC proposal, having a lemmatizer with high accuracy is particularly important for NLP in highly inflected languages because: 1. words often have a dozen or more possible forms (and, as opposed to go in English, this is the norm and not only a characteristic of irregularly formed words), and 2. small corpus size in general often demands that counts for a given feature—like words—be based on the broadest measure possible. So, for example, if you want to study the idea of shapes in Ovid’s Metamorphoses, you would need to would want to look at the word forma. This “word” (token, really) appears 39 times in the poem. But what you really want to look at is not just forma, but formae (21), formam (18), formarum (0—yes, it’s zero, but you would still want to know), formis (1), and formas (6). And you wouldn’t want to miss tokens like formasque (Met. 2.78) or formaene (Met. 10.563)—there are 9 such instances. If you were going to, say, topic model the Metamorphoses, you would be much better off having the 94 examples of “forma” than the smaller numbers of its different forms.
“Ancient languages do not have complete BLARKs.” writes Barbara McGillivray [2014: 19], referring to Krauwer’s idea [2003: 4] of the Basic LAnguage Resource Kit. A BLARK consists of the fundamental resources necessary for text analysis—corpora, lexicons, tokenizers, POS-taggers, etc. A lemmatizer is another basic tool. More and more, the CLTK is solving the BLARK problem for Latin, Greek, and other historical languages which have been referred to as “less-resourced” [see Piotrowski 2012: 85]. In order for these languages to participate in advances in text analysis and to take full advantage of digital resources for language processing, basic tools, like the lemmatizer, need to be available and need to work at accuracy rates high enough to stand up to the very high bar demanded in philological research. This is the goal for the summer.
Bird, S., E. Klein, and E. Loper. 2009. Natural Language Processing with Python. Cambridge, Ma.: O’Reilly. (Esp. Ch. 5 “Categorizing and Tagging Words”).
Krauwer, S. 2003. “The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap.” Proceedings of the 2003 International Workshop on Speech and Computer (SPECOM 2003) : 8-15.
McGillivray, B. 2014. Methods in Latin Computational Linguistics. Leiden: Brill.
Piotrowski, M. 2012. “Natural Language Processing for Historical Texts.” Synthesis Lectures on Human Language Technologies 5: 1-157.