Current State of the CLTK Latin Lemmatizer

code

Lemmatization is a core task in natural language processing that allows us to return the dictionary headword—also known as the lemma—for each token in a given string. The Classical Language Toolkit includes a lemmatizer for Latin and Greek and for my Google Summer of Code project I have been rewriting these tools to improve their accuracy. In this post, I want to 1. review the current state of the lemmatizer, specifically the Latin lemmatizer, 2. test some sample sentences to see where the lemmatizer performs well and where it does not, and 3. suggest where I think improvements could be made.

[This post uses Python3 and the current version of the CLTK.]

The current version of the lemmatizer uses a model that is kept in the CLTK_DATA directory. (More specifically, the model is a Python dictionary called LEMMATA that can be found in the ‘latin_lemmata_cltk.py’ file in the ‘latin_models_cltk’ corpus.) So before we can lemmatize Latin texts we need to import this model/corpus. The import commands are given below, but if you want more details on loading CLTK corpora, see this post.

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_models_cltk')

[Note that once this corpus is imported into CLTK_DATA, you will not need to repeat these steps to use the Latin lemmatized in the future.]

To use the lemmatizer, we import it as follows:

from cltk.stem.lemma import LemmaReplacer

LemmaReplacer takes a language argument, so we can create an instance of the Latin lemmatizer with the following command:

lemmatizer = LemmaReplacer('latin')

This lemmatized checks words against the LEMMATA dictionary that you installed above. That is, it checks the dictionary to see if a word is found as a key and returns the associated value. Here is the beginning of the lemma dictionary:

LEMMATA = { 
    '-nam' : 'nam', 
    '-namque' : 'nam', 
    '-sed' : 'sed', 
    'Aaron' : 'Aaron', 
    'Aaroni' : 'Aaron', 
    'Abante' : 'Abas', 
    'Abanteis' : 'Abanteus', 
    'Abantem' : 'Abas', 
    'Abantes' : 'Abas', etc...

If a word is not found in the dictionary, the lemmatizer returns the original word unchanged. Since Python dictionaries do not support duplicate keys, there is no resolution for ambiguous forms with the current lemmatizer. For example, this key-value pair {‘amor’ : ‘amo’} ensures that the word “amor” is always lemmatized as a verb and not a noun, even though the nominative singular form of ‘amor’ appears much more frequently than the first-person singular passive form of ‘amor’.

Let’s try some test sentences. Here is the first sentence from Cicero’s In Catilinam 1:

sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'
sentence = sentence.lower()

Note that I have also made the sentence lowercase as the current lemmatizer can raise errors due to case handling.

Now let’s pass this to the lemmatizer:

lemmas = lemmatizer.lemmatize(sentence)
print(lemmas)

>>> ['quis1', 'usque', 'tandem', 'abutor', ',', 'catilina', ',', 'patior', 'noster', '?']

The lemmatizer does a pretty good job. Punctuation included, its accuracy is 80% when compared with the lemmas found in Perseus Treebank Data. According to this dataset, the “quis1” should resolve to “quo”. (Though an argument could be made about whether this adverb is a form of ‘quis’ or its own word deriving from ‘quis’. The argument about whether ‘quousque’ should in fact be one word is also worth mentioning. Note that the number following ‘quis’ is a feature of the Morpheus parser to disambiguate identical forms.) “Patientia” is perhaps a clearer case. Though derived from the verb “patior”, the expected behavior of the lemmatizer is to resolve this word as the self-sufficient noun ‘patientia’. This is what we find in our comparative data from Perseus.

Another example, a longer sentence from the opening of Sallust’s Bellum Catilinae:

sentence = 'Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit.'
sentence = sentence.lower()

lemmas = lemmatizer.lemmatize(sentence)
print(lemmas)

>>> ['omne', 'homo', ',', 'qui1', 'sui', 'studeo', 'praesto2', 'ceter', 'animalis', ',', 'summum', 'ops1', 'nitor1', 'decet', ',', 'neo1', 'vita', 'silentium', 'transeo', 'velut', 'pecus1', ',', 'qui1', 'natura', 'pronus', 'atque', 'venter', 'oboedio', 'fingo.']

Again, pretty good results overall—82.76%. But the errors reveal the shortcomings of the lemmatizer. “Omnis” is an extremely common word in Latin and it simply appears incorrectly in the lemma model. Ditto ‘summus’. Ditto ‘ceter’, though worse because this is not even a valid Latin form. ‘Animalibus’ suffers from the kind of ambiguity noted above with ‘amor’—the noun ‘animal’ is much more common that the adjective ‘animals’. The most significant error is lemmatizing ‘ne’—one of the most common words in the language—incorrectly as the extremely infrequent (if ever appearing) present active imperative of ‘neo’.

If this all sounds critical simply for the sake of being critical, that is not my intention. I have been working on new approaches to the problem of Latin lemmatization and have learned a great deal from the current CLTK lemmatizer. The work shown above is a solid start and there is significant room for improvement. I see it as a baseline: every percentage point above 80% or 82.76% accuracy is a step in the right direction. Next week, I will publish some new blog posts with ideas for new approaches to Latin lemmatization based not on dictionary matching, but on training data, regex matching, and attention to word order and context. While dictionary matching is still the most efficient way to resolve some lemmas (e.g. unambiguous, indeclinables like “ad”), it is through a combination of multiple approaches that we will be able to increase substantially the accuracy of this important tool in the CLTK.

 

Advertisements

13 thoughts on “Current State of the CLTK Latin Lemmatizer

  1. Great post! This is super helpful and insightful. I found it via Googling to get the answer to why I couldn’t run LemmaReplacer out of the box, and found that I needed the import statement you give above. I subsequently read the rest of your post above with interest. I’ll look forward to tracking progress on all of this. I’m pretty green at lemmatization and computational text analysis more generally. I use Latin to study early modern history, especially legal texts and library catalogs. Has anyone considered building a crowdsourced, open-source custom dictionary and using the backoff method of lemmatization in CLTK to run that first, followed by the default dictionary?

    1. Thank you for the feedback and great to see people experimenting with CLTK. The way that the default backoff lemmatizer is currently setup, the default dictionary you mention is used as part of the backoff chain: the first lemmatizer uses a dictionary of high-frequency words; second, regex; third, training data; fourth, a customized (and experimental!) regex lemmatizer based on principal parts; fifth, the default dictionary; and last, a lemmatizer that simply returns the token as is, i.e. no change. The order here is based on some preliminary experiments that showed the highest accuracy, but further testing needs to be done. I also need to document the whole thing better and comments like this are nice prods to do so. Thanks! PJB

    2. I should also mention that, while I did include BackoffLatinLemmatizer, as precomposed backoff chain that could be used with models already in the CLTK, the module is setup in such a way that you could chain together your own sub-lemmatizers using the models, training data, etc. that best work with your research questions, arrange them in any order you want, etc. Again, I need to document this better—if you find yourself want to experiment with custom backoff setups, let me know.

      1. Hi again. Thanks for your two replies. I wonder if I can get your help with immediate implementation actually. I read what you wrote, but the documentation at http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method makes it seem like I should be able to use TrainLemmatizer and then follow with the normal lemmatizer (LemmaReplacer()). Here is a segment of my code:

        # make some stuff to use
        text2=”pandect juris civil”
        tokens=text2.split()
        customDictionary={‘pandect’:’pandectae’} # just a quick one-item dictionary
        lemmatizer = LemmaReplacer(‘latin’)
        lemmatizer2 = TrainLemmatizer(customDictionary, backoff=lemmatizer)

        # do some stuff
        lemmatizedText=lemmatizer2.lemmatize(tokens)
        print(lemmatizedText)

        …But this gives me an error about backoff taggers. Do you think you can help me figure out what is wrong here? I can send you a full error message if that would help. Thanks!

      2. Thanks for the example—the two lemmatization modules in the CLTK right now are not compatible, which means that you can not use LemmaReplacer as a backoff. That said, there is a workaround for this case.

        LemmaReplacer is basically functionally equivalent to TrainLemmatizer using the old lemma set, which is available here: https://github.com/cltk/latin_models_cltk/tree/master/lemmata/backoff

        If you have already installed the latin_models_cltk corpus to the usual location, then this code should work:

        import os
        from cltk.utils.file_operations import open_pickle

        rel_path = os.path.join('~/cltk_data/latin/model/latin_models_cltk/lemmata/backoff')
        path = os.path.expanduser(rel_path)
        file = 'latin_lemmata_cltk.pickle'
        old_model_path = os.path.join(path, file)
        LATIN_OLD_MODEL = open_pickle(old_model_path)

        lemmatizer = TrainLemmatizer(model=LATIN_OLD_MODEL, backoff=None)
        lemmatizer2 = TrainLemmatizer(customDictionary, backoff=lemmatizer)

        Hope that helps. Let me know. PJB

  2. Hi Patrick,
    So in the intervening 3 days, I had written my own jury-rigged code that runs all tokens through a custom dictionary which is stored externally in a CSV file. The value of this, as I suppose, is that I want to be able to frequently edit the custom dictionary using a simple interface. I’ll upload what I have done to my GitHub account at https://github.com/ColinWilder. Give me an hour or so after completing this post to make the new GitHub repo and you can check it out if you like. After consulting the custom dictionary, I pass just those tokens which didn’t get a hit in the custom dictionary through the tokenizer and then LemmaReplacer. All of that basically works. BUT, I need to be able to identify the tokens that fail to find a match in either the custom dictionary or LemmaReplacer, since I want to feed these “fails” back to the custom dictionary for subsequent manual lemmatization (by me) – call it manual machine learning if you like. So I want the code to spit back a default fail lemma (e.g. ‘UNK’) when it fails to find a real lemma. PROBLEM: When I write this code, I get an error:
    default = DefaultLemmatizer(‘UNK’)
    lemmatizer = LemmaReplacer(‘latin’, backoff=default)
    text_lemmatized=lemmatizer.lemmatize(text_tokenized)
    print(text_lemmatized)
    Are the default lemmatizer and LemmaReplacer in compatible too?
    Thanks,
    Colin

  3. Right—LemmaReplacer is an older Latin lemmatizer and is incompatible with the backoff lemmatizers. But as noted above, you do not need to use LemmaReplacer if you create an instance of TrainLemmatizer based on the LATIN_OLD_MODEL. I’ll take a look at your repo—thanks for sharing.

  4. Hi again. This is what I’ve put together:
    sampleText=”Quo usque tandem abutere, Catilina, patientia nostra?”
    default = DefaultLemmatizer(‘UNK’)
    lemmatizer = TrainLemmatizer(model=LATIN_OLD_MODEL, backoff=None)
    lemmatizer2 = TrainLemmatizer(lemmatizer, backoff=default)
    finishedText = lemmatizer2.lemmatize(sampleText)
    …but I get “AttributeError: ‘TrainLemmatizer’ object has no attribute ‘keys'”. Do you have any suggestions?

    1. Two things here—
      1. The lemmatizer works on a list of tokens, not a string. So, you should at least split sampleText (or better run a CLTK Latin word tokenizer on the strong). You should also consider preprocessing the string to normalize case, remove punctuation, etc. (Though note that the word tokenizer will tokenize most punctuation for you.)
      2. TrainLemmatizer needs a model (as Python dict) to be defined as the first parameter. You have put a lemmatizer object as the first parameter (which explains the error—dictionaries have keys, lemmatizer objects do not). You should update that line to something like

      lemmatizer2 = TrainLemmatizer(model={‘patientia’: ‘patientia’}, backoff=lemmatizer)

      and compare the results of finishedText = lemmatizer.lemmatize(tokens) and finishedText = lemmatizer2.lemmatize(tokens)

      Let me know if that helps, PJB

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s