I spent this summer working on a new approach to Latin lemmatization. Following and building on the logic of the NLTK Sequential Backoff Tagger—a POS-tagger that tries to maximize the accuracy of part-of-speech tagging tasks by combining multiple taggers and making several passes over the input—the Backoff Lemmatizer takes a token and passes this to any one of a number of different lemmatizers. This lemmatizer either returns a match or passes the token, or backs off, to another lemmatizer. When no more lemmatizers are available, it returns
None. This setup allows users to customize the order of the lemmatizer sequence to best fit their processing task.
In this series of blog posts, I will explain how to use the different lemmatizers available in the Backoff Latin Lemmatizer. Here I will introduce the two most basic lemmatizers: DefaultLemmatizer and IdentityLemmatizer. Both are simple and will produce results with poor accuracy. (In fact, the DefaultLemmatizer’s accuracy will pretty much always be 0%!) Yet both can be useful as the final backoff lemmatizer in the sequence.
The DefaultLemmatizer returns the same “lemma” for all tokens. You can either specify what you want the lemmatizer to return, or if you leave the parameter blank,
None. Note that all of the lemmatizers take as their input a list of tokens.
> from cltk.lemmatize.latin.backoff import DefaultLemmatizer > from cltk.tokenize.word import WordTokenizer > lemmatizer = DefaultLemmatizer() > tokenizer = WordTokenizer('latin') > sent = "Quo usque tandem abutere, Catilina, patientia nostra?" > # Tokenize the sentence > tokens = tokenizer.tokenize(sent) > lemmatizer.lemmatize(tokens) [('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]
As mentioned above, you can specify your own “lemma” instead of
> lemmatizer = DefaultLemmatizer('UNK') > lemmatizer.lemmatize(tokens) [('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]
This is all somewhat unimpressive. But the DefaultLemmatizer is in fact quite useful. When placed as the last lemmatizer in a backoff chain, it allows you to identify easily (and with your own designation) which tokens did not return a match in any of the preceding lemmatizers. Note again that the accuracy of this lemmatizer is basically always 0%.
The IdentityLemmatizer has a similarly straightforward logic, returning the input token as the output lemma:
> from cltk.lemmatize.latin.backoff import IdentityLemmatizer > from cltk.tokenize.word import WordTokenizer > lemmatizer = IdentityLemmatizer() > tokenizer = WordTokenizer('latin') > sent = "Quo usque tandem abutere, Catilina, patientia nostra?" > # Tokenize the sentence > tokens = tokenizer.tokenize(sent) > lemmatizer.lemmatize(tokens) [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]
Like the DefaultLemmatizer, the IdentityLemmatizer is useful as the final lemmatizer in a backoff chain. The difference here is that it is a default meant to boost accuracy—there is likely to be some lemmas in any given input that find no match but are in fact already correct. In the example above, by simply using the IdentityLemmatizer on this list of tokens, we get an accuracy—including punctuation—of 80% (i.e. 8 out of 10). Not a bad start (and not all sentences will be so successful!) but one can imagine a case in which, say, “Catalina” is not found in training data and it would be better to take a chance on returning a match than not. Of course, if this is not the case, the DefaultLemmatizer is probably the better final backoff option.
In the next post, I will introduce a second approach to lemmatizing with the Backoff Latin Lemmatizer, namely working with a model, i.e. a dictionary of lemma matches.