10,000 Most Frequent ‘Words’ in the Latin Library

article

A few months ago, I posted a list of the 10,000 most frequent words in the PHI Classical Latin Texts. While I did include a notebook with the code for that experiment, I could not include the data because the PHI texts are not available for redistribution. So here is an updated post, based on a freely available corpus of Latin literature—and one that I have been using for my recent Disiecta Membra posts like this one and this one and this one—the Latin Library. (The timing is good, as the Latin Library has received some positive attention recently.) The code for this post is available as a Jupyter Notebook here.

The results, based on the 13,563,476 tokens in the Latin Library:

Top 10 tokens in Latin Library:

       TOKEN       COUNT       TYPE-TOK %  RUNNING %   
    1. et          446474      3.29%       3.29%       
    2. in          274387      2.02%       5.31%       
    3. est         174413      1.29%       6.6%        
    4. non         166083      1.22%       7.83%       
    5. -que        135281      1.0%        8.82%       
    6. ad          133596      0.98%       9.81%       
    7. ut          119504      0.88%       10.69%      
    8. cum         109996      0.81%       11.5%       
    9. quod        104315      0.77%       12.27%      
   10. si          95511       0.70%       12.97%

How does this compare with the previous test against the PHI run? Here are the frequency rankings from the PHI run, 1 through 10: et, in, -que, ne, est, non, ut, cum, si, and ad. So—basically, the same. The loss of ne from the top 10 is certainly a result of improvements to the CLTK tokenizer, specifically improvements in tokenizing the the enclitic -ne. Ne is now #41 with 26,825 appearances and -ne #30 with 36,644 appearances. The combined count would still not crack the Top 10, which suggests that there may have been a lot of words wrongly tokenized of the form, e.g. ‘homine’ as [‘homi’, ‘-ne’]. (I suspect that this still happens, but am confident that the frequency of this problem is declining. If you spot any “bad” tokenization involving words ending in ‘-ne‘ or ‘-n‘, please submit an issue.) With ne out of the Top 10, we see that quod has joined the list. It should come as little surprise that quod was #11 in the PHI frequency list.

Since the PHI post, significant advances have been made with the CLTK Latin lemmatizer. Recent tests show accuracies consistently over 90%. So, let’s put out a provisional list of top lemmas as well—

Top 10 lemmas in Latin Library:

       LEMMA       COUNT       TYPE-LEM %  RUNNING %   
    1. et          446474      3.29%       3.29%       
    2. sum         437415      3.22%       6.52%       
    3. qui         365280      2.69%       9.21%       
    4. in          274387      2.02%       11.23%      
    5. is          213677      1.58%       12.81%      
    6. non         166083      1.22%       14.03%      
    7. -que        144790      1.07%       15.1%       
    8. hic         140421      1.04%       16.14%      
    9. ad          133613      0.99%       17.12%      
   10. ut          119506      0.88%       18.0%

No real surprises here. Six from the Top 10 lemmas are indeclinable, whether conjunctions, prepositions, adverbs, or enclitic, and so remain from the top tokens list: etinnon-quead and ut. Forms of sum and qui can be found in the top tokens list as well, est and quod respectively. Hic rises to the top based on its large number of relatively high ranking forms, though it should be noted that its top ranking form is #23 (hoc), followed by #46 (haec), #71 (his), #91 (hic), and #172 (hanc) among others. Is also joins the top 10, though I have my concerns about this because of the relatively high frequency of overlapping forms with the verb eo (i.e. eoiseam, etc.). This result should be reviewed and tested further.

While I’m thinking about it, other concerns I have would be the counts for hic, i.e. with respect to the demonstrative and the adverb, as well as the slight fluctuations in the counts of indeclinables, e.g. ut (119,504 tokens vs. 119,506 lemmas), or the somewhat harder to explain jump in -que. So, we’ll consider this a work in progress. But one that is—at least for the Top 10—more or less in line with other studies (e.g. Diederich, which—with the exception of cum—has same words, if different order.)

 

Parlor Game, Revisited

article

In August, the Dickinson College Commentaries blog featured a post on common Latin words that are not found in Virgil’s Aeneid. Author Chris Francese refers to the post as a “diverting Latin parlor game” and in that spirit of diversion I’d like play along and push the game further.

The setup is as follows, to quote the post:

Take a very common Latin word (in the DCC Latin Core Vocabulary) that does not occur in Vergil’s Aeneid, and explain its absence. Why would Vergil avoid certain lemmata (dictionary head words) that are frequent in preserved Latin?

So, Virgil avoids words such as aegre, arbitrorauctoritas, beneficium, etc. and it is up to us to figure out why. An interesting question and by asking the question, Francese enters a fascinating conversation on Latin poetic diction which includes Bertil Axelson, Gordon Williams, Patricia Watson, and many others (myself included, I suppose). But my goal in this post is not so much to answer the “why?” posed in the quote above, but more to investigate the methods through which we can start the conversation.

The line in Francese’s post that got me thinking was this:

The Vergilian data comes from LASLA  (no automatic lemmatizers were used, all human inspection), as analyzed by Seth Levin.

It just so happened that when this post came out, I was completing a summer-long project building an “automatic lemmatizer” for Latin for the Classical Language Toolkit. So my first reaction to the post was to see how close I could get to the DCC Blog’s list using the new lemmatizer. The answer is pretty close.

[I have published a Jupyter Notebook with the code for these results here: https://github.com/diyclassics/dcc-lemma/blob/master/Parlor%20Game%2C%20Revisited.ipynb.]

There are 75 lemmas from the DCC Latin Core Vocabulary that do not appear in the Aeneid (DCC Missing). Using the Backoff Latin lemmatizer on the Latin Library text of the Aeneid (CLTK Missing), I returned a list of 119 lemmas. There are somewhere around 6100 unique lemmas in the Aeneid meaning that our results only differ by 0.7%.

The results from CLTK Missing show 69 out of 75 lemmas (92%) from the DCC list. The six lemmas that it missed are:

[‘eo’, ‘mundus’, ‘plerusque’, ‘reliquus’, ‘reuerto’, ‘solum’]

Some of these can be easily explained. Reliqui (from relinquo) was incorrectly lemmatized as reliquus—an error. Mundus was lemmatized correctly and so appears in the list of Aeneid lemmas, just not the one on DCC Missing, i.e. mundus (from mundus, a, um = ‘clean’). A related problem with both eo and solum—homonyms of both these words appear in the list of Aeneid lemmas. (See below on the issue of lemmatizing adverbs/adjectives, adjective/nouns, etc.)  Plerusque comes from parsing error in my preprocessing script, where I split the DCC list on whitespace. Since this word is listed as plērus- plēra- plērumqueplerus- made it into reference list, but not plerusque. (I could have fixed this, but I thought it was better in this informal setting to make clear the full range on small errors that can creep into a text processing “parlor game” like this.)  Lastly, is reverto wrong? The LASLA lemma is revertor which—true enough—does not appear on the DCC Core Vocabulary, but this is probably too fine a distinction. Lewis & Short, e.g., lists reverto and revertor as the headword.

This leaves 50 lemmas returned in CLTK Missing that are—compared to DCC Missing—false positives. The list is as follows:

[‘aduersus’, ‘alienus’, ‘aliquando’, ‘aliquis’, ‘aliter’, ‘alius’, ‘animal’, ‘antequam’, ‘barbarus’, ‘breuiter’, ‘certe’, ‘citus’, ‘ciuitas’, ‘coepi’, ‘consilium’, ‘diuersus’, ‘exsilium’, ‘factum’, ‘feliciter’, ‘fore’, ‘forte’, ‘illuc’, ‘ingenium’, ‘item’, ‘longe’, ‘male’, ‘mare’, ‘maritus’, ‘pauci’, ‘paulo’, ‘plerus’, ‘praeceptum’, ‘primum’, ‘prius’, ‘proelium’, ‘qua’, ‘quantum’, ‘quomodo’, ‘singuli’, ‘subito’, ‘tantum’, ‘tutus’, ‘ualidus’, ‘uarius’, ‘uere’, ‘uero’, ‘uictoria’, ‘ultimus’, ‘uolucer’, ‘uos’]

To be perfectly honest, you learn more about the lemmatizer than the Aeneid from this list and this is actually very useful data for uncovering places where the CLTK tools can be improved.

So, for example, there are a number of adverbs on this list (breuiter, certe, tantum, etc.). These are cases where the CLTK lemmatizer return the associated adjective (so breuiscertustantus). This is a matter of definition. That is, the CLTK result is more different than wrong. We can debate whether some adverbs deserve to be given their own lemma, but is still that—a debate. (Lewis & Short, e.g. has certe listed under certus, but a separate entry for breuiter.)

The DCC Blog post makes a similar point about nouns and adjectives:

At times there might be some lemmatization issues (for example barbarus came up in the initial list of excluded core words, since Vergil avoids the noun, though he uses the adjective twice. I deleted it from this version.

This explains why barbarus appears on CLTK Missing. Along the same line, factum has been lemmatized under facio. Again, not so much incorrect, but a matter of how we define our terms and set parameters for the lemmatizer. I have tried as much as possible to follow the practice of the Ancient Greek and Latin Dependency Treebank and the default Backoff lemmatizer uses the treebanks as the source of its default training data. This explains why uos appears in CLTK Missing—the AGLDT lemmatizes forms of uos as the second-person singular pronoun tu.

As I continue to test the lemmatizer, I will use these results to fine tune and improve the output, trying to explain each case and make decisions such as which adverbs need to be lemmatized as adverbs and so on. It would be great to hear comments, either on this post or in the CLTK Github issues, about where improvements need to be made.

There remains a final question. If the hand lemmatized data from LASLA produces more accurate results, why use the CLTK lemmatizer at all?

It is an expensive process—time/money/resources—to produce curated data. This data is available for Virgil, but may not be for another author. What if we wanted to play the same parlor game with Lucan? I don’t know whether lemmatized data is available for Lucan, but I was a trivial task for me to rerun this experiment (with minimal preprocessing changes) on the Bellum Ciuile. (I placed the list of DCC core words not appearing in Lucan at the bottom of this post.) And I could do it for any text in the Latin Library just as easily.

Automatic lemmatizers are not perfect, but they are often good and sometimes very good. More importantly, they are getting better and, in the case of the CLTK, they are being actively developed and developers like myself can work with researchers to make the tools as good as possible.

Lemmas from the DCC Latin Core Vocabulary not found in Lucan*
(* A first draft by an automatic lemmatizer)

accido
adhibeo
aduersus
aegre
alienus
aliquando
aliquis
aliter
alius
amicitia
antequam
arbitror
auctoritas
autem
beneficium
bos
breuiter
celebro
celeriter
centum
certe
ceterum
citus
ciuitas
coepi
cogito
comparo
compono
condicio
confiteor
consilium
consuetudo
conuiuium
deinde
desidero
dignitas
disciplina
diuersus
dormio
edico
egregius
epistula
existimo
exspecto
factum
familia
fere
filia
fore
forte
frumentum
gratia
hortor
illuc
imperator
impleo
impono
ingenium
initium
integer
interim
interrogo
intersum
ita
itaque
item
legatus
libido
longe
magnitudo
maiores
male
mare
maritus
memoria
mulier
multitudo
narro
nauis
necessitas
negotium
nemo
oportet
oratio
pauci
paulo
pecunia
pertineo
plerumque
plerus
poeta
postea
posterus
praeceptum
praesens
praesidium
praeterea
primum
princeps
principium
priuatus
prius
proelium
proficiscor
proprius
puella
qua
quantum
quattuor
quemadmodum
quomodo
ratio
sanctus
sapiens
sapientia
scientia
seruus
singuli
statim
studeo
subito
suscipio
tantum
tempestas
tutus
ualidus
uarius
uere
uero
uictoria
uinum
uitium
ultimus
uoluntas
uos
utrum

Backoff Latin Lemmatizer, pt. 1

tutorial

I spent this summer working on a new approach to Latin lemmatization. Following and building on the logic of the NLTK Sequential Backoff Tagger—a POS-tagger that tries to maximize the accuracy of part-of-speech tagging tasks by combining multiple taggers and making several passes over the input—the Backoff Lemmatizer takes a token and passes this to any one of a number of different lemmatizers. This lemmatizer either returns a match or passes the token, or backs off, to another lemmatizer. When no more lemmatizers are available, it returns None. This setup allows users to customize the order of the lemmatizer sequence to best fit their processing task.

In this series of blog posts, I will explain how to use the different lemmatizers available in the Backoff Latin Lemmatizer. Here I will introduce the two most basic lemmatizers: DefaultLemmatizer and IdentityLemmatizer. Both are simple and will produce results with poor accuracy. (In fact, the DefaultLemmatizer’s accuracy will pretty much always be 0%!) Yet both can be useful as the final backoff lemmatizer in the sequence.

The DefaultLemmatizer returns the same “lemma” for all tokens. You can either specify what you want the lemmatizer to return, or if you leave the parameter blank, None. Note that all of the lemmatizers take as their input a list of tokens.

> from cltk.lemmatize.latin.backoff import DefaultLemmatizer
> from cltk.tokenize.word import WordTokenizer

> lemmatizer = DefaultLemmatizer()
> tokenizer = WordTokenizer('latin')

> sent = "Quo usque tandem abutere, Catilina, patientia nostra?"

> # Tokenize the sentence
> tokens = tokenizer.tokenize(sent)

> lemmatizer.lemmatize(tokens)
[('Quo', None), ('usque', None), ('tandem', None), ('abutere', None), (',', None), ('Catilina', None), (',', None), ('patientia', None), ('nostra', None), ('?', None)]

As mentioned above, you can specify your own “lemma” instead of  None.

> lemmatizer = DefaultLemmatizer('UNK')
> lemmatizer.lemmatize(tokens)
[('Quo', 'UNK'), ('usque', 'UNK'), ('tandem', 'UNK'), ('abutere', 'UNK'), (',', 'UNK'), ('Catilina', 'UNK'), (',', 'UNK'), ('patientia', 'UNK'), ('nostra', 'UNK'), ('?', 'UNK')]

This is all somewhat unimpressive. But the DefaultLemmatizer is in fact quite useful. When placed as the last lemmatizer in a backoff chain, it allows you to identify easily (and with your own designation) which tokens did not return a match in any of the preceding lemmatizers. Note again that the accuracy of this lemmatizer is basically always 0%.

The IdentityLemmatizer has a similarly straightforward logic, returning the input token as the output lemma:

> from cltk.lemmatize.latin.backoff import IdentityLemmatizer
> from cltk.tokenize.word import WordTokenizer

> lemmatizer = IdentityLemmatizer()
> tokenizer = WordTokenizer('latin')

> sent = "Quo usque tandem abutere, Catilina, patientia nostra?"

> # Tokenize the sentence
> tokens = tokenizer.tokenize(sent)

> lemmatizer.lemmatize(tokens)
[('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutere'), (',', ','), ('Catilina', 'Catilina'), (',', ','), ('patientia', 'patientia'), ('nostra', 'nostra'), ('?', '?')]

Like the DefaultLemmatizer, the IdentityLemmatizer is useful as the final lemmatizer in a backoff chain. The difference here is that it is a default meant to boost accuracy—there is likely to be some lemmas in any given input that find no match but are in fact already correct. In the example above, by simply using the IdentityLemmatizer on this list of tokens, we get an accuracy—including punctuation—of  80% (i.e. 8 out of 10). Not a bad start (and not all sentences will be so successful!) but one can imagine a case in which, say, “Catalina” is not found in training data and it would be better to take a chance on returning a match than not. Of course, if this is not the case, the DefaultLemmatizer is probably the better final backoff option.

In the next post, I will introduce a second approach to lemmatizing with the Backoff Latin Lemmatizer, namely working with a model, i.e. a dictionary of lemma matches.

Wrapping up Google Summer of Code

article

GSoC-logo-vertical-200Today marks the final day of Google Summer of Code. I have submitted the code for the Latin/Greek Backoff Lemmatizer and the beta version should work its way into the Classical Language Toolkit soon enough. Calling it a lemmatizer is perhaps a little misleading—it is in fact a series of lemmatizers that can be run consecutively, with each pass designed to suggest lemmas that earlier passes missed. The lemmatizers fall into three main categories: 1. lemmas determined from context based on tagged training data, 2. lemmas determined by rules, in this case mostly regex matching on word endings, and 3. lemmas determined by dictionary lookup, that is using a similar process to the one that already exists in the CLTK. By putting these three types of lemmatizers together,  I was consistently able to return > 90% accuracy on the development test sets. There will be several blog posts in the near future to document the features of each type of lemmatizer and report more thoroughly the test results. The main purpose of today’s post is simply to share the report I wrote to summarize my summer research project.

But before sharing the report, I wanted to comment briefly on what I see as the most exciting part of this lemmatizer project. I was happy to see accuracies consistently over 90% as I tested various iterations of the lemmatizer in recent weeks. That said, it is clear to me that the path to even higher accuracy and better performance is now wide open. By organizing the lemmatizer as a series of sub-lemmatizers that can be run in a backoff sequence, tweaks can be made to any part of the chain as well as in the order of the chain itself to produce higher quality results. With a lemmatizer based on dictionary lookups, there are not many options for optimization: find and fix key/value errors or make the dictionary larger. The problem with the first option is that it is finite—errors exist in the model but not enough to have that much of an effect on accuracy. Even more of a concern, the second option is infinite—as new texts are worked on (and hopefully, as new discoveries are made!) there will always be another token missed by the dictionary. Accordingly, a lemmatizer based on training data and rules—or better yet one based on training data, rules and lookups combined in a systematic and modular fashion like this  GSoC “Backoff Lemmatizer” project—is the preferred way forward.

Now the report. I wrote this over the weekend as a Gist to summarize my summer work for GSoC. The blog format makes it a bit easier to read, but you can find the original here.

Google Summer of Code 2016 Final Report

Here is a summary of the work I completed for the 2016 Google Summer of Code project “CLTK Latin/Greek Backoff Lemmatizer” for the Classical Language Toolkit (cltk.org). The code can be found at https://github.com/diyclassics/cltk/tree/lemmatize/cltk/lemmatize.

  • Wrote custom lemmatizers for Latin and Greek as subclasses of NLTK’s tag module, including:
    • Default lemmatization, i.e. same lemma returned for every token
    • Identity lemmatization, i.e. original token returned as lemma
    • Model lemmatization, i.e. lemma returned based on dictionary lookup
    • Context lemmatization, i.e. lemma returned based on proximal token/lemma tuples in training data
    • Context/POS lemmatization, i.e. same as above, but proximal tuples are inspected for POS information
    • Regex lemmatization, i.e. lemma returned through rules-based inspection of token endings
    • Principal parts lemmatization, i.e. same as above, but matched regexes are then subjected to dictionary lookup to determine lemma
  • Organized the custom lemmatizers into a backoff chain, increasing accuracy (compared to dictionary lookup alone by as much as 28.9%). Final accuracy tests on test corpus showed average of 90.82%.
    • An example backoff chain is included in the backoff.py file under the class LazyLatinLemmatizer.
  • Constructed models for language-specific lookup tasks, including:
    • Dictionaries of high-frequency, unambiguous lemmas
    • Regex patterns for high-accuracy lemma prediction
    • Constructed models to be used as training data for context-based lemmatization
  • Wrote tests for basic subclasses. Code for tests can be found here.
  • Tangential work for CLTK inspired by daily work on lemmatizer
    • Continued improvements to the CLTK Latin tokenizer. Lemmatization is performed on tokens, and it is clear that accuracy is affected by the quality of the tokens text pass as parameters to the lemmatizer.
    • Introduction of PlaintextCorpusReader-based corpus of Latin (using the Latin Library corpus) to encourage easier adoption of the CLTK. Initial blog posts on this feature are part of an ongoing series which will work through a Latin NLP task workflow and will soon treat lemmatization. These posts will document in detail features developed during this summer project.

Next steps

  • Test various combinations of backoff chains like the one used in LazyLatinLemmatizer to determine which returns data with the highest accuracy.
    • The most significant increases in accuracy appear to come from the ContextLemmatizer, which is based on training data. Two comments here:
    • Training data for the GSoC summer project was derived from Ancient Greek Dependency Treebank (v. 2.1). The Latin data consists of around 5,000 sentences. Experiments throughout the summer (and research by others) suggests that more training data will lead to improved results. This data will be “expensive” to produce, but I am sure it will lead to higher accuracy. There are other large, tagged sets available and testing will continue with those in upcoming months. The AGDT data also has some inconsistancies, e.g. various lemma tagging for punctuation. I would like to work with the Perseus team to bring this data increasing closer to being a “gold standard” dataset for applications such as this.
    • The NLTK ContextTagger uses look-behind ngrams to create context. The nature of Latin/Greek as a “free” word-order language suggests that it may be worthwhile to think about and write code for generating different contexts. Skipgram context is one idea that I will pursue in upcoming months.
    • More model/pattern information will only improve accuracy, i.e. more ‘endings’ patterns for the RegexLemmatizer, a more complete principal parts list for the PPLematizer. The original dictionary model—currently included at the end of the LazyLatinLemmatizer—could also be revised/augmented.
  • Continued testing of the lemmatizer with smaller, localized selections will help to isolate edge cases and exceptions. The RomanNumeralLemmatizer, e.g., was written to handle a type of token that as an edge case was lowering accuracy.
  • The combination context/POS lemmatizer is very basic at the moment, but has enormous potential for increasing the accuracy of a notoriously difficult lemmatization problem, i.e. ambiguous forms. The current version (inc. the corresponding training data) is only set to resolve one ambiguous case, namely ‘cum1’ (prep.) versus ‘cum2’ (conj.). Two comments:
    • More testing is needed to determine the accuracy (as well as the precision and recall) of this lemmatizer in distinguishing between the two forms of ‘cum1/2’. The current version only uses bigram POS data, but (see above) different contexts may yield better results as well.
    • More ambiguous cases should be introduced to the training data and tested like ‘cum1/2’. The use of Morpheus numbers in the AGDT data should assist with this.

This was an incredible project to work on following several years of philological/literary critical graduate work and as I finished up my PhD in classics at Fordham University. I improved my skills and/or learned a great deal about, but not limited to, object-oriented programming, unit testing, version control, and working with important open-source development architecture such as TravisCI, ZenHub, Codecov, etc.

Acknowledgments

I want to thank the following people: my mentors Kyle P. Johnson and James Tauber who have set an excellent example of what the future of philology will look like: open source/access and community-developed, while rooted in the highest standards of both software development and traditional scholarship; the rest of the CLTK development community; my team at the Institute of the Study of the Ancient World Library for supporting this work during my first months there; Matthew McGowan, my dissertation advisor, for supporting both my traditional and digital work throughout my time at Fordham; the Tufts/Perseus/Leipzig DH/Classics team—the roots of this project come from working with them at various workshops in recent years and they first made the case to me about what could be accomplished through humanties computing; Neil Coffee and the DCA; the NLTK development team; Google for supporting an open-source, digital humanities coding project with Summer of Code; and of course, the #DigiClass world of Twitter for proving to me that there is an enthusiastic audience out there who want to ‘break’ classical texts, study them, and put them back together in various ways to learn more about them—better lemmatization is a desideratum and my motivation comes from wanting to help the community fill this need.—PJB

Current State of the CLTK Latin Lemmatizer

code

Lemmatization is a core task in natural language processing that allows us to return the dictionary headword—also known as the lemma—for each token in a given string. The Classical Language Toolkit includes a lemmatizer for Latin and Greek and for my Google Summer of Code project I have been rewriting these tools to improve their accuracy. In this post, I want to 1. review the current state of the lemmatizer, specifically the Latin lemmatizer, 2. test some sample sentences to see where the lemmatizer performs well and where it does not, and 3. suggest where I think improvements could be made.

[This post uses Python3 and the current version of the CLTK.]

The current version of the lemmatizer uses a model that is kept in the CLTK_DATA directory. (More specifically, the model is a Python dictionary called LEMMATA that can be found in the ‘latin_lemmata_cltk.py’ file in the ‘latin_models_cltk’ corpus.) So before we can lemmatize Latin texts we need to import this model/corpus. The import commands are given below, but if you want more details on loading CLTK corpora, see this post.

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_models_cltk')

[Note that once this corpus is imported into CLTK_DATA, you will not need to repeat these steps to use the Latin lemmatized in the future.]

To use the lemmatizer, we import it as follows:

from cltk.stem.lemma import LemmaReplacer

LemmaReplacer takes a language argument, so we can create an instance of the Latin lemmatizer with the following command:

lemmatizer = LemmaReplacer('latin')

This lemmatized checks words against the LEMMATA dictionary that you installed above. That is, it checks the dictionary to see if a word is found as a key and returns the associated value. Here is the beginning of the lemma dictionary:

LEMMATA = { 
    '-nam' : 'nam', 
    '-namque' : 'nam', 
    '-sed' : 'sed', 
    'Aaron' : 'Aaron', 
    'Aaroni' : 'Aaron', 
    'Abante' : 'Abas', 
    'Abanteis' : 'Abanteus', 
    'Abantem' : 'Abas', 
    'Abantes' : 'Abas', etc...

If a word is not found in the dictionary, the lemmatizer returns the original word unchanged. Since Python dictionaries do not support duplicate keys, there is no resolution for ambiguous forms with the current lemmatizer. For example, this key-value pair {‘amor’ : ‘amo’} ensures that the word “amor” is always lemmatized as a verb and not a noun, even though the nominative singular form of ‘amor’ appears much more frequently than the first-person singular passive form of ‘amor’.

Let’s try some test sentences. Here is the first sentence from Cicero’s In Catilinam 1:

sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'
sentence = sentence.lower()

Note that I have also made the sentence lowercase as the current lemmatizer can raise errors due to case handling.

Now let’s pass this to the lemmatizer:

lemmas = lemmatizer.lemmatize(sentence)
print(lemmas)

>>> ['quis1', 'usque', 'tandem', 'abutor', ',', 'catilina', ',', 'patior', 'noster', '?']

The lemmatizer does a pretty good job. Punctuation included, its accuracy is 80% when compared with the lemmas found in Perseus Treebank Data. According to this dataset, the “quis1” should resolve to “quo”. (Though an argument could be made about whether this adverb is a form of ‘quis’ or its own word deriving from ‘quis’. The argument about whether ‘quousque’ should in fact be one word is also worth mentioning. Note that the number following ‘quis’ is a feature of the Morpheus parser to disambiguate identical forms.) “Patientia” is perhaps a clearer case. Though derived from the verb “patior”, the expected behavior of the lemmatizer is to resolve this word as the self-sufficient noun ‘patientia’. This is what we find in our comparative data from Perseus.

Another example, a longer sentence from the opening of Sallust’s Bellum Catilinae:

sentence = 'Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit.'
sentence = sentence.lower()

lemmas = lemmatizer.lemmatize(sentence)
print(lemmas)

>>> ['omne', 'homo', ',', 'qui1', 'sui', 'studeo', 'praesto2', 'ceter', 'animalis', ',', 'summum', 'ops1', 'nitor1', 'decet', ',', 'neo1', 'vita', 'silentium', 'transeo', 'velut', 'pecus1', ',', 'qui1', 'natura', 'pronus', 'atque', 'venter', 'oboedio', 'fingo.']

Again, pretty good results overall—82.76%. But the errors reveal the shortcomings of the lemmatizer. “Omnis” is an extremely common word in Latin and it simply appears incorrectly in the lemma model. Ditto ‘summus’. Ditto ‘ceter’, though worse because this is not even a valid Latin form. ‘Animalibus’ suffers from the kind of ambiguity noted above with ‘amor’—the noun ‘animal’ is much more common that the adjective ‘animals’. The most significant error is lemmatizing ‘ne’—one of the most common words in the language—incorrectly as the extremely infrequent (if ever appearing) present active imperative of ‘neo’.

If this all sounds critical simply for the sake of being critical, that is not my intention. I have been working on new approaches to the problem of Latin lemmatization and have learned a great deal from the current CLTK lemmatizer. The work shown above is a solid start and there is significant room for improvement. I see it as a baseline: every percentage point above 80% or 82.76% accuracy is a step in the right direction. Next week, I will publish some new blog posts with ideas for new approaches to Latin lemmatization based not on dictionary matching, but on training data, regex matching, and attention to word order and context. While dictionary matching is still the most efficient way to resolve some lemmas (e.g. unambiguous, indeclinables like “ad”), it is through a combination of multiple approaches that we will be able to increase substantially the accuracy of this important tool in the CLTK.

 

10,000 Most Frequent ‘Words’ in the Latin Canon, revisited

code

Last year, the CLTK’s Kyle Johnson wrote a post on the “10,000 most frequent words in Greek and Latin canon”. Since that post was written, I updated the CLTK’s Latin tokenizer to better handle enclitics and other affixes. I thought it would be a good idea to revisit that post for two reasons: 1. to look at the most important changes introduced by the new tokenizer features, and 2. to discuss briefly what we can learn from the most frequent words as I continue to develop the new Latin lemmatizer for the CLTK.

Here is an iPython notebook with the code for generating the Latin list: https://github.com/diyclassics/lemmatizer/blob/master/notebooks/phi-10000.ipynb. I have followed Johnson’s workflow, i.e. tokenize the PHI corpus and create a frequency distribution list. (In a future post, I will run the same experiment on the Latin Library corpus using the built-in NLTK FreqDist function.)

Here are the results:

Top 10 tokens using the NLTK tokenizer:
et	197240
in	141628
est	99525
non	91073
ut	70782
cum	61861
si	60652
ad	59462
quod	53346
qui	46724
Top 10 tokens using the CLTK tokenizer:
et	197242
in	142130
que	110612
ne	103342
est	103254
non	91073
ut	71275
cum	65341
si	61776
ad	59475

The list gives a good indication of what the new tokenizer does:

  • The biggest change is that the (very common) enclitics -que and -ne take their place in the list of top Latin tokens.
  • The words et and non (words which do not combine with -que) are for the most part unaffected.
  • The words estin, and ut see their count go up because of enclitic handling in the Latin tokenizer, e.g. estne > est, ne; inque > in, que. While these tokens are the most obvious examples of this effect, it is the explanation for most of the changed counts on the top 10,000 list, e.g. amorque amor, que. (Ad is less clear. Adque may be a variant of atque; this should be looked into.)
  • The word cum also see its count go up, both because of enclitic handling and also because of the tokenization of forms like mecum as cumme.
  • The word si sees its count go up because the Latin tokenizer handles contractions if words like sodes (siaudes) and sultis (sivultis).

I was thinking about this list of top tokens as I worked on the Latin lemmatizer this week. These top 10 tokens represent 17.3% of all the tokens in the PHI corpus; related, the top 228 tokens represent 50% of the corpus. Making sure that these words are handled correctly then will have the largest overall effect on the accuracy of the Latin lemmatizer.

A few observations…

  • Many of the highest frequency words in the corpus are conjunctions, prepositions, adverbs and other indeclinable, unambiguous words. These should be lemmatized with dictionary matching.
  • Ambiguous tokens are the real challenge of the lemmatizer project and none is more important than cumCum alone makes up 1.1% of the corpus with both the conjunction (‘when’) and the preposition (‘with’) significantly represented. Compare this with est, which is an ambiguous form (i.e. est sum “to be” vs. est edo “to eat”), but with one occurring by far more frequently in the corpus. For this reason, cum will be a good place to start with testing a context-based lemmatizer, such as one that uses bigrams to resolve ambiguities. Quod and quam, also both in the top 20 tokens, can be added to this category.

In addition to high-frequency tokens, extremely rare tokens also present a significant challenge to lemmatization. Look for a post about hapax legomena in the Latin corpus later this week.

GSoC 2016: Lemmatizing Latin/Greek for CLTK

code

Google Summer of Code 2016 started this week. That means that my work on improving the Latin (and Greek) lemmatizer in the Classical Language Toolkit is now underway. For this summer project, I proposed to rewrite the CLTK lemmatizer using a backoff strategy—that is, using a series of different lemmatizers to increase accuracy. Backoff tagging is a common technique in part-of-speech tagging in NLP, but it should also help to resolve ambiguities, predict unknown words, and similar issues that can trip up a lemmatizer. The current CLTK lemmatizer uses dictionary matching, but lacks a systematic way to differentiate ambiguous forms. (Is that forma the nominative singular noun [ > forma, –ae] or forma the present imperative active verb [ > formo (1) ?) The specifics of my backoff strategy will be discussed here as the project develops, but for now I’ll say that it is a combination of training on context, regex matching, and, yes, dictionary matching for high frequency, indeclinable, and unambiguous words.

Screen Shot 2016-05-23 at 11.28.28 PM

First round of tests today with the default Latin lemmatizer.

As I mention in my GSoC proposal, having a lemmatizer with high accuracy is particularly important for NLP in highly inflected languages because: 1. words often have a dozen or more possible forms (and, as opposed to go in English, this is the norm and not only a characteristic of irregularly formed words), and 2. small corpus size in general often demands that counts for a given feature—like words—be based on the broadest measure possible. So, for example, if you want to study the idea of shapes in Ovid’s Metamorphoses, you would need to would want to look at the word forma. This “word” (token, really) appears 39 times in the poem. But what you really want to look at is not just forma, but formae (21), formam (18), formarum (0—yes, it’s zero, but you would still want to know), formis (1), and formas (6). And you wouldn’t want to miss tokens like formasque (Met. 2.78) or formaene (Met. 10.563)—there are 9 such instances. If you were going to, say, topic model the Metamorphoses, you would be much better off having the 94 examples of “forma” than the smaller numbers of its different forms.

“Ancient languages do not have complete BLARKs.” writes Barbara McGillivray  [2014: 19], referring to Krauwer’s idea [2003: 4] of the Basic LAnguage Resource Kit. A BLARK consists of the fundamental resources necessary for text analysis—corpora, lexicons, tokenizers, POS-taggers, etc. A lemmatizer is another basic tool. More and more, the CLTK is solving the BLARK problem for Latin, Greek, and other historical languages which have been referred to as “less-resourced” [see Piotrowski 2012: 85]. In order for these languages to participate in advances in text analysis and to take full advantage of digital resources for language processing, basic tools, like the lemmatizer, need to be available and need to work at accuracy rates high enough to stand up to the very high bar demanded in philological research. This is the goal for the summer.

Works cited:
Bird, S., E. Klein, and E. Loper. 2009. Natural Language Processing with Python. Cambridge, Ma.: O’Reilly. (Esp. Ch. 5 “Categorizing and Tagging Words”).
Krauwer, S. 2003. “The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap.” Proceedings of the 2003 International Workshop on Speech and Computer (SPECOM 2003) : 8-15.
McGillivray, B. 2014. Methods in Latin Computational Linguistics. Leiden: Brill.
Piotrowski, M. 2012. “Natural Language Processing for Historical Texts.” Synthesis Lectures on Human Language Technologies 5: 1-157.