Last year, the CLTK’s Kyle Johnson wrote a post on the “10,000 most frequent words in Greek and Latin canon”. Since that post was written, I updated the CLTK’s Latin tokenizer to better handle enclitics and other affixes. I thought it would be a good idea to revisit that post for two reasons: 1. to look at the most important changes introduced by the new tokenizer features, and 2. to discuss briefly what we can learn from the most frequent words as I continue to develop the new Latin lemmatizer for the CLTK.
Here is an iPython notebook with the code for generating the Latin list: https://github.com/diyclassics/lemmatizer/blob/master/notebooks/phi-10000.ipynb. I have followed Johnson’s workflow, i.e. tokenize the PHI corpus and create a frequency distribution list. (In a future post, I will run the same experiment on the Latin Library corpus using the built-in NLTK FreqDist function.)
Here are the results:
Top 10 tokens using the NLTK tokenizer: et 197240 in 141628 est 99525 non 91073 ut 70782 cum 61861 si 60652 ad 59462 quod 53346 qui 46724
Top 10 tokens using the CLTK tokenizer: et 197242 in 142130 que 110612 ne 103342 est 103254 non 91073 ut 71275 cum 65341 si 61776 ad 59475
The list gives a good indication of what the new tokenizer does:
- The biggest change is that the (very common) enclitics -que and -ne take their place in the list of top Latin tokens.
- The words et and non (words which do not combine with -que) are for the most part unaffected.
- The words est, in, and ut see their count go up because of enclitic handling in the Latin tokenizer, e.g. estne > est, ne; inque > in, que. While these tokens are the most obvious examples of this effect, it is the explanation for most of the changed counts on the top 10,000 list, e.g. amorque > amor, que. (Ad is less clear. Adque may be a variant of atque; this should be looked into.)
- The word cum also see its count go up, both because of enclitic handling and also because of the tokenization of forms like mecum as cum, me.
- The word si sees its count go up because the Latin tokenizer handles contractions if words like sodes (si, audes) and sultis (si, vultis).
I was thinking about this list of top tokens as I worked on the Latin lemmatizer this week. These top 10 tokens represent 17.3% of all the tokens in the PHI corpus; related, the top 228 tokens represent 50% of the corpus. Making sure that these words are handled correctly then will have the largest overall effect on the accuracy of the Latin lemmatizer.
A few observations…
- Many of the highest frequency words in the corpus are conjunctions, prepositions, adverbs and other indeclinable, unambiguous words. These should be lemmatized with dictionary matching.
- Ambiguous tokens are the real challenge of the lemmatizer project and none is more important than cum. Cum alone makes up 1.1% of the corpus with both the conjunction (‘when’) and the preposition (‘with’) significantly represented. Compare this with est, which is an ambiguous form (i.e. est > sum “to be” vs. est > edo “to eat”), but with one occurring by far more frequently in the corpus. For this reason, cum will be a good place to start with testing a context-based lemmatizer, such as one that uses bigrams to resolve ambiguities. Quod and quam, also both in the top 20 tokens, can be added to this category.
In addition to high-frequency tokens, extremely rare tokens also present a significant challenge to lemmatization. Look for a post about hapax legomena in the Latin corpus later this week.