10,000 Most Frequent ‘Words’ in the Latin Canon, revisited

code

Last year, the CLTK’s Kyle Johnson wrote a post on the “10,000 most frequent words in Greek and Latin canon”. Since that post was written, I updated the CLTK’s Latin tokenizer to better handle enclitics and other affixes. I thought it would be a good idea to revisit that post for two reasons: 1. to look at the most important changes introduced by the new tokenizer features, and 2. to discuss briefly what we can learn from the most frequent words as I continue to develop the new Latin lemmatizer for the CLTK.

Here is an iPython notebook with the code for generating the Latin list: https://github.com/diyclassics/lemmatizer/blob/master/notebooks/phi-10000.ipynb. I have followed Johnson’s workflow, i.e. tokenize the PHI corpus and create a frequency distribution list. (In a future post, I will run the same experiment on the Latin Library corpus using the built-in NLTK FreqDist function.)

Here are the results:

Top 10 tokens using the NLTK tokenizer:
et	197240
in	141628
est	99525
non	91073
ut	70782
cum	61861
si	60652
ad	59462
quod	53346
qui	46724
Top 10 tokens using the CLTK tokenizer:
et	197242
in	142130
que	110612
ne	103342
est	103254
non	91073
ut	71275
cum	65341
si	61776
ad	59475

The list gives a good indication of what the new tokenizer does:

  • The biggest change is that the (very common) enclitics -que and -ne take their place in the list of top Latin tokens.
  • The words et and non (words which do not combine with -que) are for the most part unaffected.
  • The words estin, and ut see their count go up because of enclitic handling in the Latin tokenizer, e.g. estne > est, ne; inque > in, que. While these tokens are the most obvious examples of this effect, it is the explanation for most of the changed counts on the top 10,000 list, e.g. amorque amor, que. (Ad is less clear. Adque may be a variant of atque; this should be looked into.)
  • The word cum also see its count go up, both because of enclitic handling and also because of the tokenization of forms like mecum as cumme.
  • The word si sees its count go up because the Latin tokenizer handles contractions if words like sodes (siaudes) and sultis (sivultis).

I was thinking about this list of top tokens as I worked on the Latin lemmatizer this week. These top 10 tokens represent 17.3% of all the tokens in the PHI corpus; related, the top 228 tokens represent 50% of the corpus. Making sure that these words are handled correctly then will have the largest overall effect on the accuracy of the Latin lemmatizer.

A few observations…

  • Many of the highest frequency words in the corpus are conjunctions, prepositions, adverbs and other indeclinable, unambiguous words. These should be lemmatized with dictionary matching.
  • Ambiguous tokens are the real challenge of the lemmatizer project and none is more important than cumCum alone makes up 1.1% of the corpus with both the conjunction (‘when’) and the preposition (‘with’) significantly represented. Compare this with est, which is an ambiguous form (i.e. est sum “to be” vs. est edo “to eat”), but with one occurring by far more frequently in the corpus. For this reason, cum will be a good place to start with testing a context-based lemmatizer, such as one that uses bigrams to resolve ambiguities. Quod and quam, also both in the top 20 tokens, can be added to this category.

In addition to high-frequency tokens, extremely rare tokens also present a significant challenge to lemmatization. Look for a post about hapax legomena in the Latin corpus later this week.

Advertisements

8 thoughts on “10,000 Most Frequent ‘Words’ in the Latin Canon, revisited

  1. Interesting stuff! I wonder if you have any thoughts about how this differs from Diederich’s top 300. In particular -ne is an order of magnitude less coming in his list than #1. (That also seems a little more reasonable to me.) Maybe a bunch of third declension ablatives are being treated as nominative plus -ne?

    Also -que is his #1.

    1. I set up this test around Johnson’s original post using the PHI Latin texts—I am going to rerun this again using the Latin Library as the corpus and we’ll see what changes. Not only is it possible that there are biases in this “corpus”, it is pretty obvious—see e.g. “libro” as #59 and then look at Justinianus—http://latin.packhum.org/stats?q=%23libro%23. Still, good questions and thanks for the suggestion—I’ll go back to Diederich when I write the LL post.

  2. Patrick, Any chance of re-revisiting this at the end of your summer of code? I’m curious how the top 10 look now.

    1. Definitely—should get some better stats on ne/-ne for sure. I’ll have to think about how to approach cum-with/cum-when. As far as the tokenizer is concerned, they’re the same. But I also plan on publish a ‘Top 10,000 Latin Lemmata’ post soon and the new version of the lemmatizer tries to disambiguate cum-with/cum-when. (That reminds me also to set up some tests for that!)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s