10,000 Most Frequent ‘Words’ in the Latin Library

article

A few months ago, I posted a list of the 10,000 most frequent words in the PHI Classical Latin Texts. While I did include a notebook with the code for that experiment, I could not include the data because the PHI texts are not available for redistribution. So here is an updated post, based on a freely available corpus of Latin literature—and one that I have been using for my recent Disiecta Membra posts like this one and this one and this one—the Latin Library. (The timing is good, as the Latin Library has received some positive attention recently.) The code for this post is available as a Jupyter Notebook here.

The results, based on the 13,563,476 tokens in the Latin Library:

Top 10 tokens in Latin Library:

       TOKEN       COUNT       TYPE-TOK %  RUNNING %   
    1. et          446474      3.29%       3.29%       
    2. in          274387      2.02%       5.31%       
    3. est         174413      1.29%       6.6%        
    4. non         166083      1.22%       7.83%       
    5. -que        135281      1.0%        8.82%       
    6. ad          133596      0.98%       9.81%       
    7. ut          119504      0.88%       10.69%      
    8. cum         109996      0.81%       11.5%       
    9. quod        104315      0.77%       12.27%      
   10. si          95511       0.70%       12.97%

How does this compare with the previous test against the PHI run? Here are the frequency rankings from the PHI run, 1 through 10: et, in, -que, ne, est, non, ut, cum, si, and ad. So—basically, the same. The loss of ne from the top 10 is certainly a result of improvements to the CLTK tokenizer, specifically improvements in tokenizing the the enclitic -ne. Ne is now #41 with 26,825 appearances and -ne #30 with 36,644 appearances. The combined count would still not crack the Top 10, which suggests that there may have been a lot of words wrongly tokenized of the form, e.g. ‘homine’ as [‘homi’, ‘-ne’]. (I suspect that this still happens, but am confident that the frequency of this problem is declining. If you spot any “bad” tokenization involving words ending in ‘-ne‘ or ‘-n‘, please submit an issue.) With ne out of the Top 10, we see that quod has joined the list. It should come as little surprise that quod was #11 in the PHI frequency list.

Since the PHI post, significant advances have been made with the CLTK Latin lemmatizer. Recent tests show accuracies consistently over 90%. So, let’s put out a provisional list of top lemmas as well—

Top 10 lemmas in Latin Library:

       LEMMA       COUNT       TYPE-LEM %  RUNNING %   
    1. et          446474      3.29%       3.29%       
    2. sum         437415      3.22%       6.52%       
    3. qui         365280      2.69%       9.21%       
    4. in          274387      2.02%       11.23%      
    5. is          213677      1.58%       12.81%      
    6. non         166083      1.22%       14.03%      
    7. -que        144790      1.07%       15.1%       
    8. hic         140421      1.04%       16.14%      
    9. ad          133613      0.99%       17.12%      
   10. ut          119506      0.88%       18.0%

No real surprises here. Six from the Top 10 lemmas are indeclinable, whether conjunctions, prepositions, adverbs, or enclitic, and so remain from the top tokens list: etinnon-quead and ut. Forms of sum and qui can be found in the top tokens list as well, est and quod respectively. Hic rises to the top based on its large number of relatively high ranking forms, though it should be noted that its top ranking form is #23 (hoc), followed by #46 (haec), #71 (his), #91 (hic), and #172 (hanc) among others. Is also joins the top 10, though I have my concerns about this because of the relatively high frequency of overlapping forms with the verb eo (i.e. eoiseam, etc.). This result should be reviewed and tested further.

While I’m thinking about it, other concerns I have would be the counts for hic, i.e. with respect to the demonstrative and the adverb, as well as the slight fluctuations in the counts of indeclinables, e.g. ut (119,504 tokens vs. 119,506 lemmas), or the somewhat harder to explain jump in -que. So, we’ll consider this a work in progress. But one that is—at least for the Top 10—more or less in line with other studies (e.g. Diederich, which—with the exception of cum—has same words, if different order.)