Nuntii Latini: 2016 Year in Review

tutorial

Earlier this week, Radio Bremen announced that it would be discontinuing its Nuntii Latini Septimanales. As a weekly listener, I was disappointed by the news—luckily, the monthly broadcasts will continue. Where else can you read news stories about heros artis musicae mortuus, i.e. David Bowie, or Trump victor improvisus? Coincidentally, I learned about the fate of Septimanales while preparing a quick study of word usage in these weekly news broadcasts. So, as a tribute to the work of the Nuntii writers and as a follow up to the Latin Library word-frequency post from earlier this week, I present “Nuntii Latini: 2016 Year in Review”.

[A Jupyter Notebook with the code and complete lists of tokens and lemmas for this post is available here.]

A quick note about how I went about this work. To get the data, I collected a list of web pages from the “Archivum Septimanale” page and used the Python Requests package to get the html contents of each of the weekly posts. I then used Beautiful Soup to extract only the content of the three weekly stories that Radio Bremen publishes every week. Here is a sample of what I scraped from each page:

[['30.12.2016',
  'Impetus terroristicus Berolini factus',
  'Anis Amri, qui impetum terroristicum Berolini fecisse pro certo habetur, '
  'a custode publico prope urbem Mediolanum in fuga necatus est. In Tunisia, '
  'qua e civitate ille islamista ortus est, tres viri comprehensi sunt, in his '
  'nepos auctoris facinoris. Quos huic facinori implicatos esse suspicio est. '
  'Impetu media in urbe Berolino facto duodecim homines interfecti, '
  'quinquaginta tres graviter vulnerati erant.'],
 ['30.12.2016',
  'Plures Turci asylum petunt',
  'Numerus asylum petentium, qui e Turcia orti sunt, anno bis millesimo sexto '
  'decimo evidenter auctus est, ut a moderatoribus Germaniae nuntiatur. '
  'Circiter quattuor partes eorum sunt Cordueni. Post seditionem ad irritum '
  'redactam ii, qui Turciam regunt, magis magisque regimini adversantes '
  'opprimunt, imprimis Corduenos, qui in re publica versantur.'],
 ['30.12.2016',
  'Septimanales finiuntur',
  'A. d. XI Kal. Febr. anni bis millesimi decimi redactores nuntiorum '
  'Latinorum Radiophoniae Bremensis nuntios septimanales lingua Latina '
  'emittere coeperunt. Qui post septem fere annos hoc nuntio finiuntur. Nuntii '
  'autem singulorum mensium etiam in futurum emittentur ut solent. Cuncti '
  'nuntii septimanales in archivo repositi sunt ita, ut legi et audiri '
  'possint.']]

The stories were preprocessed following more or less the same process that I’ve used in earlier posts. One exception was that I need to tweak the CLTK Latin tokenizer. This tokenizer currently checks tokens against a list of high-frequency forms ending in ‘-ne‘ and ‘-n‘ to best predict when the enclitic –ne should be assigned its own token. Nuntii Latini unsurpisingly contains a number of words not on this list—mostly proper names ending in ‘-n‘, such as Clinton, Putin, Erdoğan, John and Bremen among others.

Here are some basic stats about the Nuntii Latini 2016:

Number of weekly nuntii: 46 (There was a break over the summer.)
Number of stories: 138
Number of tokens: 6546
Number of unique tokens: 3021
Lexical diversity: 46.15% (i.e. unique tokens / tokens)
Number of unique lemmas: 2033
Here are the top tokens:
Top 10 tokens in Nuntii Latini 2016:

       TOKEN       COUNT       Type-Tok %  RUNNING %   
    1. in          206         3.15%       3.15%       
    2. est         135         2.06%       5.21%       
    3. et          106         1.62%       6.83%       
    4. qui         70          1.07%       7.9%        
    5. ut          56          0.86%       8.75%       
    6. a           54          0.82%       9.58%       
    7. sunt        50          0.76%       10.34%      
    8. esse        42          0.64%       10.98%      
    9. quod        41          0.63%       11.61%      
   10. ad          40          0.61%       12.22%      

How does this compare with the top tokens from the Latin Library that I posted earlier in the week? Usual suspects overall. Curious that the Nuntii uses -que relatively infrequently and even et less than we would expect compared to a larger sample like the Latin Library. There seems to be a slight preference for a (#6) over ab (#27). [Similar pattern is e (#21) vs. ex (#25).] And three forms of the verb sum crack the Top 10—an interesting feature of the Nuntii Latini style.

The top lemmas are more interesting:

Top 10 lemmas in Nuntii Latini 2016:

       LEMMA       COUNT       TYPE-LEM %  RUNNING %   
    1. sum         323         4.93%       4.93%       
    2. qui         208         3.18%       8.11%       
    3. in          206         3.15%       11.26%      
    4. et          106         1.62%       12.88%      
    5. annus       91          1.39%       14.27%      
    6. ab          74          1.13%       15.4%       
    7. hic         64          0.98%       16.38%      
    8. ut          56          0.86%       17.23%      
    9. ille        51          0.78%       18.01%      
   10. homo        49          0.75%       18.76%

Based on the top tokens, it is no surprise to see sum take the top spot. At the same time, we should note that this is a good indicator of Nuntii Latini style. Of greater interest though, unlike the Latin Library lemma list, we see content words appearing with greater frequency. Annus is easily explained by the regular occurrence of dates in the news stories, especially formulas for the current year such as anno bis millesimo sexto decimo. Homo on the other hand tells us more about the content and style of the Nuntii. Simply put, the news stories concern the people of the world and in the abbreviated style of the Nuntii, homo and often homines is a useful and general way of referring to them, e.g. Franciscus papa…profugos ibi permanentes et homines ibi viventes salutavit from April 22.

Since I had the Top 10,000 Latin Library tokens at the ready, I thought it would be interesting to “subtract” these tokens from the Nuntii list to see what remains. This would give a (very) rough indication of which words represent the 2016 news cycle more than Latin usage in general. So, here are the top 25 tokens from the Nuntii Latini that do not appear in the Latin Library list:

Top 25 tokens in Nuntii Latini 2016 (not in the Latin Library 10000):

       LEMMA               COUNT       
    1. praesidens          19          
    2. turciae             17          
    3. ministrorum         14          
    4. americae            13          
    5. millesimo           13          
    6. moderatores         12          
    7. unitarum            12          
    8. electionibus        10          
    9. factio              9           
   10. merkel              8           
   11. factionis           8           
   12. imprimis            8           
   13. habitis             8           
   14. europaeae           8           
   15. millesimi           8           
   16. turcia              7           
   17. britanniae          7           
   18. cancellaria         7           
   19. angela              7           
   20. declarauit          7           
   21. recep               7           
   22. democrata           7           
   23. profugis            7           
   24. tayyip              7           
   25. suffragiorum        6

As I said above, this is a rough, inexact way of weighting the vocabulary. At the same time, it does give a good sense of the year in (Latin) world news. We see important regions in world politics (Europe, Turkey, America, Britain), major players (Angela Merkel, Recep Tayyip [Erdoğan]), and their titles (praesidens, minister, moderator). There are indicators of top news stories like the elections (electio, factio, suffragium, democrata) in the U.S and elsewhere as well as the refugee crisis (profugus). Now that I have this dataset, I’d like to use it to look for patterns in the texts more systematically, e.g. compute TF-IDF scores, topic model the stories, extract named entities, etc. Look for these posts in upcoming weeks.

Advertisements