In August, the Dickinson College Commentaries blog featured a post on common Latin words that are not found in Virgil’s Aeneid. Author Chris Francese refers to the post as a “diverting Latin parlor game” and in that spirit of diversion I’d like play along and push the game further.
The setup is as follows, to quote the post:
Take a very common Latin word (in the DCC Latin Core Vocabulary) that does not occur in Vergil’s Aeneid, and explain its absence. Why would Vergil avoid certain lemmata (dictionary head words) that are frequent in preserved Latin?
So, Virgil avoids words such as aegre, arbitror, auctoritas, beneficium, etc. and it is up to us to figure out why. An interesting question and by asking the question, Francese enters a fascinating conversation on Latin poetic diction which includes Bertil Axelson, Gordon Williams, Patricia Watson, and many others (myself included, I suppose). But my goal in this post is not so much to answer the “why?” posed in the quote above, but more to investigate the methods through which we can start the conversation.
The line in Francese’s post that got me thinking was this:
The Vergilian data comes from LASLA (no automatic lemmatizers were used, all human inspection), as analyzed by Seth Levin.
It just so happened that when this post came out, I was completing a summer-long project building an “automatic lemmatizer” for Latin for the Classical Language Toolkit. So my first reaction to the post was to see how close I could get to the DCC Blog’s list using the new lemmatizer. The answer is pretty close.
[I have published a Jupyter Notebook with the code for these results here: https://github.com/diyclassics/dcc-lemma/blob/master/Parlor%20Game%2C%20Revisited.ipynb.]
There are 75 lemmas from the DCC Latin Core Vocabulary that do not appear in the Aeneid (DCC Missing). Using the Backoff Latin lemmatizer on the Latin Library text of the Aeneid (CLTK Missing), I returned a list of 119 lemmas. There are somewhere around 6100 unique lemmas in the Aeneid meaning that our results only differ by 0.7%.
The results from CLTK Missing show 69 out of 75 lemmas (92%) from the DCC list. The six lemmas that it missed are:
[‘eo’, ‘mundus’, ‘plerusque’, ‘reliquus’, ‘reuerto’, ‘solum’]
Some of these can be easily explained. Reliqui (from relinquo) was incorrectly lemmatized as reliquus—an error. Mundus was lemmatized correctly and so appears in the list of Aeneid lemmas, just not the one on DCC Missing, i.e. mundus (from mundus, a, um = ‘clean’). A related problem with both eo and solum—homonyms of both these words appear in the list of Aeneid lemmas. (See below on the issue of lemmatizing adverbs/adjectives, adjective/nouns, etc.) Plerusque comes from parsing error in my preprocessing script, where I split the DCC list on whitespace. Since this word is listed as plērus- plēra- plērumque, plerus- made it into reference list, but not plerusque. (I could have fixed this, but I thought it was better in this informal setting to make clear the full range on small errors that can creep into a text processing “parlor game” like this.) Lastly, is reverto wrong? The LASLA lemma is revertor which—true enough—does not appear on the DCC Core Vocabulary, but this is probably too fine a distinction. Lewis & Short, e.g., lists reverto and revertor as the headword.
This leaves 50 lemmas returned in CLTK Missing that are—compared to DCC Missing—false positives. The list is as follows:
[‘aduersus’, ‘alienus’, ‘aliquando’, ‘aliquis’, ‘aliter’, ‘alius’, ‘animal’, ‘antequam’, ‘barbarus’, ‘breuiter’, ‘certe’, ‘citus’, ‘ciuitas’, ‘coepi’, ‘consilium’, ‘diuersus’, ‘exsilium’, ‘factum’, ‘feliciter’, ‘fore’, ‘forte’, ‘illuc’, ‘ingenium’, ‘item’, ‘longe’, ‘male’, ‘mare’, ‘maritus’, ‘pauci’, ‘paulo’, ‘plerus’, ‘praeceptum’, ‘primum’, ‘prius’, ‘proelium’, ‘qua’, ‘quantum’, ‘quomodo’, ‘singuli’, ‘subito’, ‘tantum’, ‘tutus’, ‘ualidus’, ‘uarius’, ‘uere’, ‘uero’, ‘uictoria’, ‘ultimus’, ‘uolucer’, ‘uos’]
To be perfectly honest, you learn more about the lemmatizer than the Aeneid from this list and this is actually very useful data for uncovering places where the CLTK tools can be improved.
So, for example, there are a number of adverbs on this list (breuiter, certe, tantum, etc.). These are cases where the CLTK lemmatizer return the associated adjective (so breuis, certus, tantus). This is a matter of definition. That is, the CLTK result is more different than wrong. We can debate whether some adverbs deserve to be given their own lemma, but is still that—a debate. (Lewis & Short, e.g. has certe listed under certus, but a separate entry for breuiter.)
The DCC Blog post makes a similar point about nouns and adjectives:
At times there might be some lemmatization issues (for example barbarus came up in the initial list of excluded core words, since Vergil avoids the noun, though he uses the adjective twice. I deleted it from this version.
This explains why barbarus appears on CLTK Missing. Along the same line, factum has been lemmatized under facio. Again, not so much incorrect, but a matter of how we define our terms and set parameters for the lemmatizer. I have tried as much as possible to follow the practice of the Ancient Greek and Latin Dependency Treebank and the default Backoff lemmatizer uses the treebanks as the source of its default training data. This explains why uos appears in CLTK Missing—the AGLDT lemmatizes forms of uos as the second-person singular pronoun tu.
As I continue to test the lemmatizer, I will use these results to fine tune and improve the output, trying to explain each case and make decisions such as which adverbs need to be lemmatized as adverbs and so on. It would be great to hear comments, either on this post or in the CLTK Github issues, about where improvements need to be made.
There remains a final question. If the hand lemmatized data from LASLA produces more accurate results, why use the CLTK lemmatizer at all?
It is an expensive process—time/money/resources—to produce curated data. This data is available for Virgil, but may not be for another author. What if we wanted to play the same parlor game with Lucan? I don’t know whether lemmatized data is available for Lucan, but I was a trivial task for me to rerun this experiment (with minimal preprocessing changes) on the Bellum Ciuile. (I placed the list of DCC core words not appearing in Lucan at the bottom of this post.) And I could do it for any text in the Latin Library just as easily.
Automatic lemmatizers are not perfect, but they are often good and sometimes very good. More importantly, they are getting better and, in the case of the CLTK, they are being actively developed and developers like myself can work with researchers to make the tools as good as possible.
Lemmas from the DCC Latin Core Vocabulary not found in Lucan*
(* A first draft by an automatic lemmatizer)