In August, the Dickinson College Commentaries blog featured a post on common Latin words that are not found in Virgil’s Aeneid. Author Chris Francese refers to the post as a “diverting Latin parlor game” and in that spirit of diversion I’d like play along and push the game further.
The setup is as follows, to quote the post:
Take a very common Latin word (in the DCC Latin Core Vocabulary) that does not occur in Vergil’s Aeneid, and explain its absence. Why would Vergil avoid certain lemmata (dictionary head words) that are frequent in preserved Latin?
So, Virgil avoids words such as aegre, arbitror, auctoritas, beneficium, etc. and it is up to us to figure out why. An interesting question and by asking the question, Francese enters a fascinating conversation on Latin poetic diction which includes Bertil Axelson, Gordon Williams, Patricia Watson, and many others (myself included, I suppose). But my goal in this post is not so much to answer the “why?” posed in the quote above, but more to investigate the methods through which we can start the conversation.
The line in Francese’s post that got me thinking was this:
The Vergilian data comes from LASLA (no automatic lemmatizers were used, all human inspection), as analyzed by Seth Levin.
It just so happened that when this post came out, I was completing a summer-long project building an “automatic lemmatizer” for Latin for the Classical Language Toolkit. So my first reaction to the post was to see how close I could get to the DCC Blog’s list using the new lemmatizer. The answer is pretty close.
[I have published a Jupyter Notebook with the code for these results here: https://github.com/diyclassics/dcc-lemma/blob/master/Parlor%20Game%2C%20Revisited.ipynb.]
There are 75 lemmas from the DCC Latin Core Vocabulary that do not appear in the Aeneid (DCC Missing). Using the Backoff Latin lemmatizer on the Latin Library text of the Aeneid (CLTK Missing), I returned a list of 119 lemmas. There are somewhere around 6100 unique lemmas in the Aeneid meaning that our results only differ by 0.7%.
The results from CLTK Missing show 69 out of 75 lemmas (92%) from the DCC list. The six lemmas that it missed are:
[‘eo’, ‘mundus’, ‘plerusque’, ‘reliquus’, ‘reuerto’, ‘solum’]
Some of these can be easily explained. Reliqui (from relinquo) was incorrectly lemmatized as reliquus—an error. Mundus was lemmatized correctly and so appears in the list of Aeneid lemmas, just not the one on DCC Missing, i.e. mundus (from mundus, a, um = ‘clean’). A related problem with both eo and solum—homonyms of both these words appear in the list of Aeneid lemmas. (See below on the issue of lemmatizing adverbs/adjectives, adjective/nouns, etc.) Plerusque comes from parsing error in my preprocessing script, where I split the DCC list on whitespace. Since this word is listed as plērus- plēra- plērumque, plerus- made it into reference list, but not plerusque. (I could have fixed this, but I thought it was better in this informal setting to make clear the full range on small errors that can creep into a text processing “parlor game” like this.) Lastly, is reverto wrong? The LASLA lemma is revertor which—true enough—does not appear on the DCC Core Vocabulary, but this is probably too fine a distinction. Lewis & Short, e.g., lists reverto and revertor as the headword.
This leaves 50 lemmas returned in CLTK Missing that are—compared to DCC Missing—false positives. The list is as follows:
[‘aduersus’, ‘alienus’, ‘aliquando’, ‘aliquis’, ‘aliter’, ‘alius’, ‘animal’, ‘antequam’, ‘barbarus’, ‘breuiter’, ‘certe’, ‘citus’, ‘ciuitas’, ‘coepi’, ‘consilium’, ‘diuersus’, ‘exsilium’, ‘factum’, ‘feliciter’, ‘fore’, ‘forte’, ‘illuc’, ‘ingenium’, ‘item’, ‘longe’, ‘male’, ‘mare’, ‘maritus’, ‘pauci’, ‘paulo’, ‘plerus’, ‘praeceptum’, ‘primum’, ‘prius’, ‘proelium’, ‘qua’, ‘quantum’, ‘quomodo’, ‘singuli’, ‘subito’, ‘tantum’, ‘tutus’, ‘ualidus’, ‘uarius’, ‘uere’, ‘uero’, ‘uictoria’, ‘ultimus’, ‘uolucer’, ‘uos’]
To be perfectly honest, you learn more about the lemmatizer than the Aeneid from this list and this is actually very useful data for uncovering places where the CLTK tools can be improved.
So, for example, there are a number of adverbs on this list (breuiter, certe, tantum, etc.). These are cases where the CLTK lemmatizer return the associated adjective (so breuis, certus, tantus). This is a matter of definition. That is, the CLTK result is more different than wrong. We can debate whether some adverbs deserve to be given their own lemma, but is still that—a debate. (Lewis & Short, e.g. has certe listed under certus, but a separate entry for breuiter.)
The DCC Blog post makes a similar point about nouns and adjectives:
At times there might be some lemmatization issues (for example barbarus came up in the initial list of excluded core words, since Vergil avoids the noun, though he uses the adjective twice. I deleted it from this version.
This explains why barbarus appears on CLTK Missing. Along the same line, factum has been lemmatized under facio. Again, not so much incorrect, but a matter of how we define our terms and set parameters for the lemmatizer. I have tried as much as possible to follow the practice of the Ancient Greek and Latin Dependency Treebank and the default Backoff lemmatizer uses the treebanks as the source of its default training data. This explains why uos appears in CLTK Missing—the AGLDT lemmatizes forms of uos as the second-person singular pronoun tu.
As I continue to test the lemmatizer, I will use these results to fine tune and improve the output, trying to explain each case and make decisions such as which adverbs need to be lemmatized as adverbs and so on. It would be great to hear comments, either on this post or in the CLTK Github issues, about where improvements need to be made.
There remains a final question. If the hand lemmatized data from LASLA produces more accurate results, why use the CLTK lemmatizer at all?
It is an expensive process—time/money/resources—to produce curated data. This data is available for Virgil, but may not be for another author. What if we wanted to play the same parlor game with Lucan? I don’t know whether lemmatized data is available for Lucan, but I was a trivial task for me to rerun this experiment (with minimal preprocessing changes) on the Bellum Ciuile. (I placed the list of DCC core words not appearing in Lucan at the bottom of this post.) And I could do it for any text in the Latin Library just as easily.
Automatic lemmatizers are not perfect, but they are often good and sometimes very good. More importantly, they are getting better and, in the case of the CLTK, they are being actively developed and developers like myself can work with researchers to make the tools as good as possible.
Lemmas from the DCC Latin Core Vocabulary not found in Lucan*
(* A first draft by an automatic lemmatizer)
accido
adhibeo
aduersus
aegre
alienus
aliquando
aliquis
aliter
alius
amicitia
antequam
arbitror
auctoritas
autem
beneficium
bos
breuiter
celebro
celeriter
centum
certe
ceterum
citus
ciuitas
coepi
cogito
comparo
compono
condicio
confiteor
consilium
consuetudo
conuiuium
deinde
desidero
dignitas
disciplina
diuersus
dormio
edico
egregius
epistula
existimo
exspecto
factum
familia
fere
filia
fore
forte
frumentum
gratia
hortor
illuc
imperator
impleo
impono
ingenium
initium
integer
interim
interrogo
intersum
ita
itaque
item
legatus
libido
longe
magnitudo
maiores
male
mare
maritus
memoria
mulier
multitudo
narro
nauis
necessitas
negotium
nemo
oportet
oratio
pauci
paulo
pecunia
pertineo
plerumque
plerus
poeta
postea
posterus
praeceptum
praesens
praesidium
praeterea
primum
princeps
principium
priuatus
prius
proelium
proficiscor
proprius
puella
qua
quantum
quattuor
quemadmodum
quomodo
ratio
sanctus
sapiens
sapientia
scientia
seruus
singuli
statim
studeo
subito
suscipio
tantum
tempestas
tutus
ualidus
uarius
uere
uero
uictoria
uinum
uitium
ultimus
uoluntas
uos
utrum