More Tokenizing Latin Text

code

When I first started working on the CLTK Latin tokenizer, I wrote a blog post both explaining tokenizing in general and also showing some of the advantages of using a language-specific tokenizer. At that point, the most important feature of the CLTK Latin tokenizer was the ability to split tokens on the enclitic ‘-que’. In the meantime, I have added several more features, described below. Like the last post, the code below assumes the following requirements: Python 3.4, NLTK3, and the current version of CLTK.

Start by importing the Latin word tokenizer with the following code:

from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('latin')

The following code demonstrates the current features of the tokenizer:

# -que
# V. Aen. 1.1
text = "Arma virumque cano, Troiae qui primus ab oris"
word_tokenizer.tokenize(text)

>>> ['Arma', 'que', 'virum', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris']


# -ne
# Cic. Orat. 1.226.1
text = "Potestne virtus, Crasse, servire istis auctoribus, quorum tu praecepta oratoris facultate complecteris?"
word_tokenizer.tokenize(text)

>>> ['ne', 'Potest', 'virtus', ',', 'Crasse', ',', 'servire', 'istis', 'auctoribus', ',', 'quorum', 'tu', 'praecepta', 'oratoris', 'facultate', 'complecteris', '?']


# -ve
# Catull. 14.4-5
text = "Nam quid feci ego quidve sum locutus, cur me tot male perderes poetis?"
word_tokenizer.tokenize(text)

>>> ['Nam', 'quid', 'feci', 'ego', 've', 'quid', 'sum', 'locutus', ',', 'cur', 'me', 'tot', 'male', 'perderes', 'poetis', '?']


# -'st' contractions
# Prop. 2.5.1-2

text = "Hoc verumst, tota te ferri, Cynthia, Roma, et non ignota vivere nequitia?"

word_tokenizer.tokenize(text)

>>> ['Hoc', 'verum', 'est', ',', 'tota', 'te', 'ferri', ',', 'Cynthia', ',', 'Roma', ',', 'et', 'non', 'ignota', 'vivere', 'nequitia', '?']

# Plaut. Capt. 937
text = "Quid opust verbis? lingua nullast qua negem quidquid roges."

word_tokenizer.tokenize(text)

>>> ['Quid', 'opus', 'est', 'verbis', '?', 'lingua', 'nulla', 'est', 'qua', 'negem', 'quidquid', 'roges.']


# 'nec' and 'neque'
# Cic. Phillip. 13.14

text = "Neque enim, quod quisque potest, id ei licet, nec, si non obstatur, propterea etiam permittitur."

word_tokenizer.tokenize(text)

>>> ['que', 'Ne', 'enim', ',', 'quod', 'quisque', 'potest', ',', 'id', 'ei', 'licet', ',', 'c', 'ne', ',', 'si', 'non', 'obstatur', ',', 'propterea', 'etiam', 'permittitur.']


# '-n' for '-ne'
# Plaut. Amph. 823

text = "Cenavin ego heri in navi in portu Persico?"

word_tokenizer.tokenize(text)

>>> ['Cenavi', 'ne', 'ego', 'heri', 'in', 'navi', 'in', 'portu', 'Persico', '?']


# Contractions with 'si'; also handles 'sultis',
# Plaut. Bacch. 837-38

text = "Dic sodes mihi, bellan videtur specie mulier?"

word_tokenizer.tokenize(text)

>>> ['Dic', 'si', 'audes', 'mihi', ',', 'bella', 'ne', 'videtur', 'specie', 'mulier', '?']


There are still improvements to be done, but this handles a high percentage of Latin tokenization tasks. If you have any ideas for more cases that need to be handled or if you see any errors, let me know.

Advertisements

2 thoughts on “More Tokenizing Latin Text

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s