Cleaning up a Latin Library text

tutorial

In a recent post, I made the false claim that there are 16,667,542 words in the Latin Library. This is false because the the PlaintextCorpusReader uses the CLTK Latin tokenizer to split up the corpus into “words,” or more precisely, tokens, and some of these tokens are not words. This list of tokens includes a lot of punctuation and a lot of numbers, which depending on our research question may not be all that useful. If I want to know the most frequent word in the Latin Library, I probably do not want #1 to be the 1.37 million commas. So, in this post, we will “clean up” our data and return more useful word count information. (Note that you will need to have at least version 0.1.43 of CLTK for this tutorial.)

Let’s start with something small, like the works of Catullus that we saw in this post.

from cltk.corpus.latin import latinlibrary
files = latinlibrary.fileids()
catullus_raw = latinlibrary.raw('catullus.txt')

For today’s post, we will work directly with the raw string for preprocessing. Here is what the first 1,000 characters of the string look like (with some vertical whitespace removed) :

print(catullus_raw[:1000])

>>> Catullus
>>> 
>>> C. VALERIVS CATVLLVS 
>>> 
>>> 1 2 2b 3 4 5 6 7 8 9 10 11 12 13 14 14b 15 16 17 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 58b 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 95b 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
>>> 
>>> I. ad Cornelium 
>>> Cui dono lepidum novum libellum 
>>> arida modo pumice expolitum? 
>>> Corneli, tibi: namque tu solebas 
>>> meas esse aliquid putare nugas. 
>>> Iam tum, cum ausus es unus Italorum 
>>> omne aevum tribus explicare cartis... 
>>> Doctis, Iuppiter, et laboriosis! 
>>> Quare habe tibi quidquid hoc libelli— 
>>> qualecumque, quod, o patrona virgo, 
>>> plus uno maneat perenne saeclo!
>>> 
>>> II. fletus passeris Lesbiae 
>>> Passer, deliciae meae puellae, 
>>> quicum ludere, quem in sinu tenere, 
>>> cui primum digitum dare appetenti 
>>> et acris solet incitare

Our goal will be the preprocess this text so that we can best answer the question: “What are the 25 most frequent words in Catullus’ poetry?” Four steps we can take to improve the results are: 1. make the whole text lowercase, 2. remove punctuation, 3. remove numbers, and 4. remove English words that appear in our plaintext file. We will do the first three operations on the text string, then tokenize and do the fourth on the list of tokens.

catullus_edit = catullus_raw # Make a copy of the list

# 1. Make the whole text lowercase
# Use 'lower' string method

catullus_edit = catullus_edit.lower()

# 2. Remove punctuation
# Use 'translate'

from string import punctuation

translator = str.maketrans({key: " " for key in punctuation})
catullus_edit = catullus_edit.translate(translator)

# 3. Remove numbers
# Again, use 'translate'

translator = str.maketrans({key: " " for key in '0123456789'})
catullus_edit = catullus_edit.translate(translator)

# 4. Normalize u/v
# Use CLTK 'JVReplacer'

from cltk.stem.latin.j_v import JVReplacer
replacer = JVReplacer()

catullus_edit = replacer.replace(catullus_edit)

# 5. Remove English words that appear in our plaintext file
# Use 'replace'

remove_list = ['the', 'latin', 'library', 'classics', 'page']
remove_dict = {key: ' ' for key in remove_list}

for k, v in remove_dict.items():
    catullus_edit = catullus_edit.replace(k,v)

 

Here is what the preprocessed, i.e. ‘cleaned-up’, Catullus looks like:

print(catullus_edit[469:599])

>>> cui dono lepidum nouum libellum 
>>> arida modo pumice expolitum 
>>> corneli tibi namque tu solebas 
>>> meas esse aliquid putare nugas

We are now in a much better position to answer the question: “What are the 25 most frequent words in Catullus’ poetry?” As we did in a previous post, let’s use Counter to get the top words.

# Find the most commonly occurring 'words' (i.e., tokens) in Catullus:

# Tokenize catullus_edit
from cltk.tokenize.word import WordTokenizer
word_tokenizer = WordTokenizer('latin')

catullus_words = word_tokenizer.tokenize(catullus_edit)
# Count the most common words in the list catullus_words

from collections import Counter
catullus_wordlist = list(catullus_words)

c = Counter(catullus_wordlist)
print(c.most_common(25))

>>> [('et', 194), ('est', 188), ('-que', 175), ('in', 160), ('ad', 142), ('non', 141), ('te', 107), ('cum', 103), ('ut', 101), ('nec', 92), ('quod', 85), ('sed', 80), ('si', 76), ('quae', 74), ('tibi', 70), ('me', 69), ('mihi', 68), ('o', 63), ('qui', 60), ('aut', 58), ('quam', 53), ('atque', 51), ('nam', 49), ('esse', 48), ('hymenaee', 48)]

Etest, -queinadnon—we see a lot of the words we expect to see based on the entire corpus. But we also see some more interest results, such as the high proportion of second-person pronouns te (#7) and tibi (#15). I think it is also fair to say that Catullus’ poems are going to be Latin literature’s only text that returns hymenaee in the top 25.

In the next post, we will continue to think about preprocessing and how to extract the most meaningful information from our texts by looking at stop words.

Advertisements

2 thoughts on “Cleaning up a Latin Library text

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s