CLTK: Importing the Latin Library as a Corpus

tutorial

Here is quick tutorial to help users import the Latin Library as a corpus that they can use to explore the Latin language with the Classical Language Toolkit. [This tutorial assumes that you are running Python3 and the current version of the CLTK on Mac OS X (10.11). The documentation for Importing Corpora can be found here.]

Let’s begin by opening up a new session in Terminal and running Python. Type the following:

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')

First, we start by importing the CLTK CorpusImporter. This is the general class used for importing any of the available CLTK corpora in any language. Next, we create an instance of the class that will specifically help us to import Latin materials. Note that CorpusImporter takes the language you want to work with as an argument, here ‘latin’.

You can get a list of the corpora for this language that are currently available by typing the following:

corpus_importer.list_corpora

At the time of writing, the following corpora are available:

['latin_text_perseus', 'latin_treebank_perseus', 'latin_text_lacus_curtius', 'latin_text_latin_library', 'phi5', 'phi7', 'latin_proper_names_cltk', 'latin_models_cltk', 'latin_pos_lemmata_cltk', 'latin_treebank_index_thomisticus', 'latin_lexica_perseus', 'latin_training_set_sentence_cltk', 'latin_word2vec_cltk', 'latin_text_antique_digiliblt', 'latin_text_corpus_grammaticorum_latinorum']

We want to import  ‘latin_text_latin_library’. This corpus can be downloaded by passing the name of the corpus we want to download to the following CLTK function:

corpus_importer.import_corpus('latin_text_latin_library')

(When given a single argument, this function downloads the corpus from from the CLTK Github repo [see here] if it is available. Note that corpora can also be loaded locally by providing the filepath to the corpus as a second argument. This is covered in the documentation.)

Assuming everything runs properly, you should now have a new folder in your user directory called cltk_data and inside that directory you should have the following path: /latin/text/latin_text_latin_library/. This is where your new local Latin Library corpus is located. If you explore this folder, you will find hundreds of text files from the Latin Library ready for you to work with. In an upcoming post, I will explain some strategies for working with this corpus in CLTK projects.

Advertisements

3 thoughts on “CLTK: Importing the Latin Library as a Corpus

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s