Making a Keyword-in-Context index with CLTK

code, tutorial

The “key word-in-context” (KWIC) index was an innovation of early information retrieval, the basic concepts of which were developed in the late 1950s by H.P. Luhn. The idea is to produce a list of all occurrences of a word, aligned so that the word is printed as a column in the center of the text with the corresponding context printed to the immediate left and right. This allows a user to scan quickly a large number of uses in a given text. For examples, David Packard’s 1968 A Concordance to Livy uses an alphabetical KWIC format. Here are the first entries for the preposition e in Packard’s concordance:

Screen Shot 2017-08-17 at 10.16.31 AM

Using the Classical Language Toolkit and the Natural Language Toolkit’s Text module, we can easily create KWICs for texts in the Latin Library.

[This post assumes that you have already imported the Latin Library corpus as described in an earlier post and as always that you are running the latest version of CLTK on Python3.6. This tutorial was tested on v. 0.1.56.]

First, we can import a text from the Latin Library—here, Cicero’s De amicitia—as a list of words:

In [1]: from cltk.corpus.latin import latinlibrary
In [2]: amicitia_words = latinlibrary.words('cicero/amic.txt')
In [3]: print(amicitia_words[117:188])
Out [3]: ['Q.', 'Mucius', 'augur', 'multa', 'narrare', 'de', 'C.', 'Laelio', 'socero', 'suo', 'memoriter', 'et', 'iucunde', 'solebat', 'nec', 'dubitare', 'illum', 'in', 'omni', 'sermone', 'appellare', 'sapientem', ';', 'ego', 'autem', 'a', 'patre', 'ita', 'eram', 'deductus', 'ad', 'Scaevolam', 'sumpta', 'virili', 'toga', ',', 'ut', ',', 'quoad', 'possem', 'et', 'liceret', ',', 'a', 'senis', 'latere', 'numquam', 'discederem', ';', 'itaque', 'multa', 'ab', 'eo', 'prudenter', 'disputata', ',', 'multa', 'etiam', 'breviter', 'et', 'commode', 'dicta', 'memoriae', 'mandabam', 'fieri', '-que', 'studebam', 'eius', 'prudentia', 'doctior', '.']

We can then convert this list of words to an NLTK Text:

In [4]: import nltk
In [5]: amicitia_text = nltk.Text(amicitia_words)
In [6]: print(type(amicitia_text))
Out [6]: nltk.text.Text

Now that we have an NLTK text, there are several methods available to us, including “concordance,” which generates a KWIC for us based on keywords that we provide. Here, for example, is the NLTK concordance for ‘amicus’:

In [7]: amicitia_text.concordance('amicus')
Out [7]: Displaying 5 of 5 matches:
tentiam . Quonam enim modo quisquam amicus esse poterit ei , cui se putabit in
 optare , ut quam saepissime peccet amicus , quo plures det sibi tamquam ansas
escendant . Quamquam Ennius recte : Amicus certus in re incerta cernitur , tam
m in amicitiam transferetur , verus amicus numquam reperietur ; est enim is qu
itas . [ 95 ] Secerni autem blandus amicus a vero et internosci tam potest adh

The KWICs generated by NLTK Text are case insensitive (see amicus and Amicus above) and sorted sequentially by location in the text. There’s not much customization available for the method, so this is pretty much what it does. [You can set parameters for context width and number of lines presented; e.g. amicitia_text.concordance('amicus', width=50, lines=3)] Admittedly, it is pretty basic—it does not even return an identification or location code to help the user move easily to the wider context and the only way we know that the fifth match is in Chapter 95 is because the chapter number happens to be included in the context. At the same time, it is another step towards combining existing resources and tools (here, NLTK Text and a CLTK corpus) to explore Latin literature from different angles.

In a future post, I will build a KWIC method from scratch that offers more flexibility, especially with respect to context scope and location identification.