Replicating Zipf

article

As usually formulated, Zipf’s law states that when given a natural-language corpus, the relationship between the frequency of words and their frequency rank is inversely proportional. The results of a recent post on word frequencies in Latin suggested that Zipf’s law would hold up for this language and I wanted to test it to be sure. I was working with Seneca’s Epistulae Morales when I came across an interesting bit of trivia in R.E. Wyllys’s article, “Empirical and Theoretical Bases of Zipf’s Law”:

In this next book, The Psycho-Biology of Language, published in 1935, Zipf called attention for the first time to the phenomenon that has come to bear his name. This book contained Zipf’s first diagram of the log(frequency)-v.-log(rank) relationship, a Zipf curve for his count of words in the Latin writings of Plautus.

Plautus now seemed much more fun to work with than Seneca. So I decided to write a script that would replicate Zipf’s original experiment from the texts of Plautus up using Python and available online texts.

Digging into Psycho-Biology—which has the incredible subtitle An Introduction to Dynamic Philology—I learned the following about Zipf’s method (pp. 24-25):

With all the words of four Plautine plays (AululariaMostellariaPseudolus, and Trinummus) selected for material, the average number of syllables in each frequency category was computed. …The average number of syllables of all words occurring once was 3.23, of those occurring twice, 2.92, etc.

Zipf combined his Plautine experiment with a study of morpheme length in colloquial Chinese and in the English of American newspapers. For all three, he concludes (p. 27) that “a statistical relationship has been established between high frequency, small variety, and shortness in length, a relationship which is presumably valid for language in general.”

So, Zipf’s experiment with the plays of Plautus involved not the distribution of words, but the distribution of the frequency of words containing a certain number of syllables. Not what I was expecting to work with, say, with the Senecan letters, but an interesting problem nevertheless and one no less tractable using Python.

Here is his chart of word-syllable frequency in Plautus:

Screen Shot 2017-04-09 at 6.48.07 AM.png

To replicate Zipf’s method, I have done the following: 1. I downloaded the texts of the four plays from Tesserae. These files were for the most part already preprocessed (e.g. character names preceding lines have already been removed, unlike in the Latin Library texts); 2. I computed the number of syllables in each word, estimating this by number of vowels per word; and 3. I created a frequency table using Pandas, grouping words by their frequency and averaging the number syllables. (All of the code for this post can be found at https://github.com/diyclassics/zipf/blob/master/zipf.ipynb.)

Here is what my chart based on the Tesserae texts of Plautus looks like:

occurrences words avg_syll
0 1 5461 3.274309
1 2 1199 2.937448
2 3 494 2.777328
3 4 301 2.714286
4 5 152 2.631579
5 6 137 2.598540
6 7 84 2.440476
7 8 73 2.438356
8 9 51 2.411765
etc.

The numbers are similar and so encouraging. The differences, I assume, come from two main sources. One, Zipf, from what I can determine, does not name which edition of Plautus he used for this study, or perhaps just as likely, which concordance or wordlist. I have 35,215 tokens to work with and Zipf has 33,094. Close, but not ideal. More encouraging are the counts for some of the top counts. Zipf has 5,429 words that appear once where I have 5,461, or a difference of 32 words. For words appearing twice, Zipf has 1,198. I have only one more. So, the total variation seems at least to be distributed throughout the list. Secondly, Zipf does not explain how he determined the number of syllables per word. I used vowel-counting to keep my own experiment rooted in the text and replicable. It is impossible for me to know if Zipf is consistent in syllabifying words or even correct (I’ll assume he was though!). Again, the numbers are more encouraging than not. For single-occurrence words, Zipf has an average of 3.23 syllables; I show 3.27. For words that appear twice, it is his average of 2.92 against mine of 2.94.

We are now ready to plot these numbers “upon double logarithmic graph-paper,” or the Matplotlib equivalent, the loglog function. Here is an comparison of what Zipf got and what I get:

Screen Shot 2017-04-08 at 5.39.51 AM.png

download.png

Again, encouragingly close. I will note that where Zipf has plotted “the orderliness of the distribution of words” (i.e., the downward sloping line) in a ab2 = k relationship, where a is the number of words for a given occurrence and b the number of occurrences. I plotted instead a line of best fit using Numpy and Matplotlib which seems very close. I will look at the relationship between these two ideas in a future post.

Zipf concludes his chapter with the following comment (pp. 47-48):

The high degree of orderliness of the distribution of words in the stream of speech points unmistakably to a tendency to maintain an equilibrium in the stream of speech between frequency on the one hand and what may tentatively be termed variety on the other.

The graphs above suggest as much. But Zipf’s conclusions are not the main point of this post. Rather, this post is meant to show that we have the texts and methods at hand to replicate past experiments that had to be done with analog methods and with great difficulty in tracing specific, yet critical aspects, of their methods. I can point you to exactly the texts and exactly the code I used to derive my plot. Coding is a series of decisions based on an input and resulting in an output. So is a good argument. If I can put myself somewhere in the middle with a computational humanities approach, I feel like I am making some progress.

Next up, a look at the distribution of Seneca’s vocabulary as originally scheduled.

Advertisements