This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Sunday, October 23, 2011

The Zipf and Zipf-Mandelbrot distributions

In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data.  Today’s post considers the Zipf distribution for discrete data, which has come to be extremely popular as a model for phenomena like word frequencies, city sizes, or sales rank data, where the values of these quantities associated with randomly selected samples can vary by many orders of magnitude.

More specifically, the Zipf distribution is defined by a probability pi of observing the ith element of an infinite sequence of objects in a single random draw from that sequence, where the probability is given by:

           
pi = A/ia

Here, a is a positive number greater than 1 that determines the rate of the distribution’s tail decay, and A is a normalization constant, chosen so that these probabilities sum to 1.  Like the continuous-valued Pareto type I distribution, the Zipf distribution exhibits a “long tail,” meaning that its tail decays slowly enough that in a random sample of objects Oi drawn from a Zipf distribution, some very large values of the index i will be observed, particularly for relatively small values of the exponent a.  In one of the earliest and most common applications of the Zipf distribution, the objects considered represent words in a document and i represents their rank, ranging from most frequent (for i = 1) to rare (for large i ).   In a more business-oriented application, the objects might be products for sale (e.g., books listed on Amazon), with the index i corresponding to their sales rank.  For a fairly extensive collection of references to many different applications of the Zipf distribution, the website (originally) from Rockefeller University is an excellent source.

In Exploring Data in Engineering, the Sciences, and Medicine, I give a brief discussion of both the Zipf distribution and the closely related Zipf-Mandelbrot distribution discussed by Beniot Mandelbrot in his book The Fractal Geometry of Nature.  The probabilities defining this distribution may be parameterized in several ways, and the one given in Exploring Data is:

           
pi = A/(1+Bi)a

where again a is an exponent that determines the rate at which the tail of the distribution decays, and B is a second parameter with a value that is strictly positive but no greater than 1.  For both the Zipf distribution and the Zipf-Mandelbrot distribution, the exponent a must be greater than 1 for the distribution to be well-defined, it must be greater than 2 for the mean to be finite, and it must be greater than 3 for the variance to be finite.

So far, I have been unable to find an R package that supports the generation of random samples drawn from the Zipf distribution, but the package zipfR includes the command rlnre, which generates random samples drawn from the Zipf-Mandelbrot distribution.  As I noted, this distribution can be parameterized in several different ways and, as Murphy’s law would have it, the zipfR parameterization is not the same as the one presented above and discussed in Exploring Data.  Fortunately, the conversion between these parameters is simple.  The zipfR package defines the distribution in terms of a parameter alpha that must lie strictly between 0 and 1, and a second parameter B that I will call BzipfR to avoid confusion with the parameter B in the above definition.  These parameters are related by:

           
alpha = 1/a    and    BzipfR = (a-1) B

Since the a parameter (and thus the alpha parameter in the zipfR package) determines the tail decay rate of the distribution, it is of the most interest here, and the rest of this post will focus on three examples: a = 1.5 (alpha = 2/3), for which both the distribution’s mean and variance are infinite, a = 2.5 (alpha = 2/5), for which the mean is finite but the variance is not, and a = 3.5 (alpha = 2/7), for which both the mean and variance are finite.  The value of the parameter B in the Exploring Data definition of the distribution will be fixed at 0.2 in all of these examples, corresponding to values of BzipfR = 0.1, 0.3, and 0.5 for the three examples considered here.

To generate Zipf-Mandelbrot random samples, the zipfR package uses the procedure rlnre in conjunction with the procedure lnre (the abbreviation “lnre” stands for “large number of rare events” and it represents a class of data models that includes the Zipf-Mandelbrot distribution).  Specifically, to generate a random sample of size N = 100 for the first case considered here, the following R code is executed:

> library(zipfR)
> ZM = lnre(“zm”, alpha = 2/3, B = 0.1)
> zmsample = rlnre(ZM, n=100)

The first line loads the zipfR library (which must first be installed, of course, using the install.packages command), the second line invokes the lnre command to set up the distribution with the desired parameters, and the last line invokes the rlnre command to generate 100 random samples from this distribution.  (As with all R random number generators, the set.seed command should be used first to initialize the random number generator seed if you want to get repeatable results; for the results presented here, I used set.seed(101).)  The sample returned by the rlnre command is a vector of 100 observations, which have the “factor” data type, although their designations are numeric (think of the factor value “1339” as meaning “1 sample of object number 1339”).  In the results I present here, I have converted these factor responses to numerical ones so I can interpret them as numerical ranks.  This conversion is a little subtle: simply converting from factor to numeric values via something like “zmnumeric = as.numeric(zmsample)” almost certainly doesn’t give you what you want: this will convert the first-ocurring factor value (which has a numeric label, say “1339”) into the number 1, convert the second-occurring value (since this is a random sequence, this might be “73”) into the number 2, etc.  To get what you want (e.g., the labels “1339” and “73” assigned to the numbers 1339 and 73, respectively), you need to first convert the factors in zmsample into characters and then convert these characters into numeric values:

           
zmnumeric = as.numeric(as.character(zmsample))


The three plots below show random samples drawn from each of the three Zipf-Mandelbrot distributions considered here.  In all cases, the y-axis corresponds to the number of times the object labeled i was observed in a random sample of size N = 100 drawn from the distribution with the indicated exponent.  Since the range of these indices can be quite large in the slowly-decaying members of the Zipf-Mandelbrot distribution family, the plots are drawn with logarithmic x-axes, and to facilitate comparisons, the x-axes have the same range in all three plots, as do the y-axes.  In all three plots, object i = 1 occurs most often – about a dozen times in the top plot, two dozen times in the middle plot, and three dozen times in the bottom plot – and those objects with larger indices occur less frequently.  The major difference between these three examples lies in the largest indices of the objects seen in the samples: we never see an object with index greater than 50 in the bottom plot, we see only two such objects in the middle plot, while more than a third of the objects in the top plot meet this condition, with the most extreme object having index i = 115,116. 



As in the case of the Pareto type I distributions I discussed in several previous posts – which may be regarded as the continuous analog of the Zipf distribution – the mean is generally not a useful characterization for the Zipf distribution.  This point is illustrated in the boxplot comparison presented below, which summarizes the means computed from 1000 statistically independent random samples drawn from each of the three distributions considered here, where the object labels have been converted to numerical values as described above.  Thus, the three boxplots on the left represent the means – note the logarithmic scale on the y-axis – of these index values i generated for each random sample.  The extreme variability seen for Case 1 (a = 1.5) reflects the fact that neither the mean nor the variance are finite for this case, and the consistent reduction in the range of variability for Cases 2 (a = 2.5, finite mean but infinite variance) and 3 (a = 3.5, finite mean and variance) reflects the “shortening tail” of this distribution with increasing exponent a.  As I discussed in my last post, a better characterization than the mean for distributions like this is the “95% tail length,” corresponding to the 95% sample quantile. Boxplots summarizing these values for the three distributions considered here are shown to the right of the dashed vertical line in the plot below.  In each case, the range of variation seen here is much less extreme for the 95% tail length than it is for the mean, supporting the idea that this is a better characterization for data described by Zipf-like discrete distributions.



Other alternatives to the (arithmetic) mean that I discussed in conjunction with the Pareto type I distribution were the sample median, the geometric mean, and the harmonic mean.  The plot below compares these four characterizations for 1000 random samples, each of size N = 100, drawn from the Zipf-Mandelbrot distribution with a = 3.5 (the third case), for which the mean is well-defined.  Even here, it is clear that the mean is considerably more variable than these other three alternatives.



Finally, the plot below shows boxplot comparisons of these alternative characterizations – the median, the geometric mean, and the harmonic mean – for all three of the distributions considered here.  Not surprisingly, Case 1 (a = 1.5) exhibits the largest variability seen for all three characterizations, but the harmonic mean is much more consistent for this case than either the geometric mean or the median.  In fact, the same observation holds – although less dramatically – for Case 2 (a = 2.5), and the harmonic mean appears more consistent than the geometric mean for all three cases.  This observation is particularly interesting in view of the connection between the harmonic mean and the reciprocal transformation, which I will discuss in more detail next time.