ExploringDataBlog: 2011

Sunday, November 27, 2011

Cleaning time-series and other data streams

The need to analyze time-series or other forms of streaming data arises frequently in many different application areas. Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations. As a specific example, the figure below shows four data sequences: the upper two plots represent hourly physical property measurements, one made at the inlet of a product storage tank (the left-hand plot) and the other made at the same time at the outlet of the tank (the right-hand plot). The lower two plots in this figure show the results of applying the data cleaning filter outlierMAD from the R package pracma discussed further below. The two main points of this post are first, that isolated spikes like those seen in the upper two plots at hour 291 can badly distort the results of an otherwise reasonable time-series characterization, and second, that the simple moving window data cleaning filter described here is often very effective in removing these artifacts.

This example is discussed in more detail in Section 8.1.2 of my book Discrete-Time Dynamic Models, but the key observations here are the following. First, the large spikes seen in both of the original data sequences were caused by the simultaneous, temporary loss of both measurements and the subsequent coding of these missing values as zero by the data collection system. The practical question of interest was to determine how long, on average, the viscous, polymeric material being fed into and out of the product storage tank was spending there. A standard method for addressing such questions is the use of cross-correlation analysis, where the expected result is a broad peak like the heavy dashed line in the plot shown below. The location of this peak provides an estimate of the average time spent in the tank, which is approximately 21 hours in this case, as indicated in the plot. This result was about what was expected, and it was obtained by applying standard cross-correlation analysis to the cleaned data sequences shown in the bottom two plots above. The lighter solid curve in the plot below shows the results of applying exactly the same analysis, but to the original data sequences instead of the cleaned data sequences. This dramatically different plot suggests that the material is spending very little time in the storage tank: accepted uncritically, this result would imply severe fouling of the tank, suggesting a need to shut the process down and clean out the tank, an expensive and labor-intensive proposition. The main point of this example is that the difference in these two plots is entirely due to the extreme data anomalies present in the original time-series. Additional examples of problems caused by time-series outliers are discussed in Section 4.3 of my book Mining Imperfect Data.

One of the primary features of the analysis of time-series and other streaming data sequences is the need for local data characterizations. This point is illustrated in the plot below, which shows the first 200 observations of the storage tank inlet data sequence discussed above. All of these observations but one are represented as open circles in this plot, but the data point at k = 110 is shown as a solid circle, to emphasize how far it lies from its immediate neighbors in the data sequence. It is important to note that this point is not anomalous with respect to the overall range of this data sequence – it is, for example, well within the normal range of variation seen for the points from about k = 150 to k = 200 – but it is clearly anomalous with respect to those points that immediately precede and follow it. A general strategy for automatically detecting and removing such spikes from a data sequence like this one is to apply a moving window data cleaning filter which characterizes each data point with respect to a local neighborhood of prior and subsequent samples. That is, for each data point k in the original data sequence, this type of filter forms a cleaned data estimate based on some number J of prior data values (i.e., points k-J through k-1 in the sequence) and, in the simplest implementations, the same number of subsequent data values (i.e., points k+1 through k+J in the sequence).

The specific data cleaning filter considered here is the Hampel filter, which applies the Hampel identifier discussed in Chapter 7 of Exploring Data in Engineering, the Sciences and Medicine to this moving data window. If the k^th data point is declared to be an outlier, it is replaced by the median value computed from this data window; otherwise, the data point is not modified. The results of applying the Hampel filter with a window width of J = 5 to the above data sequence are shown in the plot below. The effect is to modify three of the original data points – those at k = 43, 110, and 120 – and the original values of these modified points are shown as solid circles at the appropriate locations in this plot. It is clear that the most pronounced effect of the Hampel filter is to remove the local outlier indicated in the above figure and replace it with a value that is much more representative of the other data points in the immediate vicinity.

As I noted above, the Hampel filter implementation used here is that available in the R package pracma as procedure outlierMAD. I will discuss this R package in more detail in my next post, but for those seeking a more detailed discussion of the Hampel filter in the meantime, one is freely available on-line in the form of an EDN article I wrote in 2002, Scrub data with scale-invariant nonlinear digital filters. Also, comparisons with alternatives like the standard median filter (generally too aggressive, introducing unwanted distortion into the “cleaned” data sequence) and the center-weighted median filter (sometimes quite effective) are presented in Section 4.2 of the book Mining Imperfect Data mentioned above.

Friday, November 11, 2011

Harmonic means, reciprocals, and ratios of random variables

In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian. For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions. Since the harmonic mean is defined as the reciprocal of the mean of the reciprocal values, it is intimately related to the reciprocal transformation. The main point of this post is to show how profoundly the reciprocal transformation can alter the character of a distribution, for better or worse. One way that reciprocal transformations sneak into analysis results is through attempts to characterize ratios of random numbers. The key issue underlying all of these ideas is the question of when the denominator variable in either a reciprocal transformation or a ratio exhibits non-negligible probability in a finite neighborhood of zero. I discuss transformations in Chapter 12 of Exploring Data in Engineering, the Sciences and Medicine, with a section (12.7) devoted to reciprocal transformations, showing what happens when we apply them to six different distributions: Gaussian, Laplace, Cauchy, beta, Pareto, and lognormal.

In the general case, if a random variable x has the density p(x), the distribution g(y) of the reciprocal y = 1/x has the density:

g(y) = p(1/y)/y²

As I discuss in greater detail in Exploring Data, the consequence of this transformation is typically (though not always) to convert a well-behaved distribution into a very poorly behaved one. As a specific example, the plot below shows the effect of the reciprocal transformation on a Gaussian random variable with mean 1 and standard deviation 2. The most obvious characteristic of this transformed distribution is its strongly asymmetric, bimodal character, but another non-obvious consequence of the reciprocal transformation is that it takes a distribution that is completely characterized by its first two moments into a new distribution with Cauchy-like tails, for which none of the integer moments exist.

The implications of the reciprocal transformation for many other distributions are equally non-obvious. For example, both the badly-behaved Cauchy distribution (no moments exist) and the well-behaved lognormal distribution (all moments exist, but interestingly, do not completely characterize the distribution, as I have discussed in a previous post) are invariant under the reciprocal transformation. Also, applying the reciprocal transformation to the long-tailed Pareto type I distribution (which exhibits few or no finite moments, depending on its tail decay rate) yields a beta distribution, all of whose moments are finite. Finally, it is worth noting that the invariance of the Cauchy distribution under the reciprocal transformation lies at the heart of the following result, presented in the book Continuous Univariate Distributions by Johnson, Kotz, and Balakrishnan (Volume 1, 2^nd edition, Wiley, 1994, page 319). They note that if the density of x is positive, continuous, and differentiable at x = 0 – all true for the Gaussian case – the distribution of the harmonic mean of N samples approaches a Cauchy limit as N becomes infinitely large.

As noted above, the key issue responsible for the pathological behavior of the reciprocal transformation is the question of whether the original data distribution exhibits nonzero probability of taking on values within a neighborhood around zero. In particular, note that if x can only assume values larger than some positive lower limit L, it follows that 1/x necessarily lies between 0 and 1/L, which is enough to guarantee that all moments of the transformed distribution exist. For the Gaussian distribution, even if the mean is large enough and the standard deviation is small enough that the probability of observing values less than some limit L > 0 is negligible, the fact that this probability is not zero means that the moments of any reciprocally-transformed Gaussian distribution are not finite. As a practical matter, however, reciprocal transformations and related characterizations – like harmonic means and ratios – do become better-behaved as the probability of observing values near zero become negligibly small.

To see this point, consider two reciprocally-transformed Gaussian examples. The first is the one considered above: the reciprocal transformation of a Gaussian random variable with mean 1 and standard deviation 2. In this case, the probability that x assumes values smaller than or equal to zero is non-negligible. Specifically, this probability is simply the cumulative distribution function for the distribution evaluated at zero, easily computed in R as approximately 31%:

> pnorm(0,mean=1,sd=2)

[1] 0.3085375

In contrast, for a Gaussian random variable with mean 1 and standard deviation 0.1, the corresponding probability is negligibly small:

> pnorm(0,mean=1,sd=0.1)

[1] 7.619853e-24

If we consider the harmonic means of these two examples, we see that the first one is horribly behaved, as all of the results presented here would lead us to expect. In fact, the qqPlot command in the car package in R allows us to compute quantile-quantile plots for the Student’s t-distribution with one degree of freedom, corresponding to the Cauchy distribution, yielding the plot shown below. The Cauchy-like tail behavior expected from the results presented by Johnson, Kotz and Balakrishnan is seen clearly in this Cauchy Q-Q plot, constructed from 1000 harmonic means, each computed from statistically independent samples drawn from a Gaussian distribution with mean 1 and standard deviation 2. The fact that almost all of the observations fall within the – very wide – 95% confidence interval around the reference line suggest that the Cauchy tail behavior is appropriate here.

To further confirm this point, compare the corresponding normal Q-Q plot for the same sequence of harmonic means, shown below. There, the extreme non-Gaussian character of these harmonic means is readily apparent from the pronounced outliers evident in both the upper and lower tails.

In marked contrast, for the second example with the mean of 1 as before but the much smaller standard deviation of 0.1, the harmonic mean is much better behaved, as the normal Q-Q plot below illustrates. Specifically, this plot is identical in construction to the one above, except it was computed from samples drawn from the second data distribution. Here, most of the computed harmonic mean values fall within the 95% confidence limits around the Gaussian reference line, suggesting that it is not unreasonable in practice to regard these values as approximately normally distributed, in spite of the pathologies of the reciprocal transformation.

One reason the reciprocal transformation is important in practice – particularly in connection with the Gaussian distribution – is that the desire to characterize ratios of uncertain quantities does arise from time to time. In particular, if we are interested in characterizing the ratio of two averages, the Central Limit Theorem would lead us to expect that, at least approximately, this ratio should behave like the ratio of two Gaussian random variables. If these component averages are statistically independent, the expected value of the ratio can be re-written as the product of the expected value of the numerator average and the expected value of the reciprocal of the denominator average, leading us directly to the reciprocal Gaussian transformation discussed here. In fact, if these two averages are both zero mean, it is a standard result that the ratio has a Cauchy distribution (this result is presented in the same discussion from Johnson, Kotz and Balakrishnan noted above). As in the second harmonic mean example presented above, however, it turns out to be true that if the mean and standard deviation of the denominator variable are such that the probability of a zero or negative denominator are negligible, the distribution of the ratio may be approximated reasonably well as Gaussian. A very readable and detailed discussion of this fact is given in the paper by George Marsaglia in the May 2006 issue of Journal of Statistical Software.

Finally, it is important to note that the “reciprocally-transformed Gaussian distribution” I have been discussing here is not the same as the inverse Gaussian distribution, to which Johnson, Kotz and Balakrishnan devote a 39-page chapter (Chapter 15). That distribution takes only positive values and exhibits moments of all orders, both positive and negative, and as a consequence, it has the interesting characteristic that it remains well-behaved under reciprocal transformations, in marked contrast to the Gaussian case.

Sunday, October 23, 2011

The Zipf and Zipf-Mandelbrot distributions

In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data. Today’s post considers the Zipf distribution for discrete data, which has come to be extremely popular as a model for phenomena like word frequencies, city sizes, or sales rank data, where the values of these quantities associated with randomly selected samples can vary by many orders of magnitude.

More specifically, the Zipf distribution is defined by a probability p_i of observing the i^th element of an infinite sequence of objects in a single random draw from that sequence, where the probability is given by:

p_i = A/i^a

Here, a is a positive number greater than 1 that determines the rate of the distribution’s tail decay, and A is a normalization constant, chosen so that these probabilities sum to 1. Like the continuous-valued Pareto type I distribution, the Zipf distribution exhibits a “long tail,” meaning that its tail decays slowly enough that in a random sample of objects O_i drawn from a Zipf distribution, some very large values of the index i will be observed, particularly for relatively small values of the exponent a. In one of the earliest and most common applications of the Zipf distribution, the objects considered represent words in a document and i represents their rank, ranging from most frequent (for i = 1) to rare (for large i ). In a more business-oriented application, the objects might be products for sale (e.g., books listed on Amazon), with the index i corresponding to their sales rank. For a fairly extensive collection of references to many different applications of the Zipf distribution, the website (originally) from Rockefeller University is an excellent source.

In Exploring Data in Engineering, the Sciences, and Medicine, I give a brief discussion of both the Zipf distribution and the closely related Zipf-Mandelbrot distribution discussed by Beniot Mandelbrot in his book The Fractal Geometry of Nature. The probabilities defining this distribution may be parameterized in several ways, and the one given in Exploring Data is:

p_i = A/(1+Bi)^a

where again a is an exponent that determines the rate at which the tail of the distribution decays, and B is a second parameter with a value that is strictly positive but no greater than 1. For both the Zipf distribution and the Zipf-Mandelbrot distribution, the exponent a must be greater than 1 for the distribution to be well-defined, it must be greater than 2 for the mean to be finite, and it must be greater than 3 for the variance to be finite.

So far, I have been unable to find an R package that supports the generation of random samples drawn from the Zipf distribution, but the package zipfR includes the command rlnre, which generates random samples drawn from the Zipf-Mandelbrot distribution. As I noted, this distribution can be parameterized in several different ways and, as Murphy’s law would have it, the zipfR parameterization is not the same as the one presented above and discussed in Exploring Data. Fortunately, the conversion between these parameters is simple. The zipfR package defines the distribution in terms of a parameter alpha that must lie strictly between 0 and 1, and a second parameter B that I will call B_zipfR to avoid confusion with the parameter B in the above definition. These parameters are related by:

alpha = 1/a and B_zipfR = (a-1) B

Since the a parameter (and thus the alpha parameter in the zipfR package) determines the tail decay rate of the distribution, it is of the most interest here, and the rest of this post will focus on three examples: a = 1.5 (alpha = 2/3), for which both the distribution’s mean and variance are infinite, a = 2.5 (alpha = 2/5), for which the mean is finite but the variance is not, and a = 3.5 (alpha = 2/7), for which both the mean and variance are finite. The value of the parameter B in the Exploring Data definition of the distribution will be fixed at 0.2 in all of these examples, corresponding to values of B_zipfR = 0.1, 0.3, and 0.5 for the three examples considered here.

To generate Zipf-Mandelbrot random samples, the zipfR package uses the procedure rlnre in conjunction with the procedure lnre (the abbreviation “lnre” stands for “large number of rare events” and it represents a class of data models that includes the Zipf-Mandelbrot distribution). Specifically, to generate a random sample of size N = 100 for the first case considered here, the following R code is executed:

> library(zipfR)
> ZM = lnre(“zm”, alpha = 2/3, B = 0.1)
> zmsample = rlnre(ZM, n=100)

The first line loads the zipfR library (which must first be installed, of course, using the install.packages command), the second line invokes the lnre command to set up the distribution with the desired parameters, and the last line invokes the rlnre command to generate 100 random samples from this distribution. (As with all R random number generators, the set.seed command should be used first to initialize the random number generator seed if you want to get repeatable results; for the results presented here, I used set.seed(101).) The sample returned by the rlnre command is a vector of 100 observations, which have the “factor” data type, although their designations are numeric (think of the factor value “1339” as meaning “1 sample of object number 1339”). In the results I present here, I have converted these factor responses to numerical ones so I can interpret them as numerical ranks. This conversion is a little subtle: simply converting from factor to numeric values via something like “zmnumeric = as.numeric(zmsample)” almost certainly doesn’t give you what you want: this will convert the first-ocurring factor value (which has a numeric label, say “1339”) into the number 1, convert the second-occurring value (since this is a random sequence, this might be “73”) into the number 2, etc. To get what you want (e.g., the labels “1339” and “73” assigned to the numbers 1339 and 73, respectively), you need to first convert the factors in zmsample into characters and then convert these characters into numeric values:

zmnumeric = as.numeric(as.character(zmsample))

The three plots below show random samples drawn from each of the three Zipf-Mandelbrot distributions considered here. In all cases, the y-axis corresponds to the number of times the object labeled i was observed in a random sample of size N = 100 drawn from the distribution with the indicated exponent. Since the range of these indices can be quite large in the slowly-decaying members of the Zipf-Mandelbrot distribution family, the plots are drawn with logarithmic x-axes, and to facilitate comparisons, the x-axes have the same range in all three plots, as do the y-axes. In all three plots, object i = 1 occurs most often – about a dozen times in the top plot, two dozen times in the middle plot, and three dozen times in the bottom plot – and those objects with larger indices occur less frequently. The major difference between these three examples lies in the largest indices of the objects seen in the samples: we never see an object with index greater than 50 in the bottom plot, we see only two such objects in the middle plot, while more than a third of the objects in the top plot meet this condition, with the most extreme object having index i = 115,116.

As in the case of the Pareto type I distributions I discussed in several previous posts – which may be regarded as the continuous analog of the Zipf distribution – the mean is generally not a useful characterization for the Zipf distribution. This point is illustrated in the boxplot comparison presented below, which summarizes the means computed from 1000 statistically independent random samples drawn from each of the three distributions considered here, where the object labels have been converted to numerical values as described above. Thus, the three boxplots on the left represent the means – note the logarithmic scale on the y-axis – of these index values i generated for each random sample. The extreme variability seen for Case 1 (a = 1.5) reflects the fact that neither the mean nor the variance are finite for this case, and the consistent reduction in the range of variability for Cases 2 (a = 2.5, finite mean but infinite variance) and 3 (a = 3.5, finite mean and variance) reflects the “shortening tail” of this distribution with increasing exponent a. As I discussed in my last post, a better characterization than the mean for distributions like this is the “95% tail length,” corresponding to the 95% sample quantile. Boxplots summarizing these values for the three distributions considered here are shown to the right of the dashed vertical line in the plot below. In each case, the range of variation seen here is much less extreme for the 95% tail length than it is for the mean, supporting the idea that this is a better characterization for data described by Zipf-like discrete distributions.

Other alternatives to the (arithmetic) mean that I discussed in conjunction with the Pareto type I distribution were the sample median, the geometric mean, and the harmonic mean. The plot below compares these four characterizations for 1000 random samples, each of size N = 100, drawn from the Zipf-Mandelbrot distribution with a = 3.5 (the third case), for which the mean is well-defined. Even here, it is clear that the mean is considerably more variable than these other three alternatives.

Finally, the plot below shows boxplot comparisons of these alternative characterizations – the median, the geometric mean, and the harmonic mean – for all three of the distributions considered here. Not surprisingly, Case 1 (a = 1.5) exhibits the largest variability seen for all three characterizations, but the harmonic mean is much more consistent for this case than either the geometric mean or the median. In fact, the same observation holds – although less dramatically – for Case 2 (a = 2.5), and the harmonic mean appears more consistent than the geometric mean for all three cases. This observation is particularly interesting in view of the connection between the harmonic mean and the reciprocal transformation, which I will discuss in more detail next time.

Wednesday, September 28, 2011

Is the “Long Tail” a Useless Concept?

In response to my last post, “The Long Tail of the Pareto Distribution,” Neil Gunther had the following comment:

“Unfortunately, you've fallen into the trap of using the ‘long tail’ misnomer. If you think about it, it can't possibly be the length of the tail that sets distributions like Pareto and Zipf apart; even the negative exponential and Gaussian have infinitely long tails.”

He goes on to say that the relevant concept is the “width” or the “weight” of the tails that is important, and that a more appropriate characterization of these “Long Tails” would be “heavy-tailed” or “power-law” distributions.

Neil’s comment raises an important point: while the term “long tail” appears a lot in both the on-line and hard-copy literature, it is often somewhat ambiguously defined. For example, in his book, The Long Tail, Chris Anderson offers the following description (page 10):

“In statistics, curves like that are called ‘long-tailed distributions’ because the tail of the curve is very long relative to the head.”

The difficulty with this description is that it is somewhat ambiguous since it says nothing about how to measure “tail length,” forcing us to adopt our own definitions. It is clear from Neil’s comments that the definition he adopts for “tail length” is the width of the distribution’s support set. Under this definition, the notion of a “long-tailed distribution” is of extremely limited utility: the situation is exactly as Neil describes it, with “long-tailed distributions” corresponding to any distribution with unbounded support, including both distributions like the Gaussian and gamma distribution where the mean is a reasonable characterization, and those like the Cauchy and Pareto distribution where the mean doesn’t even exist.

The situation is analogous to that of confidence intervals, which characterize the uncertainty inherited by any characterization computed from a collection of uncertain (i.e., random) data values. As a specific example consider the mean: the sample mean is the arithmetic average of N observed data samples, and it is generally intended as an estimate of the population mean, defined as the first moment of the data distribution. A q% confidence interval around the sample mean is an interval that contains the population mean with probability at least q%. These intervals can be computed in various ways for different data characterizations, but the key point here is that they are widely used in practice, with the most popular choices being the 90%, 95% and 99% confidence intervals, which necessarily become wider as this percentage q increases. (For a more detailed discussion of confidence intervals, refer to Chapter 9 of Exploring Data in Engineering, the Sciences, and Medicine.) We can, in principle, construct 100% confidence intervals, but this leads us directly back to Neil’s objection: the 100% confidence interval for the mean is entire support set of the distribution (e.g., for the Gaussian distribution, this 100% confidence interval is the whole real line, while for any gamma distribution, it is the set of all positive numbers). These observations suggest the following notion of “tail length” that addresses Neil’s concern while retaining the essential idea of interest in the business literature: we can compare the “q% tail length” of different distributions for some q less than 100.

In particular, consider the case of J-shaped distributions, defined as those like the Pareto type I distribution whose distribution p(x) decays monotonically with increasing x, approaching zero as x goes to infinity. The plot below shows two specific examples to illustrate the idea: the solid line corresponds to the (shifted) exponential distribution:

p(x) = e^–(x-1)

for all x greater than or equal to 1 and zero otherwise, while the dotted line represents the Pareto type I distribution with location parameter k = 1 and shape parameter a = 0.5 discussed in my last post. Initially, as x increases from 1, the exponential density is greater than the Pareto density, but for x larger than about 3.5, the opposite is true: the exponential distribution rapidly becomes much smaller, reflecting its much more rapid rate of tail decay.

For these distributions, define the q% tail length to be the distance from the minimum possible value of x (the “head” of the distribution; here, x = 1) to the point in the tail where the cumulative probability reaches q% (i.e., the value x_q where x < x_q with probability q%). In practical terms, the q% tail length tells us how far out we have to go in the tail to account for q% of the possible cases. In R, this value is easy to compute using the quantile function included in most families of available distribution functions. As a specific example, for the Pareto type I distribution, the function qparetoI in the VGAM package gives us the desired quantiles for the distribution with specified values of the parameters k (designated “scale” in the qparetoI call) and a (designated “shape” in the qparetoI call). Thus, for the case k = 1 and a = 0.5 (i.e., the dashed curve in the above plot), the “90% tail length” is given by:

> qparetoI(p=0.9,scale=1,shape=0.5)

[1] 100

For comparison, the corresponding shifted exponential distribution has the 90% tail length given by:

> 1 + qexp(p = 0.9)

[1] 3.302585

(Note that here, I added 1 to the exponential quantile to account for the shift in its domain from “all positive numbers” – the domain for the standard exponential distribution – to the shifted domain “all numbers greater than 1”.) Since these 90% tail lengths differ by a factor of 30, they provide a sound basis for declaring the Pareto type I distribution to be “longer tailed” than the exponential distribution.

These results also provide a useful basis for assessing the influence of the decay parameter a for the Pareto distribution. As I noted last time, two of the examples I considered did not have finite means (a = 0.5 and 1.0), and none of the four had finite variances (i.e., also a = 1.5 and 2.0), rendering moment characterizations like the mean and standard deviation fundamentally useless. Comparing the 90% tail lengths for these distributions, however, leads to the following results:

a = 0.5: 90% tail length = 100.000

a = 1.0: 90% tail length = 10.000

a = 1.5: 90% tail length = 4.642

a = 2.0: 90% tail length = 3.162

It is clear from these results that the shape parameter a has a dramatic effect on the 90% tail length (in fact, on the q% tail length for any q less than 100). Further, note that the 90% tail length for the Pareto type I distribution with a = 2.0 is actually a little bit shorter than that for the exponential distribution. If we move further out into the tail, however, this situation changes. As a specific example, suppose we compare the 98% tail lengths. For the exponential distribution, this yields the value 4.912, while for the four Pareto shape parameters we have the following results:

a = 0.5: 98% tail length = 2,500.000

a = 1.0: 98% tail length = 50.000

a = 1.5: 98% tail length = 13.572

a = 2.0: 98% tail length = 7.071

This value (i.e., the 98% tail length) seems a particularly appropriate choice to include here since in his book, The Long Tail, Chris Anderson notes that his original presentations on the topic were entitled “The 98% Rule,” reflecting the fact that he was explicitly considering how far out you had to go into the tail of a distribution of goods (e.g., the books for sale by Amazon) to account for 98% of the sales.

Since this discussion originally began with the question, “when are averages useless?” it is appropriate to note that, in contrast to the much better-known average, the “q% tail length” considered here is well-defined for any proper distribution. As the examples discussed here demonstrate, this characterization also provides a useful basis for quantifying the “Long Tail” behavior that is of increasing interest in business applications like Internet marketing. Thus, if we adopt this measure for any q value less than 100%, the answer to the title question of this post is, “No: The Long Tail is a useful concept.”

The downside of this minor change is that – as the results shown here illustrate – the results obtained using the q% tail length depend on the value of q we choose. In my next post, I will explore the computational issues associated with that choice.

Saturday, September 17, 2011

The Long Tail of the Pareto Distribution

In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization. One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by:

p(x) = ak^a/x^a+1

defined for all x > k, where k and a are numeric parameters that define the distribution. In the example I discussed last time, I considered the case where a = 1.5, which exhibits a finite mean (specifically, the mean is 3 for this case), but an infinite variance. As the results I presented last time demonstrated, the extreme data variability of this distribution renders the computed mean too variable to be useful. Another reason this distribution is particularly interesting is that it exhibits essentially the same tail behavior as the discrete Zipf distribution; there, the probability that a discrete random variable x takes its i^th value is:

p_i = A/i^c,

where A is a normalization constant and c is a parameter that determines how slowly the tail decays. This distribution was originally proposed to characterize the frequency of words in long documents (the Zipf-Estoup law), it was investigated further by Zipf in the mid-twentieth century in a wide range of applications (e.g., the distributions of city sizes), and it has become the subject of considerable recent attention as a model for “long-tailed” business phenomena (for a non-technical introduction to some of these business phenomena, see the book by Chris Anderson, The Long Tail). I will discuss the Zipf distribution further in a later post, but one of the reasons for discussing the Pareto type I distribution first is that since it is a continuous distribution, the math is easier, meaning that more characterization results are available for the Pareto distribution.

The mean of the Pareto type I distribution is:

Mean = ak/(a-1),

provided a > 1, and the variance of the distribution is finite only if a > 2. Plots of the probability density defined above for this distribution are shown above, for k = 1 in all cases, and with a taking the values 0.5, 1.0, 1.5, and 2.0. (This is essentially the same plot as Figure 4.17 in Exploring Data in Engineering, the Sciences, and Medicine, where I give a brief description of the Pareto type I distribution.) Note that all of the cases considered here are characterized by infinite variance, while the first two (a = 0.5 and 1.0) are also characterized by infinite means. As the results presented below emphasize, the mean represents a very poor characterization in practice for data drawn from any of these distributions, but there are alternatives, including the familiar median that I have discussed previously, along with two others that are more specific to the Pareto type I distribution: the geometric mean and the harmonic mean.

The plot below emphasizes the point made above about the extremely limited utility of the mean as a characterization of Pareto type I data, even in cases where it is theoretically well-defined. Specifically, this plot compares the four characterizations I discuss here – the mean (more precisely known as the “arithmetic mean” to distinguish it from the other means considered here), the median, the geometric mean, and the harmonic mean – for 1000 statistically independent Pareto type I data sequences, each of length N = 400, with parameters k = 1 and a = 2.0. For this example, the mean is well-defined (specifically, it is equal to 2), but compared with the other data characterizations, its variability is much greater, reflecting the more serious impact of this distribution’s infinite variance on the mean than on these other data characterizations.

To give a more complete view of the extreme variability of the arithmetic mean, boxplots of 1000 statistically independent samples drawn from all four of the Pareto type I distribution examples plotted above are shown in the boxplots below. As before, each sample is of size N = 400 and the parameter k has the value 1, but here the computed arithmetic means are shown for the parameter values a = 0.5, 1.0, 1.5, and 2.0; note the log scale used here because the range of computed means is so large. For the first two of these examples, the population mean does not exist, so it is not surprising that the computed values span such an enormous range, but even when the mean is well-defined, the influence of the infinite variance of these cases is clearly evident. It may be argued that infinite variance is an extreme phenomenon, but it is worth emphasizing here that for the specific “long tail” distributions popular in many applications, the decay rate is sufficiently slow for the variance – and sometimes even the mean – to be infinite, as in these examples.

As I have noted several times in previous posts, the median is much better behaved than the mean, so much so that it is well-defined for any proper distribution. One of the advantages of the Pareto type I distribution is that the form of the density function is simple enough that the median of the distribution can be computed explicitly from the distribution parameters. This result is given in the fabulous book by Johnson, Kotz and Balakrishnan that I have mentioned previously, which devotes an entire chapter (Chapter 20) to the Pareto family of distributions. Specifically, the median of the Pareto type I distribution with parameters k and a is given by:

Median = 2^1/ak

Thus, for the four examples considered here, the median values are 4.0 (for a = 0.5), 2.0 (for a = 1.0), 1.587 (for a = 1.5), and 1.414 (for a = 2.0). Boxplot summaries for the same 1000 random samples considered above are shown in the plot below, which also includes horizontal dotted lines at these theoretical median values for the four distributions. The fact that these lines correspond closely with the median lines in the boxplots gives an indication that the computed median is, on average, in good agreement with the correct values it is attempting to estimate. As in the case of the arithmetic means, the variability of these estimates decreases monotonically as a increases, corresponding to the fact that the distribution becomes generally better-behaved as the a parameter increases.

The geometric mean is an alternative characterization to the more familiar arithmetic mean, one that is well-defined for any sequence of positive numbers. Specifically, the geometric mean of N positive numbers is defined as the N^th root of their product. Equivalently, the geometric mean may be computed by exponentiating the arithmetic average of the log-transformed values. In the case of the Pareto type I distribution, the utility of the geometric mean is closely related to the fact that the log transformation converts a Pareto-distributed random variable into an exponentially distributed one, a point that I will discuss further in a later post on data transformations. (These transformations are the topic of Chapter 12 of Exploring Data, where I briefly discuss both the logarithmic transformation on which the geometric mean is based and the reciprocal transformation on which the harmonic mean is based, described next.) The key point here is that the following simple expression is available for the geometric mean of the Pareto type I distribution (Johnson, Kotz, and Balakrishnan, page 577):

Geometric Mean = k exp(1/a)

For the four specific examples considered here, these geometric mean values are approximately 7.389 (for a = 0.5), 2.718 (for a = 1.0), 1.948 (for a = 1.5), and 1.649 (for a = 2.0). The boxplots shown below summarize the range of variation seen in the computed geometric means for the same 1000 statistically independent samples considered above. Again, the horizontal dotted lines indicate the correct values for each distribution, and it may be seen that the computed values are in good agreement, on average. As before, the variability of these computed values decreases with increasing a values as the distribution becomes better-behaved.

The fourth characterization considered here is the harmonic mean, again appropriate to positive values, and defined as the reciprocal of the average of the reciprocal data values. In the case of the geometric mean just discussed, the log transformation on which it is based is often useful in improving the distributional character of data values that span a wide range. In the case of the Pareto type I distribution – and a number of others – the reciprocal transformation on which the harmonic mean is based also improves the behavior of the data distribution, but this is often not the case. In particular, reciprocal transformations often make the character of a data distribution much worse: applied to the extremely well-behaved standard uniform distribution, it yields the Pareto type I distribution with a = 1, for which none of the integer moments exist; similarly, applied to the Gaussian distribution, the reciprocal transformation yields a result that is both infinite variance and bimodal. (A little thought suggests that the reciprocal transformation is inappropriate for the Gaussian distribution because it is not strictly positive, but normality is a favorite working assumption, sometimes applied to the denominators of ratios, leading to a number of theoretical difficulties. I will have more to say about that in a future post.) For the case of the Pareto type I distribution, the reciprocal transformation converts it into the extremely well-behaved beta distribution, and the harmonic mean has the following simple expression:

Harmonic mean = k(1 + a^-1)

For the four examples considered here, this expression yields harmonic mean values of 3 (for a = 0.5), 2 (for a = 1.0), 1.667 (for a = 1.5), and 1.5 (for a = 2.0). Boxplot summaries of the computed harmonic means for the 1000 simulations of each case considered previously are shown below, again with dotted horizontal lines at the theoretical values for each case. As with both the median and the geometric mean, it is clear from these plots that the computed values are correct on average, and their variability decreases with increasing values of the a parameter.

The key point of this post has been to show that, while averages are not suitable characterizations for “long tailed” phenomena that are becoming an increasing subject of interest in many different fields, useful alternatives do exist. For the case of the Pareto type I distribution considered here, these alternatives include the popular median, along with the somewhat less well-known geometric and harmonic means. In an upcoming post, I will examine the utility of these characterizations for the Zipf distribution.

Saturday, August 27, 2011

Some Additional Thoughts on Useless Averages

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions. The post generated three interesting comments that I want to respond to here.

First and foremost, I want to say thanks to all of you for giving me something to think about further, leading me in some interesting new directions. First, chrisbeeleyimh had the following to say:

“I seem to have rather abandoned means and medians in favor of drawing the distribution all the time, which baffles my colleagues somewhat.”

Chris also maintains a collection of data examples where the mean is the same but the shape is very different. In fact, one of the points I illustrate in Section 4.4.1 of Exploring Data in Engineering, the Sciences, and Medicine is that there are cases where not only the means but all of the moments (i.e., variance, skewness, kurtosis, etc.) are identical but the distributions are profoundly different. A specific example is taken from the book Counterexamples in Probability, 2nd Edition by J.M. Stoyanov, who shows that if the lognormal density is multiplied by the following function:

f(x) = 1 + A sin(2 pi ln x),

for any constant A between -1 and +1, the moments are unchanged. The character of the distribution is changed profoundly, however, as the following plot illustrates (this plot is similar to Fig. 4.8 in Exploring Data, which shows the same two distributions, but for A = 0.5 instead of A = 0.9, as shown here). To be sure, this behavior is pathological – distributions that have finite support, for example, are defined uniquely by their complete set of moments – but it does make the point that moment characterizations are not always complete, even if an infinite number of them are available. Within well-behaved families of distributions (such as the one proposed by Karl Pearson in 1895), a complete characterization is possible on the basis of the first few moments, which is one reason for the historical popularity of the method of moments for fitting data to distributions. It is important to recognize, however, that moments do have their limitations and that the first moment alone – i.e., the mean by itself – is almost never a complete characterization. (I am forced to say “almost” here because if we impose certain very strong distributional assumptions – e.g., the Poisson or binomial distributions – the specific distribution considered may be fully characterized by its mean. This begs the question, however, of whether this distributional assumption is adequate. My experience has been that, no matter how firmly held the belief in a particular distribution is, exceptions do arise in practice … overdispersion, anyone?)

The plot below provides a further illustration of the inadequacy of the mean as a sole data characterization, comparing four different members of the family of beta distributions. These distributions – in the standard form assumed here – describe variables whose values range from 0 to 1, and they are defined by two parameters, p and q, that determine the shape of the density function and all moments of the distribution. The mean of the beta distribution is equal to p/(p+q), so if p = q – corresponding to the class of symmetric beta distributions – the mean is ½, regardless of the common value of these parameters. The four plots below show the corresponding distributions when both parameters are equal to 0.5 (upper left, the arcsin distribution I discussed last time), 1.0 (upper right, the uniform distribution), 1.5 (lower left), and 8.0 (lower right).

The second comment on my last post was from Efrique, who suggested the Student’s t-distribution with 2 degrees of freedom as a better infinite-variance example than the Cauchy example I used (corresponding to Student’s t-distribution with one degree of freedom), because the first moment doesn’t even exist for the Cauchy distribution (“there’s nothing to converge to”). The figure below expands the boxplot comparison I presented last time, comparing the means, medians, and modes (from the modeest package), for both of these infinite-variance examples: the Cauchy distribution I discussed last time and the Student’s t-distribution with two degrees of freedom that Efrique suggested. Here, the same characterization (mean, median, or mode) is summarized for both distributions in side-by-side boxplots to facilitate comparisons. It is clear from these boxplots that the results for the median and the mode are essentially identical for these distributions, but the results for the mean differ dramatically (recall that these results are truncated for the Cauchy distribution: 13.6% of the 1000 computed means fell outside the +/- 5 range shown here, exhibiting values approaching +/- 1000). This difference illustrates Efrique’s further point that the mean of the data values is a consistent estimator of the (well-defined) population mean of the Student’s t-distribution with 2 degrees of freedom, while it is not a consistent estimator for the Cauchy distribution. Still, it also clear from this plot that the mean is substantially more variable for the Student’s t-distribution with 2 degrees of freedom than either the median or the modeest mode estimate.

Another example of an infinite-variance distribution where the mean is well-defined but highly variable is the Pareto type I distribution, discussed in Section 4.5.8 of Exploring Data. My favorite reference on distributions is the two volume set by Johnson, Kotz, and Balakrishnan (Continuous Univariate Distributions, Vol. 1 (Wiley Series in Probability and Statistics) and Continuous Univariate Distributions, Vol. 2 (Wiley Series in Probability and Statistics)), who devote an entire 55 page chapter (Chapter 20 in Volume 1) to the Pareto distribution, noting that it is named after Vilafredo Pareto, a mid nineteenth- to early twentieth-century Swiss professor of economics, who proposed it as a description of the distribution of income over a population. In fact, there are several different distributions named after Pareto, but the type I distribution considered here exhibits a power-law decay like the Student’s t-distributions, but it is a J-shaped distribution whose mode is equal to its minimum value. More specifically, this distribution is defined by a location parameter that determines this minimum value and a shape parameter that determines how rapidly the tail decays for values larger than this minimum. The example considered here takes this minimum value as 1 and the shape parameter as 1.5, giving a distribution with a finite mean but an infinite variance. As in the above example, the boxplot summary shown below characterizes the mean, median, and mode for 1000 statistically independent random samples drawn from this distribution, each of size N = 100. As before, it is clear from this plot that the mean is much more highly variable than either the median or the mode.

In this case, however, we have the added complication that since this distribution is not symmetric, its mean, median and mode do not coincide. In fact, the population mode is the minimum value (which is 1 here), corresponding to the solid line at the bottom of the plot. The narrow range of the boxplot values around this correct value suggest that the modeest package is reliably estimating this mode value, but as I noted in my last post, this characterization is not useful here because it tells us nothing about the rate at which the density decays. The theoretical median value can also be calculated easily for this distribution, and here it is approximately equal to 1.587, corresponding to the dashed horizontal line in the plot. As with the mode, it is clear from the boxplot that the median estimated from the data is in generally excellent agreement with this value. Finally, the mean value for this particular distribution is 3, corresponding to the dotted horizontal line in the plot. Since this line lies fairly close to the upper quartile of the computed means (i.e., the top of the “box” in the boxplot), it follows that the estimated mean falls below the correct value almost 75% of the time, but it is also clear that when the mean is overestimated, the extent of this overestimation can be very large. Motivated in part by the fact that the mean doesn’t always exist for the Pareto distribution, Johnson, Kotz and Balakrishnan note in their chapter on these distributions that alternative location measures have been considered, including both the geometric and harmonic means. I will examine these ideas further in a future post.

Finally, klr mentioned my post on useless averages in his blog TimelyPortfolio, where he discusses alternatives to the moving average in characterizing financial time-series. For the case he considers, klr compares a 10-month moving average, the corresponding moving median, and a number of the corresponding mode estimators from the modeest package. This is a very interesting avenue of exploration for me since it is closely related to the median filter and other nonlinear digital filters that can be very useful in cleaning noisy time-series data. I discuss a number of these ideas – including moving-window extensions of other data characterizations like skewness and kurtosis – in my book Mining Imperfect Data: Dealing with Contamination and Incomplete Records.

Again, thanks to all of you for your comments. You have given me much to think about and investigate further, which is one of the joys of doing this blog.