ExploringDataBlog: August 2011

Saturday, August 27, 2011

Some Additional Thoughts on Useless Averages

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions. The post generated three interesting comments that I want to respond to here.

First and foremost, I want to say thanks to all of you for giving me something to think about further, leading me in some interesting new directions. First, chrisbeeleyimh had the following to say:

“I seem to have rather abandoned means and medians in favor of drawing the distribution all the time, which baffles my colleagues somewhat.”

Chris also maintains a collection of data examples where the mean is the same but the shape is very different. In fact, one of the points I illustrate in Section 4.4.1 of Exploring Data in Engineering, the Sciences, and Medicine is that there are cases where not only the means but all of the moments (i.e., variance, skewness, kurtosis, etc.) are identical but the distributions are profoundly different. A specific example is taken from the book Counterexamples in Probability, 2nd Edition by J.M. Stoyanov, who shows that if the lognormal density is multiplied by the following function:

f(x) = 1 + A sin(2 pi ln x),

for any constant A between -1 and +1, the moments are unchanged. The character of the distribution is changed profoundly, however, as the following plot illustrates (this plot is similar to Fig. 4.8 in Exploring Data, which shows the same two distributions, but for A = 0.5 instead of A = 0.9, as shown here). To be sure, this behavior is pathological – distributions that have finite support, for example, are defined uniquely by their complete set of moments – but it does make the point that moment characterizations are not always complete, even if an infinite number of them are available. Within well-behaved families of distributions (such as the one proposed by Karl Pearson in 1895), a complete characterization is possible on the basis of the first few moments, which is one reason for the historical popularity of the method of moments for fitting data to distributions. It is important to recognize, however, that moments do have their limitations and that the first moment alone – i.e., the mean by itself – is almost never a complete characterization. (I am forced to say “almost” here because if we impose certain very strong distributional assumptions – e.g., the Poisson or binomial distributions – the specific distribution considered may be fully characterized by its mean. This begs the question, however, of whether this distributional assumption is adequate. My experience has been that, no matter how firmly held the belief in a particular distribution is, exceptions do arise in practice … overdispersion, anyone?)

The plot below provides a further illustration of the inadequacy of the mean as a sole data characterization, comparing four different members of the family of beta distributions. These distributions – in the standard form assumed here – describe variables whose values range from 0 to 1, and they are defined by two parameters, p and q, that determine the shape of the density function and all moments of the distribution. The mean of the beta distribution is equal to p/(p+q), so if p = q – corresponding to the class of symmetric beta distributions – the mean is ½, regardless of the common value of these parameters. The four plots below show the corresponding distributions when both parameters are equal to 0.5 (upper left, the arcsin distribution I discussed last time), 1.0 (upper right, the uniform distribution), 1.5 (lower left), and 8.0 (lower right).

The second comment on my last post was from Efrique, who suggested the Student’s t-distribution with 2 degrees of freedom as a better infinite-variance example than the Cauchy example I used (corresponding to Student’s t-distribution with one degree of freedom), because the first moment doesn’t even exist for the Cauchy distribution (“there’s nothing to converge to”). The figure below expands the boxplot comparison I presented last time, comparing the means, medians, and modes (from the modeest package), for both of these infinite-variance examples: the Cauchy distribution I discussed last time and the Student’s t-distribution with two degrees of freedom that Efrique suggested. Here, the same characterization (mean, median, or mode) is summarized for both distributions in side-by-side boxplots to facilitate comparisons. It is clear from these boxplots that the results for the median and the mode are essentially identical for these distributions, but the results for the mean differ dramatically (recall that these results are truncated for the Cauchy distribution: 13.6% of the 1000 computed means fell outside the +/- 5 range shown here, exhibiting values approaching +/- 1000). This difference illustrates Efrique’s further point that the mean of the data values is a consistent estimator of the (well-defined) population mean of the Student’s t-distribution with 2 degrees of freedom, while it is not a consistent estimator for the Cauchy distribution. Still, it also clear from this plot that the mean is substantially more variable for the Student’s t-distribution with 2 degrees of freedom than either the median or the modeest mode estimate.

Another example of an infinite-variance distribution where the mean is well-defined but highly variable is the Pareto type I distribution, discussed in Section 4.5.8 of Exploring Data. My favorite reference on distributions is the two volume set by Johnson, Kotz, and Balakrishnan (Continuous Univariate Distributions, Vol. 1 (Wiley Series in Probability and Statistics) and Continuous Univariate Distributions, Vol. 2 (Wiley Series in Probability and Statistics)), who devote an entire 55 page chapter (Chapter 20 in Volume 1) to the Pareto distribution, noting that it is named after Vilafredo Pareto, a mid nineteenth- to early twentieth-century Swiss professor of economics, who proposed it as a description of the distribution of income over a population. In fact, there are several different distributions named after Pareto, but the type I distribution considered here exhibits a power-law decay like the Student’s t-distributions, but it is a J-shaped distribution whose mode is equal to its minimum value. More specifically, this distribution is defined by a location parameter that determines this minimum value and a shape parameter that determines how rapidly the tail decays for values larger than this minimum. The example considered here takes this minimum value as 1 and the shape parameter as 1.5, giving a distribution with a finite mean but an infinite variance. As in the above example, the boxplot summary shown below characterizes the mean, median, and mode for 1000 statistically independent random samples drawn from this distribution, each of size N = 100. As before, it is clear from this plot that the mean is much more highly variable than either the median or the mode.

In this case, however, we have the added complication that since this distribution is not symmetric, its mean, median and mode do not coincide. In fact, the population mode is the minimum value (which is 1 here), corresponding to the solid line at the bottom of the plot. The narrow range of the boxplot values around this correct value suggest that the modeest package is reliably estimating this mode value, but as I noted in my last post, this characterization is not useful here because it tells us nothing about the rate at which the density decays. The theoretical median value can also be calculated easily for this distribution, and here it is approximately equal to 1.587, corresponding to the dashed horizontal line in the plot. As with the mode, it is clear from the boxplot that the median estimated from the data is in generally excellent agreement with this value. Finally, the mean value for this particular distribution is 3, corresponding to the dotted horizontal line in the plot. Since this line lies fairly close to the upper quartile of the computed means (i.e., the top of the “box” in the boxplot), it follows that the estimated mean falls below the correct value almost 75% of the time, but it is also clear that when the mean is overestimated, the extent of this overestimation can be very large. Motivated in part by the fact that the mean doesn’t always exist for the Pareto distribution, Johnson, Kotz and Balakrishnan note in their chapter on these distributions that alternative location measures have been considered, including both the geometric and harmonic means. I will examine these ideas further in a future post.

Finally, klr mentioned my post on useless averages in his blog TimelyPortfolio, where he discusses alternatives to the moving average in characterizing financial time-series. For the case he considers, klr compares a 10-month moving average, the corresponding moving median, and a number of the corresponding mode estimators from the modeest package. This is a very interesting avenue of exploration for me since it is closely related to the median filter and other nonlinear digital filters that can be very useful in cleaning noisy time-series data. I discuss a number of these ideas – including moving-window extensions of other data characterizations like skewness and kurtosis – in my book Mining Imperfect Data: Dealing with Contamination and Incomplete Records.

Again, thanks to all of you for your comments. You have given me much to think about and investigate further, which is one of the joys of doing this blog.

Saturday, August 20, 2011

When are averages useless?

Of all possible single-number characterizations of a data sequence, the average is probably the best known. It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers. It is not the only such “typical value,” however, nor is it always the most useful one: two other candidates – location estimators in statistical terminology – are the median and the mode, both of which are discussed in detail in Section 4.1.2 of Exploring Data in Engineering, the Sciences, and Medicine. Like the average, these alternative location estimators are not always “fully representative,” but they do represent viable alternatives – at least sometimes – in cases where the average is sufficiently non-representative as to be effectively useless. As the title of this post suggests, the focus here is on those cases where the mean doesn’t really tell us what we want to know about a data sequence, briefly examining why this happens and what we can do about it.

First, it is worth saying a few words about the two alternatives just mentioned: the median and the mode. Of these, the mode is both the more difficult to estimate and the less broadly useful. Essentially, “the mode” corresponds to “the location of the peak in the data distribution.” One difficulty with this somewhat loose definition is that “the mode” is not always well-defined. The above collection of plots shows three examples where the mode is not well-defined, and another where the mode is well-defined but not particularly useful. The upper left plot shows the density of the uniform distribution on the range [1,2]: there, the density is constant over the entire range, so there is no single, well-defined “peak” or unique maximum to serve as a mode for this distribution. The upper right plot shows a nonparametric density estimate for the Old Faithful geyser waiting time data that I have discussed in several of my recent posts (the R data object faithful). Here, the difficulty is that there are not one but two modes, so “the mode” is not well-defined here, either: we must discuss “the modes.” The same behavior is observed for the arcsin distribution, whose density is shown in the lower left plot in the above figure. This density corresponds to the beta distribution with shape parameters both equal to ½, giving a bimodal distribution whose cumulative probability function can be written simply in terms of the arcsin function, motivating its name (see Section 4.5.1 of Exploring Data for a more complete discussion of both the beta distribution family and the special case of the arcsin distribution). In this case, the two modes of the distribution occur at the extremes of the data, at x = 1 and x = 2.

The second difficulty with the mode noted above is that it is sometimes well-defined but not particularly useful. The case of the J-shaped exponential density shown in the lower right plot above illustrates this point: this distribution exhibits a single, well-defined peak at the minimum value x = 0. Here, you don’t even have to look at the data to arrive at this result, which therefore tells you nothing about the data distribution: this density is described by a single parameter that determines how slowly or rapidly the distribution decays and the mode is independent of this parameter. Despite these limitations, there are cases where the mode represents an extremely useful data characterization, even though it is much harder to estimate than the mean or the median. Fortunately, there is a nice package available in R to address this problem: the modeest package provides 11 different mode estimation procedures. I will illustrate one of these in the examples that follow – the half range mode estimator of Bickel – and I will give a more complete discussion of this package in a later post.

The median is a far better-known data characterization than the mode, and it is both much easier to estimate and much more broadly applicable. In particular, unlike either the mean or the mode, the median is well-defined for any proper data distribution, a result demonstrated in Section 4.1.2 of Exploring Data. Conceptually, computing the median only requires sorting the N data values from smallest to largest and then taking either the middle element from this sorted list (if N is odd), or averaging the middle two elements (if N is even).

The mean is, of course, both the easiest of these characterizations to compute – simply add the N data values and divide by N – and unquestionably the best known. There are, however, at least three situations where the mean can be so highly non-representative as to be useless:

1.      if severe outliers are present;
2.      if the distribution is multi-modal;
3.      if the distribution has infinite variance.

The rest of this post examines each of these cases in turn.

I have discussed the problem of outliers before, but they are an important enough problem in practice to bear repeating. (I devote all of Chapter 7 to this topic in Exploring Data.) The plot below shows the makeup flow rate dataset, available from the companion website for Exploring Data (the dataset is makeup.csv, available on the R programs and datasets page). This dataset consists of 2,589 successive measurements of the flow rate of a fluid stream in an industrial manufacturing process. The points in this plot show two distinct forms of behavior: those with values on the order of 400 represent measurements made during normal process operation, while those with values less than about 300 correspond to measurements made when the process is shut down (these values are approximately zero) or is in the process of being either shut down or started back up. The three lines in this plot correspond to the mean (the solid line at approximately 315), the median (the dotted line at approximately 393), and the mode (the dashed line at approximately 403, estimated using the “hrm” method in the modeest package). As I have noted previously, the mean in this case represents a useful line of demarcation between the normal operation data (those points above the mean, representing 77.6% of the data) and the shutdown segments (those points below the mean, representing 22.4% of the data). In contrast, both the median and the specific mode estimator used here provide much better characterizations of the normal operating data.

The next plot below shows a nonparametric density estimate of the Old Faithful geyser waiting data I discussed in my last few posts. The solid vertical line at 70.90 corresponds to the mean value computed from the complete dataset. It has been said that a true compromise is an agreement that makes all parties equally unhappy, and this seems a reasonable description of the mean here: the value lies about mid-way between the two peaks in this distribution, centered at approximately 55 and 80; in fact, this value lies fairly close to the trough between the peaks in this density estimate. (The situation is even worse for the arcsin density discussed above: there, the two modes occur at values of 1 and 2, while the mean falls equidistant from both at 1.5, arguably the “least representative” value in the whole data range.) The median waiting time value is 76, corresponding to the dotted line just to the left of the main peak at about 80, and the mode (again, computed using the package modeest with the “hrm” method) corresponds to the dashed line at 83, just to the right of the main peak. The basic difficulty here is that all of these location estimators are inherently inadequate since they are attempting to characterize “the representative value” of a data sequence that has “two representative values:” one representing the smaller peak at around 55 and the other representing the larger peak at around 80. In this case, both the median and the mode do a better job of characterizing the larger of the two peaks in the distribution (but not a great job), although such a partial characterization is not always what we want. This type of behavior is exactly what the mixture models I discussed in my last few posts are intended to describe.

To illustrate the third situation where the mean is essentially useless, consider the Cauchy distribution, corresponding to the Student’s t distribution with one degree of freedom. This is probably the best known infinite-variance distribution there is, and it is often used as an extreme example because it causes a lot of estimation procedures to fail. The plot below is a (truncated) boxplot comparison of the values of the mean, median, and mode computed from 1000 independently generated Cauchy random number sequences, each of length N = 100. It is clear from these boxplots that the variability of the mean is much greater than that of either of the other two estimators, which are the median and the mode, the latter again estimated from the data using the half-range mode (hrm) method in the modeest package. One of the consequences of working with infinite variance distributions is that the mean is no longer a consistent location estimator, meaning that the variance of the estimated mean does not approach zero in the limit of large sample sizes. In fact, the Cauchy distribution is one of the examples I discuss in Chapter 6 of Exploring Data as a counterexample to the Central Limit Theorem: for most data distributions, the distribution of the mean approaches a Gaussian limit with a variance that decreases inversely with the sample size N, but for the Cauchy distribution, the distribution of the mean is exactly the same as that of the data itself. In other words, for the Cauchy distribution, averaging a collection of N numbers does not reduce the variability at all. This is exactly what we are seeing here, although the plot below doesn’t show how bad the situation really is: the smallest value of the mean in this sequence of 1000 estimates is -798.97 and the largest value is 928.85. In order to see any detail at all in the distribution of the median and mode values, it was necessary to restrict the range of the boxplots shown here to lie between -5 and +5, which eliminated 13.6% of the computed mean values. In contrast, the median is known to be a reasonably good location estimator for the Cauchy distribution (see Section 6.6.1 of Exploring Data for a further discussion of this point), and the results presented here suggest that Bickel’s half-range mode estimator is also a reasonable candidate. The main point here is that the mean is a completely unreasonable estimator in situations like this one, an important point in view of the growing interest in data models like the infinite-variance Zipf distribution to describe “long-tailed” phenomena in business.

I will have more to say about both the modeest package and Zipf distributions in upcoming posts.

Saturday, August 6, 2011

Fitting mixture distributions with the R package mixtools

My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful. Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine. One important topic I haven’t covered is how to fit mixture models to datasets like the Old Faithful geyser data that I have discussed previously: a nonparametric density plot gives fairly compelling evidence for a bimodal distribution, but how do you estimate the parameters of a mixture model that describes these two modes? For a finite Gaussian mixture distribution, one way is by trial and error, first estimating the centers of the peaks by eye in the density plot (these become the component means), and adjusting the standard deviations and mixing percentages to approximately match the peak widths and heights, respectively. This post considers the more systematic alternative of estimating the mixture distribution parameters using the mixtools package in R.

The mixtools package is one of several available in R to fit mixture distributions or to solve the closely related problem of model-based clustering. Further, mixtools includes a variety of procedures for fitting mixture models of different types. This post focuses on one of these – the normalmixEM procedure for fitting normal mixture densities – and applies it to two simple examples, starting with the Old Faithful dataset mentioned above. A much more complete and thorough discussion of the mixtools package – which also discusses its application to the Old Faithful dataset – is given in the R package vignette, mixtools: An R Package for Analyzing Finite Mixture Models.

The above plot shows the results obtained using the normalmixEM procedure with its default parameter values, applied to the Old Faithful waiting time data. Specifically, this plot was generated by the following sequence of R commands:

            library(mixtools)
            wait = faithful$waiting
            mixmdl = normalmixEM(wait)
            plot(mixmdl,which=2)
            lines(density(wait), lty=2, lwd=2)

Like many modeling tools in R, the normalmixEM procedure has associated plot and summary methods. In this case, the plot method displays either the log likelihood associated with each iteration of the EM fitting algorithm (more about that below), or the component densities shown above, or both. Specifying “which=1” displays only the log likelihood plot (this is the default), specifying “which = 2” displays only the density components/histogram plot shown here, and specifying “density = TRUE” without specifying the “which” parameter gives both plots. Note that the two solid curves shown in the above plot correspond to the individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of an observation being drawn from that component distribution. The final line of R code above overlays the nonparametric density estimate generated by the density function with its default parameters, shown here as the heavy dashed line (obtained by specifying “lty = 2”).

Most of the procedures in the mixtools package are based on the iterative expectation maximization (EM) algorithm, discussed in Section 2 of the mixtools vignette and also in Chapter 16 of Exploring Data. A detailed discussion of this algorithm is beyond the scope of this post – books have been devoted to the topic (see, for example, the book by McLachlan and Krishnan, The EM Algorithm and Extensions (Wiley Series in Probability and Statistics) ) – but the following two points are important to note here. First, the EM algorithm is an iterative procedure, and the time required for it to reach convergence – if it converges at all – depends strongly on the problem to which it is applied. The second key point is that because it is an iterative procedure, the EM algorithm requires starting values for the parameters, and algorithm performance can depend strongly on these initial values. The normalmixEM procedure supports both user-supplied starting values and built-in estimation of starting values if none are supplied. These built-in estimates are the default and, in favorable cases, they work quite well. The Old Faithful waiting time data is a case in point – using the default starting values gives the following parameter estimates:

            > mixmdl[c("lambda","mu","sigma")]
$lambda
[1] 0.3608868 0.6391132

$mu
[1] 54.61489 80.09109

$sigma
[1] 5.871241 5.867718

The mixture density described by these parameters is given by:

p(x) = lambda[1] n(x; mu[1], sigma[1]) + lambda[2] n(x; mu[2], sigma[2])

where n(x; mu, sigma) represents the Gaussian probability density function with mean mu and standard deviation sigma.

One reason the default starting values work well for the Old Faithful waiting time data is that if nothing is specified, the number of components (the parameter k) is set equal to 2. Thus, if you are attempting to fit a mixture model with more than two components, this number should be specified, either by setting k to some other value and not specifying any starting estimates for the parameters lambda, mu, and sigma, or by specifying a vector with k components as starting values for at least one of these parameters. (There are a number of useful options in calling the normalmixEM procedure: for example, specifying the initial sigma value as a scalar constant rather than a vector with k components forces the component variances to be equal. I won’t attempt to give a detailed discussion of these options here; for that, type “help(normalmixEM)”.)

Another important point about the default starting values is that, aside from the number of components k, any unspecified initial parameter estimates are selected randomly by the normalmixEM procedure. This means that, even in cases where the default starting values consistently work well – again, the Old Faithful waiting time dataset seems to be such a case – the number of iterations required to obtain the final result can vary significantly from one run to the next. (Specifically, the normalmixEM procedure does not fix the seed for the random number generators used to compute these starting values, so repeated runs of the procedure with the same data will start from different initial parameter values and require different numbers of iterations to achieve convergence. In the case of the Old Faithful waiting time data, I have seen anywhere between 16 and 59 iterations required, with the final results differing only very slightly, typically in the fifth or sixth decimal place. If you want to use the same starting value on successive runs, this can be done by setting the random number seed via the set.seed command before you invoke the normalmixEM procedure.)

It is important to note that the default starting values do not always work well, even if the correct number of components is specified. This point is illustrated nicely by the following example. The plot above shows two curves: the solid line is the exact density for the three-component Gaussian mixture distribution described by the following parameters:

            mu = (2.00, 5.00, 7.00)
            sigma = (1.000, 1.000, 1.000)
            lambda = (0.200, 0.600, 0.200)

The dashed curve in the figure is the nonparametric density estimate generated from n = 500 observations drawn from this mixture distribution. Note that the first two components of this mixture distribution are evident in both of these plots, from the density peaks at approximately 2 and 5. The third component, however, is too close to the second to yield a clear peak in either density, giving rise instead to slightly asymmetric “shoulders” on the right side of the upper peaks. The key point is that the components in this mixture distribution are difficult to distinguish from either of these density estimates, and this hints at further difficulties to come.

Applying the normalmixEM procedure to the 500 sample sequence used to generate the nonparametric density estimate shown above and specifying k = 3 gives results that are substantially more variable than the Old Faithful results discussed above. In fact, to compare these results, it is necessary to be explicit about the values of the random seeds used to initialize the parameter estimation procedure. Specifying this random seed as 101 and only specifying k=3 in the normalmixEM call yields the following parameter estimates after 78 iterations:

            mu = (1.77, 4.87, 5.44)
            sigma = (0.766, 0.115, 1.463)
            lambda = (0.168, 0.028, 0.803)

Comparing these results with the correct parameter values listed above, it is clear that some of these estimation errors are quite large. The figure shown below compares the mixture density constructed from these parameters (the heavy dashed curve) with the nonparametric density estimate computed from the data used to estimate them. The prominent “spike” in this mixture density plot corresponds to the very small standard deviation estimated for the second component and it provides a dramatic illustration of the relatively poor results obtained for this particular example.

Repeating this numerical experiment with different random seeds to obtain different random starting estimates, the normalmixEM procedure failed to converge in 1000 iterations for seed values of 102 and 103, but it converged after 393 iterations for the seed value 104, yielding the following parameter estimates:

            mu = (1.79, 5.03, 5.46)
            sigma = (0.775, 0.352, 1.493)
            lambda = (0.169, 0.063, 0.768)

Arguably, the general behavior of these parameter estimates is quite similar to those obtained with the random seed value 101, but note that the second variance component differs by a factor of three, and the second component of lambda increases almost as much.

Increasing the sample size from n = 500 to n = 2000 and repeating these experiments, the normalmixEM procedure failed to converge after 1000 iterations for all four of the random seed values 101 through 104. If, however, we specify the correct standard deviations (i.e., specify “sigma = c(1,1,1)” when we invoke normalmixEM) and we increase the maximum number of iterations to 3000 (i.e., specify “maxit = 3000”), the procedure does converge after 2417 iterations for the seed value 101, yielding the following parameter estimates:

            mu = (1.98, 4.98, 7.15)
            sigma = (1.012, 1.055, 0.929)
            lambda = (0.198, 0.641, 0.161)

While these parameters took a lot more effort to obtain, they are clearly much closer to the correct values, emphasizing the point that when we are fitting a model to data, our results generally improve as the amount of available data increases and as our starting estimates become more accurate. This point is further illustrated by the plot shown below, analogous to the previous one, but constructed from the model fit to the longer data sequence and incorporating better initial parameter estimates. Interestingly, re-running the same procedure but taking the correct means as starting parameter estimates instead of the correct standard deviations, the procedure failed to converge in 3000 iterations.

Overall, I like what I have seen so far of the mixtools package, and I look forward to exploring its capabilities further. It’s great to have a built-in procedure – i.e., one I didn’t have to write and debug myself – that does all of the things that this package does. However, the three-component mixture results presented here do illustrate an important point: the behavior of iterative procedures like normalmixEM and others in the mixtools package can depend strongly on the starting values chosen to initialize the iteration process, and the extent of this dependence can vary greatly from one application to another.