This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

## Saturday, August 20, 2011

### When are averages useless?

Of all possible single-number characterizations of a data sequence, the average is probably the best known.  It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers.  It is not the only such “typical value,” however, nor is it always the most useful one: two other candidates – location estimators in statistical terminology – are the median and the mode, both of which are discussed in detail in Section 4.1.2 of .  Like the average, these alternative location estimators are not always “fully representative,” but they do represent viable alternatives – at least sometimes – in cases where the average is sufficiently non-representative as to be effectively useless.  As the title of this post suggests, the focus here is on those cases where the mean doesn’t really tell us what we want to know about a data sequence, briefly examining why this happens and what we can do about it.

First, it is worth saying a few words about the two alternatives just mentioned: the median and the mode.  Of these, the mode is both the more difficult to estimate and the less broadly useful.  Essentially, “the mode” corresponds to “the location of the peak in the data distribution.”  One difficulty with this somewhat loose definition is that “the mode” is not always well-defined.  The above collection of plots shows three examples where the mode is not well-defined, and another where the mode is well-defined but not particularly useful.  The upper left plot shows the density of the uniform distribution on the range [1,2]: there, the density is constant over the entire range, so there is no single, well-defined “peak” or unique maximum to serve as a mode for this distribution.  The upper right plot shows a nonparametric density estimate for the Old Faithful geyser waiting time data that I have discussed in several of my recent posts (the R data object faithful).  Here, the difficulty is that there are not one but two modes, so “the mode” is not well-defined here, either: we must discuss “the modes.”  The same behavior is observed for the arcsin distribution, whose density is shown in the lower left plot in the above figure.  This density corresponds to the beta distribution with shape parameters both equal to ½, giving a bimodal distribution whose cumulative probability function can be written simply in terms of the arcsin function, motivating its name (see Section 4.5.1 of Exploring Data for a more complete discussion of both the beta distribution family and the special case of the arcsin distribution).  In this case, the two modes of the distribution occur at the extremes of the data, at x = 1 and x = 2.

The second difficulty with the mode noted above is that it is sometimes well-defined but not particularly useful.  The case of the J-shaped exponential density shown in the lower right plot above illustrates this point: this distribution exhibits a single, well-defined peak at the minimum value x = 0.  Here, you don’t even have to look at the data to arrive at this result, which therefore tells you nothing about the data distribution: this density is described by a single parameter that determines how slowly or rapidly the distribution decays and the mode is independent of this parameter.  Despite these limitations, there are cases where the mode represents an extremely useful data characterization, even though it is much harder to estimate than the mean or the median.  Fortunately, there is a nice package available in R to address this problem: the modeest package provides 11 different mode estimation procedures.  I will illustrate one of these in the examples that follow – the half range mode estimator of Bickel – and I will give a more complete discussion of this package in a later post.

The median is a far better-known data characterization than the mode, and it is both much easier to estimate and much more broadly applicable.  In particular, unlike either the mean or the mode, the median is well-defined for any proper data distribution, a result demonstrated in Section 4.1.2 of Exploring Data.  Conceptually, computing the median only requires sorting the N data values from smallest to largest and then taking either the middle element from this sorted list (if N is odd), or averaging the middle two elements (if N is even).

The mean is, of course, both the easiest of these characterizations to compute – simply add the N data values and divide by N – and unquestionably the best known.  There are, however, at least three situations where the mean can be so highly non-representative as to be useless:

1.      if severe outliers are present;
2.      if the distribution is multi-modal;
3.      if the distribution has infinite variance.
The rest of this post examines each of these cases in turn.

I have discussed the problem of outliers before, but they are an important enough problem in practice to bear repeating.  (I devote all of Chapter 7 to this topic in Exploring Data.)  The plot below shows the makeup flow rate dataset, available from the companion website for Exploring Data (the dataset is makeup.csv, available on the R programs and datasets page).  This dataset consists of 2,589 successive measurements of the flow rate of a fluid stream in an industrial manufacturing process.  The points in this plot show two distinct forms of behavior: those with values on the order of 400 represent measurements made during normal process operation, while those with values less than about 300 correspond to measurements made when the process is shut down (these values are approximately zero) or is in the process of being either shut down or started back up.  The three lines in this plot correspond to the mean (the solid line at approximately 315), the median (the dotted line at approximately 393), and the mode (the dashed line at approximately 403, estimated using the “hrm” method in the modeest package).  As I have noted previously, the mean in this case represents a useful line of demarcation between the normal operation data (those points above the mean, representing 77.6% of the data) and the shutdown segments (those points below the mean, representing 22.4% of the data).  In contrast, both the median and the specific mode estimator used here provide much better characterizations of the normal operating data.

The next plot below shows a nonparametric density estimate of the Old Faithful geyser waiting data I discussed in my last few posts.  The solid vertical line at 70.90 corresponds to the mean value computed from the complete dataset.  It has been said that a true compromise is an agreement that makes all parties equally unhappy, and this seems a reasonable description of the mean here: the value lies about mid-way between the two peaks in this distribution, centered at approximately 55 and 80; in fact, this value lies fairly close to the trough between the peaks in this density estimate.  (The situation is even worse for the arcsin density discussed above: there, the two modes occur at values of 1 and 2, while the mean falls equidistant from both at 1.5, arguably the “least representative” value in the whole data range.)  The median waiting time value is 76, corresponding to the dotted line just to the left of the main peak at about 80, and the mode (again, computed using the package modeest with the “hrm” method) corresponds to the dashed line at 83, just to the right of the main peak.  The basic difficulty here is that all of these location estimators are inherently inadequate since they are attempting to characterize “the representative value” of a data sequence that has “two representative values:” one representing the smaller peak at around 55 and the other representing the larger peak at around 80.  In this case, both the median and the mode do a better job of characterizing the larger of the two peaks in the distribution (but not a great job), although such a partial characterization is not always what we want.  This type of behavior is exactly what the mixture models I discussed in my last few posts are intended to describe.

To illustrate the third situation where the mean is essentially useless, consider the Cauchy distribution, corresponding to the Student’s t distribution with one degree of freedom.  This is probably the best known infinite-variance distribution there is, and it is often used as an extreme example because it causes a lot of estimation procedures to fail.  The plot below is a (truncated) boxplot comparison of the values of the mean, median, and mode computed from 1000 independently generated Cauchy random number sequences, each of length N = 100.  It is clear from these boxplots that the variability of the mean is much greater than that of either of the other two estimators, which are the median and the mode, the latter again estimated from the data using the half-range mode (hrm) method in the modeest package.  One of the consequences of working with infinite variance distributions is that the mean is no longer a consistent location estimator, meaning that the variance of the estimated mean does not approach zero in the limit of large sample sizes.  In fact, the Cauchy distribution is one of the examples I discuss in Chapter 6 of Exploring Data as a counterexample to the Central Limit Theorem: for most data distributions, the distribution of the mean approaches a Gaussian limit with a variance that decreases inversely with the sample size N, but for the Cauchy distribution, the distribution of the mean is exactly the same as that of the data itself.  In other words, for the Cauchy distribution, averaging a collection of N numbers does not reduce the variability at all.  This is exactly what we are seeing here, although the plot below doesn’t show how bad the situation really is: the smallest value of the mean in this sequence of 1000 estimates is -798.97 and the largest value is 928.85.  In order to see any detail at all in the distribution of the median and mode values, it was necessary to restrict the range of the boxplots shown here to lie between -5 and +5, which eliminated 13.6% of the computed mean values.  In contrast, the median is known to be a reasonably good location estimator for the Cauchy distribution (see Section 6.6.1 of Exploring Data for a further discussion of this point), and the results presented here suggest that Bickel’s half-range mode estimator is also a reasonable candidate.  The main point here is that the mean is a completely unreasonable estimator in situations like this one, an important point in view of the growing interest in data models like the infinite-variance Zipf distribution to describe “long-tailed” phenomena in business.

I will have more to say about both the modeest package and Zipf distributions in upcoming posts.