ExploringDataBlog: March 2012

The problem of outliers – data points that are substantially inconsistent with the majority of the other points in a dataset – arises frequently in the analysis of numerical data. The practical importance of outliers lies in the fact that even a few of these points can badly distort the results of an otherwise reasonable data analysis. This outlier-sensitivity problem is often particularly acute for classical data characterizations and analysis methods like means, standard deviations, and linear regression analysis. As a consequence, a range of outlier-resistant methods have been developed for many different applications, and new methods continue to be developed. For example, the R package robustbase that I have discussed in previous posts includes outlier-resistant methods for estimating location (i.e., outlier-resistant alternatives to the mean), estimating scale (outlier-resistant alternatives to the standard deviation), quantifying asymmetry (outlier-resistant alternatives to the skewness), and fitting regression models. In Exploring Data in Engineering, the Sciences, and Medicine, I discuss a number of outlier-resistant methods for addressing some of these problems, including Gastwirth’s location estimator, an alternative to the mean that is the subject of this post.

The mean is the best-known location estimator, and it gives a useful assessment of the “typical” value of any numerical sequence that is reasonably symmetrically distributed and free of outliers. The outlier-sensitivity of the mean is severe, however, which motivates the use of outlier-resistant alternatives like the median. While the median is almost as well-known as the mean and extremely outlier-resistant, it can behave unexpectedly (i.e., “badly”) as a result of its non-smooth character. This point is illustrated in Fig. 7.23 in Exploring Data, identical in character to the figure shown below (this figure is slightly different because it uses a different seed to generate the random numbers on which it is based). Specifically, this plot shows a sequence of 201 data points, constructed as follows. The first 100 points are normally distributed with mean 1 and standard deviation 0.1, the 101^st point is equal to zero, and points 102 through 201 are normally distributed with mean -1 and standard deviation 0.1. Small changes in this dataset in the specific form of deleting points can result in very large changes in the computed median. Specifically, in this example, the first 100 points lie between 0.768 and 1.185 and the last 100 points lie between -0.787 and -1.282; because the central data point lies between these two equal-sized groups, it defines the median, which is 0. The mean is quite close to this value, at -0.004, but the situation changes dramatically if we omit either the first two or the last two points from this data sequence. Specifically, the median value computed from points 1 through 199 is 0.768, while that computed from points 3 through 201 is -0.787. In contrast, the mean values for these two modified sequences are 0.006 and -0.014. Thus, although the median is much less sensitive than the mean to contamination from outliers, it is extremely sensitive to the 1% change made in this example for this particular dataset.

The fact that the median is not “universally the best location estimator” provides a practical motivation for examining alternatives that are intermediate in behavior between the very smooth but very outlier-sensitive mean and the very outlier-insensitive but very non-smooth median. Some of these alternatives were examined in detail in the book Robust Estimates of Location: Survey and Advances, by D.F. Andrews, P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, and J.W. Tukey, published by Princeton University Press in 1972 (according to the publisher's website, this book is out of print, but used copies are available through distributors like Amazon or Barnes and Noble). The book summarizes the results of a year-long study of 68 different location estimators, including both the mean and the median. The fundamental criteria for inclusion in this study were, first, that the estimators had to be computable from any given sequence of real numbers, and second, that they had to be both location and scale-invariant. Specifically, if a given data sequence {x_k} yielded a result m, the scaled and shifted data sequence {Ax_k + b} should yield the result Am+b, for any numbers A and b. The study was co-authored by six statistical researchers with differing opinions and points of view, but two of the authors – D.F. Andrews and F.R. Hampel – included the Gastwirth estimator (described in detail below) in their list of favorites. For example, Hampel characterized this estimator as one of a small list of those that were “never bad at the distributions considered.” Also, in contrast to many of the location estimators considered in the study, Gastwirth’s estimator does not require iterative computations, making it simpler to implement.

Specifically, Gastwirth’s location estimator is a weighted sum of three order statistics. That is, to compute this estimator, we first sort the data sequence in ascending order. Then, we take the values that are one-third of the way up this sequence (the 0.33 quantile), half way up the sequence (i.e., the median, or 0.50 quantile), and two-thirds of the way up the sequence (the 0.67 quantile). Given these three values, we then form the weighted average, giving the central (median) value a weight of 40% and the two extreme values each a weight of 30%. This is extremely easy to do in R, with the following code:

Gastwirth <- function(x,...){
#
ordstats = quantile(x, probs=c(1/3,1/2,2/3),...)
wts = c(0.3,0.4,0.3)
sum(wts*ordstats)
#
}

The key part of this code is the first line, which computes the required order statistics (i.e., the quantiles 1/3, 1/2, and 2/3) using the built-in quantile function. The first argument passed to this function is x, the vector of data values to be characterized, and the second argument (probs) defines the specific quantiles we wish to compute. The ellipses in the Gastwirth procedure’s command line is passed to the quantile function; several parameters are possible (type “help(quantile)” in your R session for details), but one of the most useful is na.rm, a logical variable that specifies how missing data values are to be handled. The default is “FALSE” and this causes the Gastwirth procedure to return the missing data value “NA” if any values of x are missing; the alternative “TRUE” computes the Gastwirth estimator from the non-missing values, giving a numerical result. The three-element vector wts defines the quantile weights that define the Gastwirth estimator, which the final sum statement computes.

For the data example considered above, the Gastwirth estimator yields the location estimate -0.001 for the complete dataset, 0.308 for points 1 to 199 (vs. 0.768 for the median), and -0.317 for points 3 to 201 (vs. -0.787 for the median). Thus, while it does not perform nearly as well as the mean for this example, it performs substantially better than the median.

For the infinite-variance Cauchy distribution that I have discussed in several previous posts, the Gastwirth estimator performs similarly to the median, yielding a useful estimate of the center of the data distribution, in contrast to the mean, which doesn’t actually exist for this distribution (that is, the first moment does not exist for the Cauchy distribution). Still, the distribution is symmetric about zero, so the median is well-defined, as is the Gastwirth estimator, and both should be zero for this distribution. The above figure shows the results of applying these three estimators – the mean, the median, and Gastwirth’s estimator – to 1,000 independent random samples drawn from the Cauchy distribution. Specifically, this figure gives a boxplot summary of these results, truncated to the range from -3 to 3 to show the range of variation of the median and Gastwirth estimator (without this restriction, the boxplot comparison would be fairly non-informative, since the mean values range from approximately -161 to 27,793, reflecting the fact that the mean is not a consistent location estimator for the Cauchy distribution). To generate these results, the replicate function in R was used, followed by the apply function, as follows:

RandomSampleFrame = replicate(1000, rt(n=200,df=1))
BoxPlotVector = apply(RandomSampleFrame, MARGIN=2, Gastwirth)

The replicate function creates a data frame with the number of columns specified by the first argument (here, 1000), and each column generated by the R statement that appears as the second argument. In this case, this second argument is the command rt, which generates a sequence of n statistically independent random numbers drawn from the Student’s t-distribution with the number of degrees of freedom specified by the df argument (here, this is 1, corresponding to the fact that the Cauchy distribution is the Student’s t-distribution with 1 degree of freedom). Thus, RandomSampleFrame is a data frame with 200 rows and 1,000 columns, each of which may be regarded as a Cauchy-distributed random sample. The apply function applies the function specified in the third argument (here, the Gastwirth procedure listed above) to the columns (MARGIN=2 specifies columns; MARGIN=1 would specify rows) of the data frame specified in the first argument. The result is BoxPlotVector, a vector of 1,000 Gastwirth estimates, one for each random sample generated by the replicate function above.

At the other extreme, in the limit of infinite degrees of freedom, the Student’s t-distribution approaches a Gaussian limit. The figure above shows the same comparison as before, except for the Gaussian distribution instead of the Cauchy distribution. Here, the mean is the best possible location estimator and it clearly performs the best, but the point of this example is that Gastwirth’s location estimator performs better than the median. In particular, the interquartile distance (i.e., the width of the “box” in each boxplot) for the mean is 0.094, it is 0.113 for the median, and it is 0.106 for Gastwirth’s estimator.

Another application area where very robust estimators like the median often perform poorly is that of bimodal distributions like the arc-sine distribution whose density is plotted above. This distribution is a symmetric beta distribution, with both shape parameters equal to 0.5 (see Exploring Data, Sec. 4.5.1 for further discussion of this distribution). Because it is symmetrically distributed on the interval from 0 to 1, the location parameter for this distribution is 0.5 and all three of the location estimators considered here yield values that are accurate on average, but with different levels of precision. This point is shown in the figure below, which again provides boxplot comparisons for 1,000 random samples drawn from this distribution, each of length 200, for the mean, median, and Gastwirth location estimators. As in the Gaussian case considered above, the mean performs best here, with an interquartile distance of 0.035, the median performs worst, with an interquartile distance of 0.077, and Gastwirth’s estimator is intermediate, with an interquartile distance of 0.060.

The point of this post has been to illustrate a location estimator with properties that are intermediate between those of the much better-known mean and median. In particular, the results presented here for the Cauchy distribution show that Gastwirth’s estimator is intermediate in outlier sensitivity between the disastrously sensitive mean and the maximally insensitive median. Similarly, the first example demonstrated that Gastwirth’s estimator is also intermediate in smoothness between the maximally smooth mean and the discontinuous median: the sensitivity of Gastwirth’s estimator to data editing in “swing-vote” examples like the one presented here is still undesirably large, but much better than that of the median. Finally, the results presented here for the Gaussian and arc-sine distributions show that Gastwirth’s estimator is better-behaved for these distributions than the median. Because it is extremely easy to implement in R, Gastwirth’s estimator seems worth knowing about.

Saturday, March 3, 2012

Gastwirth’s location estimator