This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, April 21, 2012

David Olive’s median confidence interval

As I have discussed in a number of previous posts, the median represents a well-known and widely-used estimate of the “center” of a data sequence.  Relative to the better-known mean, the primary advantage of the median is its much reduced outlier sensitivity.  This post briefly describes a simple confidence interval for the median that is discussed in a paper by David Olive, available on-line via the following link:

As Olive notes in his paper and I further demonstrate in this post, an advantage of his confidence interval for the median is that it provides a simple, numerical way of identifying situations where the data values deserve a careful, graphical look.  In particular, he advocates comparing the traditional confidence interval for the mean with his confidence interval for the median: if these intervals are markedly different, it is worth investigating to understand why.  This strategy may be viewed as a particular instance of Collin Mallows’ “compute and compare” advice, discussed at the end of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine.  The key idea here is that under “standard” working assumptions – i.e., distributional symmetry and approximate normality – the mean and the median should be approximately the same: if they are not, it probably means these working assumptions have been violated, due to outliers in the data, pronounced distributional asymmetry, or other less common phenomena like strongly multimodal data distributions or coarse quantization.  In the increasingly common case where we have a lot of numerical variables to consider, it may be undesirable or infeasible to examine them all graphically: numerical comparisons like the one described here may be automated and used to point us to subsets of variables that we really need to look at further.  In addition to describing this confidence interval estimator and illustrating it for three examples, this post also provides the R code to compute it. 



As a first example, the plot above shows the makeup flow rate dataset discussed in Exploring Data and available as the makeup dataset  (makeup.csv) from the book's companion website.  This plot shows 2,589 successive observations of the measured flow rate of a solvent recycle stream in an industrial manufacturing process.  In normal operation, this flow rate is just under 400 – in fact, the median flow rate is 393.86 – but this data record also includes measurements during time intervals when the process is either being shut down, is not running, or is being started back up, and during these periods the measured flow rates decrease toward zero, are approximately equal to zero, and increase from zero back to approximately 400, respectively.  Because of the presence of these anomalous segments in the data, the mean value is much smaller than the median: specifically, the mean is 315.46, actually serving as a practical dividing line between the normal operation segments (i.e., those data points that lie above the mean) and the shutdown segments (i.e., those data points that lie below the mean).  The dashed lines in this plot at 309.49 and 321.44 correspond to the classical 95% confidence interval for the mean, computed as described below.  In contrast, the dotted lines at 391.83 and 394.88 correspond to Olive’s 95% confidence interval for the median, also described below.  Before proceeding to a more detailed discussion of how these lines were determined, the three primary points to note from this figure are, first, that the two confidence intervals are very different (e.g., they do not overlap at all), second, that the mean confidence intervals are much wider than those for the median in this case, and third, that the median confidence interval lies well within the range of the normal operating data, while the mean confidence interval does not.  It is also worth noting that, if we simply remove the shutdown episodes from this dataset, the mean of this edited dataset is 397.7, a value that lies slightly above the upper 95% confidence interval for the median, but only slightly so (this and other data cleaning strategies for this dataset are discussed in some detail in Chapter 7 of Exploring Data).

Both the classical confidence interval for the mean and David Olive’s confidence interval for the median are based on the fact that these estimators are asymptotically normal: for a sufficiently large data sample, both the estimated mean and the estimated median approach the correct limits for the underlying data distribution, with a standard deviation that decreases inversely with the square root of the sample size.  Using this description directly would lead to confidence intervals based on the quantiles of the Gaussian distribution, but for small to moderate-sized samples, more accurate confidence intervals are obtained by replacing these Gaussian quantiles with those for the Student’s t-distribution with the appropriate number of degrees of freedom.  More specifically, for the mean, the confidence interval at a given level p is of the form:

            CI = (Mean – cp SE, Mean + cp SE),

where cp is the constant derived from the Gaussian or Student’s t-distribution, and SE is the standard error of the mean, equal to the usual standard deviation estimate divided by the square root of the number of data points.  (For a more detailed discussion of the math behind these results, refer to either Chapter 9 of Exploring Data or to David Olive’s paper, available through the link given above.)  For the median, Olive provides a simple estimator for the standard error, described further in the next paragraph.  First, however, it is worth saying a little about the difference between the Gaussian and Student’s t-distribution in these results.  Probably the most commonly used confidence intervals are the 95% intervals – these are the confidence intervals shown in the plot above for the makeup flow rate data – which represent the interval that should contain the true distribution mean with probability at least 95%.  In the Gaussian case, the constant cp for the 95% confidence interval is approximately 1.96, while for the Student’s t-distribution, this number depends on the degrees of freedom parameter.  In the case of the mean, the degrees of freedom is one less than the sample size, while for the median confidence intervals described below, this number is typically much smaller.  The difference between these distributions is that the cp parameter decreases from a very large value for few degrees of freedom – e.g., the 95% parameter value is 12.71 for a single degree of freedom – to the Gaussian value (e.g., 1.96 for the 95% case) in the limit of infinite degrees of freedom.  Thus, using Student’s t-distribution instead of the Gaussian distribution results in wider confidence intervals, wider by the ratio of the Student’s t value for cp to the Gaussian value.  The plot below shows this ratio for the 95% parameter cp as the degree of freedom parameter varies between 5 and 200, with the dashed line corresponding to the Gaussian limit when this ratio is equal to 1.



The general structure of Olive’s confidence interval for the median is exactly analogous to that for the mean given above:

            CI = (Median – cp SE, Median + cp SE)

The key result of Olive’s paper is a simple estimator for the standard error SE, based on order statistics (i.e., rank-ordered data values like the minimum, median, and maximum).  Instead of describing these results mathematically, I have included an R procedure that computes the median, Olive’s standard error, the corresponding confidence intervals, and the classical results for the mean (again, for the mathematical details, refer to Olive’s paper; for a more detailed discussion of order statistics, refer to Chapter 6 of Exploring Data).  Specifically, the following R procedure is called with a vector y of numerical data values, and the default level of the resulting confidence interval is 95%, although this level can be changed by specifying an alternative value of alpha (this is 1 minus the confidence level, so alpha is 0.05 for the 95% case, 0.01 for 99%, etc.).


DOliveCIproc <- function(y, alpha = 0.05){
  #
  #  This procedure implements David Olive's simple
  #  median confidence interval, along with the standard
  #  confidence interval for the mean, for comparison
  #
  #  First, compute the median
  #
  n = length(y)
  ysort = sort(y)
  nhalf = floor(n/2)
  if (2*nhalf < n){
    #  n odd
    med = ysort[nhalf + 1]
  }
  else{
    # n even
    med = (ysort[nhalf] + ysort[nhalf+1])/2
  }
  #
  #  Next, compute Olive’s standard error for the median
  #
  Ln = nhalf - ceiling(sqrt(n/4))
  Un = n - Ln
  SE = 0.5*(ysort[Un] - ysort[Ln+1])
  #
  #  Compute the confidence interval based on Student’s t-distribution
  #  The degrees of freedom parameter p is discussed in Olive’s paper
  #
  p = Un - Ln - 1
  t = qt(p = 1 - alpha/2, df = p)
  medLCI = med - t * SE
  medUCI = med + t * SE
  #
  #  Next, compute the mean and its classical confidence interval
  #
  mu = mean(y)
  SEmu = sd(y)/sqrt(n)
  tmu = qt(p = 1 - alpha/2, df = n-1)
  muLCI = mu - tmu * SEmu
  muUCI = mu + tmu * SEmu
  #
  #  Finally, return a data frame with all of the results computed here
  #
  OutFrame = data.frame(Median = med, LCI = medLCI, UCI = medUCI,
                        Mean = mu, MeanLCI = muLCI, MeanUCI = muUCI,
                        N = n, dof = p, tmedian = t, tmean = tmu,
                        SEmedian = SE, SEmean = SEmu)
  OutFrame
}

Briefly, this procedure performs the following computations.  The first portion of the code computes the median, defined as the middle element of the rank-ordered list of samples if the number of samples n is odd, and the average of the two middle samples if n is even.  Note that the even/odd character of n is determined by using the floor function in R: floor(n/2) is the largest integer that does not exceed n/2.  Thus, if n is odd, the floor function rounds n/2 down to its integer part, so the product 2 * floor(n/2) is less than n, while if n is even, floor(n/2) is exactly equal to n/2, so this product is equal to n.  In addition, both the floor function and its opposite function ceiling are needed to compute the value Ln used in computing Olive’s standard error for the median.  The cp values correspond to the parameters t and tmu that appear in this function, computed from the built-in R function qt (which returns quantiles of the t-distribution).  Note that for the median, the degrees of freedom supplied to this function is p, which tends to be much smaller than the degrees of freedom value n-1 for the mean confidence interval computed in the latter part of this function.

As a specific illustration of the results generated by this procedure, applying it to the makeup flow rate data sequence yields:

> DOliveCIproc(makeupflow)
    Median          LCI             UCI           Mean        MeanLCI    MeanUCI     N   dof         tmedian
1 393.3586   391.8338   394.8834   315.4609   309.4857   321.4361   2589  52   2.006647
     tmean      SEmedian     SEmean
1 1.960881   0.75987     3.047188
>

These results were used to construct the confidence interval lines in the makeup flow rate plot shown above.  In addition, note that these results also illustrate the point noted in the preceding discussion about the degrees of freedom used in constructing the Student’s t-based confidence intervals.  For the mean, the degrees of freedom is N-1, which is 2588 for this example, meaning that there is essentially no difference in this case between these confidence intervals and those based on the Gaussian limiting distribution.  In contrast, for the median, the degrees of freedom is only 52, giving a cp value that is about 2.5% larger than the corresponding Gaussian case; for the next example, the degrees of freedom is only 16, making this parameter about 8% larger than the Gaussian limit.



One of the points I discussed in my last post was the instability of the median relative to the mean, a point I illustrated with the plot shown above.  This is a simulation-based dataset consisting of three parts: the first 100 points are narrowly distributed around the value +1, the 101st point is exactly zero, and the last 100 points are narrowly distributed around the value -1.  As I noted last time, removing two points from either the first group or the last group can profoundly alter the median, while having very little effect on the mean.  The figure shown above includes, in addition to the data values, the 95% confidence intervals for both the mean (the dotted lines in the center of the plot) and the median (the heavy dashed lines at the top and bottom of the plot).  Here, the fact that the median confidence interval is enormously wider (by almost a factor of 13) than the mean confidence interval gives an indication of the instability of the median.  In fact, the data distribution in this example is strongly bimodal, corresponding to a case where order statistic-based estimators like the median and Olive’s standard error for it perform poorly, a point discussed in Chapter 7 of Exploring Data.



One of the other important cases where estimators based on order statistics can perform poorly is that of coarsely quantized data, such as temperatures recorded only to the nearest tenth of a degree.  The difficulty with these cases is that coarse quantization profoundly changes the nature of the data distribution.  Specifically, it is a standard result in statistics that the probability of any two samples drawn from a continuous distribution having exactly the same value is zero, but this is no longer true for discrete distributions (e.g., count data), and coarse quantization introduces an element of discreteness into the data distribution.  The above figure illustrates this point for a simple simulation-based example.  The upper left plot shows a random sample of size 200 drawn from a zero-mean, unit-variance Gaussian distribution, and the upper right plot shows the effects of quantizing this sample, rounding it to the nearest half-integer value.  The lower two plots are normal quantile-quantile plots generated by the R command qqPlot from the car package: in the lower left plot, almost all of the points fall within the 95% confidence interval around the normal reference line for this plot, while many of the points fall somewhat outside these confidence limits in the plot shown in the lower right.  The greatest difference, however, is in the “staircase” appearance of this lower right plot, reflecting the effects of the coarse quantization on this data sample: each “step” corresponds to a group of samples that have exactly the same value.

The influence of this quantization on Olive’s confidence interval for the median is profound: for the original Gaussian data sequence, the 95% confidence interval for the median is approximately (-0.222,0.124), compared with (-0.174,0.095) for the mean.  These results are consistent with our expectations: since the mean is the best possible location estimator for Gaussian data, it should give the narrower confidence interval, and it does.  For the quantized case, the 95% confidence interval for the mean is (-0.194, 0.079), fairly similar to that for the original data sequence, but the confidence interval for the median reduces to the single value zero.  This result represents an implosion of Olive’s standard error estimator for the median, exactly analogous to the behavior of the MADM scale estimate that I have discussed previously when a majority of the data values (i.e., more than 50% of them) are identical.  Here, the situation is more serious, since the MADM scale estimate does not implode for this example: the MADM scale for the original data sequence is 0.938, versus 0.741 for the quantized sequence.  The reason Olive’s standard error estimator is more prone to implosion in the face of coarse quantization is that it is based on a small subset of the original data sample.  In particular, the size of the subsample on which this estimator is based is p, the degrees of freedom for the t-distribution used in constructing the corresponding confidence interval, and this number is approximately the square root of the sample size.  Thus, for a sample of size 200 like the example considered here, MADM scale implosion requires just over half the sample to have the same value – 101 data points in this case – where Olive’s standard error estimator for the median can implode if 16 or more samples have the same value, and this is exactly what happens here: the median value is zero, and this value occurs 39 times in the quantized data sequence.

David Olive’s confidence interval for the median is easily computed and represents a useful adjunct to the median as a characterization of numerical variables.  As Olive advises, there is considerable advantage in computing and comparing both his median confidence interval and the corresponding standard confidence interval around the mean.  Although in the summary of his paper, Olive only mentions outliers as a potential cause of substantial differences between these two confidence intervals, this post has illustrated that disagreements can also arise from other causes, including light-tailed, bimodal, or coarsely quantized data, much like the situation with the MADM scale estimate versus the standard deviation.  In fact, as the last example discussed here illustrates, Olive’s standard error estimator for the median and the confidence intervals based on it can implode – exactly like the MADM scale estimate – in the face of coarsely quantized data.  In fact, the implosion problem for Olive’s median standard error estimator is potentially more severe, again as illustrated in the previous example.  Finally, it is worth noting that Olive’s paper also discusses confidence intervals for trimmed means.

1 comment: