ExploringDataBlog: binary confidence intervals

Showing posts with label binary confidence intervals. Show all posts

Saturday, April 21, 2012

David Olive’s median confidence interval

As I have discussed in a number of previous posts, the median represents a well-known and widely-used estimate of the “center” of a data sequence. Relative to the better-known mean, the primary advantage of the median is its much reduced outlier sensitivity. This post briefly describes a simple confidence interval for the median that is discussed in a paper by David Olive, available on-line via the following link:

http://www.math.siu.edu/olive/ppmedci.pdf

As Olive notes in his paper and I further demonstrate in this post, an advantage of his confidence interval for the median is that it provides a simple, numerical way of identifying situations where the data values deserve a careful, graphical look. In particular, he advocates comparing the traditional confidence interval for the mean with his confidence interval for the median: if these intervals are markedly different, it is worth investigating to understand why. This strategy may be viewed as a particular instance of Collin Mallows’ “compute and compare” advice, discussed at the end of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine. The key idea here is that under “standard” working assumptions – i.e., distributional symmetry and approximate normality – the mean and the median should be approximately the same: if they are not, it probably means these working assumptions have been violated, due to outliers in the data, pronounced distributional asymmetry, or other less common phenomena like strongly multimodal data distributions or coarse quantization. In the increasingly common case where we have a lot of numerical variables to consider, it may be undesirable or infeasible to examine them all graphically: numerical comparisons like the one described here may be automated and used to point us to subsets of variables that we really need to look at further. In addition to describing this confidence interval estimator and illustrating it for three examples, this post also provides the R code to compute it.

As a first example, the plot above shows the makeup flow rate dataset discussed in Exploring Data and available as the makeup dataset (makeup.csv) from the book's companion website. This plot shows 2,589 successive observations of the measured flow rate of a solvent recycle stream in an industrial manufacturing process. In normal operation, this flow rate is just under 400 – in fact, the median flow rate is 393.86 – but this data record also includes measurements during time intervals when the process is either being shut down, is not running, or is being started back up, and during these periods the measured flow rates decrease toward zero, are approximately equal to zero, and increase from zero back to approximately 400, respectively. Because of the presence of these anomalous segments in the data, the mean value is much smaller than the median: specifically, the mean is 315.46, actually serving as a practical dividing line between the normal operation segments (i.e., those data points that lie above the mean) and the shutdown segments (i.e., those data points that lie below the mean). The dashed lines in this plot at 309.49 and 321.44 correspond to the classical 95% confidence interval for the mean, computed as described below. In contrast, the dotted lines at 391.83 and 394.88 correspond to Olive’s 95% confidence interval for the median, also described below. Before proceeding to a more detailed discussion of how these lines were determined, the three primary points to note from this figure are, first, that the two confidence intervals are very different (e.g., they do not overlap at all), second, that the mean confidence intervals are much wider than those for the median in this case, and third, that the median confidence interval lies well within the range of the normal operating data, while the mean confidence interval does not. It is also worth noting that, if we simply remove the shutdown episodes from this dataset, the mean of this edited dataset is 397.7, a value that lies slightly above the upper 95% confidence interval for the median, but only slightly so (this and other data cleaning strategies for this dataset are discussed in some detail in Chapter 7 of Exploring Data).

Both the classical confidence interval for the mean and David Olive’s confidence interval for the median are based on the fact that these estimators are asymptotically normal: for a sufficiently large data sample, both the estimated mean and the estimated median approach the correct limits for the underlying data distribution, with a standard deviation that decreases inversely with the square root of the sample size. Using this description directly would lead to confidence intervals based on the quantiles of the Gaussian distribution, but for small to moderate-sized samples, more accurate confidence intervals are obtained by replacing these Gaussian quantiles with those for the Student’s t-distribution with the appropriate number of degrees of freedom. More specifically, for the mean, the confidence interval at a given level p is of the form:

CI = (Mean – c_p SE, Mean + c_p SE),

where c_p is the constant derived from the Gaussian or Student’s t-distribution, and SE is the standard error of the mean, equal to the usual standard deviation estimate divided by the square root of the number of data points. (For a more detailed discussion of the math behind these results, refer to either Chapter 9 of Exploring Data or to David Olive’s paper, available through the link given above.) For the median, Olive provides a simple estimator for the standard error, described further in the next paragraph. First, however, it is worth saying a little about the difference between the Gaussian and Student’s t-distribution in these results. Probably the most commonly used confidence intervals are the 95% intervals – these are the confidence intervals shown in the plot above for the makeup flow rate data – which represent the interval that should contain the true distribution mean with probability at least 95%. In the Gaussian case, the constant c_p for the 95% confidence interval is approximately 1.96, while for the Student’s t-distribution, this number depends on the degrees of freedom parameter. In the case of the mean, the degrees of freedom is one less than the sample size, while for the median confidence intervals described below, this number is typically much smaller. The difference between these distributions is that the c_p parameter decreases from a very large value for few degrees of freedom – e.g., the 95% parameter value is 12.71 for a single degree of freedom – to the Gaussian value (e.g., 1.96 for the 95% case) in the limit of infinite degrees of freedom. Thus, using Student’s t-distribution instead of the Gaussian distribution results in wider confidence intervals, wider by the ratio of the Student’s t value for c_p to the Gaussian value. The plot below shows this ratio for the 95% parameter c_p as the degree of freedom parameter varies between 5 and 200, with the dashed line corresponding to the Gaussian limit when this ratio is equal to 1.

The general structure of Olive’s confidence interval for the median is exactly analogous to that for the mean given above:

CI = (Median – c_p SE, Median + c_p SE)

The key result of Olive’s paper is a simple estimator for the standard error SE, based on order statistics (i.e., rank-ordered data values like the minimum, median, and maximum). Instead of describing these results mathematically, I have included an R procedure that computes the median, Olive’s standard error, the corresponding confidence intervals, and the classical results for the mean (again, for the mathematical details, refer to Olive’s paper; for a more detailed discussion of order statistics, refer to Chapter 6 of Exploring Data). Specifically, the following R procedure is called with a vector y of numerical data values, and the default level of the resulting confidence interval is 95%, although this level can be changed by specifying an alternative value of alpha (this is 1 minus the confidence level, so alpha is 0.05 for the 95% case, 0.01 for 99%, etc.).

DOliveCIproc <- function(y, alpha = 0.05){
#
# This procedure implements David Olive's simple
# median confidence interval, along with the standard
# confidence interval for the mean, for comparison
#
# First, compute the median
#
n = length(y)
ysort = sort(y)
nhalf = floor(n/2)
if (2*nhalf < n){
    # n odd
    med = ysort[nhalf + 1]
}
else{
    # n even
    med = (ysort[nhalf] + ysort[nhalf+1])/2
}
#
# Next, compute Olive’s standard error for the median
#
Ln = nhalf - ceiling(sqrt(n/4))
Un = n - Ln
SE = 0.5*(ysort[Un] - ysort[Ln+1])
#
# Compute the confidence interval based on Student’s t-distribution
# The degrees of freedom parameter p is discussed in Olive’s paper
#
p = Un - Ln - 1
t = qt(p = 1 - alpha/2, df = p)
medLCI = med - t * SE
medUCI = med + t * SE
#
# Next, compute the mean and its classical confidence interval
#
mu = mean(y)
SEmu = sd(y)/sqrt(n)
tmu = qt(p = 1 - alpha/2, df = n-1)
muLCI = mu - tmu * SEmu
muUCI = mu + tmu * SEmu
#
# Finally, return a data frame with all of the results computed here
#
OutFrame = data.frame(Median = med, LCI = medLCI, UCI = medUCI,
                        Mean = mu, MeanLCI = muLCI, MeanUCI = muUCI,
                        N = n, dof = p, tmedian = t, tmean = tmu,
                        SEmedian = SE, SEmean = SEmu)
OutFrame
}

Briefly, this procedure performs the following computations. The first portion of the code computes the median, defined as the middle element of the rank-ordered list of samples if the number of samples n is odd, and the average of the two middle samples if n is even. Note that the even/odd character of n is determined by using the floor function in R: floor(n/2) is the largest integer that does not exceed n/2. Thus, if n is odd, the floor function rounds n/2 down to its integer part, so the product 2 * floor(n/2) is less than n, while if n is even, floor(n/2) is exactly equal to n/2, so this product is equal to n. In addition, both the floor function and its opposite function ceiling are needed to compute the value Ln used in computing Olive’s standard error for the median. The c_p values correspond to the parameters t and tmu that appear in this function, computed from the built-in R function qt (which returns quantiles of the t-distribution). Note that for the median, the degrees of freedom supplied to this function is p, which tends to be much smaller than the degrees of freedom value n-1 for the mean confidence interval computed in the latter part of this function.

As a specific illustration of the results generated by this procedure, applying it to the makeup flow rate data sequence yields:

> DOliveCIproc(makeupflow)

Median LCI UCI Mean MeanLCI MeanUCI N dof tmedian

1 393.3586 391.8338 394.8834 315.4609 309.4857 321.4361 2589 52 2.006647

tmean SEmedian SEmean

1 1.960881 0.75987 3.047188

These results were used to construct the confidence interval lines in the makeup flow rate plot shown above. In addition, note that these results also illustrate the point noted in the preceding discussion about the degrees of freedom used in constructing the Student’s t-based confidence intervals. For the mean, the degrees of freedom is N-1, which is 2588 for this example, meaning that there is essentially no difference in this case between these confidence intervals and those based on the Gaussian limiting distribution. In contrast, for the median, the degrees of freedom is only 52, giving a c_p value that is about 2.5% larger than the corresponding Gaussian case; for the next example, the degrees of freedom is only 16, making this parameter about 8% larger than the Gaussian limit.

One of the points I discussed in my last post was the instability of the median relative to the mean, a point I illustrated with the plot shown above. This is a simulation-based dataset consisting of three parts: the first 100 points are narrowly distributed around the value +1, the 101^st point is exactly zero, and the last 100 points are narrowly distributed around the value -1. As I noted last time, removing two points from either the first group or the last group can profoundly alter the median, while having very little effect on the mean. The figure shown above includes, in addition to the data values, the 95% confidence intervals for both the mean (the dotted lines in the center of the plot) and the median (the heavy dashed lines at the top and bottom of the plot). Here, the fact that the median confidence interval is enormously wider (by almost a factor of 13) than the mean confidence interval gives an indication of the instability of the median. In fact, the data distribution in this example is strongly bimodal, corresponding to a case where order statistic-based estimators like the median and Olive’s standard error for it perform poorly, a point discussed in Chapter 7 of Exploring Data.

One of the other important cases where estimators based on order statistics can perform poorly is that of coarsely quantized data, such as temperatures recorded only to the nearest tenth of a degree. The difficulty with these cases is that coarse quantization profoundly changes the nature of the data distribution. Specifically, it is a standard result in statistics that the probability of any two samples drawn from a continuous distribution having exactly the same value is zero, but this is no longer true for discrete distributions (e.g., count data), and coarse quantization introduces an element of discreteness into the data distribution. The above figure illustrates this point for a simple simulation-based example. The upper left plot shows a random sample of size 200 drawn from a zero-mean, unit-variance Gaussian distribution, and the upper right plot shows the effects of quantizing this sample, rounding it to the nearest half-integer value. The lower two plots are normal quantile-quantile plots generated by the R command qqPlot from the car package: in the lower left plot, almost all of the points fall within the 95% confidence interval around the normal reference line for this plot, while many of the points fall somewhat outside these confidence limits in the plot shown in the lower right. The greatest difference, however, is in the “staircase” appearance of this lower right plot, reflecting the effects of the coarse quantization on this data sample: each “step” corresponds to a group of samples that have exactly the same value.

The influence of this quantization on Olive’s confidence interval for the median is profound: for the original Gaussian data sequence, the 95% confidence interval for the median is approximately (-0.222,0.124), compared with (-0.174,0.095) for the mean. These results are consistent with our expectations: since the mean is the best possible location estimator for Gaussian data, it should give the narrower confidence interval, and it does. For the quantized case, the 95% confidence interval for the mean is (-0.194, 0.079), fairly similar to that for the original data sequence, but the confidence interval for the median reduces to the single value zero. This result represents an implosion of Olive’s standard error estimator for the median, exactly analogous to the behavior of the MADM scale estimate that I have discussed previously when a majority of the data values (i.e., more than 50% of them) are identical. Here, the situation is more serious, since the MADM scale estimate does not implode for this example: the MADM scale for the original data sequence is 0.938, versus 0.741 for the quantized sequence. The reason Olive’s standard error estimator is more prone to implosion in the face of coarse quantization is that it is based on a small subset of the original data sample. In particular, the size of the subsample on which this estimator is based is p, the degrees of freedom for the t-distribution used in constructing the corresponding confidence interval, and this number is approximately the square root of the sample size. Thus, for a sample of size 200 like the example considered here, MADM scale implosion requires just over half the sample to have the same value – 101 data points in this case – where Olive’s standard error estimator for the median can implode if 16 or more samples have the same value, and this is exactly what happens here: the median value is zero, and this value occurs 39 times in the quantized data sequence.

David Olive’s confidence interval for the median is easily computed and represents a useful adjunct to the median as a characterization of numerical variables. As Olive advises, there is considerable advantage in computing and comparing both his median confidence interval and the corresponding standard confidence interval around the mean. Although in the summary of his paper, Olive only mentions outliers as a potential cause of substantial differences between these two confidence intervals, this post has illustrated that disagreements can also arise from other causes, including light-tailed, bimodal, or coarsely quantized data, much like the situation with the MADM scale estimate versus the standard deviation. In fact, as the last example discussed here illustrates, Olive’s standard error estimator for the median and the confidence intervals based on it can implode – exactly like the MADM scale estimate – in the face of coarsely quantized data. In fact, the implosion problem for Olive’s median standard error estimator is potentially more severe, again as illustrated in the previous example. Finally, it is worth noting that Olive’s paper also discusses confidence intervals for trimmed means.

Tuesday, April 12, 2011

Screening for predictive characteristics … and a mea culpa

In my last post, I considered the UCI mushroom dataset and characterized the variables included there using four different interestingness measures. When I began drafting this post, my intention was to consider the question of how the different mushroom characteristics included in this dataset relate to each mushroom’s classification as edible or poisonous. In fact, I do consider this problem here, but in the process of working out the example, I discovered a minor typographical error in Exploring Data in Engineering, the Sciences, and Medicine that has somewhat less minor consequences. Specifically, in Eq. (9.67) on page 413, two square roots were omitted, making the result incorrect as stated. (More specifically, the term in curly brackets that appears twice should have an exponent of ½, like the different term that appears twice in curly brackets in Eq. (9.66) just above it on the same page.) The consequence of this omission is that the confidence intervals defined by Eq. (9.67) are too narrow; further, since this equation was used to implement the R procedure binomCI.proc available from the companion website, the results generated by this procedure are also incorrect. I have brought these errors to Oxford’s attention and have asked them to replace the original R procedure with a corrected update, but if you have already downloaded this procedure, you need to be aware of the missing square root. The rest of this post carries out my original plan – which was to show how binomial confidence intervals can be useful in screening categorical variables for their ability to predict a binary outcome like edibility.

Recall that the UCI mushroom dataset gives 23 characteristics for each of 8,124 mushrooms, including a binary classification of each mushroom as “edible” or “poisonous.” The question considered here is which – if any – of the 22 mushroom attributes included in the dataset is potentially useful in predicting edibility. The basic idea is the following: each of these predictors is a categorical variable that can take on any one of a fixed set of possible values, so we can examine the groups of mushrooms defined by each of these values and estimate the probability that the mushrooms in the group are edible. As a specific example, the mushroom characteristic CapSurf has four possible values: “f” (fibrous), “g” (grooves), “s” (smooth), or “y” (scaly). In this case, we want to estimate the probability that mushrooms with CapSurf = f are edible, the probability that those with CapSurf = g are edible, and similarly for CapSurf = s and y. The most common approach to this problem is to estimate these probabilities as the fractions of edible mushrooms in each group: P_edible = n_edible/n_group. The difficulty with this number, taken by itself, is that it doesn’t tell us how much weight to give the final result: if we have one edible mushroom in a group of five, we get P_edible = 0.200, and we get the same result if we have 200 edible mushrooms in a group of 1,000. We are likely to put more faith in the second result than in the first, however, because it has a lot more weight of evidence behind it. For example, if we add a single edible mushroom to each group, our probability estimate for the first case increases from P_edible = 0.200 to P_edible = 0.333, while in the second case, the estimated probability only increases to P_edible = 0.201. Even worse, if we remove one edible mushroom from the first group, P_edible drops to 0.000, while in the second case, it only drops to 0.199.

This is where statistics can come to our rescue: in addition to computing the point estimate P_edible of the probability that a mushroom is edible, we can also compute confidence intervals, which quantify the uncertainty in this result. That is, a confidence interval is defined as a set of values that has at least some specified probability of containing the true but unknown value of P_edible. A common choice is 95% confidence limits: the true value of P_edible lies between some lower limit P_- and some upper limit P₊ with probability at least 95%. One of the key points of this post is that these intervals can be computed in more than one way, and the way that was widely adopted as “the standard method” for a long time has been found to be inadequate. Fortunately, a simple alternative is available that gives much better results, at least if you implement it correctly. The details follow, all based on the material presented in Section 9.7 of Exploring Data.

This standard method relies on the assumption of asymptotic normality: for a “sufficiently large” group (i.e., for “large enough” values of n_group and possibly n_edible), the estimator P_edible should approaches a Gaussian limiting distribution with variance P_edible(1 – P_edible)/n_group. If we assume our sample is large enough for this to be a good approximation, we can rely on known results for the Gaussian distribution to construct our confidence intervals. As a specific example, the 95% confidence interval would be centered at P_edible with upper and lower limits lying approximately plus or minus 1.96 standard deviations from this value, where the standard deviation is just the square root of the variance given above. The plot below shows the results obtained by applying this strategy to the groups of mushrooms defined by the four possible values of the CapSurf variable. Specifically, the open circles in this plot correspond to the estimated probability P_edible that a mushroom from the group defined by each CapSurf value is edible. The downward-pointing triangles represent the upper 95% confidence limit for this value, and the upward-pointing triangles represent the lower 95% confidence limit for this value. The horizontal dotted line corresponds to the average fraction of edible mushrooms in the UCI dataset, giving us a frame of reference for assessing the edibility results for each individual CapSurf value. That is, points lying well above this average line represent groups of mushrooms that are more edible than average, while points lying well below this average line represent groups of mushrooms that are less edible than average. The result obtained for level “g” clearly illustrates one difficulty with this approach: this group is extremely small, containing only four mushrooms, none of which are classified as edible. Thus, not only is P_edible zero, its associated variance is also zero, giving us zero-width confidence intervals. In words, this result is suggesting that mushrooms with grooved cap surfaces are never edible and that we are quite certain of this, despite the fact that this conclusion is only based on four mushrooms. In contrast, we seem to be less certain about the probability that scaly (“y”) or smooth (“s”) mushrooms are edible, despite the fact that these results are based on groups of 3,244 and 2,556 mushrooms, respectively.

An alternative approach that gives more accurate confidence intervals and also overcomes this particular difficulty is one proposed by Brown, Cai, and DasGupta. The details are given in Exploring Data (with the exception of the error noted at the beginning of this post, I believe they are correct), and they are somewhat messy, so I won’t repeat them here, but the basic ideas are, first, to add positive offsets to both n_edible and n_group in computing the probability that a mushroom is edible, and second, to modify the expression for the variance. Both of these modifications depend explicitly on the confidence level considered (i.e., the offsets are different for 95% confidence intervals than they are for 99% confidence intervals, as are the variance modifications), and they become negligible in the limit as both n_edible and n_group become very large. To see the impact of these modifications, the plot below gives the modified 95% confidence intervals for the CapSurf data, in the same general format as before. Comparing this plot with the one above, it is clear that the most dramatic difference is that for level “g,” the grooved cap mushrooms: whereas the asymptotic result suggested the mushrooms in this group were poisonous with absolute certainty, the very wide confidence intervals for this group in the plot below reflect the fact that this result is only based on four mushrooms and, while none of these four are edible, the confidence intervals extend from essentially zero probability of being edible to almost the average probability for the complete dataset. Thus, while we can conclude from this plot that mushrooms with grooved cap surfaces appear less likely than average to be edible, the available evidence isn’t enough to make this argument too strongly. In contrast, the results for the mushrooms with scaly cap surfaces (“y”) or smooth cap surfaces (“s”) are essentially identical to those presented above, consistent with the much larger groups on which these results are based.

Before leaving this example, it is worth showing how the results are changed in light of my typographical error in Eq. (9.67) of Exploring Data. The missing square roots were omitted from terms that define the width of the confidence intervals, and these terms are numerically smaller than 1. Since if x is smaller than 1, the square root of x is larger than x but still smaller than 1, the effect of these omitted square roots is to make the resulting confidence intervals too narrow (i.e., we are using the value x rather than the larger square root of x that we should be using to determine the width of the confidence intervals). This error causes our results to appear more precise than they really are. This effect may be seen clearly in the plot below, which corresponds to the two plots discussed above, but with the confidence intervals based on the erroneous implementation of the estimator of Brown et al.

Finally, the figure below shows four plots, each in the same format as those discussed above, corresponding to the P_edible estimates obtained by applying the method of Brown et al. to four different mushroom characteristics. The upper left plot shows the results obtained for the mushroom characteristic “Odor,” which appears to be highly predictive of edibility. Careful examination of these results reveals that, for the mushrooms in the UCI dataset, those with odors characterized as “a” (almond) or “l” (anise) are always edible, those with odors characterized as “c” (creosote), “f” (foul), “m” (musty), “p” (pungent), “s” (spicy), or “y” (fishy) are always poisonous, and those with no odor are more likely to be edible than not, but they can still be poisonous. In contrast, CapShape (upper right plot) appears much less predictive: some values seem to be strongly associated with edibility (“b” or “s”), while the levels “f” and “x” seem to convey no information at all: the likelihood that these mushrooms are edible is essentially the same as that of the complete collection, without regard to CapShape. The lower left plot shows the corresponding results for StalkRoot, which suggest that levels “c,” “e,” and “r” are more likely to be edible than average, level “b” conveys no information, and mushrooms where StalkRoot values are missing are somewhat more likely to be poisonous (the class “?”). This result is somewhat distressing, raising the possibility that the missing values for this variable are not missing at random, but that there may be some systematic mechanism at work (e.g., is the StalkRoot characterization somehow more difficult for poisonous mushrooms?). Finally, the lower right plot shows the results for the binary characteristic GillSize: it appears that mushrooms with GillSize “n” (narrow) are much more likely to be poisonous than those with GillSize “b” (broad). Because both the response (i.e., edibility) and the candidate predictor GillSize are binary in this case, an alternative – and arguably better – approach to characterizing their relationship is in terms of odds ratios, which I will take up in my next post.