This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Tuesday, April 12, 2011

Screening for predictive characteristics … and a mea culpa

In my last post, I considered the UCI mushroom dataset and characterized the variables included there using four different interestingness measures.  When I began drafting this post, my intention was to consider the question of how the different mushroom characteristics included in this dataset relate to each mushroom’s classification as edible or poisonous.  In fact, I do consider this problem here, but in the process of working out the example, I discovered a minor typographical error in Exploring Data in Engineering, the Sciences, and Medicine that has somewhat less minor consequences.  Specifically, in Eq. (9.67) on page 413, two square roots were omitted, making the result incorrect as stated.  (More specifically, the term in curly brackets that appears twice should have an exponent of ½, like the different term that appears twice in curly brackets in Eq. (9.66) just above it on the same page.)  The consequence of this omission is that the confidence intervals defined by Eq. (9.67) are too narrow; further, since this equation was used to implement the R procedure binomCI.proc available from the companion website, the results generated by this procedure are also incorrect.  I have brought these errors to Oxford’s attention and have asked them to replace the original R procedure with a corrected update, but if you have already downloaded this procedure, you need to be aware of the missing square root.  The rest of this post carries out my original plan – which was to show how binomial confidence intervals can be useful in screening categorical variables for their ability to predict a binary outcome like edibility. 

Recall that the UCI mushroom dataset gives 23 characteristics for each of 8,124 mushrooms, including a binary classification of each mushroom as “edible” or “poisonous.”  The question considered here is which – if any – of the 22 mushroom attributes included in the dataset is potentially useful in predicting edibility.  The basic idea is the following: each of these predictors is a categorical variable that can take on any one of a fixed set of possible values, so we can examine the groups of mushrooms defined by each of these values and estimate the probability that the mushrooms in the group are edible.  As a specific example, the mushroom characteristic CapSurf has four possible values: “f” (fibrous), “g” (grooves), “s” (smooth), or “y” (scaly).  In this case, we want to estimate the probability that mushrooms with CapSurf = f are edible, the probability that those with CapSurf = g are edible, and similarly for CapSurf = s and y.  The most common approach to this problem is to estimate these probabilities as the fractions of edible mushrooms in each group: Pedible = nedible/ngroup.  The difficulty with this number, taken by itself, is that it doesn’t tell us how much weight to give the final result: if we have one edible mushroom in a group of five, we get Pedible = 0.200, and we get the same result if we have 200 edible mushrooms in a group of 1,000.  We are likely to put more faith in the second result than in the first, however, because it has a lot more weight of evidence behind it.  For example, if we add a single edible mushroom to each group, our probability estimate for the first case increases from Pedible = 0.200 to Pedible = 0.333, while in the second case, the estimated probability only increases to Pedible = 0.201.  Even worse, if we remove one edible mushroom from the first group, Pedible drops to 0.000, while in the second case, it only drops to 0.199. 

This is where statistics can come to our rescue: in addition to computing the point estimate Pedible of the probability that a mushroom is edible, we can also compute confidence intervals, which quantify the uncertainty in this result.  That is, a confidence interval is defined as a set of values that has at least some specified probability of containing the true but unknown value of Pedible.  A common choice is 95% confidence limits: the true value of Pedible lies between some lower limit P- and some upper limit P+ with probability at least 95%.  One of the key points of this post is that these intervals can be computed in more than one way, and the way that was widely adopted as “the standard method” for a long time has been found to be inadequate.  Fortunately, a simple alternative is available that gives much better results, at least if you implement it correctly.  The details follow, all based on the material presented in Section 9.7 of Exploring Data.

This standard method relies on the assumption of asymptotic normality: for a “sufficiently large” group (i.e., for “large enough” values of ngroup and possibly nedible), the estimator Pedible should approaches a Gaussian limiting distribution with variance Pedible(1 – Pedible)/ngroup.  If we assume our sample is large enough for this to be a good approximation, we can rely on known results for the Gaussian distribution to construct our confidence intervals.  As a specific example, the 95% confidence interval would be centered at Pedible with upper and lower limits lying approximately plus or minus 1.96 standard deviations from this value, where the standard deviation is just the square root of the variance given above.  The plot below shows the results obtained by applying this strategy to the groups of mushrooms defined by the four possible values of the CapSurf variable.  Specifically, the open circles in this plot correspond to the estimated probability Pedible that a mushroom from the group defined by each CapSurf value is edible.  The downward-pointing triangles represent the upper 95% confidence limit for this value, and the upward-pointing triangles represent the lower 95% confidence limit for this value.  The horizontal dotted line corresponds to the average fraction of edible mushrooms in the UCI dataset, giving us a frame of reference for assessing the edibility results for each individual CapSurf value.  That is, points lying well above this average line represent groups of mushrooms that are more edible than average, while points lying well below this average line represent groups of mushrooms that are less edible than average.  The result obtained for level “g” clearly illustrates one difficulty with this approach: this group is extremely small, containing only four mushrooms, none of which are classified as edible.  Thus, not only is Pedible zero, its associated variance is also zero, giving us zero-width confidence intervals.  In words, this result is suggesting that mushrooms with grooved cap surfaces are never edible and that we are quite certain of this, despite the fact that this conclusion is only based on four mushrooms.  In contrast, we seem to be less certain about the probability that scaly (“y”) or smooth (“s”) mushrooms are edible, despite the fact that these results are based on groups of 3,244 and 2,556 mushrooms, respectively.



An alternative approach that gives more accurate confidence intervals and also overcomes this particular difficulty is one proposed by Brown, Cai, and DasGupta.  The details are given in Exploring Data (with the exception of the error noted at the beginning of this post, I believe they are correct), and they are somewhat messy, so I won’t repeat them here, but the basic ideas are, first, to add positive offsets to both nedible and ngroup in computing the probability that a mushroom is edible, and second, to modify the expression for the variance.  Both of these modifications depend explicitly on the confidence level considered (i.e., the offsets are different for 95% confidence intervals than they are for 99%  confidence intervals, as are the variance modifications), and they become negligible in the limit as both nedible and ngroup become very large.  To see the impact of these modifications, the plot below gives the modified 95% confidence intervals for the CapSurf data, in the same general format as before.  Comparing this plot with the one above, it is clear that the most dramatic difference is that for level “g,” the grooved cap mushrooms: whereas the asymptotic result suggested the mushrooms in this group were poisonous with absolute certainty, the very wide confidence intervals for this group in the plot below reflect the fact that this result is only based on four mushrooms and, while none of these four are edible, the confidence intervals extend from essentially zero probability of being edible to almost the average probability for the complete dataset.  Thus, while we can conclude from this plot that mushrooms with grooved cap surfaces appear less likely than average to be edible, the available evidence isn’t enough to make this argument too strongly.  In contrast, the results for the mushrooms with scaly cap surfaces (“y”) or smooth cap surfaces (“s”) are essentially identical to those presented above, consistent with the much larger groups on which these results are based.



Before leaving this example, it is worth showing how the results are changed in light of my typographical error in Eq. (9.67) of Exploring Data.  The missing square roots were omitted from terms that define the width of the confidence intervals, and these terms are numerically smaller than 1.  Since if x is smaller than 1, the square root of x is larger than x but still smaller than 1, the effect of these omitted square roots is to make the resulting confidence intervals too narrow (i.e., we are using the value x rather than the larger square root of x that we should be using to determine the width of the confidence intervals).  This error causes our results to appear more precise than they really are.  This effect may be seen clearly in the plot below, which corresponds to the two plots discussed above, but with the confidence intervals based on the erroneous implementation of the estimator of Brown et al.



Finally, the figure below shows four plots, each in the same format as those discussed above, corresponding to the Pedible estimates obtained by applying the method of Brown et al. to four different mushroom characteristics.  The upper left plot shows the results obtained for the mushroom characteristic “Odor,” which appears to be highly predictive of edibility. Careful examination of these results reveals that, for the mushrooms in the UCI dataset, those with odors characterized as “a” (almond) or “l” (anise) are always edible, those with odors characterized as “c” (creosote), “f” (foul), “m” (musty), “p” (pungent), “s” (spicy), or “y” (fishy) are always poisonous, and those with no odor are more likely to be edible than not, but they can still be poisonous.  In contrast, CapShape (upper right plot) appears much less predictive: some values seem to be strongly associated with edibility (“b” or “s”), while the levels “f” and “x” seem to convey no information at all: the likelihood that these mushrooms are edible is essentially the same as that of the complete collection, without regard to CapShape.  The lower left plot shows the corresponding results for StalkRoot, which suggest that levels “c,” “e,” and “r” are more likely to be edible than average, level “b” conveys no information, and mushrooms where StalkRoot values are missing are somewhat more likely to be poisonous (the class “?”).  This result is somewhat distressing, raising the possibility that the missing values for this variable are not missing at random, but that there may be some systematic mechanism at work (e.g., is the StalkRoot characterization somehow more difficult for poisonous mushrooms?).  Finally, the lower right plot shows the results for the binary characteristic GillSize: it appears that mushrooms with GillSize “n” (narrow) are much more likely to be poisonous than those with GillSize “b” (broad).  Because both the response (i.e., edibility) and the candidate predictor GillSize are binary in this case, an alternative – and arguably better – approach to characterizing their relationship is in terms of odds ratios, which I will take up in my next post.


4 comments: