This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, May 19, 2012

Interestingness comparisons

In three previous posts (April 3, 2011,  April 12, 2011,and May 21, 2011), I have discussed interestingness measures, which characterize the distributional heterogeneity of categorical variables.  Four specific measures are discussed in Chapter 3 of Exploring Data in Engineering, the Sciences and Medicine: the Bray measure, the Gini measure, the Shannon measure, and the Simpson measure.  All four of these measures vary from 0 to 1 in value, exhibiting their minimum value when all levels of the variable are equally represented, and exhibiting their maximum value when the variable is completely concentrated on a single one of its several possible levels.  Intermediate values correspond to variables that are more or less homogeneously distributed: more homogeneous for smaller values of the measure, and less homogeneous for larger values.  One of the points I noted in my first post on this topic was that the different measures exhibit different behavior for the intermediate cases, reflecting different inherent sensitivities to the various ways in which a variable can be “more homogeneous” or “less homogeneous.”  This post examines changes in interestingness measures as a potential exploratory analysis tool for selecting categorical predictors of some binary response.  In fact, I examined the same question from a different perspective in my April 12 post noted above: the primary difference is that there, the characterization I considered generates a single graph for each variable, with the number of points on the graph corresponding to the number of levels of the variable.  Here, I examine a characterization that represents each variable as a single point on the graph, allowing us to consider all variables simultaneously.



As a reminder of how these measures behave, the figure above shows a plot of the normalized Gini measure versus the normalized Shannon measure for the 23 categorical variables included in the mushroom dataset from the UCI Machine Learning Repository.  As I have noted in several previous posts that have discussed  this dataset, it gives observable characteristics for 8,124 mushrooms and classifies each one as either edible or poisonous (the binary variable EorP).  The above plot illustrates the systematic difference between the normalized Shannon and Gini interestingness measures: there, each point represents one of the 23 variables in the dataset, with the horizontal axis representing the Shannon measure computed for the variable and the vertical axis rperesenting the corresponding Gini measure.  The plot shows that the Gini measure is consistently larger than the Shannon measure, since all points lie above the equality reference line in this plot except for the single point at the origin.  This point corresponds to the variable VeilType, which only exhibits a single value in this dataset, meaning that both the Gini and Shannon measures are inherently ill-defined; consequently, they are given the default value of zero here, consistent with the general interpretation of these measures: if a variable only assumes a single value, it seems reasonable to consider it “completely homogeneous.”

Because edible and poisonous mushrooms are fairly evenly represented in this dataset (51.8% edible versus 48.2% poisonous), it has been widely used as one of several benchmarks for evaluating classification algorithms.  In particular, given the other mushroom characteristics, the fundamental classification question is how well can we predict whether each mushroom is poisonous or edible.  In this post and a subsequent follow-up post, I consider a closely related question: can differences in a variable’s interestingness measure between the edible subset and the poisonous subset be used to help us select prediction covariates for these classification algorithms?  In this post, I present some preliminary evidence to suggest that this may be the case, while in a subsequent post, I will put the question to the test by seeing how well the covariates suggested by this analysis actually predict edibility.

The specific idea I examine here is the following: given an interestingness measure and a mushroom characteristic, compute this measure for the chosen characteristic, applied the edible and poisonous mushrooms separately.  If these numbers are very different, this suggests that the distribution of levels is different for edible and poisonous mushrooms, further suggesting that this variable may be a useful predictor of edibility.  To turn this idea into a data analysis tool, it is necessary to define what we mean by “very different,” and this can be done in more than one way.  Here, I consider two possibilities.  The first is what I call the “normalized difference,” defined as the difference of the two interestingness measures divided by their sum.  Since both interestingness measures lie between 0 and 1, it is not difficult to show that this normalized difference lies between -1 and +1.  As a specific application of this idea, consider the plot below, which shows the normalized difference in the Gini measure between the poisonous mushrooms and the edible mushrooms (the normalized Gini shift) plotted against the corresponding difference for the Shannon measure (the normalized Shannon shift).  In addition, this plot shows an equality reference line, and the fact that the points consistently lie between this line and the horizontal axis shows that the normalized Gini shift is consistently smaller in magnitude than the normalized Shannon shift.  This suggests that the normalized Shannon measure may be more sensitive to distributional differences between edible and poisonous mushrooms.



The next figure, below, shows a re-drawn version of the above plot, with the equality reference line removed and replaced by four other reference lines.  The vertical dashed lines correspond to the outlier detection limits obtained by the Hampel identifier with threshold value t = 2 (see Chapter 7 of Exploring Data for a detailed discussion of this procedure), computed from the normalized Shannon shift values, while the horizontal dashed lines represent the corresponding limits computed from the normalized Gini shift values.  Points falling outside these limits represent variables whose changes in both Gini measure and Shannon measure are “unusually large” according to the Hampel identifier criteria used here.  These points are represented as solid circles, while those not detected as “unusual” by the Hampel identifier are represented as open circles.  The idea proposed here – to be investigated in a future post – is that these outlying variables may be useful in predicting mushroom edibility.



More specifically, the five solid circles in the above plot correspond to the following mushroom characteristics.  The two points in the lower left corner of the plot – exhibiting almost the most negative normalized Shannon shift possible – correspond to GillSize and StalkShape, two binary variables.  As I discussed in a previous post (May 7, 2011) and I discuss further in Chapter 13 of Exploring Data, an extremely useful measure of association between two binary variables (e.g., between GillSize and edibility) is the odds ratio.  An examination of the odds ratios for these two variables suggest that both should be at least somewhat predictive of edibility: the odds ratio between GillSize and edibility is 0.056, suggesting a very strong association (specifically, a GillSize value of “n” for “narrow” is most commonly associated with poisonous mushrooms in the UCI mushroom dataset), while the odds ratio between StalkShape and edibility is less extreme at 1.511, but still different enough from the neutral value of 1 to be suggestive of a clear association between these variables (a StalkShape value of “t” is more strongly associated with edible mushrooms than the alternative value of “e”).  The solid circle in the upper right of this plot corresponds to the variable CapSurf, which has four levels and whose distributional homogeneity appears to change quite substantially, according to both the Gini and Shannon measures.  Because this variable has more than two levels, it is not possible to characterize its association in terms of its odds ratio relative to edibility.  Finally, the cluster of three points in the upper right, just barely above the upper horizontal dashed line, correspond to the binary variables Bruises and GillSpace, and the six-level variable Pop.  Both of these binary variables exhibit very large odds ratios with respect to edibility (9.97 and 13.55 for Bruises and GillSpace, respectively), again suggesting that these variables may be highly predictive of edibility. 

The prevalence of binary variables in these results is noteworthy, and it reflects the fact that distributional shifts for binary variables can only occur in one way (i.e., the relative frequency of either fixed level can either increase or decrease).  Thus, large shifts in either interestingness measure should correspond to significant odds ratios with respect to the binary response variable, and this is seen to be the case here.  The situation is more complicated when a variable exhibits more than two levels, since the distribution of these levels can change in many ways between the two binary response values.  An important advantage of techniques like the the interestingness shift analysis described here is that they are not restricted to binary characteristics, as odds ratio characterizations are.

The second approach I consider for measuring the shift in interestingness between edible and poisonous mushrooms is what I call the “marginal measure,” corresponding to the difference in either the Gini or the Shannon measure between poisonous and edible mushrooms, divided by the original measure for the complete dataset.  An important difference between the marginal measure and the normalized measure is that the marginal measure is not bounded to lie between -1 and +1, as is evident in the plot below.  This plot shows the marginal Gini shift against the marginal Shannon shift for the mushroom characteristics, in the same format as the plot above.  Here, only four points are flagged as outliers, corresponding to the four binary variables identified above from the normalized shift plot: Bruises (the point in the extreme upper right), GillSpace (the point just barely in the upper right quadrant), and GillSize and StalkShape (the two points in the extreme lower left).  However, if we lower the Hampel identifier threshold from t = 2 to t = 1.5, we again identify CapSurf and Pop as potentially influential variables.



This last observation suggests an alternative interpretation approach that may be worth exploring.  Specifically, both of the two previous plots give clear visual evidence of “cluster structure,” and the Hampel identifier does extract some or all of this structure from the plot, but only if we apply a sufficiently judicious tuning to the threshold parameter.  A possible alternative would be to apply cluster analysis procedures, and this will be the subject of one or more subsequent posts.  In particular, there are many different clustering algorithms that could be applied to this problem, and the results are likely to be quite different.  The key practical question is which ones – if any – lead to useful ways of grouping these mushroom characteristics.  Subsequent posts will examine this question further from several different perspectives.