This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Sunday, June 10, 2012

Classifying the UCI mushrooms

In my last post, I considered the shifts in two interestingness measures as possible tools for selecting variables in classification problems.  Specifically, I considered the Gini and Shannon interestingness measures applied to the 22 categorical mushroom characteristics from the UCI mushroom dataset.  The proposed variable selection strategy was to compare these values when computed from only edible mushrooms or only poisonous mushrooms.  The rationale was that variables whose interestingness measures changed a lot between these two subsets might be predictive of mushroom edibility.  In this post, I examine this question a little more systematically, primarily to illustrate the mechanics of setting up classification problems and evaluating their results.

More specifically, the classification problem I consider here is that of building and comparing models that predicts mushroom edibility, each one based on a different mushroom characteristic.  In practice, you would generally consider more than one characteristic as the basis for prediction, but here, I want to use standard classification tools to provide a basis for comparing the predictabilities of each of the potentially promising mushroom characteristics identified in my last post.  In doing this, I also want to highlight three aspects of classification problems: first, the utility of randomly splitting the available data into subsets before undertaking the analysis, second, the fact that we have many different options in building classifiers, and third, one approach to assessing classification results.

One of the extremely useful ideas emphasized in the machine learning literature is the utility of randomly partitioning our dataset into three parts: one used to fit whatever prediction model we are interested in building, another used to perform intermediate fit comparisons (e.g., compare the performance of models based on different predictor variables), and a third that is saved for a final performance assessment.  The reasoning behind this partitioning is that if we allow our prediction model to become too complex, we run the risk of overfitting, or predicting some of the random details in our dataset, resulting in a model that does not perform well on other, similar datasets.  This is an important practical problem that I illustrate with an extreme example in Chapter 1 of Exploring Data in Engineering, the Sciences, and Medicine.  There, a sequence of seven monotonically-decaying observations is fit to a sixth-degree polynomial that exactly predicts the original seven observations, but which exhibits horrible interpolation and extrapolation behavior.  The point here is that we need a practical means of protecting ourselves against building models that are too specific to the dataset at hand, and the partitioning strategy just described provides a simple way of doing this.  That is, once we partition the data, we can fit our prediction model to the first subset and then evaluate its performance with respect to the second subset: because these subsets were generated by randomly sampling the original dataset, their general character is the same, so a “good” prediction model built from the first subset should give “reasonable” predictions for the second subset.  The reason for saving out a third data subset – not used at all until the final evaluation of our model – is that model-building is typically an iterative procedure, so we are likely to cycle repeatedly between the first and second subsets.  For the final model evaluation, it is desirable to have a dataset available that hasn’t been used at all.

Generating this three-way split in R is fairly easy.  As with many tasks, this can be done in more than one way, but the following procedure is fairly straightforward and only makes use of procedures available in base R:

RandomThreeWay.proc <- function(df, probs = c(35,35,30), iseed = 101){
  n = nrow(df)
  u = runif(n)
  nprobs = probs/sum(probs)
  brks = c(0,cumsum(nprobs))
  Subgroup = cut(u, breaks=brks, labels=c("A","B","C"), include.lowest=TRUE)

This function is called with three parameters: the data frame that we wish to partition for our analysis, a vector of the relative sizes of our three partitions, and a seed for the random number generator.  In the implementation shown here, the vector of relative sizes is given the default values 35%/35%/30%, but any relative size partitioning can be specified.  The result returned by this procedure is the character vector Subgroup, which has the values “A”, “B”, or “C”, corresponding to the three desired partitions of the dataset.  The first line of this procedure sets the seed for the uniform random number generator used in the third line, and the second line specifies how many random numbers to generate (i.e., one for each data record in the data frame).  The basic idea here is to generate uniform random numbers on the interval [0,1] and then assign subgroups depending on whether this value falls into the interval between 0 and 0.35, 0.35 to 0.70, or 0.70 to 1.00.  The runif function generates the required random numbers, the cumsum function is used to generate the cumulative breakpoints from the normalized probabilities, and the cut function is used to group the uniform random numbers using these break points.

In the specific example considered here, I use logistic regression as my classifier, although many, many other classification procedures are available in R, including a wide range of decision tree-based models, random forest models, boosted tree models, na├»ve Bayes classifiers, and support vector machines, to name only a few.  (For a more complete list, refer to the CRAN task view on Machine Learning and Statistical Learning).  Here, I construct and compare six logistic regression models, each constructed to predict the probability that a mushroom is poisonous from one of the six mushroom characteristics identified in my previous post: GillSize, StalkShape, CapSurf, Bruises, GillSpace, and Pop.  In each case, I extract the records for subset “A” of the UCI mushroom dataset, as described above, and use the base R procedure glm to construct a logistic regression model.  Because the model evaluation procedure (somers2, described below) that I use here requires a binary response coded as 0 or 1, it is simplest to construct a data frame with this response explicitly, along with the prediction covariate of interest.  The following code does this for the first predictor (GillSize):

EorP = UCImushroom.frame$EorP
PoisonBinary = rep(0,length(EorP))
PoisonIndx = which(EorP = = "p")
PoisonBinary[PoisonIndx] = 1
FirstFrame = data.frame(PoisonBinary = PoisonBinary, Covar = UCImushroom.frame$GillSize)

In particular, this code constructs a two-column data frame that contains the binary response variable PoisonBinary that is equal to 1 whenever EorP is “p” and 0 whenever this variable is “e”, and the prediction covariate Covar, which is here “GillSize”.  Given this data frame, I then apply the following code to randomly partition this data frame into subsets A, B, and C, and I invoke the built-in glm procedure to fit a logistic regression model:

Subset = RandomThreeWay.proc(FirstFrame)
IndxA = which(Subset = = "A")
LogisticModel = glm(PoisonBinary ~ Covar, data = FirstFrame, subset = IndxA, family=binomial())

Note that here I have specified the model form using the R formula construction “PoisonBinary ~ Covar”, I have used the subset argument of the glm procedure to specify that I only want to fit the model to subset A, and I have specified “family = binomial()” to request a logistic regression model.  Once I have this model, I evaluate it using the concordance index C available from the somers2 function in the R package Hmisc.  This value corresponds to the area under the ROC curve and is a measure of agreement between the predictions of the logistic regression model and the actual binary response.  As discussed above, I want to do this evaluation for subset B to avoid an over-optimistic view of the model’s performance due to overfitting of subset A.  To do this, I need the model predictions from subset B, which I obtain with the built-in predict procedure:

IndxB = which(Subset = = "B")
PredPoisonProb = predict(LogisticModel, newdata = FirstFrame[IndxB,], type="response")
ObsPoisonBinary = FirstFrame$PoisonBinary[IndxB]

In addition, I have created the variable ObsPoissonBinary, the sequence of binary responses from subset B, which I will use in calling the somers2 function:

somers2(PredPoisonProb, ObsPoisonBinary)
           C                     Dxy                    n                         Missing
   0.7375031    0.4750063 2858.0000000    0.0000000

The results shown here include the concordance index C, an alternative (and fully equivalent) measure called Somers’ D (from which the procedure gets its name), the number of records in the dataset (here, in subset B), and the number of missing records (here, none).  The concordance index C is a number that varies between 0 and 1, with values between 0.5 and 1.0 meaning that the predictions are better than random guessing, and values less than 0.5 indicating performance so poor that it is actually worse than random guessing.  Here, the value of approximately 0.738 suggests that GillSize is a reasonable predictor of mushroom edibility, at least for mushrooms like those characterized in the UCI mushroom dataset. 

Repeating this process for all six of the mushroom characteristics identified as potentially predictive by the interestingness change analysis I discussed last time leads to the following results:

            Pop:                 C = 0.753        (6 levels)
            Bruises:            C = 0.740        (2 levels)
            GillSize:            C = 0.738        (2 levels)
            GillSpace:         C = 0.635        (2 levels)
            CapSurf:           C = 0.595        (4 levels)
            StalkShape:      C = 0.550        (2 levels)

These results leave open the questions of whether other mushroom characteristics, not identified on the basis of their interestingness shifts, are in fact more predictive of edibility, or how much better the predictions can be if we use more than one prediction variable.  I will examine those questions in subsequent posts, using the ideas outlined here.  For now, it is enough to note that one advantage of the approach described here, relative to that using odds ratios for selected covariates discussed last time, is that this approach can be used to assess the potential prediction power of categorical variables with arbitrary numbers of levels, while the odds ratio approach is limited to two-level predictors.