This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, April 13, 2013

Classification Tree Models

On March 26, I attended the Connecticut R Meetup in New Haven, which featured a talk by Illya Mowerman on decision trees in R.  I have gone to these Meetups before, and I have always found them to be interesting and informative.  Attendees range from those who are just starting to explore R to those who have multiple CRAN packages to their credit.  Each session is organized around a talk that focuses on some aspect of R and both the talks and the discussion that follow are typically lively and useful.  More information about the Connecticut R Meetup can be found here, and information about R Meetups in other areas can be found with a Google search on “R Meetup” with a location.


Mowerman’s talk focused on decision trees like the one shown in the figure above.  I give a somewhat more detailed discussion of this example below, but the basic idea is that the tree assigns every record in a dataset to a unique group, and a predicted response is generated for each group.  The basic decision tree models are either classification trees, appropriate to binary response variables, or regression tree models, appropriate to numeric response variables.  The figure above represents a classification tree model that predicts the probability that an automobile insurance policyholder will file a claim, based on a publicly available insurance dataset discussed further below.  Two advantages of classification tree models that Mowerman emphasized in his talk are, first, their simplicity of interpretation, and second, their ability to generate predictions from a mix of numerical and categorical covariates.  The above example illustrates both of these points – the decision tree is based on both categorical variables like veh_body (vehicle body type) and numerical variables like veh_value (the vehicle value in units of 10,000 Australian dollars). 

To interpret this tree, begin by reading from the top down, with the root node, numbered 1, which partitions the dataset into two subsets based on the variable agecat.  This variable is an integer-coded driver age group with six levels, ranging from 1 for the youngest drivers to 6 for the oldest drivers.  The root node splits the dataset into a younger driver subgroup (to the left, with agecat values 1 through 4) and an older driver subgroup (to the right, with agecat values 5 and 6).  Going to the right, node 11 splits the older driver group on the basis of vehicle value, with node 12 consisting of older drivers with veh_value less than or equal to 2.89, corresponding to vehicle values not more than 28,900 Australian dollars.  This subgroup contains 15,351 policy records, of which 5.3% file claims.  Similarly, node 13 corresponds to older drivers with vehicles valued more than 28,900 Australian dollars; this is a smaller group (1,932 policy records) with a higher fraction filing claims (8.3%).  Going to the left, we partition the younger driver group first on vehicle body type (node 2), then possibly a second time on driver age (node 4), possibly further on vehicle value (node 6) and finally again on vehicle body type (node 7).  The key point is that every record in the dataset is ultimately assigned to one of the seven terminal nodes of this tree (the “leaves,” numbered 3, 5, 8, 9, 10, 12, and 13).  The numbers associated with these nodes gives their size and the fraction of each group that files a claim, which may be viewed as an estimate of the conditional probability that a driver from each group will file a claim. 

Classification trees can be fit to data using a number of different algorithms, several of which are included in various R packages.  Mowerman’s talk focused primarily on the rpart package that is part of the standard R distribution and includes a procedure also named rpart, based on what is probably the best known algorithm for fitting classification and regression trees.  In addition, Mowerman also discussed the rpart.plot package, a very useful adjunct to rpart that provides a lot of flexibility in representing the resulting tree models graphically.  In particular, this package can be used to make much nicer plots than the one shown above; I haven't done that here largely because I have used a different tree fitting procedure, for reasons discussed in the next paragraph.  Another classification package that Mowerman mentioned in his talk is C50, which implements the C5.0 algorithm popular in the machine learning community.  The primary focus of this post is the ctree procedure in the party package, which was used to fit the tree shown here.

The reason I have used the ctree procedure instead of the rpart procedure is that for the dataset I consider here, the rpart procedure returns a trivial tree.  That is, when I attempt to fit a tree to the dataset using rpart with the response variable and covariates described below, the resulting “tree” assigns the entire dataset to a single node, declaring the overall fraction of positive responses in the dataset to be the common prediction for all records.  Applying the ctree procedure (the code is listed below) yields the nontrivial tree shown in the plot above.  The reason for the difference in these results is that the rpart and ctree procedures use different tree-fitting algorithms.  Very likely, the reason rpart has such difficulty with this dataset is its high degree of class imbalance: the positive response (i.e., “policy filed one or more claims”) occurs in only 4,264 of 67,856 data records, representing 6.81% of the total.  This imbalance problem is known to make classification difficult, enough so that it has become the focus of a specialized technical literature.  For a rather technical survey of this topic, refer to the paper “The Class Imbalance Problem: A Systematic Study,” by Japkowicz and Stephen (Intelligent Data Analysis, volume 6, number 5, November, 2002).  (So far, I have not been able to find a free version of this paper, but if you are interested in the topic, a search on this title turns up a number of other useful papers on the topic, although generally more specialized than this broad survey.)

To obtain the tree shown in the plot above, I used the following R commands:

> library(party)
> carFrame = read.csv("car.csv")
> Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat
> TreeModel = ctree(Fmla, data = carFrame)
> plot(TreeModel, type="simple")

 

The first line loads the party package to make the ctree procedure available for our use, and the second line reads the data file described below into the dataframe carFrame (note that this assumes the data file "car.csv" has been loaded into R's current working directory, which can be shown using the getwd() command).  The third line defines the formula that specifies the response as the binary variable clm (on the left side of "~") and the six other variables listed above as potential predictors, each separated by the "+" symbol.  The fourth line invokes the ctree procedure to fit the model and the last line displays the results.

The dataset I used here is car.csv, available from the website associated with the book Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller.  As noted, this dataset contains 67,856 records, each characterizing an automobile insurance policy associated with one vehicle and one driver.  The dataset has 10 columns, each representing an observed value for a policy characteristic, including claim and loss information, vehicle characteristics, driver characteristics, and certain other variables (e.g., a categorical variable characterizing the type of region where the vehicle is driven).  The ctree model shown above was built to predict the binary response variable clm (where clm = 1 if one or more claims have been filed by the policyholder, and 0 otherwise), based on the following prediction variables:


-         the numeric variable veh_value;
-         veh_body, a categorical variable with 13 levels;
-         veh_age, an integer-coded categorical variable with 4 levels;
-         gender, a binary indicator of driver gender;
-         area, a categorical variable with six levels;
-         agecat, and integer-coded driver age variable.

The tree model shown above illustrates one of the points Mowerman made in his talk, that classification tree models can easily handle mixed covariate types: here, these covariates include one numeric variable (veh_value), one binary variable (gender), and four categorical variables.  In principle, tree models can be built using categorical variables with an arbitrary number of levels, but in practice procedures like ctree will fail if the number of levels becomes too large.

One of the tuning parameters in tree-fitting procedures like rpart and ctree is the minimum node size.  In his R Meetup talk, Mowerman showed that increasing this value from the default limit of 7 yielded simpler trees for the dataset he considered (the churn dataset from the C50 package).  Specifically, increasing the minimum node size parameter eliminated very small nodes from the tree, nodes whose practical utility was questionable due to their small size.  In my next post, I will show how a graphical tool for displaying binomial probability confidence limits can be used to help interpret classification tree results by explicitly displaying the prediction uncertainties.  The tool I use is GroupedBinomialPlot, one of those included in the ExploringData package that I am developing.

Finally, I should say in response to a question about my last post that, sadly, I do not yet have a beta test version of the ExploringData package.