This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Sunday, March 9, 2014

A question of model uncertainty

It has been several months since my last post on classification tree models, because two things have been consuming all of my spare time.  The first is that I taught a night class for the University of Connecticut’s Graduate School of Business, introducing R to students with little or no prior exposure to either R or programming.  My hope is that the students learned something useful – I can say with certainty that I did – but preparing for the class and teaching it took a lot of time.  The other activity, that has taken essentially all of my time since the class ended, is the completion of a book on nonlinear digital filtering using Python, joint work with my colleague Moncef Gabbouj of the Tampere University of Technology in Tampere, Finland.  I will have more to say about both of these activities in the future, but for now I wanted to respond to a question raised about my last post.

Specifically, Professor Frank Harrell, the developer of the extremely useful Hmisc package, asked the following:

            How did you take into account model uncertainty?  The uncertainty resulting from data mining to find nodes and thresholds for continuous predictors has a massive impact on confidence intervals for estimates from recursive partitioning.

The short answer is that model uncertainty was not accounted for in the results I presented last time, primarily because – as Professor Harrell’s comments indicate – this is a complicated issue for tree-based models.  The primary objective of this post and the next few is to discuss this issue.

So first, what exactly is model uncertainty?  Any time we fit an empirical model to data, the results we obtain inherit some of the uncertainty present in the data.  For the specific example of linear regression models, the magnitude of this uncertainty is partially characterized by the standard errors included in the results returned by R’s summary() function.  This magnitude depends on both the uncertainty inherent in the data and the algorithm we use to fit the model.  Sometimes – and classification tree models are a case in point – this uncertainty is not restricted to variations in the values of a fixed set of parameters, but it can manifest itself in substantial structural variations.  That is, if we fit classification tree models to two similar but not identical datasets, the results may differ in the number of terminal nodes, the depths of these terminal nodes, the variables that determine the path to each one, and the values of these variables that determine the split at each intermediate node.  This is the issue Professor Harrell raised in his comments, and the primary point of this post is to present some simple examples to illustrate its nature and severity.

In addition, this post has two other objectives.  The first is to make amends for a very bad practice demonstrated in my last two posts.  Specifically, the classification tree models described there were fit to a relatively large dataset and then evaluated with respect to that same dataset.  This is bad practice because it can lead to overfitting, a problem that I will discuss in detail in my next post.  (For a simple example that illustrates this problem, see the discussion in Section 1.5.3 of Exploring Data in Engineering, the Sciences, and Medicine.)  In the machine learning community, this issue is typically addressed by splitting the original dataset randomly into three parts: a training subset (Tr) used for model-fitting, a validation subset (V) used for intermediate modeling decisions (e.g., which variables to include in the model), and a test subset (Te) used for final model evaluation.  This approach is described in Section 7.2 of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, who suggest 50% training, 25% validation, and 25% test as a typical choice.

The other point of this post is to say something about the different roles of model uncertainty and data uncertainty in the practice of predictive modeling.  I will say a little more at the end, but whether we are considering business applications like predicting customer behavior or industrial process control applications to predict the influence of changes in control valve settings, the basic predictive modeling process consists of three steps: build a prediction model; fix (i.e., “finalize”) this model; and apply it to generate predictions from data not seen in the model-building process.  In these applications, model uncertainty plays an important role in the model development process, but once we have fixed the model, we have eliminated this uncertainty by fiat.  Uncertainty remains an important issue in these applications, but the source of this uncertainty is in the data from which the model generates its predictions and not in the model itself once we have fixed it.  Conversely, as George Box famously said, “all models are wrong, but some are useful,” and this point is crucial here: if the model uncertainty is great enough, it may be difficult or impossible to select a fixed model that is good enough to be useful in practice.

Returning to the topic of uncertainty in tree-based models, the above plot is a graphical representation of a classification tree model repeated from my previous two posts.  This model was fit using the ctree procedure in the R package party, taking all optional parameters at their default values.  As before, the dataset used to generate this model was the Australian vehicle insurance dataset car.csv, obtained from the website associated with the book Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller.  This model – and all of the others considered in this post – was fit using the same formula as before:

            Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat

Each record in this dataset describes a single-vehicle, single-driver insurance policy, and clm is a binary response variable taking the value 1 if policy filed one or more claims during the observation period and 0 otherwise.  The other variables (on the right side of “~”) represent covariates that are either numeric (veh_value, the value of the vehicle) or categorical (all other variables, representing the vehicle body type, its age, the gender of the driver, the region where the vehicle is driven, and the driver’s age).

As I noted above, this model was fit to the entire dataset, a practice that is to be discouraged since it does not leave independent datasets of similar character for validation and testing.  To address this problem, I randomly partitioned the original dataset into a 50% training subset, a 25% validation subset, and a 25% test subset as suggested by Hastie, Tibshirani and Friedman.  The plot shown below represents the ctree model we obtain using exactly the same fitting procedure as before, but applied to the 50% random training subset instead of the complete dataset.  Comparing these plots reveals substantial differences in the overall structure of the trees we obtain, strictly as a function of the data used to fit the models.  In particular, while the original model has seven terminal nodes (i.e., the tree assigns every record to one of seven “leaves”), the model obtained from the training data subset has only four.  Also, note that the branches in the original tree model are determined by the three variables agecat, veh_body, and veh_value, while the branches in the model built from the training subset are determined by the two variables agecat and veh_value only.

These differences illustrate the point noted above about the strong dependence of classification tree model structure on the data used in model-building.  One could object that since the two datasets used here differ by a factor of two in size, the comparison isn’t exactly “apples-to-apples.”  To see that this is not really the issue, consider the following two cases, based on the idea of bootstrap resampling.  I won’t attempt a detailed discussion of the bootstrap approach here, but the basic idea is to assess the effects of data variability on a computational procedure by applying that procedure to multiple datasets, each obtained by sampling with replacement from a single source dataset.  (For a comprehensive discussion of the bootstrap and some of its many applications, refer to the book Bootstrap Methods and their Application by A.C. Davison and D.V. Hinkley.)  The essential motivation is that these datasets – called bootstrap resamples – all have the same essential statistical character as the original dataset.  Thus, by comparing the results obtained from different bootstrap resamples, we can assess the variability in results for which exact statistical characterizations are either unknown or impractical to compute.  Here, I use this idea to obtain datasets that should address the “apples-to-apples” concern raised above.  More specifically, I start with the training data subset used to generate the model described in the previous figure, and I use R’s built-in sample() function to sample the rows of this dataframe with replacement.  For an arbitrary dataframe DF, the code to do this is simple:

> set.seed(iseed) 
> BootstrapIndex = sample(seq(1,nrow(DF),1),size=nrow(DF),replace=TRUE 
> ResampleFrame = DF[BootstrapIndex,]

The only variable in this procedure is the seed for the random sampling function sample(), which I have denoted as iseed.  The extremely complicated figure below shows the ctree model obtained using the bootstrap resample generated from the training subset with iseed = 5.

Comparing this model with the previous one – both built from datasets of the same size, with the same general data characteristics – we see that the differences are even more dramatic than those between the original model (built from the complete dataset) and the second one (built from the training subset).  Specifically, while the training subset model has four terminal nodes, determined by two variables, the bootstrap subsample model uses all six of the variables included in the model formula, yielding a tree with 16 terminal nodes.  But wait – sampling with replacement generates a significant number of duplicated records (for large datasets, each bootstrap resample contains approximately 63.2% of the original data values, meaning that the other 36.8% of the resample values must be duplicates).  Could this be the reason the results are so different?  The following example shows that this is not the issue.

This plot shows the ctree model obtained from another bootstrap resample of the training data subset, obtained by specifying iseed = 6 instead of iseed = 5.  This second bootstrap resample tree is much simpler, with only 7 terminal nodes instead of 16, and the branches of the tree are based on only four of the prediction variables instead of all six (specifically, neither gender nor veh_body appear in this model).  While I don’t include all of the corresponding plots, I have also constructed and compared the ctree models obtained from the bootstrap resamples generated for all iseed values between 1 and 8, giving final models involving between four and six variables, with between 7 and 16 terminal nodes.  In all cases, the datasets used in building these models were exactly the same size and had the same statistical character.  The key point is that, as Professor Harrell noted in his comments, the structural variability of these classification tree models across similar datasets is substantial.  In fact, this variability of individual tree-based models was one of the key motivations for developing the random forest method, which achieves substantially reduced model uncertainty by averaging over many randomly generated trees.  Unfortunately, the price we pay for this improved model stability is a complete loss of interpretibility.  That is, looking at any one of the plots shown here, we can construct a simple description (e.g., node 12 in the above figure represents older drivers – agecat > 4 – with less expensive cars, and it has the lowest risk of any of the groups identified there).  While we may obtain less variable predictions by averaging over a large number of these trees, such simple intuitive explanations of the resulting model are no longer possible.

I noted earlier that predictive modeling applications typically involve a three-step strategy: fit the model, fix the model, and apply the model.  I also argued that once we fix the model, we have eliminated model uncertainty when we apply it to new data.  Unfortunately, if the inherent model uncertainty is large, as in the examples presented here, this greatly complicates the “fix the model” step.  That is, if small variations in our training data subset can cause large changes in the structure of our prediction model, it is likely that very different models will exhibit similar performance when applied to our validation data subset.  How, then, do we choose?  I will examine this issue further in my next post when I discuss overfitting and the training/validation/test split in more detail. 


  1. I am trying to use CART algorithms as against conventional statistics, DOE and hypothesis testing. I have got some results and advantages also for problem solving rather than using the data as training set for future prediction.

    Is my understanding correct that uncertainty is more important in cases where we use this training data rather than problem solving approach?


  2. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work
    Android Training in Chennai

  3. I am expecting more interesting topics from you. And this was nice content and definitely it will be useful for many people.

    Hadoop Training in Chennai

  4. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot.thus these tips are really awesome and you had a wonderful products.

    Online Reputation Management


  5. What an awesome post, I just read it from start to end. Learned something new after a long time.

    SAP SD training in Chennai

  6. Your thinking toward the respective issue is awesome also the idea behind the blog is very interesting which would bring a new evolution in respective field. Thanks for sharing.

    SEO Company in India

  7. great tips thanks for your impressive enhancement of getting unique style of design a website. and it much more wonderful information to me. keep share many different ideas.
    IELTS coaching in chennai

  8. Truly a very good article on how to handle the future technology. After reading your post,thanks for taking the time to discuss this, I feel happy about and I love learning more about this topic
    Tooth Braces In Chennai

  9. Thank you for sharing such a nice and interesting blog with us. I have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
    Jobs in Chennai
    Jobs in Bangalore
    Jobs in Delhi
    Jobs in Hyderabad
    Jobs in Kolkata
    Jobs in Mumbai
    Jobs in Noida
    Jobs in Pune

  10. Thanks for sharing this information and keep updating us. This content is quite informatics to me.
    Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai

  11. Wow.. Thanks much for sharing.. My friend also recommended you so that i can have a helping hand to make my blog as effective as possible.
    Architectural Firms in Chennai
    Architects in Chennai

  12. This content is so informatics and it was motivating all the programmers and beginners to switch over the career into the Big Data Technology. This article is so impressed and keeps updating us regularly.
    Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai

  13. Wonderful blog.. Thanks for sharing informative Post. Its very useful to me.

    Installment loans
    Payday loans
    Title loans

  14. Really Good blog post.provided a helpful information.keep updating...
    Digital marketing company in Chennai

  15. Excellent goods from you, man. I’ve understand your stuff previous to and you’re just too excellent. I actually like what you’ve acquired here, certainly like what you are stating and the way in which you say it. You make it enjoyable and you still take care of to keep it sensible. I can not wait to read far more from you. This is actually a tremendous site..
    Storage space rental singapore
    Cheapest storage space singapore
    Van delivery service singapore


  16. Its a wonderful post and very helpful, thanks for all this information. You are including better information regarding this topic in an effective way.Thank you so much

    Personal Installment Loans
    Payday Cash Advance loan
    Title Car loan
    Cash Advance Loan

  17. wonderful post and very helpful, thanks for all this information. You are including better information regarding this topic in an effective way.Thank you so much
    Web Design Company in Chennai

  18. i really like this blog.And i got more information's from this blog.thanks for sharing!!!!
    SEO Company in Chennai

  19. very informative blog and very useful for readers. making us to learn more from your blog.keep on updating your blog and thanks for sharing
    java training in chennai
    dotnet training in chennai
    cloud -computing training in chennai
    digital marketing training in chennai

  20. Thank you for sharing such a nice and interesting blog with us. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog.
    Web development Company in Chennai

  21. this is very nice blog this studying course information very useful to everyone who have learning this information.

    Base SAS Training in chennai

  22. This is a nice article here with some useful tips for those who are not used-to comment that frequently.Thanks for this helpful information I agree with all points you have given to us.I will follow all of them.

    Self Employment Tax
    Tax Preparation Services
    Tax Accountant
    Tax Consultant
    Tax Advisor
    Online Tax services
    Tax Professional
    Online Tax Preparation"

  23. Thank you your is very useful for me .keep more updates.waiting for your new blog.
    Business Tax Return
    Cpa Tax Accountant
    Tax Return Services

  24. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic.

    Eczema Treatment
    Psoriasis Oil
    Hyperpigmentation Treatment
    Herbal Tonic

  25. It's Really nice post.Thank you for sharing such a blog....!!
    Spoken english Training in Bangalore

  26. Just found your post by searching on the Google, I am Impressed and Learned Lot of new thing from your post. I am new to blogging and always try to learn new skill as I believe that blogging is the full time job for learning new things day by day.
    Eczema Treatment
    Psoriasis Oil
    Hyperpigmentation Treatment
    Herbal Tonic

  27. These ways are very simple and very much useful, as a beginner level these helped me a lot thanks fore sharing these kinds of useful and knowledgeable information.

    Hadoop Training in Chennai

  28. useful and knowledgeable information.
    thank u sharing!!

    This blog was very useful for me waiting for more blog.

    SAP FICO Training in Chennai

  29. A great content and very much useful to the visitors. Looking for more updates in future.

    Selenium Training in Chennai

  30. Its really an Excellent post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog. posts. I hope you post again soon.

    Selenium Training in Chennai

  31. is an online platform that allows people to find offers and the greatest deals in and around OMR. It is the directory of the business in and around omr. It helps internet users to surf the required deals and offers.

    Offers in Chennai

  32. It is really very excellent. Because of all given information was wonderful and it's very helpful for me. Thanks for sharing
    SAP Training in Chennai
    SAP ABAP Training in Chennai
    SAP FICO Training in Chennai
    SAP MM Training in Chennai

  33. Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care.As always, we appreciate your confidence and trust in us.
    Software Testing Training in Chennai
    SEO Training in Chennai
    Informatica Training in Chennai
    Digital Marketing Training in Chennai

  34. Thanks for sharing, Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this.


  35. These ways are very simple and very much useful, as a beginner level these helped me a lot thanks fore sharing these kinds of useful and knowledgeable information.
    Also Check out the :

  36. Usually I do not read post on blogs, but I would like to say that this write-up very forced me to try and do it! Your writing style has been surprised me. Great work admin..Keep update more blog..
    Economics Journal
    Journal Of Ecology
    Thesis by publication
    National Journal
    Journal Of Information Technology