This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Sunday, March 9, 2014

A question of model uncertainty

It has been several months since my last post on classification tree models, because two things have been consuming all of my spare time.  The first is that I taught a night class for the University of Connecticut’s Graduate School of Business, introducing R to students with little or no prior exposure to either R or programming.  My hope is that the students learned something useful – I can say with certainty that I did – but preparing for the class and teaching it took a lot of time.  The other activity, that has taken essentially all of my time since the class ended, is the completion of a book on nonlinear digital filtering using Python, joint work with my colleague Moncef Gabbouj of the Tampere University of Technology in Tampere, Finland.  I will have more to say about both of these activities in the future, but for now I wanted to respond to a question raised about my last post.

Specifically, Professor Frank Harrell, the developer of the extremely useful Hmisc package, asked the following:

            How did you take into account model uncertainty?  The uncertainty resulting from data mining to find nodes and thresholds for continuous predictors has a massive impact on confidence intervals for estimates from recursive partitioning.

The short answer is that model uncertainty was not accounted for in the results I presented last time, primarily because – as Professor Harrell’s comments indicate – this is a complicated issue for tree-based models.  The primary objective of this post and the next few is to discuss this issue.

So first, what exactly is model uncertainty?  Any time we fit an empirical model to data, the results we obtain inherit some of the uncertainty present in the data.  For the specific example of linear regression models, the magnitude of this uncertainty is partially characterized by the standard errors included in the results returned by R’s summary() function.  This magnitude depends on both the uncertainty inherent in the data and the algorithm we use to fit the model.  Sometimes – and classification tree models are a case in point – this uncertainty is not restricted to variations in the values of a fixed set of parameters, but it can manifest itself in substantial structural variations.  That is, if we fit classification tree models to two similar but not identical datasets, the results may differ in the number of terminal nodes, the depths of these terminal nodes, the variables that determine the path to each one, and the values of these variables that determine the split at each intermediate node.  This is the issue Professor Harrell raised in his comments, and the primary point of this post is to present some simple examples to illustrate its nature and severity.

In addition, this post has two other objectives.  The first is to make amends for a very bad practice demonstrated in my last two posts.  Specifically, the classification tree models described there were fit to a relatively large dataset and then evaluated with respect to that same dataset.  This is bad practice because it can lead to overfitting, a problem that I will discuss in detail in my next post.  (For a simple example that illustrates this problem, see the discussion in Section 1.5.3 of Exploring Data in Engineering, the Sciences, and Medicine.)  In the machine learning community, this issue is typically addressed by splitting the original dataset randomly into three parts: a training subset (Tr) used for model-fitting, a validation subset (V) used for intermediate modeling decisions (e.g., which variables to include in the model), and a test subset (Te) used for final model evaluation.  This approach is described in Section 7.2 of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, who suggest 50% training, 25% validation, and 25% test as a typical choice.

The other point of this post is to say something about the different roles of model uncertainty and data uncertainty in the practice of predictive modeling.  I will say a little more at the end, but whether we are considering business applications like predicting customer behavior or industrial process control applications to predict the influence of changes in control valve settings, the basic predictive modeling process consists of three steps: build a prediction model; fix (i.e., “finalize”) this model; and apply it to generate predictions from data not seen in the model-building process.  In these applications, model uncertainty plays an important role in the model development process, but once we have fixed the model, we have eliminated this uncertainty by fiat.  Uncertainty remains an important issue in these applications, but the source of this uncertainty is in the data from which the model generates its predictions and not in the model itself once we have fixed it.  Conversely, as George Box famously said, “all models are wrong, but some are useful,” and this point is crucial here: if the model uncertainty is great enough, it may be difficult or impossible to select a fixed model that is good enough to be useful in practice.



Returning to the topic of uncertainty in tree-based models, the above plot is a graphical representation of a classification tree model repeated from my previous two posts.  This model was fit using the ctree procedure in the R package party, taking all optional parameters at their default values.  As before, the dataset used to generate this model was the Australian vehicle insurance dataset car.csv, obtained from the website associated with the book Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller.  This model – and all of the others considered in this post – was fit using the same formula as before:

            Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat

Each record in this dataset describes a single-vehicle, single-driver insurance policy, and clm is a binary response variable taking the value 1 if policy filed one or more claims during the observation period and 0 otherwise.  The other variables (on the right side of “~”) represent covariates that are either numeric (veh_value, the value of the vehicle) or categorical (all other variables, representing the vehicle body type, its age, the gender of the driver, the region where the vehicle is driven, and the driver’s age).

As I noted above, this model was fit to the entire dataset, a practice that is to be discouraged since it does not leave independent datasets of similar character for validation and testing.  To address this problem, I randomly partitioned the original dataset into a 50% training subset, a 25% validation subset, and a 25% test subset as suggested by Hastie, Tibshirani and Friedman.  The plot shown below represents the ctree model we obtain using exactly the same fitting procedure as before, but applied to the 50% random training subset instead of the complete dataset.  Comparing these plots reveals substantial differences in the overall structure of the trees we obtain, strictly as a function of the data used to fit the models.  In particular, while the original model has seven terminal nodes (i.e., the tree assigns every record to one of seven “leaves”), the model obtained from the training data subset has only four.  Also, note that the branches in the original tree model are determined by the three variables agecat, veh_body, and veh_value, while the branches in the model built from the training subset are determined by the two variables agecat and veh_value only.



These differences illustrate the point noted above about the strong dependence of classification tree model structure on the data used in model-building.  One could object that since the two datasets used here differ by a factor of two in size, the comparison isn’t exactly “apples-to-apples.”  To see that this is not really the issue, consider the following two cases, based on the idea of bootstrap resampling.  I won’t attempt a detailed discussion of the bootstrap approach here, but the basic idea is to assess the effects of data variability on a computational procedure by applying that procedure to multiple datasets, each obtained by sampling with replacement from a single source dataset.  (For a comprehensive discussion of the bootstrap and some of its many applications, refer to the book Bootstrap Methods and their Application by A.C. Davison and D.V. Hinkley.)  The essential motivation is that these datasets – called bootstrap resamples – all have the same essential statistical character as the original dataset.  Thus, by comparing the results obtained from different bootstrap resamples, we can assess the variability in results for which exact statistical characterizations are either unknown or impractical to compute.  Here, I use this idea to obtain datasets that should address the “apples-to-apples” concern raised above.  More specifically, I start with the training data subset used to generate the model described in the previous figure, and I use R’s built-in sample() function to sample the rows of this dataframe with replacement.  For an arbitrary dataframe DF, the code to do this is simple:

> set.seed(iseed) 
> BootstrapIndex = sample(seq(1,nrow(DF),1),size=nrow(DF),replace=TRUE 
> ResampleFrame = DF[BootstrapIndex,]

The only variable in this procedure is the seed for the random sampling function sample(), which I have denoted as iseed.  The extremely complicated figure below shows the ctree model obtained using the bootstrap resample generated from the training subset with iseed = 5.





Comparing this model with the previous one – both built from datasets of the same size, with the same general data characteristics – we see that the differences are even more dramatic than those between the original model (built from the complete dataset) and the second one (built from the training subset).  Specifically, while the training subset model has four terminal nodes, determined by two variables, the bootstrap subsample model uses all six of the variables included in the model formula, yielding a tree with 16 terminal nodes.  But wait – sampling with replacement generates a significant number of duplicated records (for large datasets, each bootstrap resample contains approximately 63.2% of the original data values, meaning that the other 36.8% of the resample values must be duplicates).  Could this be the reason the results are so different?  The following example shows that this is not the issue.



This plot shows the ctree model obtained from another bootstrap resample of the training data subset, obtained by specifying iseed = 6 instead of iseed = 5.  This second bootstrap resample tree is much simpler, with only 7 terminal nodes instead of 16, and the branches of the tree are based on only four of the prediction variables instead of all six (specifically, neither gender nor veh_body appear in this model).  While I don’t include all of the corresponding plots, I have also constructed and compared the ctree models obtained from the bootstrap resamples generated for all iseed values between 1 and 8, giving final models involving between four and six variables, with between 7 and 16 terminal nodes.  In all cases, the datasets used in building these models were exactly the same size and had the same statistical character.  The key point is that, as Professor Harrell noted in his comments, the structural variability of these classification tree models across similar datasets is substantial.  In fact, this variability of individual tree-based models was one of the key motivations for developing the random forest method, which achieves substantially reduced model uncertainty by averaging over many randomly generated trees.  Unfortunately, the price we pay for this improved model stability is a complete loss of interpretibility.  That is, looking at any one of the plots shown here, we can construct a simple description (e.g., node 12 in the above figure represents older drivers – agecat > 4 – with less expensive cars, and it has the lowest risk of any of the groups identified there).  While we may obtain less variable predictions by averaging over a large number of these trees, such simple intuitive explanations of the resulting model are no longer possible.

I noted earlier that predictive modeling applications typically involve a three-step strategy: fit the model, fix the model, and apply the model.  I also argued that once we fix the model, we have eliminated model uncertainty when we apply it to new data.  Unfortunately, if the inherent model uncertainty is large, as in the examples presented here, this greatly complicates the “fix the model” step.  That is, if small variations in our training data subset can cause large changes in the structure of our prediction model, it is likely that very different models will exhibit similar performance when applied to our validation data subset.  How, then, do we choose?  I will examine this issue further in my next post when I discuss overfitting and the training/validation/test split in more detail. 


Saturday, December 15, 2012

Data Science, Data Analysis, R and Python

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:

  1. “Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;
  2. “Data Scientist: The Sexiest Job of the 21st Century,” by Thomas H. Davenport and D.J. Patil, pages 70 – 76;
  3. “Making Advanced Analytics Work For You,” by Dominic Barton and David Court, pages 79 – 83.

All three provide food for thought; this post presents a brief summary of some of those thoughts.

One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety.  All of these aspects of the Big Data characterization problem affect it, but differently:

·        For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself.  Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task. 
·        For high velocity datasets – e.g., real-time, Internet-based data sources – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect.  If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face a fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate. 
·        For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).

One practical corollary to these observations is the need for a computer-based data reduction process or “data funnel” that matches the volume, velocity, and/or variety of the original data sources with the ultimate needs of the organization.  In large organizations, this data funnel generally involves a mix of different technologies and people.  While it is not a complete characterization, some of these differences are evident from the primary software platforms used in the different stages of this data funnel: languages like HTML for dealing with web-based data sources; typically, some variant of SQL for dealing with large databases; a package like R for complex quantitative analysis; and often something like Microsoft Word, Excel, or PowerPoint delivers the final results.  In addition, to help coordinate some of these tasks, there are likely to be scripts, either in an operating system like UNIX or in a platform-independent scripting language like perl or Python.

An important point omitted from all three articles is that there are at least two distinct application areas for Big Data:

1.      The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport.  Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet.  Examples like these describe automated systems put in place to continuously exploit a large but fixed data source. 
2.      The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science team scrambles to find an answer.  This use is not represented by any of the examples described in these articles.  In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different.  In particular, the end result of a one-off analysis is a single set of results, ultimately summarized to address the question originally posed.  In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on demand from a Microsoft Access database of summary information).

A key point of Davenport and Patil’s article is that data science involves more than just the analysis of data: it is also necessary to identify data sources, acquire what is needed from them, re-structure the results into a form amenable to analysis, clean them up, and in the end, present the analytical results in a useable form.  In fact, the subtitle of their article is “Meet the people who can coax treasure out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors indicate that the term was coined in 2008 by D.J. Patil, who holds a position with that title at Greylock Partners.)  Also, two particularly interesting tidbits from this article were the authors’ suggestion that a good place to find data scientists is at R User Groups, and their description of R as “an open-source statistical tool favored by data scientists.”

Davenport and Patil emphasize the difference between structured and unstructured data, especially relevant to the R community since most of R’s procedures are designed to work with the structured data types discussed in Chapter 2 of Exploring Data in Engineering, the Sciences and Medicine: continuous, integer, nominal, ordinal, and binary.  More specifically, note that these variable types can all be included in dataframes, the data object type that is best supported by R’s vast and expanding collection of add-on packages.  Certainly, there is some support for other data types, and the level of this support is growing – the tm package and a variety of other related packages support the analysis of text data, the twitteR package provides support for analyzing Twitter tweets, and the scrapeR package supports web scraping – but the acquisition and reformatting of unstructured data sources is not R’s primary strength.  Yet it is a key component of data science, as Davenport and Patil emphasize:

“A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed.  A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data.”


To better understand the distinction between the quantitative analyst and the data scientist implied by this quote, consider mathematician George Polya’s book, How To Solve It.  Originally published in 1945 and most recently re-issued in 2009, 24 years after the author’s death, this book is a very useful guide to solving math problems.  Polya’s basic approach consists of these four steps:

  1. Understand the problem;
  2. Formulate a plan for solving the problem;
  3. Carry out this plan;
  4. Check the results.

It is important to note what is not included in the scope of Polya’s four steps: Step 1 assumes a problem has been stated precisely, and Step 4 assumes the final result is well-defined, verifiable, and requires no further explanation.  While quantitative analysis problems are generally neither as precisely formulated as Polya’s method assumes, nor as clear in their ultimate objective, the class of “quantitative analyst” problems that Davenport and Patil assume in the previous quote correspond very roughly to problems of this type.  They begin with something like an R dataframe and a reasonably clear idea of what analytical results are desired; they end by summarizing the problem and presenting the results.  In contrast, the class of “data scientist” problems implied in Davenport and Patil’s quote comprises an expanded set of steps:

  1. Formulate the analytical problem: decide what kinds of questions could and should be asked in a way that is likely to yield useful, quantitative answers;
  2. Identify and evaluate potential data sources: what is available in-house, from the Internet, from vendors?  How complete are these data sources?  What would it cost to use them?  Are there significant constraints on how they can be used?  Are some of these data sources strongly incompatible?  If so, does it make sense to try to merge them approximately, or is it more reasonable to omit some of them?
  3. Acquire the data and transform it into a form that is useful for analysis; note that for sufficiently large data collections, part of this data will almost certainly be stored in some form of relational database, probably administered by others, and extracting what is needed for analysis will likely involve writing SQL queries against this database;
  4. Once the relevant collection of data has been acquired and prepared, examine the results carefully to make sure it meets analytical expectations: do the formats look right?  Are the ranges consistent with expectations?  Do the relationships seen between key variables seem to make sense?
  5. Do the analysis: by lumping all of the steps of data analysis into this simple statement, I am not attempting to minimize the effort involved, but rather emphasizing the other aspects of the Big Data analysis problem;
  6. After the analysis is complete, develop a concise summary of the results that clearly and succinctly states the motivating problem, highlights what has been assumed, what has been neglected and why, and gives the simplest useful summary of the data analysis results.  (Note that this will often involve several different summaries, with different levels of detail and/or emphases, intended for different audiences.)

Here, Steps 1 and 6 necessarily involve close interaction with the end users of the data analysis results, and they lie mostly outside the domain of R.  (Conversely, knowing what is available in R can be extremely useful in formulating analytical problems that are reasonable to solve, and the graphical procedures available in R can be extremely useful in putting together meaningful summaries of the results.)  The primary domain of R is Step 5: given a dataframe containing what are believed to be the relevant variables, we generate, validate, and refine the analytical results that will form the basis for the summary in Step 6.  Part of Step 4 also lies clearly within the domain of R: examining the data once it has been acquired to make sure it meets expectations.  In particular, once we have a dataset or a collection of datasets that can be converted easily into one or more R dataframes (e.g., csv files or possibly relational databases), a preliminary look at the data is greatly facilitated by the vast array of R procedures available for graphical characterizations (e.g., nonparametric density estimates, quantile-quantile plots, boxplots and variants like beanplots or bagplots, and much more); for constructing simple descriptive statistics (e.g., means, medians, and quantiles for numerical variables, tabulations of level counts for categorical variables, etc.); and for preliminary multivariate characterizations (e.g., scatter plots, classical and robust covariance ellipses, classical and robust principal component plots, etc.).   

The rest of this post discusses those parts of Steps 2, 3, and 4 above that fall outside the domain of R.  First, however, I have two observations.  My first observation is that because R is evolving fairly rapidly, some tasks which are “outside the domain of R” today may very well move “inside the domain of R” in the near future.  The packages twitteR and scrapeR, mentioned earlier, are cases in point, as are the continued improvements in packages that simplify the use of R with databases.  My second observation is that, just because something is possible within a particular software environment doesn’t make it a good idea.  A number of years ago, I attended a student talk given at an industry/university consortium.  The speaker set up and solved a simple linear program (i.e., he implemented the simplex algorithm to solve a simple linear optimization problem with linear constraints) using an industrial programmable controller.  At the time, programming those controllers was done via relay ladder logic, a diagrammatic approach used by electricians to configure complicated electrical wiring systems.  I left the talk impressed by the student’s skill, creativity and persistence, but I felt his efforts were extremely misguided.

Although it does not address every aspect of the “extra-R” components of Steps 2, 3, and 4 defined above – indeed, some of these aspects are so application-specific that no single book possibly could – Paul Murrell’s book Introduction to Data Technologies provides an excellent introduction to many of them.  (This book is also available as a free PDF file under creative commons.)   A point made in the book’s preface mirrors one in Davenport and Patil’s article:

“Data sets never pop into existence in a fully mature and reliable state; they must be cleaned and massaged into an appropriate form.  Just getting the data ready for analysis often represents a significant component of a research project.”

 Since Murrell is the developer of R’s grid graphics system that I have discussed in previous posts, it is no surprise that his book has an R-centric data analysis focus, but the book’s main emphasis is on the tasks of getting data from the outside world – specifically, from the Internet – into a dataframe suitable for analysis in R.  Murrell therefore gives detailed treatments of topics like HTML and Cascading Style Sheets (CSS) for working with Internet web pages; XML for storing and sharing data; and relational databases and their associated query language SQL for efficiently organizing data collections with complex structures.  Murrell states in his preface that these are things researchers – the target audience of the book – typically aren’t taught, but pick up in bits and pieces as they go along.  He adds:

            “A great deal of information on these topics already exists in books and on the internet; the value of this book is in collecting only the important subset of this information that is necessary to begin applying these technologies within a research setting.”

My one quibble with Murrell’s book is that he gives Python only a passing mention.  While I greatly prefer R to Python for data analysis, I have found Python to be more suitable than R for a variety of extra-analytical tasks, including preliminary explorations of the contents of weakly structured data sources, as well as certain important reformatting and preprocessing tasks.  Like R, Python is an open-source language, freely available for a wide variety of computing environments.  Also like R, Python has numerous add-on packages that support an enormous variety of computational tasks (over 25,000 at this writing).  In my day job in a SAS-centric environment, I commonly face tasks like the following: I need to create several nearly-identical SAS batch jobs, each to read a different SAS dataset that is selected on the basis of information contained in the file name; submit these jobs, each of which creates a CSV file; harvest and merge the resulting CSV files; run an R batch job to read this combined CSV file and perform computations on its contents.  I can do all of these things with a Python script, which also provides a detailed recipe of what I have done, so when I have to modify the procedure slightly and run it again six months later, I can quickly re-construct what I did before.  I have found Python to be better suited than R to tasks that involve a combination of automatically generating simple programs in another language, data file management, text processing, simple data manipulation, and batch job scheduling.

Despite my Python quibble, Murrell’s book represents an excellent first step toward filling the knowledge gap that Davenport and Patil note between quantitative analysts and data scientists; in fact, it is the only book I know addressing this gap.  If you are an R aficionado interested in positioning yourself for “the sexiest job of the 21st century,” Murrell’s book is an excellent place to start.