This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, April 13, 2013

Classification Tree Models

On March 26, I attended the Connecticut R Meetup in New Haven, which featured a talk by Illya Mowerman on decision trees in R.  I have gone to these Meetups before, and I have always found them to be interesting and informative.  Attendees range from those who are just starting to explore R to those who have multiple CRAN packages to their credit.  Each session is organized around a talk that focuses on some aspect of R and both the talks and the discussion that follow are typically lively and useful.  More information about the Connecticut R Meetup can be found here, and information about R Meetups in other areas can be found with a Google search on “R Meetup” with a location.


Mowerman’s talk focused on decision trees like the one shown in the figure above.  I give a somewhat more detailed discussion of this example below, but the basic idea is that the tree assigns every record in a dataset to a unique group, and a predicted response is generated for each group.  The basic decision tree models are either classification trees, appropriate to binary response variables, or regression tree models, appropriate to numeric response variables.  The figure above represents a classification tree model that predicts the probability that an automobile insurance policyholder will file a claim, based on a publicly available insurance dataset discussed further below.  Two advantages of classification tree models that Mowerman emphasized in his talk are, first, their simplicity of interpretation, and second, their ability to generate predictions from a mix of numerical and categorical covariates.  The above example illustrates both of these points – the decision tree is based on both categorical variables like veh_body (vehicle body type) and numerical variables like veh_value (the vehicle value in units of 10,000 Australian dollars). 

To interpret this tree, begin by reading from the top down, with the root node, numbered 1, which partitions the dataset into two subsets based on the variable agecat.  This variable is an integer-coded driver age group with six levels, ranging from 1 for the youngest drivers to 6 for the oldest drivers.  The root node splits the dataset into a younger driver subgroup (to the left, with agecat values 1 through 4) and an older driver subgroup (to the right, with agecat values 5 and 6).  Going to the right, node 11 splits the older driver group on the basis of vehicle value, with node 12 consisting of older drivers with veh_value less than or equal to 2.89, corresponding to vehicle values not more than 28,900 Australian dollars.  This subgroup contains 15,351 policy records, of which 5.3% file claims.  Similarly, node 13 corresponds to older drivers with vehicles valued more than 28,900 Australian dollars; this is a smaller group (1,932 policy records) with a higher fraction filing claims (8.3%).  Going to the left, we partition the younger driver group first on vehicle body type (node 2), then possibly a second time on driver age (node 4), possibly further on vehicle value (node 6) and finally again on vehicle body type (node 7).  The key point is that every record in the dataset is ultimately assigned to one of the seven terminal nodes of this tree (the “leaves,” numbered 3, 5, 8, 9, 10, 12, and 13).  The numbers associated with these nodes gives their size and the fraction of each group that files a claim, which may be viewed as an estimate of the conditional probability that a driver from each group will file a claim. 

Classification trees can be fit to data using a number of different algorithms, several of which are included in various R packages.  Mowerman’s talk focused primarily on the rpart package that is part of the standard R distribution and includes a procedure also named rpart, based on what is probably the best known algorithm for fitting classification and regression trees.  In addition, Mowerman also discussed the rpart.plot package, a very useful adjunct to rpart that provides a lot of flexibility in representing the resulting tree models graphically.  In particular, this package can be used to make much nicer plots than the one shown above; I haven't done that here largely because I have used a different tree fitting procedure, for reasons discussed in the next paragraph.  Another classification package that Mowerman mentioned in his talk is C50, which implements the C5.0 algorithm popular in the machine learning community.  The primary focus of this post is the ctree procedure in the party package, which was used to fit the tree shown here.

The reason I have used the ctree procedure instead of the rpart procedure is that for the dataset I consider here, the rpart procedure returns a trivial tree.  That is, when I attempt to fit a tree to the dataset using rpart with the response variable and covariates described below, the resulting “tree” assigns the entire dataset to a single node, declaring the overall fraction of positive responses in the dataset to be the common prediction for all records.  Applying the ctree procedure (the code is listed below) yields the nontrivial tree shown in the plot above.  The reason for the difference in these results is that the rpart and ctree procedures use different tree-fitting algorithms.  Very likely, the reason rpart has such difficulty with this dataset is its high degree of class imbalance: the positive response (i.e., “policy filed one or more claims”) occurs in only 4,264 of 67,856 data records, representing 6.81% of the total.  This imbalance problem is known to make classification difficult, enough so that it has become the focus of a specialized technical literature.  For a rather technical survey of this topic, refer to the paper “The Class Imbalance Problem: A Systematic Study,” by Japkowicz and Stephen (Intelligent Data Analysis, volume 6, number 5, November, 2002).  (So far, I have not been able to find a free version of this paper, but if you are interested in the topic, a search on this title turns up a number of other useful papers on the topic, although generally more specialized than this broad survey.)

To obtain the tree shown in the plot above, I used the following R commands:

> library(party)
> carFrame = read.csv("car.csv")
> Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat
> TreeModel = ctree(Fmla, data = carFrame)
> plot(TreeModel, type="simple")

 

The first line loads the party package to make the ctree procedure available for our use, and the second line reads the data file described below into the dataframe carFrame (note that this assumes the data file "car.csv" has been loaded into R's current working directory, which can be shown using the getwd() command).  The third line defines the formula that specifies the response as the binary variable clm (on the left side of "~") and the six other variables listed above as potential predictors, each separated by the "+" symbol.  The fourth line invokes the ctree procedure to fit the model and the last line displays the results.

The dataset I used here is car.csv, available from the website associated with the book Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller.  As noted, this dataset contains 67,856 records, each characterizing an automobile insurance policy associated with one vehicle and one driver.  The dataset has 10 columns, each representing an observed value for a policy characteristic, including claim and loss information, vehicle characteristics, driver characteristics, and certain other variables (e.g., a categorical variable characterizing the type of region where the vehicle is driven).  The ctree model shown above was built to predict the binary response variable clm (where clm = 1 if one or more claims have been filed by the policyholder, and 0 otherwise), based on the following prediction variables:


-         the numeric variable veh_value;
-         veh_body, a categorical variable with 13 levels;
-         veh_age, an integer-coded categorical variable with 4 levels;
-         gender, a binary indicator of driver gender;
-         area, a categorical variable with six levels;
-         agecat, and integer-coded driver age variable.

The tree model shown above illustrates one of the points Mowerman made in his talk, that classification tree models can easily handle mixed covariate types: here, these covariates include one numeric variable (veh_value), one binary variable (gender), and four categorical variables.  In principle, tree models can be built using categorical variables with an arbitrary number of levels, but in practice procedures like ctree will fail if the number of levels becomes too large.

One of the tuning parameters in tree-fitting procedures like rpart and ctree is the minimum node size.  In his R Meetup talk, Mowerman showed that increasing this value from the default limit of 7 yielded simpler trees for the dataset he considered (the churn dataset from the C50 package).  Specifically, increasing the minimum node size parameter eliminated very small nodes from the tree, nodes whose practical utility was questionable due to their small size.  In my next post, I will show how a graphical tool for displaying binomial probability confidence limits can be used to help interpret classification tree results by explicitly displaying the prediction uncertainties.  The tool I use is GroupedBinomialPlot, one of those included in the ExploringData package that I am developing.

Finally, I should say in response to a question about my last post that, sadly, I do not yet have a beta test version of the ExploringData package.

Saturday, February 16, 2013

Finding outliers in numerical data

One of the topics emphasized in Exploring Data in Engineering, the Sciences and Medicine is the damage outliers can do to traditional data characterizations.  Consequently, one of the procedures to be included in the ExploringData package is FindOutliers, described in this post.  Given a vector of numeric values, this procedure supports four different methods for identifying possible outliers.

Before describing these methods, it is important to emphasize two points.  First, the detection of outliers in a sequence of numbers can be approached as a mathematical problem, but the interpretation of these data observations cannot.  That is, mathematical outlier detection procedures implement various rules for identifying points that appear to be anomalous with respect to the nominal behavior of the data, but they cannot explain why these points appear to be anomalous.  The second point is closely related to the first: one possible source of outliers in a data sequence is gross measurement errors or other data quality problems, but other sources of outliers are also possible so it is important to keep an open mind.  The terms “outlier” and “bad data” are not synonymous.  Chapter 7 of Exploring Data briefly describes two examples of outliers whose detection and interpretation led to a Nobel Prize and to a major new industrial product (Teflon, a registered trademark of the DuPont Company).

In the case of a single sequence of numbers, the typical approach to outlier detection is to first determine upper and lower limits on the nominal range of data variation, and then declare any point falling outside this range to be an outlier.  The FindOutliers procedure implements the following methods of computing the upper and lower limits of the nominal data range:

1.                  The ESD identifier, more commonly known as the “three-sigma edit rule,” well known but unreliable;
2.                  The Hampel identifier, a more reliable procedure based on the median and the MADM scale estimate;
3.                  The standard boxplot rule, based on the upper and lower quartiles of the data distribution;
4.                  An adjusted boxplot rule, based on the upper and lower quartiles, along with a robust skewness estimator called the medcouple.

The rest of this post briefly describes these four outlier detection rules and illustrates their application to two real data examples.

Without question, the most popular outlier detection rule is the ESD identifier (an abbreviation for “extreme Studentized deviation”), which declares any point more than t standard deviations from the mean to be an outlier, where the threshold value t is most commonly taken to be 3.  In other words, the nominal range used by this outlier detection procedure is the closed interval:

            [mean – t * SD, mean + t * SD]

where SD is the estimated standard deviation of the data sequence.  Motivation for the threshold choice t = 3 comes from the fact that for normally-distributed data, the probability of observing a value more than three standard deviations from the mean is only about 0.3%.  The problem with this outlier detection procedure is that both the mean and the standard deviation are themselves extremely sensitive to the presence of outliers in the data.  As a consequence, this procedure is likely to miss outliers that are present in the data.  In fact, it can be shown that for a contamination level greater than 10%, this rule fails completely, detecting no outliers at all, no matter how extreme they are (for details, see the discussion in Sec. 3.2.1 of Mining Imperfect Data).

The default option for the FindOutliers procedure is the Hampel identifier, which replaces the mean with the median and the standard deviation with the MAD (or MADM)  scale estimate.  The nominal data range for this outlier detection procedure is:

            [median – t * MAD, median + t * MAD]

As I have discussed in previous posts, the median and the MAD scale are much more resistant to the influence of outliers than the mean and standard deviation.  As a consequence, the Hampel identifier is generally more effective than the ESD identifier, although the Hampel identifier can be too aggressive, declaring too many points as outliers.  For detailed comparisons of the ESD and Hampel identifiers, refer to Sec. 7.5 of Exploring Data or Sec. 3.3 of Mining Imperfect Data.

The third method option for the FindOutliers procedure is the standard boxplot rule, based on the following nominal data range:

            [Q1 – c * IQD, Q3 + c * IQD]

where Q1 and Q3 represent the lower and upper quartiles, respectively, of the data distribution, and IQD = Q3 – Q1 is the interquartile distance, a measure of the spread of the data similar to the standard deviation.  The threshold parameter c is analogous to t in the first two outlier detection rules, and the value most commonly used in this outlier detection rule is c = 1.5.  This outlier detection rule is much less sensitive to the presence of outliers than the ESD identifier, but more sensitive than the Hampel identifier, and, like the Hampel identifier, it can be somewhat too aggressive, declaring nominal data observations to be outliers.  An advantage of the boxplot rule over these two alternatives is that, because it does not depend on an estimate of the “center” of the data (e.g., the mean in the ESD identifier or the median in the Hampel identifier), it is better suited to distributions that are moderately asymmetric.

The fourth method option is an extension of the standard boxplot rule, developed for data distributions that may be strongly asymmetric.  Basically, this procedure modifies the threshold parameter c by an amount that depends on the asymmetry of the distribution, modifying the upper threshold and the lower threshold differently.  Because the standard moment-based skewness estimator is extremely outlier-sensitive (for an illustration of this point, see the discussion in Sec. 7.1.1 of Exploring Data), it is necessary to use an outlier-resistant alternative to assess distributional asymmetry.  The asymmetry measure used here is the medcouple, a robust skewness measure available in the robustbase package in R and that I have discussed in a previous post (Boxplots and Beyond - Part II: Asymmetry ).   An important point about the medcouple is that it can be either positive or negative, depending on the direction of the distributional asymmetry; positive values arise more frequently in practice, but negative values can occur and the sign of the medcouple influences the definition of the asymmetric boxplot rule.  Specifically, for positive values of the medcouple MC, the adjusted boxplot rule’s nominal data range is:

            [Q1 – c * exp(a * MC) * IQD, Q3 + c * exp(b * MC) * IQD ]

while for negative medcouple values, the nominal data range is:

            [Q1 – c * exp(-b * MC) * IQD, Q3 + c * exp(-a * MC) * IQD ]

An important observation here is that for symmetric data distributions, MC should be zero, reducing the adjusted boxplot rule to the standard boxplot rule described above.  As in the standard boxplot rule, the threshold parameter is typically taken as c = 1.5, while the other two parameters are typically taken as a = -4 and b = 3.  In particular, these are the default values for the procedure adjboxStats in the robustbase package.



To illustrate how these outlier detection methods compare, the above pair of plots shows the results of applying all four of them to the makeup flow rate dataset discussed in Exploring Data (Sec. 7.1.2) in connection with the failure of the ESD identifier.  The points in these plots represent approximately 2,500 regularly sampled flow rate measurements from an industrial manufacturing process.  These measurements were taken over a long enough period of time to contain both periods of regular process operation – during which the measurements fluctuate around a value of approximately 400 – and periods when the process was shut down, was being shut down, or was being restarted, during which the measurements exhibit values near zero.  If we wish to characterize normal process operation, these shut down episodes represent outliers, and they correspond to about 20% of the data.  The left-hand plot shows the outlier detection limits for the ESD identifier (lighter, dashed lines) and the Hampel identifier (darker, dotted lines).  As discussed in Exploring Data, the ESD limits are wide enough that they do not detect any outliers in this data sequence, while the Hampel identifier nicely separates the data into normal operating data and outliers that correspond to the shut down episodes.  The right-hand plot shows the analogous results obtained with the standard boxplot method (lighter, dashed lines) and the adjusted boxplot method (darker, dotted lines).  Here, the standard boxplot rule gives results very similar to the Hampel identifier, again nicely separating the dataset into normal operating data and shut down episodes.  Unfortunately, the adjusted boxplot rule does not perform very well here, placing its lower nominal data limit in about the middle of the shut down data and its upper nominal data limit in about the middle of the normal operating data.  The likely cause of this behavior is that the relatively large fraction of lower tail outliers, which introduces a fairly strong negative skewness (the medcouple value for this example is -0.589).



The second example considered here is the industrial pressure data sequence shown in the above figure, in the same format as the previous figure.  This data sequence was discussed in Exploring Data (pp. 326-327) as a troublesome case because the two smallest values in this data sequence – near the right-hand end of the plots – appear to be downward outliers in a sequence with generally positive skewness (here, the medcouple value is 0.162).  As a consequence, neither the ESD identifier nor the Hampel identifier give fully satisfactory performance, in both cases declaring only one of these points as a downward outlier and arguably detecting too many upward outliers.  In fact, because the Hampel identifier is more aggressive here, it actually declares more upward outliers, making its performance worse for this example.  The right-hand plot in the above figure shows the outlier detection limits for the standard boxplot rule (lighter, dashed lines) and the adjusted boxplot rule (darker, dotted lines).  As in the previous example, the limits for the standard boxplot rule are almost the same as those for the Hampel identifier (the darker, dotted lines in the left-hand plot), but here the adjusted boxplot rule gives much better results, identifying both of the visually evident downward outliers and declaring far fewer points as upward outliers.

The primary point of this post has been to describe and demonstrate the outlier detection methods to be included in the FindOutliers procedure in the forthcoming ExploringData R package.  It should be clear from these results that, when it comes to outlier detection, “one size does not fit all” – method matters, and the choice of method requires a comparison of the results obtained by each one.  I have not included the code for the FindOutliers procedure here, but that will be the subject of my next post.

Saturday, December 15, 2012

Data Science, Data Analysis, R and Python

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:

  1. “Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;
  2. “Data Scientist: The Sexiest Job of the 21st Century,” by Thomas H. Davenport and D.J. Patil, pages 70 – 76;
  3. “Making Advanced Analytics Work For You,” by Dominic Barton and David Court, pages 79 – 83.

All three provide food for thought; this post presents a brief summary of some of those thoughts.

One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety.  All of these aspects of the Big Data characterization problem affect it, but differently:

·        For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself.  Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task. 
·        For high velocity datasets – e.g., real-time, Internet-based data sources – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect.  If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face a fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate. 
·        For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).

One practical corollary to these observations is the need for a computer-based data reduction process or “data funnel” that matches the volume, velocity, and/or variety of the original data sources with the ultimate needs of the organization.  In large organizations, this data funnel generally involves a mix of different technologies and people.  While it is not a complete characterization, some of these differences are evident from the primary software platforms used in the different stages of this data funnel: languages like HTML for dealing with web-based data sources; typically, some variant of SQL for dealing with large databases; a package like R for complex quantitative analysis; and often something like Microsoft Word, Excel, or PowerPoint delivers the final results.  In addition, to help coordinate some of these tasks, there are likely to be scripts, either in an operating system like UNIX or in a platform-independent scripting language like perl or Python.

An important point omitted from all three articles is that there are at least two distinct application areas for Big Data:

1.      The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport.  Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet.  Examples like these describe automated systems put in place to continuously exploit a large but fixed data source. 
2.      The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science team scrambles to find an answer.  This use is not represented by any of the examples described in these articles.  In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different.  In particular, the end result of a one-off analysis is a single set of results, ultimately summarized to address the question originally posed.  In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on demand from a Microsoft Access database of summary information).

A key point of Davenport and Patil’s article is that data science involves more than just the analysis of data: it is also necessary to identify data sources, acquire what is needed from them, re-structure the results into a form amenable to analysis, clean them up, and in the end, present the analytical results in a useable form.  In fact, the subtitle of their article is “Meet the people who can coax treasure out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors indicate that the term was coined in 2008 by D.J. Patil, who holds a position with that title at Greylock Partners.)  Also, two particularly interesting tidbits from this article were the authors’ suggestion that a good place to find data scientists is at R User Groups, and their description of R as “an open-source statistical tool favored by data scientists.”

Davenport and Patil emphasize the difference between structured and unstructured data, especially relevant to the R community since most of R’s procedures are designed to work with the structured data types discussed in Chapter 2 of Exploring Data in Engineering, the Sciences and Medicine: continuous, integer, nominal, ordinal, and binary.  More specifically, note that these variable types can all be included in dataframes, the data object type that is best supported by R’s vast and expanding collection of add-on packages.  Certainly, there is some support for other data types, and the level of this support is growing – the tm package and a variety of other related packages support the analysis of text data, the twitteR package provides support for analyzing Twitter tweets, and the scrapeR package supports web scraping – but the acquisition and reformatting of unstructured data sources is not R’s primary strength.  Yet it is a key component of data science, as Davenport and Patil emphasize:

“A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed.  A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data.”


To better understand the distinction between the quantitative analyst and the data scientist implied by this quote, consider mathematician George Polya’s book, How To Solve It.  Originally published in 1945 and most recently re-issued in 2009, 24 years after the author’s death, this book is a very useful guide to solving math problems.  Polya’s basic approach consists of these four steps:

  1. Understand the problem;
  2. Formulate a plan for solving the problem;
  3. Carry out this plan;
  4. Check the results.

It is important to note what is not included in the scope of Polya’s four steps: Step 1 assumes a problem has been stated precisely, and Step 4 assumes the final result is well-defined, verifiable, and requires no further explanation.  While quantitative analysis problems are generally neither as precisely formulated as Polya’s method assumes, nor as clear in their ultimate objective, the class of “quantitative analyst” problems that Davenport and Patil assume in the previous quote correspond very roughly to problems of this type.  They begin with something like an R dataframe and a reasonably clear idea of what analytical results are desired; they end by summarizing the problem and presenting the results.  In contrast, the class of “data scientist” problems implied in Davenport and Patil’s quote comprises an expanded set of steps:

  1. Formulate the analytical problem: decide what kinds of questions could and should be asked in a way that is likely to yield useful, quantitative answers;
  2. Identify and evaluate potential data sources: what is available in-house, from the Internet, from vendors?  How complete are these data sources?  What would it cost to use them?  Are there significant constraints on how they can be used?  Are some of these data sources strongly incompatible?  If so, does it make sense to try to merge them approximately, or is it more reasonable to omit some of them?
  3. Acquire the data and transform it into a form that is useful for analysis; note that for sufficiently large data collections, part of this data will almost certainly be stored in some form of relational database, probably administered by others, and extracting what is needed for analysis will likely involve writing SQL queries against this database;
  4. Once the relevant collection of data has been acquired and prepared, examine the results carefully to make sure it meets analytical expectations: do the formats look right?  Are the ranges consistent with expectations?  Do the relationships seen between key variables seem to make sense?
  5. Do the analysis: by lumping all of the steps of data analysis into this simple statement, I am not attempting to minimize the effort involved, but rather emphasizing the other aspects of the Big Data analysis problem;
  6. After the analysis is complete, develop a concise summary of the results that clearly and succinctly states the motivating problem, highlights what has been assumed, what has been neglected and why, and gives the simplest useful summary of the data analysis results.  (Note that this will often involve several different summaries, with different levels of detail and/or emphases, intended for different audiences.)

Here, Steps 1 and 6 necessarily involve close interaction with the end users of the data analysis results, and they lie mostly outside the domain of R.  (Conversely, knowing what is available in R can be extremely useful in formulating analytical problems that are reasonable to solve, and the graphical procedures available in R can be extremely useful in putting together meaningful summaries of the results.)  The primary domain of R is Step 5: given a dataframe containing what are believed to be the relevant variables, we generate, validate, and refine the analytical results that will form the basis for the summary in Step 6.  Part of Step 4 also lies clearly within the domain of R: examining the data once it has been acquired to make sure it meets expectations.  In particular, once we have a dataset or a collection of datasets that can be converted easily into one or more R dataframes (e.g., csv files or possibly relational databases), a preliminary look at the data is greatly facilitated by the vast array of R procedures available for graphical characterizations (e.g., nonparametric density estimates, quantile-quantile plots, boxplots and variants like beanplots or bagplots, and much more); for constructing simple descriptive statistics (e.g., means, medians, and quantiles for numerical variables, tabulations of level counts for categorical variables, etc.); and for preliminary multivariate characterizations (e.g., scatter plots, classical and robust covariance ellipses, classical and robust principal component plots, etc.).   

The rest of this post discusses those parts of Steps 2, 3, and 4 above that fall outside the domain of R.  First, however, I have two observations.  My first observation is that because R is evolving fairly rapidly, some tasks which are “outside the domain of R” today may very well move “inside the domain of R” in the near future.  The packages twitteR and scrapeR, mentioned earlier, are cases in point, as are the continued improvements in packages that simplify the use of R with databases.  My second observation is that, just because something is possible within a particular software environment doesn’t make it a good idea.  A number of years ago, I attended a student talk given at an industry/university consortium.  The speaker set up and solved a simple linear program (i.e., he implemented the simplex algorithm to solve a simple linear optimization problem with linear constraints) using an industrial programmable controller.  At the time, programming those controllers was done via relay ladder logic, a diagrammatic approach used by electricians to configure complicated electrical wiring systems.  I left the talk impressed by the student’s skill, creativity and persistence, but I felt his efforts were extremely misguided.

Although it does not address every aspect of the “extra-R” components of Steps 2, 3, and 4 defined above – indeed, some of these aspects are so application-specific that no single book possibly could – Paul Murrell’s book Introduction to Data Technologies provides an excellent introduction to many of them.  (This book is also available as a free PDF file under creative commons.)   A point made in the book’s preface mirrors one in Davenport and Patil’s article:

“Data sets never pop into existence in a fully mature and reliable state; they must be cleaned and massaged into an appropriate form.  Just getting the data ready for analysis often represents a significant component of a research project.”

 Since Murrell is the developer of R’s grid graphics system that I have discussed in previous posts, it is no surprise that his book has an R-centric data analysis focus, but the book’s main emphasis is on the tasks of getting data from the outside world – specifically, from the Internet – into a dataframe suitable for analysis in R.  Murrell therefore gives detailed treatments of topics like HTML and Cascading Style Sheets (CSS) for working with Internet web pages; XML for storing and sharing data; and relational databases and their associated query language SQL for efficiently organizing data collections with complex structures.  Murrell states in his preface that these are things researchers – the target audience of the book – typically aren’t taught, but pick up in bits and pieces as they go along.  He adds:

            “A great deal of information on these topics already exists in books and on the internet; the value of this book is in collecting only the important subset of this information that is necessary to begin applying these technologies within a research setting.”

My one quibble with Murrell’s book is that he gives Python only a passing mention.  While I greatly prefer R to Python for data analysis, I have found Python to be more suitable than R for a variety of extra-analytical tasks, including preliminary explorations of the contents of weakly structured data sources, as well as certain important reformatting and preprocessing tasks.  Like R, Python is an open-source language, freely available for a wide variety of computing environments.  Also like R, Python has numerous add-on packages that support an enormous variety of computational tasks (over 25,000 at this writing).  In my day job in a SAS-centric environment, I commonly face tasks like the following: I need to create several nearly-identical SAS batch jobs, each to read a different SAS dataset that is selected on the basis of information contained in the file name; submit these jobs, each of which creates a CSV file; harvest and merge the resulting CSV files; run an R batch job to read this combined CSV file and perform computations on its contents.  I can do all of these things with a Python script, which also provides a detailed recipe of what I have done, so when I have to modify the procedure slightly and run it again six months later, I can quickly re-construct what I did before.  I have found Python to be better suited than R to tasks that involve a combination of automatically generating simple programs in another language, data file management, text processing, simple data manipulation, and batch job scheduling.

Despite my Python quibble, Murrell’s book represents an excellent first step toward filling the knowledge gap that Davenport and Patil note between quantitative analysts and data scientists; in fact, it is the only book I know addressing this gap.  If you are an R aficionado interested in positioning yourself for “the sexiest job of the 21st century,” Murrell’s book is an excellent place to start.

Saturday, October 27, 2012

Characterizing a new dataset

In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly.  So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ExploringData package, which I also mentioned in my last post and promised to describe later.  My next post will be a special one on Big Data and Data Science, followed by another one about the DataframeSummary procedure (additional features of the procedure and the code used to implement it), after which I will come back to the spacing measures I discussed last time.

A task that arises frequently in exploratory data analysis is the initial characterization of a new dataset.  Ideally, everything we could want to know about a dataset should come from the accompanying metadata, but this is rarely the case.  As I discuss in Chapter 2 of Exploring Data in Engineering, the Sciences, and Medicine, metadata is the available “data about data” that (usually) accompanies a data source.  In practice, however, the available metadata is almost never as complete as we would like, and it is sometimes wrong in important respects.  This is particularly the case when numeric codes are used for missing data, without accompanying notes describing the coding.  An example, illustrating the consequent problem of disguised missing data is described in my paper The Problem of Disguised Missing Data.  (It should be noted that the original source of one of the problems described there – a comment in the UCI Machine Learning Repository header file for the Pima Indians diabetes dataset that there were no missing data records – has since been corrected.)

Once we have converted our data source into an R data frame (e.g., via the read.csv function for an external csv file), there are a number of useful tools to help us begin this characterization process.  Probably the most general is the str command, applicable to essentially any R object.  Applied to a dataframe, this command first tells us that the object is a dataframe, second, gives us the dimensions of the dataframe, and third, presents a brief summary of its contents, including the variable names, their type (specifically, the results of R’s class function), and the values of their first few observations.  As a specific example, if we apply this command to the rent dataset from the gamlss package, we obtain the following summary:

> str(rent)
'data.frame':   1969 obs. of  9 variables:
 $ R  : num  693 422 737 732 1295 ...
 $ Fl : num  50 54 70 50 55 59 46 94 93 65 ...
 $ A  : num  1972 1972 1972 1972 1893 ...
 $ Sp : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Sm : num  0 0 0 0 0 0 0 0 0 0 ...
 $ B  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ H  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
 $ L  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
> 
This dataset summarizes a 1993 random sample of housing rental prices in Munich, including a number of important characteristics about each one (e.g., year of construction, floor space in square meters, etc.).  A more detailed description can be obtained via the command “help(rent)”.

The head command provides similar information to the str command, in slightly less detail (e.g., it doesn’t give us the variable types), but in a format that some will find more natural:

> head(rent)
       R Fl    A Sp Sm B H L loc
1  693.3 50 1972  0  0 0 0 0   2
2  422.0 54 1972  0  0 0 0 0   2
3  736.6 70 1972  0  0 0 0 0   2
4  732.2 50 1972  0  0 0 0 0   2
5 1295.1 55 1893  0  0 0 0 0   2
6 1195.9 59 1893  0  0 0 0 0   2
> 
 (An important difference between these representations is that str characterizes factor variables by their level number and not their level value: thus the first few observations of the factor B assume the first level of the factor, which is the value 0.  As a consequence, while it may appear that str is telling us that the first few records list the value 1 for the variable B while head is indicating a zero, this is not the case.  This is one reason that data analysts may prefer the head characterization.)

While the R data types for each variable can be useful to know – particularly in cases where it isn’t what we expect it to be, as when integers are coded as factors – this characterization doesn’t really tell us the whole story.  In particular, note that R has commands like “as.character” and “as.factor” that can easily convert numeric variables to character or factor data types.  Even beyond this, the range of inherent behaviors that numerically-coded data can exhibit cannot be fully described by a simple data type designation.  As a specific example, one of the variables in the rent dataframe is “A,” described in the metadata available from the help command as “year of construction.”  While this variable is coded as type “numeric,” in fact it takes integer values from 1890 to 1988, with some values in this range repeated many times and others absent.  This point is important, since analysis tools designed for continuous variables – especially outlier-resistant ones like medians and other rank-based methods – sometimes perform poorly in the face of data sequences with many repeated values (i.e., “ties,” which have zero probability for continuous data distributions).  In extreme cases, these techniques may fail completely, as in the case of the MADM scale estimate, discussed in Chapter 7 of Exploring Data.  This data characterization implodes if more than 50% of the data values are the same, returning the useless value zero in this case, independent of the values of all of the other data points.

These observations motivate the DataframeSummary procedure described here, to be included in the ExploringData package.  This function is called with the name of the dataframe to be characterized and an optional parameter Option, which can take any one of the following four values:

  1. “Brief” (the default value)
  2. “NumericOnly”
  3. “FactorOnly”
  4. “AllAsFactor”

In all cases, this function returns a summary dataframe with one row for each column in the dataframe to be characterized.  Like the str command, these results include the name of each variable and its type.  Under the default option “Brief,” this function also returns the following characteristics for each variable:

  • Levels = the number of distinct values the variable exhibits;
  • AvgFreq = the average number of records listing each value;
  • TopLevel = the most frequently occurring value;
  • TopFreq = the number of records listing this most frequent value;
  • TopPct = the percentage of records listing this most frequent value;
  • MissFreq = the number of missing or blank records;
  • MissPct = the percentage of missing or blank records.

For the rent dataframe, this function (under the default “Brief” option) gives the following summary:

> DataframeSummary(rent)
Variable Type Levels AvgFreq TopLevel TopFreq TopPct MissFreq MissPct
3        A    numeric      73   26.97         1957         551       27.98         0       0
6        B    factor           2  984.50           0          1925        97.77        0       0
2       Fl     numeric      91   21.64          60              71          3.61        0       0
7        H    factor            2  984.50          0          1580        80.24        0       0
8        L    factor            2  984.50          0           1808        91.82        0       0
9      loc    factor            3  656.33          2           1247        63.33        0       0
1        R    numeric   1762    1.12          900               7          0.36        0       0
5       Sm  numeric         2  984.50           0          1797         91.26        0       0
4       Sp   numeric         2  984.50           0          1419         72.07        0       0
> 

The variable names and types appear essentially as they do in the results obtained with the str function, and the numbers to the far left indicate the column numbers from the dataframe rent for each variable, since the variable names are listed alphabetically for convenience.  The “Levels” column of this summary dataframe gives the number of unique values for each variable, and it is clear that this can vary widely even within a given data type.  For example, the variable “R” (monthly rent in DM) exhibits 1,762 unique values in 1,969 data observations, so it is almost unique, while the variables “Sm” and “Sp” exhibit only two possible values, even though all three of these variables are of type “numeric.”  The AvgFreq column gives the average number of times each level should appear, assuming a uniform distribution over all possible values.  This number is included as a reference value for assessing the other frequencies (i.e., TopFreq for the most frequently occurring value and MissFreq for missing data values).  Thus, for the first variable, “A,” AvgFreq is 26.97, meaning that if all 73 possible values for this variable were equally represented, each one should occur about 27 times in the dataset.  The most frequently occurring level (TopLevel) is “1957,” which occurs 551 times, suggesting a highly nonuniform distribution of values for this variable.  In contrast, for the variable “R,” AvgFreq is 1.12, meaning that each value of this variable is almost unique.  The TopPct column gives the percentage of records in the dataset exhibiting the most frequent value for each record, which varies from 0.36% for the numeric variable “R” to 97.77% for the factor variable “B.”  It is interesting to note that this variable is of type “factor” but is coded as 0 or 1, while the variables “Sm” and “Sp” are also binary, coded as 0 or 1, but are of type “numeric.”  This illustrates the point noted above that the R data type is not always as informative as we might like it to be.  (This is not a criticism of R, but rather a caution about the fact that, in preparing data, we are free to choose many different representations, and the original logic behind the choice may not be obvious to all ultimate users of the data.)  In addition, comparing the available metadata for the variable “B” illustrates the point about metadata errors noted earlier: of the 1,969 data records, 1,925 have the value “0” (97.77%), while 44 have the value “1” (2.23%), but the information returned by the help command indicates exactly the opposite proportion of values: 1,925 should have the value “1” (indicating the presence of a bathroom), while 44 should have the value “0” (indicating the absence of a bathroom).  Since the interpretation of the variables that enter any analysis is important in explaining our final analytical results, it is useful to detect this type of mismatch between the data and the available metadata as early as possible.  Here, comparing the average rents for records with B = 1 (DM 424.95) against those with B = 0 (DM 820.72) suggests that the levels have been reversed relative to the metadata: the relatively few housing units without bathrooms are represented by B = 1, renting for less than the majority of those units, which have bathrooms and are represented by B = 0.  Finally, the last two columns of the above summary give the number of records with missing or blank values (MissFreq) and the corresponding percentage (MissPct); here, all records are complete so these numbers are zero.

In my next post on this topic, I will present results for the other three options of the DataframeSummary procedure, along with the code that implements it.  In all cases, the results include those generated by the “Brief” option just presented, but the difference between the other options lies first, in what additional characterizations are included, and second, in which subset of variables are included in the summary.  Specifically, for the rent dataframe, we obtain:

  • Under the “NumericOnly” option, a summary of the five numeric variables R, FL, A, Sp, and Sm results, giving characteristics that are appropriate to numeric data types, like the spacing measures described in my last post;
  • Under the “FactorOnly” option, a summary of the four factor variables B, H, L, and loc results, giving measures that are appropriate to categorical data types, like the normalized Shannon entropy measure discussed in several previous posts;
  • Under the “AllAsFactor” option, all variables in the dataframe are first converted to factors and then characterized using the same measures as in the “FactorOnly” option.

The advantage of the “AllAsFactor” option is that it characterizes all variables in the dataframe, but as I discussed in my last post, the characterization of numerical variables with measures like Shannon entropy is not always terribly useful.

Saturday, September 22, 2012

Spacing measures: heterogeneity in numerical distributions

Numerically-coded data sequences can exhibit a very wide range of distributional characteristics, including near-Gaussian (historically, the most popular working assumption), strongly asymmetric, light- or heavy-tailed, multi-modal, or discrete (e.g., count data).  In addition, numerically coded values can be effectively categorical, either ordered, or unordered.  A specific example that illustrates the range of distributional behavior often seen in a collection of numerical variables is the Boston housing dataframe (Boston) from the MASS package in R.  This dataframe includes 14 numerical variables that characterize 506 suburban housing tracts in the Boston area: 12 of these variables have class “numeric” and the remaining two have class “integer”.  The integer variable chas is in fact a binary flag, taking the value 1 if the tract bounds the Charles river and 0 otherwise, and the integer variable rad is described as “an index of accessibility to radial highways,’’ assuming one of nine values: the integers 1 through 8, and 24.  The other 12 variables assume anywhere between 26 unique values (for the zoning variable zn) to 504 unique values (for the per capita crime rate crim).  The figure below shows nonparametric density estimates for four of these variables: the per-capita crime rate (crim, upper left plot), the percentage of the population designated “lower status” by the researchers who provided the data (lstat, upper right plot), the average number of rooms per dwelling (rm, lower left plot), and the zoning variable (zn, lower right plot).  Comparing the appearances of these density estimates, considerable variability is evident: the distribution of crim is very asymmetric with an extremely heavy right tail, the distribution of lstat is also clearly asymmetric but far less so, while the distribution of rm appears to be almost Gaussian.  Finally, the distribution of zn appears to be tri-modal, mostly concentrated around zero, but with clear secondary peaks at around 20 and 80.



Each of these four plots also includes some additional information about the corresponding variable: three vertical reference lines at the mean (the solid line) and the mean offset by plus or minus three standard deviations (the dotted lines), and the value of the normalized Shannon entropy, listed in the title of each plot.  This normalized entropy value is discussed in detail in Chapter 3 of Exploring Data in Engineering, the Sciences, and Medicine and in two of my previous posts (April 3, 2011 and May 21, 2011), and it forms the basis for the spacing measure described below.  First, however, the reason for including the three vertical reference lines on the density plots is to illustrate that, while popular “Gaussian expectations” for data are approximately met for some numerical variables (the rm variable is a case in point here), often these expectations are violated so much that they are useless.  Specifically, note that under approximately Gaussian working assumptions, most of the observed values for the data sequence should fall between the two dotted reference lines, which should correspond approximately to the smallest and largest values seen in the dataset.  This description is reasonably accurate for the variable rm, and the upper limit appears fairly reasonable for the variable lstat, but the lower limit is substantially negative here, which is not reasonable for this variable since it is defined as a percentage.  These reference lines appear even more divergent from the general shapes of the distributions for the crim and zn data, where again, the lower reference lines are substantially negative, infeasible values for both of these variables.

The reason the reference values defined by these lines are not particularly representative is the extremely heterogeneous nature of the data distributions, particularly for the variables crim – where the distribution exhibits a very long right tail – and zn – where the distribution exhibits multiple modes.  For categorical variables, distributional heterogeneity can be assessed by measures like the normalized Shannon entropy, which varies between 0 and 1, taking the value zero when all levels of the variable are equally represented, and taking the value 1 when only one of several possible values are present.  This measure is easily computed and, while it is intended for use with categorical variables, the procedures used to compute it will return results for numerical variables as well.  These values are shown in the figure captions of each of the above four plots, and it is clear from these results that the Shannon measure does not give a reliable indication of distributional heterogeneity here.  In particular, note that the Shannon measure for the crim variable is zero to three decimal places, suggesting a very homogeneous distribution, while the variables lstat and rm – both arguably less heterogeneous than crim – exhibit slightly larger values of 0.006 and 0.007, respectively.  Further, the variable zn, whose density estimate resembles that of crim more than that of either of the other two variables, exhibits the much larger Shannon entropy value of 0.585.

The basic difficulty here is that all observations of a continuously distributed random variable should be unique.  The normalized Shannon entropy – along with the other heterogeneity measures discussed in Chapter 3 of Exploring Data – effectively treat variables as categorical, returning a value that is computed from the fractions of total observations assigned to each possible value for the variable.  Thus, for an ideal continuously-distributed variable, every possible value appears once and only once, so these fractions should be 1/N for each of the N distinct values observed for the variable.  This means that the normalized Shannon measure – along with all of the alternative measures just noted – should be identically zero for this case, regardless of whether the continuous distribution in question is Gaussian, Cauchy, Pareto, uniform, or anything else.  In fact, the crim variable considered here almost meets this ideal requirement: in 506 observations, crim exhibits 504 unique values, which is why its normalized Shannon entropy value is zero to three significant figures.  In marked contrast, the variable zn exhibits only 26 distinct values, meaning that each of these values occurs, on average, just over 19 times.  However, this average behavior is not representative of the data in this case, since the smallest possible value (0) occurs 372 times, while the largest possible value (100) occurs only once.  It is because of the discrete character of this distribution that the normalized Shannon entropy is much larger here, accurately reflecting the pronounced distributional heterogeneity of this variable.

Taken together, these observations suggest a simple extension of the normalized Shannon entropy that can give us a more adequate characterization of distributional differences for numerical variables.  Specifically, the idea is this: begin by dividing the total range of a numerical variable x into M equal intervals.  Then, count the number of observations that fall into each of these intervals and divide by the total number of observations N to obtain the fraction of observations falling into each group.  By doing this, we have effectively converted the original numerical variable into an M-level categorical variable, to which we can apply heterogeneity measures like the normalized Shannon entropy.  The four plots below illustrate this basic idea for the four Boston housing variables considered above.  Specifically, each plot shows the fraction of observations falling into each of 10 equally spaced intervals, spanning the range from the smallest observed value of the variable to the largest.



As a specific example, consider the results shown in the upper left plot for the variable crim, which varies from a minimum of 0.00632 to a maximum of 89.0.  Almost 87% of the observations fall into the smallest 10% of this range, from 0.00632 to 8.9, while the next two groups account for almost all of the remaining observations.  In fact, none of the other groups (4 through 10) account for more than 1% of the observations, and one of these groups – group 7 – is completely empty.  Computing the normalized Shannon entropy from this ten-level categorical variable yields 0.767, as indicated in the title of the upper left plot.  In contrast, the corresponding plot for the lstat variable, shown in the upper right, is much more uniform, with the first five groups exhibiting roughly the same fractional occupation.  As a consequence, the normalized Shannon entropy for this grouped variable is much smaller than that for the more heterogeneously distributed crim variable: 0.138 versus 0.767.  Because the distribution is more sharply peaked for the rm variable than for lstat, the occupation fractions for the grouped version of this variable (lower left plot) are less homogeneous, and the normalized Shannon entropy is correspondingly larger, at 0.272.  Finally, for the zn variable (lower right plot), the grouped distribution appears similar to that for the crim variable, and the normalized Shannon entropy values are also similar: 0.525 versus 0.767. 

The key point here is that, in contrast to the normalized Shannon entropy applied directly to the numerical variables in the Boston dataframe, grouping these values into 10 equally-spaced intervals and then computing the normalized Shannon entropy gives a number that seems to be more consistent with the distributional differences between these variables that can be seen clearly in their density plots.  Motivation for this numerical measure (i.e., why not just look at the density plots?) comes from the fact that we are sometimes faced with the task of characterizing a new dataset that we have not seen before.  While we can – and should – examine graphical representations of these variables, in cases where we have many such variables, it is desirable to have a few, easily computed numerical measures to use as screening tools, guiding us in deciding which variables to look at first, and which techniques to apply to them.  The spacing measure described here – i.e., the normalized Shannon entropy measure applied to a grouped version of the numerical variable – appears to be a potentially useful measure for this type of preliminary data characterization.  For this reason, I am including it – along with a few other numerical characterizations – in the DataFrameSummary procedure I am implementing as part of the ExploringData package, which I will describe in a later post.  Next time, however, I will explore two obvious extensions of the procedure described here: different choices of the heterogeneity measure, and different choices of the number of grouping levels.  In particular, as I have shown in previous posts on interestingness measures, the normalized Bray, Gini, and Simpson measures all behave somewhat differently than the Shannon measure considered here, raising the question of which one would be most effective in this application.  In addition, the choice of 10 grouping levels considered here was arbitrary, and it is by no means clear that this choice is the best one.  In my next post, I will explore how sensitive the Boston housing results are to changes in these two key design parameters.

Finally, it is worth saying something about how the grouping used here was implemented.  The R code listed below is the function I used to convert a numerical variable x into the grouped variable from which I computed the normalized Shannon entropy.  The three key components of this function are the classIntervals function from the R package classInt (which must be loaded before use; hence, the “library(classInt)” statement at the beginning of the function), and the cut and table functions from base R.  The classIntervals function generates a two-element list with components var, which contains the original observations, and brks, which contains the M+1 boundary values for the M groups to be generated.  Note that the style = “equal” argument is important here, since we want M equal-width groups.  The cut function then takes these results and converts them into an M-level categorical variable, assigning each original data value to the interval into which it falls.  The table function counts the number of times each of the M possible levels occurs for this categorical variable.  Dividing this vector by the sum of all entries then gives the fraction of observations falling into each group.  Plotting the results obtained from this function and reformatting the results slightly yields the four plots shown in the second figure above, and applying the shannon.proc procedure available from the OUP companion website for Exploring Data yields the Shannon entropy values listed in the figure titles.


UniformSpacingFunction <- function(x, nLvls = 10){
  #
  library(classInt)
  #
  xsum = classIntervals(x,n = nLvls, style="equal")
  xcut = cut(xsum$var, breaks = xsum$brks, include.lowest = TRUE)
  xtbl = table(xcut)
  pvec = xtbl/sum(xtbl)
  pvec
}