ExploringDataBlog

A question of model uncertainty

2014-03-09T11:13:00.000-07:00

It has been several months since my last post on classification tree models, because two things have been consuming all of my spare time. The first is that I taught a night class for the University of Connecticut’s Graduate School of Business, introducing R to students with little or no prior exposure to either R or programming. My hope is that the students learned something useful – I can say with certainty that I did – but preparing for the class and teaching it took a lot of time. The other activity, that has taken essentially all of my time since the class ended, is the completion of a book on nonlinear digital filtering using Python, joint work with my colleague Moncef Gabbouj of the Tampere University of Technology in Tampere, Finland. I will have more to say about both of these activities in the future, but for now I wanted to respond to a question raised about my last post.

Specifically, Professor Frank Harrell, the developer of the extremely useful Hmisc package, asked the following:

How did you take into account model uncertainty? The uncertainty resulting from data mining to find nodes and thresholds for continuous predictors has a massive impact on confidence intervals for estimates from recursive partitioning.

The short answer is that model uncertainty was not accounted for in the results I presented last time, primarily because – as Professor Harrell’s comments indicate – this is a complicated issue for tree-based models. The primary objective of this post and the next few is to discuss this issue.

So first, what exactly is model uncertainty? Any time we fit an empirical model to data, the results we obtain inherit some of the uncertainty present in the data. For the specific example of linear regression models, the magnitude of this uncertainty is partially characterized by the standard errors included in the results returned by R’s summary() function. This magnitude depends on both the uncertainty inherent in the data and the algorithm we use to fit the model. Sometimes – and classification tree models are a case in point – this uncertainty is not restricted to variations in the values of a fixed set of parameters, but it can manifest itself in substantial structural variations. That is, if we fit classification tree models to two similar but not identical datasets, the results may differ in the number of terminal nodes, the depths of these terminal nodes, the variables that determine the path to each one, and the values of these variables that determine the split at each intermediate node. This is the issue Professor Harrell raised in his comments, and the primary point of this post is to present some simple examples to illustrate its nature and severity.

In addition, this post has two other objectives. The first is to make amends for a very bad practice demonstrated in my last two posts. Specifically, the classification tree models described there were fit to a relatively large dataset and then evaluated with respect to that same dataset. This is bad practice because it can lead to overfitting, a problem that I will discuss in detail in my next post. (For a simple example that illustrates this problem, see the discussion in Section 1.5.3 of Exploring Data in Engineering, the Sciences, and Medicine.) In the machine learning community, this issue is typically addressed by splitting the original dataset randomly into three parts: a training subset (Tr) used for model-fitting, a validation subset (V) used for intermediate modeling decisions (e.g., which variables to include in the model), and a test subset (Te) used for final model evaluation. This approach is described in Section 7.2 of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, who suggest 50% training, 25% validation, and 25% test as a typical choice.

The other point of this post is to say something about the different roles of model uncertainty and data uncertainty in the practice of predictive modeling. I will say a little more at the end, but whether we are considering business applications like predicting customer behavior or industrial process control applications to predict the influence of changes in control valve settings, the basic predictive modeling process consists of three steps: build a prediction model; fix (i.e., “finalize”) this model; and apply it to generate predictions from data not seen in the model-building process. In these applications, model uncertainty plays an important role in the model development process, but once we have fixed the model, we have eliminated this uncertainty by fiat. Uncertainty remains an important issue in these applications, but the source of this uncertainty is in the data from which the model generates its predictions and not in the model itself once we have fixed it. Conversely, as George Box famously said, “all models are wrong, but some are useful,” and this point is crucial here: if the model uncertainty is great enough, it may be difficult or impossible to select a fixed model that is good enough to be useful in practice.

Returning to the topic of uncertainty in tree-based models, the above plot is a graphical representation of a classification tree model repeated from my previous two posts. This model was fit using the ctree procedure in the R package party, taking all optional parameters at their default values. As before, the dataset used to generate this model was the Australian vehicle insurance dataset car.csv, obtained from the website associated with the book Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller. This model – and all of the others considered in this post – was fit using the same formula as before:

Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat

Each record in this dataset describes a single-vehicle, single-driver insurance policy, and clm is a binary response variable taking the value 1 if policy filed one or more claims during the observation period and 0 otherwise. The other variables (on the right side of “~”) represent covariates that are either numeric (veh_value, the value of the vehicle) or categorical (all other variables, representing the vehicle body type, its age, the gender of the driver, the region where the vehicle is driven, and the driver’s age).

As I noted above, this model was fit to the entire dataset, a practice that is to be discouraged since it does not leave independent datasets of similar character for validation and testing. To address this problem, I randomly partitioned the original dataset into a 50% training subset, a 25% validation subset, and a 25% test subset as suggested by Hastie, Tibshirani and Friedman. The plot shown below represents the ctree model we obtain using exactly the same fitting procedure as before, but applied to the 50% random training subset instead of the complete dataset. Comparing these plots reveals substantial differences in the overall structure of the trees we obtain, strictly as a function of the data used to fit the models. In particular, while the original model has seven terminal nodes (i.e., the tree assigns every record to one of seven “leaves”), the model obtained from the training data subset has only four. Also, note that the branches in the original tree model are determined by the three variables agecat, veh_body, and veh_value, while the branches in the model built from the training subset are determined by the two variables agecat and veh_value only.

These differences illustrate the point noted above about the strong dependence of classification tree model structure on the data used in model-building. One could object that since the two datasets used here differ by a factor of two in size, the comparison isn’t exactly “apples-to-apples.” To see that this is not really the issue, consider the following two cases, based on the idea of bootstrap resampling. I won’t attempt a detailed discussion of the bootstrap approach here, but the basic idea is to assess the effects of data variability on a computational procedure by applying that procedure to multiple datasets, each obtained by sampling with replacement from a single source dataset. (For a comprehensive discussion of the bootstrap and some of its many applications, refer to the book Bootstrap Methods and their Application by A.C. Davison and D.V. Hinkley.) The essential motivation is that these datasets – called bootstrap resamples – all have the same essential statistical character as the original dataset. Thus, by comparing the results obtained from different bootstrap resamples, we can assess the variability in results for which exact statistical characterizations are either unknown or impractical to compute. Here, I use this idea to obtain datasets that should address the “apples-to-apples” concern raised above. More specifically, I start with the training data subset used to generate the model described in the previous figure, and I use R’s built-in sample() function to sample the rows of this dataframe with replacement. For an arbitrary dataframe DF, the code to do this is simple:

> set.seed(iseed)

> BootstrapIndex = sample(seq(1,nrow(DF),1),size=nrow(DF),replace=TRUE

> ResampleFrame = DF[BootstrapIndex,]

The only variable in this procedure is the seed for the random sampling function sample(), which I have denoted as iseed. The extremely complicated figure below shows the ctree model obtained using the bootstrap resample generated from the training subset with iseed = 5.

Comparing this model with the previous one – both built from datasets of the same size, with the same general data characteristics – we see that the differences are even more dramatic than those between the original model (built from the complete dataset) and the second one (built from the training subset). Specifically, while the training subset model has four terminal nodes, determined by two variables, the bootstrap subsample model uses all six of the variables included in the model formula, yielding a tree with 16 terminal nodes. But wait – sampling with replacement generates a significant number of duplicated records (for large datasets, each bootstrap resample contains approximately 63.2% of the original data values, meaning that the other 36.8% of the resample values must be duplicates). Could this be the reason the results are so different? The following example shows that this is not the issue.

This plot shows the ctree model obtained from another bootstrap resample of the training data subset, obtained by specifying iseed = 6 instead of iseed = 5. This second bootstrap resample tree is much simpler, with only 7 terminal nodes instead of 16, and the branches of the tree are based on only four of the prediction variables instead of all six (specifically, neither gender nor veh_body appear in this model). While I don’t include all of the corresponding plots, I have also constructed and compared the ctree models obtained from the bootstrap resamples generated for all iseed values between 1 and 8, giving final models involving between four and six variables, with between 7 and 16 terminal nodes. In all cases, the datasets used in building these models were exactly the same size and had the same statistical character. The key point is that, as Professor Harrell noted in his comments, the structural variability of these classification tree models across similar datasets is substantial. In fact, this variability of individual tree-based models was one of the key motivations for developing the random forest method, which achieves substantially reduced model uncertainty by averaging over many randomly generated trees. Unfortunately, the price we pay for this improved model stability is a complete loss of interpretibility. That is, looking at any one of the plots shown here, we can construct a simple description (e.g., node 12 in the above figure represents older drivers – agecat > 4 – with less expensive cars, and it has the lowest risk of any of the groups identified there). While we may obtain less variable predictions by averaging over a large number of these trees, such simple intuitive explanations of the resulting model are no longer possible.

I noted earlier that predictive modeling applications typically involve a three-step strategy: fit the model, fix the model, and apply the model. I also argued that once we fix the model, we have eliminated model uncertainty when we apply it to new data. Unfortunately, if the inherent model uncertainty is large, as in the examples presented here, this greatly complicates the “fix the model” step. That is, if small variations in our training data subset can cause large changes in the structure of our prediction model, it is likely that very different models will exhibit similar performance when applied to our validation data subset. How, then, do we choose? I will examine this issue further in my next post when I discuss overfitting and the training/validation/test split in more detail.

Assessing the precision of classification tree model predictions

2013-08-06T19:14:00.000-07:00

My last post focused on the use of the ctree procedure in the R package party to build classification tree models. These models map each record in a dataset into one of M mutually exclusive groups, which are characterized by their average response. For responses coded as 0 or 1, this average may be regarded as an estimate of the probability that a record in the group exhibits a “positive response.” This interpretation leads to the idea discussed here, which is to replace this estimate with the size-corrected probability estimate I discussed in my previous post (Screening for predictive characteristics). Also, as discussed in that post, these estimates provide the basis for confidence intervals that quantify their precision, particularly for small groups.

In this post, the basis for these estimates is the R package PropCIs, which includes several procedures for estimating binomial probabilities and their confidence intervals, including an implementation of the method discussed in my previous post. Specifically, the procedure used here is addz2ci, discussed in Chapter 9 of Exploring Data in Engineering, the Sciences, and Medicine. As noted in both that discussion and in my previous post, this estimator is described in a paper by Brown, Cai and DasGupta in 2002, but the documentation for the PropCIs package cites an earlier paper by Agresti and Coull (“Approximate is better than exact for interval estimation of binomial proportions,” in The American Statistician, vol. 52, 1998, pp. 119-126). The essential idea is to modify the classical estimator, augmenting the counts of 0’s and 1’s in the data by z²/2, where z is the normal z-score associated with the significance level. As a specific example, z is approximately 1.96 for 95% confidence limits, so this modification adds approximately 2 to each count. In cases where both of these counts are large, this correction has negligible effect, so the size-corrected estimates and their corresponding confidence intervals are essentially identical with the classical results. In cases where either the sample is small or one of the possible responses is rare, these size-corrected results are much more reasonable than the classical results, which motivated their use both here and in my earlier post.

The above plot provides a simple illustration of the results that can be obtained using the addz2ci procedure, in a case where some groups are small enough for these size-corrections to matter. More specifically, this plot is based on the Australian vehicle insurance dataset that I discussed in my last post, and it characterizes the probability that a policy files a claim (i.e., that the variable clm has the value 1), for each of the 13 vehicle types included in the dataset. The heavy horizontal line segments in this plot represent the size-corrected claim probability estimates for each vehicle type, while the open triangles connected by dotted lines represent the upper and lower 95% confidence limits around these probability estimates, computed as described above. The solid horizontal line represents the overall claim probability for the dataset, to serve as a reference value for the individual subset results.

An important observation here is that although this dataset is reasonably large (there are a total of 67,856 records), the subgroups are quite heterogeneous in size, spanning the range from 27 records listing “RDSTR” as the vehicle type to 22,233 listing “SEDAN”. As a consequence, although the classical and size-adjusted claim probability estimates and their confidence intervals are essentially identical for the dataset overall, the extent of this agreement varies substantially across the different vehicle types. Taking the extremes, the results for the largest group (“SEDAN”) are, as with the dataset overall, almost identical: the classical estimate is 0.0665, while the size-adjusted estimate is 0.0664; the lower 95% confidence limit also differs by one in the fourth decimal place (classical 0.0631 versus size-corrected 0.0632), and the upper limit is identical to four decimal places, at 0.0697. In marked contrast, the classical and size-corrected estimates for the “RDSTR” group are 0.0741 versus 0.1271, the upper 95% confidence limits are 0.1729 versus 0.2447, and the lower confidence limits are -0.0247 versus 0.0096. Note that in this case, the lower classical confidence limit violates the requirement that probabilities must be positive, something that is not possible for the addz2ci confidence limits (specifically, negative values are less likely to arise, as in this example, and if they ever do arise, they are replaced with zero, the smallest feasible value for the lower confidence limit; similarly for upper confidence limits that exceed 1). As is often the case, the primary advantage of plotting these results is that it gives us a much more immediate indication of the relative precision of the probability estimates, particularly in cases like “RDSTR” where these confidence intervals are quite wide.

The R code used to generate these results uses both the addz2ci procedure from the PropCIs package, and the summaryBy procedure from the doBy package. Specifically, the following function returns a dataframe with one row for each distinct value of the variable GroupingVar. The columns of this dataframe include this value, the total number of records listing this value, the number of these records for which the binary response variable BinVar is equal to 1, the lower confidence limit, the upper confidence limit, and the size-corrected estimate. The function is called with BinVar, GroupingVar, and the significance level, with a default of 95%. The first two lines of the function require the doBy and PropCIs packages. The third line constructs an internal dataframe, passed to the summaryBy function in the doBy package, which applies the length and sum functions to the subset of BinVar values defined by each level of GroupingVar, giving the total number of records and the total number of records with BinVar = 1. The main loop in this program applies the addz2ci function to these two numbers, for each value of GroupingVar, which returns a two-element list. The element $estimate gives the size-corrected probability estimate, and the element $conf.int is a vector of length 2 with the lower and upper confidence limits for this estimate. The rest of the program appends these values to the internal dataframe created by the summaryBy function, which is returned as the final result. The code listing follows:

BinomialCIbyGroupFunction <- function(BinVar, GroupingVar, SigLevel = 0.95){
#
require(doBy)
require(PropCIs)
#
IntFrame = data.frame(b = BinVar, g = as.factor(GroupingVar))
SumFrame = summaryBy(b ~ g, data = IntFrame, FUN=c(length,sum))
#
n = nrow(SumFrame)
EstVec = vector("numeric",n)
LowVec = vector("numeric",n)
UpVec = vector("numeric",n)
for (i in 1:n){
    Rslt = addz2ci(x = SumFrame$b.sum[i],n = SumFrame$b.length[i],conf.level=SigLevel)
    EstVec[i] = Rslt$estimate
    CI = Rslt$conf.int
    LowVec[i] = CI[1]
    UpVec[i] = CI[2]
}
SumFrame$LowerCI = LowVec
SumFrame$UpperCI = UpVec
SumFrame$Estimate = EstVec
return(SumFrame)
}

The binary response characterization tools just described can be applied to the results obtained from a classification tree model. Specifically, since a classification tree assigns every record to a unique terminal node, we can characterize the response across these nodes, treating the node numbers as the data groups, analogous to the vehicle body types in the previous example. As a specific illustration, the figure above gives a graphical representation of the ctree model considered in my previous post, built using the ctree command from the party package with the following formula:

Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat

Recall that this formula specifies we want a classification tree that predicts the binary claim indicator clm from the six variables on the right-hand side of the tilde, separated by “+” signs. Each of the terminal nodes in the resulting ctree model is characterized with a rectangular box in the above figure, giving the number of records in each group (n) and the average positive response (y), corresponding to the classical claim probability estimate. Note that the product ny corresponds to the total number of claims in each group, so these products and the group sizes together provide all of the information we need to compute the size-corrected claim probability estimates and their confidence limits for each terminal node. Alternatively, we can use the where method associated with the binary tree object that ctree returns to extract the terminal nodes associated with each observation. Then, we simply use the terminal node in place of vehicle body type in exactly the same analysis as before.

The above figure shows these estimates, in the same format as the original plot of claim probability broken down by vehicle body type given earlier. Here, the range of confidence interval widths is much less extreme than before, but it is still clearly evident: the largest group (Node 10, with 23,315 records) exhibits the narrowest confidence interval, while the smallest groups (Node 9, with 1,361 records, and Node 13, with 1,932 records) exhibit the widest confidence intervals. Despite its small size, however, the smallest group does exhibit a significantly lower claim probability than any of the other groups defined by this classification tree model.

The primary point of this post has been to demonstrate that binomial confidence intervals can be used to help interpret and explain classification tree results, especially when displayed graphically as in the above figure. These displays provide a useful basis for comparing classification tree models obtained in different ways (e.g., by different algorithms like rpart and ctree, or by different tuning parameters for one specific algorithm). Comparisons of this sort will form the basis for my next post.

Classification Tree Models

2013-04-13T08:09:00.000-07:00

On March 26, I attended the Connecticut R Meetup in New Haven, which featured a talk by Illya Mowerman on decision trees in R. I have gone to these Meetups before, and I have always found them to be interesting and informative. Attendees range from those who are just starting to explore R to those who have multiple CRAN packages to their credit. Each session is organized around a talk that focuses on some aspect of R and both the talks and the discussion that follow are typically lively and useful. More information about the Connecticut R Meetup can be found here, and information about R Meetups in other areas can be found with a Google search on “R Meetup” with a location.

Mowerman’s talk focused on decision trees like the one shown in the figure above. I give a somewhat more detailed discussion of this example below, but the basic idea is that the tree assigns every record in a dataset to a unique group, and a predicted response is generated for each group. The basic decision tree models are either classification trees, appropriate to binary response variables, or regression tree models, appropriate to numeric response variables. The figure above represents a classification tree model that predicts the probability that an automobile insurance policyholder will file a claim, based on a publicly available insurance dataset discussed further below. Two advantages of classification tree models that Mowerman emphasized in his talk are, first, their simplicity of interpretation, and second, their ability to generate predictions from a mix of numerical and categorical covariates. The above example illustrates both of these points – the decision tree is based on both categorical variables like veh_body (vehicle body type) and numerical variables like veh_value (the vehicle value in units of 10,000 Australian dollars).

To interpret this tree, begin by reading from the top down, with the root node, numbered 1, which partitions the dataset into two subsets based on the variable agecat. This variable is an integer-coded driver age group with six levels, ranging from 1 for the youngest drivers to 6 for the oldest drivers. The root node splits the dataset into a younger driver subgroup (to the left, with agecat values 1 through 4) and an older driver subgroup (to the right, with agecat values 5 and 6). Going to the right, node 11 splits the older driver group on the basis of vehicle value, with node 12 consisting of older drivers with veh_value less than or equal to 2.89, corresponding to vehicle values not more than 28,900 Australian dollars. This subgroup contains 15,351 policy records, of which 5.3% file claims. Similarly, node 13 corresponds to older drivers with vehicles valued more than 28,900 Australian dollars; this is a smaller group (1,932 policy records) with a higher fraction filing claims (8.3%). Going to the left, we partition the younger driver group first on vehicle body type (node 2), then possibly a second time on driver age (node 4), possibly further on vehicle value (node 6) and finally again on vehicle body type (node 7). The key point is that every record in the dataset is ultimately assigned to one of the seven terminal nodes of this tree (the “leaves,” numbered 3, 5, 8, 9, 10, 12, and 13). The numbers associated with these nodes gives their size and the fraction of each group that files a claim, which may be viewed as an estimate of the conditional probability that a driver from each group will file a claim.

Classification trees can be fit to data using a number of different algorithms, several of which are included in various R packages. Mowerman’s talk focused primarily on the rpart package that is part of the standard R distribution and includes a procedure also named rpart, based on what is probably the best known algorithm for fitting classification and regression trees. In addition, Mowerman also discussed the rpart.plot package, a very useful adjunct to rpart that provides a lot of flexibility in representing the resulting tree models graphically. In particular, this package can be used to make much nicer plots than the one shown above; I haven't done that here largely because I have used a different tree fitting procedure, for reasons discussed in the next paragraph. Another classification package that Mowerman mentioned in his talk is C50, which implements the C5.0 algorithm popular in the machine learning community. The primary focus of this post is the ctree procedure in the party package, which was used to fit the tree shown here.

The reason I have used the ctree procedure instead of the rpart procedure is that for the dataset I consider here, the rpart procedure returns a trivial tree. That is, when I attempt to fit a tree to the dataset using rpart with the response variable and covariates described below, the resulting “tree” assigns the entire dataset to a single node, declaring the overall fraction of positive responses in the dataset to be the common prediction for all records. Applying the ctree procedure (the code is listed below) yields the nontrivial tree shown in the plot above. The reason for the difference in these results is that the rpart and ctree procedures use different tree-fitting algorithms. Very likely, the reason rpart has such difficulty with this dataset is its high degree of class imbalance: the positive response (i.e., “policy filed one or more claims”) occurs in only 4,264 of 67,856 data records, representing 6.81% of the total. This imbalance problem is known to make classification difficult, enough so that it has become the focus of a specialized technical literature. For a rather technical survey of this topic, refer to the paper “The Class Imbalance Problem: A Systematic Study,” by Japkowicz and Stephen (Intelligent Data Analysis, volume 6, number 5, November, 2002). (So far, I have not been able to find a free version of this paper, but if you are interested in the topic, a search on this title turns up a number of other useful papers on the topic, although generally more specialized than this broad survey.)

To obtain the tree shown in the plot above, I used the following R commands:

> library(party)

> carFrame = read.csv("car.csv")

> Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat

> TreeModel = ctree(Fmla, data = carFrame)

> plot(TreeModel, type="simple")

The first line loads the party package to make the ctree procedure available for our use, and the second line reads the data file described below into the dataframe carFrame (note that this assumes the data file "car.csv" has been loaded into R's current working directory, which can be shown using the getwd() command). The third line defines the formula that specifies the response as the binary variable clm (on the left side of "~") and the six other variables listed above as potential predictors, each separated by the "+" symbol. The fourth line invokes the ctree procedure to fit the model and the last line displays the results.

The dataset I used here is car.csv, available from the website associated with the book Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller. As noted, this dataset contains 67,856 records, each characterizing an automobile insurance policy associated with one vehicle and one driver. The dataset has 10 columns, each representing an observed value for a policy characteristic, including claim and loss information, vehicle characteristics, driver characteristics, and certain other variables (e.g., a categorical variable characterizing the type of region where the vehicle is driven). The ctree model shown above was built to predict the binary response variable clm (where clm = 1 if one or more claims have been filed by the policyholder, and 0 otherwise), based on the following prediction variables:

-         the numeric variable veh_value;

-         veh_body, a categorical variable with 13 levels;

-         veh_age, an integer-coded categorical variable with 4 levels;

-         gender, a binary indicator of driver gender;

-         area, a categorical variable with six levels;

-         agecat, and integer-coded driver age variable.

The tree model shown above illustrates one of the points Mowerman made in his talk, that classification tree models can easily handle mixed covariate types: here, these covariates include one numeric variable (veh_value), one binary variable (gender), and four categorical variables. In principle, tree models can be built using categorical variables with an arbitrary number of levels, but in practice procedures like ctree will fail if the number of levels becomes too large.

One of the tuning parameters in tree-fitting procedures like rpart and ctree is the minimum node size. In his R Meetup talk, Mowerman showed that increasing this value from the default limit of 7 yielded simpler trees for the dataset he considered (the churn dataset from the C50 package). Specifically, increasing the minimum node size parameter eliminated very small nodes from the tree, nodes whose practical utility was questionable due to their small size. In my next post, I will show how a graphical tool for displaying binomial probability confidence limits can be used to help interpret classification tree results by explicitly displaying the prediction uncertainties. The tool I use is GroupedBinomialPlot, one of those included in the ExploringData package that I am developing.

Finally, I should say in response to a question about my last post that, sadly, I do not yet have a beta test version of the ExploringData package.

Finding outliers in numerical data

2013-02-16T12:10:00.000-08:00

One of the topics emphasized in Exploring Data in Engineering, the Sciences and Medicine is the damage outliers can do to traditional data characterizations. Consequently, one of the procedures to be included in the ExploringData package is FindOutliers, described in this post. Given a vector of numeric values, this procedure supports four different methods for identifying possible outliers.

Before describing these methods, it is important to emphasize two points. First, the detection of outliers in a sequence of numbers can be approached as a mathematical problem, but the interpretation of these data observations cannot. That is, mathematical outlier detection procedures implement various rules for identifying points that appear to be anomalous with respect to the nominal behavior of the data, but they cannot explain why these points appear to be anomalous. The second point is closely related to the first: one possible source of outliers in a data sequence is gross measurement errors or other data quality problems, but other sources of outliers are also possible so it is important to keep an open mind. The terms “outlier” and “bad data” are not synonymous. Chapter 7 of Exploring Data briefly describes two examples of outliers whose detection and interpretation led to a Nobel Prize and to a major new industrial product (Teflon, a registered trademark of the DuPont Company).

In the case of a single sequence of numbers, the typical approach to outlier detection is to first determine upper and lower limits on the nominal range of data variation, and then declare any point falling outside this range to be an outlier. The FindOutliers procedure implements the following methods of computing the upper and lower limits of the nominal data range:

1. The ESD identifier, more commonly known as the “three-sigma edit rule,” well known but unreliable;

2. The Hampel identifier, a more reliable procedure based on the median and the MADM scale estimate;

3. The standard boxplot rule, based on the upper and lower quartiles of the data distribution;

4. An adjusted boxplot rule, based on the upper and lower quartiles, along with a robust skewness estimator called the medcouple.

The rest of this post briefly describes these four outlier detection rules and illustrates their application to two real data examples.

Without question, the most popular outlier detection rule is the ESD identifier (an abbreviation for “extreme Studentized deviation”), which declares any point more than t standard deviations from the mean to be an outlier, where the threshold value t is most commonly taken to be 3. In other words, the nominal range used by this outlier detection procedure is the closed interval:

[mean – t * SD, mean + t * SD]

where SD is the estimated standard deviation of the data sequence. Motivation for the threshold choice t = 3 comes from the fact that for normally-distributed data, the probability of observing a value more than three standard deviations from the mean is only about 0.3%. The problem with this outlier detection procedure is that both the mean and the standard deviation are themselves extremely sensitive to the presence of outliers in the data. As a consequence, this procedure is likely to miss outliers that are present in the data. In fact, it can be shown that for a contamination level greater than 10%, this rule fails completely, detecting no outliers at all, no matter how extreme they are (for details, see the discussion in Sec. 3.2.1 of Mining Imperfect Data).

The default option for the FindOutliers procedure is the Hampel identifier, which replaces the mean with the median and the standard deviation with the MAD (or MADM) scale estimate. The nominal data range for this outlier detection procedure is:

[median – t * MAD, median + t * MAD]

As I have discussed in previous posts, the median and the MAD scale are much more resistant to the influence of outliers than the mean and standard deviation. As a consequence, the Hampel identifier is generally more effective than the ESD identifier, although the Hampel identifier can be too aggressive, declaring too many points as outliers. For detailed comparisons of the ESD and Hampel identifiers, refer to Sec. 7.5 of Exploring Data or Sec. 3.3 of Mining Imperfect Data.

The third method option for the FindOutliers procedure is the standard boxplot rule, based on the following nominal data range:

[Q1 – c * IQD, Q3 + c * IQD]

where Q1 and Q3 represent the lower and upper quartiles, respectively, of the data distribution, and IQD = Q3 – Q1 is the interquartile distance, a measure of the spread of the data similar to the standard deviation. The threshold parameter c is analogous to t in the first two outlier detection rules, and the value most commonly used in this outlier detection rule is c = 1.5. This outlier detection rule is much less sensitive to the presence of outliers than the ESD identifier, but more sensitive than the Hampel identifier, and, like the Hampel identifier, it can be somewhat too aggressive, declaring nominal data observations to be outliers. An advantage of the boxplot rule over these two alternatives is that, because it does not depend on an estimate of the “center” of the data (e.g., the mean in the ESD identifier or the median in the Hampel identifier), it is better suited to distributions that are moderately asymmetric.

The fourth method option is an extension of the standard boxplot rule, developed for data distributions that may be strongly asymmetric. Basically, this procedure modifies the threshold parameter c by an amount that depends on the asymmetry of the distribution, modifying the upper threshold and the lower threshold differently. Because the standard moment-based skewness estimator is extremely outlier-sensitive (for an illustration of this point, see the discussion in Sec. 7.1.1 of Exploring Data), it is necessary to use an outlier-resistant alternative to assess distributional asymmetry. The asymmetry measure used here is the medcouple, a robust skewness measure available in the robustbase package in R and that I have discussed in a previous post (Boxplots and Beyond - Part II: Asymmetry ). An important point about the medcouple is that it can be either positive or negative, depending on the direction of the distributional asymmetry; positive values arise more frequently in practice, but negative values can occur and the sign of the medcouple influences the definition of the asymmetric boxplot rule. Specifically, for positive values of the medcouple MC, the adjusted boxplot rule’s nominal data range is:

[Q1 – c * exp(a * MC) * IQD, Q3 + c * exp(b * MC) * IQD ]

while for negative medcouple values, the nominal data range is:

[Q1 – c * exp(-b * MC) * IQD, Q3 + c * exp(-a * MC) * IQD ]

An important observation here is that for symmetric data distributions, MC should be zero, reducing the adjusted boxplot rule to the standard boxplot rule described above. As in the standard boxplot rule, the threshold parameter is typically taken as c = 1.5, while the other two parameters are typically taken as a = -4 and b = 3. In particular, these are the default values for the procedure adjboxStats in the robustbase package.

To illustrate how these outlier detection methods compare, the above pair of plots shows the results of applying all four of them to the makeup flow rate dataset discussed in Exploring Data (Sec. 7.1.2) in connection with the failure of the ESD identifier. The points in these plots represent approximately 2,500 regularly sampled flow rate measurements from an industrial manufacturing process. These measurements were taken over a long enough period of time to contain both periods of regular process operation – during which the measurements fluctuate around a value of approximately 400 – and periods when the process was shut down, was being shut down, or was being restarted, during which the measurements exhibit values near zero. If we wish to characterize normal process operation, these shut down episodes represent outliers, and they correspond to about 20% of the data. The left-hand plot shows the outlier detection limits for the ESD identifier (lighter, dashed lines) and the Hampel identifier (darker, dotted lines). As discussed in Exploring Data, the ESD limits are wide enough that they do not detect any outliers in this data sequence, while the Hampel identifier nicely separates the data into normal operating data and outliers that correspond to the shut down episodes. The right-hand plot shows the analogous results obtained with the standard boxplot method (lighter, dashed lines) and the adjusted boxplot method (darker, dotted lines). Here, the standard boxplot rule gives results very similar to the Hampel identifier, again nicely separating the dataset into normal operating data and shut down episodes. Unfortunately, the adjusted boxplot rule does not perform very well here, placing its lower nominal data limit in about the middle of the shut down data and its upper nominal data limit in about the middle of the normal operating data. The likely cause of this behavior is that the relatively large fraction of lower tail outliers, which introduces a fairly strong negative skewness (the medcouple value for this example is -0.589).

The second example considered here is the industrial pressure data sequence shown in the above figure, in the same format as the previous figure. This data sequence was discussed in Exploring Data (pp. 326-327) as a troublesome case because the two smallest values in this data sequence – near the right-hand end of the plots – appear to be downward outliers in a sequence with generally positive skewness (here, the medcouple value is 0.162). As a consequence, neither the ESD identifier nor the Hampel identifier give fully satisfactory performance, in both cases declaring only one of these points as a downward outlier and arguably detecting too many upward outliers. In fact, because the Hampel identifier is more aggressive here, it actually declares more upward outliers, making its performance worse for this example. The right-hand plot in the above figure shows the outlier detection limits for the standard boxplot rule (lighter, dashed lines) and the adjusted boxplot rule (darker, dotted lines). As in the previous example, the limits for the standard boxplot rule are almost the same as those for the Hampel identifier (the darker, dotted lines in the left-hand plot), but here the adjusted boxplot rule gives much better results, identifying both of the visually evident downward outliers and declaring far fewer points as upward outliers.

The primary point of this post has been to describe and demonstrate the outlier detection methods to be included in the FindOutliers procedure in the forthcoming ExploringData R package. It should be clear from these results that, when it comes to outlier detection, “one size does not fit all” – method matters, and the choice of method requires a comparison of the results obtained by each one. I have not included the code for the FindOutliers procedure here, but that will be the subject of my next post.

Data Science, Data Analysis, R and Python

2012-12-15T13:32:00.000-08:00

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:

“Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;

“Data Scientist: The Sexiest Job of the 21^st Century,” by Thomas H. Davenport and D.J. Patil, pages 70 – 76;

“Making Advanced Analytics Work For You,” by Dominic Barton and David Court, pages 79 – 83.

All three provide food for thought; this post presents a brief summary of some of those thoughts.

One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety. All of these aspects of the Big Data characterization problem affect it, but differently:

·        For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself. Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task.

·        For high velocity datasets – e.g., real-time, Internet-based data sources – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect. If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face a fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate.

·        For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).

One practical corollary to these observations is the need for a computer-based data reduction process or “data funnel” that matches the volume, velocity, and/or variety of the original data sources with the ultimate needs of the organization. In large organizations, this data funnel generally involves a mix of different technologies and people. While it is not a complete characterization, some of these differences are evident from the primary software platforms used in the different stages of this data funnel: languages like HTML for dealing with web-based data sources; typically, some variant of SQL for dealing with large databases; a package like R for complex quantitative analysis; and often something like Microsoft Word, Excel, or PowerPoint delivers the final results. In addition, to help coordinate some of these tasks, there are likely to be scripts, either in an operating system like UNIX or in a platform-independent scripting language like perl or Python.

An important point omitted from all three articles is that there are at least two distinct application areas for Big Data:

1. The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport. Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet. Examples like these describe automated systems put in place to continuously exploit a large but fixed data source.

2. The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science team scrambles to find an answer. This use is not represented by any of the examples described in these articles. In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different. In particular, the end result of a one-off analysis is a single set of results, ultimately summarized to address the question originally posed. In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on demand from a Microsoft Access database of summary information).

A key point of Davenport and Patil’s article is that data science involves more than just the analysis of data: it is also necessary to identify data sources, acquire what is needed from them, re-structure the results into a form amenable to analysis, clean them up, and in the end, present the analytical results in a useable form. In fact, the subtitle of their article is “Meet the people who can coax treasure out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors indicate that the term was coined in 2008 by D.J. Patil, who holds a position with that title at Greylock Partners.) Also, two particularly interesting tidbits from this article were the authors’ suggestion that a good place to find data scientists is at R User Groups, and their description of R as “an open-source statistical tool favored by data scientists.”

Davenport and Patil emphasize the difference between structured and unstructured data, especially relevant to the R community since most of R’s procedures are designed to work with the structured data types discussed in Chapter 2 of Exploring Data in Engineering, the Sciences and Medicine: continuous, integer, nominal, ordinal, and binary. More specifically, note that these variable types can all be included in dataframes, the data object type that is best supported by R’s vast and expanding collection of add-on packages. Certainly, there is some support for other data types, and the level of this support is growing – the tm package and a variety of other related packages support the analysis of text data, the twitteR package provides support for analyzing Twitter tweets, and the scrapeR package supports web scraping – but the acquisition and reformatting of unstructured data sources is not R’s primary strength. Yet it is a key component of data science, as Davenport and Patil emphasize:

“A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed. A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data.”

To better understand the distinction between the quantitative analyst and the data scientist implied by this quote, consider mathematician George Polya’s book, How To Solve It. Originally published in 1945 and most recently re-issued in 2009, 24 years after the author’s death, this book is a very useful guide to solving math problems. Polya’s basic approach consists of these four steps:

Understand the problem;

Formulate a plan for solving the problem;

Carry out this plan;

Check the results.

It is important to note what is not included in the scope of Polya’s four steps: Step 1 assumes a problem has been stated precisely, and Step 4 assumes the final result is well-defined, verifiable, and requires no further explanation. While quantitative analysis problems are generally neither as precisely formulated as Polya’s method assumes, nor as clear in their ultimate objective, the class of “quantitative analyst” problems that Davenport and Patil assume in the previous quote correspond very roughly to problems of this type. They begin with something like an R dataframe and a reasonably clear idea of what analytical results are desired; they end by summarizing the problem and presenting the results. In contrast, the class of “data scientist” problems implied in Davenport and Patil’s quote comprises an expanded set of steps:

Formulate the analytical problem: decide what kinds of questions could and should be asked in a way that is likely to yield useful, quantitative answers;

Identify and evaluate potential data sources: what is available in-house, from the Internet, from vendors? How complete are these data sources? What would it cost to use them? Are there significant constraints on how they can be used? Are some of these data sources strongly incompatible? If so, does it make sense to try to merge them approximately, or is it more reasonable to omit some of them?

Acquire the data and transform it into a form that is useful for analysis; note that for sufficiently large data collections, part of this data will almost certainly be stored in some form of relational database, probably administered by others, and extracting what is needed for analysis will likely involve writing SQL queries against this database;

Once the relevant collection of data has been acquired and prepared, examine the results carefully to make sure it meets analytical expectations: do the formats look right? Are the ranges consistent with expectations? Do the relationships seen between key variables seem to make sense?

Do the analysis: by lumping all of the steps of data analysis into this simple statement, I am not attempting to minimize the effort involved, but rather emphasizing the other aspects of the Big Data analysis problem;

After the analysis is complete, develop a concise summary of the results that clearly and succinctly states the motivating problem, highlights what has been assumed, what has been neglected and why, and gives the simplest useful summary of the data analysis results. (Note that this will often involve several different summaries, with different levels of detail and/or emphases, intended for different audiences.)

Here, Steps 1 and 6 necessarily involve close interaction with the end users of the data analysis results, and they lie mostly outside the domain of R. (Conversely, knowing what is available in R can be extremely useful in formulating analytical problems that are reasonable to solve, and the graphical procedures available in R can be extremely useful in putting together meaningful summaries of the results.) The primary domain of R is Step 5: given a dataframe containing what are believed to be the relevant variables, we generate, validate, and refine the analytical results that will form the basis for the summary in Step 6. Part of Step 4 also lies clearly within the domain of R: examining the data once it has been acquired to make sure it meets expectations. In particular, once we have a dataset or a collection of datasets that can be converted easily into one or more R dataframes (e.g., csv files or possibly relational databases), a preliminary look at the data is greatly facilitated by the vast array of R procedures available for graphical characterizations (e.g., nonparametric density estimates, quantile-quantile plots, boxplots and variants like beanplots or bagplots, and much more); for constructing simple descriptive statistics (e.g., means, medians, and quantiles for numerical variables, tabulations of level counts for categorical variables, etc.); and for preliminary multivariate characterizations (e.g., scatter plots, classical and robust covariance ellipses, classical and robust principal component plots, etc.).

The rest of this post discusses those parts of Steps 2, 3, and 4 above that fall outside the domain of R. First, however, I have two observations. My first observation is that because R is evolving fairly rapidly, some tasks which are “outside the domain of R” today may very well move “inside the domain of R” in the near future. The packages twitteR and scrapeR, mentioned earlier, are cases in point, as are the continued improvements in packages that simplify the use of R with databases. My second observation is that, just because something is possible within a particular software environment doesn’t make it a good idea. A number of years ago, I attended a student talk given at an industry/university consortium. The speaker set up and solved a simple linear program (i.e., he implemented the simplex algorithm to solve a simple linear optimization problem with linear constraints) using an industrial programmable controller. At the time, programming those controllers was done via relay ladder logic, a diagrammatic approach used by electricians to configure complicated electrical wiring systems. I left the talk impressed by the student’s skill, creativity and persistence, but I felt his efforts were extremely misguided.

Although it does not address every aspect of the “extra-R” components of Steps 2, 3, and 4 defined above – indeed, some of these aspects are so application-specific that no single book possibly could – Paul Murrell’s book Introduction to Data Technologies provides an excellent introduction to many of them. (This book is also available as a free PDF file under creative commons.) A point made in the book’s preface mirrors one in Davenport and Patil’s article:

“Data sets never pop into existence in a fully mature and reliable state; they must be cleaned and massaged into an appropriate form. Just getting the data ready for analysis often represents a significant component of a research project.”

Since Murrell is the developer of R’s grid graphics system that I have discussed in previous posts, it is no surprise that his book has an R-centric data analysis focus, but the book’s main emphasis is on the tasks of getting data from the outside world – specifically, from the Internet – into a dataframe suitable for analysis in R. Murrell therefore gives detailed treatments of topics like HTML and Cascading Style Sheets (CSS) for working with Internet web pages; XML for storing and sharing data; and relational databases and their associated query language SQL for efficiently organizing data collections with complex structures. Murrell states in his preface that these are things researchers – the target audience of the book – typically aren’t taught, but pick up in bits and pieces as they go along. He adds:

“A great deal of information on these topics already exists in books and on the internet; the value of this book is in collecting only the important subset of this information that is necessary to begin applying these technologies within a research setting.”

My one quibble with Murrell’s book is that he gives Python only a passing mention. While I greatly prefer R to Python for data analysis, I have found Python to be more suitable than R for a variety of extra-analytical tasks, including preliminary explorations of the contents of weakly structured data sources, as well as certain important reformatting and preprocessing tasks. Like R, Python is an open-source language, freely available for a wide variety of computing environments. Also like R, Python has numerous add-on packages that support an enormous variety of computational tasks (over 25,000 at this writing). In my day job in a SAS-centric environment, I commonly face tasks like the following: I need to create several nearly-identical SAS batch jobs, each to read a different SAS dataset that is selected on the basis of information contained in the file name; submit these jobs, each of which creates a CSV file; harvest and merge the resulting CSV files; run an R batch job to read this combined CSV file and perform computations on its contents. I can do all of these things with a Python script, which also provides a detailed recipe of what I have done, so when I have to modify the procedure slightly and run it again six months later, I can quickly re-construct what I did before. I have found Python to be better suited than R to tasks that involve a combination of automatically generating simple programs in another language, data file management, text processing, simple data manipulation, and batch job scheduling.

Despite my Python quibble, Murrell’s book represents an excellent first step toward filling the knowledge gap that Davenport and Patil note between quantitative analysts and data scientists; in fact, it is the only book I know addressing this gap. If you are an R aficionado interested in positioning yourself for “the sexiest job of the 21^st century,” Murrell’s book is an excellent place to start.

Characterizing a new dataset

2012-10-27T12:30:00.000-07:00

In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly. So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ExploringData package, which I also mentioned in my last post and promised to describe later. My next post will be a special one on Big Data and Data Science, followed by another one about the DataframeSummary procedure (additional features of the procedure and the code used to implement it), after which I will come back to the spacing measures I discussed last time.

A task that arises frequently in exploratory data analysis is the initial characterization of a new dataset. Ideally, everything we could want to know about a dataset should come from the accompanying metadata, but this is rarely the case. As I discuss in Chapter 2 of Exploring Data in Engineering, the Sciences, and Medicine, metadata is the available “data about data” that (usually) accompanies a data source. In practice, however, the available metadata is almost never as complete as we would like, and it is sometimes wrong in important respects. This is particularly the case when numeric codes are used for missing data, without accompanying notes describing the coding. An example, illustrating the consequent problem of disguised missing data is described in my paper The Problem of Disguised Missing Data. (It should be noted that the original source of one of the problems described there – a comment in the UCI Machine Learning Repository header file for the Pima Indians diabetes dataset that there were no missing data records – has since been corrected.)

Once we have converted our data source into an R data frame (e.g., via the read.csv function for an external csv file), there are a number of useful tools to help us begin this characterization process. Probably the most general is the str command, applicable to essentially any R object. Applied to a dataframe, this command first tells us that the object is a dataframe, second, gives us the dimensions of the dataframe, and third, presents a brief summary of its contents, including the variable names, their type (specifically, the results of R’s class function), and the values of their first few observations. As a specific example, if we apply this command to the rent dataset from the gamlss package, we obtain the following summary:

> str(rent)

'data.frame': 1969 obs. of 9 variables:

$ R : num 693 422 737 732 1295 ...

$ Fl : num 50 54 70 50 55 59 46 94 93 65 ...

$ A : num 1972 1972 1972 1972 1893 ...

$ Sp : num 0 0 0 0 0 0 0 0 0 0 ...

$ Sm : num 0 0 0 0 0 0 0 0 0 0 ...

$ B : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ H : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...

$ L : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

$ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...

>

This dataset summarizes a 1993 random sample of housing rental prices in Munich, including a number of important characteristics about each one (e.g., year of construction, floor space in square meters, etc.). A more detailed description can be obtained via the command “help(rent)”.

The head command provides similar information to the str command, in slightly less detail (e.g., it doesn’t give us the variable types), but in a format that some will find more natural:

> head(rent)

       R Fl    A Sp Sm B H L loc

1 693.3 50 1972 0 0 0 0 0   2

2 422.0 54 1972 0 0 0 0 0   2

3 736.6 70 1972 0 0 0 0 0   2

4 732.2 50 1972 0 0 0 0 0   2

5 1295.1 55 1893 0 0 0 0 0   2

6 1195.9 59 1893 0 0 0 0 0   2

>

(An important difference between these representations is that str characterizes factor variables by their level number and not their level value: thus the first few observations of the factor B assume the first level of the factor, which is the value 0. As a consequence, while it may appear that str is telling us that the first few records list the value 1 for the variable B while head is indicating a zero, this is not the case. This is one reason that data analysts may prefer the head characterization.)

While the R data types for each variable can be useful to know – particularly in cases where it isn’t what we expect it to be, as when integers are coded as factors – this characterization doesn’t really tell us the whole story. In particular, note that R has commands like “as.character” and “as.factor” that can easily convert numeric variables to character or factor data types. Even beyond this, the range of inherent behaviors that numerically-coded data can exhibit cannot be fully described by a simple data type designation. As a specific example, one of the variables in the rent dataframe is “A,” described in the metadata available from the help command as “year of construction.” While this variable is coded as type “numeric,” in fact it takes integer values from 1890 to 1988, with some values in this range repeated many times and others absent. This point is important, since analysis tools designed for continuous variables – especially outlier-resistant ones like medians and other rank-based methods – sometimes perform poorly in the face of data sequences with many repeated values (i.e., “ties,” which have zero probability for continuous data distributions). In extreme cases, these techniques may fail completely, as in the case of the MADM scale estimate, discussed in Chapter 7 of Exploring Data. This data characterization implodes if more than 50% of the data values are the same, returning the useless value zero in this case, independent of the values of all of the other data points.

These observations motivate the DataframeSummary procedure described here, to be included in the ExploringData package. This function is called with the name of the dataframe to be characterized and an optional parameter Option, which can take any one of the following four values:

“Brief” (the default value)
“NumericOnly”
“FactorOnly”
“AllAsFactor”

In all cases, this function returns a summary dataframe with one row for each column in the dataframe to be characterized. Like the str command, these results include the name of each variable and its type. Under the default option “Brief,” this function also returns the following characteristics for each variable:

Levels = the number of distinct values the variable exhibits;
AvgFreq = the average number of records listing each value;
TopLevel = the most frequently occurring value;
TopFreq = the number of records listing this most frequent value;
TopPct = the percentage of records listing this most frequent value;
MissFreq = the number of missing or blank records;
MissPct = the percentage of missing or blank records.

For the rent dataframe, this function (under the default “Brief” option) gives the following summary:

> DataframeSummary(rent)

Variable Type Levels AvgFreq TopLevel TopFreq TopPct MissFreq MissPct

3        A    numeric      73   26.97         1957         551       27.98         0       0

6        B    factor           2 984.50           0          1925        97.77        0       0

2       Fl     numeric      91   21.64          60              71          3.61        0       0

7        H    factor            2 984.50          0          1580        80.24        0       0

8        L    factor            2 984.50          0           1808        91.82        0       0

9     loc    factor            3 656.33          2           1247        63.33        0       0

1        R    numeric   1762    1.12          900               7          0.36        0       0

5       Sm  numeric         2 984.50           0          1797         91.26        0       0

4       Sp   numeric         2 984.50           0          1419         72.07        0       0

>

The variable names and types appear essentially as they do in the results obtained with the str function, and the numbers to the far left indicate the column numbers from the dataframe rent for each variable, since the variable names are listed alphabetically for convenience. The “Levels” column of this summary dataframe gives the number of unique values for each variable, and it is clear that this can vary widely even within a given data type. For example, the variable “R” (monthly rent in DM) exhibits 1,762 unique values in 1,969 data observations, so it is almost unique, while the variables “Sm” and “Sp” exhibit only two possible values, even though all three of these variables are of type “numeric.” The AvgFreq column gives the average number of times each level should appear, assuming a uniform distribution over all possible values. This number is included as a reference value for assessing the other frequencies (i.e., TopFreq for the most frequently occurring value and MissFreq for missing data values). Thus, for the first variable, “A,” AvgFreq is 26.97, meaning that if all 73 possible values for this variable were equally represented, each one should occur about 27 times in the dataset. The most frequently occurring level (TopLevel) is “1957,” which occurs 551 times, suggesting a highly nonuniform distribution of values for this variable. In contrast, for the variable “R,” AvgFreq is 1.12, meaning that each value of this variable is almost unique. The TopPct column gives the percentage of records in the dataset exhibiting the most frequent value for each record, which varies from 0.36% for the numeric variable “R” to 97.77% for the factor variable “B.” It is interesting to note that this variable is of type “factor” but is coded as 0 or 1, while the variables “Sm” and “Sp” are also binary, coded as 0 or 1, but are of type “numeric.” This illustrates the point noted above that the R data type is not always as informative as we might like it to be. (This is not a criticism of R, but rather a caution about the fact that, in preparing data, we are free to choose many different representations, and the original logic behind the choice may not be obvious to all ultimate users of the data.) In addition, comparing the available metadata for the variable “B” illustrates the point about metadata errors noted earlier: of the 1,969 data records, 1,925 have the value “0” (97.77%), while 44 have the value “1” (2.23%), but the information returned by the help command indicates exactly the opposite proportion of values: 1,925 should have the value “1” (indicating the presence of a bathroom), while 44 should have the value “0” (indicating the absence of a bathroom). Since the interpretation of the variables that enter any analysis is important in explaining our final analytical results, it is useful to detect this type of mismatch between the data and the available metadata as early as possible. Here, comparing the average rents for records with B = 1 (DM 424.95) against those with B = 0 (DM 820.72) suggests that the levels have been reversed relative to the metadata: the relatively few housing units without bathrooms are represented by B = 1, renting for less than the majority of those units, which have bathrooms and are represented by B = 0. Finally, the last two columns of the above summary give the number of records with missing or blank values (MissFreq) and the corresponding percentage (MissPct); here, all records are complete so these numbers are zero.

In my next post on this topic, I will present results for the other three options of the DataframeSummary procedure, along with the code that implements it. In all cases, the results include those generated by the “Brief” option just presented, but the difference between the other options lies first, in what additional characterizations are included, and second, in which subset of variables are included in the summary. Specifically, for the rent dataframe, we obtain:

Under the “NumericOnly” option, a summary of the five numeric variables R, FL, A, Sp, and Sm results, giving characteristics that are appropriate to numeric data types, like the spacing measures described in my last post;
Under the “FactorOnly” option, a summary of the four factor variables B, H, L, and loc results, giving measures that are appropriate to categorical data types, like the normalized Shannon entropy measure discussed in several previous posts;
Under the “AllAsFactor” option, all variables in the dataframe are first converted to factors and then characterized using the same measures as in the “FactorOnly” option.

The advantage of the “AllAsFactor” option is that it characterizes all variables in the dataframe, but as I discussed in my last post, the characterization of numerical variables with measures like Shannon entropy is not always terribly useful.

Spacing measures: heterogeneity in numerical distributions

2012-09-22T18:54:00.000-07:00

Numerically-coded data sequences can exhibit a very wide range of distributional characteristics, including near-Gaussian (historically, the most popular working assumption), strongly asymmetric, light- or heavy-tailed, multi-modal, or discrete (e.g., count data). In addition, numerically coded values can be effectively categorical, either ordered, or unordered. A specific example that illustrates the range of distributional behavior often seen in a collection of numerical variables is the Boston housing dataframe (Boston) from the MASS package in R. This dataframe includes 14 numerical variables that characterize 506 suburban housing tracts in the Boston area: 12 of these variables have class “numeric” and the remaining two have class “integer”. The integer variable chas is in fact a binary flag, taking the value 1 if the tract bounds the Charles river and 0 otherwise, and the integer variable rad is described as “an index of accessibility to radial highways,’’ assuming one of nine values: the integers 1 through 8, and 24. The other 12 variables assume anywhere between 26 unique values (for the zoning variable zn) to 504 unique values (for the per capita crime rate crim). The figure below shows nonparametric density estimates for four of these variables: the per-capita crime rate (crim, upper left plot), the percentage of the population designated “lower status” by the researchers who provided the data (lstat, upper right plot), the average number of rooms per dwelling (rm, lower left plot), and the zoning variable (zn, lower right plot). Comparing the appearances of these density estimates, considerable variability is evident: the distribution of crim is very asymmetric with an extremely heavy right tail, the distribution of lstat is also clearly asymmetric but far less so, while the distribution of rm appears to be almost Gaussian. Finally, the distribution of zn appears to be tri-modal, mostly concentrated around zero, but with clear secondary peaks at around 20 and 80.

Each of these four plots also includes some additional information about the corresponding variable: three vertical reference lines at the mean (the solid line) and the mean offset by plus or minus three standard deviations (the dotted lines), and the value of the normalized Shannon entropy, listed in the title of each plot. This normalized entropy value is discussed in detail in Chapter 3 of Exploring Data in Engineering, the Sciences, and Medicine and in two of my previous posts (April 3, 2011 and May 21, 2011), and it forms the basis for the spacing measure described below. First, however, the reason for including the three vertical reference lines on the density plots is to illustrate that, while popular “Gaussian expectations” for data are approximately met for some numerical variables (the rm variable is a case in point here), often these expectations are violated so much that they are useless. Specifically, note that under approximately Gaussian working assumptions, most of the observed values for the data sequence should fall between the two dotted reference lines, which should correspond approximately to the smallest and largest values seen in the dataset. This description is reasonably accurate for the variable rm, and the upper limit appears fairly reasonable for the variable lstat, but the lower limit is substantially negative here, which is not reasonable for this variable since it is defined as a percentage. These reference lines appear even more divergent from the general shapes of the distributions for the crim and zn data, where again, the lower reference lines are substantially negative, infeasible values for both of these variables.

The reason the reference values defined by these lines are not particularly representative is the extremely heterogeneous nature of the data distributions, particularly for the variables crim – where the distribution exhibits a very long right tail – and zn – where the distribution exhibits multiple modes. For categorical variables, distributional heterogeneity can be assessed by measures like the normalized Shannon entropy, which varies between 0 and 1, taking the value zero when all levels of the variable are equally represented, and taking the value 1 when only one of several possible values are present. This measure is easily computed and, while it is intended for use with categorical variables, the procedures used to compute it will return results for numerical variables as well. These values are shown in the figure captions of each of the above four plots, and it is clear from these results that the Shannon measure does not give a reliable indication of distributional heterogeneity here. In particular, note that the Shannon measure for the crim variable is zero to three decimal places, suggesting a very homogeneous distribution, while the variables lstat and rm – both arguably less heterogeneous than crim – exhibit slightly larger values of 0.006 and 0.007, respectively. Further, the variable zn, whose density estimate resembles that of crim more than that of either of the other two variables, exhibits the much larger Shannon entropy value of 0.585.

The basic difficulty here is that all observations of a continuously distributed random variable should be unique. The normalized Shannon entropy – along with the other heterogeneity measures discussed in Chapter 3 of Exploring Data – effectively treat variables as categorical, returning a value that is computed from the fractions of total observations assigned to each possible value for the variable. Thus, for an ideal continuously-distributed variable, every possible value appears once and only once, so these fractions should be 1/N for each of the N distinct values observed for the variable. This means that the normalized Shannon measure – along with all of the alternative measures just noted – should be identically zero for this case, regardless of whether the continuous distribution in question is Gaussian, Cauchy, Pareto, uniform, or anything else. In fact, the crim variable considered here almost meets this ideal requirement: in 506 observations, crim exhibits 504 unique values, which is why its normalized Shannon entropy value is zero to three significant figures. In marked contrast, the variable zn exhibits only 26 distinct values, meaning that each of these values occurs, on average, just over 19 times. However, this average behavior is not representative of the data in this case, since the smallest possible value (0) occurs 372 times, while the largest possible value (100) occurs only once. It is because of the discrete character of this distribution that the normalized Shannon entropy is much larger here, accurately reflecting the pronounced distributional heterogeneity of this variable.

Taken together, these observations suggest a simple extension of the normalized Shannon entropy that can give us a more adequate characterization of distributional differences for numerical variables. Specifically, the idea is this: begin by dividing the total range of a numerical variable x into M equal intervals. Then, count the number of observations that fall into each of these intervals and divide by the total number of observations N to obtain the fraction of observations falling into each group. By doing this, we have effectively converted the original numerical variable into an M-level categorical variable, to which we can apply heterogeneity measures like the normalized Shannon entropy. The four plots below illustrate this basic idea for the four Boston housing variables considered above. Specifically, each plot shows the fraction of observations falling into each of 10 equally spaced intervals, spanning the range from the smallest observed value of the variable to the largest.

As a specific example, consider the results shown in the upper left plot for the variable crim, which varies from a minimum of 0.00632 to a maximum of 89.0. Almost 87% of the observations fall into the smallest 10% of this range, from 0.00632 to 8.9, while the next two groups account for almost all of the remaining observations. In fact, none of the other groups (4 through 10) account for more than 1% of the observations, and one of these groups – group 7 – is completely empty. Computing the normalized Shannon entropy from this ten-level categorical variable yields 0.767, as indicated in the title of the upper left plot. In contrast, the corresponding plot for the lstat variable, shown in the upper right, is much more uniform, with the first five groups exhibiting roughly the same fractional occupation. As a consequence, the normalized Shannon entropy for this grouped variable is much smaller than that for the more heterogeneously distributed crim variable: 0.138 versus 0.767. Because the distribution is more sharply peaked for the rm variable than for lstat, the occupation fractions for the grouped version of this variable (lower left plot) are less homogeneous, and the normalized Shannon entropy is correspondingly larger, at 0.272. Finally, for the zn variable (lower right plot), the grouped distribution appears similar to that for the crim variable, and the normalized Shannon entropy values are also similar: 0.525 versus 0.767.

The key point here is that, in contrast to the normalized Shannon entropy applied directly to the numerical variables in the Boston dataframe, grouping these values into 10 equally-spaced intervals and then computing the normalized Shannon entropy gives a number that seems to be more consistent with the distributional differences between these variables that can be seen clearly in their density plots. Motivation for this numerical measure (i.e., why not just look at the density plots?) comes from the fact that we are sometimes faced with the task of characterizing a new dataset that we have not seen before. While we can – and should – examine graphical representations of these variables, in cases where we have many such variables, it is desirable to have a few, easily computed numerical measures to use as screening tools, guiding us in deciding which variables to look at first, and which techniques to apply to them. The spacing measure described here – i.e., the normalized Shannon entropy measure applied to a grouped version of the numerical variable – appears to be a potentially useful measure for this type of preliminary data characterization. For this reason, I am including it – along with a few other numerical characterizations – in the DataFrameSummary procedure I am implementing as part of the ExploringData package, which I will describe in a later post. Next time, however, I will explore two obvious extensions of the procedure described here: different choices of the heterogeneity measure, and different choices of the number of grouping levels. In particular, as I have shown in previous posts on interestingness measures, the normalized Bray, Gini, and Simpson measures all behave somewhat differently than the Shannon measure considered here, raising the question of which one would be most effective in this application. In addition, the choice of 10 grouping levels considered here was arbitrary, and it is by no means clear that this choice is the best one. In my next post, I will explore how sensitive the Boston housing results are to changes in these two key design parameters.

Finally, it is worth saying something about how the grouping used here was implemented. The R code listed below is the function I used to convert a numerical variable x into the grouped variable from which I computed the normalized Shannon entropy. The three key components of this function are the classIntervals function from the R package classInt (which must be loaded before use; hence, the “library(classInt)” statement at the beginning of the function), and the cut and table functions from base R. The classIntervals function generates a two-element list with components var, which contains the original observations, and brks, which contains the M+1 boundary values for the M groups to be generated. Note that the style = “equal” argument is important here, since we want M equal-width groups. The cut function then takes these results and converts them into an M-level categorical variable, assigning each original data value to the interval into which it falls. The table function counts the number of times each of the M possible levels occurs for this categorical variable. Dividing this vector by the sum of all entries then gives the fraction of observations falling into each group. Plotting the results obtained from this function and reformatting the results slightly yields the four plots shown in the second figure above, and applying the shannon.proc procedure available from the OUP companion website for Exploring Data yields the Shannon entropy values listed in the figure titles.

UniformSpacingFunction <- function(x, nLvls = 10){

#

library(classInt)

#

xsum = classIntervals(x,n = nLvls, style="equal")

xcut = cut(xsum$var, breaks = xsum$brks, include.lowest = TRUE)

xtbl = table(xcut)

pvec = xtbl/sum(xtbl)

pvec

}

Implementing the CountSummary Procedure

2012-09-08T11:53:00.000-07:00

In my last post, I described and demonstrated the CountSummary procedure to be included in the ExploringData package that I am in the process of developing. This procedure generates a collection of graphical data summaries for a count data sequence, based on the distplot, Ord_plot, and Ord_estimate functions from the vcd package. The distplot function generates both the Poissonness plot and the negative-binomialness plot discussed in Chapters 8 and 9 of Exploring Data in Engineering, the Sciences and Medicine. These plots provide informal graphical assessments of the conformance of a count data sequence with the two most popular distribution models for count data, the Poisson distribution and the negative-binomial distribution. As promised, this post describes the R code needed to implement the CountSummary procedure, based on these functions from the vcd package.

The key to this implementation lies in the use of the grid package, a set of low-level graphics primitives included in base R. As I mentioned in my last post, the reason this was necessary - instead of using higher-level graphics packages like lattice or ggplot2 - was that the vcd package is based on grid graphics, making it incompatible with base graphics commands like those used to generate arrays of multiple plots. The grid package was developed by Paul Murrell, who provides a lot of extremely useful information about both R graphics in general and grid graphics in particular on his home page, including the article “Drawing Diagrams with R,” which provides a nicely focused introduction to grid graphics. The first example I present here is basically a composite of the first two examples presented in this paper. Specifically, the code for this example is:

library(grid)

grid.newpage()

pushViewport(viewport(width = 0.8, height = 0.4))

grid.roundrect()

grid.text("This is text in a box")

popViewport()

The first line of this R code loads the grid package and the second tells this package to clear the plot window; failing to do this will cause this particular piece of code to overwrite whatever was there before, which usually isn’t what you want. The third line creates a viewport, into which the plot will be placed. In this particular example, we specify a width of 0.8, or 80% of the total plot window width, and a height of 0.4, corresponding to 40% of the total window height. The next two lines draw a rectangular box with rounded corners and put “This is text in a box” in the center of this box. The advantage of the grid package is that it provides us with simple graphics primitives to draw this kind of figure, without having to compute exact positions (e.g., in inches) for the different figure components. Commands like grid.text provide useful defaults (i.e., put the text in the center of the viewport), which can be overridden by specifying positional parameters in a variety of ways (e.g., left- or right-justified, offsets in inches or lines of text, etc.). The results obtained using these commands are shown in the figure below.

The code for the second example is a simple extension of the first one, essentially consisting of the added initial code required to create the desired two-by-two plot array, followed by four slightly modified copies of the above code. Specifically, this code is:

grid.newpage()

pushViewport(viewport(layout=grid.layout(nrow=2,ncol=2)))

pushViewport(viewport(layout.pos.row=1,layout.pos.col=1))

grid.roundrect(width = 0.8, height=0.4)

grid.text("Plot 1 goes here")

popViewport()

pushViewport(viewport(layout.pos.row=1,layout.pos.col=2))

grid.roundrect(width = 0.8, height=0.4)

grid.text("Plot 2 goes here")

popViewport()

pushViewport(viewport(layout.pos.row=2,layout.pos.col=1))

grid.roundrect(width = 0.8, height=0.4)

grid.text("Plot 3 goes here")

popViewport()

pushViewport(viewport(layout.pos.row=2,layout.pos.col=2))

grid.roundrect(width = 0.8, height=0.4)

grid.text("Plot 4 goes here")

popViewport()

Here, note that the first “pushViewport” command creates the two-by-two plot array we want, by specifying “layout = grid.layout(nrow=2,ncol=2)”. As in initializing a data frame in R, we can create an arbitrary two-dimensional array of grid graphics viewports – say m by n – by specifying “layout = grid.layout(nrow=m, ncol=n)”. Once we have done this, we can use whatever grid commands – or grid-compatible commands, such as those generated by the vcd package – we want, to create the individual elements in our array of plots. In this example, I have basically repeated the code from the first example to put text into rounded rectangular boxes in each position of the plot array. The two most important details are, first, the “pushViewport” command at the beginning of each of these individual plot blocks specifies which of the four array elements the following plot will go in, and second, the “popViewport()” command at the end of each block, which tells the grid package that we are finished with this element of the array. If we leave this command out, the next “pushViewport” command will not move to the desired plot element, but will simply overwrite the previous plot. Executing this code yields the plot shown below.

The final example replaces the text in the above two-by-two example with the plots I want for the CountSummary procedure. Before presenting this code, it is important to say something about the structure of the resulting plot and the vcd commands used to generate the different plot elements. The first plot – in the upper left position of the array shown below – is an Ord plot, generated by the Ord_plot command, which does two things. The first is to generate the desired plot, but the second is to return estimates of the intercept and slope of one of the two reference lines in the plot. The first of these lines is fit to the points in the plot via ordinary least squares, while the second – the one whose parameters are returned – is fit via weighted least squares, to down-weight the widely scattered points seen in this plot that correspond to cases with very few observations. The intent of the Ord plot is to help us decide which of several alternative distributions – including both the Poisson and the negative-binomial – fits our count data sequence better. This guidance is based on the reference line parameters, and the Ord_estimate function in the vcd package transforms these parameter estimates into distributional recommendations and the distribution parameter values needed by the distplot function in the vcd package to generate either the Poissonness plot or the negative-binomialness plot for the count data sequence. Although these recommendations are sometimes useful, it is important to emphasize the caution given in the vcd package documentation:

“Be careful with the conclusions from Ord_estimate as it implements just some simple heuristics!”

In the CountSummary procedure, I use these results both to generate part of the text summary in the upper right element of the plot array, and to decide which type of plot to display in the lower right element of this array. Both this plot and the Poissonness reference plot in the lower left element of the display are created using the distplot command in the vcd package. I include the Poissonness reference plot because the Poisson distribution is the most commonly assumed distribution for count data – analogous in many ways to the Gaussian distribution so often assumed for continuous-valued data – and, by not specifying the single parameter for this distribution, I allow the function to determine it by fitting the data. In cases where the Ord plot heuristic recommends the Poissonness plot, it also provides this parameter, which I provide to the distplot function for the lower right plot. Thus, while both the lower right and lower left plots are Poissonness plots in this case, they are generally based on different distribution parameters. In the particular example shown here – constructed from the “number of times pregnant” variable in the Pima Indians diabetes dataset that I have discussed in several previous posts (available from the UCI Machine Learning Repository) – the Ord plot heuristic recommends the negative binomial distribution. Comparing the Poissonness and negative-binomialness plots in the bottom row of the above plot array, it does appear that the negative binomial distribution fits the data better.

Finally, before examining the code for the CountSummary procedure, it is worth noting that the vcd package’s implementation of the Ord_plot and Ord_estimate procedures can generate four different distributional recommendations: the Poisson and negative-binomial distributions discussed here, along with the binomial distribution and the much less well-known log-series distribution. The distplot procedure is flexible enough to generate plots for the first three of these distributions, but not the fourth, so in cases where the Ord plot heuristic recommends this last distribution, the CountSummary procedure displays the recommended distribution and parameter, but displays a warning message that no distribution plot is available for this case in the lower right plot position.

The code for the CountSummary procedure looks like this:

CountSummary <- function(xCount,TitleString){

#

# Initial setup

#

library(vcd)

grid.newpage()

#

# Set up 2x2 array of plots

#

pushViewport(viewport(layout=grid.layout(nrow=2,ncol=2)))

#

# Generate the plots:

#

# 1 - upper left = Ord plot

#

pushViewport(viewport(layout.pos.row=1,layout.pos.col=1))

OrdLine = Ord_plot(xCount, newpage = FALSE, pop=FALSE, legend=FALSE)

OrdType = Ord_estimate(OrdLine)

popViewport()

#

# 2 - upper right = text summary

#

OrdTypeText = paste("Type = ",OrdType$type,sep=" ")

if (OrdType$type == "poisson"){

OrdPar = "Lambda = "

}

else if ((OrdType$type == "nbinomial")|(OrdType$type == "nbinomial")){

OrdPar = "Prob = "

}

else if (OrdType$type == "log-series"){

OrdPar = "Theta = "

}

else{

OrdPar = "Parameter = "

}

OrdEstText = paste(OrdPar,round(OrdType$estimate,digits=3), sep=" ")

TextSummary = paste("Ord plot heuristic results:",OrdTypeText,OrdEstText,sep="\n")

pushViewport(viewport(layout.pos.row=1,layout.pos.col=2))

grid.text(TitleString,y=2/3,gp=gpar(fontface="bold"))

grid.text(TextSummary,y=1/3)

popViewport()

# 3 - lower left = standard Poissonness plot

pushViewport(viewport(layout.pos.row=2,layout.pos.col=1))

distplot(xCount, type="poisson",newpage=FALSE, pop=FALSE, legend = FALSE)

popViewport()

# 4 - lower right = plot suggested by Ord results

pushViewport(viewport(layout.pos.row=2,layout.pos.col=2))

if (OrdType$type == "poisson"){

distplot(xCount, type="poisson",lambda=OrdType$estimate, newpage=FALSE, pop=FALSE, legend=FALSE)

}

else if (OrdType$type == "nbinomial"){

prob = OrdType$estimate

size = 1/prob - 1

distplot(xCount, type="nbinomial",size=size,newpage=FALSE, pop=FALSE, legend=FALSE)

}

else if (OrdType$type == "binomial"){

distplot(xCount, type="binomial", newpage=FALSE, pop=FALSE, legend=FALSE)

}

else{

Message = paste("No distribution plot","available","for this case",sep="\n")

grid.text(Message)

}

popViewport()

}

This procedure is a function called with two arguments: the sequence of count values, xCounts, and TitleString, a text string that is displayed in the upper right text box in the plot array, along with the recommendations from the Ord plot heuristic. When called, the function first loads the vcd library to make the Ord_plot, Ord_estimate, and distplot functions available for use, and it executes the grid.newpage() command to clear the display. (Note that we don’t have to include “library(grid)” here to load the grid package, since loading the vcd package automatically does this.) As in the previous example, the first “pushViewport” command creates the two-by-two plot array, and this is again followed by four code segments, one to generate each of the four displays in this array. The first of these segments invokes the Ord_plot and Ord_estimate commands as discussed above, first to generate the upper left plot (a side-effect of the Ord_plot command) and second, to obtain the Ord plot heuristic recommendations, to be used in structuring the rest of the display. The second segment creates a text display as in the first example considered here, but the structure of this display depends on the Ord plot heuristic results (i.e., the names of the parameters for the four possible recommended distributions are different, and the logic in this code block matches the display text to this distribution). As noted in the preceding discussion, the third plot (lower left) is the Poissonness plot generated by the distplot function from the vcd package. In this case, the function is called only specifying ‘type = “poisson”’ without the optional distribution parameter lambda, which is obtained by fitting the data. The final element of this plot array, in the lower right, is also generated via a call to the distplot function, but here, the results from the Ord plot heuristic are used to specify both the type parameter and any optional or required shape parameters for the distribution. As with the displayed text, simple if-then-else logic is required here to match the plot generated with the Ord plot heuristic recommendations.

Finally, it is important to note that in all of the calls made to Ord_plot or distplot in the CountSummary procedure, the parameters newpage, pop, and legend, are all specified as FALSE. Specifying “newpage = FALSE” prevents these vcd plot commands from clearing the display page and erasing everything we have done so far. Similarly, specifying “pop = FALSE” allows us to continue working in the current plot window until we notify the grid graphics system that we are done with it by issuing our own “popViewport()” command. Specifying “legend = FALSE” tells Ord_plot and distplot not to write the default informational legend on each plot. This is important here because, given the relatively small size of the plots generated in this two-by-two array, including the default legends would obscure important details.

Base versus grid graphics

2012-07-21T13:33:00.000-07:00

In a comment in response to my latest post, Robert Young took issue with my characterization of grid as an R graphics package. Perhaps grid is better described as a “graphics support package,” but my primary point – and the main point of this post – is that to generate the display you want, it is sometimes necessary to use commands from this package. In my case, the necessity to learn something about grid graphics came as the result of my attempt to implement the CountSummary procedure to be included in the ExploringData package that I am developing. CountSummary is a graphical summary procedure for count data, based on Poissonness plots, negative binomialness plots, and Ord plots, all discussed in Chapter 8 of Exploring Data in Engineering, the Sciences and Medicine. My original idea was to implement these plots myself, but then I discovered that all three were already available in the vcd package. One of the great things about R is that you are encouraged to build on what already exists, so using the vcd implementations seemed like a no-brainer. Unfortunately, my first attempt at creating a two-by-two array of plots from the vcd package failed, and I didn’t understand why. The reason turned out to be that I was attempting to mix the base graphics command “par(mfrow=c(2,2))” that sets up a two-by-two array with varous plotting commands from vcd, which are based on grid graphics. Because these two graphics systems don’t play well together, I didn’t get the results I wanted. In the end, however, by learning a little about the grid package and its commands, I was able to generate my two-by-two plot array without a great deal of difficulty. Since grid graphics isn’t even mentioned in my favorite R reference book (Michael Crawley’s The R Book), I wanted to say a little here about what the grid package is and why you might need to know something about it. To do this, I will describe the ideas that went into the development of the CountSummary procedure and conclude this post with an example that shows what the output looks like. Next time, I will give a detailed discussion of the R code that generated these results. (For those wanting a preliminary view of what the code looks like, load the vcd package with the library command and run “examples(Ord_plot)” – in addition to generating the plots, this example displays the grid commands needed to construct the two-by-two array.)

Count variables – non-negative integer variables like the “number of times pregnant” (NPG) variable from the Pima Indians database described below – are often assumed to obey a Poisson distribution, in much the same way that continuous-valued variables are often assumed to obey a Gaussian (normal) distribution. Like this normality assumption for continuous variables, the Poisson assumption for count data is sometimes reasonable, but sometimes it isn’t. Normal quantile-quantile plots like those generated by the qqnorm command in base R or the qqPlot command from the car package are useful in informally assessing the reasonableness of the normality assumption for continuous data. Similarly, Poissonness plots are the corresponding graphical tool for informally evaluating the Poisson hypothesis for count data. The construction and interpretation of these plots is discussed in some detail in Chapters 8 and 9 of Exploring Data, but briefly, this plot constructs a variable called the Poissonness count metameter from the number of times each possible count value occurs in the data; if the data sequence conforms to the Poisson distribution, the points on this plot should fall approximately on a straight line. A simple R function that constructs Poissonness plots is available on the OUP companion website for the book, but an implementation that is both more conveniently available and more flexible is the distplot function in the vcd package, which also generates the negative binomialness plot discussed below.

The figure above is the Poissonness plot constructed using the distplot procedure from the vcd package for the NPG variable from the Pima Indians diabetes dataset mentioned above. I have discussed this dataset in previous posts and have used it as the basis for several examples in Exploring Data. It is available from the UCI Machine Learning Repository and it has been incorporated in various forms as an example dataset in a number of R packages, including a cleaned-up version in the MASS package (dataset Pima.tr). The full version considered here contains nine characteristics for 768 female members of the Pima Indian tribe, including their age, medical characteristics like diastolic blood pressure, and the number of times each woman has been pregnant. If this NPG count sequence obeyed the Poisson distribution, the points in the above plot would fall approximately on the reference line included there. The fact that these points do not conform well to this line – note, in particular, the departure at the lower left end of the plot where most of the counts occur – calls the Poisson working assumption into question.

A fundamental feature of the Poisson distribution is that it is defined by a single parameter that determines all distributional characteristics, including both the mean and the variance. In fact, a key characteristic of the Poisson distribution is that the variance is equal to the mean. This constraint is not satisfied by all count data sequences we encounter, however, and these deviations are important enough to receive special designations: integer sequences whose variance is larger than their mean are commonly called overdispersed, while those whose variance is smaller than their mean are commonly called underdispersed. In practice, overdispersion seems to occur more frequently, and a popular distributional alternative for overdispersed sequences is the negative binomial distribution. This distribution is defined by two parameters and it is capable of matching both the mean and variance of arbitrary overdispersed count data sequences. For a detailed discussion of this distribution, refer to Chapter 3 of Exploring Data.

Like the Poisson distribution, it is possible to evaluate the reasonableness of the negative binomial distribution graphically, via the negative binomialness plot. Like the Poissonness plot, this plot is based on a quantity called the negative binomialness metameter, computed from the number of times each count value occurs, plotted against those count values. To construct this plot, it is necessary to specify a numerical value for the distribution’s second parameter (the size parameter in the distplot command, corresponding to the r parameter in the discussion of this distribution given in Chapter 8 of Exploring Data). This can be done in several different ways, including the specification of trial values, the approach taken in the negative binomialness plot procedure that is available from the OUP companion website. This option is also available with the distplot command from the vcd package: to obtain a negative binomialness plot, specify the type parameter as “nbinomial” and, if a fixed size parameter is desired, it is specified by giving a numerical value for the size parameter in the distplot function call. Alternatively, if this parameter is not specified, the distplot procedure will estimate it via the method of maximum likelihood, an extremely useful feature, although it is important to note that this estimation process can be time-consuming, especially for long data sequences. Finally, a third approach that can be adopted is to use the Ord plot described next to obtain an estimate of this parameter based on a simple heuristic. In addition, this heuristic suggests which of these two candidate distributions – the Poisson or the negative binomial – is more appropriate for the data sequence.

Like the Poissonness plot, the Ord plot computes a simple derived quantity from the original count data sequence – specifically, the frequency ratio, defined for each count value as that value multiplied by the ratio of the number of times it occurs to the number of times the next smaller count occurs – and plots this versus the counts. If the data sequence obeys the negative binomial distribution, these points should conform reasonably well to a line with positive slope, and this slope can be used to determine the size parameter for the distribution. Conversely, if the Poisson distribution is appropriate, the best fit reference line for the Ord plot should have zero slope. In addition, Ord plots can also be used to suggest two additional discrete distributions (specifically, the binomial distribution and the log-series distribution), and the vcd package provides dataset examples to illustrate all four of these cases.

For my CountSummary procedure, I decided to construct a two-by-two array with the following four components. First, in the upper left, I used the Ord_plot command in vcd to generate an Ord plot. This command returns the intercept and slope parameters for the reference line in the plot, and the Ord_estimate command can then be used to convert these values into a type specification and an estimate of the distribution parameter needed to construct the appropriate discrete distribution plot. I will discuss these results in more detail in my next post, but for the case of the NPG count sequence considered here, the Ord plot results suggest the negative binomial distribution as the most appropriate choice, returning a parameter prob, from which the size parameter required to generate the negative binomialness plot may be generated (specifically, size = 1/prob – 1). The upper right quadrant of this display gives a text summary identifying the variable being characterized and listing the Ord plot recommendations and parameter estimate. Since the Poisson distribution is “the default” assumption for count data, the lower left plot shows a Poissonness plot for the data sequence, while the lower right plot is the “distribution-ness plot” for the distribution recommended by the Ord plot results. The results obtained by the CountSummary procedure for the NPG sequence are shown below. Next time, I will present the code used to generate this plot.

Graphical insights from the 2012 UseR! Meeting

2012-07-07T08:11:00.000-07:00

About this time last month, I attended the 2012 UseR! Meeting. Now an annual event, this series of conferences started in Europe in 2004 as an every-other-year gathering that now seems to alternate between the U.S. and Europe. This year’s meeting was held on the Vanderbilt University campus in Nashville, TN, and it was attended by about 500 R aficionados, ranging from beginners who have just learned about R to members of the original group of developers and the R Core Team that continues to maintain it. Many different topics were discussed, but one given particular emphasis was data visualization, which forms the primary focus of this post. For a more complete view of the range of topics discussed and who discussed them, the conference program is available as a PDF file that includes short abstracts of the talks.

All attendees were invited to present a Lightning Talk, and about 20 of us did. The format is essentially the technical equivalent of the 50-yard dash: before the talk, you provide the organizers exactly 15 slides, each of which is displayed for 20 seconds. The speaker’s challenge is first, to try to keep up with the slides, and second, to try to convey some useful information about each one. For my Lightning Talk, I described the ExploringData R package that I am in the process of developing, as a companion to both this blog and my book, Exploring Data in Engineering, the Sciences, and Medicine. The intent of the package is first, to make the R procedures and datasets from the OUP companion site for the book more readily accessible, and second, to provide some additional useful tools for exploratory data analysis, incorporating some of the extensions I have discussed in previous blog posts.

Originally, I had hoped to have the package complete by the time I gave my Lightning Talk, but in retrospect, it is just as well that the package is still in the development stage, because I picked up some extremely useful tips on what constitutes a good package at the meeting. As a specific example, Hadley Wickham, Professor of Statistics at Rice University and the developer of the ggplot2 package (more on this later), gave a standing-room-only talk on package development, featuring the devtools package, something he developed to make the R package development process easier. In addition, the CRC vendor display at the meeting gave me the opportunity to browse and purchase Paul Murrell’s book, R Graphics, which provides an extremely useful, detailed, and well-written treatment of the four different approaches to graphics in R that I will say a bit more about below.

Because I am still deciding what to include in the ExploringData package, one of the most valuable sessions for me was the invited talk by Di Cook, Professor of Statistics at Iowa State University, who emphasized the importance of meaningful graphical displays in understanding the contents of a dataset, particularly if it is new to us. One of her key points – illustrated with examples from some extremely standard R packages – was that the “examples” associated with datasets included in R packages often fail to include any such graphical visualization, and even for those that do, the displays are often too cryptic to be informative. While this point is obvious enough in retrospect, it is one that I – along with a lot of other people, evidently – had not thought about previously. As a consequence, I am now giving careful thought to the design of informative display examples for each of the datasets I will include in the ExploringData package.

As I mentioned above, there are (at least) four fundamental approaches to doing graphics in R. The one that most of us first encounter – the one we use by default every time we issue a “plot” command – is called base graphics, and it is included in base R to support a wide range of useful data visualization procedures, including scatter plots, boxplots, histograms, and a variety of other common displays. The other three approaches to graphics – grid graphics, lattice graphics, and ggplot2 – all offer more advanced features than what is typically available in base graphics, but they are, most unfortunately, incompatible in a number of ways with base graphics. I discovered this the hard way when I was preparing one of the procedures for the ExploringData package (the CountSummary procedure, which I will describe and demonstrate in my next post). Specifically, the vcd package includes implementations of Poissonness plots, negative binomialness plots, and Ord plots, all discussed in Exploring Data, and I wanted to take advantage of these implementations in building a simple graphical summary display for count data. In base graphics, to generate a two-by-two array of plots, you simply specify “par(mfrow=c(2,2))” and then generate each individual plot using standard plot commands. When I tried this with the plots generated by the vcd package, I didn’t get what I wanted – for the most part, it appeared that the “par(mfrow=c(2,2))” command was simply being ignored, and when it wasn’t, multiple plots were piled up on top of each other. It turns out that the vcd package uses grid graphics, which has a fundamentally different syntax: it’s more complicated, but in the end, it does provide a wider range of display options. Ultimately, I was able to generate the display I wanted, although this required some digging, since grid graphics aren’t really discussed much in my standard R reference books. For example, The R Book by Michael J. Crawley covers an extremely wide range of useful topics, but the only mentions of “grid” in the index refer to the generation of grid lines (e.g., the base graphics command “grid” generates grid lines on a base R plot, which is not based on grid graphics).

Often, grid graphics are mentioned in passing in introductory descriptions of trellis (lattice) graphics, since the lattice package is based on grid graphics. This package is discussed in The R Book, and I have used it occasionally because it does support things like violin plots that are not part of base graphics. To date, I haven’t used it much because I find the syntax much more complicated, but I plan to look further into it, since it does appear to have a lot more capability than base graphics do. Also, Murrell’s R Graphics book devotes a chapter to trellis graphics and the lattice package, which goes well beyond the treatments given in my other R references, and this provides me further motivation to learn more. The fourth approach to R graphics – Hadley Wickham’s ggplot2 package – was much discussed at the UseR! Meeting, appearing both in examples presented in various authors’ talks and as components for more complex and specialized graphics packages. I have not yet used ggplot2, but I intend to try it out, since it appears from some of the examples that this package can generate an extremely wide range of data visualizations, with simple types comparable to what is found in base graphics often available as defaults. Like the lattice package, ggplot2 is also based on grid graphics, making it, too, incompatible with base graphics. Again, the fact that Murrell’s book devotes a chapter to this package should also be quite helpful in learning when and how to make the best use of it.

This year’s UseR! Meeting was the second one I have attended – I also went to the 2010 meeting in Gaithersburg, MD, held at the National Institute of Standards and Technology (NIST). Both have been fabulous meetings, and I fully expect future meetings to be as good: next year’s UseR! meeting is scheduled to be held in Spain and I’m not sure I will be able to attend, but I would love to. In any case, if you can get there, I highly recommend it, based on my experiences so far.

Classifying the UCI mushrooms

2012-06-10T13:13:00.000-07:00

In my last post, I considered the shifts in two interestingness measures as possible tools for selecting variables in classification problems. Specifically, I considered the Gini and Shannon interestingness measures applied to the 22 categorical mushroom characteristics from the UCI mushroom dataset. The proposed variable selection strategy was to compare these values when computed from only edible mushrooms or only poisonous mushrooms. The rationale was that variables whose interestingness measures changed a lot between these two subsets might be predictive of mushroom edibility. In this post, I examine this question a little more systematically, primarily to illustrate the mechanics of setting up classification problems and evaluating their results.

More specifically, the classification problem I consider here is that of building and comparing models that predicts mushroom edibility, each one based on a different mushroom characteristic. In practice, you would generally consider more than one characteristic as the basis for prediction, but here, I want to use standard classification tools to provide a basis for comparing the predictabilities of each of the potentially promising mushroom characteristics identified in my last post. In doing this, I also want to highlight three aspects of classification problems: first, the utility of randomly splitting the available data into subsets before undertaking the analysis, second, the fact that we have many different options in building classifiers, and third, one approach to assessing classification results.

One of the extremely useful ideas emphasized in the machine learning literature is the utility of randomly partitioning our dataset into three parts: one used to fit whatever prediction model we are interested in building, another used to perform intermediate fit comparisons (e.g., compare the performance of models based on different predictor variables), and a third that is saved for a final performance assessment. The reasoning behind this partitioning is that if we allow our prediction model to become too complex, we run the risk of overfitting, or predicting some of the random details in our dataset, resulting in a model that does not perform well on other, similar datasets. This is an important practical problem that I illustrate with an extreme example in Chapter 1 of Exploring Data in Engineering, the Sciences, and Medicine. There, a sequence of seven monotonically-decaying observations is fit to a sixth-degree polynomial that exactly predicts the original seven observations, but which exhibits horrible interpolation and extrapolation behavior. The point here is that we need a practical means of protecting ourselves against building models that are too specific to the dataset at hand, and the partitioning strategy just described provides a simple way of doing this. That is, once we partition the data, we can fit our prediction model to the first subset and then evaluate its performance with respect to the second subset: because these subsets were generated by randomly sampling the original dataset, their general character is the same, so a “good” prediction model built from the first subset should give “reasonable” predictions for the second subset. The reason for saving out a third data subset – not used at all until the final evaluation of our model – is that model-building is typically an iterative procedure, so we are likely to cycle repeatedly between the first and second subsets. For the final model evaluation, it is desirable to have a dataset available that hasn’t been used at all.

Generating this three-way split in R is fairly easy. As with many tasks, this can be done in more than one way, but the following procedure is fairly straightforward and only makes use of procedures available in base R:

RandomThreeWay.proc <- function(df, probs = c(35,35,30), iseed = 101){

#

set.seed(iseed)

n = nrow(df)

#

u = runif(n)

#

nprobs = probs/sum(probs)

brks = c(0,cumsum(nprobs))

Subgroup = cut(u, breaks=brks, labels=c("A","B","C"), include.lowest=TRUE)

#

Subgroup

#

}

This function is called with three parameters: the data frame that we wish to partition for our analysis, a vector of the relative sizes of our three partitions, and a seed for the random number generator. In the implementation shown here, the vector of relative sizes is given the default values 35%/35%/30%, but any relative size partitioning can be specified. The result returned by this procedure is the character vector Subgroup, which has the values “A”, “B”, or “C”, corresponding to the three desired partitions of the dataset. The first line of this procedure sets the seed for the uniform random number generator used in the third line, and the second line specifies how many random numbers to generate (i.e., one for each data record in the data frame). The basic idea here is to generate uniform random numbers on the interval [0,1] and then assign subgroups depending on whether this value falls into the interval between 0 and 0.35, 0.35 to 0.70, or 0.70 to 1.00. The runif function generates the required random numbers, the cumsum function is used to generate the cumulative breakpoints from the normalized probabilities, and the cut function is used to group the uniform random numbers using these break points.

In the specific example considered here, I use logistic regression as my classifier, although many, many other classification procedures are available in R, including a wide range of decision tree-based models, random forest models, boosted tree models, naïve Bayes classifiers, and support vector machines, to name only a few. (For a more complete list, refer to the CRAN task view on Machine Learning and Statistical Learning). Here, I construct and compare six logistic regression models, each constructed to predict the probability that a mushroom is poisonous from one of the six mushroom characteristics identified in my previous post: GillSize, StalkShape, CapSurf, Bruises, GillSpace, and Pop. In each case, I extract the records for subset “A” of the UCI mushroom dataset, as described above, and use the base R procedure glm to construct a logistic regression model. Because the model evaluation procedure (somers2, described below) that I use here requires a binary response coded as 0 or 1, it is simplest to construct a data frame with this response explicitly, along with the prediction covariate of interest. The following code does this for the first predictor (GillSize):

EorP = UCImushroom.frame$EorP

PoisonBinary = rep(0,length(EorP))

PoisonIndx = which(EorP = = "p")

PoisonBinary[PoisonIndx] = 1

FirstFrame = data.frame(PoisonBinary = PoisonBinary, Covar = UCImushroom.frame$GillSize)

In particular, this code constructs a two-column data frame that contains the binary response variable PoisonBinary that is equal to 1 whenever EorP is “p” and 0 whenever this variable is “e”, and the prediction covariate Covar, which is here “GillSize”. Given this data frame, I then apply the following code to randomly partition this data frame into subsets A, B, and C, and I invoke the built-in glm procedure to fit a logistic regression model:

Subset = RandomThreeWay.proc(FirstFrame)

IndxA = which(Subset = = "A")

LogisticModel = glm(PoisonBinary ~ Covar, data = FirstFrame, subset = IndxA, family=binomial())

Note that here I have specified the model form using the R formula construction “PoisonBinary ~ Covar”, I have used the subset argument of the glm procedure to specify that I only want to fit the model to subset A, and I have specified “family = binomial()” to request a logistic regression model. Once I have this model, I evaluate it using the concordance index C available from the somers2 function in the R package Hmisc. This value corresponds to the area under the ROC curve and is a measure of agreement between the predictions of the logistic regression model and the actual binary response. As discussed above, I want to do this evaluation for subset B to avoid an over-optimistic view of the model’s performance due to overfitting of subset A. To do this, I need the model predictions from subset B, which I obtain with the built-in predict procedure:

IndxB = which(Subset = = "B")

PredPoisonProb = predict(LogisticModel, newdata = FirstFrame[IndxB,], type="response")

ObsPoisonBinary = FirstFrame$PoisonBinary[IndxB]

In addition, I have created the variable ObsPoissonBinary, the sequence of binary responses from subset B, which I will use in calling the somers2 function:

library(Hmisc)

somers2(PredPoisonProb, ObsPoisonBinary)

C Dxy n Missing

0.7375031 0.4750063 2858.0000000 0.0000000

The results shown here include the concordance index C, an alternative (and fully equivalent) measure called Somers’ D (from which the procedure gets its name), the number of records in the dataset (here, in subset B), and the number of missing records (here, none). The concordance index C is a number that varies between 0 and 1, with values between 0.5 and 1.0 meaning that the predictions are better than random guessing, and values less than 0.5 indicating performance so poor that it is actually worse than random guessing. Here, the value of approximately 0.738 suggests that GillSize is a reasonable predictor of mushroom edibility, at least for mushrooms like those characterized in the UCI mushroom dataset.

Repeating this process for all six of the mushroom characteristics identified as potentially predictive by the interestingness change analysis I discussed last time leads to the following results:

            Pop:                 C = 0.753        (6 levels)

            Bruises:            C = 0.740        (2 levels)

            GillSize:            C = 0.738        (2 levels)

            GillSpace:         C = 0.635        (2 levels)

            CapSurf:           C = 0.595        (4 levels)

            StalkShape:      C = 0.550        (2 levels)

These results leave open the questions of whether other mushroom characteristics, not identified on the basis of their interestingness shifts, are in fact more predictive of edibility, or how much better the predictions can be if we use more than one prediction variable. I will examine those questions in subsequent posts, using the ideas outlined here. For now, it is enough to note that one advantage of the approach described here, relative to that using odds ratios for selected covariates discussed last time, is that this approach can be used to assess the potential prediction power of categorical variables with arbitrary numbers of levels, while the odds ratio approach is limited to two-level predictors.

Interestingness comparisons

2012-05-19T13:06:00.000-07:00

In three previous posts (April 3, 2011, April 12, 2011,and May 21, 2011), I have discussed interestingness measures, which characterize the distributional heterogeneity of categorical variables. Four specific measures are discussed in Chapter 3 of Exploring Data in Engineering, the Sciences and Medicine: the Bray measure, the Gini measure, the Shannon measure, and the Simpson measure. All four of these measures vary from 0 to 1 in value, exhibiting their minimum value when all levels of the variable are equally represented, and exhibiting their maximum value when the variable is completely concentrated on a single one of its several possible levels. Intermediate values correspond to variables that are more or less homogeneously distributed: more homogeneous for smaller values of the measure, and less homogeneous for larger values. One of the points I noted in my first post on this topic was that the different measures exhibit different behavior for the intermediate cases, reflecting different inherent sensitivities to the various ways in which a variable can be “more homogeneous” or “less homogeneous.” This post examines changes in interestingness measures as a potential exploratory analysis tool for selecting categorical predictors of some binary response. In fact, I examined the same question from a different perspective in my April 12 post noted above: the primary difference is that there, the characterization I considered generates a single graph for each variable, with the number of points on the graph corresponding to the number of levels of the variable. Here, I examine a characterization that represents each variable as a single point on the graph, allowing us to consider all variables simultaneously.

As a reminder of how these measures behave, the figure above shows a plot of the normalized Gini measure versus the normalized Shannon measure for the 23 categorical variables included in the mushroom dataset from the UCI Machine Learning Repository. As I have noted in several previous posts that have discussed this dataset, it gives observable characteristics for 8,124 mushrooms and classifies each one as either edible or poisonous (the binary variable EorP). The above plot illustrates the systematic difference between the normalized Shannon and Gini interestingness measures: there, each point represents one of the 23 variables in the dataset, with the horizontal axis representing the Shannon measure computed for the variable and the vertical axis rperesenting the corresponding Gini measure. The plot shows that the Gini measure is consistently larger than the Shannon measure, since all points lie above the equality reference line in this plot except for the single point at the origin. This point corresponds to the variable VeilType, which only exhibits a single value in this dataset, meaning that both the Gini and Shannon measures are inherently ill-defined; consequently, they are given the default value of zero here, consistent with the general interpretation of these measures: if a variable only assumes a single value, it seems reasonable to consider it “completely homogeneous.”

Because edible and poisonous mushrooms are fairly evenly represented in this dataset (51.8% edible versus 48.2% poisonous), it has been widely used as one of several benchmarks for evaluating classification algorithms. In particular, given the other mushroom characteristics, the fundamental classification question is how well can we predict whether each mushroom is poisonous or edible. In this post and a subsequent follow-up post, I consider a closely related question: can differences in a variable’s interestingness measure between the edible subset and the poisonous subset be used to help us select prediction covariates for these classification algorithms? In this post, I present some preliminary evidence to suggest that this may be the case, while in a subsequent post, I will put the question to the test by seeing how well the covariates suggested by this analysis actually predict edibility.

The specific idea I examine here is the following: given an interestingness measure and a mushroom characteristic, compute this measure for the chosen characteristic, applied the edible and poisonous mushrooms separately. If these numbers are very different, this suggests that the distribution of levels is different for edible and poisonous mushrooms, further suggesting that this variable may be a useful predictor of edibility. To turn this idea into a data analysis tool, it is necessary to define what we mean by “very different,” and this can be done in more than one way. Here, I consider two possibilities. The first is what I call the “normalized difference,” defined as the difference of the two interestingness measures divided by their sum. Since both interestingness measures lie between 0 and 1, it is not difficult to show that this normalized difference lies between -1 and +1. As a specific application of this idea, consider the plot below, which shows the normalized difference in the Gini measure between the poisonous mushrooms and the edible mushrooms (the normalized Gini shift) plotted against the corresponding difference for the Shannon measure (the normalized Shannon shift). In addition, this plot shows an equality reference line, and the fact that the points consistently lie between this line and the horizontal axis shows that the normalized Gini shift is consistently smaller in magnitude than the normalized Shannon shift. This suggests that the normalized Shannon measure may be more sensitive to distributional differences between edible and poisonous mushrooms.

The next figure, below, shows a re-drawn version of the above plot, with the equality reference line removed and replaced by four other reference lines. The vertical dashed lines correspond to the outlier detection limits obtained by the Hampel identifier with threshold value t = 2 (see Chapter 7 of Exploring Data for a detailed discussion of this procedure), computed from the normalized Shannon shift values, while the horizontal dashed lines represent the corresponding limits computed from the normalized Gini shift values. Points falling outside these limits represent variables whose changes in both Gini measure and Shannon measure are “unusually large” according to the Hampel identifier criteria used here. These points are represented as solid circles, while those not detected as “unusual” by the Hampel identifier are represented as open circles. The idea proposed here – to be investigated in a future post – is that these outlying variables may be useful in predicting mushroom edibility.

More specifically, the five solid circles in the above plot correspond to the following mushroom characteristics. The two points in the lower left corner of the plot – exhibiting almost the most negative normalized Shannon shift possible – correspond to GillSize and StalkShape, two binary variables. As I discussed in a previous post (May 7, 2011) and I discuss further in Chapter 13 of Exploring Data, an extremely useful measure of association between two binary variables (e.g., between GillSize and edibility) is the odds ratio. An examination of the odds ratios for these two variables suggest that both should be at least somewhat predictive of edibility: the odds ratio between GillSize and edibility is 0.056, suggesting a very strong association (specifically, a GillSize value of “n” for “narrow” is most commonly associated with poisonous mushrooms in the UCI mushroom dataset), while the odds ratio between StalkShape and edibility is less extreme at 1.511, but still different enough from the neutral value of 1 to be suggestive of a clear association between these variables (a StalkShape value of “t” is more strongly associated with edible mushrooms than the alternative value of “e”). The solid circle in the upper right of this plot corresponds to the variable CapSurf, which has four levels and whose distributional homogeneity appears to change quite substantially, according to both the Gini and Shannon measures. Because this variable has more than two levels, it is not possible to characterize its association in terms of its odds ratio relative to edibility. Finally, the cluster of three points in the upper right, just barely above the upper horizontal dashed line, correspond to the binary variables Bruises and GillSpace, and the six-level variable Pop. Both of these binary variables exhibit very large odds ratios with respect to edibility (9.97 and 13.55 for Bruises and GillSpace, respectively), again suggesting that these variables may be highly predictive of edibility.

The prevalence of binary variables in these results is noteworthy, and it reflects the fact that distributional shifts for binary variables can only occur in one way (i.e., the relative frequency of either fixed level can either increase or decrease). Thus, large shifts in either interestingness measure should correspond to significant odds ratios with respect to the binary response variable, and this is seen to be the case here. The situation is more complicated when a variable exhibits more than two levels, since the distribution of these levels can change in many ways between the two binary response values. An important advantage of techniques like the the interestingness shift analysis described here is that they are not restricted to binary characteristics, as odds ratio characterizations are.

The second approach I consider for measuring the shift in interestingness between edible and poisonous mushrooms is what I call the “marginal measure,” corresponding to the difference in either the Gini or the Shannon measure between poisonous and edible mushrooms, divided by the original measure for the complete dataset. An important difference between the marginal measure and the normalized measure is that the marginal measure is not bounded to lie between -1 and +1, as is evident in the plot below. This plot shows the marginal Gini shift against the marginal Shannon shift for the mushroom characteristics, in the same format as the plot above. Here, only four points are flagged as outliers, corresponding to the four binary variables identified above from the normalized shift plot: Bruises (the point in the extreme upper right), GillSpace (the point just barely in the upper right quadrant), and GillSize and StalkShape (the two points in the extreme lower left). However, if we lower the Hampel identifier threshold from t = 2 to t = 1.5, we again identify CapSurf and Pop as potentially influential variables.

This last observation suggests an alternative interpretation approach that may be worth exploring. Specifically, both of the two previous plots give clear visual evidence of “cluster structure,” and the Hampel identifier does extract some or all of this structure from the plot, but only if we apply a sufficiently judicious tuning to the threshold parameter. A possible alternative would be to apply cluster analysis procedures, and this will be the subject of one or more subsequent posts. In particular, there are many different clustering algorithms that could be applied to this problem, and the results are likely to be quite different. The key practical question is which ones – if any – lead to useful ways of grouping these mushroom characteristics. Subsequent posts will examine this question further from several different perspectives.

David Olive’s median confidence interval

2012-04-21T11:38:00.000-07:00

As I have discussed in a number of previous posts, the median represents a well-known and widely-used estimate of the “center” of a data sequence. Relative to the better-known mean, the primary advantage of the median is its much reduced outlier sensitivity. This post briefly describes a simple confidence interval for the median that is discussed in a paper by David Olive, available on-line via the following link:

http://www.math.siu.edu/olive/ppmedci.pdf

As Olive notes in his paper and I further demonstrate in this post, an advantage of his confidence interval for the median is that it provides a simple, numerical way of identifying situations where the data values deserve a careful, graphical look. In particular, he advocates comparing the traditional confidence interval for the mean with his confidence interval for the median: if these intervals are markedly different, it is worth investigating to understand why. This strategy may be viewed as a particular instance of Collin Mallows’ “compute and compare” advice, discussed at the end of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine. The key idea here is that under “standard” working assumptions – i.e., distributional symmetry and approximate normality – the mean and the median should be approximately the same: if they are not, it probably means these working assumptions have been violated, due to outliers in the data, pronounced distributional asymmetry, or other less common phenomena like strongly multimodal data distributions or coarse quantization. In the increasingly common case where we have a lot of numerical variables to consider, it may be undesirable or infeasible to examine them all graphically: numerical comparisons like the one described here may be automated and used to point us to subsets of variables that we really need to look at further. In addition to describing this confidence interval estimator and illustrating it for three examples, this post also provides the R code to compute it.

As a first example, the plot above shows the makeup flow rate dataset discussed in Exploring Data and available as the makeup dataset (makeup.csv) from the book's companion website. This plot shows 2,589 successive observations of the measured flow rate of a solvent recycle stream in an industrial manufacturing process. In normal operation, this flow rate is just under 400 – in fact, the median flow rate is 393.86 – but this data record also includes measurements during time intervals when the process is either being shut down, is not running, or is being started back up, and during these periods the measured flow rates decrease toward zero, are approximately equal to zero, and increase from zero back to approximately 400, respectively. Because of the presence of these anomalous segments in the data, the mean value is much smaller than the median: specifically, the mean is 315.46, actually serving as a practical dividing line between the normal operation segments (i.e., those data points that lie above the mean) and the shutdown segments (i.e., those data points that lie below the mean). The dashed lines in this plot at 309.49 and 321.44 correspond to the classical 95% confidence interval for the mean, computed as described below. In contrast, the dotted lines at 391.83 and 394.88 correspond to Olive’s 95% confidence interval for the median, also described below. Before proceeding to a more detailed discussion of how these lines were determined, the three primary points to note from this figure are, first, that the two confidence intervals are very different (e.g., they do not overlap at all), second, that the mean confidence intervals are much wider than those for the median in this case, and third, that the median confidence interval lies well within the range of the normal operating data, while the mean confidence interval does not. It is also worth noting that, if we simply remove the shutdown episodes from this dataset, the mean of this edited dataset is 397.7, a value that lies slightly above the upper 95% confidence interval for the median, but only slightly so (this and other data cleaning strategies for this dataset are discussed in some detail in Chapter 7 of Exploring Data).

Both the classical confidence interval for the mean and David Olive’s confidence interval for the median are based on the fact that these estimators are asymptotically normal: for a sufficiently large data sample, both the estimated mean and the estimated median approach the correct limits for the underlying data distribution, with a standard deviation that decreases inversely with the square root of the sample size. Using this description directly would lead to confidence intervals based on the quantiles of the Gaussian distribution, but for small to moderate-sized samples, more accurate confidence intervals are obtained by replacing these Gaussian quantiles with those for the Student’s t-distribution with the appropriate number of degrees of freedom. More specifically, for the mean, the confidence interval at a given level p is of the form:

CI = (Mean – c_p SE, Mean + c_p SE),

where c_p is the constant derived from the Gaussian or Student’s t-distribution, and SE is the standard error of the mean, equal to the usual standard deviation estimate divided by the square root of the number of data points. (For a more detailed discussion of the math behind these results, refer to either Chapter 9 of Exploring Data or to David Olive’s paper, available through the link given above.) For the median, Olive provides a simple estimator for the standard error, described further in the next paragraph. First, however, it is worth saying a little about the difference between the Gaussian and Student’s t-distribution in these results. Probably the most commonly used confidence intervals are the 95% intervals – these are the confidence intervals shown in the plot above for the makeup flow rate data – which represent the interval that should contain the true distribution mean with probability at least 95%. In the Gaussian case, the constant c_p for the 95% confidence interval is approximately 1.96, while for the Student’s t-distribution, this number depends on the degrees of freedom parameter. In the case of the mean, the degrees of freedom is one less than the sample size, while for the median confidence intervals described below, this number is typically much smaller. The difference between these distributions is that the c_p parameter decreases from a very large value for few degrees of freedom – e.g., the 95% parameter value is 12.71 for a single degree of freedom – to the Gaussian value (e.g., 1.96 for the 95% case) in the limit of infinite degrees of freedom. Thus, using Student’s t-distribution instead of the Gaussian distribution results in wider confidence intervals, wider by the ratio of the Student’s t value for c_p to the Gaussian value. The plot below shows this ratio for the 95% parameter c_p as the degree of freedom parameter varies between 5 and 200, with the dashed line corresponding to the Gaussian limit when this ratio is equal to 1.

The general structure of Olive’s confidence interval for the median is exactly analogous to that for the mean given above:

CI = (Median – c_p SE, Median + c_p SE)

The key result of Olive’s paper is a simple estimator for the standard error SE, based on order statistics (i.e., rank-ordered data values like the minimum, median, and maximum). Instead of describing these results mathematically, I have included an R procedure that computes the median, Olive’s standard error, the corresponding confidence intervals, and the classical results for the mean (again, for the mathematical details, refer to Olive’s paper; for a more detailed discussion of order statistics, refer to Chapter 6 of Exploring Data). Specifically, the following R procedure is called with a vector y of numerical data values, and the default level of the resulting confidence interval is 95%, although this level can be changed by specifying an alternative value of alpha (this is 1 minus the confidence level, so alpha is 0.05 for the 95% case, 0.01 for 99%, etc.).

DOliveCIproc <- function(y, alpha = 0.05){
#
# This procedure implements David Olive's simple
# median confidence interval, along with the standard
# confidence interval for the mean, for comparison
#
# First, compute the median
#
n = length(y)
ysort = sort(y)
nhalf = floor(n/2)
if (2*nhalf < n){
    # n odd
    med = ysort[nhalf + 1]
}
else{
    # n even
    med = (ysort[nhalf] + ysort[nhalf+1])/2
}
#
# Next, compute Olive’s standard error for the median
#
Ln = nhalf - ceiling(sqrt(n/4))
Un = n - Ln
SE = 0.5*(ysort[Un] - ysort[Ln+1])
#
# Compute the confidence interval based on Student’s t-distribution
# The degrees of freedom parameter p is discussed in Olive’s paper
#
p = Un - Ln - 1
t = qt(p = 1 - alpha/2, df = p)
medLCI = med - t * SE
medUCI = med + t * SE
#
# Next, compute the mean and its classical confidence interval
#
mu = mean(y)
SEmu = sd(y)/sqrt(n)
tmu = qt(p = 1 - alpha/2, df = n-1)
muLCI = mu - tmu * SEmu
muUCI = mu + tmu * SEmu
#
# Finally, return a data frame with all of the results computed here
#
OutFrame = data.frame(Median = med, LCI = medLCI, UCI = medUCI,
                        Mean = mu, MeanLCI = muLCI, MeanUCI = muUCI,
                        N = n, dof = p, tmedian = t, tmean = tmu,
                        SEmedian = SE, SEmean = SEmu)
OutFrame
}

Briefly, this procedure performs the following computations. The first portion of the code computes the median, defined as the middle element of the rank-ordered list of samples if the number of samples n is odd, and the average of the two middle samples if n is even. Note that the even/odd character of n is determined by using the floor function in R: floor(n/2) is the largest integer that does not exceed n/2. Thus, if n is odd, the floor function rounds n/2 down to its integer part, so the product 2 * floor(n/2) is less than n, while if n is even, floor(n/2) is exactly equal to n/2, so this product is equal to n. In addition, both the floor function and its opposite function ceiling are needed to compute the value Ln used in computing Olive’s standard error for the median. The c_p values correspond to the parameters t and tmu that appear in this function, computed from the built-in R function qt (which returns quantiles of the t-distribution). Note that for the median, the degrees of freedom supplied to this function is p, which tends to be much smaller than the degrees of freedom value n-1 for the mean confidence interval computed in the latter part of this function.

As a specific illustration of the results generated by this procedure, applying it to the makeup flow rate data sequence yields:

> DOliveCIproc(makeupflow)

Median LCI UCI Mean MeanLCI MeanUCI N dof tmedian

1 393.3586 391.8338 394.8834 315.4609 309.4857 321.4361 2589 52 2.006647

tmean SEmedian SEmean

1 1.960881 0.75987 3.047188

These results were used to construct the confidence interval lines in the makeup flow rate plot shown above. In addition, note that these results also illustrate the point noted in the preceding discussion about the degrees of freedom used in constructing the Student’s t-based confidence intervals. For the mean, the degrees of freedom is N-1, which is 2588 for this example, meaning that there is essentially no difference in this case between these confidence intervals and those based on the Gaussian limiting distribution. In contrast, for the median, the degrees of freedom is only 52, giving a c_p value that is about 2.5% larger than the corresponding Gaussian case; for the next example, the degrees of freedom is only 16, making this parameter about 8% larger than the Gaussian limit.

One of the points I discussed in my last post was the instability of the median relative to the mean, a point I illustrated with the plot shown above. This is a simulation-based dataset consisting of three parts: the first 100 points are narrowly distributed around the value +1, the 101^st point is exactly zero, and the last 100 points are narrowly distributed around the value -1. As I noted last time, removing two points from either the first group or the last group can profoundly alter the median, while having very little effect on the mean. The figure shown above includes, in addition to the data values, the 95% confidence intervals for both the mean (the dotted lines in the center of the plot) and the median (the heavy dashed lines at the top and bottom of the plot). Here, the fact that the median confidence interval is enormously wider (by almost a factor of 13) than the mean confidence interval gives an indication of the instability of the median. In fact, the data distribution in this example is strongly bimodal, corresponding to a case where order statistic-based estimators like the median and Olive’s standard error for it perform poorly, a point discussed in Chapter 7 of Exploring Data.

One of the other important cases where estimators based on order statistics can perform poorly is that of coarsely quantized data, such as temperatures recorded only to the nearest tenth of a degree. The difficulty with these cases is that coarse quantization profoundly changes the nature of the data distribution. Specifically, it is a standard result in statistics that the probability of any two samples drawn from a continuous distribution having exactly the same value is zero, but this is no longer true for discrete distributions (e.g., count data), and coarse quantization introduces an element of discreteness into the data distribution. The above figure illustrates this point for a simple simulation-based example. The upper left plot shows a random sample of size 200 drawn from a zero-mean, unit-variance Gaussian distribution, and the upper right plot shows the effects of quantizing this sample, rounding it to the nearest half-integer value. The lower two plots are normal quantile-quantile plots generated by the R command qqPlot from the car package: in the lower left plot, almost all of the points fall within the 95% confidence interval around the normal reference line for this plot, while many of the points fall somewhat outside these confidence limits in the plot shown in the lower right. The greatest difference, however, is in the “staircase” appearance of this lower right plot, reflecting the effects of the coarse quantization on this data sample: each “step” corresponds to a group of samples that have exactly the same value.

The influence of this quantization on Olive’s confidence interval for the median is profound: for the original Gaussian data sequence, the 95% confidence interval for the median is approximately (-0.222,0.124), compared with (-0.174,0.095) for the mean. These results are consistent with our expectations: since the mean is the best possible location estimator for Gaussian data, it should give the narrower confidence interval, and it does. For the quantized case, the 95% confidence interval for the mean is (-0.194, 0.079), fairly similar to that for the original data sequence, but the confidence interval for the median reduces to the single value zero. This result represents an implosion of Olive’s standard error estimator for the median, exactly analogous to the behavior of the MADM scale estimate that I have discussed previously when a majority of the data values (i.e., more than 50% of them) are identical. Here, the situation is more serious, since the MADM scale estimate does not implode for this example: the MADM scale for the original data sequence is 0.938, versus 0.741 for the quantized sequence. The reason Olive’s standard error estimator is more prone to implosion in the face of coarse quantization is that it is based on a small subset of the original data sample. In particular, the size of the subsample on which this estimator is based is p, the degrees of freedom for the t-distribution used in constructing the corresponding confidence interval, and this number is approximately the square root of the sample size. Thus, for a sample of size 200 like the example considered here, MADM scale implosion requires just over half the sample to have the same value – 101 data points in this case – where Olive’s standard error estimator for the median can implode if 16 or more samples have the same value, and this is exactly what happens here: the median value is zero, and this value occurs 39 times in the quantized data sequence.

David Olive’s confidence interval for the median is easily computed and represents a useful adjunct to the median as a characterization of numerical variables. As Olive advises, there is considerable advantage in computing and comparing both his median confidence interval and the corresponding standard confidence interval around the mean. Although in the summary of his paper, Olive only mentions outliers as a potential cause of substantial differences between these two confidence intervals, this post has illustrated that disagreements can also arise from other causes, including light-tailed, bimodal, or coarsely quantized data, much like the situation with the MADM scale estimate versus the standard deviation. In fact, as the last example discussed here illustrates, Olive’s standard error estimator for the median and the confidence intervals based on it can implode – exactly like the MADM scale estimate – in the face of coarsely quantized data. In fact, the implosion problem for Olive’s median standard error estimator is potentially more severe, again as illustrated in the previous example. Finally, it is worth noting that Olive’s paper also discusses confidence intervals for trimmed means.

Gastwirth’s location estimator

2012-03-03T15:25:00.000-08:00

The problem of outliers – data points that are substantially inconsistent with the majority of the other points in a dataset – arises frequently in the analysis of numerical data. The practical importance of outliers lies in the fact that even a few of these points can badly distort the results of an otherwise reasonable data analysis. This outlier-sensitivity problem is often particularly acute for classical data characterizations and analysis methods like means, standard deviations, and linear regression analysis. As a consequence, a range of outlier-resistant methods have been developed for many different applications, and new methods continue to be developed. For example, the R package robustbase that I have discussed in previous posts includes outlier-resistant methods for estimating location (i.e., outlier-resistant alternatives to the mean), estimating scale (outlier-resistant alternatives to the standard deviation), quantifying asymmetry (outlier-resistant alternatives to the skewness), and fitting regression models. In Exploring Data in Engineering, the Sciences, and Medicine, I discuss a number of outlier-resistant methods for addressing some of these problems, including Gastwirth’s location estimator, an alternative to the mean that is the subject of this post.

The mean is the best-known location estimator, and it gives a useful assessment of the “typical” value of any numerical sequence that is reasonably symmetrically distributed and free of outliers. The outlier-sensitivity of the mean is severe, however, which motivates the use of outlier-resistant alternatives like the median. While the median is almost as well-known as the mean and extremely outlier-resistant, it can behave unexpectedly (i.e., “badly”) as a result of its non-smooth character. This point is illustrated in Fig. 7.23 in Exploring Data, identical in character to the figure shown below (this figure is slightly different because it uses a different seed to generate the random numbers on which it is based). Specifically, this plot shows a sequence of 201 data points, constructed as follows. The first 100 points are normally distributed with mean 1 and standard deviation 0.1, the 101^st point is equal to zero, and points 102 through 201 are normally distributed with mean -1 and standard deviation 0.1. Small changes in this dataset in the specific form of deleting points can result in very large changes in the computed median. Specifically, in this example, the first 100 points lie between 0.768 and 1.185 and the last 100 points lie between -0.787 and -1.282; because the central data point lies between these two equal-sized groups, it defines the median, which is 0. The mean is quite close to this value, at -0.004, but the situation changes dramatically if we omit either the first two or the last two points from this data sequence. Specifically, the median value computed from points 1 through 199 is 0.768, while that computed from points 3 through 201 is -0.787. In contrast, the mean values for these two modified sequences are 0.006 and -0.014. Thus, although the median is much less sensitive than the mean to contamination from outliers, it is extremely sensitive to the 1% change made in this example for this particular dataset.

The fact that the median is not “universally the best location estimator” provides a practical motivation for examining alternatives that are intermediate in behavior between the very smooth but very outlier-sensitive mean and the very outlier-insensitive but very non-smooth median. Some of these alternatives were examined in detail in the book Robust Estimates of Location: Survey and Advances, by D.F. Andrews, P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, and J.W. Tukey, published by Princeton University Press in 1972 (according to the publisher's website, this book is out of print, but used copies are available through distributors like Amazon or Barnes and Noble). The book summarizes the results of a year-long study of 68 different location estimators, including both the mean and the median. The fundamental criteria for inclusion in this study were, first, that the estimators had to be computable from any given sequence of real numbers, and second, that they had to be both location and scale-invariant. Specifically, if a given data sequence {x_k} yielded a result m, the scaled and shifted data sequence {Ax_k + b} should yield the result Am+b, for any numbers A and b. The study was co-authored by six statistical researchers with differing opinions and points of view, but two of the authors – D.F. Andrews and F.R. Hampel – included the Gastwirth estimator (described in detail below) in their list of favorites. For example, Hampel characterized this estimator as one of a small list of those that were “never bad at the distributions considered.” Also, in contrast to many of the location estimators considered in the study, Gastwirth’s estimator does not require iterative computations, making it simpler to implement.

Specifically, Gastwirth’s location estimator is a weighted sum of three order statistics. That is, to compute this estimator, we first sort the data sequence in ascending order. Then, we take the values that are one-third of the way up this sequence (the 0.33 quantile), half way up the sequence (i.e., the median, or 0.50 quantile), and two-thirds of the way up the sequence (the 0.67 quantile). Given these three values, we then form the weighted average, giving the central (median) value a weight of 40% and the two extreme values each a weight of 30%. This is extremely easy to do in R, with the following code:

Gastwirth <- function(x,...){
#
ordstats = quantile(x, probs=c(1/3,1/2,2/3),...)
wts = c(0.3,0.4,0.3)
sum(wts*ordstats)
#
}

The key part of this code is the first line, which computes the required order statistics (i.e., the quantiles 1/3, 1/2, and 2/3) using the built-in quantile function. The first argument passed to this function is x, the vector of data values to be characterized, and the second argument (probs) defines the specific quantiles we wish to compute. The ellipses in the Gastwirth procedure’s command line is passed to the quantile function; several parameters are possible (type “help(quantile)” in your R session for details), but one of the most useful is na.rm, a logical variable that specifies how missing data values are to be handled. The default is “FALSE” and this causes the Gastwirth procedure to return the missing data value “NA” if any values of x are missing; the alternative “TRUE” computes the Gastwirth estimator from the non-missing values, giving a numerical result. The three-element vector wts defines the quantile weights that define the Gastwirth estimator, which the final sum statement computes.

For the data example considered above, the Gastwirth estimator yields the location estimate -0.001 for the complete dataset, 0.308 for points 1 to 199 (vs. 0.768 for the median), and -0.317 for points 3 to 201 (vs. -0.787 for the median). Thus, while it does not perform nearly as well as the mean for this example, it performs substantially better than the median.

For the infinite-variance Cauchy distribution that I have discussed in several previous posts, the Gastwirth estimator performs similarly to the median, yielding a useful estimate of the center of the data distribution, in contrast to the mean, which doesn’t actually exist for this distribution (that is, the first moment does not exist for the Cauchy distribution). Still, the distribution is symmetric about zero, so the median is well-defined, as is the Gastwirth estimator, and both should be zero for this distribution. The above figure shows the results of applying these three estimators – the mean, the median, and Gastwirth’s estimator – to 1,000 independent random samples drawn from the Cauchy distribution. Specifically, this figure gives a boxplot summary of these results, truncated to the range from -3 to 3 to show the range of variation of the median and Gastwirth estimator (without this restriction, the boxplot comparison would be fairly non-informative, since the mean values range from approximately -161 to 27,793, reflecting the fact that the mean is not a consistent location estimator for the Cauchy distribution). To generate these results, the replicate function in R was used, followed by the apply function, as follows:

RandomSampleFrame = replicate(1000, rt(n=200,df=1))
BoxPlotVector = apply(RandomSampleFrame, MARGIN=2, Gastwirth)

The replicate function creates a data frame with the number of columns specified by the first argument (here, 1000), and each column generated by the R statement that appears as the second argument. In this case, this second argument is the command rt, which generates a sequence of n statistically independent random numbers drawn from the Student’s t-distribution with the number of degrees of freedom specified by the df argument (here, this is 1, corresponding to the fact that the Cauchy distribution is the Student’s t-distribution with 1 degree of freedom). Thus, RandomSampleFrame is a data frame with 200 rows and 1,000 columns, each of which may be regarded as a Cauchy-distributed random sample. The apply function applies the function specified in the third argument (here, the Gastwirth procedure listed above) to the columns (MARGIN=2 specifies columns; MARGIN=1 would specify rows) of the data frame specified in the first argument. The result is BoxPlotVector, a vector of 1,000 Gastwirth estimates, one for each random sample generated by the replicate function above.

At the other extreme, in the limit of infinite degrees of freedom, the Student’s t-distribution approaches a Gaussian limit. The figure above shows the same comparison as before, except for the Gaussian distribution instead of the Cauchy distribution. Here, the mean is the best possible location estimator and it clearly performs the best, but the point of this example is that Gastwirth’s location estimator performs better than the median. In particular, the interquartile distance (i.e., the width of the “box” in each boxplot) for the mean is 0.094, it is 0.113 for the median, and it is 0.106 for Gastwirth’s estimator.

Another application area where very robust estimators like the median often perform poorly is that of bimodal distributions like the arc-sine distribution whose density is plotted above. This distribution is a symmetric beta distribution, with both shape parameters equal to 0.5 (see Exploring Data, Sec. 4.5.1 for further discussion of this distribution). Because it is symmetrically distributed on the interval from 0 to 1, the location parameter for this distribution is 0.5 and all three of the location estimators considered here yield values that are accurate on average, but with different levels of precision. This point is shown in the figure below, which again provides boxplot comparisons for 1,000 random samples drawn from this distribution, each of length 200, for the mean, median, and Gastwirth location estimators. As in the Gaussian case considered above, the mean performs best here, with an interquartile distance of 0.035, the median performs worst, with an interquartile distance of 0.077, and Gastwirth’s estimator is intermediate, with an interquartile distance of 0.060.

The point of this post has been to illustrate a location estimator with properties that are intermediate between those of the much better-known mean and median. In particular, the results presented here for the Cauchy distribution show that Gastwirth’s estimator is intermediate in outlier sensitivity between the disastrously sensitive mean and the maximally insensitive median. Similarly, the first example demonstrated that Gastwirth’s estimator is also intermediate in smoothness between the maximally smooth mean and the discontinuous median: the sensitivity of Gastwirth’s estimator to data editing in “swing-vote” examples like the one presented here is still undesirably large, but much better than that of the median. Finally, the results presented here for the Gaussian and arc-sine distributions show that Gastwirth’s estimator is better-behaved for these distributions than the median. Because it is extremely easy to implement in R, Gastwirth’s estimator seems worth knowing about.

Measuring associations between non-numeric variables

2012-02-04T16:06:00.000-08:00

It is often useful to know how strongly or weakly two variables are associated: do they vary together or are they essentially unrelated? In the case of numerical variables, the best-known measure of association is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century. For variables that are ordered but not necessarily numeric (e.g., Likert scale responses with levels like “strongly agree,” “agree,” “neither agree nor disagree,” “disagree” and “strongly disagree”), association can be measured in terms of the Spearman rank correlation coefficient. Both of these measures are discussed in detail in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine. For unordered categorical variables (e.g., country, state, county, tumor type, literary genre, etc.), neither of these measures are applicable, but applicable alternatives do exist. One of these is Goodman and Kruskal’s tau measure, discussed very briefly in Exploring Data (Chapter 10, page 492). The point of this post is to give a more detailed discussion of this association measure, illustrating some of its advantages, disadvantages, and peculiarities.

A more complete discussion of Goodman and Kruskal’s tau measure is given in Agresti’s book Categorical Data Analysis, on pages 68 and 69. It belongs to a family of categorical association measures of the general form:

a(x,y) = [V(y) – E{V(y|x)}]/V(y)

where V(y) is a measure of the overall (i.e., marginal) variability of y and E{V(y|x)} is the expected value of the conditional variability V(y|x) of y given a fixed value of x, where the expectation is taken over all possible values of x. These variability measures can be defined in different ways, leading to different association measures, including Goodman and Kruskal’s tau as a special case. Agresti’s book gives detailed expressions for several of these variability measures, including the one on which Goodman and Kruskal’s tau is based, and an alternative expression for the overall association measure a(x,y) is given in Eq. (10.178) on page 492 of Exploring Data. This association measure does not appear to be available in any current R package, but it is easily implemented as the following function:

GKtau <- function(x,y){
#
# First, compute the IxJ contingency table between x and y
#
Nij = table(x,y,useNA="ifany")
#
# Next, convert this table into a joint probability estimate
#
PIij = Nij/sum(Nij)
#
# Compute the marginal probability estimates
#
PIiPlus = apply(PIij,MARGIN=1,sum)
PIPlusj = apply(PIij,MARGIN=2,sum)
#
# Compute the marginal variation of y
#
Vy = 1 - sum(PIPlusj^2)
#
# Compute the expected conditional variation of y given x
#
InnerSum = apply(PIij^2,MARGIN=1,sum)
VyBarx = 1 - sum(InnerSum/PIiPlus)
#
# Compute and return Goodman and Kruskal's tau measure
#
tau = (Vy - VyBarx)/Vy
tau
}

An important feature of this procedure is that it allows missing values in either of the variables x or y, treating “missing” as an additional level. In practice, this is sometimes very important since missing values in one variable may be strongly associated with either missing values in another variable or specific non-missing levels of that variable.

An important characteristic of Goodman and Kruskal’s tau measure is its asymmetry: because the variables x and y enter this expression differently, the value of a(y,x) is not the same as the value of a(x,y), in general. This stands in marked contrast to either the product-moment correlation coefficient or the Spearman rank correlation coefficient, which are both symmetric, giving the same association between x and y as that between y and x. The fundamental reason for the asymmetry of the general class of measures defined above is that they quantify the extent to which the variable x is useful in predicting y, which may be very different than the extent to which the variable y is useful in predicting x. Specifically, if x and y are statistically independent, then E{V(y|x)} = V(y) – i.e., knowing x does not help at all in predicting y – and this implies that a(x,y) = 0. At the other extreme, if y is perfectly predictable from x, then E{V(y|x)} = 0, which implies that a(x,y) = 1. As the examples presented next demonstrate, it is possible that y is extremely predictable from x, but x is only slightly predictable from y.

Specifically, consider the sequence of 400 random numbers, uniformly distributed between 0 and 1 generated by the following R code:

set.seed(123)

u = runif(400)

(Here, I have used the “set.seed” command to initialize the random number generator so repeated runs of this example will give exactly the same results.) The second sequence is obtained by quantizing the first, rounding the values of u to a single digit:

x = round(u,digits=1)

The plot below shows the effects of this coarse quantization: values of u vary continuously from 0 to 1, but values of x are restricted to 0.0, 0.1, 0.2, … , 1.0. Although this example is simulation-based, it is important to note that this type of grouping of variables is often encountered in practice (e.g., the use of age groups instead of ages in demographic characterizations, blood pressure characterizations like “normal,” “borderline hypertensive,” etc. in clinical data analysis, or the recording of industrial process temperatures to the nearest 0.1 degree, in part due to measurement accuracy considerations and in part due to memory limitations of early data collection systems).

In this particular case, because the variables x and u are both numeric, we could compute either the product-moment correlation coefficient or the Spearman rank correlation, obtaining the very large value of approximately 0.995 for either one, showing that these variables are strongly associated. We can also apply Goodman and Kruskal’s tau measure here, and the result is much more informative. Specifically, the value of a(u,x) is 1 in this case, correctly reflecting the fact that the grouped variable x is exactly computable from the original variable u. In contrast, the value of a(x,u) is approximately 0.025, suggesting – again correctly – that the original variable u cannot be well predicted from the grouped variable x.

To illustrate a case where the product-moment and rank correlation measures are not applicable at all, consider the following alphabetic re-coding of the variable x into an unordered categorical variable c:

letters = c(“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”, “K”)

c = letters[10*x+1]

In this case, both of the Goodman and Kruskal tau measures, a(x,c) and a(c,x), are equal to 1, reflecting the fact that these two variables are effectively identical, related via the non-numeric transformation given above.

Being able to detect relationships like these can be extremely useful in exploratory data analysis where such relationships may be unexpected, particularly in the early stages of characterizing a dataset whose metadata – i.e., detailed descriptions of the variables included in the dataset – is absent, incomplete, ambiguous, or suspect. As a real data illustration, consider the rent data frame from the R package gamlss.data, which has 1,969 rows, each corresponding to a rental property in Munich, and 9 columns, each giving a characteristic of that unit (e.g., the rent, floor space, year of construction, etc.). Three of these variables are Sp, a binary variable indicating whether the location is considered above average (1) or not (0), Sm, another binary variable indicating whether the location is considered below average (1) or not (0), and loc, a three-level variable combining the information in these other two, taking the values 1 (below average), 2 (average), or 3 (above average). The Goodman and Kruskal tau values between all possible pairs of these three variables are:

a(Sm,Sp) = a(Sp,Sm) = 0.037

a(Sm,loc) = 0.245 vs. a(loc,Sm) = 1

a(Sp,loc) = 0.701 vs. a(loc,Sp) = 1

The first of these results – the symmetry of Goodman and Kruskal’s tau for the variables Sm and Sp – is a consequence of the fact that this measure is symmetric for any pair of binary variables. In fact, the odds ratio that I have discussed in previous posts represents a much better way of characterizing the relationship between binary variables (here, the odds ratio between Sm and Sp is zero, reflecting the fact that a location cannot be both “above average” and “below average” at the same time). The real utility of the tau measure here is that the second and third lines above show that the variables Sm and Sp are both re-groupings of the finer-grained variable loc.

Finally, a more interesting exploratory application to this dataset is the following one. Computing Goodman and Kruskal’s tau measure between the location variable loc and all of the other variables in the dataset – beyond the cases of Sm and Sp just considered – generally yields small values for the associations in either direction. As a specific example, the association a(loc,Fl) is 0.001, suggesting that location is not a good predictor of the unit’s floor space in meters, and although the reverse association a(Fl,loc) is larger (0.057), it is not large enough to suggest that the unit’s floor space is a particularly good predictor of its location quality. The same is true of most of the other variables in the dataset: they are neither well predicted by nor good predictors of location quality. The one glaring exception is the rent variable R: although the association a(loc,R) is only 0.001, the reverse association a(R,loc) is 0.907, a very large value suggesting that location quality is quite well predicted by the rent. The beanplot above shows what is happening here: because the variation in rents for all three location qualities is substantial, knowledge of the loc value is not sufficient to accurately predict the rent R, but these rent values do generally increase in going from below-average locations (loc = 1) to average locations (loc = 2) to above-average locations (loc = 3). For comparison, the beanplots below show why the association with floor space is so much weaker: both the mean floor space in each location quality group and the overall range of these values are quite comparable, implying that neither location quality can be well predicted from floor space nor vice versa.

The asymmetry of Goodman and Kruskal’s tau measure is disconcerting at first because it has no counterpart in better-known measures like the product-moment correlation coefficient between numerical variables, Spearman’s rank correlation coefficient between ordinal variables, or the odds ratio between binary variables. One of the points of this post has been to demonstrate how this unusual asymmetry can be useful in practice, distinguishing between the ability of one variable x to predict another variable y, and the reverse case.

Moving window filters and the pracma package

2012-01-14T11:06:00.000-08:00

In my last post, I discussed the Hampel filter, a useful moving window nonlinear data cleaning filter that is available in the R package pracma. In this post, I briefly discuss this moving window filter in a little more detail, focusing on two important practical points: the choice of the filter’s local outlier detection threshold, and the question of how to initialize moving window filters. This second point is particularly important here because the pracma package initializes the Hampel filter in a particularly appropriate way, but doesn’t do such a good job of initializing the Savitzky-Golay filter, a linear smoothing filter that is popular in physics and chemistry. Fortunately, this second difficulty is easy to fix, as I demonstrate here.

Recall from my last post that the Hampel filter is a moving window implementation of the Hampel identifier, discussed in Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine. In particular, this procedure – implemented as outlierMAD in the pracma package – is a nonlinear data cleaning filter that looks for local outliers in a time-series or other streaming data sequence, replacing them with a more reasonable alternative value when it finds them. Specifically, this filter may be viewed as a more effective alternative to a “local three-sigma edit rule” that would replace any data point lying more than three standard deviations from the mean of its neighbors with that mean value. The difficulty with this simple strategy is that both the mean and especially the standard deviation are badly distorted by the presence of outliers in the data, causing this data cleaning procedure to often fail completely in practice. The Hampel filter instead uses the median of neighboring observations as a reference value, and the MAD scale estimator as an alternative measure of distance: that is, a data point is declared an outlier and replaced if it lies more than some number t of MAD scale estimates from the median of its neighbors; the replacement value used in this procedure is the median.

More specifically, for each observation in the original data sequence, the Hampel filter constructs a moving window that includes the K prior points, the data point of primary interest, and the K subsequent data points. The reference value used for the central data point is the median of these 2K+1 successive observations, and the MAD scale estimate is computed from these same observations to serve as a measure of the “natural local spread” of the data sequence. If the central data point lies more than t MAD scale estimate values from the median, it is replaced with the median; otherwise, it is left unchanged. To illustrate the performance of this filter, the top plot above shows the sequence of 1024 successive physical property measurements from an industrial manufacturing process that I also discussed in my last post. The bottom plot in this pair shows the results of applying the Hampel filter with a window half-width parameter K=5 and a threshold value of t = 3 to this data sequence. Comparing these two plots, it is clear that the Hampel filter has removed the glaring outlier – the value zero – at observation k = 291, yielding a cleaned data sequence that varies over a much narrower (and, at least in this case, much more reasonable) range of possible values. What is less obvious is that this filter has also replaced 18 other data points with their local median reference values.

The above plot shows the original data sequence, but on approximately the same range as the cleaned data sequence so that the glaring outlier at k = 291 no longer dominates the figure. The large solid circles represent the 18 additional points that the Hampel filter has declared to be outliers and replaced with their local median values. This plot was generated using the Hampel filter implemented in the outlierMAD command in the pracma package, which has the following syntax:

outlierMAD(x,k)

where x is the data sequence to be cleaned and k is the half-width that defines the moving data window on which the filter is based. Here, specifying k = 5 results in an 11-point moving data window. Unfortunately, the threshold parameter t is hard-coded as 3 in this pracma procedure, which has the following code:

outlierMAD <- function (x, k){

n <- length(x)

y <- x

ind <- c()

L <- 1.4826

t0 <- 3

for (i in (k + 1):(n - k)) {

x0 <- median(x[(i - k):(i + k)])

S0 <- L * median(abs(x[(i - k):(i + k)] - x0))

if (abs(x[i] - x0) > t0 * S0) {

y[i] <- x0

ind <- c(ind, i)

}

list(y = y, ind = ind)

}

Note that it is a simple matter to create your own version of this filter, specifying the threshold (here, the variable t0) to have a default value of 3, but allowing the user to modify it in the function call. Specifically, the code would be:

HampelFilter <- function (x, k,t0=3){

n <- length(x)

y <- x

ind <- c()

L <- 1.4826

for (i in (k + 1):(n - k)) {

x0 <- median(x[(i - k):(i + k)])

S0 <- L * median(abs(x[(i - k):(i + k)] - x0))

if (abs(x[i] - x0) > t0 * S0) {

y[i] <- x0

ind <- c(ind, i)

}

list(y = y, ind = ind)

}

The advantage of this modification is that it allows you to explore the influence of varying the threshold parameter. Note that increasing t0 makes the filter more forgiving, allowing more extreme local fluctuations to pass through the filter unmodified, while decreasing t0 makes the filter more aggressive, declaring more points to be local outliers and replacing them with the appropriate local median. In fact, this filter remains well-defined even for t0 = 0, where it reduces to the median filter, popular in nonlinear digital signal processing. John Tukey – the developer or co-developer of many useful things, including the fast Fourier transform (FFT) – introduced the median filter at a technical conference in 1974, and it has profoundly influenced subsequent developments in nonlinear digital filtering. It may be viewed as the most aggressive limit of the Hampel filter and, although it is quite effective in removing local outliers, it is often too aggressive in practice, introducing significant distortions into the original data sequence. This point may be seen in the plot below, which shows the results of applying the median filter (i.e., the HampelFilter procedure defined above with t0=0) to the physical property dataset. In particular, the heavy solid line in this plot shows the behavior of the first 250 points of the median filtered sequence, while the lighter dotted line shows the corresponding results for the Hampel filter with t0=3. Note the “clipped” or “blocky” appearance of the median filtered results, compared with the more irregular local variation seen in the Hampel filtered results. In many applications (e.g., fitting time-series models), the less aggressive Hampel filter gives much better overall results.

The other main issue I wanted to discuss in this post is that of initializing moving window filters. The basic structure of these filters – whether they are nonlinear types like the Hampel and median filters discussed above, or linear types like the Savitzky-Golay filter discussed briefly below – is built on a moving data window that includes a central point of interest, prior observations and subsequent observations. For a symmetric window that includes K prior and K subsequent observations, this window is not well defined for the first K or the last K observations in the data sequence. These points must be given special treatment, and a very common approach in the digital signal processing community is to extend the original sequence by appending K additional copies of the first element to the beginning of the sequence and K additional copies of the last element to the end of the sequence. The pracma implementation of the Hampel filter procedure (outlierMAD) takes an alternative approach, one that is particularly appropriate for data cleaning filters. Specifically, procedure outlierMAD simply passes the first and last K observations unmodified from the original data sequence to the filter output. This would also seem to be a reasonable option for smoothing filters like the linear Savitzky-Golay filter discussed next.

As noted, this linear smoothing filter is popular in chemistry and physics, and it is implemented in the pracma package as procedure savgol. For a more detailed discussion of this filter, refer to the treatment in the book Numerical Recipes, which the authors of the pracma package cite for further details (Section 14.8). Here, the key point is that this filter is a linear smoother, implemented as the convolution of the input sequence with an impulse response function (i.e., a smoothing kernel) that is constructed by the savgol procedure. The above two plots show the effects of applying this filter with a total window width of 11 points (i.e., the same half-width K = 5 used with the Hampel and median filters), first to the raw physical property data sequence (upper plot), and then to the sequence after it has been cleaned by the Hampel filter (lower plot). The large downward spike at k = 291 in the upper plot reflects the impact of the glaring outlier in the original data sequence, illustrating the practical importance of removing these artifacts from a data sequence before applying smoothing procedures like the Savitzky-Golay filter. Both the upper and lower plots exhibit similarly large spikes at the beginning and end of the data sequence, however, and these artifacts are due to the moving window problem noted above for the first K and the last K elements of the original data sequence. In particular, the filter implementation in the savgol procedure does not apply the sequence extension procedure discussed above, and this fact is responsible for these artifacts appearing at the beginning and end of the smoothed data sequence.

It is extremely easy to correct this problem, adopting the same philosophy the package uses for the outlierMAD procedure: simply retain the first and last K elements of the original sequence unmodified. The procedure SGwrapper listed below does this after the fact, calling the savgol procedure and then replacing the first and last K elements of the filtered sequence with the original sequence values:

SGwrapper <- function(x,K,forder=4,dorder=0){

n = length(x)

fl = 2*K+1

y = savgol(x,fl,forder,dorder)

if (dorder == 0){

y[1:K] = x[1:K]

y[(n-K):n] = x[(n-K):n]

}

else{

y[1:K] = 0

y[(n-K):n] = 0

}

Before showing the results obtained with this procedure, it is important to note two points. First, the moving window width parameter fl required for the savgol procedure corresponds to fl = 2K+1 for a half-width parameter K. The procedure SGwrapper instead requires K as its passing parameter, constructing fl from this value of K. Second, note that in addition to serving as a smoother, the Savitzky-Golay filter family can also be used to estimate derivatives (this is tricky since differentiation filters are incredible noise amplifiers, but I’ll talk more about that in another post). In the savgol procedure, this is accomplished by specifying the parameter dorder, which has a default value of zero (implying smoothing), but which can be set to 1 to estimate the first derivative of a sequence, 2 for the second derivative, etc. In these cases, replacing the first and last K elements of the filtered sequence with the original data sequence elements is not reasonable: in the absence of any other knowledge, a better default derivative estimate is zero, and the SGwrapper procedure listed above does this.

The four plots shown above illustrate the differences between the original savgol procedure (the left-hand plots) and those obtained with the SGwrapper procedure listed above (the right-hand plots). In all cases, the data sequence used to generate these plots was the physical property data sequence cleaned using the Hampel filter with t0 = 3. The upper left plot repeats the lower of the two previous plots, corresponding to the savgol smoother output, while the upper right plot applies the SGwrapper function to remove the artifacts at the beginning and end of the smoothed data sequence. Similarly, the lower two plots give the corresponding second-derivative estimates, obtained by applying the savgol procedure with fl = 11 and dorder = 2 (lower left plot) or the SGwrapper procedure with K = 5 and dorder = 2 (lower right plot).

Cleaning time-series and other data streams

2011-11-27T08:37:00.000-08:00

The need to analyze time-series or other forms of streaming data arises frequently in many different application areas. Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations. As a specific example, the figure below shows four data sequences: the upper two plots represent hourly physical property measurements, one made at the inlet of a product storage tank (the left-hand plot) and the other made at the same time at the outlet of the tank (the right-hand plot). The lower two plots in this figure show the results of applying the data cleaning filter outlierMAD from the R package pracma discussed further below. The two main points of this post are first, that isolated spikes like those seen in the upper two plots at hour 291 can badly distort the results of an otherwise reasonable time-series characterization, and second, that the simple moving window data cleaning filter described here is often very effective in removing these artifacts.

This example is discussed in more detail in Section 8.1.2 of my book Discrete-Time Dynamic Models, but the key observations here are the following. First, the large spikes seen in both of the original data sequences were caused by the simultaneous, temporary loss of both measurements and the subsequent coding of these missing values as zero by the data collection system. The practical question of interest was to determine how long, on average, the viscous, polymeric material being fed into and out of the product storage tank was spending there. A standard method for addressing such questions is the use of cross-correlation analysis, where the expected result is a broad peak like the heavy dashed line in the plot shown below. The location of this peak provides an estimate of the average time spent in the tank, which is approximately 21 hours in this case, as indicated in the plot. This result was about what was expected, and it was obtained by applying standard cross-correlation analysis to the cleaned data sequences shown in the bottom two plots above. The lighter solid curve in the plot below shows the results of applying exactly the same analysis, but to the original data sequences instead of the cleaned data sequences. This dramatically different plot suggests that the material is spending very little time in the storage tank: accepted uncritically, this result would imply severe fouling of the tank, suggesting a need to shut the process down and clean out the tank, an expensive and labor-intensive proposition. The main point of this example is that the difference in these two plots is entirely due to the extreme data anomalies present in the original time-series. Additional examples of problems caused by time-series outliers are discussed in Section 4.3 of my book Mining Imperfect Data.

One of the primary features of the analysis of time-series and other streaming data sequences is the need for local data characterizations. This point is illustrated in the plot below, which shows the first 200 observations of the storage tank inlet data sequence discussed above. All of these observations but one are represented as open circles in this plot, but the data point at k = 110 is shown as a solid circle, to emphasize how far it lies from its immediate neighbors in the data sequence. It is important to note that this point is not anomalous with respect to the overall range of this data sequence – it is, for example, well within the normal range of variation seen for the points from about k = 150 to k = 200 – but it is clearly anomalous with respect to those points that immediately precede and follow it. A general strategy for automatically detecting and removing such spikes from a data sequence like this one is to apply a moving window data cleaning filter which characterizes each data point with respect to a local neighborhood of prior and subsequent samples. That is, for each data point k in the original data sequence, this type of filter forms a cleaned data estimate based on some number J of prior data values (i.e., points k-J through k-1 in the sequence) and, in the simplest implementations, the same number of subsequent data values (i.e., points k+1 through k+J in the sequence).

The specific data cleaning filter considered here is the Hampel filter, which applies the Hampel identifier discussed in Chapter 7 of Exploring Data in Engineering, the Sciences and Medicine to this moving data window. If the k^th data point is declared to be an outlier, it is replaced by the median value computed from this data window; otherwise, the data point is not modified. The results of applying the Hampel filter with a window width of J = 5 to the above data sequence are shown in the plot below. The effect is to modify three of the original data points – those at k = 43, 110, and 120 – and the original values of these modified points are shown as solid circles at the appropriate locations in this plot. It is clear that the most pronounced effect of the Hampel filter is to remove the local outlier indicated in the above figure and replace it with a value that is much more representative of the other data points in the immediate vicinity.

As I noted above, the Hampel filter implementation used here is that available in the R package pracma as procedure outlierMAD. I will discuss this R package in more detail in my next post, but for those seeking a more detailed discussion of the Hampel filter in the meantime, one is freely available on-line in the form of an EDN article I wrote in 2002, Scrub data with scale-invariant nonlinear digital filters. Also, comparisons with alternatives like the standard median filter (generally too aggressive, introducing unwanted distortion into the “cleaned” data sequence) and the center-weighted median filter (sometimes quite effective) are presented in Section 4.2 of the book Mining Imperfect Data mentioned above.

Harmonic means, reciprocals, and ratios of random variables

2011-11-11T14:16:00.000-08:00

In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian. For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions. Since the harmonic mean is defined as the reciprocal of the mean of the reciprocal values, it is intimately related to the reciprocal transformation. The main point of this post is to show how profoundly the reciprocal transformation can alter the character of a distribution, for better or worse. One way that reciprocal transformations sneak into analysis results is through attempts to characterize ratios of random numbers. The key issue underlying all of these ideas is the question of when the denominator variable in either a reciprocal transformation or a ratio exhibits non-negligible probability in a finite neighborhood of zero. I discuss transformations in Chapter 12 of Exploring Data in Engineering, the Sciences and Medicine, with a section (12.7) devoted to reciprocal transformations, showing what happens when we apply them to six different distributions: Gaussian, Laplace, Cauchy, beta, Pareto, and lognormal.

In the general case, if a random variable x has the density p(x), the distribution g(y) of the reciprocal y = 1/x has the density:

g(y) = p(1/y)/y²

As I discuss in greater detail in Exploring Data, the consequence of this transformation is typically (though not always) to convert a well-behaved distribution into a very poorly behaved one. As a specific example, the plot below shows the effect of the reciprocal transformation on a Gaussian random variable with mean 1 and standard deviation 2. The most obvious characteristic of this transformed distribution is its strongly asymmetric, bimodal character, but another non-obvious consequence of the reciprocal transformation is that it takes a distribution that is completely characterized by its first two moments into a new distribution with Cauchy-like tails, for which none of the integer moments exist.

The implications of the reciprocal transformation for many other distributions are equally non-obvious. For example, both the badly-behaved Cauchy distribution (no moments exist) and the well-behaved lognormal distribution (all moments exist, but interestingly, do not completely characterize the distribution, as I have discussed in a previous post) are invariant under the reciprocal transformation. Also, applying the reciprocal transformation to the long-tailed Pareto type I distribution (which exhibits few or no finite moments, depending on its tail decay rate) yields a beta distribution, all of whose moments are finite. Finally, it is worth noting that the invariance of the Cauchy distribution under the reciprocal transformation lies at the heart of the following result, presented in the book Continuous Univariate Distributions by Johnson, Kotz, and Balakrishnan (Volume 1, 2^nd edition, Wiley, 1994, page 319). They note that if the density of x is positive, continuous, and differentiable at x = 0 – all true for the Gaussian case – the distribution of the harmonic mean of N samples approaches a Cauchy limit as N becomes infinitely large.

As noted above, the key issue responsible for the pathological behavior of the reciprocal transformation is the question of whether the original data distribution exhibits nonzero probability of taking on values within a neighborhood around zero. In particular, note that if x can only assume values larger than some positive lower limit L, it follows that 1/x necessarily lies between 0 and 1/L, which is enough to guarantee that all moments of the transformed distribution exist. For the Gaussian distribution, even if the mean is large enough and the standard deviation is small enough that the probability of observing values less than some limit L > 0 is negligible, the fact that this probability is not zero means that the moments of any reciprocally-transformed Gaussian distribution are not finite. As a practical matter, however, reciprocal transformations and related characterizations – like harmonic means and ratios – do become better-behaved as the probability of observing values near zero become negligibly small.

To see this point, consider two reciprocally-transformed Gaussian examples. The first is the one considered above: the reciprocal transformation of a Gaussian random variable with mean 1 and standard deviation 2. In this case, the probability that x assumes values smaller than or equal to zero is non-negligible. Specifically, this probability is simply the cumulative distribution function for the distribution evaluated at zero, easily computed in R as approximately 31%:

> pnorm(0,mean=1,sd=2)

[1] 0.3085375

In contrast, for a Gaussian random variable with mean 1 and standard deviation 0.1, the corresponding probability is negligibly small:

> pnorm(0,mean=1,sd=0.1)

[1] 7.619853e-24

If we consider the harmonic means of these two examples, we see that the first one is horribly behaved, as all of the results presented here would lead us to expect. In fact, the qqPlot command in the car package in R allows us to compute quantile-quantile plots for the Student’s t-distribution with one degree of freedom, corresponding to the Cauchy distribution, yielding the plot shown below. The Cauchy-like tail behavior expected from the results presented by Johnson, Kotz and Balakrishnan is seen clearly in this Cauchy Q-Q plot, constructed from 1000 harmonic means, each computed from statistically independent samples drawn from a Gaussian distribution with mean 1 and standard deviation 2. The fact that almost all of the observations fall within the – very wide – 95% confidence interval around the reference line suggest that the Cauchy tail behavior is appropriate here.

To further confirm this point, compare the corresponding normal Q-Q plot for the same sequence of harmonic means, shown below. There, the extreme non-Gaussian character of these harmonic means is readily apparent from the pronounced outliers evident in both the upper and lower tails.

In marked contrast, for the second example with the mean of 1 as before but the much smaller standard deviation of 0.1, the harmonic mean is much better behaved, as the normal Q-Q plot below illustrates. Specifically, this plot is identical in construction to the one above, except it was computed from samples drawn from the second data distribution. Here, most of the computed harmonic mean values fall within the 95% confidence limits around the Gaussian reference line, suggesting that it is not unreasonable in practice to regard these values as approximately normally distributed, in spite of the pathologies of the reciprocal transformation.

One reason the reciprocal transformation is important in practice – particularly in connection with the Gaussian distribution – is that the desire to characterize ratios of uncertain quantities does arise from time to time. In particular, if we are interested in characterizing the ratio of two averages, the Central Limit Theorem would lead us to expect that, at least approximately, this ratio should behave like the ratio of two Gaussian random variables. If these component averages are statistically independent, the expected value of the ratio can be re-written as the product of the expected value of the numerator average and the expected value of the reciprocal of the denominator average, leading us directly to the reciprocal Gaussian transformation discussed here. In fact, if these two averages are both zero mean, it is a standard result that the ratio has a Cauchy distribution (this result is presented in the same discussion from Johnson, Kotz and Balakrishnan noted above). As in the second harmonic mean example presented above, however, it turns out to be true that if the mean and standard deviation of the denominator variable are such that the probability of a zero or negative denominator are negligible, the distribution of the ratio may be approximated reasonably well as Gaussian. A very readable and detailed discussion of this fact is given in the paper by George Marsaglia in the May 2006 issue of Journal of Statistical Software.

Finally, it is important to note that the “reciprocally-transformed Gaussian distribution” I have been discussing here is not the same as the inverse Gaussian distribution, to which Johnson, Kotz and Balakrishnan devote a 39-page chapter (Chapter 15). That distribution takes only positive values and exhibits moments of all orders, both positive and negative, and as a consequence, it has the interesting characteristic that it remains well-behaved under reciprocal transformations, in marked contrast to the Gaussian case.

The Zipf and Zipf-Mandelbrot distributions

2011-10-23T13:31:00.000-07:00

In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data. Today’s post considers the Zipf distribution for discrete data, which has come to be extremely popular as a model for phenomena like word frequencies, city sizes, or sales rank data, where the values of these quantities associated with randomly selected samples can vary by many orders of magnitude.

More specifically, the Zipf distribution is defined by a probability p_i of observing the i^th element of an infinite sequence of objects in a single random draw from that sequence, where the probability is given by:

p_i = A/i^a

Here, a is a positive number greater than 1 that determines the rate of the distribution’s tail decay, and A is a normalization constant, chosen so that these probabilities sum to 1. Like the continuous-valued Pareto type I distribution, the Zipf distribution exhibits a “long tail,” meaning that its tail decays slowly enough that in a random sample of objects O_i drawn from a Zipf distribution, some very large values of the index i will be observed, particularly for relatively small values of the exponent a. In one of the earliest and most common applications of the Zipf distribution, the objects considered represent words in a document and i represents their rank, ranging from most frequent (for i = 1) to rare (for large i ). In a more business-oriented application, the objects might be products for sale (e.g., books listed on Amazon), with the index i corresponding to their sales rank. For a fairly extensive collection of references to many different applications of the Zipf distribution, the website (originally) from Rockefeller University is an excellent source.

In Exploring Data in Engineering, the Sciences, and Medicine, I give a brief discussion of both the Zipf distribution and the closely related Zipf-Mandelbrot distribution discussed by Beniot Mandelbrot in his book The Fractal Geometry of Nature. The probabilities defining this distribution may be parameterized in several ways, and the one given in Exploring Data is:

p_i = A/(1+Bi)^a

where again a is an exponent that determines the rate at which the tail of the distribution decays, and B is a second parameter with a value that is strictly positive but no greater than 1. For both the Zipf distribution and the Zipf-Mandelbrot distribution, the exponent a must be greater than 1 for the distribution to be well-defined, it must be greater than 2 for the mean to be finite, and it must be greater than 3 for the variance to be finite.

So far, I have been unable to find an R package that supports the generation of random samples drawn from the Zipf distribution, but the package zipfR includes the command rlnre, which generates random samples drawn from the Zipf-Mandelbrot distribution. As I noted, this distribution can be parameterized in several different ways and, as Murphy’s law would have it, the zipfR parameterization is not the same as the one presented above and discussed in Exploring Data. Fortunately, the conversion between these parameters is simple. The zipfR package defines the distribution in terms of a parameter alpha that must lie strictly between 0 and 1, and a second parameter B that I will call B_zipfR to avoid confusion with the parameter B in the above definition. These parameters are related by:

alpha = 1/a and B_zipfR = (a-1) B

Since the a parameter (and thus the alpha parameter in the zipfR package) determines the tail decay rate of the distribution, it is of the most interest here, and the rest of this post will focus on three examples: a = 1.5 (alpha = 2/3), for which both the distribution’s mean and variance are infinite, a = 2.5 (alpha = 2/5), for which the mean is finite but the variance is not, and a = 3.5 (alpha = 2/7), for which both the mean and variance are finite. The value of the parameter B in the Exploring Data definition of the distribution will be fixed at 0.2 in all of these examples, corresponding to values of B_zipfR = 0.1, 0.3, and 0.5 for the three examples considered here.

To generate Zipf-Mandelbrot random samples, the zipfR package uses the procedure rlnre in conjunction with the procedure lnre (the abbreviation “lnre” stands for “large number of rare events” and it represents a class of data models that includes the Zipf-Mandelbrot distribution). Specifically, to generate a random sample of size N = 100 for the first case considered here, the following R code is executed:

> library(zipfR)
> ZM = lnre(“zm”, alpha = 2/3, B = 0.1)
> zmsample = rlnre(ZM, n=100)

The first line loads the zipfR library (which must first be installed, of course, using the install.packages command), the second line invokes the lnre command to set up the distribution with the desired parameters, and the last line invokes the rlnre command to generate 100 random samples from this distribution. (As with all R random number generators, the set.seed command should be used first to initialize the random number generator seed if you want to get repeatable results; for the results presented here, I used set.seed(101).) The sample returned by the rlnre command is a vector of 100 observations, which have the “factor” data type, although their designations are numeric (think of the factor value “1339” as meaning “1 sample of object number 1339”). In the results I present here, I have converted these factor responses to numerical ones so I can interpret them as numerical ranks. This conversion is a little subtle: simply converting from factor to numeric values via something like “zmnumeric = as.numeric(zmsample)” almost certainly doesn’t give you what you want: this will convert the first-ocurring factor value (which has a numeric label, say “1339”) into the number 1, convert the second-occurring value (since this is a random sequence, this might be “73”) into the number 2, etc. To get what you want (e.g., the labels “1339” and “73” assigned to the numbers 1339 and 73, respectively), you need to first convert the factors in zmsample into characters and then convert these characters into numeric values:

zmnumeric = as.numeric(as.character(zmsample))

The three plots below show random samples drawn from each of the three Zipf-Mandelbrot distributions considered here. In all cases, the y-axis corresponds to the number of times the object labeled i was observed in a random sample of size N = 100 drawn from the distribution with the indicated exponent. Since the range of these indices can be quite large in the slowly-decaying members of the Zipf-Mandelbrot distribution family, the plots are drawn with logarithmic x-axes, and to facilitate comparisons, the x-axes have the same range in all three plots, as do the y-axes. In all three plots, object i = 1 occurs most often – about a dozen times in the top plot, two dozen times in the middle plot, and three dozen times in the bottom plot – and those objects with larger indices occur less frequently. The major difference between these three examples lies in the largest indices of the objects seen in the samples: we never see an object with index greater than 50 in the bottom plot, we see only two such objects in the middle plot, while more than a third of the objects in the top plot meet this condition, with the most extreme object having index i = 115,116.

As in the case of the Pareto type I distributions I discussed in several previous posts – which may be regarded as the continuous analog of the Zipf distribution – the mean is generally not a useful characterization for the Zipf distribution. This point is illustrated in the boxplot comparison presented below, which summarizes the means computed from 1000 statistically independent random samples drawn from each of the three distributions considered here, where the object labels have been converted to numerical values as described above. Thus, the three boxplots on the left represent the means – note the logarithmic scale on the y-axis – of these index values i generated for each random sample. The extreme variability seen for Case 1 (a = 1.5) reflects the fact that neither the mean nor the variance are finite for this case, and the consistent reduction in the range of variability for Cases 2 (a = 2.5, finite mean but infinite variance) and 3 (a = 3.5, finite mean and variance) reflects the “shortening tail” of this distribution with increasing exponent a. As I discussed in my last post, a better characterization than the mean for distributions like this is the “95% tail length,” corresponding to the 95% sample quantile. Boxplots summarizing these values for the three distributions considered here are shown to the right of the dashed vertical line in the plot below. In each case, the range of variation seen here is much less extreme for the 95% tail length than it is for the mean, supporting the idea that this is a better characterization for data described by Zipf-like discrete distributions.

Other alternatives to the (arithmetic) mean that I discussed in conjunction with the Pareto type I distribution were the sample median, the geometric mean, and the harmonic mean. The plot below compares these four characterizations for 1000 random samples, each of size N = 100, drawn from the Zipf-Mandelbrot distribution with a = 3.5 (the third case), for which the mean is well-defined. Even here, it is clear that the mean is considerably more variable than these other three alternatives.

Finally, the plot below shows boxplot comparisons of these alternative characterizations – the median, the geometric mean, and the harmonic mean – for all three of the distributions considered here. Not surprisingly, Case 1 (a = 1.5) exhibits the largest variability seen for all three characterizations, but the harmonic mean is much more consistent for this case than either the geometric mean or the median. In fact, the same observation holds – although less dramatically – for Case 2 (a = 2.5), and the harmonic mean appears more consistent than the geometric mean for all three cases. This observation is particularly interesting in view of the connection between the harmonic mean and the reciprocal transformation, which I will discuss in more detail next time.

Is the “Long Tail” a Useless Concept?

2011-09-28T15:11:00.000-07:00

In response to my last post, “The Long Tail of the Pareto Distribution,” Neil Gunther had the following comment:

“Unfortunately, you've fallen into the trap of using the ‘long tail’ misnomer. If you think about it, it can't possibly be the length of the tail that sets distributions like Pareto and Zipf apart; even the negative exponential and Gaussian have infinitely long tails.”

He goes on to say that the relevant concept is the “width” or the “weight” of the tails that is important, and that a more appropriate characterization of these “Long Tails” would be “heavy-tailed” or “power-law” distributions.

Neil’s comment raises an important point: while the term “long tail” appears a lot in both the on-line and hard-copy literature, it is often somewhat ambiguously defined. For example, in his book, The Long Tail, Chris Anderson offers the following description (page 10):

“In statistics, curves like that are called ‘long-tailed distributions’ because the tail of the curve is very long relative to the head.”

The difficulty with this description is that it is somewhat ambiguous since it says nothing about how to measure “tail length,” forcing us to adopt our own definitions. It is clear from Neil’s comments that the definition he adopts for “tail length” is the width of the distribution’s support set. Under this definition, the notion of a “long-tailed distribution” is of extremely limited utility: the situation is exactly as Neil describes it, with “long-tailed distributions” corresponding to any distribution with unbounded support, including both distributions like the Gaussian and gamma distribution where the mean is a reasonable characterization, and those like the Cauchy and Pareto distribution where the mean doesn’t even exist.

The situation is analogous to that of confidence intervals, which characterize the uncertainty inherited by any characterization computed from a collection of uncertain (i.e., random) data values. As a specific example consider the mean: the sample mean is the arithmetic average of N observed data samples, and it is generally intended as an estimate of the population mean, defined as the first moment of the data distribution. A q% confidence interval around the sample mean is an interval that contains the population mean with probability at least q%. These intervals can be computed in various ways for different data characterizations, but the key point here is that they are widely used in practice, with the most popular choices being the 90%, 95% and 99% confidence intervals, which necessarily become wider as this percentage q increases. (For a more detailed discussion of confidence intervals, refer to Chapter 9 of Exploring Data in Engineering, the Sciences, and Medicine.) We can, in principle, construct 100% confidence intervals, but this leads us directly back to Neil’s objection: the 100% confidence interval for the mean is entire support set of the distribution (e.g., for the Gaussian distribution, this 100% confidence interval is the whole real line, while for any gamma distribution, it is the set of all positive numbers). These observations suggest the following notion of “tail length” that addresses Neil’s concern while retaining the essential idea of interest in the business literature: we can compare the “q% tail length” of different distributions for some q less than 100.

In particular, consider the case of J-shaped distributions, defined as those like the Pareto type I distribution whose distribution p(x) decays monotonically with increasing x, approaching zero as x goes to infinity. The plot below shows two specific examples to illustrate the idea: the solid line corresponds to the (shifted) exponential distribution:

p(x) = e^–(x-1)

for all x greater than or equal to 1 and zero otherwise, while the dotted line represents the Pareto type I distribution with location parameter k = 1 and shape parameter a = 0.5 discussed in my last post. Initially, as x increases from 1, the exponential density is greater than the Pareto density, but for x larger than about 3.5, the opposite is true: the exponential distribution rapidly becomes much smaller, reflecting its much more rapid rate of tail decay.

For these distributions, define the q% tail length to be the distance from the minimum possible value of x (the “head” of the distribution; here, x = 1) to the point in the tail where the cumulative probability reaches q% (i.e., the value x_q where x < x_q with probability q%). In practical terms, the q% tail length tells us how far out we have to go in the tail to account for q% of the possible cases. In R, this value is easy to compute using the quantile function included in most families of available distribution functions. As a specific example, for the Pareto type I distribution, the function qparetoI in the VGAM package gives us the desired quantiles for the distribution with specified values of the parameters k (designated “scale” in the qparetoI call) and a (designated “shape” in the qparetoI call). Thus, for the case k = 1 and a = 0.5 (i.e., the dashed curve in the above plot), the “90% tail length” is given by:

> qparetoI(p=0.9,scale=1,shape=0.5)

[1] 100

For comparison, the corresponding shifted exponential distribution has the 90% tail length given by:

> 1 + qexp(p = 0.9)

[1] 3.302585

(Note that here, I added 1 to the exponential quantile to account for the shift in its domain from “all positive numbers” – the domain for the standard exponential distribution – to the shifted domain “all numbers greater than 1”.) Since these 90% tail lengths differ by a factor of 30, they provide a sound basis for declaring the Pareto type I distribution to be “longer tailed” than the exponential distribution.

These results also provide a useful basis for assessing the influence of the decay parameter a for the Pareto distribution. As I noted last time, two of the examples I considered did not have finite means (a = 0.5 and 1.0), and none of the four had finite variances (i.e., also a = 1.5 and 2.0), rendering moment characterizations like the mean and standard deviation fundamentally useless. Comparing the 90% tail lengths for these distributions, however, leads to the following results:

a = 0.5: 90% tail length = 100.000

a = 1.0: 90% tail length = 10.000

a = 1.5: 90% tail length = 4.642

a = 2.0: 90% tail length = 3.162

It is clear from these results that the shape parameter a has a dramatic effect on the 90% tail length (in fact, on the q% tail length for any q less than 100). Further, note that the 90% tail length for the Pareto type I distribution with a = 2.0 is actually a little bit shorter than that for the exponential distribution. If we move further out into the tail, however, this situation changes. As a specific example, suppose we compare the 98% tail lengths. For the exponential distribution, this yields the value 4.912, while for the four Pareto shape parameters we have the following results:

a = 0.5: 98% tail length = 2,500.000

a = 1.0: 98% tail length = 50.000

a = 1.5: 98% tail length = 13.572

a = 2.0: 98% tail length = 7.071

This value (i.e., the 98% tail length) seems a particularly appropriate choice to include here since in his book, The Long Tail, Chris Anderson notes that his original presentations on the topic were entitled “The 98% Rule,” reflecting the fact that he was explicitly considering how far out you had to go into the tail of a distribution of goods (e.g., the books for sale by Amazon) to account for 98% of the sales.

Since this discussion originally began with the question, “when are averages useless?” it is appropriate to note that, in contrast to the much better-known average, the “q% tail length” considered here is well-defined for any proper distribution. As the examples discussed here demonstrate, this characterization also provides a useful basis for quantifying the “Long Tail” behavior that is of increasing interest in business applications like Internet marketing. Thus, if we adopt this measure for any q value less than 100%, the answer to the title question of this post is, “No: The Long Tail is a useful concept.”

The downside of this minor change is that – as the results shown here illustrate – the results obtained using the q% tail length depend on the value of q we choose. In my next post, I will explore the computational issues associated with that choice.

The Long Tail of the Pareto Distribution

2011-09-17T09:54:00.000-07:00

In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization. One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by:

p(x) = ak^a/x^a+1

defined for all x > k, where k and a are numeric parameters that define the distribution. In the example I discussed last time, I considered the case where a = 1.5, which exhibits a finite mean (specifically, the mean is 3 for this case), but an infinite variance. As the results I presented last time demonstrated, the extreme data variability of this distribution renders the computed mean too variable to be useful. Another reason this distribution is particularly interesting is that it exhibits essentially the same tail behavior as the discrete Zipf distribution; there, the probability that a discrete random variable x takes its i^th value is:

p_i = A/i^c,

where A is a normalization constant and c is a parameter that determines how slowly the tail decays. This distribution was originally proposed to characterize the frequency of words in long documents (the Zipf-Estoup law), it was investigated further by Zipf in the mid-twentieth century in a wide range of applications (e.g., the distributions of city sizes), and it has become the subject of considerable recent attention as a model for “long-tailed” business phenomena (for a non-technical introduction to some of these business phenomena, see the book by Chris Anderson, The Long Tail). I will discuss the Zipf distribution further in a later post, but one of the reasons for discussing the Pareto type I distribution first is that since it is a continuous distribution, the math is easier, meaning that more characterization results are available for the Pareto distribution.

The mean of the Pareto type I distribution is:

Mean = ak/(a-1),

provided a > 1, and the variance of the distribution is finite only if a > 2. Plots of the probability density defined above for this distribution are shown above, for k = 1 in all cases, and with a taking the values 0.5, 1.0, 1.5, and 2.0. (This is essentially the same plot as Figure 4.17 in Exploring Data in Engineering, the Sciences, and Medicine, where I give a brief description of the Pareto type I distribution.) Note that all of the cases considered here are characterized by infinite variance, while the first two (a = 0.5 and 1.0) are also characterized by infinite means. As the results presented below emphasize, the mean represents a very poor characterization in practice for data drawn from any of these distributions, but there are alternatives, including the familiar median that I have discussed previously, along with two others that are more specific to the Pareto type I distribution: the geometric mean and the harmonic mean.

The plot below emphasizes the point made above about the extremely limited utility of the mean as a characterization of Pareto type I data, even in cases where it is theoretically well-defined. Specifically, this plot compares the four characterizations I discuss here – the mean (more precisely known as the “arithmetic mean” to distinguish it from the other means considered here), the median, the geometric mean, and the harmonic mean – for 1000 statistically independent Pareto type I data sequences, each of length N = 400, with parameters k = 1 and a = 2.0. For this example, the mean is well-defined (specifically, it is equal to 2), but compared with the other data characterizations, its variability is much greater, reflecting the more serious impact of this distribution’s infinite variance on the mean than on these other data characterizations.

To give a more complete view of the extreme variability of the arithmetic mean, boxplots of 1000 statistically independent samples drawn from all four of the Pareto type I distribution examples plotted above are shown in the boxplots below. As before, each sample is of size N = 400 and the parameter k has the value 1, but here the computed arithmetic means are shown for the parameter values a = 0.5, 1.0, 1.5, and 2.0; note the log scale used here because the range of computed means is so large. For the first two of these examples, the population mean does not exist, so it is not surprising that the computed values span such an enormous range, but even when the mean is well-defined, the influence of the infinite variance of these cases is clearly evident. It may be argued that infinite variance is an extreme phenomenon, but it is worth emphasizing here that for the specific “long tail” distributions popular in many applications, the decay rate is sufficiently slow for the variance – and sometimes even the mean – to be infinite, as in these examples.

As I have noted several times in previous posts, the median is much better behaved than the mean, so much so that it is well-defined for any proper distribution. One of the advantages of the Pareto type I distribution is that the form of the density function is simple enough that the median of the distribution can be computed explicitly from the distribution parameters. This result is given in the fabulous book by Johnson, Kotz and Balakrishnan that I have mentioned previously, which devotes an entire chapter (Chapter 20) to the Pareto family of distributions. Specifically, the median of the Pareto type I distribution with parameters k and a is given by:

Median = 2^1/ak

Thus, for the four examples considered here, the median values are 4.0 (for a = 0.5), 2.0 (for a = 1.0), 1.587 (for a = 1.5), and 1.414 (for a = 2.0). Boxplot summaries for the same 1000 random samples considered above are shown in the plot below, which also includes horizontal dotted lines at these theoretical median values for the four distributions. The fact that these lines correspond closely with the median lines in the boxplots gives an indication that the computed median is, on average, in good agreement with the correct values it is attempting to estimate. As in the case of the arithmetic means, the variability of these estimates decreases monotonically as a increases, corresponding to the fact that the distribution becomes generally better-behaved as the a parameter increases.

The geometric mean is an alternative characterization to the more familiar arithmetic mean, one that is well-defined for any sequence of positive numbers. Specifically, the geometric mean of N positive numbers is defined as the N^th root of their product. Equivalently, the geometric mean may be computed by exponentiating the arithmetic average of the log-transformed values. In the case of the Pareto type I distribution, the utility of the geometric mean is closely related to the fact that the log transformation converts a Pareto-distributed random variable into an exponentially distributed one, a point that I will discuss further in a later post on data transformations. (These transformations are the topic of Chapter 12 of Exploring Data, where I briefly discuss both the logarithmic transformation on which the geometric mean is based and the reciprocal transformation on which the harmonic mean is based, described next.) The key point here is that the following simple expression is available for the geometric mean of the Pareto type I distribution (Johnson, Kotz, and Balakrishnan, page 577):

Geometric Mean = k exp(1/a)

For the four specific examples considered here, these geometric mean values are approximately 7.389 (for a = 0.5), 2.718 (for a = 1.0), 1.948 (for a = 1.5), and 1.649 (for a = 2.0). The boxplots shown below summarize the range of variation seen in the computed geometric means for the same 1000 statistically independent samples considered above. Again, the horizontal dotted lines indicate the correct values for each distribution, and it may be seen that the computed values are in good agreement, on average. As before, the variability of these computed values decreases with increasing a values as the distribution becomes better-behaved.

The fourth characterization considered here is the harmonic mean, again appropriate to positive values, and defined as the reciprocal of the average of the reciprocal data values. In the case of the geometric mean just discussed, the log transformation on which it is based is often useful in improving the distributional character of data values that span a wide range. In the case of the Pareto type I distribution – and a number of others – the reciprocal transformation on which the harmonic mean is based also improves the behavior of the data distribution, but this is often not the case. In particular, reciprocal transformations often make the character of a data distribution much worse: applied to the extremely well-behaved standard uniform distribution, it yields the Pareto type I distribution with a = 1, for which none of the integer moments exist; similarly, applied to the Gaussian distribution, the reciprocal transformation yields a result that is both infinite variance and bimodal. (A little thought suggests that the reciprocal transformation is inappropriate for the Gaussian distribution because it is not strictly positive, but normality is a favorite working assumption, sometimes applied to the denominators of ratios, leading to a number of theoretical difficulties. I will have more to say about that in a future post.) For the case of the Pareto type I distribution, the reciprocal transformation converts it into the extremely well-behaved beta distribution, and the harmonic mean has the following simple expression:

Harmonic mean = k(1 + a^-1)

For the four examples considered here, this expression yields harmonic mean values of 3 (for a = 0.5), 2 (for a = 1.0), 1.667 (for a = 1.5), and 1.5 (for a = 2.0). Boxplot summaries of the computed harmonic means for the 1000 simulations of each case considered previously are shown below, again with dotted horizontal lines at the theoretical values for each case. As with both the median and the geometric mean, it is clear from these plots that the computed values are correct on average, and their variability decreases with increasing values of the a parameter.

The key point of this post has been to show that, while averages are not suitable characterizations for “long tailed” phenomena that are becoming an increasing subject of interest in many different fields, useful alternatives do exist. For the case of the Pareto type I distribution considered here, these alternatives include the popular median, along with the somewhat less well-known geometric and harmonic means. In an upcoming post, I will examine the utility of these characterizations for the Zipf distribution.

Some Additional Thoughts on Useless Averages

2011-08-27T13:46:00.000-07:00

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions. The post generated three interesting comments that I want to respond to here.

First and foremost, I want to say thanks to all of you for giving me something to think about further, leading me in some interesting new directions. First, chrisbeeleyimh had the following to say:

“I seem to have rather abandoned means and medians in favor of drawing the distribution all the time, which baffles my colleagues somewhat.”

Chris also maintains a collection of data examples where the mean is the same but the shape is very different. In fact, one of the points I illustrate in Section 4.4.1 of Exploring Data in Engineering, the Sciences, and Medicine is that there are cases where not only the means but all of the moments (i.e., variance, skewness, kurtosis, etc.) are identical but the distributions are profoundly different. A specific example is taken from the book Counterexamples in Probability, 2nd Edition by J.M. Stoyanov, who shows that if the lognormal density is multiplied by the following function:

f(x) = 1 + A sin(2 pi ln x),

for any constant A between -1 and +1, the moments are unchanged. The character of the distribution is changed profoundly, however, as the following plot illustrates (this plot is similar to Fig. 4.8 in Exploring Data, which shows the same two distributions, but for A = 0.5 instead of A = 0.9, as shown here). To be sure, this behavior is pathological – distributions that have finite support, for example, are defined uniquely by their complete set of moments – but it does make the point that moment characterizations are not always complete, even if an infinite number of them are available. Within well-behaved families of distributions (such as the one proposed by Karl Pearson in 1895), a complete characterization is possible on the basis of the first few moments, which is one reason for the historical popularity of the method of moments for fitting data to distributions. It is important to recognize, however, that moments do have their limitations and that the first moment alone – i.e., the mean by itself – is almost never a complete characterization. (I am forced to say “almost” here because if we impose certain very strong distributional assumptions – e.g., the Poisson or binomial distributions – the specific distribution considered may be fully characterized by its mean. This begs the question, however, of whether this distributional assumption is adequate. My experience has been that, no matter how firmly held the belief in a particular distribution is, exceptions do arise in practice … overdispersion, anyone?)

The plot below provides a further illustration of the inadequacy of the mean as a sole data characterization, comparing four different members of the family of beta distributions. These distributions – in the standard form assumed here – describe variables whose values range from 0 to 1, and they are defined by two parameters, p and q, that determine the shape of the density function and all moments of the distribution. The mean of the beta distribution is equal to p/(p+q), so if p = q – corresponding to the class of symmetric beta distributions – the mean is ½, regardless of the common value of these parameters. The four plots below show the corresponding distributions when both parameters are equal to 0.5 (upper left, the arcsin distribution I discussed last time), 1.0 (upper right, the uniform distribution), 1.5 (lower left), and 8.0 (lower right).

The second comment on my last post was from Efrique, who suggested the Student’s t-distribution with 2 degrees of freedom as a better infinite-variance example than the Cauchy example I used (corresponding to Student’s t-distribution with one degree of freedom), because the first moment doesn’t even exist for the Cauchy distribution (“there’s nothing to converge to”). The figure below expands the boxplot comparison I presented last time, comparing the means, medians, and modes (from the modeest package), for both of these infinite-variance examples: the Cauchy distribution I discussed last time and the Student’s t-distribution with two degrees of freedom that Efrique suggested. Here, the same characterization (mean, median, or mode) is summarized for both distributions in side-by-side boxplots to facilitate comparisons. It is clear from these boxplots that the results for the median and the mode are essentially identical for these distributions, but the results for the mean differ dramatically (recall that these results are truncated for the Cauchy distribution: 13.6% of the 1000 computed means fell outside the +/- 5 range shown here, exhibiting values approaching +/- 1000). This difference illustrates Efrique’s further point that the mean of the data values is a consistent estimator of the (well-defined) population mean of the Student’s t-distribution with 2 degrees of freedom, while it is not a consistent estimator for the Cauchy distribution. Still, it also clear from this plot that the mean is substantially more variable for the Student’s t-distribution with 2 degrees of freedom than either the median or the modeest mode estimate.

Another example of an infinite-variance distribution where the mean is well-defined but highly variable is the Pareto type I distribution, discussed in Section 4.5.8 of Exploring Data. My favorite reference on distributions is the two volume set by Johnson, Kotz, and Balakrishnan (Continuous Univariate Distributions, Vol. 1 (Wiley Series in Probability and Statistics) and Continuous Univariate Distributions, Vol. 2 (Wiley Series in Probability and Statistics)), who devote an entire 55 page chapter (Chapter 20 in Volume 1) to the Pareto distribution, noting that it is named after Vilafredo Pareto, a mid nineteenth- to early twentieth-century Swiss professor of economics, who proposed it as a description of the distribution of income over a population. In fact, there are several different distributions named after Pareto, but the type I distribution considered here exhibits a power-law decay like the Student’s t-distributions, but it is a J-shaped distribution whose mode is equal to its minimum value. More specifically, this distribution is defined by a location parameter that determines this minimum value and a shape parameter that determines how rapidly the tail decays for values larger than this minimum. The example considered here takes this minimum value as 1 and the shape parameter as 1.5, giving a distribution with a finite mean but an infinite variance. As in the above example, the boxplot summary shown below characterizes the mean, median, and mode for 1000 statistically independent random samples drawn from this distribution, each of size N = 100. As before, it is clear from this plot that the mean is much more highly variable than either the median or the mode.

In this case, however, we have the added complication that since this distribution is not symmetric, its mean, median and mode do not coincide. In fact, the population mode is the minimum value (which is 1 here), corresponding to the solid line at the bottom of the plot. The narrow range of the boxplot values around this correct value suggest that the modeest package is reliably estimating this mode value, but as I noted in my last post, this characterization is not useful here because it tells us nothing about the rate at which the density decays. The theoretical median value can also be calculated easily for this distribution, and here it is approximately equal to 1.587, corresponding to the dashed horizontal line in the plot. As with the mode, it is clear from the boxplot that the median estimated from the data is in generally excellent agreement with this value. Finally, the mean value for this particular distribution is 3, corresponding to the dotted horizontal line in the plot. Since this line lies fairly close to the upper quartile of the computed means (i.e., the top of the “box” in the boxplot), it follows that the estimated mean falls below the correct value almost 75% of the time, but it is also clear that when the mean is overestimated, the extent of this overestimation can be very large. Motivated in part by the fact that the mean doesn’t always exist for the Pareto distribution, Johnson, Kotz and Balakrishnan note in their chapter on these distributions that alternative location measures have been considered, including both the geometric and harmonic means. I will examine these ideas further in a future post.

Finally, klr mentioned my post on useless averages in his blog TimelyPortfolio, where he discusses alternatives to the moving average in characterizing financial time-series. For the case he considers, klr compares a 10-month moving average, the corresponding moving median, and a number of the corresponding mode estimators from the modeest package. This is a very interesting avenue of exploration for me since it is closely related to the median filter and other nonlinear digital filters that can be very useful in cleaning noisy time-series data. I discuss a number of these ideas – including moving-window extensions of other data characterizations like skewness and kurtosis – in my book Mining Imperfect Data: Dealing with Contamination and Incomplete Records.

Again, thanks to all of you for your comments. You have given me much to think about and investigate further, which is one of the joys of doing this blog.

When are averages useless?

2011-08-20T08:21:00.000-07:00

Of all possible single-number characterizations of a data sequence, the average is probably the best known. It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers. It is not the only such “typical value,” however, nor is it always the most useful one: two other candidates – location estimators in statistical terminology – are the median and the mode, both of which are discussed in detail in Section 4.1.2 of Exploring Data in Engineering, the Sciences, and Medicine. Like the average, these alternative location estimators are not always “fully representative,” but they do represent viable alternatives – at least sometimes – in cases where the average is sufficiently non-representative as to be effectively useless. As the title of this post suggests, the focus here is on those cases where the mean doesn’t really tell us what we want to know about a data sequence, briefly examining why this happens and what we can do about it.

First, it is worth saying a few words about the two alternatives just mentioned: the median and the mode. Of these, the mode is both the more difficult to estimate and the less broadly useful. Essentially, “the mode” corresponds to “the location of the peak in the data distribution.” One difficulty with this somewhat loose definition is that “the mode” is not always well-defined. The above collection of plots shows three examples where the mode is not well-defined, and another where the mode is well-defined but not particularly useful. The upper left plot shows the density of the uniform distribution on the range [1,2]: there, the density is constant over the entire range, so there is no single, well-defined “peak” or unique maximum to serve as a mode for this distribution. The upper right plot shows a nonparametric density estimate for the Old Faithful geyser waiting time data that I have discussed in several of my recent posts (the R data object faithful). Here, the difficulty is that there are not one but two modes, so “the mode” is not well-defined here, either: we must discuss “the modes.” The same behavior is observed for the arcsin distribution, whose density is shown in the lower left plot in the above figure. This density corresponds to the beta distribution with shape parameters both equal to ½, giving a bimodal distribution whose cumulative probability function can be written simply in terms of the arcsin function, motivating its name (see Section 4.5.1 of Exploring Data for a more complete discussion of both the beta distribution family and the special case of the arcsin distribution). In this case, the two modes of the distribution occur at the extremes of the data, at x = 1 and x = 2.

The second difficulty with the mode noted above is that it is sometimes well-defined but not particularly useful. The case of the J-shaped exponential density shown in the lower right plot above illustrates this point: this distribution exhibits a single, well-defined peak at the minimum value x = 0. Here, you don’t even have to look at the data to arrive at this result, which therefore tells you nothing about the data distribution: this density is described by a single parameter that determines how slowly or rapidly the distribution decays and the mode is independent of this parameter. Despite these limitations, there are cases where the mode represents an extremely useful data characterization, even though it is much harder to estimate than the mean or the median. Fortunately, there is a nice package available in R to address this problem: the modeest package provides 11 different mode estimation procedures. I will illustrate one of these in the examples that follow – the half range mode estimator of Bickel – and I will give a more complete discussion of this package in a later post.

The median is a far better-known data characterization than the mode, and it is both much easier to estimate and much more broadly applicable. In particular, unlike either the mean or the mode, the median is well-defined for any proper data distribution, a result demonstrated in Section 4.1.2 of Exploring Data. Conceptually, computing the median only requires sorting the N data values from smallest to largest and then taking either the middle element from this sorted list (if N is odd), or averaging the middle two elements (if N is even).

The mean is, of course, both the easiest of these characterizations to compute – simply add the N data values and divide by N – and unquestionably the best known. There are, however, at least three situations where the mean can be so highly non-representative as to be useless:

1.      if severe outliers are present;
2.      if the distribution is multi-modal;
3.      if the distribution has infinite variance.

The rest of this post examines each of these cases in turn.

I have discussed the problem of outliers before, but they are an important enough problem in practice to bear repeating. (I devote all of Chapter 7 to this topic in Exploring Data.) The plot below shows the makeup flow rate dataset, available from the companion website for Exploring Data (the dataset is makeup.csv, available on the R programs and datasets page). This dataset consists of 2,589 successive measurements of the flow rate of a fluid stream in an industrial manufacturing process. The points in this plot show two distinct forms of behavior: those with values on the order of 400 represent measurements made during normal process operation, while those with values less than about 300 correspond to measurements made when the process is shut down (these values are approximately zero) or is in the process of being either shut down or started back up. The three lines in this plot correspond to the mean (the solid line at approximately 315), the median (the dotted line at approximately 393), and the mode (the dashed line at approximately 403, estimated using the “hrm” method in the modeest package). As I have noted previously, the mean in this case represents a useful line of demarcation between the normal operation data (those points above the mean, representing 77.6% of the data) and the shutdown segments (those points below the mean, representing 22.4% of the data). In contrast, both the median and the specific mode estimator used here provide much better characterizations of the normal operating data.

The next plot below shows a nonparametric density estimate of the Old Faithful geyser waiting data I discussed in my last few posts. The solid vertical line at 70.90 corresponds to the mean value computed from the complete dataset. It has been said that a true compromise is an agreement that makes all parties equally unhappy, and this seems a reasonable description of the mean here: the value lies about mid-way between the two peaks in this distribution, centered at approximately 55 and 80; in fact, this value lies fairly close to the trough between the peaks in this density estimate. (The situation is even worse for the arcsin density discussed above: there, the two modes occur at values of 1 and 2, while the mean falls equidistant from both at 1.5, arguably the “least representative” value in the whole data range.) The median waiting time value is 76, corresponding to the dotted line just to the left of the main peak at about 80, and the mode (again, computed using the package modeest with the “hrm” method) corresponds to the dashed line at 83, just to the right of the main peak. The basic difficulty here is that all of these location estimators are inherently inadequate since they are attempting to characterize “the representative value” of a data sequence that has “two representative values:” one representing the smaller peak at around 55 and the other representing the larger peak at around 80. In this case, both the median and the mode do a better job of characterizing the larger of the two peaks in the distribution (but not a great job), although such a partial characterization is not always what we want. This type of behavior is exactly what the mixture models I discussed in my last few posts are intended to describe.

To illustrate the third situation where the mean is essentially useless, consider the Cauchy distribution, corresponding to the Student’s t distribution with one degree of freedom. This is probably the best known infinite-variance distribution there is, and it is often used as an extreme example because it causes a lot of estimation procedures to fail. The plot below is a (truncated) boxplot comparison of the values of the mean, median, and mode computed from 1000 independently generated Cauchy random number sequences, each of length N = 100. It is clear from these boxplots that the variability of the mean is much greater than that of either of the other two estimators, which are the median and the mode, the latter again estimated from the data using the half-range mode (hrm) method in the modeest package. One of the consequences of working with infinite variance distributions is that the mean is no longer a consistent location estimator, meaning that the variance of the estimated mean does not approach zero in the limit of large sample sizes. In fact, the Cauchy distribution is one of the examples I discuss in Chapter 6 of Exploring Data as a counterexample to the Central Limit Theorem: for most data distributions, the distribution of the mean approaches a Gaussian limit with a variance that decreases inversely with the sample size N, but for the Cauchy distribution, the distribution of the mean is exactly the same as that of the data itself. In other words, for the Cauchy distribution, averaging a collection of N numbers does not reduce the variability at all. This is exactly what we are seeing here, although the plot below doesn’t show how bad the situation really is: the smallest value of the mean in this sequence of 1000 estimates is -798.97 and the largest value is 928.85. In order to see any detail at all in the distribution of the median and mode values, it was necessary to restrict the range of the boxplots shown here to lie between -5 and +5, which eliminated 13.6% of the computed mean values. In contrast, the median is known to be a reasonably good location estimator for the Cauchy distribution (see Section 6.6.1 of Exploring Data for a further discussion of this point), and the results presented here suggest that Bickel’s half-range mode estimator is also a reasonable candidate. The main point here is that the mean is a completely unreasonable estimator in situations like this one, an important point in view of the growing interest in data models like the infinite-variance Zipf distribution to describe “long-tailed” phenomena in business.

I will have more to say about both the modeest package and Zipf distributions in upcoming posts.

Fitting mixture distributions with the R package mixtools

2011-08-06T14:23:00.000-07:00

My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful. Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine. One important topic I haven’t covered is how to fit mixture models to datasets like the Old Faithful geyser data that I have discussed previously: a nonparametric density plot gives fairly compelling evidence for a bimodal distribution, but how do you estimate the parameters of a mixture model that describes these two modes? For a finite Gaussian mixture distribution, one way is by trial and error, first estimating the centers of the peaks by eye in the density plot (these become the component means), and adjusting the standard deviations and mixing percentages to approximately match the peak widths and heights, respectively. This post considers the more systematic alternative of estimating the mixture distribution parameters using the mixtools package in R.

The mixtools package is one of several available in R to fit mixture distributions or to solve the closely related problem of model-based clustering. Further, mixtools includes a variety of procedures for fitting mixture models of different types. This post focuses on one of these – the normalmixEM procedure for fitting normal mixture densities – and applies it to two simple examples, starting with the Old Faithful dataset mentioned above. A much more complete and thorough discussion of the mixtools package – which also discusses its application to the Old Faithful dataset – is given in the R package vignette, mixtools: An R Package for Analyzing Finite Mixture Models.

The above plot shows the results obtained using the normalmixEM procedure with its default parameter values, applied to the Old Faithful waiting time data. Specifically, this plot was generated by the following sequence of R commands:

            library(mixtools)
            wait = faithful$waiting
            mixmdl = normalmixEM(wait)
            plot(mixmdl,which=2)
            lines(density(wait), lty=2, lwd=2)

Like many modeling tools in R, the normalmixEM procedure has associated plot and summary methods. In this case, the plot method displays either the log likelihood associated with each iteration of the EM fitting algorithm (more about that below), or the component densities shown above, or both. Specifying “which=1” displays only the log likelihood plot (this is the default), specifying “which = 2” displays only the density components/histogram plot shown here, and specifying “density = TRUE” without specifying the “which” parameter gives both plots. Note that the two solid curves shown in the above plot correspond to the individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of an observation being drawn from that component distribution. The final line of R code above overlays the nonparametric density estimate generated by the density function with its default parameters, shown here as the heavy dashed line (obtained by specifying “lty = 2”).

Most of the procedures in the mixtools package are based on the iterative expectation maximization (EM) algorithm, discussed in Section 2 of the mixtools vignette and also in Chapter 16 of Exploring Data. A detailed discussion of this algorithm is beyond the scope of this post – books have been devoted to the topic (see, for example, the book by McLachlan and Krishnan, The EM Algorithm and Extensions (Wiley Series in Probability and Statistics) ) – but the following two points are important to note here. First, the EM algorithm is an iterative procedure, and the time required for it to reach convergence – if it converges at all – depends strongly on the problem to which it is applied. The second key point is that because it is an iterative procedure, the EM algorithm requires starting values for the parameters, and algorithm performance can depend strongly on these initial values. The normalmixEM procedure supports both user-supplied starting values and built-in estimation of starting values if none are supplied. These built-in estimates are the default and, in favorable cases, they work quite well. The Old Faithful waiting time data is a case in point – using the default starting values gives the following parameter estimates:

            > mixmdl[c("lambda","mu","sigma")]
$lambda
[1] 0.3608868 0.6391132

$mu
[1] 54.61489 80.09109

$sigma
[1] 5.871241 5.867718

The mixture density described by these parameters is given by:

p(x) = lambda[1] n(x; mu[1], sigma[1]) + lambda[2] n(x; mu[2], sigma[2])

where n(x; mu, sigma) represents the Gaussian probability density function with mean mu and standard deviation sigma.

One reason the default starting values work well for the Old Faithful waiting time data is that if nothing is specified, the number of components (the parameter k) is set equal to 2. Thus, if you are attempting to fit a mixture model with more than two components, this number should be specified, either by setting k to some other value and not specifying any starting estimates for the parameters lambda, mu, and sigma, or by specifying a vector with k components as starting values for at least one of these parameters. (There are a number of useful options in calling the normalmixEM procedure: for example, specifying the initial sigma value as a scalar constant rather than a vector with k components forces the component variances to be equal. I won’t attempt to give a detailed discussion of these options here; for that, type “help(normalmixEM)”.)

Another important point about the default starting values is that, aside from the number of components k, any unspecified initial parameter estimates are selected randomly by the normalmixEM procedure. This means that, even in cases where the default starting values consistently work well – again, the Old Faithful waiting time dataset seems to be such a case – the number of iterations required to obtain the final result can vary significantly from one run to the next. (Specifically, the normalmixEM procedure does not fix the seed for the random number generators used to compute these starting values, so repeated runs of the procedure with the same data will start from different initial parameter values and require different numbers of iterations to achieve convergence. In the case of the Old Faithful waiting time data, I have seen anywhere between 16 and 59 iterations required, with the final results differing only very slightly, typically in the fifth or sixth decimal place. If you want to use the same starting value on successive runs, this can be done by setting the random number seed via the set.seed command before you invoke the normalmixEM procedure.)

It is important to note that the default starting values do not always work well, even if the correct number of components is specified. This point is illustrated nicely by the following example. The plot above shows two curves: the solid line is the exact density for the three-component Gaussian mixture distribution described by the following parameters:

            mu = (2.00, 5.00, 7.00)
            sigma = (1.000, 1.000, 1.000)
            lambda = (0.200, 0.600, 0.200)

The dashed curve in the figure is the nonparametric density estimate generated from n = 500 observations drawn from this mixture distribution. Note that the first two components of this mixture distribution are evident in both of these plots, from the density peaks at approximately 2 and 5. The third component, however, is too close to the second to yield a clear peak in either density, giving rise instead to slightly asymmetric “shoulders” on the right side of the upper peaks. The key point is that the components in this mixture distribution are difficult to distinguish from either of these density estimates, and this hints at further difficulties to come.

Applying the normalmixEM procedure to the 500 sample sequence used to generate the nonparametric density estimate shown above and specifying k = 3 gives results that are substantially more variable than the Old Faithful results discussed above. In fact, to compare these results, it is necessary to be explicit about the values of the random seeds used to initialize the parameter estimation procedure. Specifying this random seed as 101 and only specifying k=3 in the normalmixEM call yields the following parameter estimates after 78 iterations:

            mu = (1.77, 4.87, 5.44)
            sigma = (0.766, 0.115, 1.463)
            lambda = (0.168, 0.028, 0.803)

Comparing these results with the correct parameter values listed above, it is clear that some of these estimation errors are quite large. The figure shown below compares the mixture density constructed from these parameters (the heavy dashed curve) with the nonparametric density estimate computed from the data used to estimate them. The prominent “spike” in this mixture density plot corresponds to the very small standard deviation estimated for the second component and it provides a dramatic illustration of the relatively poor results obtained for this particular example.

Repeating this numerical experiment with different random seeds to obtain different random starting estimates, the normalmixEM procedure failed to converge in 1000 iterations for seed values of 102 and 103, but it converged after 393 iterations for the seed value 104, yielding the following parameter estimates:

            mu = (1.79, 5.03, 5.46)
            sigma = (0.775, 0.352, 1.493)
            lambda = (0.169, 0.063, 0.768)

Arguably, the general behavior of these parameter estimates is quite similar to those obtained with the random seed value 101, but note that the second variance component differs by a factor of three, and the second component of lambda increases almost as much.

Increasing the sample size from n = 500 to n = 2000 and repeating these experiments, the normalmixEM procedure failed to converge after 1000 iterations for all four of the random seed values 101 through 104. If, however, we specify the correct standard deviations (i.e., specify “sigma = c(1,1,1)” when we invoke normalmixEM) and we increase the maximum number of iterations to 3000 (i.e., specify “maxit = 3000”), the procedure does converge after 2417 iterations for the seed value 101, yielding the following parameter estimates:

            mu = (1.98, 4.98, 7.15)
            sigma = (1.012, 1.055, 0.929)
            lambda = (0.198, 0.641, 0.161)

While these parameters took a lot more effort to obtain, they are clearly much closer to the correct values, emphasizing the point that when we are fitting a model to data, our results generally improve as the amount of available data increases and as our starting estimates become more accurate. This point is further illustrated by the plot shown below, analogous to the previous one, but constructed from the model fit to the longer data sequence and incorporating better initial parameter estimates. Interestingly, re-running the same procedure but taking the correct means as starting parameter estimates instead of the correct standard deviations, the procedure failed to converge in 3000 iterations.

Overall, I like what I have seen so far of the mixtools package, and I look forward to exploring its capabilities further. It’s great to have a built-in procedure – i.e., one I didn’t have to write and debug myself – that does all of the things that this package does. However, the three-component mixture results presented here do illustrate an important point: the behavior of iterative procedures like normalmixEM and others in the mixtools package can depend strongly on the starting values chosen to initialize the iteration process, and the extent of this dependence can vary greatly from one application to another.

Mixture distributions and models: a clarification

2011-07-16T11:32:00.000-07:00

In response to my last post, Chris had the following comment:

I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work. You seem to say mixture models apply to a small set of models – namely regression models.

This comment suggests that my caution about the difference between mixed-effect models and mixture distributions may have caused as much confusion as clarification, and the purpose of this post is to try to clear up this confusion.

So first, let me offer the following general observations. The terms “mixture models” refers to a generalization of the class of finite mixture distributions that I discussed in my previous post. I give a more detailed discussion of finite mixture distributions in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine , and the more general class of mixture models is discussed in the book Mixture Models (Statistics: A Series of Textbooks and Monographs) by Geoffrey J. McLachlan and Kaye E. Bashford. The basic idea is that we are describing some observed phenomenon like the Old Faithful geyser data (the faithful data object in R) where a close look at the data (e.g., with a nonparametric density estimate) suggests substantial heterogeneity. In particular, the density estimates I presented last time for both of the variables in this dataset exhibit clear evidence of bimodality. Essentially, the idea behind a mixture model/mixture distribution is that we are observing something that isn’t fully characterized by a single, simple distribution or model, but instead by several such distributions or models, with some random selection mechanism at work. In the case of mixture distributions, some observations appear to be drawn from distribution 1, some from distribution 2, and so forth. The more general class of mixture models is quite broad, including things like heterogeneous regression models, where the response may depend approximately linearly on some covariate with one slope and intercept for observations drawn from one sub-population, but with another, very different slope and intercept for observations drawn from another sub-population. I present an example at the end of this post that illustrates this idea.

The probable source of confusion for Chris – and very possibly other readers – is the comment I made about the difference between these mixture models and mixed-effect models. This other class of models – which I only mentioned in passing in my post – typically consists of a linear regression model with two types of prediction variables: deterministic predictors, like those that appear in standard linear regression models, and random predictors that are typically assumed to obey a Gaussian distribution. This framework has been extended to more general settings like generalized linear models (e.g., mixed-effect logistic regression models). The R package lme4 provides support for fitting both linear mixed-effect models and generalized linear mixed-effect models to data. As I noted last time, these model classes are distinct from the mixture distribution/mixture model classes I discuss here. The models that I do discuss – mixture models – have strong connections with cluster analysis, where we are given a heterogeneous group of objects and typically wish to determine how many distinct groups of objects are present and assign individuals to the appropriate groups. A very high-level view of the many R packages available for clustering – some based on mixture model ideas and some not – is available from the CRAN clustering task view page. Two packages from this task view that I plan to discuss in future posts are flexmix and mixtools, both of which support a variety of mixture model applications. The following comments from the vignette FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R give an indication of the range of areas where these ideas are useful:

“Finite mixture models have been used for more than 100 years, but have seen a real boost in popularity over the last decade due to the tremendous increase in available computing power. The areas of application of mixture models range from biology and medicine to physics, economics, and marketing. On the one hand, these models can be applied to data where observations originate from various groups and the group affiliations are not known, and on the other hand to provide approximations for multi-modal distributions.”

The following example illustrates the second of these ideas, motivated by the Old Faithful geyser data that I discussed last time. As a reminder, the plot above shows the nonparametric density estimate generated from the 272 observations of the Old Faithful waiting time data included in the faithful data object, using the density procedure in R with the default parameter settings. As I noted last time, the plot shows two clear peaks, the lower one centered at approximately 55 minutes, and the second at approximately 80 minutes. Also, note that the first peak is substantially smaller in amplitude and appears to be somewhat narrower than the second peak.

To illustrate the connection with finite mixture distributions, the R procedure described below generates a two-component Gaussian mixture density whose random samples exhibit approximately the same behavior seen in the Old Faithful waiting time data. The results generated by this procedure are shown in the above figure, which includes two overlaid plots: one corresponding to the exact density for the two-component Gaussian mixture distribution (the solid line), and the other corresponding to the nonparametric density estimate computed from N = 272 random samples drawn from this mixture distribution (the dashed line). As in the previous plot, the nonparametric density estimate was computed using the density command in R with its default parameter values. The first component in this mixture has mean 54.5 and standard deviation 8.0, values chosen by trial and error to approximately match the lower peak in the Old Faithful waiting time distribution. The second component has mean 80.0 and standard deviation 5.0, chosen to approximately match the second peak in the waiting time distribution. The probabilities associated with the first and second components are 0.45 and 0.55, respectively, selected to give approximately the same peak heights seen in the waiting time density estimate. Combining these results, the density of this mixture distribution is:

p(x) = 0.45 n(x; 54.5, 8.0) + 0.55 n(x; 80.0, 5.0),

where n(x;m,s) denotes the Gaussian density function with mean m and standard deviation s. These density functions can be generated using the dnorm function in R.

The R procedure listed below generates n independent, identically distributed random samples from an m-component Gaussian mixture distribution. This procedure is called with the following parameters:

n = the number of random samples to generate

mvec = vector of m mean values

svec = vector of m standard deviations

pvec = vector of probabilities for each of the m components

iseed = integer seed to initialize the random number generators

The R code for the procedure looks like this:

MixEx01GenProc <- function(n, muvec, sigvec, pvec, iseed=101){

set.seed(iseed)

m <- length(pvec)

indx <- sample(seq(1,m,1), size=n, replace=T, prob=pvec)

yvec <- 0

for (i in 1:m){

xvec <- rnorm(n, mean=muvec[i], sd=sigvec[i])

yvec <- yvec + xvec * as.numeric(indx == i)

}

yvec

}

The first statement initializes the random number generator using the iseed parameter, which is given a default value of 101. The second line determines the number of components in the mixture density from the length of the pvec parameter vector, and the third line generates a random sequence indx of component indices taking the values 1 through m with probabilities determined by the pvec parameter. The rest of the program is a short loop that generates each component in turn, using indx to randomly select observations from each of these components with the appropriate probability. To see how this works, note that the first pass through the loop generates the random vector xvec of length n, with mean given by the first element of the vector muvec and standard deviation given by the first element of the vector sigvec. Then, for every one of the n elements of yvec for which the indx vector is equal to 1, yvec is set equal to the corresponding element of this first random component xvec. On the second pass through the loop, the second random component is generated as xvec, again with length n but now with mean specified by the second element of muvec and standard deviation determined by the second element of sigvec. As before, this value is added to the initial value of yvec whenever the selection index vector indx is equal to 2. Note that since every element of the indx vector is unique, none of the nonzero elements of yvec computed during the first iteration of the loop are modified; instead, the only elements of yvec that are modified in the second pass through the loop have their initial value of zero, specified in the line above the start of the loop. More generally, each pass through the loop generates the next component of the mixture distribution and fills in the corresponding elements of yvec as determined by the random selection index vector indx.

As I noted at the beginning of this post, the notion of a mixture model is more general than that of the finite mixture distributions just described, but closely related. I conclude this post with a simple example of a more general mixture model. The above scatter plot shows two variables, x and y, related by the following mixture model:

y = x + e₁ with probability p₁ = 0.40,

and

y = -x + 2 + e₂ with probability p₂ = 0.60,

where e₁ is a zero-mean Gaussian random variable with standard deviation 0.1, and e₂ is a zero-mean Gaussian random variable with standard deviation 0.3. To emphasize the components in the mixture model, points corresponding to the first component are plotted as solid circles, while points corresponding to the second component are plotted as open triangles. The two dashed lines in this plot represent the ordnary least squares regression lines fit to each component separately, and they both correspond reasonably well to the underlying linear relationships that define the two components (e.g., the least squares line fit to the solid circles has a slope of approximately +1 and an intercept of approximately 0). In contrast, the heavier dotted line represents the ordinary least squares regression line fit to the complete dataset without any knowledge of its underlying component structure: this line is almost horizontal and represents a very poor approximation to the behavior of the dataset.

The point of this example is to illustrate two things. First, it provides a relatively simple illustration of how the mixture density idea discussed above generalizes to the setting of regression models and beyond: we can construct fairly general mixture models by requiring different randomly selected subsets of the data to conform to different modeling assumptions. The second point – emphasized by the strong disagreement between the overall regression line and both of the component regression lines – is that if we are given only the dataset (i.e., the x and y values themselves) without knowing which component they represent, standard analysis procedures are likely to perform very badly. This question – how do we analyze a dataset like this one without detailed prior knowledge of its heterogeneous structure – is what R packages like flexmix and mixtools are designed to address.