tag:blogger.com,1999:blog-91793254201748997792015-04-24T22:07:15.143-07:00ExploringDataBlogRon Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.comBlogger38125tag:blogger.com,1999:blog-9179325420174899779.post-58685583139159282322014-03-09T11:13:00.000-07:002014-03-09T11:13:20.167-07:00A question of model uncertainty<div class="MsoNormal">It has been several months since my last post on classification tree models, because two things have been consuming all of my spare time. The first is that I taught a night class for the <st1:place w:st="on"><st1:placetype w:st="on">University</st1:placetype> of <st1:placename w:st="on">Connecticut</st1:placename></st1:place>’s Graduate School of Business, introducing R to students with little or no prior exposure to either R or programming. My hope is that the students learned something useful – I can say with certainty that I did – but preparing for the class and teaching it took a lot of time. The other activity, that has taken essentially all of my time since the class ended, is the completion of a book on nonlinear digital filtering using Python, joint work with my colleague Moncef Gabbouj of the Tampere University of Technology in <st1:place w:st="on"><st1:city w:st="on">Tampere</st1:city>, <st1:country-region w:st="on">Finland</st1:country-region></st1:place>. I will have more to say about both of these activities in the future, but for now I wanted to respond to a question raised about my last post.</div><div class="MsoNormal"><br /></div><div class="MsoNormal">Specifically, Professor Frank Harrell, the developer of the extremely useful <b>Hmisc</b> package, asked the following:</div><div class="MsoNormal"><br /></div><blockquote class="tr_bq"> How did you take into account model uncertainty? The uncertainty resulting from data mining to find nodes and thresholds for continuous predictors has a massive impact on confidence intervals for estimates from recursive partitioning.</blockquote><div class="MsoNormal"><br /></div><div class="MsoNormal">The short answer is that model uncertainty was not accounted for in the results I presented last time, primarily because – as Professor Harrell’s comments indicate – this is a complicated issue for tree-based models. The primary objective of this post and the next few is to discuss this issue.</div><div class="MsoNormal"><br /></div><div class="MsoNormal">So first, what exactly is model uncertainty? Any time we fit an empirical model to data, the results we obtain inherit some of the uncertainty present in the data. For the specific example of linear regression models, the magnitude of this uncertainty is partially characterized by the standard errors included in the results returned by R’s <b>summary()</b> function. This magnitude depends on both the uncertainty inherent in the data and the algorithm we use to fit the model. Sometimes – and classification tree models are a case in point – this uncertainty is not restricted to variations in the values of a fixed set of parameters, but it can manifest itself in substantial structural variations. That is, if we fit classification tree models to two similar but not identical datasets, the results may differ in the number of terminal nodes, the depths of these terminal nodes, the variables that determine the path to each one, and the values of these variables that determine the split at each intermediate node. This is the issue Professor Harrell raised in his comments, and the primary point of this post is to present some simple examples to illustrate its nature and severity.</div><div class="MsoNormal"><br /></div><div class="MsoNormal">In addition, this post has two other objectives. The first is to make amends for a very bad practice demonstrated in my last two posts. Specifically, the classification tree models described there were fit to a relatively large dataset and then evaluated with respect to that same dataset. This is bad practice because it can lead to overfitting, a problem that I will discuss in detail in my next post. (For a simple example that illustrates this problem, see the discussion in Section 1.5.3 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a>.) In the machine learning community, this issue is typically addressed by splitting the original dataset randomly into three parts: a training subset (Tr) used for model-fitting, a validation subset (V) used for intermediate modeling decisions (e.g., which variables to include in the model), and a test subset (Te) used for final model evaluation. This approach is described in Section 7.2 of <a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/">The Elements of Statistical Learning</a> by Hastie, Tibshirani, and Friedman, who suggest 50% training, 25% validation, and 25% test as a typical choice.</div><div class="MsoNormal"><br /></div><div class="MsoNormal">The other point of this post is to say something about the different roles of model uncertainty and data uncertainty in the practice of predictive modeling. I will say a little more at the end, but whether we are considering business applications like predicting customer behavior or industrial process control applications to predict the influence of changes in control valve settings, the basic predictive modeling process consists of three steps: build a prediction model; fix (i.e., “finalize”) this model; and apply it to generate predictions from data not seen in the model-building process. In these applications, model uncertainty plays an important role in the model development process, but once we have fixed the model, we have eliminated this uncertainty by fiat. Uncertainty remains an important issue in these applications, but the source of this uncertainty is in the data from which the model generates its predictions and not in the model itself once we have fixed it. Conversely, as George Box famously said, “all models are wrong, but some are useful,” and this point is crucial here: if the model uncertainty is great enough, it may be difficult or impossible to select a fixed model that is good enough to be useful in practice.</div><div class="MsoNormal"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-X3GeYf3OqkY/UxylhzQA5xI/AAAAAAAAAN4/QxD1U1bNrY0/s1600/TreeFull.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-X3GeYf3OqkY/UxylhzQA5xI/AAAAAAAAAN4/QxD1U1bNrY0/s1600/TreeFull.png" height="319" width="320" /></a></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><br /></div><div class="MsoNormal">Returning to the topic of uncertainty in tree-based models, the above plot is a graphical representation of a classification tree model repeated from my previous two posts. This model was fit using the <b>ctree</b> procedure in the R package <b>party</b>, taking all optional parameters at their default values. As before, the dataset used to generate this model was the Australian vehicle insurance dataset <b>car.csv</b>, obtained from the website associated with the book <a href="http://www.businessandeconomics.mq.edu.au/our_departments/Applied_Finance_and_Actuarial_Studies/research/books/GLMsforInsuranceData">Generalized Linear Models for Insurance Data</a>, by Piet de Jong and Gillian Z. Heller. This model – and all of the others considered in this post – was fit using the same formula as before:</div><div class="MsoNormal"><br /></div><div class="MsoNormal"> Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat</div><div class="MsoNormal"><br /></div><div class="MsoNormal">Each record in this dataset describes a single-vehicle, single-driver insurance policy, and clm is a binary response variable taking the value 1 if policy filed one or more claims during the observation period and 0 otherwise. The other variables (on the right side of “~”) represent covariates that are either numeric (veh_value, the value of the vehicle) or categorical (all other variables, representing the vehicle body type, its age, the gender of the driver, the region where the vehicle is driven, and the driver’s age).</div><div class="MsoNormal"><br /></div><div class="MsoNormal">As I noted above, this model was fit to the entire dataset, a practice that is to be discouraged since it does not leave independent datasets of similar character for validation and testing. To address this problem, I randomly partitioned the original dataset into a 50% training subset, a 25% validation subset, and a 25% test subset as suggested by Hastie, Tibshirani and Friedman. The plot shown below represents the <b>ctree</b>model we obtain using exactly the same fitting procedure as before, but applied to the 50% random training subset instead of the complete dataset. Comparing these plots reveals substantial differences in the overall structure of the trees we obtain, strictly as a function of the data used to fit the models. In particular, while the original model has seven terminal nodes (i.e., the tree assigns every record to one of seven “leaves”), the model obtained from the training data subset has only four. Also, note that the branches in the original tree model are determined by the three variables agecat, veh_body, and veh_value, while the branches in the model built from the training subset are determined by the two variables agecat and veh_value only.</div><div class="MsoNormal"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-8-rb7qgiujo/UxymjpPBTII/AAAAAAAAAOE/z_PySdRZQSo/s1600/TreeT.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-8-rb7qgiujo/UxymjpPBTII/AAAAAAAAAOE/z_PySdRZQSo/s1600/TreeT.png" height="319" width="320" /></a></div><div class="MsoNormal"><o:p><br /></o:p></div><div class="MsoNormal"><br /></div><div class="MsoNormal">These differences illustrate the point noted above about the strong dependence of classification tree model structure on the data used in model-building. One could object that since the two datasets used here differ by a factor of two in size, the comparison isn’t exactly “apples-to-apples.” To see that this is not really the issue, consider the following two cases, based on the idea of bootstrap resampling. I won’t attempt a detailed discussion of the bootstrap approach here, but the basic idea is to assess the effects of data variability on a computational procedure by applying that procedure to multiple datasets, each obtained by sampling with replacement from a single source dataset. (For a comprehensive discussion of the bootstrap and some of its many applications, refer to the book <a href="http://www.amazon.com/Bootstrap-Application-Statistical-Probabilistic-Mathematics/dp/0521574714">Bootstrap Methods and their Application</a> by A.C. Davison and D.V. Hinkley.) The essential motivation is that these datasets – called bootstrap resamples – all have the same essential statistical character as the original dataset. Thus, by comparing the results obtained from different bootstrap resamples, we can assess the variability in results for which exact statistical characterizations are either unknown or impractical to compute. Here, I use this idea to obtain datasets that should address the “apples-to-apples” concern raised above. More specifically, I start with the training data subset used to generate the model described in the previous figure, and I use R’s built-in <b>sample()</b> function to sample the rows of this dataframe with replacement. For an arbitrary dataframe DF, the code to do this is simple:</div><div class="MsoNormal"><br /></div><blockquote class="tr_bq" style="margin-left: .5in;">> set.seed(iseed) </blockquote><blockquote class="tr_bq" style="margin-left: .5in;">> BootstrapIndex = sample(seq(1,nrow(DF),1),size=nrow(DF),replace=TRUE </blockquote><blockquote class="tr_bq" style="margin-left: .5in;">> ResampleFrame = DF[BootstrapIndex,]</blockquote><div class="MsoNormal"><br /></div><div class="MsoNormal">The only variable in this procedure is the seed for the random sampling function <b>sample()</b>, which I have denoted as iseed. The extremely complicated figure below shows the <b>ctree</b> model obtained using the bootstrap resample generated from the training subset with iseed = 5.</div><div class="MsoNormal"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-lxTrGZk7n90/Uxyn_2pfB2I/AAAAAAAAAOY/ekhilLyXVyw/s1600/Tree5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-lxTrGZk7n90/Uxyn_2pfB2I/AAAAAAAAAOY/ekhilLyXVyw/s1600/Tree5.png" height="319" width="320" /></a></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><br /></div><div class="MsoNormal">Comparing this model with the previous one – both built from datasets of the same size, with the same general data characteristics – we see that the differences are even more dramatic than those between the original model (built from the complete dataset) and the second one (built from the training subset). Specifically, while the training subset model has four terminal nodes, determined by two variables, the bootstrap subsample model uses all six of the variables included in the model formula, yielding a tree with 16 terminal nodes. But wait – sampling with replacement generates a significant number of duplicated records (for large datasets, each bootstrap resample contains approximately 63.2% of the original data values, meaning that the other 36.8% of the resample values must be duplicates). Could this be the reason the results are so different? The following example shows that this is not the issue.</div><div class="MsoNormal"><br /></div><div class="MsoNormal"></div><div class="MsoNormal"><o:p></o:p></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-w6tbbQzqWn0/Uxyn8VuoH-I/AAAAAAAAAOU/RJTj_PhMFsM/s1600/Tree6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-w6tbbQzqWn0/Uxyn8VuoH-I/AAAAAAAAAOU/RJTj_PhMFsM/s1600/Tree6.png" height="319" width="320" /></a></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><o:p><br /></o:p></div><div class="MsoNormal">This plot shows the <b>ctree</b> model obtained from another bootstrap resample of the training data subset, obtained by specifying iseed = 6 instead of iseed = 5. This second bootstrap resample tree is much simpler, with only 7 terminal nodes instead of 16, and the branches of the tree are based on only four of the prediction variables instead of all six (specifically, neither gender nor veh_body appear in this model). While I don’t include all of the corresponding plots, I have also constructed and compared the <b>ctree</b>models obtained from the bootstrap resamples generated for all iseed values between 1 and 8, giving final models involving between four and six variables, with between 7 and 16 terminal nodes. In all cases, the datasets used in building these models were exactly the same size and had the same statistical character. The key point is that, as Professor Harrell noted in his comments, the structural variability of these classification tree models across similar datasets is substantial. In fact, this variability of individual tree-based models was one of the key motivations for developing the random forest method, which achieves substantially reduced model uncertainty by averaging over many randomly generated trees. Unfortunately, the price we pay for this improved model stability is a complete loss of interpretibility. That is, looking at any one of the plots shown here, we can construct a simple description (e.g., node 12 in the above figure represents older drivers – agecat > 4 – with less expensive cars, and it has the lowest risk of any of the groups identified there). While we may obtain less variable predictions by averaging over a large number of these trees, such simple intuitive explanations of the resulting model are no longer possible.</div><div class="MsoNormal"><br /></div><div class="MsoNormal">I noted earlier that predictive modeling applications typically involve a three-step strategy: fit the model, fix the model, and apply the model. I also argued that once we fix the model, we have eliminated model uncertainty when we apply it to new data. Unfortunately, if the inherent model uncertainty is large, as in the examples presented here, this greatly complicates the “fix the model” step. That is, if small variations in our training data subset can cause large changes in the structure of our prediction model, it is likely that very different models will exhibit similar performance when applied to our validation data subset. How, then, do we choose? I will examine this issue further in my next post when I discuss overfitting and the training/validation/test split in more detail. </div><br /><div class="MsoNormal"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com1tag:blogger.com,1999:blog-9179325420174899779.post-80212182885203066222013-08-06T19:14:00.000-07:002013-08-06T19:14:28.240-07:00Assessing the precision of classification tree model predictions<div class="MsoNormal">My last post focused on the use of the <i>ctree</i> procedure in the R package <i>party</i> to build classification tree models. These models map each record in a dataset into one of M mutually exclusive groups, which are characterized by their average response. For responses coded as 0 or 1, this average may be regarded as an estimate of the probability that a record in the group exhibits a “positive response.” This interpretation leads to the idea discussed here, which is to replace this estimate with the size-corrected probability estimate I discussed in my previous post (<a href="http://exploringdatablog.blogspot.com/2011/04/screening-for-predictive.html">Screening for predictive characteristics</a>). Also, as discussed in that post, these estimates provide the basis for confidence intervals that quantify their precision, particularly for small groups.</div><div class="MsoNormal"><br /></div><div class="MsoNormal"> </div><div class="MsoNormal">In this post, the basis for these estimates is the R package <i>PropCIs</i>, which includes several procedures for estimating binomial probabilities and their confidence intervals, including an implementation of the method discussed in my previous post. Specifically, the procedure used here is <i>addz2ci</i>, discussed in Chapter 9 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a>. As noted in both that discussion and in my previous post, this estimator is described in a paper by Brown, Cai and DasGupta in 2002, but the documentation for the <i>PropCIs</i> package cites an earlier paper by Agresti and Coull (“Approximate is better than exact for interval estimation of binomial proportions,” in <i>The American Statistician,</i> vol. 52, 1998, pp. 119-126). The essential idea is to modify the classical estimator, augmenting the counts of 0’s and 1’s in the data by <i>z<sup>2</sup>/2</i>, where <i>z</i> is the normal z-score associated with the significance level. As a specific example, <i>z</i> is approximately 1.96 for 95% confidence limits, so this modification adds approximately 2 to each count. In cases where both of these counts are large, this correction has negligible effect, so the size-corrected estimates and their corresponding confidence intervals are essentially identical with the classical results. In cases where either the sample is small or one of the possible responses is rare, these size-corrected results are much more reasonable than the classical results, which motivated their use both here and in my earlier post.</div><div class="MsoNormal"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-x_lzlloipQg/UgGoBHq-A3I/AAAAAAAAALk/faiVi488bWs/s1600/Tree2PostFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="283" src="http://4.bp.blogspot.com/-x_lzlloipQg/UgGoBHq-A3I/AAAAAAAAALk/faiVi488bWs/s320/Tree2PostFig01.png" width="320" /></a></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><br /></div><div class="MsoNormal">The above plot provides a simple illustration of the results that can be obtained using the <i>addz2ci</i> procedure, in a case where some groups are small enough for these size-corrections to matter. More specifically, this plot is based on the Australian vehicle insurance dataset that I discussed in my last post, and it characterizes the probability that a policy files a claim (i.e., that the variable <i>clm</i> has the value 1), for each of the 13 vehicle types included in the dataset. The heavy horizontal line segments in this plot represent the size-corrected claim probability estimates for each vehicle type, while the open triangles connected by dotted lines represent the upper and lower 95% confidence limits around these probability estimates, computed as described above. The solid horizontal line represents the overall claim probability for the dataset, to serve as a reference value for the individual subset results.</div><div class="MsoNormal"> </div><div class="MsoNormal"><br /></div><div class="MsoNormal">An important observation here is that although this dataset is reasonably large (there are a total of 67,856 records), the subgroups are quite heterogeneous in size, spanning the range from 27 records listing “RDSTR” as the vehicle type to 22,233 listing “<st1:city w:st="on">SEDAN</st1:city>”. As a consequence, although the classical and size-adjusted claim probability estimates and their confidence intervals are essentially identical for the dataset overall, the extent of this agreement varies substantially across the different vehicle types. Taking the extremes, the results for the largest group (“SEDAN”) are, as with the dataset overall, almost identical: the classical estimate is 0.0665, while the size-adjusted estimate is 0.0664; the lower 95% confidence limit also differs by one in the fourth decimal place (classical 0.0631 versus size-corrected 0.0632), and the upper limit is identical to four decimal places, at 0.0697. In marked contrast, the classical and size-corrected estimates for the “RDSTR” group are 0.0741 versus 0.1271, the upper 95% confidence limits are 0.1729 versus 0.2447, and the lower confidence limits are -0.0247 versus 0.0096. Note that in this case, the lower classical confidence limit violates the requirement that probabilities must be positive, something that is not possible for the <i>addz2ci</i> confidence limits (specifically, negative values are less likely to arise, as in this example, and if they ever do arise, they are replaced with zero, the smallest feasible value for the lower confidence limit; similarly for upper confidence limits that exceed 1). As is often the case, the primary advantage of plotting these results is that it gives us a much more immediate indication of the relative precision of the probability estimates, particularly in cases like “RDSTR” where these confidence intervals are quite wide.</div><div class="MsoNormal"><br /></div><div class="MsoNormal">The R code used to generate these results uses both the <i>addz2ci</i> procedure from the <i>PropCIs</i> package, and the <i>summaryBy</i> procedure from the <i>doBy</i> package. Specifically, the following function returns a dataframe with one row for each distinct value of the variable <i>GroupingVar</i>. The columns of this dataframe include this value, the total number of records listing this value, the number of these records for which the binary response variable <i>BinVar</i> is equal to 1, the lower confidence limit, the upper confidence limit, and the size-corrected estimate. The function is called with <i>BinVar</i>, <i>GroupingVar</i>, and the significance level, with a default of 95%. The first two lines of the function require the <i>doBy</i> and <i>PropCIs</i> packages. The third line constructs an internal dataframe, passed to the <i>summaryBy</i>function in the <i>doBy</i> package, which applies the <i>length</i> and <i>sum</i> functions to the subset of <i>BinVar</i> values defined by each level of <i>GroupingVar</i>, giving the total number of records and the total number of records with <i>BinVar</i> = 1. The main loop in this program applies the <i>addz2ci</i> function to these two numbers, for each value of <i>GroupingVar</i>, which returns a two-element list. The element <i>$estimate</i> gives the size-corrected probability estimate, and the element <i>$conf.int</i> is a vector of length 2 with the lower and upper confidence limits for this estimate. The rest of the program appends these values to the internal dataframe created by the <i>summaryBy</i>function, which is returned as the final result. The code listing follows:</div><div class="MsoNormal"><br /></div><blockquote class="tr_bq">BinomialCIbyGroupFunction <- function(BinVar, GroupingVar, SigLevel = 0.95){<br /> #<br /> require(doBy)<br /> require(PropCIs)<br /> #<br /> IntFrame = data.frame(b = BinVar, g = as.factor(GroupingVar))<br /> SumFrame = summaryBy(b ~ g, data = IntFrame, FUN=c(length,sum))<br /> #<br /> n = nrow(SumFrame)<br /> EstVec = vector("numeric",n)<br /> LowVec = vector("numeric",n)<br /> UpVec = vector("numeric",n)<br /> for (i in 1:n){<br /> Rslt = addz2ci(x = SumFrame$b.sum[i],n = SumFrame$b.length[i],conf.level=SigLevel)<br /> EstVec[i] = Rslt$estimate<br /> CI = Rslt$conf.int<br /> LowVec[i] = CI[1]<br /> UpVec[i] = CI[2]<br /> }<br /> SumFrame$LowerCI = LowVec<br /> SumFrame$UpperCI = UpVec<br /> SumFrame$Estimate = EstVec<br /> return(SumFrame)<br />}</blockquote><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-I5hspeKvPcI/UgGqALxgzEI/AAAAAAAAAL0/bRuw6CXTG1M/s1600/Tree2PostFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="283" src="http://4.bp.blogspot.com/-I5hspeKvPcI/UgGqALxgzEI/AAAAAAAAAL0/bRuw6CXTG1M/s320/Tree2PostFig02.png" width="320" /></a></div><br /><br /> The binary response characterization tools just described can be applied to the results obtained from a classification tree model. Specifically, since a classification tree assigns every record to a unique terminal node, we can characterize the response across these nodes, treating the node numbers as the data groups, analogous to the vehicle body types in the previous example. As a specific illustration, the figure above gives a graphical representation of the <i>ctree</i> model considered in my previous post, built using the <i>ctree</i> command from the <i>party</i> package with the following formula:<br /> <div class="MsoNormal"><br /></div><div class="MsoNormal"> Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat</div><div class="MsoNormal"><br /></div><div class="MsoNormal">Recall that this formula specifies we want a classification tree that predicts the binary claim indicator <i>clm</i> from the six variables on the right-hand side of the tilde, separated by “+” signs. Each of the terminal nodes in the resulting <i>ctree</i> model is characterized with a rectangular box in the above figure, giving the number of records in each group <i>(n)</i> and the average positive response <i>(y)</i>, corresponding to the classical claim probability estimate. Note that the product <i>ny</i> corresponds to the total number of claims in each group, so these products and the group sizes together provide all of the information we need to compute the size-corrected claim probability estimates and their confidence limits for each terminal node. Alternatively, we can use the <i>where</i> method associated with the binary tree object that <i>ctree</i> returns to extract the terminal nodes associated with each observation. Then, we simply use the terminal node in place of vehicle body type in exactly the same analysis as before.</div><div class="MsoNormal"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-WAAQNC5-2IU/UgGq_xDPi_I/AAAAAAAAAME/V85l5kYcm_s/s1600/Tree2PostFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="283" src="http://1.bp.blogspot.com/-WAAQNC5-2IU/UgGq_xDPi_I/AAAAAAAAAME/V85l5kYcm_s/s320/Tree2PostFig03.png" width="320" /></a></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><br /></div><div class="MsoNormal">The above figure shows these estimates, in the same format as the original plot of claim probability broken down by vehicle body type given earlier. Here, the range of confidence interval widths is much less extreme than before, but it is still clearly evident: the largest group (Node 10, with 23,315 records) exhibits the narrowest confidence interval, while the smallest groups (Node 9, with 1,361 records, and Node 13, with 1,932 records) exhibit the widest confidence intervals. Despite its small size, however, the smallest group does exhibit a significantly lower claim probability than any of the other groups defined by this classification tree model.</div><div class="MsoNormal"><br /></div><div class="MsoNormal"> </div><div class="MsoNormal">The primary point of this post has been to demonstrate that binomial confidence intervals can be used to help interpret and explain classification tree results, especially when displayed graphically as in the above figure. These displays provide a useful basis for comparing classification tree models obtained in different ways (e.g., by different algorithms like <i>rpart</i> and <i>ctree</i>, or by different tuning parameters for one specific algorithm). Comparisons of this sort will form the basis for my next post.</div><div class="MsoNormal"> </div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com1tag:blogger.com,1999:blog-9179325420174899779.post-55839113393338171032013-04-13T08:09:00.000-07:002013-04-13T08:09:32.331-07:00Classification Tree Models<div class="MsoNormal" style="margin: 0in 0in 0pt;">On March 26, I attended the Connecticut R Meetup in <st1:city w:st="on"><st1:place w:st="on">New Haven</st1:place></st1:city>, which featured a talk by Illya Mowerman on decision trees in <em>R</em>.<span style="mso-spacerun: yes;"> </span>I have gone to these Meetups before, and I have always found them to be interesting and informative.<span style="mso-spacerun: yes;"> </span>Attendees range from those who are just starting to explore <em>R</em> to those who have multiple CRAN packages to their credit.<span style="mso-spacerun: yes;"> </span>Each session is organized around a talk that focuses on some aspect of <em>R</em> and both the talks and the discussion that follow are typically lively and useful.<span style="mso-spacerun: yes;"> </span>More information about the Connecticut R Meetup can be found <a href="http://www.meetup.com/Conneticut-R-Users-Group/messages/47523342/">here</a>, and information about R Meetups in other areas can be found with a Google search on “R Meetup” with a location.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-9rNn3CpjFYE/UWlfCb5uNMI/AAAAAAAAAK8/bsMaO9uyttM/s1600/ctreeFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" bua="true" height="319" src="http://3.bp.blogspot.com/-9rNn3CpjFYE/UWlfCb5uNMI/AAAAAAAAAK8/bsMaO9uyttM/s320/ctreeFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Mowerman’s talk focused on decision trees like the one shown in the figure above.<span style="mso-spacerun: yes;"> </span>I give a somewhat more detailed discussion of this example below, but the basic idea is that the tree assigns every record in a dataset to a unique group, and a predicted response is generated for each group.<span style="mso-spacerun: yes;"> </span>The basic decision tree models are either classification trees, appropriate to binary response variables, or regression tree models, appropriate to numeric response variables.<span style="mso-spacerun: yes;"> </span>The figure above represents a classification tree model that predicts the probability that an automobile insurance policyholder will file a claim, based on a publicly available insurance dataset discussed further below.<span style="mso-spacerun: yes;"> </span>Two advantages of classification tree models that Mowerman emphasized in his talk are, first, their simplicity of interpretation, and second, their ability to generate predictions from a mix of numerical and categorical covariates.<span style="mso-spacerun: yes;"> </span>The above example illustrates both of these points – the decision tree is based on both categorical variables like <strong>veh_body</strong> (vehicle body type) and numerical variables like <strong>veh_value</strong> (the vehicle value in units of 10,000 Australian dollars).<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><span style="mso-spacerun: yes;"><div class="MsoNormal" style="margin: 0in 0in 0pt;">To interpret this tree, begin by reading from the top down, with the root node, numbered 1, which partitions the dataset into two subsets based on the variable <strong>agecat</strong>.<span style="mso-spacerun: yes;"> </span>This variable is an integer-coded driver age group with six levels, ranging from 1 for the youngest drivers to 6 for the oldest drivers.<span style="mso-spacerun: yes;"> </span>The root node splits the dataset into a younger driver subgroup (to the left, with <strong>agecat</strong> values 1 through 4) and an older driver subgroup (to the right, with <strong>agecat</strong> values 5 and 6).<span style="mso-spacerun: yes;"> </span>Going to the right, node 11 splits the older driver group on the basis of vehicle value, with node 12 consisting of older drivers with <strong>veh_value</strong> less than or equal to 2.89, corresponding to vehicle values not more than 28,900 Australian dollars.<span style="mso-spacerun: yes;"> </span>This subgroup contains 15,351 policy records, of which 5.3% file claims.<span style="mso-spacerun: yes;"> </span>Similarly, node 13 corresponds to older drivers with vehicles valued more than 28,900 Australian dollars; this is a smaller group (1,932 policy records) with a higher fraction filing claims (8.3%).<span style="mso-spacerun: yes;"> </span>Going to the left, we partition the younger driver group first on vehicle body type (node 2), then possibly a second time on driver age (node 4), possibly further on vehicle value (node 6) and finally again on vehicle body type (node 7).<span style="mso-spacerun: yes;"> </span>The key point is that every record in the dataset is ultimately assigned to one of the seven terminal nodes of this tree (the “leaves,” numbered 3, 5, 8, 9, 10, 12, and 13).<span style="mso-spacerun: yes;"> </span>The numbers associated with these nodes gives their size and the fraction of each group that files a claim, which may be viewed as an estimate of the conditional probability that a driver from each group will file a claim.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Classification trees can be fit to data using a number of different algorithms, several of which are included in various <em>R</em> packages.<span style="mso-spacerun: yes;"> </span>Mowerman’s talk focused primarily on the <strong>rpart</strong> package that is part of the standard <em>R</em> distribution and includes a procedure also named <strong>rpart</strong>, based on what is probably the best known algorithm for fitting classification and regression trees.<span style="mso-spacerun: yes;"> </span>In addition, Mowerman also discussed the <strong>rpart.plot</strong> package, a very useful adjunct to <strong>rpart</strong> that provides a lot of flexibility in representing the resulting tree models graphically. In particular, this package can be used to make much nicer plots than the one shown above; I haven't done that here largely because I have used a different tree fitting procedure, for reasons discussed in the next paragraph.<span style="mso-spacerun: yes;"> </span>Another classification package that Mowerman mentioned in his talk is <strong>C50</strong>, which implements the C5.0 algorithm popular in the machine learning community.<span style="mso-spacerun: yes;"> </span>The primary focus of this post is the <strong>ctree</strong> procedure in the <strong>party</strong> package, which was used to fit the tree shown here.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The reason I have used the <strong>ctree</strong> procedure instead of the <strong>rpart</strong> procedure is that for the dataset I consider here, the <strong>rpart</strong> procedure returns a trivial tree.<span style="mso-spacerun: yes;"> </span>That is, when I attempt to fit a tree to the dataset using <strong>rpart</strong> with the response variable and covariates described below, the resulting “tree” assigns the entire dataset to a single node, declaring the overall fraction of positive responses in the dataset to be the common prediction for all records.<span style="mso-spacerun: yes;"> </span>Applying the <strong>ctree</strong> procedure (the code is listed below) yields the nontrivial tree shown in the plot above.<span style="mso-spacerun: yes;"> </span>The reason for the difference in these results is that the <strong>rpart</strong> and <strong>ctree</strong> procedures use different tree-fitting algorithms.<span style="mso-spacerun: yes;"> </span>Very likely, the reason <strong>rpart</strong> has such difficulty with this dataset is its high degree of <em>class imbalance:</em> the positive response (i.e., “policy filed one or more claims”) occurs in only 4,264 of 67,856 data records, representing 6.81% of the total.<span style="mso-spacerun: yes;"> </span>This imbalance problem is known to make classification difficult, enough so that it has become the focus of a specialized technical literature.<span style="mso-spacerun: yes;"> </span>For a rather technical survey of this topic, refer to the paper “The Class Imbalance Problem: A Systematic Study,” by Japkowicz and Stephen <a href="http://iospress.metapress.com/content/mxug8cjkjylnk3n0/">(Intelligent Data Analysis, volume 6, number 5, November, 2002).</a> (So far, I have not been able to find a free version of this paper, but if you are interested in the topic, a search on this title turns up a number of other useful papers on the topic, although generally more specialized than this broad survey.)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To obtain the tree shown in the plot above, I used the following R commands:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><o:p><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;"><o:p>> library(party)</o:p></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><o:p>> carFrame = read.csv("car.csv")</o:p></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><o:p>> Fmla = clm ~ veh_value + veh_body + veh_age + gender + area + agecat</o:p></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><o:p>> TreeModel = ctree(Fmla, data = carFrame)</o:p></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><o:p>> plot(TreeModel, type="simple")</o:p></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div></o:p><o:p></o:p> </span><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><o:p></o:p>The first line loads the <strong>party</strong> package to make the <strong>ctree</strong> procedure available for our use, and the second line reads the data file described below into the dataframe <strong>carFrame</strong> (note that this assumes the data file "car.csv" has been loaded into <em>R's</em> current working directory, which can be shown using the <strong>getwd()</strong> command). The third line defines the formula that specifies the response as the binary variable <strong>clm</strong> (on the left side of "~") and the six other variables listed above as potential predictors, each separated by the "+" symbol. The fourth line invokes the <strong>ctree</strong> procedure to fit the model and the last line displays the results.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The dataset I used here is <strong>car.csv</strong>, available from the <a href="http://www.businessandeconomics.mq.edu.au/our_departments/Applied_Finance_and_Actuarial_Studies/research/books/GLMsforInsuranceData/data_sets">website</a> associated with the book <a href="http://www.amazon.com/Generalized-Insurance-International-Actuarial-Science/dp/0521879140/ref=sr_1_1?s=books&ie=UTF8&qid=1365819949&sr=1-1&keywords=GLMs+for+insurance+data">Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller</a>.<span style="mso-spacerun: yes;"> </span>As noted, this dataset contains 67,856 records, each characterizing an automobile insurance policy associated with one vehicle and one driver.<span style="mso-spacerun: yes;"> </span>The dataset has 10 columns, each representing an observed value for a policy characteristic, including claim and loss information, vehicle characteristics, driver characteristics, and certain other variables (e.g., a categorical variable characterizing the type of region where the vehicle is driven).<span style="mso-spacerun: yes;"> </span>The <strong>ctree</strong> model shown above was built to predict the binary response variable <strong>clm</strong> (where <strong>clm</strong> = 1 if one or more claims have been filed by the policyholder, and 0 otherwise), based on the following prediction variables:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><o:p><o:p><blockquote class="tr_bq"><o:p><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt 0.75in; mso-list: l0 level1 lfo1; tab-stops: list .75in; text-indent: -0.25in;"><span style="mso-list: Ignore;">-<span style="font: 7pt 'Times New Roman';"> </span></span>the numeric variable veh_value;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 0.75in; mso-list: l0 level1 lfo1; tab-stops: list .75in; text-indent: -0.25in;"><span style="mso-list: Ignore;">-<span style="font: 7pt 'Times New Roman';"> </span></span>veh_body, a categorical variable with 13 levels;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 0.75in; mso-list: l0 level1 lfo1; tab-stops: list .75in; text-indent: -0.25in;"><span style="mso-list: Ignore;">-<span style="font: 7pt 'Times New Roman';"> </span></span>veh_age, an integer-coded categorical variable with 4 levels;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 0.75in; mso-list: l0 level1 lfo1; tab-stops: list .75in; text-indent: -0.25in;"><span style="mso-list: Ignore;">-<span style="font: 7pt 'Times New Roman';"> </span></span>gender, a binary indicator of driver gender;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 0.75in; mso-list: l0 level1 lfo1; tab-stops: list .75in; text-indent: -0.25in;"><span style="mso-list: Ignore;">-<span style="font: 7pt 'Times New Roman';"> </span></span>area, a categorical variable with six levels;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 0.75in; mso-list: l0 level1 lfo1; tab-stops: list .75in; text-indent: -0.25in;"><span style="mso-list: Ignore;">-<span style="font: 7pt 'Times New Roman';"> </span></span>agecat, and integer-coded driver age variable.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></o:p></blockquote></o:p></o:p><o:p><div class="MsoNormal" style="margin: 0in 0in 0pt;">The tree model shown above illustrates one of the points Mowerman made in his talk, that classification tree models can easily handle mixed covariate types: here, these covariates include one numeric variable (<strong>veh_value</strong>), one binary variable (<strong>gender</strong>), and four categorical variables.<span style="mso-spacerun: yes;"> </span>In principle, tree models can be built using categorical variables with an arbitrary number of levels, but in practice procedures like <strong>ctree</strong> will fail if the number of levels becomes too large.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">One of the tuning parameters in tree-fitting procedures like <strong>rpart</strong> and <strong>ctree</strong> is the minimum node size.<span style="mso-spacerun: yes;"> </span>In his R Meetup talk, Mowerman showed that increasing this value from the default limit of 7 yielded simpler trees for the dataset he considered (the <strong>churn</strong> dataset from the <strong>C50</strong> package).<span style="mso-spacerun: yes;"> </span>Specifically, increasing the minimum node size parameter eliminated very small nodes from the tree, nodes whose practical utility was questionable due to their small size.<span style="mso-spacerun: yes;"> </span>In my next post, I will show how a graphical tool for displaying binomial probability confidence limits can be used to help interpret classification tree results by explicitly displaying the prediction uncertainties.<span style="mso-spacerun: yes;"> </span>The tool I use is <strong>GroupedBinomialPlot</strong>, one of those included in the <strong>ExploringData</strong> package that I am developing.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, I should say in response to a question about my last post that, sadly, I do not yet have a beta test version of the <strong>ExploringData</strong> package.</div></o:p>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com3tag:blogger.com,1999:blog-9179325420174899779.post-59604961577361142892013-02-16T12:10:00.000-08:002013-02-16T12:10:20.193-08:00Finding outliers in numerical data<div class="MsoNormal" style="margin: 0in 0in 0pt;">One of the topics emphasized in <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences and Medicine</a> is the damage outliers can do to traditional data characterizations.<span style="mso-spacerun: yes;"> </span>Consequently, one of the procedures to be included in the <strong>ExploringData</strong> package is <strong>FindOutliers</strong>, described in this post.<span style="mso-spacerun: yes;"> </span>Given a vector of numeric values, this procedure supports four different methods for identifying possible outliers.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Before describing these methods, it is important to emphasize two points.<span style="mso-spacerun: yes;"> </span>First, the <i style="mso-bidi-font-style: normal;">detection</i> of outliers in a sequence of numbers can be approached as a mathematical problem, but the <i style="mso-bidi-font-style: normal;">interpretation</i> of these data observations cannot.<span style="mso-spacerun: yes;"> </span>That is, mathematical outlier detection procedures implement various rules for identifying points that appear to be anomalous with respect to the nominal behavior of the data, but they cannot explain <i style="mso-bidi-font-style: normal;">why</i> these points appear to be anomalous.<span style="mso-spacerun: yes;"> </span>The second point is closely related to the first: one possible source of outliers in a data sequence is gross measurement errors or other data quality problems, but other sources of outliers are also possible so it is important to keep an open mind.<span style="mso-spacerun: yes;"> </span>The terms “outlier” and “bad data” are <i style="mso-bidi-font-style: normal;">not</i> synonymous.<span style="mso-spacerun: yes;"> </span>Chapter 7 of <em>Exploring Data</em> briefly describes two examples of outliers whose detection and interpretation led to a Nobel Prize and to a major new industrial product (Teflon, a registered trademark of the DuPont Company).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In the case of a single sequence of numbers, the typical approach to outlier detection is to first determine upper and lower limits on the nominal range of data variation, and then declare any point falling outside this range to be an outlier.<span style="mso-spacerun: yes;"> </span>The <strong>FindOutliers</strong> procedure implements the following methods of computing the upper and lower limits of the nominal data range:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><span style="mso-list: Ignore;">1.<span style="font: 7pt 'Times New Roman';"> </span></span>The ESD identifier, more commonly known as the “three-sigma edit rule,” well known but unreliable;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><span style="mso-list: Ignore;">2.<span style="font: 7pt 'Times New Roman';"> </span></span>The Hampel identifier, a more reliable procedure based on the median and the MADM scale estimate;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><span style="mso-list: Ignore;">3.<span style="font: 7pt 'Times New Roman';"> </span></span>The standard boxplot rule, based on the upper and lower quartiles of the data distribution;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><span style="mso-list: Ignore;">4.<span style="font: 7pt 'Times New Roman';"> </span></span>An adjusted boxplot rule, based on the upper and lower quartiles, along with a robust skewness estimator called the <i style="mso-bidi-font-style: normal;">medcouple</i>.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The rest of this post briefly describes these four outlier detection rules and illustrates their application to two real data examples.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Without question, the most popular outlier detection rule is the ESD identifier (an abbreviation for “extreme Studentized deviation”), which declares any point more than <i style="mso-bidi-font-style: normal;">t </i>standard deviations from the mean to be an outlier, where the threshold value <i style="mso-bidi-font-style: normal;">t</i> is most commonly taken to be 3.<span style="mso-spacerun: yes;"> </span>In other words, the nominal range used by this outlier detection procedure is the closed interval:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>[mean – t * SD, mean + t * SD]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where SD is the estimated standard deviation of the data sequence.<span style="mso-spacerun: yes;"> </span>Motivation for the threshold choice t = 3 comes from the fact that for normally-distributed data, the probability of observing a value more than three standard deviations from the mean is only about 0.3%.<span style="mso-spacerun: yes;"> </span>The problem with this outlier detection procedure is that both the mean and the standard deviation are themselves extremely sensitive to the presence of outliers in the data.<span style="mso-spacerun: yes;"> </span>As a consequence, this procedure is likely to miss outliers that are present in the data.<span style="mso-spacerun: yes;"> </span>In fact, it can be shown that for a contamination level greater than 10%, this rule fails completely, detecting no outliers at all, no matter how extreme they are (for details, see the discussion in Sec. 3.2.1 of <a href="http://www.amazon.com/Mining-Imperfect-Data-Contamination-Incomplete/dp/0898715822">Mining Imperfect Data</a>).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The default option for the <strong>FindOutliers</strong> procedure is the Hampel identifier, which replaces the mean with the median and the standard deviation with the MAD (or MADM)<span style="mso-spacerun: yes;"> </span>scale estimate.<span style="mso-spacerun: yes;"> </span>The nominal data range for this outlier detection procedure is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>[median – t * MAD, median + t * MAD]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As I have discussed in previous posts, the median and the MAD scale are much more resistant to the influence of outliers than the mean and standard deviation.<span style="mso-spacerun: yes;"> </span>As a consequence, the Hampel identifier is generally more effective than the ESD identifier, although the Hampel identifier can be too aggressive, declaring too many points as outliers.<span style="mso-spacerun: yes;"> </span>For detailed comparisons of the ESD and Hampel identifiers, refer to Sec. 7.5 of <i style="mso-bidi-font-style: normal;">Exploring Data</i> or Sec. 3.3 of <i style="mso-bidi-font-style: normal;">Mining Imperfect Data</i>.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The third method option for the <strong>FindOutliers</strong> procedure is the standard boxplot rule, based on the following nominal data range:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>[Q1 – c * IQD, Q3 + c * IQD]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where Q1 and Q3 represent the lower and upper quartiles, respectively, of the data distribution, and IQD = Q3 – Q1 is the interquartile distance, a measure of the spread of the data similar to the standard deviation.<span style="mso-spacerun: yes;"> </span>The threshold parameter <i style="mso-bidi-font-style: normal;">c</i> is analogous to <i style="mso-bidi-font-style: normal;">t</i> in the first two outlier detection rules, and the value most commonly used in this outlier detection rule is c = 1.5.<span style="mso-spacerun: yes;"> </span>This outlier detection rule is much less sensitive to the presence of outliers than the ESD identifier, but more sensitive than the Hampel identifier, and, like the Hampel identifier, it can be somewhat too aggressive, declaring nominal data observations to be outliers.<span style="mso-spacerun: yes;"> </span>An advantage of the boxplot rule over these two alternatives is that, because it does not depend on an estimate of the “center” of the data (e.g., the mean in the ESD identifier or the median in the Hampel identifier), it is better suited to distributions that are moderately asymmetric.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The fourth method option is an extension of the standard boxplot rule, developed for data distributions that may be strongly asymmetric.<span style="mso-spacerun: yes;"> </span>Basically, this procedure modifies the threshold parameter <i style="mso-bidi-font-style: normal;">c</i> by an amount that depends on the asymmetry of the distribution, modifying the upper threshold and the lower threshold differently.<span style="mso-spacerun: yes;"> </span>Because the standard moment-based skewness estimator is <i style="mso-bidi-font-style: normal;">extremely</i> outlier-sensitive (for an illustration of this point, see the discussion in Sec. 7.1.1 of <i style="mso-bidi-font-style: normal;">Exploring Data</i>), it is necessary to use an outlier-resistant alternative to assess distributional asymmetry.<span style="mso-spacerun: yes;"> </span>The asymmetry measure used here is the <i style="mso-bidi-font-style: normal;">medcouple</i>, a robust skewness measure available in the <b style="mso-bidi-font-weight: normal;">robustbase</b> package in <em>R</em> and that I have discussed in a previous post (<a href="http://exploringdatablog.blogspot.com/2011/02/boxplots-and-beyond-part-ii-asymmetry.html">Boxplots and Beyond - Part II: Asymmetry</a><span style="mso-spacerun: yes;"> </span>).<span style="mso-spacerun: yes;"> </span>An important point about the medcouple is that it can be either positive or negative, depending on the direction of the distributional asymmetry; positive values arise more frequently in practice, but negative values can occur and the sign of the medcouple influences the definition of the asymmetric boxplot rule.<span style="mso-spacerun: yes;"> </span>Specifically, for positive values of the medcouple MC, the adjusted boxplot rule’s nominal data range is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>[Q1 – c * exp(a * MC) * IQD, Q3 + c * exp(b * MC) * IQD ]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">while for negative medcouple values, the nominal data range is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>[Q1 – c * exp(-b * MC) * IQD, Q3 + c * exp(-a * MC) * IQD ]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">An important observation here is that for symmetric data distributions, MC should be zero, reducing the adjusted boxplot rule to the standard boxplot rule described above.<span style="mso-spacerun: yes;"> </span>As in the standard boxplot rule, the threshold parameter is typically taken as c = 1.5, while the other two parameters are typically taken as a = -4 and b = 3.<span style="mso-spacerun: yes;"> </span>In particular, these are the default values for the procedure <b style="mso-bidi-font-weight: normal;">adjboxStats</b> in the <b style="mso-bidi-font-weight: normal;">robustbase</b> package.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-SPUmE3GgKEc/UR_jamXVJaI/AAAAAAAAAKs/ZmL_a4g1Pg4/s1600/FindOutliersFig01Makeup.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://4.bp.blogspot.com/-SPUmE3GgKEc/UR_jamXVJaI/AAAAAAAAAKs/ZmL_a4g1Pg4/s320/FindOutliersFig01Makeup.png" uea="true" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To illustrate how these outlier detection methods compare, the above pair of plots shows the results of applying all four of them to the makeup flow rate dataset discussed in <em>Exploring Data</em> (Sec. 7.1.2) in connection with the failure of the ESD identifier.<span style="mso-spacerun: yes;"> </span>The points in these plots represent approximately 2,500 regularly sampled flow rate measurements from an industrial manufacturing process.<span style="mso-spacerun: yes;"> </span>These measurements were taken over a long enough period of time to contain both periods of regular process operation – during which the measurements fluctuate around a value of approximately 400 – and periods when the process was shut down, was being shut down, or was being restarted, during which the measurements exhibit values near zero.<span style="mso-spacerun: yes;"> </span>If we wish to characterize normal process operation, these shut down episodes represent outliers, and they correspond to about 20% of the data.<span style="mso-spacerun: yes;"> </span>The left-hand plot shows the outlier detection limits for the ESD identifier (lighter, dashed lines) and the Hampel identifier (darker, dotted lines).<span style="mso-spacerun: yes;"> </span>As discussed in <em>Exploring Data</em>, the ESD limits are wide enough that they do not detect any outliers in this data sequence, while the Hampel identifier nicely separates the data into normal operating data and outliers that correspond to the shut down episodes.<span style="mso-spacerun: yes;"> </span>The right-hand plot shows the analogous results obtained with the standard boxplot method (lighter, dashed lines) and the adjusted boxplot method (darker, dotted lines).<span style="mso-spacerun: yes;"> </span>Here, the standard boxplot rule gives results very similar to the Hampel identifier, again nicely separating the dataset into normal operating data and shut down episodes.<span style="mso-spacerun: yes;"> </span>Unfortunately, the adjusted boxplot rule does not perform very well here, placing its lower nominal data limit in about the middle of the shut down data and its upper nominal data limit in about the middle of the normal operating data.<span style="mso-spacerun: yes;"> </span>The likely cause of this behavior is that the relatively large fraction of lower tail outliers, which introduces a fairly strong negative skewness (the medcouple value for this example is -0.589).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Sa2YMBtbOwA/UR_jSga68dI/AAAAAAAAAKk/9hbC2kIJWbw/s1600/FindOutliersFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://1.bp.blogspot.com/-Sa2YMBtbOwA/UR_jSga68dI/AAAAAAAAAKk/9hbC2kIJWbw/s320/FindOutliersFig01.png" uea="true" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The second example considered here is the industrial pressure data sequence shown in the above figure, in the same format as the previous figure.<span style="mso-spacerun: yes;"> </span>This data sequence was discussed in <em>Exploring Data</em> (pp. 326-327) as a troublesome case because the two smallest values in this data sequence – near the right-hand end of the plots – appear to be downward outliers in a sequence with generally positive skewness (here, the medcouple value is 0.162).<span style="mso-spacerun: yes;"> </span>As a consequence, neither the ESD identifier nor the Hampel identifier give fully satisfactory performance, in both cases declaring only one of these points as a downward outlier and arguably detecting too many upward outliers.<span style="mso-spacerun: yes;"> </span>In fact, because the Hampel identifier is more aggressive here, it actually declares more upward outliers, making its performance worse for this example.<span style="mso-spacerun: yes;"> </span>The right-hand plot in the above figure shows the outlier detection limits for the standard boxplot rule (lighter, dashed lines) and the adjusted boxplot rule (darker, dotted lines).<span style="mso-spacerun: yes;"> </span>As in the previous example, the limits for the standard boxplot rule are almost the same as those for the Hampel identifier (the darker, dotted lines in the left-hand plot), but here the adjusted boxplot rule gives much better results, identifying both of the visually evident downward outliers and declaring far fewer points as upward outliers.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><span style="font-family: 'Times New Roman'; font-size: 12pt; mso-ansi-language: EN-US; mso-bidi-language: AR-SA; mso-fareast-font-family: 'Times New Roman'; mso-fareast-language: EN-US;">The primary point of this post has been to describe and demonstrate the outlier detection methods to be included in the <strong>FindOutliers</strong> procedure in the forthcoming <strong>ExploringData</strong> <em>R</em> package.<span style="mso-spacerun: yes;"> </span>It should be clear from these results that, when it comes to outlier detection, “one size does not fit all” – method matters, and the choice of method requires a comparison of the results obtained by each one.<span style="mso-spacerun: yes;"> </span>I have not included the code for the <strong>FindOutliers</strong> procedure here, but that will be the subject of my next post.</span>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com3tag:blogger.com,1999:blog-9179325420174899779.post-74643460280829807632012-12-15T13:32:00.000-08:002012-12-15T13:32:30.779-08:00Data Science, Data Analysis, R and Python<div class="MsoNormal" style="margin: 0in 0in 0pt;">The October 2012 issue of <i style="mso-bidi-font-style: normal;">Harvard Business Review</i> prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:</div><o:p><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><ol style="margin-top: 0in;" type="1"><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list .5in;">“Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list .5in;">“Data Scientist: The Sexiest Job of the 21<sup>st</sup> Century,” by Thomas H. Davenport and D.J. Patil, pages 70 – 76;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l4 level1 lfo1; tab-stops: list .5in;">“Making Advanced Analytics Work For You,” by Dominic Barton and <st1:street w:st="on"><st1:address w:st="on">David Court</st1:address></st1:street>, pages 79 – 83.</li></ol><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div></o:p>All three provide food for thought; this post presents a brief summary of some of those thoughts. <div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">One point made in the first article is that the “size” of a dataset – i.e., what constitutes “Big Data” – can be measured in at least three very different ways: volume, velocity, and variety.<span style="mso-spacerun: yes;"> </span>All of these aspects of the Big Data characterization problem affect it, but differently:</div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt 78pt; mso-list: l1 level1 lfo4; tab-stops: list 78.0pt; text-indent: -0.25in;"><span style="font-family: Symbol; mso-bidi-font-family: Symbol; mso-fareast-font-family: Symbol;"><span style="mso-list: Ignore;">·<span style="font: 7pt 'Times New Roman';"> </span></span></span>For very large data volumes, one fundamental issue is the incomprehensibility of the raw data itself.<span style="mso-spacerun: yes;"> </span>Even if you could display a data table with several million, billion, or trillion rows and hundreds or thousands of columns, making any sense of this display would be a hopeless task.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt 78pt; mso-list: l1 level1 lfo4; tab-stops: list 78.0pt; text-indent: -0.25in;"><span style="font-family: Symbol; mso-bidi-font-family: Symbol; mso-fareast-font-family: Symbol;"><span style="mso-list: Ignore;">·<span style="font: 7pt 'Times New Roman';"> </span></span></span>For high velocity datasets – e.g., real-time, Internet-based data sources – the data volume is determined by the observation time: at a fixed rate, the longer you observe, the more you collect.<span style="mso-spacerun: yes;"> </span>If you are attempting to generate a real-time characterization that keeps up with this input data rate, you face a fundamental trade-off between exploiting richer datasets acquired over longer observation periods, and the longer computation times required to process those datasets, making you less likely to keep up with the input data rate.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt 78pt; mso-list: l1 level1 lfo4; tab-stops: list 78.0pt; text-indent: -0.25in;"><span style="font-family: Symbol; mso-bidi-font-family: Symbol; mso-fareast-font-family: Symbol;"><span style="mso-list: Ignore;">·<span style="font: 7pt 'Times New Roman';"> </span></span></span>For high-variety datasets, a key challenge lies in finding useful ways to combine very different data sources into something amenable to a common analysis (e.g., combining images, text, and numerical data into a single joint analysis framework).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">One practical corollary to these observations is the need for a computer-based data reduction process or “data funnel” that matches the volume, velocity, and/or variety of the original data sources with the ultimate needs of the organization.<span style="mso-spacerun: yes;"> </span>In large organizations, this data funnel generally involves a mix of different technologies and people.<span style="mso-spacerun: yes;"> </span>While it is not a complete characterization, some of these differences are evident from the primary software platforms used in the different stages of this data funnel: languages like HTML for dealing with web-based data sources; typically, some variant of SQL for dealing with large databases; a package like R for complex quantitative analysis; and often something like Microsoft Word, Excel, or PowerPoint delivers the final results.<span style="mso-spacerun: yes;"> </span>In addition, to help coordinate some of these tasks, there are likely to be scripts, either in an operating system like UNIX or in a platform-independent scripting language like perl or Python.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">An important point omitted from all three articles is that there are at least two distinct application areas for Big Data:</div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l2 level1 lfo5; tab-stops: list 1.0in; text-indent: -0.25in;"><span style="mso-list: Ignore;">1.<span style="font: 7pt 'Times New Roman';"> </span></span>The class of “production applications,” which were discussed in these articles and illustrated with examples like the un-named U.S. airline described by McAfee and Brynjolfsson that adopted a vendor-supplied procedure to obtain better estimates of flight arrival times, improving their ability to schedule ground crews and saving several million dollars per year at each airport.<span style="mso-spacerun: yes;"> </span>Similarly, the article by Barton and Court described a shipping company (again, un-named) that used real-time weather forecast data and shipping port status data, developing an automated system to improve the on-time performance of its fleet.<span style="mso-spacerun: yes;"> </span>Examples like these describe automated systems put in place to continuously exploit a large but fixed data source.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l2 level1 lfo5; tab-stops: list 1.0in; text-indent: -0.25in;"><span style="mso-list: Ignore;">2.<span style="font: 7pt 'Times New Roman';"> </span></span>The exploitation of Big Data for “one-off” analyses: a question is posed, and the data science team scrambles to find an answer.<span style="mso-spacerun: yes;"> </span>This use is not represented by any of the examples described in these articles.<span style="mso-spacerun: yes;"> </span>In fact, this second type of application overlaps a lot with the development process required to create a production application, although the end results are very different.<span style="mso-spacerun: yes;"> </span>In particular, the end result of a one-off analysis is a single set of results, ultimately summarized to address the question originally posed.<span style="mso-spacerun: yes;"> </span>In contrast, a production application requires continuing support and often has to meet challenging interface requirements between the IT systems that collect and preprocess the Big Data sources and those that are already in use by the end-users of the tool (e.g., a Hadoop cluster running in a UNIX environment versus periodic reports generated either automatically or on demand from a Microsoft Access database of summary information).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">A key point of Davenport and Patil’s article is that data science involves more than just the analysis of data: it is also necessary to identify data sources, acquire what is needed from them, re-structure the results into a form amenable to analysis, clean them up, and in the end, present the analytical results in a useable form.<span style="mso-spacerun: yes;"> </span>In fact, the subtitle of their article is “Meet the people who can coax treasure out of messy, unstructured data,” and this statement forms the core of the article’s working definition for the term “data scientist.” (The authors indicate that the term was coined in 2008 by D.J. Patil, who holds a position with that title at Greylock Partners.)<span style="mso-spacerun: yes;"> </span>Also, two particularly interesting tidbits from this article were the authors’ suggestion that a good place to find data scientists is at R User Groups, and their description of R as “an open-source statistical tool favored by data scientists.”</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"><st1:city w:st="on"><st1:place w:st="on">Davenport</st1:place></st1:city> and Patil emphasize the difference between structured and unstructured data, especially relevant to the R community since most of R’s procedures are designed to work with the structured data types discussed in Chapter 2 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences and Medicine</a>: continuous, integer, nominal, ordinal, and binary.<span style="mso-spacerun: yes;"> </span>More specifically, note that these variable types can all be included in dataframes, the data object type that is best supported by R’s vast and expanding collection of add-on packages.<span style="mso-spacerun: yes;"> </span>Certainly, there is some support for other data types, and the level of this support is growing – the <b style="mso-bidi-font-weight: normal;">tm</b> package and a variety of other related packages support the analysis of text data, the <b style="mso-bidi-font-weight: normal;">twitteR</b> package provides support for analyzing Twitter tweets, and the <b style="mso-bidi-font-weight: normal;">scrapeR</b> package supports web scraping – but the acquisition and reformatting of unstructured data sources is not R’s primary strength.<span style="mso-spacerun: yes;"> </span>Yet it is a key component of data science, as <st1:city w:st="on"><st1:place w:st="on">Davenport</st1:place></st1:city> and Patil emphasize:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">“A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed.<span style="mso-spacerun: yes;"> </span>A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data – and also not at actually analyzing the data.”</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To better understand the distinction between the quantitative analyst and the data scientist implied by this quote, consider mathematician George Polya’s book, <a href="http://www.amazon.com/How-Solve-Aspect-Mathematical-Method/dp/4871878309/ref=sr_1_1?s=books&ie=UTF8&qid=1352581828&sr=1-1&keywords=polya+how+to+solve+it#_">How To Solve It</a>.<span style="mso-spacerun: yes;"> </span>Originally published in 1945 and most recently re-issued in 2009, 24 years after the author’s death, this book is a very useful guide to solving math problems.<span style="mso-spacerun: yes;"> </span>Polya’s basic approach consists of these four steps:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><ol style="margin-top: 0in;" type="1"><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo2; tab-stops: list .5in;">Understand the problem;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo2; tab-stops: list .5in;">Formulate a plan for solving the problem;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo2; tab-stops: list .5in;">Carry out this plan;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo2; tab-stops: list .5in;">Check the results.</li></ol><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div>It is important to note what is <i style="mso-bidi-font-style: normal;">not</i> included in the scope of Polya’s four steps: Step 1 assumes a problem has been stated precisely, and Step 4 assumes the final result is well-defined, verifiable, and requires no further explanation.<span style="mso-spacerun: yes;"> </span>While quantitative analysis problems are generally neither as precisely formulated as Polya’s method assumes, nor as clear in their ultimate objective, the class of “quantitative analyst” problems that <st1:city w:st="on"><st1:place w:st="on">Davenport</st1:place></st1:city> and Patil assume in the previous quote correspond very roughly to problems of this type.<span style="mso-spacerun: yes;"> </span>They begin with something like an R dataframe and a reasonably clear idea of what analytical results are desired; they end by summarizing the problem and presenting the results.<span style="mso-spacerun: yes;"> </span>In contrast, the class of “data scientist” problems implied in <st1:city w:st="on"><st1:place w:st="on">Davenport</st1:place></st1:city> and Patil’s quote comprises an expanded set of steps: <div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><ol style="margin-top: 0in;" type="1"><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l3 level1 lfo3; tab-stops: list .5in;">Formulate the analytical problem: decide what kinds of questions could and should be asked in a way that is likely to yield useful, quantitative answers;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l3 level1 lfo3; tab-stops: list .5in;">Identify and evaluate potential data sources: what is available in-house, from the Internet, from vendors?<span style="mso-spacerun: yes;"> </span>How complete are these data sources?<span style="mso-spacerun: yes;"> </span>What would it cost to use them?<span style="mso-spacerun: yes;"> </span>Are there significant constraints on how they can be used?<span style="mso-spacerun: yes;"> </span>Are some of these data sources strongly incompatible?<span style="mso-spacerun: yes;"> </span>If so, does it make sense to try to merge them approximately, or is it more reasonable to omit some of them?</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l3 level1 lfo3; tab-stops: list .5in;">Acquire the data and transform it into a form that is useful for analysis; note that for sufficiently large data collections, part of this data will almost certainly be stored in some form of relational database, probably administered by others, and extracting what is needed for analysis will likely involve writing SQL queries against this database;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l3 level1 lfo3; tab-stops: list .5in;">Once the relevant collection of data has been acquired and prepared, examine the results carefully to make sure it meets analytical expectations: do the formats look right?<span style="mso-spacerun: yes;"> </span>Are the ranges consistent with expectations?<span style="mso-spacerun: yes;"> </span>Do the relationships seen between key variables seem to make sense?</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l3 level1 lfo3; tab-stops: list .5in;">Do the analysis: by lumping all of the steps of data analysis into this simple statement, I am not attempting to minimize the effort involved, but rather emphasizing the other aspects of the Big Data analysis problem;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l3 level1 lfo3; tab-stops: list .5in;">After the analysis is complete, develop a concise summary of the results that clearly and succinctly states the motivating problem, highlights what has been assumed, what has been neglected and why, and gives the simplest useful summary of the data analysis results.<span style="mso-spacerun: yes;"> </span>(Note that this will often involve several different summaries, with different levels of detail and/or emphases, intended for different audiences.)</li></ol><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div>Here, Steps 1 and 6 necessarily involve close interaction with the end users of the data analysis results, and they lie mostly outside the domain of R.<span style="mso-spacerun: yes;"> </span>(Conversely, knowing what is available in R can be extremely useful in formulating analytical problems that are reasonable to solve, and the graphical procedures available in R can be extremely useful in putting together meaningful summaries of the results.)<span style="mso-spacerun: yes;"> </span>The primary domain of R is Step 5: given a dataframe containing what are believed to be the relevant variables, we generate, validate, and refine the analytical results that will form the basis for the summary in Step 6.<span style="mso-spacerun: yes;"> </span>Part of Step 4 also lies clearly within the domain of R: examining the data once it has been acquired to make sure it meets expectations.<span style="mso-spacerun: yes;"> </span>In particular, once we have a dataset or a collection of datasets that can be converted easily into one or more R dataframes (e.g., csv files or possibly relational databases), a preliminary look at the data is greatly facilitated by the vast array of R procedures available for graphical characterizations (e.g., nonparametric density estimates, quantile-quantile plots, boxplots and variants like beanplots or bagplots, and much more); for constructing simple descriptive statistics (e.g., means, medians, and quantiles for numerical variables, tabulations of level counts for categorical variables, etc.); and for preliminary multivariate characterizations (e.g., scatter plots, classical and robust covariance ellipses, classical and robust principal component plots, etc.).<span style="mso-spacerun: yes;"> </span><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">The rest of this post discusses those parts of Steps 2, 3, and 4 above that fall outside the domain of R.<span style="mso-spacerun: yes;"> </span>First, however, I have two observations.<span style="mso-spacerun: yes;"> </span>My first observation is that because R is evolving fairly rapidly, some tasks which are “outside the domain of R” today may very well move “inside the domain of R” in the near future.<span style="mso-spacerun: yes;"> </span>The packages <strong>twitteR</strong> and <strong>scrapeR</strong>, mentioned earlier, are cases in point, as are the continued improvements in packages that simplify the use of R with databases.<span style="mso-spacerun: yes;"> </span>My second observation is that, just because something is possible within a particular software environment doesn’t make it a good idea.<span style="mso-spacerun: yes;"> </span>A number of years ago, I attended a student talk given at an industry/university consortium.<span style="mso-spacerun: yes;"> </span>The speaker set up and solved a simple linear program (i.e., he implemented the <a href="http://en.wikipedia.org/wiki/Simplex_algorithm">simplex algorithm</a> to solve a simple linear optimization problem with linear constraints) using an industrial programmable controller.<span style="mso-spacerun: yes;"> </span>At the time, programming those controllers was done via <a href="http://en.wikipedia.org/wiki/Ladder_logic">relay ladder logic</a>, a diagrammatic approach used by electricians to configure complicated electrical wiring systems.<span style="mso-spacerun: yes;"> </span>I left the talk impressed by the student’s skill, creativity and persistence, but I felt his efforts were extremely misguided.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">Although it does not address every aspect of the “extra-R” components of Steps 2, 3, and 4 defined above – indeed, some of these aspects are so application-specific that no single book possibly could – Paul Murrell’s book <a href="http://www.amazon.com/Introduction-Technologies-Chapman-Computer-Analysis/dp/1420065173">Introduction to Data Technologies</a> provides an excellent introduction to many of them.<span style="mso-spacerun: yes;"> </span>(This book is also available as a free <a href="http://www.stat.auckland.ac.nz/~paul/ItDT/itdt-2012-07-29.pdf">PDF file</a> under creative commons.) <span style="mso-spacerun: yes;"> </span>A point made in the book’s preface mirrors one in <st1:city w:st="on"><st1:place w:st="on">Davenport</st1:place></st1:city> and Patil’s article:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">“Data sets never pop into existence in a fully mature and reliable state; they must be cleaned and massaged into an appropriate form.<span style="mso-spacerun: yes;"> </span>Just getting the data ready for analysis often represents a significant component of a research project.”</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Since Murrell is the developer of R’s grid graphics system that I have discussed in previous posts, it is no surprise that his book has an R-centric data analysis focus, but the book’s main emphasis is on the tasks of getting data from the outside world – specifically, from the Internet – into a dataframe suitable for analysis in R.<span style="mso-spacerun: yes;"> </span>Murrell therefore gives detailed treatments of topics like HTML and Cascading Style Sheets (CSS) for working with Internet web pages; XML for storing and sharing data; and relational databases and their associated query language SQL for efficiently organizing data collections with complex structures.<span style="mso-spacerun: yes;"> </span>Murrell states in his preface that these are things researchers – the target audience of the book – typically aren’t taught, but pick up in bits and pieces as they go along. <span style="mso-spacerun: yes;"> </span>He adds:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>“A great deal of information on these topics already exists in books and on the internet; the value of this book is in collecting only the important subset of this information that is necessary to begin applying these technologies within a research setting.”</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div>My one quibble with Murrell’s book is that he gives Python only a passing mention.<span style="mso-spacerun: yes;"> </span>While I greatly prefer R to Python for data analysis, I have found Python to be more suitable than R for a variety of extra-analytical tasks, including preliminary explorations of the contents of weakly structured data sources, as well as certain important reformatting and preprocessing tasks.<span style="mso-spacerun: yes;"> </span>Like R, <a href="http://www.python.org/">Python</a> is an open-source language, freely available for a wide variety of computing environments.<span style="mso-spacerun: yes;"> </span>Also like R, Python has numerous add-on packages that support an enormous variety of computational tasks (over 25,000 at this writing). <span style="mso-spacerun: yes;"> </span>In my day job in a SAS-centric environment, I commonly face tasks like the following: I need to create several nearly-identical SAS batch jobs, each to read a different SAS dataset that is selected on the basis of information contained in the file name; submit these jobs, each of which creates a CSV file; harvest and merge the resulting CSV files; run an R batch job to read this combined CSV file and perform computations on its contents.<span style="mso-spacerun: yes;"> </span>I can do all of these things with a Python script, which also provides a detailed recipe of what I have done, so when I have to modify the procedure slightly and run it again six months later, I can quickly re-construct what I did before.<span style="mso-spacerun: yes;"> </span>I have found Python to be better suited than R to tasks that involve a combination of automatically generating simple programs in another language, data file management, text processing, simple data manipulation, and batch job scheduling. <div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Despite my Python quibble, Murrell’s book represents an excellent first step toward filling the knowledge gap that <st1:place w:st="on"><st1:city w:st="on">Davenport</st1:city></st1:place> and Patil note between quantitative analysts and data scientists; in fact, it is the only book I know addressing this gap.<span style="mso-spacerun: yes;"> </span>If you are an R aficionado interested in positioning yourself for “the sexiest job of the 21<sup>st</sup> century,” Murrell’s book is an excellent place to start.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com10tag:blogger.com,1999:blog-9179325420174899779.post-51223581269090045012012-10-27T12:30:00.000-07:002012-10-27T12:30:52.008-07:00Characterizing a new dataset<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly.<span style="mso-spacerun: yes;"> </span>So, instead of spacing measures, today’s post is about the <strong>DataframeSummary</strong> procedure to be included in the <strong>ExploringData</strong> package, which I also mentioned in my last post and promised to describe later.<span style="mso-spacerun: yes;"> </span>My next post will be a special one on Big Data and Data Science, followed by another one about the <strong>DataframeSummary</strong> procedure (additional features of the procedure and the code used to implement it), after which I will come back to the spacing measures I discussed last time.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">A task that arises frequently in exploratory data analysis is the initial characterization of a new dataset.<span style="mso-spacerun: yes;"> </span>Ideally, everything we could want to know about a dataset <em>should</em> come from the accompanying metadata, but this is rarely the case.<span style="mso-spacerun: yes;"> </span>As I discuss in Chapter 2 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a>, <em>metadata</em> is the available “data about data” that (usually) accompanies a data source.<span style="mso-spacerun: yes;"> </span>In practice, however, the available metadata is almost never as complete as we would like, and it is sometimes wrong in important respects.<span style="mso-spacerun: yes;"> </span>This is particularly the case when numeric codes are used for missing data, without accompanying notes describing the coding.<span style="mso-spacerun: yes;"> </span>An example, illustrating the consequent problem of <i style="mso-bidi-font-style: normal;">disguised missing data</i> is described in my paper <a href="http://www.sigkdd.org/explorations/issues/8-1-2006-06/12-Pearson.pdf">The Problem of Disguised Missing Data</a>.<span style="mso-spacerun: yes;"> </span>(It should be noted that the original source of one of the problems described there – a comment in the UCI Machine Learning Repository header file for the Pima Indians diabetes dataset that there were no missing data records – has since been <a href="http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes">corrected.)</a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Once we have converted our data source into an <em>R</em> data frame (e.g., via the <strong>read.csv</strong> function for an external csv file), there are a number of useful tools to help us begin this characterization process.<span style="mso-spacerun: yes;"> </span>Probably the most general is the <strong>str</strong> command, applicable to essentially any <em>R</em> object.<span style="mso-spacerun: yes;"> </span>Applied to a dataframe, this command first tells us that the object <i style="mso-bidi-font-style: normal;">is </i>a dataframe, second, gives us the dimensions of the dataframe, and third, presents a brief summary of its contents, including the variable names, their type (specifically, the results of R’s <strong>class</strong> function), and the values of their first few observations.<span style="mso-spacerun: yes;"> </span>As a specific example, if we apply this command to the <strong>rent</strong> dataset from the <strong>gamlss</strong> package, we obtain the following summary:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">> str(rent)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">'data.frame':<span style="mso-spacerun: yes;"> </span>1969 obs. of<span style="mso-spacerun: yes;"> </span>9 variables:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ R<span style="mso-spacerun: yes;"> </span>: num<span style="mso-spacerun: yes;"> </span>693 422 737 732 1295 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ Fl : num<span style="mso-spacerun: yes;"> </span>50 54 70 50 55 59 46 94 93 65 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ A<span style="mso-spacerun: yes;"> </span>: num<span style="mso-spacerun: yes;"> </span>1972 1972 1972 1972 1893 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ Sp : num<span style="mso-spacerun: yes;"> </span>0 0 0 0 0 0 0 0 0 0 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ Sm : num<span style="mso-spacerun: yes;"> </span>0 0 0 0 0 0 0 0 0 0 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ B<span style="mso-spacerun: yes;"> </span>: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ H<span style="mso-spacerun: yes;"> </span>: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ L<span style="mso-spacerun: yes;"> </span>: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>$ loc: Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">><o:p> </o:p></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;">This dataset summarizes a 1993 random sample of housing rental prices in <st1:place w:st="on"><st1:city w:st="on">Munich</st1:city></st1:place>, including a number of important characteristics about each one (e.g., year of construction, floor space in square meters, etc.).<span style="mso-spacerun: yes;"> </span>A more detailed description can be obtained via the command “<strong>help(rent)</strong>”.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The <strong>head</strong> command provides similar information to the <strong>str</strong> command, in slightly less detail (e.g., it doesn’t give us the variable types), but in a format that some will find more natural:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">> head(rent)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>R Fl<span style="mso-spacerun: yes;"> </span>A Sp Sm B H L loc</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">1<span style="mso-spacerun: yes;"> </span>693.3 50 1972<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0 0 0 0<span style="mso-spacerun: yes;"> </span>2</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">2<span style="mso-spacerun: yes;"> </span>422.0 54 1972<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0 0 0 0<span style="mso-spacerun: yes;"> </span>2</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">3<span style="mso-spacerun: yes;"> </span>736.6 70 1972<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0 0 0 0<span style="mso-spacerun: yes;"> </span>2</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">4<span style="mso-spacerun: yes;"> </span>732.2 50 1972<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0 0 0 0<span style="mso-spacerun: yes;"> </span>2</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">5 1295.1 55 1893<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0 0 0 0<span style="mso-spacerun: yes;"> </span>2</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">6 1195.9 59 1893<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0 0 0 0<span style="mso-spacerun: yes;"> </span>2</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">><o:p> </o:p></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><o:p> (An important difference between these representations is that <strong>str</strong> characterizes factor variables by their level <em>number</em> and not their level <em>value:</em> thus the first few observations of the factor B assume the first level of the factor, which is the value 0. As a consequence, while it may appear that <strong>str</strong> is telling us that the first few records list the value 1 for the variable B while <strong>head</strong> is indicating a zero, this is not the case. This is one reason that data analysts may prefer the <strong>head</strong> characterization.)</o:p></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">While the <em>R</em> data types for each variable can be useful to know – particularly in cases where it isn’t what we expect it to be, as when integers are coded as factors – this characterization doesn’t really tell us the whole story.<span style="mso-spacerun: yes;"> </span>In particular, note that <em>R</em> has commands like “<strong>as.character</strong>” and “<strong>as.factor</strong>” that can easily convert numeric variables to character or factor data types.<span style="mso-spacerun: yes;"> </span>Even beyond this, the range of inherent behaviors that numerically-coded data can exhibit cannot be fully described by a simple data type designation.<span style="mso-spacerun: yes;"> </span>As a specific example, one of the variables in the <strong>rent</strong> dataframe is “A,” described in the metadata available from the help command as “year of construction.”<span style="mso-spacerun: yes;"> </span>While this variable is coded as type “numeric,” in fact it takes integer values from 1890 to 1988, with some values in this range repeated many times and others absent.<span style="mso-spacerun: yes;"> </span>This point is important, since analysis tools designed for continuous variables – especially outlier-resistant ones like medians and other rank-based methods – sometimes perform poorly in the face of data sequences with many repeated values (i.e., “ties,” which have zero probability for continuous data distributions).<span style="mso-spacerun: yes;"> </span>In extreme cases, these techniques may fail completely, as in the case of the MADM scale estimate, discussed in Chapter 7 of <em>Exploring Data</em>.<span style="mso-spacerun: yes;"> </span>This data characterization <em>implodes</em> if more than 50% of the data values are the same, returning the useless value zero in this case, independent of the values of all of the other data points.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">These observations motivate the <strong>DataframeSummary</strong> procedure described here, to be included in the <strong>ExploringData</strong> package.<span style="mso-spacerun: yes;"> </span>This function is called with the name of the dataframe to be characterized and an optional parameter <strong>Option</strong>, which can take any one of the following four values:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><ol style="margin-top: 0in;" type="1"><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l2 level1 lfo1; tab-stops: list .5in;">“Brief” (the default value)</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l2 level1 lfo1; tab-stops: list .5in;">“NumericOnly”</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l2 level1 lfo1; tab-stops: list .5in;">“FactorOnly”</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l2 level1 lfo1; tab-stops: list .5in;">“AllAsFactor”</li></ol><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In all cases, this function returns a summary dataframe with one row for each column in the dataframe to be characterized.<span style="mso-spacerun: yes;"> </span>Like the <strong>str</strong> command, these results include the name of each variable and its type.<span style="mso-spacerun: yes;"> </span>Under the default option “Brief,” this function also returns the following characteristics for each variable:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><ul style="margin-top: 0in;" type="disc"><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in;">Levels = the number of distinct values the variable exhibits;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in;">AvgFreq = the average number of records listing each value;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in;">TopLevel = the most frequently occurring value;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in;">TopFreq = the number of records listing this most frequent value;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in;">TopPct = the percentage of records listing this most frequent value;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in;">MissFreq = the number of missing or blank records;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l1 level1 lfo2; tab-stops: list .5in;">MissPct = the percentage of missing or blank records.</li></ul><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">For the <strong>rent</strong> dataframe, this function (under the default “Brief” option) gives the following summary:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">> DataframeSummary(rent)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Variable<span style="mso-spacerun: yes;"> </span>Type Levels AvgFreq TopLevel TopFreq TopPct MissFreq MissPct</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">3<span style="mso-spacerun: yes;"> </span>A numeric<span style="mso-spacerun: yes;"> </span>73<span style="mso-spacerun: yes;"> </span>26.97<span style="mso-spacerun: yes;"> </span>1957<span style="mso-spacerun: yes;"> </span>551<span style="mso-spacerun: yes;"> </span>27.98<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">6<span style="mso-spacerun: yes;"> </span>B<span style="mso-spacerun: yes;"> </span>factor<span style="mso-spacerun: yes;"> </span>2<span style="mso-spacerun: yes;"> </span>984.50<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>1925<span style="mso-spacerun: yes;"> </span>97.77<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">2<span style="mso-spacerun: yes;"> </span>Fl numeric<span style="mso-spacerun: yes;"> </span>91<span style="mso-spacerun: yes;"> </span>21.64<span style="mso-spacerun: yes;"> </span>60<span style="mso-spacerun: yes;"> </span>71<span style="mso-spacerun: yes;"> </span>3.61<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">7<span style="mso-spacerun: yes;"> </span>H<span style="mso-spacerun: yes;"> </span>factor<span style="mso-spacerun: yes;"> </span>2<span style="mso-spacerun: yes;"> </span>984.50<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>1580<span style="mso-spacerun: yes;"> </span>80.24<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">8<span style="mso-spacerun: yes;"> </span>L<span style="mso-spacerun: yes;"> </span>factor<span style="mso-spacerun: yes;"> </span>2<span style="mso-spacerun: yes;"> </span>984.50<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>1808<span style="mso-spacerun: yes;"> </span>91.82<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">9<span style="mso-spacerun: yes;"> </span><span style="mso-spacerun: yes;"> </span>loc<span style="mso-spacerun: yes;"> </span>factor<span style="mso-spacerun: yes;"> </span>3<span style="mso-spacerun: yes;"> </span>656.33<span style="mso-spacerun: yes;"> </span>2<span style="mso-spacerun: yes;"> </span>1247<span style="mso-spacerun: yes;"> </span>63.33<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">1<span style="mso-spacerun: yes;"> </span>R numeric<span style="mso-spacerun: yes;"> </span>1762<span style="mso-spacerun: yes;"> </span>1.12<span style="mso-spacerun: yes;"> </span>900<span style="mso-spacerun: yes;"> </span>7<span style="mso-spacerun: yes;"> </span>0.36<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">5<span style="mso-spacerun: yes;"> </span>Sm numeric<span style="mso-spacerun: yes;"> </span>2<span style="mso-spacerun: yes;"> </span>984.50<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>1797<span style="mso-spacerun: yes;"> </span>91.26<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">4<span style="mso-spacerun: yes;"> </span>Sp numeric<span style="mso-spacerun: yes;"> </span>2<span style="mso-spacerun: yes;"> </span>984.50<span style="mso-spacerun: yes;"> </span><span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>1419<span style="mso-spacerun: yes;"> </span>72.07<span style="mso-spacerun: yes;"> </span>0<span style="mso-spacerun: yes;"> </span>0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">><o:p> </o:p></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The variable names and types appear essentially as they do in the results obtained with the <strong>str</strong> function, and the numbers to the far left indicate the column numbers from the dataframe <strong>rent</strong> for each variable, since the variable names are listed alphabetically for convenience.<span style="mso-spacerun: yes;"> </span>The “Levels” column of this summary dataframe gives the number of unique values for each variable, and it is clear that this can vary widely even within a given data type.<span style="mso-spacerun: yes;"> </span>For example, the variable “R” (monthly rent in DM) exhibits 1,762 unique values in 1,969 data observations, so it is almost unique, while the variables “Sm” and “Sp” exhibit only two possible values, even though all three of these variables are of type “numeric.”<span style="mso-spacerun: yes;"> </span>The AvgFreq column gives the average number of times each level should appear, assuming a uniform distribution over all possible values.<span style="mso-spacerun: yes;"> </span>This number is included as a reference value for assessing the other frequencies (i.e., TopFreq for the most frequently occurring value and MissFreq for missing data values).<span style="mso-spacerun: yes;"> </span>Thus, for the first variable, “A,” AvgFreq is 26.97, meaning that if all 73 possible values for this variable were equally represented, each one should occur about 27 times in the dataset.<span style="mso-spacerun: yes;"> </span>The most frequently occurring level (TopLevel) is “1957,” which occurs 551 times, suggesting a highly nonuniform distribution of values for this variable.<span style="mso-spacerun: yes;"> </span>In contrast, for the variable “R,” AvgFreq is 1.12, meaning that each value of this variable is almost unique.<span style="mso-spacerun: yes;"> </span>The TopPct column gives the percentage of records in the dataset exhibiting the most frequent value for each record, which varies from 0.36% for the numeric variable “R” to 97.77% for the factor variable “B.”<span style="mso-spacerun: yes;"> </span>It is interesting to note that this variable is of type “factor” but is coded as 0 or 1, while the variables “Sm” and “Sp” are also binary, coded as 0 or 1, but are of type “numeric.”<span style="mso-spacerun: yes;"> </span>This illustrates the point noted above that the <em>R</em> data type is not always as informative as we might like it to be.<span style="mso-spacerun: yes;"> </span>(This is not a criticism of <em>R</em>, but rather a caution about the fact that, in preparing data, we are free to choose many different representations, and the original logic behind the choice may not be obvious to all ultimate users of the data.)<span style="mso-spacerun: yes;"> </span>In addition, comparing the available metadata for the variable “B” illustrates the point about metadata errors noted earlier: of the 1,969 data records, 1,925 have the value “0” (97.77%), while 44 have the value “1” (2.23%), but the information returned by the help command indicates exactly the opposite proportion of values: 1,925 should have the value “1” (indicating the presence of a bathroom), while 44 should have the value “0” (indicating the absence of a bathroom).<span style="mso-spacerun: yes;"> </span>Since the interpretation of the variables that enter any analysis is important in explaining our final analytical results, it is useful to detect this type of mismatch between the data and the available metadata as early as possible.<span style="mso-spacerun: yes;"> Here, comparing the average rents for records with B = 1 (DM 424.95) against those with B = 0 (DM 820.72) suggests that the levels have been reversed relative to the metadata: the relatively few housing units without bathrooms are represented by B = 1, renting for less than the majority of those units, which have bathrooms and are represented by B = 0. </span>Finally, the last two columns of the above summary give the number of records with missing or blank values (MissFreq) and the corresponding percentage (MissPct); here, all records are complete so these numbers are zero.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In my next post on this topic, I will present results for the other three options of the <strong>DataframeSummary </strong>procedure, along with the code that implements it.<span style="mso-spacerun: yes;"> </span>In all cases, the results include those generated by the “Brief” option just presented, but the difference between the other options lies first, in what additional characterizations are included, and second, in which subset of variables are included in the summary.<span style="mso-spacerun: yes;"> </span>Specifically, for the <strong>rent</strong> dataframe, we obtain:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><ul style="margin-top: 0in;" type="disc"><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo3; tab-stops: list .5in;">Under the “NumericOnly” option, a summary of the five numeric variables R, FL, A, Sp, and Sm results, giving characteristics that are appropriate to numeric data types, like the spacing measures described in my last post;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo3; tab-stops: list .5in;">Under the “FactorOnly” option, a summary of the four factor variables B, H, L, and loc results, giving measures that are appropriate to categorical data types, like the normalized Shannon entropy measure discussed in several previous posts;</li><li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo3; tab-stops: list .5in;">Under the “AllAsFactor” option, all variables in the dataframe are first converted to factors and then characterized using the same measures as in the “FactorOnly” option.</li></ul><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The advantage of the “AllAsFactor” option is that it characterizes all variables in the dataframe, but as I discussed in my last post, the characterization of numerical variables with measures like <st1:place w:st="on">Shannon</st1:place> entropy is not always terribly useful.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com4tag:blogger.com,1999:blog-9179325420174899779.post-44562589283462891712012-09-22T18:54:00.000-07:002012-09-22T18:54:04.750-07:00Spacing measures: heterogeneity in numerical distributions<div class="MsoNormal" style="margin: 0in 0in 0pt;">Numerically-coded data sequences can exhibit a very wide range of distributional characteristics, including near-Gaussian (historically, the most popular working assumption), strongly asymmetric, light- or heavy-tailed, multi-modal, or discrete (e.g., count data).<span style="mso-spacerun: yes;"> </span>In addition, numerically coded values can be effectively categorical, either ordered, or unordered.<span style="mso-spacerun: yes;"> </span>A specific example that illustrates the range of distributional behavior often seen in a collection of numerical variables is the <st1:city w:st="on">Boston</st1:city> housing dataframe (<st1:city w:st="on"><st1:place w:st="on"><strong>Boston</strong></st1:place></st1:city>) from the <strong>MASS</strong> package in <em>R</em>.<span style="mso-spacerun: yes;"> </span>This dataframe includes 14 numerical variables that characterize 506 suburban housing tracts in the <st1:city w:st="on"><st1:place w:st="on">Boston</st1:place></st1:city> area: 12 of these variables have class “numeric” and the remaining two have class “integer”.<span style="mso-spacerun: yes;"> </span>The integer variable <strong>chas</strong> is in fact a binary flag, taking the value 1 if the tract bounds the Charles river and 0 otherwise, and the integer variable <strong>rad</strong> is described as “an index of accessibility to radial highways,’’ assuming one of nine values: the integers 1 through 8, and 24.<span style="mso-spacerun: yes;"> </span>The other 12 variables assume anywhere between 26 unique values (for the zoning variable <strong>zn</strong>) to 504 unique values (for the per capita crime rate <strong>crim</strong>). The figure below shows nonparametric density estimates for four of these variables: the per-capita crime rate (<strong>crim</strong>, upper left plot), the percentage of the population designated “lower status” by the researchers who provided the data (<strong>lstat</strong>, upper right plot), the average number of rooms per dwelling (<strong>rm</strong>, lower left plot), and the zoning variable (<strong>zn</strong>, lower right plot).<span style="mso-spacerun: yes;"> </span>Comparing the appearances of these density estimates, considerable variability is evident: the distribution of <strong>crim</strong> is very asymmetric with an extremely heavy right tail, the distribution of <strong>lstat</strong> is also clearly asymmetric but far less so, while the distribution of <strong>rm</strong> appears to be almost Gaussian.<span style="mso-spacerun: yes;"> </span>Finally, the distribution of <strong>zn</strong> appears to be tri-modal, mostly concentrated around zero, but with clear secondary peaks at around 20 and 80.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-71PKem_pq00/UF4hdvnxB9I/AAAAAAAAAJ4/cs7iMj7gQOI/s1600/HolesFig01a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hea="true" height="319" src="http://4.bp.blogspot.com/-71PKem_pq00/UF4hdvnxB9I/AAAAAAAAAJ4/cs7iMj7gQOI/s320/HolesFig01a.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Each of these four plots also includes some additional information about the corresponding variable: three vertical reference lines at the mean (the solid line) and the mean offset by plus or minus three standard deviations (the dotted lines), and the value of the normalized <st1:place w:st="on">Shannon</st1:place> entropy, listed in the title of each plot.<span style="mso-spacerun: yes;"> </span>This normalized entropy value is discussed in detail in Chapter 3 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a> and in two of my previous posts (<a href="http://exploringdatablog.blogspot.com/2011/04/interestingness-measures.html">April 3, 2011</a> and <a href="http://exploringdatablog.blogspot.com/2011_05_01_archive.html">May 21, 2011</a>), and it forms the basis for the spacing measure described below.<span style="mso-spacerun: yes;"> </span>First, however, the reason for including the three vertical reference lines on the density plots is to illustrate that, while popular “Gaussian expectations” for data are approximately met for some numerical variables (the <strong>rm</strong> variable is a case in point here), often these expectations are violated so much that they are useless.<span style="mso-spacerun: yes;"> </span>Specifically, note that under approximately Gaussian working assumptions, most of the observed values for the data sequence should fall between the two dotted reference lines, which should correspond approximately to the smallest and largest values seen in the dataset.<span style="mso-spacerun: yes;"> </span>This description is reasonably accurate for the variable <strong>rm</strong>, and the upper limit appears fairly reasonable for the variable <strong>lstat</strong>, but the lower limit is substantially negative here, which is not reasonable for this variable since it is defined as a percentage.<span style="mso-spacerun: yes;"> </span>These reference lines appear even more divergent from the general shapes of the distributions for the <strong>crim</strong> and <strong>zn</strong> data, where again, the lower reference lines are substantially negative, infeasible values for both of these variables.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The reason the reference values defined by these lines are not particularly representative is the extremely heterogeneous nature of the data distributions, particularly for the variables <strong>crim</strong> – where the distribution exhibits a very long right tail – and <strong>zn</strong> – where the distribution exhibits multiple modes. <span style="mso-spacerun: yes;"> </span>For categorical variables, distributional heterogeneity can be assessed by measures like the normalized Shannon entropy, which varies between 0 and 1, taking the value zero when all levels of the variable are equally represented, and taking the value 1 when only one of several possible values are present.<span style="mso-spacerun: yes;"> </span>This measure is easily computed and, while it is intended for use with categorical variables, the procedures used to compute it will return results for numerical variables as well.<span style="mso-spacerun: yes;"> </span>These values are shown in the figure captions of each of the above four plots, and it is clear from these results that the <st1:place w:st="on">Shannon</st1:place> measure does not give a reliable indication of distributional heterogeneity here.<span style="mso-spacerun: yes;"> </span>In particular, note that the Shannon measure for the <strong>crim</strong> variable is zero to three decimal places, suggesting a very homogeneous distribution, while the variables <strong>lstat</strong> and <strong>rm</strong> – both arguably less heterogeneous than <strong>crim</strong> – exhibit slightly larger values of 0.006 and 0.007, respectively.<span style="mso-spacerun: yes;"> </span>Further, the variable <strong>zn</strong>, whose density estimate resembles that of <strong>crim</strong> more than that of either of the other two variables, exhibits the much larger <st1:place w:st="on">Shannon</st1:place> entropy value of 0.585.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The basic difficulty here is that all observations of a continuously distributed random variable <em>should</em> be unique.<span style="mso-spacerun: yes;"> </span>The normalized <st1:place w:st="on">Shannon</st1:place> entropy – along with the other heterogeneity measures discussed in Chapter 3 of <em>Exploring Data</em> – effectively treat variables as categorical, returning a value that is computed from the fractions of total observations assigned to each possible value for the variable.<span style="mso-spacerun: yes;"> </span>Thus, for an ideal continuously-distributed variable, every possible value appears once and only once, so these fractions should be 1/N for each of the N distinct values observed for the variable.<span style="mso-spacerun: yes;"> </span>This means that the normalized Shannon measure – along with all of the alternative measures just noted – should be identically zero for this case, regardless of whether the continuous distribution in question is Gaussian, Cauchy, Pareto, uniform, or anything else.<span style="mso-spacerun: yes;"> </span>In fact, the <strong>crim</strong> variable considered here almost meets this ideal requirement: in 506 observations, <strong>crim</strong> exhibits 504 unique values, which is why its normalized <st1:place w:st="on">Shannon</st1:place> entropy value is zero to three significant figures.<span style="mso-spacerun: yes;"> </span>In marked contrast, the variable <strong>zn</strong> exhibits only 26 distinct values, meaning that each of these values occurs, on average, just over 19 times.<span style="mso-spacerun: yes;"> </span>However, this average behavior is not representative of the data in this case, since the smallest possible value (0) occurs 372 times, while the largest possible value (100) occurs only once.<span style="mso-spacerun: yes;"> </span>It is because of the discrete character of this distribution that the normalized <st1:place w:st="on">Shannon</st1:place> entropy is much larger here, accurately reflecting the pronounced distributional heterogeneity of this variable.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Taken together, these observations suggest a simple extension of the normalized <st1:place w:st="on">Shannon</st1:place> entropy that can give us a more adequate characterization of distributional differences for numerical variables.<span style="mso-spacerun: yes;"> </span>Specifically, the idea is this: begin by dividing the total range of a numerical variable <em>x</em> into M equal intervals.<span style="mso-spacerun: yes;"> </span>Then, count the number of observations that fall into each of these intervals and divide by the total number of observations N to obtain the fraction of observations falling into each group.<span style="mso-spacerun: yes;"> </span>By doing this, we have effectively converted the original numerical variable into an M-level categorical variable, to which we can apply heterogeneity measures like the normalized <st1:place w:st="on">Shannon</st1:place> entropy.<span style="mso-spacerun: yes;"> </span>The four plots below illustrate this basic idea for the four <st1:city w:st="on"><st1:place w:st="on">Boston</st1:place></st1:city> housing variables considered above.<span style="mso-spacerun: yes;"> </span>Specifically, each plot shows the fraction of observations falling into each of 10 equally spaced intervals, spanning the range from the smallest observed value of the variable to the largest.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-5f9HSlzZrKk/UF4nSLcji2I/AAAAAAAAAKM/PUf-UnLJZlI/s1600/HolesFig03a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hea="true" height="319" src="http://4.bp.blogspot.com/-5f9HSlzZrKk/UF4nSLcji2I/AAAAAAAAAKM/PUf-UnLJZlI/s320/HolesFig03a.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As a specific example, consider the results shown in the upper left plot for the variable <strong>crim</strong>, which varies from a minimum of 0.00632 to a maximum of 89.0.<span style="mso-spacerun: yes;"> </span>Almost 87% of the observations fall into the smallest 10% of this range, from 0.00632 to 8.9, while the next two groups account for almost all of the remaining observations.<span style="mso-spacerun: yes;"> </span>In fact, none of the other groups (4 through 10) account for more than 1% of the observations, and one of these groups – group 7 – is completely empty.<span style="mso-spacerun: yes;"> </span>Computing the normalized <st1:place w:st="on">Shannon</st1:place> entropy from this ten-level categorical variable yields 0.767, as indicated in the title of the upper left plot.<span style="mso-spacerun: yes;"> </span>In contrast, the corresponding plot for the <strong>lstat</strong> variable, shown in the upper right, is much more uniform, with the first five groups exhibiting roughly the same fractional occupation.<span style="mso-spacerun: yes;"> </span>As a consequence, the normalized <st1:place w:st="on">Shannon</st1:place> entropy for this grouped variable is much smaller than that for the more heterogeneously distributed <strong>crim</strong> variable: 0.138 versus 0.767.<span style="mso-spacerun: yes;"> </span>Because the distribution is more sharply peaked for the <strong>rm</strong> variable than for <strong>lstat</strong>, the occupation fractions for the grouped version of this variable (lower left plot) are less homogeneous, and the normalized <st1:place w:st="on">Shannon</st1:place> entropy is correspondingly larger, at 0.272.<span style="mso-spacerun: yes;"> </span>Finally, for the <strong>zn</strong> variable (lower right plot), the grouped distribution appears similar to that for the <strong>crim</strong> variable, and the normalized <st1:place w:st="on">Shannon</st1:place> entropy values are also similar: 0.525 versus 0.767.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The key point here is that, in contrast to the normalized Shannon entropy applied directly to the numerical variables in the <strong>Boston</strong> dataframe, grouping these values into 10 equally-spaced intervals and then computing the normalized Shannon entropy gives a number that seems to be more consistent with the distributional differences between these variables that can be seen clearly in their density plots.<span style="mso-spacerun: yes;"> </span>Motivation for this numerical measure (i.e., why not just look at the density plots?) comes from the fact that we are sometimes faced with the task of characterizing a new dataset that we have not seen before.<span style="mso-spacerun: yes;"> </span>While we can – and should – examine graphical representations of these variables, in cases where we have <em>many</em> such variables, it is desirable to have a few, easily computed numerical measures to use as screening tools, guiding us in deciding which variables to look at first, and which techniques to apply to them.<span style="mso-spacerun: yes;"> </span>The spacing measure described here – i.e., the normalized <st1:place w:st="on">Shannon</st1:place> entropy measure applied to a grouped version of the numerical variable – appears to be a potentially useful measure for this type of preliminary data characterization.<span style="mso-spacerun: yes;"> </span>For this reason, I am including it – along with a few other numerical characterizations – in the <strong>DataFrameSummary</strong> procedure I am implementing as part of the <strong>ExploringData</strong> package, which I will describe in a later post.<span style="mso-spacerun: yes;"> </span>Next time, however, I will explore two obvious extensions of the procedure described here: different choices of the heterogeneity measure, and different choices of the number of grouping levels.<span style="mso-spacerun: yes;"> </span>In particular, as I have shown in previous posts on interestingness measures, the normalized Bray, Gini, and Simpson measures all behave somewhat differently than the <st1:place w:st="on">Shannon</st1:place> measure considered here, raising the question of which one would be most effective in this application.<span style="mso-spacerun: yes;"> </span>In addition, the choice of 10 grouping levels considered here was arbitrary, and it is by no means clear that this choice is the best one.<span style="mso-spacerun: yes;"> </span>In my next post, I will explore how sensitive the Boston housing results are to changes in these two key design parameters.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, it is worth saying something about how the grouping used here was implemented.<span style="mso-spacerun: yes;"> </span>The <em>R </em>code listed below is the function I used to convert a numerical variable <em>x</em> into the grouped variable from which I computed the normalized <st1:place w:st="on">Shannon</st1:place> entropy.<span style="mso-spacerun: yes;"> </span>The three key components of this function are the <strong>classIntervals</strong> function from the <em>R</em> package <strong>classInt</strong> (which must be loaded before use; hence, the “library(classInt)” statement at the beginning of the function), and the <strong>cut</strong> and <strong>table</strong> functions from base <em>R.</em><span style="mso-spacerun: yes;"> </span>The <strong>classIntervals</strong> function generates a two-element list with components <strong>var</strong>, which contains the original observations, and <strong>brks</strong>, which contains the M+1 boundary values for the M groups to be generated.<span style="mso-spacerun: yes;"> </span>Note that the <strong>style = “equal”</strong> argument is important here, since we want M equal-width groups.<span style="mso-spacerun: yes;"> </span>The <strong>cut</strong> function then takes these results and converts them into an M-level categorical variable, assigning each original data value to the interval into which it falls.<span style="mso-spacerun: yes;"> </span>The <strong>table</strong> function counts the number of times each of the M possible levels occurs for this categorical variable.<span style="mso-spacerun: yes;"> </span>Dividing this vector by the sum of all entries then gives the fraction of observations falling into each group.<span style="mso-spacerun: yes;"> </span>Plotting the results obtained from this function and reformatting the results slightly yields the four plots shown in the second figure above, and applying the <strong>shannon.proc</strong> procedure available from the <a href="http://www.oup.com/us/static/companion.websites/9780195089653/TextFiles/shannonproc.txt">OUP companion website</a> for <em>Exploring Data</em> yields the Shannon entropy values listed in the figure titles.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">UniformSpacingFunction <- function(x, nLvls = 10){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>library(classInt)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>xsum = classIntervals(x,n = nLvls, style="equal")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>xcut = cut(xsum$var, breaks = xsum$brks, include.lowest = TRUE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>xtbl = table(xcut)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>pvec = xtbl/sum(xtbl)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>pvec</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com0tag:blogger.com,1999:blog-9179325420174899779.post-5718125434476277962012-09-08T11:53:00.000-07:002012-09-08T11:53:23.397-07:00Implementing the CountSummary Procedure<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last post, I described and demonstrated the <strong>CountSummary</strong> procedure to be included in the <strong>ExploringData</strong> package that I am in the process of developing.<span style="mso-spacerun: yes;"> </span>This procedure generates a collection of graphical data summaries for a count data sequence, based on the <strong>distplot</strong>, <strong>Ord_plot</strong>, and <strong>Ord_estimate</strong> functions from the <strong>vcd</strong> package.<span style="mso-spacerun: yes;"> </span>The <strong>distplot</strong> function generates both the <em>Poissonness plot</em> and the <em>negative-binomialness plot</em> discussed in Chapters 8 and 9 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences and Medicine</a>.<span style="mso-spacerun: yes;"> </span>These plots provide informal graphical assessments of the conformance of a count data sequence with the two most popular distribution models for count data, the Poisson distribution and the negative-binomial distribution.<span style="mso-spacerun: yes;"> </span>As promised, this post describes the <em>R</em> code needed to implement the <strong>CountSummary</strong> procedure, based on these functions from the <strong>vcd</strong> package.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The key to this implementation lies in the use of the <strong>grid</strong> package, a set of low-level graphics primitives included in base <em>R</em>.<span style="mso-spacerun: yes;"> As I mentioned in my last post, the reason this was necessary - instead of using higher-level graphics packages like <strong>lattice</strong> or <strong>ggplot2</strong> - was that the <strong>vcd</strong> package is based on grid graphics, making it incompatible with base graphics commands like those used to generate arrays of multiple plots. </span>The <strong>grid</strong> package was developed by Paul Murrell, who provides a lot of extremely useful information about both <em>R</em> graphics in general and grid graphics in particular on his <a href="http://www.stat.auckland.ac.nz/~paul/">home page</a>, including the article “Drawing Diagrams with R,” which provides a nicely focused introduction to grid graphics.<span style="mso-spacerun: yes;"> </span>The first example I present here is basically a composite of the first two examples presented in this paper.<span style="mso-spacerun: yes;"> </span>Specifically, the code for this example is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">library(grid)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.newpage()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">pushViewport(viewport(width = 0.8, height = 0.4))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.roundrect()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.text("This is text in a box")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">popViewport()</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The first line of this R code loads the <strong>grid</strong> package and the second tells this package to clear the plot window; failing to do this will cause this particular piece of code to overwrite whatever was there before, which usually isn’t what you want.<span style="mso-spacerun: yes;"> </span>The third line creates a <em>viewport</em>, into which the plot will be placed.<span style="mso-spacerun: yes;"> </span>In this particular example, we specify a width of 0.8, or 80% of the total plot window width, and a height of 0.4, corresponding to 40% of the total window height. <span style="mso-spacerun: yes;"> </span>The next two lines draw a rectangular box with rounded corners and put “This is text in a box” in the center of this box.<span style="mso-spacerun: yes;"> </span>The advantage of the <strong>grid </strong>package is that it provides us with simple graphics primitives to draw this kind of figure, without having to compute exact positions (e.g., in inches) for the different figure components.<span style="mso-spacerun: yes;"> </span>Commands like <strong>grid.text</strong> provide useful defaults (i.e., put the text in the center of the viewport), which can be overridden by specifying positional parameters in a variety of ways (e.g., left- or right-justified, offsets in inches or lines of text, etc.).<span style="mso-spacerun: yes;"> </span>The results obtained using these commands are shown in the figure below.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-It8lzfxyfvc/UEtohPq67jI/AAAAAAAAAJU/Z7VZG2enuO4/s1600/ImplementingFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hea="true" height="319" src="http://2.bp.blogspot.com/-It8lzfxyfvc/UEtohPq67jI/AAAAAAAAAJU/Z7VZG2enuO4/s320/ImplementingFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The code for the second example is a simple extension of the first one, essentially consisting of the added initial code required to create the desired two-by-two plot array, followed by four slightly modified copies of the above code.<span style="mso-spacerun: yes;"> </span>Specifically, this code is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.newpage()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">pushViewport(viewport(layout=grid.layout(nrow=2,ncol=2)))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">pushViewport(viewport(layout.pos.row=1,layout.pos.col=1))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.roundrect(width = 0.8, height=0.4)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.text("Plot 1 goes here")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">popViewport()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">pushViewport(viewport(layout.pos.row=1,layout.pos.col=2))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.roundrect(width = 0.8, height=0.4)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.text("Plot 2 goes here")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">popViewport()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">pushViewport(viewport(layout.pos.row=2,layout.pos.col=1))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.roundrect(width = 0.8, height=0.4)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.text("Plot 3 goes here")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">popViewport()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">pushViewport(viewport(layout.pos.row=2,layout.pos.col=2))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.roundrect(width = 0.8, height=0.4)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">grid.text("Plot 4 goes here")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">popViewport()</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Here, note that the first “pushViewport” command creates the two-by-two plot array we want, by specifying “layout = grid.layout(nrow=2,ncol=2)”.<span style="mso-spacerun: yes;"> </span>As in initializing a data frame in <em>R</em>, we can create an arbitrary two-dimensional array of grid graphics viewports – say m by n – by specifying “layout = grid.layout(nrow=m, ncol=n)”.<span style="mso-spacerun: yes;"> </span>Once we have done this, we can use whatever <strong>grid</strong> commands – or grid-compatible commands, such as those generated by the <strong>vcd</strong> package – we want, to create the individual elements in our array of plots.<span style="mso-spacerun: yes;"> </span>In this example, I have basically repeated the code from the first example to put text into rounded rectangular boxes in each position of the plot array.<span style="mso-spacerun: yes;"> </span>The two most important details are, first, the “pushViewport” command at the beginning of each of these individual plot blocks specifies which of the four array elements the following plot will go in, and second, the “popViewport()” command at the end of each block, which tells the <strong>grid</strong> package that we are finished with this element of the array.<span style="mso-spacerun: yes;"> </span>If we leave this command out, the next “pushViewport” command will not move to the desired plot element, but will simply overwrite the previous plot.<span style="mso-spacerun: yes;"> </span>Executing this code yields the plot shown below.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/--w8pVClO-Wk/UEtpZ4Fq8mI/AAAAAAAAAJc/IoDInNuG7t8/s1600/ImplementingFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hea="true" height="319" src="http://4.bp.blogspot.com/--w8pVClO-Wk/UEtpZ4Fq8mI/AAAAAAAAAJc/IoDInNuG7t8/s320/ImplementingFig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The final example replaces the text in the above two-by-two example with the plots I want for the <strong>CountSummary</strong> procedure.<span style="mso-spacerun: yes;"> </span>Before presenting this code, it is important to say something about the structure of the resulting plot and the <strong>vcd</strong> commands used to generate the different plot elements.<span style="mso-spacerun: yes;"> </span>The first plot – in the upper left position of the array shown below – is an Ord plot, generated by the <strong>Ord_plot</strong> command, which does two things.<span style="mso-spacerun: yes;"> </span>The first is to generate the desired plot, but the second is to return estimates of the intercept and slope of one of the two reference lines in the plot.<span style="mso-spacerun: yes;"> </span>The first of these lines is fit to the points in the plot via ordinary least squares, while the second – the one whose parameters are returned – is fit via weighted least squares, to down-weight the widely scattered points seen in this plot that correspond to cases with very few observations.<span style="mso-spacerun: yes;"> </span>The intent of the Ord plot is to help us decide which of several alternative distributions – including both the Poisson and the negative-binomial – fits our count data sequence better.<span style="mso-spacerun: yes;"> </span><span style="mso-spacerun: yes;"> </span>This guidance is based on the reference line parameters, and the <strong>Ord_estimate</strong> function in the <strong>vcd</strong> package transforms these parameter estimates into distributional recommendations and the distribution parameter values needed by the <strong>distplot</strong> function in the <strong>vcd</strong> package to generate either the Poissonness plot or the negative-binomialness plot for the count data sequence.<span style="mso-spacerun: yes;"> </span>Although these recommendations are sometimes useful, it is important to emphasize the caution given in the vcd package documentation:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote class="tr_bq">“Be careful with the conclusions from Ord_estimate as it implements just some simple heuristics!”</blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-j1m2Fw17_Fw/UEtqTNjZs6I/AAAAAAAAAJk/YpIHlA1JYi4/s1600/ImplementingFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hea="true" height="319" src="http://1.bp.blogspot.com/-j1m2Fw17_Fw/UEtqTNjZs6I/AAAAAAAAAJk/YpIHlA1JYi4/s320/ImplementingFig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In the <strong>CountSummary</strong> procedure, I use these results both to generate part of the text summary in the upper right element of the plot array, and to decide which type of plot to display in the lower right element of this array.<span style="mso-spacerun: yes;"> </span>Both this plot and the Poissonness reference plot in the lower left element of the display are created using the <strong>distplot</strong> command in the <strong>vcd</strong> package.<span style="mso-spacerun: yes;"> </span>I include the Poissonness reference plot because the Poisson distribution is the most commonly assumed distribution for count data – analogous in many ways to the Gaussian distribution so often assumed for continuous-valued data – and, by not specifying the single parameter for this distribution, I allow the function to determine it by fitting the data.<span style="mso-spacerun: yes;"> </span>In cases where the Ord plot heuristic recommends the Poissonness plot, it also provides this parameter, which I provide to the <strong>distplot</strong> function for the lower right plot. Thus, while both the lower right and lower left plots are Poissonness plots in this case, they are generally based on different distribution parameters.<span style="mso-spacerun: yes;"> </span>In the particular example shown here – constructed from the “number of times pregnant” variable in the Pima Indians diabetes dataset that I have discussed in several previous posts (available from the <a href="http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes">UCI Machine Learning Repository</a>) – the Ord plot heuristic recommends the negative binomial distribution.<span style="mso-spacerun: yes;"> </span>Comparing the Poissonness and negative-binomialness plots in the bottom row of the above plot array, it does appear that the negative binomial distribution fits the data better.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, before examining the code for the <strong>CountSummary</strong> procedure, it is worth noting that the <strong>vcd</strong> package’s implementation of the <strong>Ord_plot</strong> and <strong>Ord_estimate</strong> procedures can generate four different distributional recommendations: the Poisson and negative-binomial distributions discussed here, along with the binomial distribution and the much less well-known <em>log-series distribution</em>.<span style="mso-spacerun: yes;"> </span>The <strong>distplot</strong> procedure is flexible enough to generate plots for the first three of these distributions, but not the fourth, so in cases where the Ord plot heuristic recommends this last distribution, the <strong>CountSummary</strong> procedure displays the recommended distribution and parameter, but displays a warning message that no distribution plot is available for this case in the lower right plot position.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The code for the <strong>CountSummary</strong> procedure looks like this:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">CountSummary <- function(xCount,TitleString){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Initial setup</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>library(vcd)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>grid.newpage()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Set up 2x2 array of plots</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>pushViewport(viewport(layout=grid.layout(nrow=2,ncol=2)))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Generate the plots:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>1 - upper left = Ord plot</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>pushViewport(viewport(layout.pos.row=1,layout.pos.col=1))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdLine = Ord_plot(xCount, newpage = FALSE, pop=FALSE, legend=FALSE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdType = Ord_estimate(OrdLine)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>popViewport()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>2 - upper right = text summary</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdTypeText = paste("Type = ",OrdType$type,sep=" ")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>if (OrdType$type == "poisson"){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdPar = "Lambda = "</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>else if ((OrdType$type == "nbinomial")|(OrdType$type == "nbinomial")){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdPar = "Prob = "</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div></blockquote> else if (OrdType$type == "log-series"){ <div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdPar = "Theta = "</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>else{</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdPar = "Parameter = "</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OrdEstText = paste(OrdPar,round(OrdType$estimate,digits=3), sep=" ")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>TextSummary = paste("Ord plot heuristic results:",OrdTypeText,OrdEstText,sep="\n")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>pushViewport(viewport(layout.pos.row=1,layout.pos.col=2))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>grid.text(TitleString,y=2/3,gp=gpar(fontface="bold"))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>grid.text(TextSummary,y=1/3)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>popViewport()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>3 - lower left = standard Poissonness plot</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>pushViewport(viewport(layout.pos.row=2,layout.pos.col=1))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>distplot(xCount, type="poisson",newpage=FALSE, pop=FALSE, legend = FALSE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>popViewport()<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>4 - lower right = plot suggested by Ord results</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>pushViewport(viewport(layout.pos.row=2,layout.pos.col=2))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>if (OrdType$type == "poisson"){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>distplot(xCount, type="poisson",lambda=OrdType$estimate, newpage=FALSE, pop=FALSE, legend=FALSE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>else if (OrdType$type == "nbinomial"){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>prob = OrdType$estimate</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>size = 1/prob - 1</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>distplot(xCount, type="nbinomial",size=size,newpage=FALSE, pop=FALSE, legend=FALSE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>else if (OrdType$type == "binomial"){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>distplot(xCount, type="binomial", newpage=FALSE, pop=FALSE, legend=FALSE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>else{</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Message = paste("No distribution plot","available","for this case",sep="\n")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>grid.text(Message)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>popViewport()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">This procedure is a function called with two arguments: the sequence of count values, <strong>xCounts</strong>, and <strong>TitleString</strong>, a text string that is displayed in the upper right text box in the plot array, along with the recommendations from the Ord plot heuristic.<span style="mso-spacerun: yes;"> </span>When called, the function first loads the <strong>vcd </strong>library to make the <strong>Ord_plot</strong>, <strong>Ord_estimate</strong>, and <strong>distplot</strong> functions available for use, and it executes the <strong>grid.newpage()</strong> command to clear the display.<span style="mso-spacerun: yes;"> </span>(Note that we don’t have to include “library(grid)” here to load the <strong>grid </strong>package, since loading the <strong>vcd</strong> package automatically does this.)<span style="mso-spacerun: yes;"> </span>As in the previous example, the first “pushViewport” command creates the two-by-two plot array, and this is again followed by four code segments, one to generate each of the four displays in this array.<span style="mso-spacerun: yes;"> </span>The first of these segments invokes the <strong>Ord_plot</strong> and <strong>Ord_estimate</strong> commands as discussed above, first to generate the upper left plot (a side-effect of the <strong>Ord_plot</strong> command) and second, to obtain the Ord plot heuristic recommendations, to be used in structuring the rest of the display.<span style="mso-spacerun: yes;"> </span>The second segment creates a text display as in the first example considered here, but the structure of this display depends on the Ord plot heuristic results (i.e., the names of the parameters for the four possible recommended distributions are different, and the logic in this code block matches the display text to this distribution).<span style="mso-spacerun: yes;"> </span>As noted in the preceding discussion, the third plot (lower left) is the Poissonness plot generated by the <strong>distplot</strong> function from the <strong>vcd</strong> package.<span style="mso-spacerun: yes;"> </span>In this case, the function is called only specifying ‘type = “poisson”’ without the optional distribution parameter lambda, which is obtained by fitting the data.<span style="mso-spacerun: yes;"> </span>The final element of this plot array, in the lower right, is also generated via a call to the <strong>distplot</strong> function, but here, the results from the Ord plot heuristic are used to specify both the type parameter and any optional or required shape parameters for the distribution.<span style="mso-spacerun: yes;"> </span>As with the displayed text, simple if-then-else logic is required here to match the plot generated with the Ord plot heuristic recommendations.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, it is important to note that in all of the calls made to <strong>Ord_plot</strong> or <strong>distplot</strong> in the <strong>CountSummary</strong> procedure, the parameters <strong>newpage</strong>, <strong>pop</strong>, and <strong>legend</strong>, are all specified as FALSE.<span style="mso-spacerun: yes;"> </span>Specifying “newpage = FALSE” prevents these <strong>vcd</strong> plot commands from clearing the display page and erasing everything we have done so far.<span style="mso-spacerun: yes;"> </span>Similarly, specifying “pop = FALSE” allows us to continue working in the current plot window until we notify the grid graphics system that we are done with it by issuing our own “popViewport()” command.<span style="mso-spacerun: yes;"> </span>Specifying “legend = FALSE” tells <strong>Ord_plot</strong> and <strong>distplot</strong> not to write the default informational legend on each plot.<span style="mso-spacerun: yes;"> </span>This is important here because, given the relatively small size of the plots generated in this two-by-two array, including the default legends would obscure important details.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com0tag:blogger.com,1999:blog-9179325420174899779.post-22023757427221913632012-07-21T13:33:00.000-07:002012-07-21T13:33:51.861-07:00Base versus grid graphicsIn a comment in response to my latest post, Robert Young took issue with my characterization of <strong>grid</strong> as an <em>R</em> graphics package. Perhaps <strong>grid</strong> is better described as a “graphics support package,” but my primary point – and the main point of this post – is that to generate the display you want, it is sometimes necessary to use commands from this package. In my case, the necessity to learn something about grid graphics came as the result of my attempt to implement the <strong>CountSummary</strong> procedure to be included in the <strong>ExploringData</strong> package that I am developing. <strong>CountSummary</strong> is a graphical summary procedure for count data, based on <em>Poissonness plots, negative binomialness plots</em>, and <em>Ord plots</em>, all discussed in Chapter 8 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences and Medicine</a>. My original idea was to implement these plots myself, but then I discovered that all three were already available in the <strong>vcd</strong> package. One of the great things about <em>R</em> is that you are encouraged to build on what already exists, so using the <strong>vcd</strong> implementations seemed like a no-brainer. Unfortunately, my first attempt at creating a two-by-two array of plots from the <strong>vcd</strong> package failed, and I didn’t understand why. The reason turned out to be that I was attempting to mix the base graphics command “<strong>par(mfrow=c(2,2))</strong>” that sets up a two-by-two array with varous plotting commands from <strong>vcd</strong>, which are based on grid graphics. Because these two graphics systems don’t play well together, I didn’t get the results I wanted. In the end, however, by learning a little about the <strong>grid</strong> package and its commands, I was able to generate my two-by-two plot array without a great deal of difficulty. Since grid graphics isn’t even mentioned in my favorite <em>R</em> reference book (Michael Crawley’s <a href="http://www.amazon.com/The-Book-Michael-J-Crawley/dp/0470510242">The R Book</a>), I wanted to say a little here about what the <strong>grid</strong> package is and why you might need to know something about it. To do this, I will describe the ideas that went into the development of the <strong>CountSummary</strong> procedure and conclude this post with an example that shows what the output looks like. Next time, I will give a detailed discussion of the <em>R</em> code that generated these results. (For those wanting a preliminary view of what the code looks like, load the <strong>vcd</strong> package with the <strong>library</strong> command and run “<strong>examples(Ord_plot)</strong>” – in addition to generating the plots, this example displays the grid commands needed to construct the two-by-two array.)<br /><br /><br /><br /><br />Count variables – non-negative integer variables like the “number of times pregnant” (NPG) variable from the Pima Indians database described below – are often assumed to obey a Poisson distribution, in much the same way that continuous-valued variables are often assumed to obey a Gaussian (normal) distribution. Like this normality assumption for continuous variables, the Poisson assumption for count data is sometimes reasonable, but sometimes it isn’t. Normal quantile-quantile plots like those generated by the <strong>qqnorm</strong> command in base <em>R</em> or the <strong>qqPlot</strong> command from the <strong>car</strong> package are useful in informally assessing the reasonableness of the normality assumption for continuous data. Similarly, Poissonness plots are the corresponding graphical tool for informally evaluating the Poisson hypothesis for count data. The construction and interpretation of these plots is discussed in some detail in Chapters 8 and 9 of <em>Exploring Data</em>, but briefly, this plot constructs a variable called the <em>Poissonness count metameter</em> from the number of times each possible count value occurs in the data; if the data sequence conforms to the Poisson distribution, the points on this plot should fall approximately on a straight line. A simple <em>R</em> function that constructs Poissonness plots is available on the <a href="http://www.oup.com/us/companion.websites/9780195089653/rprogram/?view=usa">OUP companion website</a> for the book, but an implementation that is both more conveniently available and more flexible is the <strong>distplot</strong> function in the <strong>vcd</strong> package, which also generates the negative binomialness plot discussed below.<br /><br /><br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-eZyHMtgdR00/UAsBSv87yEI/AAAAAAAAAI8/orrB7rCvnAI/s1600/PoissonnessPlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" closure_uid_zd5ma="68" hda="true" height="319" src="http://3.bp.blogspot.com/-eZyHMtgdR00/UAsBSv87yEI/AAAAAAAAAI8/orrB7rCvnAI/s320/PoissonnessPlot.png" width="320" /></a></div><br /><br /><br />The figure above is the Poissonness plot constructed using the <strong>distplot</strong> procedure from the <strong>vcd</strong> package for the NPG variable from the Pima Indians diabetes dataset mentioned above. I have discussed this dataset in previous posts and have used it as the basis for several examples in <em>Exploring Data</em>. It is available from the <a href="http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes">UCI Machine Learning Repository</a> and it has been incorporated in various forms as an example dataset in a number of <em>R</em> packages, including a cleaned-up version in the <strong>MASS</strong> package (dataset <strong>Pima.tr</strong>). The full version considered here contains nine characteristics for 768 female members of the Pima Indian tribe, including their age, medical characteristics like diastolic blood pressure, and the number of times each woman has been pregnant. If this NPG count sequence obeyed the Poisson distribution, the points in the above plot would fall approximately on the reference line included there. The fact that these points do not conform well to this line – note, in particular, the departure at the lower left end of the plot where most of the counts occur – calls the Poisson working assumption into question.<br /><br /><br /><br />A fundamental feature of the Poisson distribution is that it is defined by a single parameter that determines all distributional characteristics, including both the mean and the variance. In fact, a key characteristic of the Poisson distribution is that the variance is equal to the mean. This constraint is not satisfied by all count data sequences we encounter, however, and these deviations are important enough to receive special designations: integer sequences whose variance is larger than their mean are commonly called <em>overdispersed</em>, while those whose variance is smaller than their mean are commonly called <em>underdispersed</em>. In practice, overdispersion seems to occur more frequently, and a popular distributional alternative for overdispersed sequences is the negative binomial distribution. This distribution is defined by two parameters and it is capable of matching both the mean and variance of arbitrary overdispersed count data sequences. For a detailed discussion of this distribution, refer to Chapter 3 of <em>Exploring Data</em>.<br /><br /><br /><br />Like the Poisson distribution, it is possible to evaluate the reasonableness of the negative binomial distribution graphically, via the negative binomialness plot. Like the Poissonness plot, this plot is based on a quantity called the <em>negative binomialness metameter</em>, computed from the number of times each count value occurs, plotted against those count values. To construct this plot, it is necessary to specify a numerical value for the distribution’s second parameter (the <em>size</em> parameter in the <strong>distplot</strong> command, corresponding to the <em>r</em> parameter in the discussion of this distribution given in Chapter 8 of <em>Exploring Data</em>). This can be done in several different ways, including the specification of trial values, the approach taken in the negative binomialness plot procedure that is available from the OUP companion website. This option is also available with the <strong>distplot</strong> command from the <strong>vcd</strong> package: to obtain a negative binomialness plot, specify the <em>type</em> parameter as “nbinomial” and, if a fixed <em>size</em> parameter is desired, it is specified by giving a numerical value for the <em>size</em> parameter in the <strong>distplot</strong> function call. Alternatively, if this parameter is not specified, the <strong>distplot</strong> procedure will estimate it via the method of maximum likelihood, an extremely useful feature, although it is important to note that this estimation process can be time-consuming, especially for long data sequences. Finally, a third approach that can be adopted is to use the Ord plot described next to obtain an estimate of this parameter based on a simple heuristic. In addition, this heuristic suggests which of these two candidate distributions – the Poisson or the negative binomial – is more appropriate for the data sequence. <br /><br /><br /><br />Like the Poissonness plot, the Ord plot computes a simple derived quantity from the original count data sequence – specifically, the <em>frequency ratio, </em>defined for each count value as that value multiplied by the ratio of the number of times it occurs to the number of times the next smaller count occurs – and plots this versus the counts. If the data sequence obeys the negative binomial distribution, these points should conform reasonably well to a line with positive slope, and this slope can be used to determine the <em>size</em> parameter for the distribution. Conversely, if the Poisson distribution is appropriate, the best fit reference line for the Ord plot should have zero slope. In addition, Ord plots can also be used to suggest two additional discrete distributions (specifically, the binomial distribution and the log-series distribution), and the <strong>vcd </strong>package provides dataset examples to illustrate all four of these cases.<br /><br /><br /><br />For my <strong>CountSummary</strong> procedure, I decided to construct a two-by-two array with the following four components. First, in the upper left, I used the <strong>Ord_plot</strong> command in <strong>vcd</strong> to generate an Ord plot. This command returns the intercept and slope parameters for the reference line in the plot, and the <strong>Ord_estimate</strong> command can then be used to convert these values into a type specification and an estimate of the distribution parameter needed to construct the appropriate discrete distribution plot. I will discuss these results in more detail in my next post, but for the case of the NPG count sequence considered here, the Ord plot results suggest the negative binomial distribution as the most appropriate choice, returning a parameter <em>prob</em>, from which the <em>size</em> parameter required to generate the negative binomialness plot may be generated (specifically, <em>size = 1/prob – 1</em>). The upper right quadrant of this display gives a text summary identifying the variable being characterized and listing the Ord plot recommendations and parameter estimate. Since the Poisson distribution is “the default” assumption for count data, the lower left plot shows a Poissonness plot for the data sequence, while the lower right plot is the “distribution-ness plot” for the distribution recommended by the Ord plot results. The results obtained by the <strong>CountSummary</strong> procedure for the NPG sequence are shown below. Next time, I will present the code used to generate this plot.<br /><br /><br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-8avnjKu1QWw/UAsEgiaIysI/AAAAAAAAAJI/q7Z52I_U0-k/s1600/CountSummaryExample.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" closure_uid_zd5ma="70" hda="true" height="319" src="http://3.bp.blogspot.com/-8avnjKu1QWw/UAsEgiaIysI/AAAAAAAAAJI/q7Z52I_U0-k/s320/CountSummaryExample.png" width="320" /></a></div><br /><br /><br /><br /><br />Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com1tag:blogger.com,1999:blog-9179325420174899779.post-11539726630515992722012-07-07T08:11:00.000-07:002012-07-07T08:11:02.191-07:00Graphical insights from the 2012 UseR! Meeting<div class="MsoNormal" style="margin: 0in 0in 0pt;">About this time last month, I attended the 2012 UseR! Meeting.<span style="mso-spacerun: yes;"> </span>Now an annual event, this series of conferences started in Europe in 2004 as an every-other-year gathering that now seems to alternate between the <country-region w:st="on">U.S.</country-region> and <place w:st="on">Europe</place>.<span style="mso-spacerun: yes;"> </span>This year’s meeting was held on the <placename w:st="on">Vanderbilt</placename> <placetype w:st="on">University</placetype> campus in <place w:st="on"><city w:st="on">Nashville</city>, <state w:st="on">TN</state></place>, and it was attended by about 500 <i style="mso-bidi-font-style: normal;">R </i>aficionados, ranging from beginners who have just learned about <i style="mso-bidi-font-style: normal;">R</i> to members of the original group of developers and the R Core Team that continues to maintain it.<span style="mso-spacerun: yes;"> </span>Many different topics were discussed, but one given particular emphasis was data visualization, which forms the primary focus of this post.<span style="mso-spacerun: yes;"> </span>For a more complete view of the range of topics discussed and who discussed them, the conference program is available as a <a href="http://biostat.mc.vanderbilt.edu/wiki/pub/Main/UseR-2012/useR-2012-program.pdf">PDF file</a> that includes short abstracts of the talks.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">All attendees were invited to present a Lightning Talk, and about 20 of us did.<span style="mso-spacerun: yes;"> </span>The format is essentially the technical equivalent of the 50-yard dash: before the talk, you provide the organizers exactly 15 slides, each of which is displayed for 20 seconds.<span style="mso-spacerun: yes;"> </span>The speaker’s challenge is first, to try to keep up with the slides, and second, to try to convey some useful information about each one.<span style="mso-spacerun: yes;"> </span>For my Lightning Talk, I described the <b style="mso-bidi-font-weight: normal;">ExploringData</b> <i style="mso-bidi-font-style: normal;">R</i> package that I am in the process of developing, as a companion to both this blog and my book, <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a>.<span style="mso-spacerun: yes;"> </span>The intent of the package is first, to make the <i style="mso-bidi-font-style: normal;">R</i> procedures and datasets from the <a href="http://www.oup.com/us/companion.websites/9780195089653/rprogram/?view=usa">OUP companion site</a> for the book more readily accessible, and second, to provide some additional useful tools for exploratory data analysis, incorporating some of the extensions I have discussed in previous blog posts.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Originally, I had hoped to have the package complete by the time I gave my Lightning Talk, but in retrospect, it is just as well that the package is still in the development stage, because I picked up some extremely useful tips on what constitutes a good package at the meeting.<span style="mso-spacerun: yes;"> </span>As a specific example, Hadley Wickham, Professor of Statistics at <place w:st="on"><placename w:st="on">Rice</placename> <placetype w:st="on">University</placetype></place> and the developer of the <b style="mso-bidi-font-weight: normal;">ggplot2</b> package (more on this later), gave a standing-room-only talk on package development, featuring the <b style="mso-bidi-font-weight: normal;">devtools</b> package, something he developed to make the <i style="mso-bidi-font-style: normal;">R</i> package development process easier.<span style="mso-spacerun: yes;"> </span>In addition, the CRC vendor display at the meeting gave me the opportunity to browse and purchase Paul Murrell’s book, <a href="http://www.amazon.com/Graphics-Second-Chapman-Hall-CRC/dp/1439831769/ref=sr_1_1?s=books&ie=UTF8&qid=1341672263&sr=1-1&keywords=R+Graphics">R Graphics</a>, which provides an extremely useful, detailed, and well-written treatment of the four different approaches to graphics in <i style="mso-bidi-font-style: normal;">R</i> that I will say a bit more about below.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Because I am still deciding what to include in the <b style="mso-bidi-font-weight: normal;">ExploringData</b> package, one of the most valuable sessions for me was the invited talk by Di Cook, Professor of Statistics at <place w:st="on"><placename w:st="on">Iowa</placename> <placetype w:st="on">State</placetype> <placetype w:st="on">University</placetype></place>, who emphasized the importance of meaningful graphical displays in understanding the contents of a dataset, particularly if it is new to us.<span style="mso-spacerun: yes;"> </span>One of her key points – illustrated with examples from some extremely standard <i style="mso-bidi-font-style: normal;">R</i> packages – was that the “examples” associated with datasets included in <i style="mso-bidi-font-style: normal;">R</i> packages often fail to include any such graphical visualization, and even for those that do, the displays are often too cryptic to be informative.<span style="mso-spacerun: yes;"> </span>While this point is obvious enough in retrospect, it is one that I – along with a lot of other people, evidently – had not thought about previously.<span style="mso-spacerun: yes;"> </span>As a consequence, I am now giving careful thought to the design of informative display examples for each of the datasets I will include in the <b style="mso-bidi-font-weight: normal;">ExploringData</b> package.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As I mentioned above, there are (at least) four fundamental approaches to doing graphics in <i style="mso-bidi-font-style: normal;">R</i>.<span style="mso-spacerun: yes;"> </span>The one that most of us first encounter – the one we use by default every time we issue a “plot” command – is called <i style="mso-bidi-font-style: normal;">base graphics</i>, and it is included in base R to support a wide range of useful data visualization procedures, including scatter plots, boxplots, histograms, and a variety of other common displays.<span style="mso-spacerun: yes;"> </span>The other three approaches to graphics – grid graphics, lattice graphics, and <b style="mso-bidi-font-weight: normal;">ggplot2</b> – all offer more advanced features than what is typically available in base graphics, but they are, most unfortunately, incompatible in a number of ways with base graphics.<span style="mso-spacerun: yes;"> </span>I discovered this the hard way when I was preparing one of the procedures for the <b style="mso-bidi-font-weight: normal;">ExploringData</b> package (the <b style="mso-bidi-font-weight: normal;">CountSummary</b> procedure, which I will describe and demonstrate in my next post).<span style="mso-spacerun: yes;"> </span>Specifically, the <b style="mso-bidi-font-weight: normal;">vcd</b> package includes implementations of Poissonness plots, negative binomialness plots, and Ord plots, all discussed in <i style="mso-bidi-font-style: normal;">Exploring Data</i>, and I wanted to take advantage of these implementations in building a simple graphical summary display for count data.<span style="mso-spacerun: yes;"> </span>In base graphics, to generate a two-by-two array of plots, you simply specify “par(mfrow=c(2,2))” and then generate each individual plot using standard plot commands.<span style="mso-spacerun: yes;"> </span>When I tried this with the plots generated by the <b style="mso-bidi-font-weight: normal;">vcd</b> package, I didn’t get what I wanted – for the most part, it appeared that the “par(mfrow=c(2,2))” command was simply being ignored, and when it wasn’t, multiple plots were piled up on top of each other.<span style="mso-spacerun: yes;"> </span>It turns out that the <b style="mso-bidi-font-weight: normal;">vcd</b> package uses grid graphics, which has a fundamentally different syntax: it’s more complicated, but in the end, it does provide a wider range of display options.<span style="mso-spacerun: yes;"> </span>Ultimately, I was able to generate the display I wanted, although this required some digging, since grid graphics aren’t really discussed much in my standard <i style="mso-bidi-font-style: normal;">R</i> reference books.<span style="mso-spacerun: yes;"> </span>For example, <a href="http://www.amazon.com/R-Book-Michael-J-Crawley/dp/0470510242/ref=sr_1_1?s=books&ie=UTF8&qid=1341672357&sr=1-1&keywords=The+R+book">The R Book</a> by Michael J. Crawley covers an extremely wide range of useful topics, but the only mentions of “grid” in the index refer to the generation of grid lines (e.g., the base graphics command “grid” generates grid lines on a base <i style="mso-bidi-font-style: normal;">R</i> plot, which is <em>not</em> based on grid graphics).<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Often, grid graphics are mentioned in passing in introductory descriptions of trellis (lattice) graphics, since the <b style="mso-bidi-font-weight: normal;">lattice</b> package is based on grid graphics.<span style="mso-spacerun: yes;"> </span>This package is discussed in <i style="mso-bidi-font-style: normal;">The R Book</i>, and I have used it occasionally because it does support things like violin plots that are not part of base graphics. <span style="mso-spacerun: yes;"> </span>To date, I haven’t used it much because I find the syntax much more complicated, but I plan to look further into it, since it does appear to have a lot more capability than base graphics do.<span style="mso-spacerun: yes;"> </span>Also, Murrell’s <i style="mso-bidi-font-style: normal;">R Graphics</i> book devotes a chapter to trellis graphics and the lattice package, which goes well beyond the treatments given in my other <i style="mso-bidi-font-style: normal;">R</i> references, and this provides me further motivation to learn more.<span style="mso-spacerun: yes;"> </span>The fourth approach to <i style="mso-bidi-font-style: normal;">R</i> graphics – Hadley Wickham’s <b style="mso-bidi-font-weight: normal;">ggplot2</b> package – was much discussed at the UseR! Meeting, appearing both in examples presented in various authors’ talks and as components for more complex and specialized graphics packages.<span style="mso-spacerun: yes;"> </span>I have not yet used <b style="mso-bidi-font-weight: normal;">ggplot2</b>, but I intend to try it out, since it appears from some of the examples that this package can generate an extremely wide range of data visualizations, with simple types comparable to what is found in base graphics often available as defaults.<span style="mso-spacerun: yes;"> </span>Like the lattice package, <b style="mso-bidi-font-weight: normal;">ggplot2</b> is also based on grid graphics, making it, too, incompatible with base graphics.<span style="mso-spacerun: yes;"> </span>Again, the fact that Murrell’s book devotes a chapter to this package should also be quite helpful in learning when and how to make the best use of it.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">This year’s UseR! Meeting was the second one I have attended – I also went to the 2010 meeting in <place w:st="on"><city w:st="on">Gaithersburg</city>, <state w:st="on">MD</state></place>, held at the National Institute of Standards and Technology (NIST).<span style="mso-spacerun: yes;"> </span>Both have been fabulous meetings, and I fully expect future meetings to be as good: next year’s UseR! meeting is scheduled to be held in <country-region w:st="on"><place w:st="on">Spain</place></country-region> and I’m not sure I will be able to attend, but I would love to.<span style="mso-spacerun: yes;"> </span>In any case, if you can get there, I highly recommend it, based on my experiences so far.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com2tag:blogger.com,1999:blog-9179325420174899779.post-8726239151325206052012-06-10T13:13:00.000-07:002012-06-10T13:13:12.654-07:00Classifying the UCI mushrooms<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last post, I considered the shifts in two interestingness measures as possible tools for selecting variables in classification problems.<span style="mso-spacerun: yes;"> </span>Specifically, I considered the Gini and Shannon interestingness measures applied to the 22 categorical mushroom characteristics from the <a href="http://archive.ics.uci.edu/ml/datasets/Mushroom">UCI mushroom dataset</a>.<span style="mso-spacerun: yes;"> </span>The proposed variable selection strategy was to compare these values when computed from only edible mushrooms or only poisonous mushrooms.<span style="mso-spacerun: yes;"> </span>The rationale was that variables whose interestingness measures changed a lot between these two subsets might be predictive of mushroom edibility.<span style="mso-spacerun: yes;"> </span>In this post, I examine this question a little more systematically, primarily to illustrate the mechanics of setting up classification problems and evaluating their results.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">More specifically, the classification problem I consider here is that of building and comparing models that predicts mushroom edibility, each one based on a different mushroom characteristic.<span style="mso-spacerun: yes;"> </span>In practice, you would generally consider more than one characteristic as the basis for prediction, but here, I want to use standard classification tools to provide a basis for comparing the predictabilities of each of the potentially promising mushroom characteristics identified in my last post.<span style="mso-spacerun: yes;"> </span>In doing this, I also want to highlight three aspects of classification problems: first, the utility of randomly splitting the available data into subsets before undertaking the analysis, second, the fact that we have many different options in building classifiers, and third, one approach to assessing classification results.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">One of the extremely useful ideas emphasized in the machine learning literature is the utility of randomly partitioning our dataset into three parts: one used to fit whatever prediction model we are interested in building, another used to perform intermediate fit comparisons (e.g., compare the performance of models based on different predictor variables), and a third that is saved for a final performance assessment.<span style="mso-spacerun: yes;"> </span>The reasoning behind this partitioning is that if we allow our prediction model to become too complex, we run the risk of <i style="mso-bidi-font-style: normal;">overfitting,</i> or predicting some of the random details in our dataset, resulting in a model that does not perform well on other, similar datasets.<span style="mso-spacerun: yes;"> </span>This is an important practical problem that I illustrate with an extreme example in Chapter 1 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a>.<span style="mso-spacerun: yes;"> </span>There, a sequence of seven monotonically-decaying observations is fit to a sixth-degree polynomial that exactly predicts the original seven observations, but which exhibits horrible interpolation and extrapolation behavior.<span style="mso-spacerun: yes;"> </span>The point here is that we need a practical means of protecting ourselves against building models that are too specific to the dataset at hand, and the partitioning strategy just described provides a simple way of doing this.<span style="mso-spacerun: yes;"> </span>That is, once we partition the data, we can fit our prediction model to the first subset and then evaluate its performance with respect to the second subset: because these subsets were generated by randomly sampling the original dataset, their general character is the same, so a “good” prediction model built from the first subset should give “reasonable” predictions for the second subset.<span style="mso-spacerun: yes;"> </span>The reason for saving out a third data subset – not used at all until the final evaluation of our model – is that model-building is typically an iterative procedure, so we are likely to cycle repeatedly between the first and second subsets.<span style="mso-spacerun: yes;"> </span>For the final model evaluation, it is desirable to have a dataset available that hasn’t been used at all.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Generating this three-way split in <em>R</em> is fairly easy.<span style="mso-spacerun: yes;"> </span>As with many tasks, this can be done in more than one way, but the following procedure is fairly straightforward and only makes use of procedures available in base <em>R</em>:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">RandomThreeWay.proc <- function(df, probs = c(35,35,30), iseed = 101){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>set.seed(iseed)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>n = nrow(df)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>u = runif(n)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>nprobs = probs/sum(probs)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>brks = c(0,cumsum(nprobs))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Subgroup = cut(u, breaks=brks, labels=c("A","B","C"), include.lowest=TRUE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Subgroup</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">This function is called with three parameters: the data frame that we wish to partition for our analysis, a vector of the relative sizes of our three partitions, and a seed for the random number generator.<span style="mso-spacerun: yes;"> </span>In the implementation shown here, the vector of relative sizes is given the default values 35%/35%/30%, but any relative size partitioning can be specified.<span style="mso-spacerun: yes;"> </span>The result returned by this procedure is the character vector Subgroup, which has the values “A”, “B”, or “C”, corresponding to the three desired partitions of the dataset.<span style="mso-spacerun: yes;"> </span>The first line of this procedure sets the seed for the uniform random number generator used in the third line, and the second line specifies how many random numbers to generate (i.e., one for each data record in the data frame).<span style="mso-spacerun: yes;"> </span>The basic idea here is to generate uniform random numbers on the interval [0,1] and then assign subgroups depending on whether this value falls into the interval between 0 and 0.35, 0.35 to 0.70, or 0.70 to 1.00.<span style="mso-spacerun: yes;"> </span>The <strong>runif</strong> function generates the required random numbers, the <strong>cumsum</strong> function is used to generate the cumulative breakpoints from the normalized probabilities, and the <strong>cut</strong> function is used to group the uniform random numbers using these break points.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In the specific example considered here, I use logistic regression as my classifier, although many, many other classification procedures are available in <em>R,</em> including a wide range of decision tree-based models, random forest models, boosted tree models, naïve Bayes classifiers, and support vector machines, to name only a few.<span style="mso-spacerun: yes;"> </span>(For a more complete list, refer to the CRAN task view on <a href="http://cran.r-project.org/web/views/MachineLearning.html">Machine Learning and Statistical Learning</a>).<span style="mso-spacerun: yes;"> </span>Here, I construct and compare six logistic regression models, each constructed to predict the probability that a mushroom is poisonous from one of the six mushroom characteristics identified in my previous post: GillSize, StalkShape, CapSurf, Bruises, GillSpace, and Pop.<span style="mso-spacerun: yes;"> </span>In each case, I extract the records for subset “A” of the UCI mushroom dataset, as described above, and use the base <em>R</em> procedure <strong>glm</strong> to construct a logistic regression model.<span style="mso-spacerun: yes;"> </span>Because the model evaluation procedure (<strong>somers2</strong>, described below) that I use here requires a binary response coded as 0 or 1, it is simplest to construct a data frame with this response explicitly, along with the prediction covariate of interest.<span style="mso-spacerun: yes;"> </span>The following code does this for the first predictor (GillSize):</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">EorP = UCImushroom.frame$EorP</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">PoisonBinary = rep(0,length(EorP))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">PoisonIndx = which(EorP = = "p")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">PoisonBinary[PoisonIndx] = 1</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">FirstFrame = data.frame(PoisonBinary = PoisonBinary, Covar = UCImushroom.frame$GillSize)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In particular, this code constructs a two-column data frame that contains the binary response variable PoisonBinary that is equal to 1 whenever EorP is “p” and 0 whenever this variable is “e”, and the prediction covariate Covar, which is here “GillSize”.<span style="mso-spacerun: yes;"> </span>Given this data frame, I then apply the following code to randomly partition this data frame into subsets A, B, and C, and I invoke the built-in <strong>glm</strong> procedure to fit a logistic regression model:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">Subset = RandomThreeWay.proc(FirstFrame)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">IndxA = which(Subset = = "A")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">LogisticModel = glm(PoisonBinary ~ Covar, data = FirstFrame, subset = IndxA, family=binomial())</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;">Note that here I have specified the model form using the <em>R</em> formula construction “PoisonBinary ~ Covar”, I have used the <strong>subset</strong> argument of the <strong>glm</strong> procedure to specify that I only want to fit the model to subset A, and I have specified “family = binomial()” to request a logistic regression model.<span style="mso-spacerun: yes;"> </span>Once I have this model, I evaluate it using the concordance index C available from the <strong>somers2</strong> function in the <em>R</em> package <strong>Hmisc</strong>.<span style="mso-spacerun: yes;"> </span>This value corresponds to the area under the ROC curve and is a measure of agreement between the predictions of the logistic regression model and the actual binary response.<span style="mso-spacerun: yes;"> </span>As discussed above, I want to do this evaluation for subset B to avoid an over-optimistic view of the model’s performance due to overfitting of subset A.<span style="mso-spacerun: yes;"> </span>To do this, I need the model predictions from subset B, which I obtain with the built-in <strong>predict</strong> procedure:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">IndxB = which(Subset = = "B")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">PredPoisonProb = predict(LogisticModel, newdata = FirstFrame[IndxB,], type="response")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">ObsPoisonBinary = FirstFrame$PoisonBinary[IndxB]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;">In addition, I have created the variable ObsPoissonBinary, the sequence of binary responses from subset B, which I will use in calling the <strong>somers2</strong> function:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">library(Hmisc)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">somers2(PredPoisonProb, ObsPoisonBinary)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>C<span style="mso-spacerun: yes;"> </span>Dxy<span style="mso-spacerun: yes;"> </span>n<span style="mso-spacerun: yes;"> </span><span style="mso-spacerun: yes;"> </span>Missing </div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>0.7375031<span style="mso-spacerun: yes;"> </span>0.4750063 2858.0000000<span style="mso-spacerun: yes;"> </span>0.0000000 </div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The results shown here include the concordance index C, an alternative (and fully equivalent) measure called Somers’ D (from which the procedure gets its name), the number of records in the dataset (here, in subset B), and the number of missing records (here, none).<span style="mso-spacerun: yes;"> </span>The concordance index C is a number that varies between 0 and 1, with values between 0.5 and 1.0 meaning that the predictions are better than random guessing, and values less than 0.5 indicating performance so poor that it is actually worse than random guessing.<span style="mso-spacerun: yes;"> </span>Here, the value of approximately 0.738 suggests that GillSize is a reasonable predictor of mushroom edibility, at least for mushrooms like those characterized in the UCI mushroom dataset.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Repeating this process for all six of the mushroom characteristics identified as potentially predictive by the interestingness change analysis I discussed last time leads to the following results:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;"> Pop:<span style="mso-tab-count: 2;"> </span>C = 0.753<span style="mso-tab-count: 1;"> </span>(6 levels)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>Bruises:<span style="mso-tab-count: 1;"> </span>C = 0.740<span style="mso-tab-count: 1;"> </span>(2 levels)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>GillSize:<span style="mso-tab-count: 1;"> </span>C = 0.738<span style="mso-tab-count: 1;"> </span>(2 levels)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>GillSpace:<span style="mso-tab-count: 1;"> </span>C = 0.635<span style="mso-tab-count: 1;"> </span>(2 levels)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>CapSurf:<span style="mso-tab-count: 1;"> </span>C = 0.595<span style="mso-tab-count: 1;"> </span>(4 levels)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>StalkShape:<span style="mso-tab-count: 1;"> </span>C = 0.550<span style="mso-tab-count: 1;"> </span>(2 levels)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">These results leave open the questions of whether other mushroom characteristics, not identified on the basis of their interestingness shifts, are in fact more predictive of edibility, or how much better the predictions can be if we use more than one prediction variable.<span style="mso-spacerun: yes;"> </span>I will examine those questions in subsequent posts, using the ideas outlined here.<span style="mso-spacerun: yes;"> </span>For now, it is enough to note that one advantage of the approach described here, relative to that using odds ratios for selected covariates discussed last time, is that this approach can be used to assess the potential prediction power of categorical variables with arbitrary numbers of levels, while the odds ratio approach is limited to two-level predictors.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com0tag:blogger.com,1999:blog-9179325420174899779.post-90028185536808831472012-05-19T13:06:00.000-07:002012-05-19T13:06:16.490-07:00Interestingness comparisons<div class="MsoNormal" style="margin: 0in 0in 0pt;">In three previous posts (<a href="http://exploringdatablog.blogspot.com/2011/04/interestingness-measures.html">April 3, 2011</a>, <a href="http://exploringdatablog.blogspot.com/2011/04/screening-for-predictive.html">April 12, 2011</a>,and <a href="http://exploringdatablog.blogspot.com/2011/05/distribution-of-interestingness.html">May 21, 2011</a>), I have discussed <em>interestingness measures,</em> which characterize the distributional heterogeneity of categorical variables.<span style="mso-spacerun: yes;"> </span>Four specific measures are discussed in Chapter 3 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences and Medicine</a>: the Bray measure, the Gini measure, the <place w:st="on">Shannon</place> measure, and the Simpson measure.<span style="mso-spacerun: yes;"> </span>All four of these measures vary from 0 to 1 in value, exhibiting their minimum value when all levels of the variable are equally represented, and exhibiting their maximum value when the variable is completely concentrated on a single one of its several possible levels.<span style="mso-spacerun: yes;"> </span>Intermediate values correspond to variables that are more or less homogeneously distributed: more homogeneous for smaller values of the measure, and less homogeneous for larger values.<span style="mso-spacerun: yes;"> </span>One of the points I noted in my first post on this topic was that the different measures exhibit different behavior for the intermediate cases, reflecting different inherent sensitivities to the various ways in which a variable can be “more homogeneous” or “less homogeneous.”<span style="mso-spacerun: yes;"> </span>This post examines changes in interestingness measures as a potential exploratory analysis tool for selecting categorical predictors of some binary response. In fact, I examined the same question from a different perspective in my April 12 post noted above: the primary difference is that there, the characterization I considered generates a single graph for each variable, with the number of points on the graph corresponding to the number of levels of the variable. Here, I examine a characterization that represents each variable as a single point on the graph, allowing us to consider all variables simultaneously.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-NBTjPHooKC0/T7fzl4ud0BI/AAAAAAAAAIY/7iD3q7T3LD4/s1600/GiniVsShannonPlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://4.bp.blogspot.com/-NBTjPHooKC0/T7fzl4ud0BI/AAAAAAAAAIY/7iD3q7T3LD4/s320/GiniVsShannonPlot.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As a reminder of how these measures behave, the figure above shows a plot of the normalized Gini measure versus the normalized <place w:st="on">Shannon</place> measure for the 23 categorical variables included in the mushroom dataset from the <a href="http://archive.ics.uci.edu/ml/datasets/Mushroom">UCI Machine Learning Repository</a>.<span style="mso-spacerun: yes;"> As I have noted in several previous posts that have discussed</span> this dataset, it gives observable characteristics for 8,124 mushrooms and classifies each one as either edible or poisonous (the binary variable EorP).<span style="mso-spacerun: yes;"> </span>The above plot illustrates the systematic difference between the normalized Shannon and Gini interestingness measures: there, each point represents one of the 23 variables in the dataset, with the horizontal axis representing the Shannon measure computed for the variable and the vertical axis rperesenting the corresponding Gini measure. The plot shows that the Gini measure is consistently larger than the <place w:st="on">Shannon</place> measure, since all points lie above the equality reference line in this plot except for the single point at the origin.<span style="mso-spacerun: yes;"> </span>This point corresponds to the variable VeilType, which only exhibits a single value in this dataset, meaning that both the Gini and Shannon measures are inherently ill-defined; consequently, they are given the default value of zero here, consistent with the general interpretation of these measures: if a variable only assumes a single value, it seems reasonable to consider it “completely homogeneous.”</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Because edible and poisonous mushrooms are fairly evenly represented in this dataset (51.8% edible versus 48.2% poisonous), it has been widely used as one of several benchmarks for evaluating classification algorithms.<span style="mso-spacerun: yes;"> </span>In particular, given the other mushroom characteristics, the fundamental classification question is how well can we predict whether each mushroom is poisonous or edible.<span style="mso-spacerun: yes;"> </span>In this post and a subsequent follow-up post, I consider a closely related question: can differences in a variable’s interestingness measure between the edible subset and the poisonous subset be used to help us select prediction covariates for these classification algorithms?<span style="mso-spacerun: yes;"> </span>In this post, I present some preliminary evidence to suggest that this may be the case, while in a subsequent post, I will put the question to the test by seeing how well the covariates suggested by this analysis actually predict edibility.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The specific idea I examine here is the following: given an interestingness measure and a mushroom characteristic, compute this measure for the chosen characteristic, applied the edible and poisonous mushrooms separately.<span style="mso-spacerun: yes;"> </span>If these numbers are very different, this suggests that the distribution of levels is different for edible and poisonous mushrooms, further suggesting that this variable may be a useful predictor of edibility.<span style="mso-spacerun: yes;"> </span>To turn this idea into a data analysis tool, it is necessary to define what we mean by “very different,” and this can be done in more than one way.<span style="mso-spacerun: yes;"> </span>Here, I consider two possibilities.<span style="mso-spacerun: yes;"> </span>The first is what I call the “normalized difference,” defined as the difference of the two interestingness measures divided by their sum.<span style="mso-spacerun: yes;"> Since</span> both interestingness measures lie between 0 and 1, it is not difficult to show that this normalized difference lies between -1 and +1.<span style="mso-spacerun: yes;"> </span>As a specific application of this idea, consider the plot below, which shows the normalized difference in the Gini measure between the poisonous mushrooms and the edible mushrooms (the normalized Gini shift) plotted against the corresponding difference for the Shannon measure (the normalized Shannon shift).<span style="mso-spacerun: yes;"> </span>In addition, this plot shows an equality reference line, and the fact that the points consistently lie between this line and the horizontal axis shows that the normalized Gini shift is consistently smaller in magnitude than the normalized <place w:st="on">Shannon</place> shift.<span style="mso-spacerun: yes;"> </span>This suggests that the normalized <place w:st="on">Shannon</place> measure may be more sensitive to distributional differences between edible and poisonous mushrooms.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-taWQoCtLtdg/T7f10S9arAI/AAAAAAAAAIg/IISgWJr5lXw/s1600/NormalizedMeasurePlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://2.bp.blogspot.com/-taWQoCtLtdg/T7f10S9arAI/AAAAAAAAAIg/IISgWJr5lXw/s320/NormalizedMeasurePlot.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The next figure, below, shows a re-drawn version of the above plot, with the equality reference line removed and replaced by four other reference lines.<span style="mso-spacerun: yes;"> </span>The vertical dashed lines correspond to the outlier detection limits obtained by the Hampel identifier with threshold value t = 2 (see Chapter 7 of <em>Exploring Data</em> for a detailed discussion of this procedure), computed from the normalized Shannon shift values, while the horizontal dashed lines represent the corresponding limits computed from the normalized Gini shift values.<span style="mso-spacerun: yes;"> </span>Points falling outside these limits represent variables whose changes in both Gini measure and Shannon measure are “unusually large” according to the Hampel identifier criteria used here.<span style="mso-spacerun: yes;"> </span>These points are represented as solid circles, while those not detected as “unusual” by the Hampel identifier are represented as open circles.<span style="mso-spacerun: yes;"> </span>The idea proposed here – to be investigated in a future post – is that these outlying variables <i style="mso-bidi-font-style: normal;">may</i> be useful in predicting mushroom edibility.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-7M85IUD9EW8/T7f2mv0pRtI/AAAAAAAAAIo/ITEIoTvLcW4/s1600/NormalizedMeasurePlot3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://4.bp.blogspot.com/-7M85IUD9EW8/T7f2mv0pRtI/AAAAAAAAAIo/ITEIoTvLcW4/s320/NormalizedMeasurePlot3.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">More specifically, the five solid circles in the above plot correspond to the following mushroom characteristics.<span style="mso-spacerun: yes;"> </span>The two points in the lower left corner of the plot – exhibiting almost the most negative normalized <place w:st="on">Shannon</place> shift possible – correspond to GillSize and StalkShape, two binary variables.<span style="mso-spacerun: yes;"> </span>As I discussed in a previous post (<a href="http://exploringdatablog.blogspot.com/2011/05/computing-odds-ratios-in-r.html">May 7, 2011</a>) and I discuss further in Chapter 13 of <i style="mso-bidi-font-style: normal;">Exploring Data</i>, an extremely useful measure of association between two binary variables (e.g., between GillSize and edibility) is the odds ratio.<span style="mso-spacerun: yes;"> </span>An examination of the odds ratios for these two variables suggest that both should be at least somewhat predictive of edibility: the odds ratio between GillSize and edibility is 0.056, suggesting a very strong association (specifically, a GillSize value of “n” for “narrow” is most commonly associated with poisonous mushrooms in the UCI mushroom dataset), while the odds ratio between StalkShape and edibility is less extreme at 1.511, but still different enough from the neutral value of 1 to be suggestive of a clear association between these variables (a StalkShape value of “t” is more strongly associated with edible mushrooms than the alternative value of “e”).<span style="mso-spacerun: yes;"> </span>The solid circle in the upper right of this plot corresponds to the variable CapSurf, which has four levels and whose distributional homogeneity appears to change quite substantially, according to both the Gini and <place w:st="on">Shannon</place> measures.<span style="mso-spacerun: yes;"> </span>Because this variable has more than two levels, it is not possible to characterize its association in terms of its odds ratio relative to edibility.<span style="mso-spacerun: yes;"> </span>Finally, the cluster of three points in the upper right, just barely above the upper horizontal dashed line, correspond to the binary variables Bruises and GillSpace, and the six-level variable Pop.<span style="mso-spacerun: yes;"> </span>Both of these binary variables exhibit very large odds ratios with respect to edibility (9.97 and 13.55 for Bruises and GillSpace, respectively), again suggesting that these variables may be highly predictive of edibility.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The prevalence of binary variables in these results is noteworthy, and it reflects the fact that distributional shifts for binary variables can only occur in one way (i.e., the relative frequency of either fixed level can either increase or decrease).<span style="mso-spacerun: yes;"> </span>Thus, large shifts in either interestingness measure should correspond to significant odds ratios with respect to the binary response variable, and this is seen to be the case here.<span style="mso-spacerun: yes;"> </span>The situation is more complicated when a variable exhibits more than two levels, since the distribution of these levels can change in many ways between the two binary response values.<span style="mso-spacerun: yes;"> </span>An important advantage of techniques like the the interestingness shift analysis described here is that they are not restricted to binary characteristics, as odds ratio characterizations are.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The second approach I consider for measuring the shift in interestingness between edible and poisonous mushrooms is what I call the “marginal measure,” corresponding to the difference in either the Gini or the <place w:st="on">Shannon</place> measure between poisonous and edible mushrooms, divided by the original measure for the complete dataset.<span style="mso-spacerun: yes;"> </span>An important difference between the marginal measure and the normalized measure is that the marginal measure is not bounded to lie between -1 and +1, as is evident in the plot below.<span style="mso-spacerun: yes;"> </span>This plot shows the marginal Gini shift against the marginal <place w:st="on">Shannon</place> shift for the mushroom characteristics, in the same format as the plot above.<span style="mso-spacerun: yes;"> </span>Here, only four points are flagged as outliers, corresponding to the four binary variables identified above from the normalized shift plot: Bruises (the point in the extreme upper right), GillSpace (the point just barely in the upper right quadrant), and GillSize and StalkShape (the two points in the extreme lower left).<span style="mso-spacerun: yes;"> </span>However, if we lower the Hampel identifier threshold from t = 2 to t = 1.5, we again identify CapSurf and Pop as potentially influential variables.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-NDv1NRKdFVo/T7f3sYM_CgI/AAAAAAAAAIw/gahvcxEtAMQ/s1600/MarginalMeasurePlot3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://3.bp.blogspot.com/-NDv1NRKdFVo/T7f3sYM_CgI/AAAAAAAAAIw/gahvcxEtAMQ/s320/MarginalMeasurePlot3.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">This last observation suggests an alternative interpretation approach that may be worth exploring.<span style="mso-spacerun: yes;"> </span>Specifically, both of the two previous plots give clear visual evidence of “cluster structure,” and the Hampel identifier does extract some or all of this structure from the plot, but only if we apply a sufficiently judicious tuning to the threshold parameter.<span style="mso-spacerun: yes;"> </span>A possible alternative would be to apply <em>cluster analysis</em> procedures, and this will be the subject of one or more subsequent posts.<span style="mso-spacerun: yes;"> </span>In particular, there are many different clustering algorithms that could be applied to this problem, and the results are likely to be quite different.<span style="mso-spacerun: yes;"> </span>The key practical question is which ones – if any – lead to useful ways of grouping these mushroom characteristics.<span style="mso-spacerun: yes;"> </span>Subsequent posts will examine this question further from several different perspectives.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com0tag:blogger.com,1999:blog-9179325420174899779.post-62875307895827496672012-04-21T11:38:00.000-07:002012-04-21T11:38:34.100-07:00David Olive’s median confidence interval<div class="MsoNormal" style="margin: 0in 0in 0pt;">As I have discussed in a number of previous posts, the median represents a well-known and widely-used estimate of the “center” of a data sequence.<span style="mso-spacerun: yes;"> </span>Relative to the better-known mean, the primary advantage of the median is its much reduced outlier sensitivity.<span style="mso-spacerun: yes;"> </span>This post briefly describes a simple confidence interval for the median that is discussed in a paper by David Olive, available on-line via the following link:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;"><a href="http://www.math.siu.edu/olive/ppmedci.pdf">http://www.math.siu.edu/olive/ppmedci.pdf</a><a href="http://www.math.siu.edu/olive/ppmedci.pdf"></a></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As Olive notes in his paper and I further demonstrate in this post, an advantage of his confidence interval for the median is that it provides a simple, numerical way of identifying situations where the data values deserve a careful, graphical look.<span style="mso-spacerun: yes;"> </span>In particular, he advocates comparing the traditional confidence interval for the mean with his confidence interval for the median: if these intervals are markedly different, it is worth investigating to understand why.<span style="mso-spacerun: yes;"> </span>This strategy may be viewed as a particular instance of Collin Mallows’ “compute and compare” advice, discussed at the end of Chapter 7 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a>.<span style="mso-spacerun: yes;"> </span>The key idea here is that under “standard” working assumptions – i.e., distributional symmetry and approximate normality – the mean and the median should be approximately the same: if they are not, it probably means these working assumptions have been violated, due to outliers in the data, pronounced distributional asymmetry, or other less common phenomena like strongly multimodal data distributions or coarse quantization.<span style="mso-spacerun: yes;"> </span>In the increasingly common case where we have a lot of numerical variables to consider, it may be undesirable or infeasible to examine them all graphically: numerical comparisons like the one described here may be automated and used to point us to subsets of variables that we really need to look at further.<span style="mso-spacerun: yes;"> </span>In addition to describing this confidence interval estimator and illustrating it for three examples, this post also provides the <em>R</em> code to compute it.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-Yf8FfTapsaU/T5LcFqPTD_I/AAAAAAAAAH4/kB_y9O0d3fo/s1600/OliveFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qda="true" src="http://2.bp.blogspot.com/-Yf8FfTapsaU/T5LcFqPTD_I/AAAAAAAAAH4/kB_y9O0d3fo/s320/OliveFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As a first example, the plot above shows the makeup flow rate dataset discussed in <em>Exploring Data</em> and available as the makeup dataset <span style="mso-spacerun: yes;"> </span>(<strong>makeup.csv</strong>) from the book's <a href="http://www.oup.com/us/companion.websites/9780195089653/rprogram/?view=usa">companion website</a>.<span style="mso-spacerun: yes;"> </span>This plot shows 2,589 successive observations of the measured flow rate of a solvent recycle stream in an industrial manufacturing process.<span style="mso-spacerun: yes;"> </span>In normal operation, this flow rate is just under 400 – in fact, the median flow rate is 393.86 – but this data record also includes measurements during time intervals when the process is either being shut down, is not running, or is being started back up, and during these periods the measured flow rates decrease toward zero, are approximately equal to zero, and increase from zero back to approximately 400, respectively.<span style="mso-spacerun: yes;"> </span>Because of the presence of these anomalous segments in the data, the mean value is much smaller than the median: specifically, the mean is 315.46, actually serving as a practical dividing line between the normal operation segments (i.e., those data points that lie above the mean) and the shutdown segments (i.e., those data points that lie below the mean).<span style="mso-spacerun: yes;"> </span>The dashed lines in this plot at 309.49 and 321.44 correspond to the classical 95% confidence interval for the mean, computed as described below.<span style="mso-spacerun: yes;"> </span>In contrast, the dotted lines at 391.83 and 394.88 correspond to Olive’s 95% confidence interval for the median, also described below.<span style="mso-spacerun: yes;"> </span>Before proceeding to a more detailed discussion of how these lines were determined, the three primary points to note from this figure are, first, that the two confidence intervals are very different (e.g., they do not overlap at all), second, that the mean confidence intervals are much wider than those for the median in this case, and third, that the median confidence interval lies well within the range of the normal operating data, while the mean confidence interval does not.<span style="mso-spacerun: yes;"> </span>It is also worth noting that, if we simply remove the shutdown episodes from this dataset, the mean of this edited dataset is 397.7, a value that lies slightly above the upper 95% confidence interval for the median, but only slightly so (this and other data cleaning strategies for this dataset are discussed in some detail in Chapter 7 of <em>Exploring Data</em>).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Both the classical confidence interval for the mean and David Olive’s confidence interval for the median are based on the fact that these estimators are asymptotically normal: for a sufficiently large data sample, both the estimated mean and the estimated median approach the correct limits for the underlying data distribution, with a standard deviation that decreases inversely with the square root of the sample size.<span style="mso-spacerun: yes;"> Using</span> this description directly would lead to confidence intervals based on the quantiles of the Gaussian distribution, but for small to moderate-sized samples, more accurate confidence intervals are obtained by replacing these Gaussian quantiles with those for the Student’s t-distribution with the appropriate number of degrees of freedom.<span style="mso-spacerun: yes;"> </span>More specifically, for the mean, the confidence interval at a given level p is of the form:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>CI = (Mean – c<sub>p</sub> SE, Mean + c<sub>p</sub> SE),</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where c<sub>p</sub> is the constant derived from the Gaussian or Student’s t-distribution, and SE is the standard error of the mean, equal to the usual standard deviation estimate divided by the square root of the number of data points.<span style="mso-spacerun: yes;"> </span>(For a more detailed discussion of the math behind these results, refer to either Chapter 9 of <em>Exploring Data</em> or to David Olive’s paper, available through the link given above.)<span style="mso-spacerun: yes;"> </span>For the median, Olive provides a simple estimator for the standard error, described further in the next paragraph.<span style="mso-spacerun: yes;"> </span>First, however, it is worth saying a little about the difference between the Gaussian and Student’s t-distribution in these results.<span style="mso-spacerun: yes;"> </span>Probably the most commonly used confidence intervals are the 95% intervals – these are the confidence intervals shown in the plot above for the makeup flow rate data – which represent the interval that should contain the true distribution mean with probability at least 95%.<span style="mso-spacerun: yes;"> </span>In the Gaussian case, the constant c<sub>p</sub> for the 95% confidence interval is approximately 1.96, while for the Student’s t-distribution, this number depends on the degrees of freedom parameter.<span style="mso-spacerun: yes;"> </span>In the case of the mean, the degrees of freedom is one less than the sample size, while for the median confidence intervals described below, this number is typically much smaller.<span style="mso-spacerun: yes;"> </span>The difference between these distributions is that the c<sub>p</sub> parameter decreases from a very large value for few degrees of freedom – e.g., the 95% parameter value is 12.71 for a single degree of freedom – to the Gaussian value (e.g., 1.96 for the 95% case) in the limit of infinite degrees of freedom.<span style="mso-spacerun: yes;"> </span>Thus, using Student’s t-distribution instead of the Gaussian distribution results in wider confidence intervals, wider by the ratio of the Student’s t value for c<sub>p</sub> to the Gaussian value.<span style="mso-spacerun: yes;"> </span>The plot below shows this ratio for the 95% parameter c<sub>p</sub> as the degree of freedom parameter varies between 5 and 200, with the dashed line corresponding to the Gaussian limit when this ratio is equal to 1.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-6bAp74V8OyI/T5L3k1Uy1JI/AAAAAAAAAIA/7rBuaVGCj-k/s1600/SizeEffectRatioPlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qda="true" src="http://3.bp.blogspot.com/-6bAp74V8OyI/T5L3k1Uy1JI/AAAAAAAAAIA/7rBuaVGCj-k/s320/SizeEffectRatioPlot.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The general structure of Olive’s confidence interval for the median is exactly analogous to that for the mean given above:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>CI = (Median – c<sub>p</sub> SE, Median + c<sub>p</sub> SE)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The key result of Olive’s paper is a simple estimator for the standard error SE, based on order statistics (i.e., rank-ordered data values like the minimum, median, and maximum).<span style="mso-spacerun: yes;"> </span>Instead of describing these results mathematically, I have included an <i style="mso-bidi-font-style: normal;">R</i> procedure that computes the median, Olive’s standard error, the corresponding confidence intervals, and the classical results for the mean (again, for the mathematical details, refer to Olive’s paper; for a more detailed discussion of order statistics, refer to Chapter 6 of <em>Exploring Data</em>).<span style="mso-spacerun: yes;"> </span>Specifically, the following <i style="mso-bidi-font-style: normal;">R</i> procedure is called with a vector y of numerical data values, and the default level of the resulting confidence interval is 95%, although this level can be changed by specifying an alternative value of alpha (this is 1 minus the confidence level, so alpha is 0.05 for the 95% case, 0.01 for 99%, etc.).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">DOliveCIproc <- function(y, alpha = 0.05){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>This procedure implements David Olive's simple</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>median confidence interval, along with the standard</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>confidence interval for the mean, for comparison</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>First, compute the median</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>n = length(y)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>ysort = sort(y)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>nhalf = floor(n/2)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>if (2*nhalf < n){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>n odd</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>med = ysort[nhalf + 1]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>else{</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span># n even</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>med = (ysort[nhalf] + ysort[nhalf+1])/2</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Next, compute Olive’s standard error for the median</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Ln = nhalf - ceiling(sqrt(n/4))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Un = n - Ln</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>SE = 0.5*(ysort[Un] - ysort[Ln+1])</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Compute the confidence interval based on Student’s t-distribution</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>The degrees of freedom parameter p is discussed in Olive’s paper</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>p = Un - Ln - 1</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>t = qt(p = 1 - alpha/2, df = p)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>medLCI = med - t * SE</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>medUCI = med + t * SE</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Next, compute the mean and its classical confidence interval</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>mu = mean(y)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>SEmu = sd(y)/sqrt(n)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>tmu = qt(p = 1 - alpha/2, df = n-1)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>muLCI = mu - tmu * SEmu</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>muUCI = mu + tmu * SEmu</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Finally, return a data frame with all of the results computed here</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OutFrame = data.frame(Median = med, LCI = medLCI, UCI = medUCI, </div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Mean = mu, MeanLCI = muLCI, MeanUCI = muUCI,</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>N = n, dof = p, tmedian = t, tmean = tmu,</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>SEmedian = SE, SEmean = SEmu)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>OutFrame</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Briefly, this procedure performs the following computations.<span style="mso-spacerun: yes;"> </span>The first portion of the code computes the median, defined as the middle element of the rank-ordered list of samples if the number of samples n is odd, and the average of the two middle samples if n is even.<span style="mso-spacerun: yes;"> </span>Note that the even/odd character of n is determined by using the <b style="mso-bidi-font-weight: normal;">floor</b> function in <i style="mso-bidi-font-style: normal;">R:</i> floor(n/2) is the largest integer that does not exceed n/2.<span style="mso-spacerun: yes;"> </span>Thus, if n is odd, the <b style="mso-bidi-font-weight: normal;">floor</b> function rounds n/2 down to its integer part, so the product 2 * floor(n/2) is less than n, while if n is even, floor(n/2) is exactly equal to n/2, so this product is equal to n.<span style="mso-spacerun: yes;"> </span>In addition, both the <b style="mso-bidi-font-weight: normal;">floor</b> function and its opposite function <b style="mso-bidi-font-weight: normal;">ceiling</b> are needed to compute the value Ln used in computing Olive’s standard error for the median.<span style="mso-spacerun: yes;"> </span>The c<sub>p</sub> values correspond to the parameters t and tmu that appear in this function, computed from the built-in R function <strong>qt</strong> (which returns quantiles of the t-distribution).<span style="mso-spacerun: yes;"> </span>Note that for the median, the degrees of freedom supplied to this function is p, which tends to be much smaller than the degrees of freedom value n-1 for the mean confidence interval computed in the latter part of this function.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As a specific illustration of the results generated by this procedure, applying it to the makeup flow rate data sequence yields:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">> DOliveCIproc(makeupflow)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Median<span style="mso-spacerun: yes;"> </span>LCI<span style="mso-spacerun: yes;"> </span>UCI<span style="mso-spacerun: yes;"> </span>Mean<span style="mso-spacerun: yes;"> </span>MeanLCI<span style="mso-spacerun: yes;"> </span>MeanUCI<span style="mso-spacerun: yes;"> </span>N dof<span style="mso-spacerun: yes;"> </span>tmedian</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">1 393.3586 391.8338 394.8834 315.4609 309.4857 321.4361 2589<span style="mso-spacerun: yes;"> </span>52 2.006647</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>tmean SEmedian<span style="mso-spacerun: yes;"> </span>SEmean</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">1 1.960881<span style="mso-spacerun: yes;"> </span>0.75987 3.047188</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">These results were used to construct the confidence interval lines in the makeup flow rate plot shown above.<span style="mso-spacerun: yes;"> </span>In addition, note that these results also illustrate the point noted in the preceding discussion about the degrees of freedom used in constructing the Student’s t-based confidence intervals.<span style="mso-spacerun: yes;"> </span>For the mean, the degrees of freedom is N-1, which is 2588 for this example, meaning that there is essentially no difference in this case between these confidence intervals and those based on the Gaussian limiting distribution.<span style="mso-spacerun: yes;"> </span>In contrast, for the median, the degrees of freedom is only 52, giving a c<sub>p</sub> value that is about 2.5% larger than the corresponding Gaussian case; for the next example, the degrees of freedom is only 16, making this parameter about 8% larger than the Gaussian limit.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-ffH0dHWT1SU/T5L44DXg6OI/AAAAAAAAAII/kiTL9LpRSng/s1600/OliveFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qda="true" src="http://2.bp.blogspot.com/-ffH0dHWT1SU/T5L44DXg6OI/AAAAAAAAAII/kiTL9LpRSng/s320/OliveFig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">One of the points I discussed in my last post was the instability of the median relative to the mean, a point I illustrated with the plot shown above.<span style="mso-spacerun: yes;"> </span>This is a simulation-based dataset consisting of three parts: the first 100 points are narrowly distributed around the value +1, the 101<sup>st</sup> point is exactly zero, and the last 100 points are narrowly distributed around the value -1.<span style="mso-spacerun: yes;"> </span>As I noted last time, removing two points from either the first group or the last group can profoundly alter the median, while having very little effect on the mean.<span style="mso-spacerun: yes;"> </span>The figure shown above includes, in addition to the data values, the 95% confidence intervals for both the mean (the dotted lines in the center of the plot) and the median (the heavy dashed lines at the top and bottom of the plot).<span style="mso-spacerun: yes;"> </span>Here, the fact that the median confidence interval is enormously wider (by almost a factor of 13) than the mean confidence interval gives an indication of the instability of the median.<span style="mso-spacerun: yes;"> </span>In fact, the data distribution in this example is strongly bimodal, corresponding to a case where order statistic-based estimators like the median and Olive’s standard error for it perform poorly, a point discussed in Chapter 7 of <em>Exploring Data.</em></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-58FRLcbZTd0/T5L5MJ-D99I/AAAAAAAAAIQ/-Kz26DHi3I4/s1600/OliveFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qda="true" src="http://3.bp.blogspot.com/-58FRLcbZTd0/T5L5MJ-D99I/AAAAAAAAAIQ/-Kz26DHi3I4/s320/OliveFig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">One of the other important cases where estimators based on order statistics can perform poorly is that of coarsely quantized data, such as temperatures recorded only to the nearest tenth of a degree.<span style="mso-spacerun: yes;"> </span>The difficulty with these cases is that coarse quantization profoundly changes the nature of the data distribution.<span style="mso-spacerun: yes;"> </span>Specifically, it is a standard result in statistics that the probability of any two samples drawn from a continuous distribution having exactly the same value is zero, but this is no longer true for discrete distributions (e.g., count data), and coarse quantization introduces an element of discreteness into the data distribution.<span style="mso-spacerun: yes;"> </span>The above figure illustrates this point for a simple simulation-based example.<span style="mso-spacerun: yes;"> </span>The upper left plot shows a random sample of size 200 drawn from a zero-mean, unit-variance Gaussian distribution, and the upper right plot shows the effects of quantizing this sample, rounding it to the nearest half-integer value.<span style="mso-spacerun: yes;"> </span>The lower two plots are normal quantile-quantile plots generated by the <i style="mso-bidi-font-style: normal;">R</i> command <b style="mso-bidi-font-weight: normal;">qqPlot</b> from the <b style="mso-bidi-font-weight: normal;">car</b> package: in the lower left plot, almost all of the points fall within the 95% confidence interval around the normal reference line for this plot, while many of the points fall somewhat outside these confidence limits in the plot shown in the lower right.<span style="mso-spacerun: yes;"> </span>The greatest difference, however, is in the “staircase” appearance of this lower right plot, reflecting the effects of the coarse quantization on this data sample: each “step” corresponds to a group of samples that have exactly the same value.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The influence of this quantization on Olive’s confidence interval for the median is profound: for the original Gaussian data sequence, the 95% confidence interval for the median is approximately (-0.222,0.124), compared with (-0.174,0.095) for the mean.<span style="mso-spacerun: yes;"> </span>These results are consistent with our expectations: since the mean is the best possible location estimator for Gaussian data, it should give the narrower confidence interval, and it does.<span style="mso-spacerun: yes;"> </span>For the quantized case, the 95% confidence interval for the mean is (-0.194, 0.079), fairly similar to that for the original data sequence, but the confidence interval for the median reduces to the single value zero.<span style="mso-spacerun: yes;"> </span>This result represents an <i style="mso-bidi-font-style: normal;">implosion</i> of Olive’s standard error estimator for the median, exactly analogous to the behavior of the MADM scale estimate that I have discussed previously when a majority of the data values (i.e., more than 50% of them) are identical.<span style="mso-spacerun: yes;"> </span>Here, the situation is more serious, since the MADM scale estimate does not implode for this example: the MADM scale for the original data sequence is 0.938, versus 0.741 for the quantized sequence.<span style="mso-spacerun: yes;"> </span>The reason Olive’s standard error estimator is more prone to implosion in the face of coarse quantization is that it is based on a small subset of the original data sample.<span style="mso-spacerun: yes;"> </span>In particular, the size of the subsample on which this estimator is based is p, the degrees of freedom for the t-distribution used in constructing the corresponding confidence interval, and this number is approximately the square root of the sample size.<span style="mso-spacerun: yes;"> </span>Thus, for a sample of size 200 like the example considered here, MADM scale implosion requires just over half the sample to have the same value – 101 data points in this case – where Olive’s standard error estimator for the median can implode if 16 or more samples have the same value, and this is exactly what happens here: the median value is zero, and this value occurs 39 times in the quantized data sequence.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">David Olive’s confidence interval for the median is easily computed and represents a useful adjunct to the median as a characterization of numerical variables.<span style="mso-spacerun: yes;"> </span>As Olive advises, there is considerable advantage in computing and comparing both his median confidence interval and the corresponding standard confidence interval around the mean.<span style="mso-spacerun: yes;"> </span>Although in the summary of his paper, Olive only mentions outliers as a potential cause of substantial differences between these two confidence intervals, this post has illustrated that disagreements can also arise from other causes, including light-tailed, bimodal, or coarsely quantized data, much like the situation with the MADM scale estimate versus the standard deviation.<span style="mso-spacerun: yes;"> </span>In fact, as the last example discussed here illustrates, Olive’s standard error estimator for the median and the confidence intervals based on it can implode – exactly like the MADM scale estimate – in the face of coarsely quantized data.<span style="mso-spacerun: yes;"> </span>In fact, the implosion problem for Olive’s median standard error estimator is potentially more severe, again as illustrated in the previous example.<span style="mso-spacerun: yes;"> </span>Finally, it is worth noting that Olive’s paper also discusses confidence intervals for trimmed means.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com0tag:blogger.com,1999:blog-9179325420174899779.post-62894266654358476302012-03-03T15:25:00.000-08:002012-03-03T15:25:18.627-08:00Gastwirth’s location estimator<div class="MsoNormal" style="margin: 0in 0in 0pt;">The problem of outliers – data points that are substantially inconsistent with the majority of the other points in a dataset – arises frequently in the analysis of numerical data.<span style="mso-spacerun: yes;"> </span>The practical importance of outliers lies in the fact that even a few of these points can badly distort the results of an otherwise reasonable data analysis.<span style="mso-spacerun: yes;"> </span>This outlier-sensitivity problem is often particularly acute for classical data characterizations and analysis methods like means, standard deviations, and linear regression analysis.<span style="mso-spacerun: yes;"> </span>As a consequence, a range of outlier-resistant methods have been developed for many different applications, and new methods continue to be developed.<span style="mso-spacerun: yes;"> </span>For example, the <em>R</em> package <strong>robustbase</strong> that I have discussed in previous posts includes outlier-resistant methods for estimating location (i.e., outlier-resistant alternatives to the mean), estimating scale (outlier-resistant alternatives to the standard deviation), quantifying asymmetry (outlier-resistant alternatives to the skewness), and fitting regression models.<span style="mso-spacerun: yes;"> </span>In <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650">Exploring Data in Engineering, the Sciences, and Medicine</a>, I discuss a number of outlier-resistant methods for addressing some of these problems, including <em>Gastwirth’s location estimator</em>, an alternative to the mean that is the subject of this post.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The mean is the best-known location estimator, and it gives a useful assessment of the “typical” value of any numerical sequence that is reasonably symmetrically distributed and free of outliers.<span style="mso-spacerun: yes;"> </span>The outlier-sensitivity of the mean is severe, however, which motivates the use of outlier-resistant alternatives like the median.<span style="mso-spacerun: yes;"> </span>While the median is almost as well-known as the mean and extremely outlier-resistant, it can behave unexpectedly (i.e., “badly”) as a result of its non-smooth character.<span style="mso-spacerun: yes;"> </span>This point is illustrated in Fig. 7.23 in <em>Exploring Data</em>, identical in character to the figure shown below (this figure is slightly different because it uses a different seed to generate the random numbers on which it is based).<span style="mso-spacerun: yes;"> </span>Specifically, this plot shows a sequence of 201 data points, constructed as follows.<span style="mso-spacerun: yes;"> </span>The first 100 points are normally distributed with mean 1 and standard deviation 0.1, the 101<sup>st</sup> point is equal to zero, and points 102 through 201 are normally distributed with mean -1 and standard deviation 0.1.<span style="mso-spacerun: yes;"> </span>Small changes in this dataset in the specific form of deleting points can result in very large changes in the computed median.<span style="mso-spacerun: yes;"> </span>Specifically, in this example, the first 100 points lie between 0.768 and 1.185 and the last 100 points lie between -0.787 and -1.282; because the central data point lies between these two equal-sized groups, it defines the median, which is 0.<span style="mso-spacerun: yes;"> </span>The mean is quite close to this value, at -0.004, but the situation changes dramatically if we omit either the first two or the last two points from this data sequence.<span style="mso-spacerun: yes;"> </span>Specifically, the median value computed from points 1 through 199 is 0.768, while that computed from points 3 through 201 is -0.787.<span style="mso-spacerun: yes;"> </span>In contrast, the mean values for these two modified sequences are 0.006 and -0.014.<span style="mso-spacerun: yes;"> </span>Thus, although the median is much less sensitive than the mean to contamination from outliers, it is extremely sensitive to the 1% change made in this example for this particular dataset.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-nEk7MSvKhcg/T1KgWJGB5xI/AAAAAAAAAHQ/CnxWo2AIR9U/s1600/GastFig00.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://1.bp.blogspot.com/-nEk7MSvKhcg/T1KgWJGB5xI/AAAAAAAAAHQ/CnxWo2AIR9U/s320/GastFig00.png" uda="true" width="320" /></a></div></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The fact that the median is not “universally the best location estimator” provides a practical motivation for examining alternatives that are intermediate in behavior between the very smooth but very outlier-sensitive mean and the very outlier-insensitive but very non-smooth median.<span style="mso-spacerun: yes;"> </span>Some of these alternatives were examined in detail in the book <em>Robust Estimates of Location: Survey and Advances<place w:st="on"></place></em>, by D.F. Andrews, P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, and J.W. Tukey, published by Princeton University Press in 1972 (according to the publisher's website, this book is out of print, but used copies are available through distributors like Amazon or Barnes and Noble).<span style="mso-spacerun: yes;"> T</span>he book summarizes the results of a year-long study of 68 different location estimators, including both the mean and the median.<span style="mso-spacerun: yes;"> </span>The fundamental criteria for inclusion in this study were, first, that the estimators had to be computable from any given sequence of real numbers, and second, that they had to be both location and scale-invariant.<span style="mso-spacerun: yes;"> </span>Specifically, if a given data sequence <i style="mso-bidi-font-style: normal;">{x<sub>k</sub>}</i> yielded a result <i style="mso-bidi-font-style: normal;">m</i>, the scaled and shifted data sequence <i style="mso-bidi-font-style: normal;">{Ax<sub>k</sub> + b}</i> should yield the result <i style="mso-bidi-font-style: normal;">Am+b</i>, for any numbers <i style="mso-bidi-font-style: normal;">A</i> and <i style="mso-bidi-font-style: normal;">b</i>.<span style="mso-spacerun: yes;"> </span>The study was co-authored by six statistical researchers with differing opinions and points of view, but two of the authors – D.F. Andrews and F.R. Hampel – included the Gastwirth estimator (described in detail below) in their list of favorites.<span style="mso-spacerun: yes;"> </span>For example, Hampel characterized this estimator as one of a small list of those that were “never bad at the distributions considered.”<span style="mso-spacerun: yes;"> </span>Also, in contrast to many of the location estimators considered in the study, Gastwirth’s estimator does not require iterative computations, making it simpler to implement.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Specifically, Gastwirth’s location estimator is a weighted sum of three order statistics.<span style="mso-spacerun: yes;"> </span>That is, to compute this estimator, we first sort the data sequence in ascending order.<span style="mso-spacerun: yes;"> </span>Then, we take the values that are one-third of the way up this sequence (the 0.33 quantile), half way up the sequence (i.e., the median, or 0.50 quantile), and two-thirds of the way up the sequence (the 0.67 quantile).<span style="mso-spacerun: yes;"> </span>Given these three values, we then form the weighted average, giving the central (median) value a weight of 40% and the two extreme values each a weight of 30%.<span style="mso-spacerun: yes;"> </span>This is extremely easy to do in R, with the following code:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">Gastwirth <- function(x,...){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>ordstats = quantile(x, probs=c(1/3,1/2,2/3),...)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>wts = c(0.3,0.4,0.3)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>sum(wts*ordstats)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div></blockquote></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The key part of this code is the first line, which computes the required order statistics (i.e., the quantiles 1/3, 1/2, and 2/3) using the built-in <b style="mso-bidi-font-weight: normal;">quantile</b> function.<span style="mso-spacerun: yes;"> </span>The first argument passed to this function is <b style="mso-bidi-font-weight: normal;">x</b>, the vector of data values to be characterized, and the second argument (<b style="mso-bidi-font-weight: normal;">probs</b>) defines the specific quantiles we wish to compute.<span style="mso-spacerun: yes;"> </span>The ellipses in the Gastwirth procedure’s command line is passed to the <b style="mso-bidi-font-weight: normal;">quantile</b> function; several parameters are possible (type “help(quantile)” in your <em>R</em> session for details), but one of the most useful is <b style="mso-bidi-font-weight: normal;">na.rm</b>, a logical variable that specifies how missing data values are to be handled.<span style="mso-spacerun: yes;"> </span>The default is “FALSE” and this causes the <b style="mso-bidi-font-weight: normal;">Gastwirth</b> procedure to return the missing data value “NA” if any values of <b style="mso-bidi-font-weight: normal;">x</b> are missing; the alternative “TRUE” computes the Gastwirth estimator from the non-missing values, giving a numerical result.<span style="mso-spacerun: yes;"> </span>The three-element vector <b style="mso-bidi-font-weight: normal;">wts</b> defines the quantile weights that define the Gastwirth estimator, which the final <b style="mso-bidi-font-weight: normal;">sum</b> statement computes.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">For the data example considered above, the Gastwirth estimator yields the location estimate -0.001 for the complete dataset, 0.308 for points 1 to 199 (vs. 0.768 for the median), and -0.317 for points 3 to 201 (vs. -0.787 for the median).<span style="mso-spacerun: yes;"> </span>Thus, while it does not perform nearly as well as the mean for this example, it performs substantially better than the median.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-lX85Av_ZRvI/T1KjMM0rX4I/AAAAAAAAAHY/dk5cvsDf_nM/s1600/GastFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://3.bp.blogspot.com/-lX85Av_ZRvI/T1KjMM0rX4I/AAAAAAAAAHY/dk5cvsDf_nM/s320/GastFig01.png" uda="true" width="320" /></a></div></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">For the infinite-variance Cauchy distribution that I have discussed in several previous posts, the Gastwirth estimator performs similarly to the median, yielding a useful estimate of the center of the data distribution, in contrast to the mean, which doesn’t actually exist for this distribution (that is, the first moment does not exist for the Cauchy distribution).<span style="mso-spacerun: yes;"> </span>Still, the distribution is symmetric about zero, so the median is well-defined, as is the Gastwirth estimator, and both should be zero for this distribution.<span style="mso-spacerun: yes;"> </span>The above figure shows the results of applying these three estimators – the mean, the median, and Gastwirth’s estimator – to 1,000 independent random samples drawn from the Cauchy distribution.<span style="mso-spacerun: yes;"> </span>Specifically, this figure gives a boxplot summary of these results, truncated to the range from -3 to 3 to show the range of variation of the median and Gastwirth estimator (without this restriction, the boxplot comparison would be fairly non-informative, since the mean values range from approximately -161 to 27,793, reflecting the fact that the mean is not a consistent location estimator for the Cauchy distribution).<span style="mso-spacerun: yes;"> </span>To generate these results, the <b style="mso-bidi-font-weight: normal;">replicate</b> function in R was used, followed by the <b style="mso-bidi-font-weight: normal;">apply</b> function, as follows:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>RandomSampleFrame = replicate(1000, rt(n=200,df=1))<br /> BoxPlotVector = apply(RandomSampleFrame, MARGIN=2, Gastwirth)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The <b style="mso-bidi-font-weight: normal;">replicate</b> function creates a data frame with the number of columns specified by the first argument (here, 1000), and each column generated by the R statement that appears as the second argument.<span style="mso-spacerun: yes;"> </span>In this case, this second argument is the command <b style="mso-bidi-font-weight: normal;">rt</b>, which generates a sequence of <b style="mso-bidi-font-weight: normal;">n</b> statistically independent random numbers drawn from the Student’s <i style="mso-bidi-font-style: normal;">t</i>-distribution with the number of degrees of freedom specified by the <b style="mso-bidi-font-weight: normal;">df </b>argument (here, this is 1, corresponding to the fact that the Cauchy distribution is the Student’s <i style="mso-bidi-font-style: normal;">t</i>-distribution with 1 degree of freedom). <span style="mso-spacerun: yes;"> </span>Thus, <b style="mso-bidi-font-weight: normal;">RandomSampleFrame</b> is a data frame with 200 rows and 1,000 columns, each of which may be regarded as a Cauchy-distributed random sample.<span style="mso-spacerun: yes;"> </span>The <b style="mso-bidi-font-weight: normal;">apply</b> function applies the function specified in the third argument (here, the <b style="mso-bidi-font-weight: normal;">Gastwirth</b> procedure listed above) to the columns (<b style="mso-bidi-font-weight: normal;">MARGIN</b>=2 specifies columns; <b style="mso-bidi-font-weight: normal;">MARGIN</b>=1 would specify rows) of the data frame specified in the first argument.<span style="mso-spacerun: yes;"> </span>The result is <b style="mso-bidi-font-weight: normal;">BoxPlotVector</b>, a vector of 1,000 Gastwirth estimates, one for each random sample generated by the <strong>replicate</strong> function above.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-ybrBRTX9ypA/T1Kk9T-_9MI/AAAAAAAAAHg/wJHe00-2PkU/s1600/GastFig02a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://3.bp.blogspot.com/-ybrBRTX9ypA/T1Kk9T-_9MI/AAAAAAAAAHg/wJHe00-2PkU/s320/GastFig02a.png" uda="true" width="320" /></a></div></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">At the other extreme, in the limit of infinite degrees of freedom, the Student’s <em>t</em>-distribution approaches a Gaussian limit.<span style="mso-spacerun: yes;"> </span>The figure above shows the same comparison as before, except for the Gaussian distribution instead of the Cauchy distribution.<span style="mso-spacerun: yes;"> </span>Here, the mean is the best possible location estimator and it clearly performs the best, but the point of this example is that Gastwirth’s location estimator performs better than the median.<span style="mso-spacerun: yes;"> </span>In particular, the interquartile distance (i.e., the width of the “box” in each boxplot) for the mean is 0.094, it is 0.113 for the median, and it is 0.106 for Gastwirth’s estimator.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-7SrgfuHe0K4/T1KlXJD1dfI/AAAAAAAAAHo/3_BtnzblqD0/s1600/ArcsinPlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://4.bp.blogspot.com/-7SrgfuHe0K4/T1KlXJD1dfI/AAAAAAAAAHo/3_BtnzblqD0/s320/ArcsinPlot.png" uda="true" width="320" /></a></div></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Another application area where very robust estimators like the median often perform poorly is that of bimodal distributions like the <i style="mso-bidi-font-style: normal;">arc-sine distribution</i> whose density is plotted above.<span style="mso-spacerun: yes;"> </span>This distribution is a symmetric beta distribution, with both shape parameters equal to 0.5 (see <em>Exploring Data</em>, Sec. 4.5.1 for further discussion of this distribution).<span style="mso-spacerun: yes;"> </span>Because it is symmetrically distributed on the interval from 0 to 1, the location parameter for this distribution is 0.5 and all three of the location estimators considered here yield values that are accurate on average, but with different levels of precision.<span style="mso-spacerun: yes;"> </span>This point is shown in the figure below, which again provides boxplot comparisons for 1,000 random samples drawn from this distribution, each of length 200, for the mean, median, and Gastwirth location estimators.<span style="mso-spacerun: yes;"> </span>As in the Gaussian case considered above, the mean performs best here, with an interquartile distance of 0.035, the median performs worst, with an interquartile distance of 0.077, and Gastwirth’s estimator is intermediate, with an interquartile distance of 0.060.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-NiS3T5T2hEo/T1KlqrDbOZI/AAAAAAAAAHw/J_9cjl80_aM/s1600/GastFig03a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://2.bp.blogspot.com/-NiS3T5T2hEo/T1KlqrDbOZI/AAAAAAAAAHw/J_9cjl80_aM/s320/GastFig03a.png" uda="true" width="320" /></a></div></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The point of this post has been to illustrate a location estimator with properties that are intermediate between those of the much better-known mean and median.<span style="mso-spacerun: yes;"> </span>In particular, the results presented here for the Cauchy distribution show that Gastwirth’s estimator is intermediate in outlier sensitivity between the disastrously sensitive mean and the maximally insensitive median.<span style="mso-spacerun: yes;"> </span>Similarly, the first example demonstrated that Gastwirth’s estimator is also intermediate in smoothness between the maximally smooth mean and the discontinuous median: the sensitivity of Gastwirth’s estimator to data editing in “swing-vote” examples like the one presented here is still undesirably large, but much better than that of the median.<span style="mso-spacerun: yes;"> </span>Finally, the results presented here for the Gaussian and arc-sine distributions show that Gastwirth’s estimator is better-behaved for these distributions than the median.<span style="mso-spacerun: yes;"> </span>Because it is extremely easy to implement in <em>R</em>, Gastwirth’s estimator seems worth knowing about.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com3tag:blogger.com,1999:blog-9179325420174899779.post-70911523361836659332012-02-04T16:06:00.000-08:002012-02-04T16:06:09.600-08:00Measuring associations between non-numeric variables<div class="MsoNormal" style="margin: 0in 0in 0pt;">It is often useful to know how strongly or weakly two variables are associated: do they vary together or are they essentially unrelated?<span style="mso-spacerun: yes;"> </span>In the case of numerical variables, the best-known measure of association is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century.<span style="mso-spacerun: yes;"> </span>For variables that are ordered but not necessarily numeric (e.g., Likert scale responses with levels like “strongly agree,” “agree,” “neither agree nor disagree,” “disagree” and “strongly disagree”), association can be measured in terms of the Spearman rank correlation coefficient.<span style="mso-spacerun: yes;"> </span>Both of these measures are discussed in detail in Chapter 10 of <a href="http://www.amazon.com/s?ie=UTF8&rh=n%3A283155%2Ck%3Aexploring%20data%20in%20engineering.%20the%20sciences.%20and%20medicine&page=1">Exploring Data in Engineering, the Sciences, and Medicine</a>.<span style="mso-spacerun: yes;"> </span>For unordered categorical variables (e.g., country, state, county, tumor type, literary genre, etc.), neither of these measures are applicable, but applicable alternatives do exist.<span style="mso-spacerun: yes;"> </span>One of these is Goodman and Kruskal’s tau measure, discussed very briefly in <em>Exploring Data</em> (Chapter 10, page 492).<span style="mso-spacerun: yes;"> </span>The point of this post is to give a more detailed discussion of this association measure, illustrating some of its advantages, disadvantages, and peculiarities.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">A more complete discussion of Goodman and Kruskal’s tau measure is given in Agresti’s book <a href="http://www.amazon.com/s/ref=nb_sb_ss_i_1_8?url=search-alias%3Dstripbooks&field-keywords=agresti+categorical+data+analysis&sprefix=agresti+%2Cstripbooks%2C428">Categorical Data Analysis</a>, on pages 68 and 69.<span style="mso-spacerun: yes;"> </span>It belongs to a family of categorical association measures of the general form:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>a(x,y) = [V(y) – E{V(y|x)}]/V(y)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where V(y) is a measure of the overall (i.e., marginal) variability of y and E{V(y|x)} is the expected value of the conditional variability V(y|x) of y given a fixed value of x, where the expectation is taken over all possible values of x.<span style="mso-spacerun: yes;"> </span>These variability measures can be defined in different ways, leading to different association measures, including Goodman and Kruskal’s tau as a special case.<span style="mso-spacerun: yes;"> </span>Agresti’s book gives detailed expressions for several of these variability measures, including the one on which Goodman and Kruskal’s tau is based, and an alternative expression for the overall association measure a(x,y) is given in Eq. (10.178) on page 492 of <em>Exploring Data</em>.<span style="mso-spacerun: yes;"> </span>This association measure does not appear to be available in any current <em>R</em> package, but it is easily implemented as the following function:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote class="tr_bq"><div class="MsoNormal" style="margin: 0in 0in 0pt;">GKtau <- function(x,y){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>First, compute the IxJ contingency table between x and y</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Nij = table(x,y,useNA="ifany")</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Next, convert this table into a joint probability estimate</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>PIij = Nij/sum(Nij)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Compute the marginal probability estimates</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>PIiPlus = apply(PIij,MARGIN=1,sum)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>PIPlusj = apply(PIij,MARGIN=2,sum)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Compute the marginal variation of y</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>Vy = 1 - sum(PIPlusj^2)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Compute the expected conditional variation of y given x</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>InnerSum = apply(PIij^2,MARGIN=1,sum)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>VyBarx = 1 - sum(InnerSum/PIiPlus)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#<span style="mso-spacerun: yes;"> </span>Compute and return Goodman and Kruskal's tau measure</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>tau = (Vy - VyBarx)/Vy</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>tau</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;">An important feature of this procedure is that it allows missing values in either of the variables x or y, treating “missing” as an additional level.<span style="mso-spacerun: yes;"> </span>In practice, this is sometimes very important since missing values in one variable may be strongly associated with either missing values in another variable or specific non-missing levels of that variable.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">An important characteristic of Goodman and Kruskal’s tau measure is its asymmetry: because the variables x and y enter this expression differently, the value of a(y,x) is <em>not</em> the same as the value of a(x,y), in general.<span style="mso-spacerun: yes;"> </span>This stands in marked contrast to either the product-moment correlation coefficient or the Spearman rank correlation coefficient, which are both symmetric, giving the same association between x and y as that between y and x.<span style="mso-spacerun: yes;"> </span>The fundamental reason for the asymmetry of the general class of measures defined above is that they quantify the extent to which the variable x is useful in predicting y, which may be very different than the extent to which the variable y is useful in predicting x.<span style="mso-spacerun: yes;"> </span>Specifically, if x and y are statistically independent, then E{V(y|x)} = V(y) – i.e., knowing x does not help at all in predicting y – and this implies that a(x,y) = 0.<span style="mso-spacerun: yes;"> </span>At the other extreme, if y is perfectly predictable from x, then E{V(y|x)} = 0, which implies that a(x,y) = 1.<span style="mso-spacerun: yes;"> </span>As the examples presented next demonstrate, it is possible that y is extremely predictable from x, but x is only slightly predictable from y.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Specifically, consider the sequence of 400 random numbers, uniformly distributed between 0 and 1 generated by the following R code:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>set.seed(123)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>u = runif(400)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">(Here, I have used the “set.seed” command to initialize the random number generator so repeated runs of this example will give exactly the same results.)<span style="mso-spacerun: yes;"> </span>The second sequence is obtained by quantizing the first, rounding the values of u to a single digit:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>x = round(u,digits=1)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The plot below shows the effects of this coarse quantization: values of u vary continuously from 0 to 1, but values of x are restricted to 0.0, 0.1, 0.2, … , 1.0.<span style="mso-spacerun: yes;"> </span>Although this example is simulation-based, it is important to note that this type of grouping of variables is often encountered in practice (e.g., the use of age groups instead of ages in demographic characterizations, blood pressure characterizations like “normal,” “borderline hypertensive,” etc. in clinical data analysis, or the recording of industrial process temperatures to the nearest 0.1 degree, in part due to measurement accuracy considerations and in part due to memory limitations of early data collection systems).<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-1yCneUgZLQE/Ty3C5dfv3II/AAAAAAAAAG4/36tSbqEgXFQ/s1600/GKtauFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" sda="true" src="http://4.bp.blogspot.com/-1yCneUgZLQE/Ty3C5dfv3II/AAAAAAAAAG4/36tSbqEgXFQ/s320/GKtauFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In this particular case, because the variables x and u are both numeric, we could compute either the product-moment correlation coefficient or the Spearman rank correlation, obtaining the very large value of approximately 0.995 for either one, showing that these variables are strongly associated.<span style="mso-spacerun: yes;"> </span>We can also apply Goodman and Kruskal’s tau measure here, and the result is much more informative.<span style="mso-spacerun: yes;"> </span>Specifically, the value of a(u,x) is 1 in this case, correctly reflecting the fact that the grouped variable x is exactly computable from the original variable u.<span style="mso-spacerun: yes;"> </span>In contrast, the value of a(x,u) is approximately 0.025, suggesting – again correctly – that the original variable u cannot be well predicted from the grouped variable x.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To illustrate a case where the product-moment and rank correlation measures are not applicable at all, consider the following alphabetic re-coding of the variable x into an unordered categorical variable c:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>letters = c(“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”, “K”)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>c = letters[10*x+1]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In this case, both of the Goodman and Kruskal tau measures, a(x,c) and a(c,x), are equal to 1, reflecting the fact that these two variables are effectively identical, related via the non-numeric transformation given above.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Being able to detect relationships like these can be extremely useful in exploratory data analysis where such relationships may be unexpected, particularly in the early stages of characterizing a dataset whose metadata – i.e., detailed descriptions of the variables included in the dataset – is absent, incomplete, ambiguous, or suspect.<span style="mso-spacerun: yes;"> </span>As a real data illustration, consider the <strong>rent</strong> data frame from the <em>R</em> package <strong>gamlss.data</strong>, which has 1,969 rows, each corresponding to a rental property in <place w:st="on"><city w:st="on">Munich</city></place>, and 9 columns, each giving a characteristic of that unit (e.g., the rent, floor space, year of construction, etc.).<span style="mso-spacerun: yes;"> </span>Three of these variables are <em>Sp</em>, a binary variable indicating whether the location is considered above average (1) or not (0), <em>Sm</em>, another binary variable indicating whether the location is considered below average (1) or not (0), and <em>loc</em>, a three-level variable combining the information in these other two, taking the values 1 (below average), 2 (average), or 3 (above average).<span style="mso-spacerun: yes;"> </span>The Goodman and Kruskal tau values between all possible pairs of these three variables are:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>a(Sm,Sp) = a(Sp,Sm) = 0.037</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>a(Sm,loc) = 0.245 vs. a(loc,Sm) = 1</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>a(Sp,loc) = 0.701 vs. a(loc,Sp) = 1</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The first of these results – the symmetry of Goodman and Kruskal’s tau for the variables <em>Sm</em> and <em>Sp</em> – is a consequence of the fact that this measure is symmetric for any pair of <em>binary</em> variables.<span style="mso-spacerun: yes;"> </span>In fact, the odds ratio that I have discussed in previous posts represents a much better way of characterizing the relationship between binary variables (here, the odds ratio between <em>Sm</em> and <em>Sp</em> is zero, reflecting the fact that a location cannot be both “above average” and “below average” at the same time).<span style="mso-spacerun: yes;"> </span>The real utility of the tau measure here is that the second and third lines above show that the variables <em>Sm</em> and <em>Sp</em> are both re-groupings of the finer-grained variable <em>loc</em>.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-oRSclb8fvPE/Ty3EgV0qJ9I/AAAAAAAAAHA/gsQgEujOFxs/s1600/GKtauFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" sda="true" src="http://4.bp.blogspot.com/-oRSclb8fvPE/Ty3EgV0qJ9I/AAAAAAAAAHA/gsQgEujOFxs/s320/GKtauFig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, a more interesting exploratory application to this dataset is the following one.<span style="mso-spacerun: yes;"> </span>Computing Goodman and Kruskal’s tau measure between the location variable <em>loc</em> and all of the other variables in the dataset – beyond the cases of <em>Sm</em> and <em>Sp</em> just considered – generally yields small values for the associations in either direction.<span style="mso-spacerun: yes;"> </span>As a specific example, the association a(loc,Fl) is 0.001, suggesting that location is not a good predictor of the unit’s floor space in meters, and although the reverse association a(Fl,loc) is larger (0.057), it is not large enough to suggest that the unit’s floor space is a particularly good predictor of its location quality.<span style="mso-spacerun: yes;"> </span>The same is true of most of the other variables in the dataset: they are neither well predicted by nor good predictors of location quality.<span style="mso-spacerun: yes;"> </span>The one glaring exception is the rent variable <em>R:</em> although the association a(loc,R) is only 0.001, the reverse association a(R,loc) is 0.907, a very large value suggesting that location quality is quite well predicted by the rent.<span style="mso-spacerun: yes;"> </span>The beanplot above shows what is happening here: because the variation in rents for all three location qualities is substantial, knowledge of the <em>loc</em> value is not sufficient to accurately predict the rent <em>R</em>, but these rent values do generally increase in going from below-average locations (loc = 1) to average locations (loc = 2) to above-average locations (loc = 3).<span style="mso-spacerun: yes;"> </span>For comparison, the beanplots below show why the association with floor space is so much weaker: both the mean floor space in each location quality group and the overall range of these values are quite comparable, implying that neither location quality can be well predicted from floor space nor vice versa.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-96Hzx9KHTtk/Ty3FGgKun9I/AAAAAAAAAHI/mxprlhMDTYk/s1600/GKtauFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" sda="true" src="http://3.bp.blogspot.com/-96Hzx9KHTtk/Ty3FGgKun9I/AAAAAAAAAHI/mxprlhMDTYk/s320/GKtauFig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The asymmetry of Goodman and Kruskal’s tau measure is disconcerting at first because it has no counterpart in better-known measures like the product-moment correlation coefficient between numerical variables, Spearman’s rank correlation coefficient between ordinal variables, or the odds ratio between binary variables.<span style="mso-spacerun: yes;"> </span>One of the points of this post has been to demonstrate how this unusual asymmetry can be useful in practice, distinguishing between the ability of one variable x to predict another variable y, and the reverse case.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com2tag:blogger.com,1999:blog-9179325420174899779.post-61250175017375564882012-01-14T11:06:00.000-08:002012-01-14T11:06:32.484-08:00Moving window filters and the pracma package<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last post, I discussed the Hampel filter, a useful moving window nonlinear data cleaning filter that is available in the <em>R</em> package <strong>pracma</strong>.<span style="mso-spacerun: yes;"> </span>In this post, I briefly discuss this moving window filter in a little more detail, focusing on two important practical points: the choice of the filter’s local outlier detection threshold, and the question of how to initialize moving window filters.<span style="mso-spacerun: yes;"> </span>This second point is particularly important here because the <strong>pracma</strong> package initializes the Hampel filter in a particularly appropriate way, but doesn’t do such a good job of initializing the Savitzky-Golay filter, a linear smoothing filter that is popular in physics and chemistry.<span style="mso-spacerun: yes;"> </span>Fortunately, this second difficulty is easy to fix, as I demonstrate here.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Recall from my last post that the Hampel filter is a moving window implementation of the Hampel identifier, discussed in Chapter 7 of <a href="http://www.amazon.com/s?ie=UTF8&rh=n%3A283155%2Ck%3Aexploring%20data%20in%20engineering.%20the%20sciences.%20and%20medicine&page=1">Exploring Data in Engineering, the Sciences, and Medicine</a>.<span style="mso-spacerun: yes;"> </span>In particular, this procedure – implemented as <strong>outlierMAD</strong> in the <strong>pracma</strong> package – is a nonlinear data cleaning filter that looks for local outliers in a time-series or other streaming data sequence, replacing them with a more reasonable alternative value when it finds them.<span style="mso-spacerun: yes;"> </span>Specifically, this filter may be viewed as a more effective alternative to a “local three-sigma edit rule” that would replace any data point lying more than three standard deviations from the mean of its neighbors with that mean value.<span style="mso-spacerun: yes;"> </span>The difficulty with this simple strategy is that both the mean and especially the standard deviation are badly distorted by the presence of outliers in the data, causing this data cleaning procedure to often fail completely in practice.<span style="mso-spacerun: yes;"> </span>The Hampel filter instead uses the median of neighboring observations as a reference value, and the MAD scale estimator as an alternative measure of distance: that is, a data point is declared an outlier and replaced if it lies more than some number <em>t </em>of MAD scale estimates from the median of its neighbors; the replacement value used in this procedure is the median.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-zeJmhgqThZk/TxHJNN-4XlI/AAAAAAAAAGQ/OxoXHvRm-3U/s1600/HampelIIfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://4.bp.blogspot.com/-zeJmhgqThZk/TxHJNN-4XlI/AAAAAAAAAGQ/OxoXHvRm-3U/s320/HampelIIfig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">More specifically, for each observation in the original data sequence, the Hampel filter constructs a moving window that includes the <em>K</em> prior points, the data point of primary interest, and the <em>K</em> subsequent data points.<span style="mso-spacerun: yes;"> </span>The reference value used for the central data point is the median of these <em>2K+1</em> successive observations, and the MAD scale estimate is computed from these same observations to serve as a measure of the “natural local spread” of the data sequence.<span style="mso-spacerun: yes;"> </span>If the central data point lies more than <em>t </em>MAD scale estimate values from the median, it is replaced with the median; otherwise, it is left unchanged.<span style="mso-spacerun: yes;"> </span>To illustrate the performance of this filter, the top plot above shows the sequence of 1024 successive physical property measurements from an industrial manufacturing process that I also discussed in my last post.<span style="mso-spacerun: yes;"> </span>The bottom plot in this pair shows the results of applying the Hampel filter with a window half-width parameter K=5 and a threshold value of t = 3 to this data sequence.<span style="mso-spacerun: yes;"> </span>Comparing these two plots, it is clear that the Hampel filter has removed the glaring outlier – the value zero – at observation k = 291, yielding a cleaned data sequence that varies over a much narrower (and, at least in this case, much more reasonable) range of possible values.<span style="mso-spacerun: yes;"> </span>What is less obvious is that this filter has also replaced 18 other data points with their local median reference values.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-uQht8r4-pk8/TxHJuf8hHiI/AAAAAAAAAGY/Vq7Pu7BaeRA/s1600/HampelIIfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://4.bp.blogspot.com/-uQht8r4-pk8/TxHJuf8hHiI/AAAAAAAAAGY/Vq7Pu7BaeRA/s320/HampelIIfig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The above plot shows the original data sequence, but on approximately the same range as the cleaned data sequence so that the glaring outlier at k = 291 no longer dominates the figure.<span style="mso-spacerun: yes;"> </span>The large solid circles represent the 18 additional points that the Hampel filter has declared to be outliers and replaced with their local median values.<span style="mso-spacerun: yes;"> </span>This plot was generated using the Hampel filter implemented in the <strong>outlierMAD</strong> command in the <strong>pracma</strong> package, which has the following syntax:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span>outlierMAD(x,k)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where <em>x</em> is the data sequence to be cleaned and <em>k</em> is the half-width that defines the moving data window on which the filter is based.<span style="mso-spacerun: yes;"> </span>Here, specifying k = 5 results in an 11-point moving data window.<span style="mso-spacerun: yes;"> </span>Unfortunately, the threshold parameter <em>t</em> is hard-coded as 3 in this <strong>pracma</strong> procedure, which has the following code:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">outlierMAD <- function (x, k){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>n <- length(x)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y <- x</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span><state w:st="on"><place w:st="on">ind</place></state> <- c()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>L <- 1.4826</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>t0 <- 3</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>for (i in (k + 1):(n - k)) {</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>x0 <- median(x[(i - k):(i + k)])</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>S0 <- L * median(abs(x[(i - k):(i + k)] - x0))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>if (abs(x[i] - x0) > t0 * S0) {</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y[i] <- x0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span><state w:st="on">ind</state> <- c(<state w:st="on"><place w:st="on">ind</place></state>, i)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>list(y = y, <state w:st="on">ind</state> = <state w:st="on"><place w:st="on">ind</place></state>)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Note that it is a simple matter to create your own version of this filter, specifying the threshold (here, the variable <em>t0</em>) to have a default value of 3, but allowing the user to modify it in the function call.<span style="mso-spacerun: yes;"> </span>Specifically, the code would be:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">HampelFilter <- function (x, k,t0=3){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>n <- length(x)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y <- x</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span><place w:st="on"><state w:st="on">ind</state></place> <- c()</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>L <- 1.4826</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>for (i in (k + 1):(n - k)) {</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>x0 <- median(x[(i - k):(i + k)])</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>S0 <- L * median(abs(x[(i - k):(i + k)] - x0))</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>if (abs(x[i] - x0) > t0 * S0) {</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y[i] <- x0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span><state w:st="on">ind</state> <- c(<place w:st="on"><state w:st="on">ind</state></place>, i)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>list(y = y, <state w:st="on">ind</state> = <place w:st="on"><state w:st="on">ind</state></place>)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The advantage of this modification is that it allows you to explore the influence of varying the threshold parameter.<span style="mso-spacerun: yes;"> </span>Note that increasing t0 makes the filter more forgiving, allowing more extreme local fluctuations to pass through the filter unmodified, while decreasing t0 makes the filter more aggressive, declaring more points to be local outliers and replacing them with the appropriate local median.<span style="mso-spacerun: yes;"> </span>In fact, this filter remains well-defined even for t0 = 0, where it reduces to the median filter, popular in nonlinear digital signal processing.<span style="mso-spacerun: yes;"> </span>John Tukey – the developer or co-developer of many useful things, including the fast Fourier transform (FFT) – introduced the median filter at a technical conference in 1974, and it has profoundly influenced subsequent developments in nonlinear digital filtering.<span style="mso-spacerun: yes;"> </span>It may be viewed as the most aggressive limit of the Hampel filter and, although it is quite effective in removing local outliers, it is often too aggressive in practice, introducing significant distortions into the original data sequence.<span style="mso-spacerun: yes;"> </span>This point may be seen in the plot below, which shows the results of applying the median filter (i.e., the <strong>HampelFilter</strong> procedure defined above with t0=0) to the physical property dataset.<span style="mso-spacerun: yes;"> </span>In particular, the heavy solid line in this plot shows the behavior of the first 250 points of the median filtered sequence, while the lighter dotted line shows the corresponding results for the Hampel filter with t0=3.<span style="mso-spacerun: yes;"> </span>Note the “clipped” or “blocky” appearance of the median filtered results, compared with the more irregular local variation seen in the Hampel filtered results.<span style="mso-spacerun: yes;"> </span>In many applications (e.g., fitting time-series models), the less aggressive Hampel filter gives much better overall results.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-EjtEmc-sw8w/TxHK5E_gtSI/AAAAAAAAAGg/9SldVilqjoc/s1600/HampelIIfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://2.bp.blogspot.com/-EjtEmc-sw8w/TxHK5E_gtSI/AAAAAAAAAGg/9SldVilqjoc/s320/HampelIIfig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The other main issue I wanted to discuss in this post is that of initializing moving window filters.<span style="mso-spacerun: yes;"> </span>The basic structure of these filters – whether they are nonlinear types like the Hampel and median filters discussed above, or linear types like the Savitzky-Golay filter discussed briefly below – is built on a moving data window that includes a central point of interest, prior observations and subsequent observations.<span style="mso-spacerun: yes;"> </span>For a symmetric window that includes K prior and K subsequent observations, this window is not well defined for the first K or the last K observations in the data sequence.<span style="mso-spacerun: yes;"> </span>These points must be given special treatment, and a very common approach in the digital signal processing community is to extend the original sequence by appending K additional copies of the first element to the beginning of the sequence and K additional copies of the last element to the end of the sequence.<span style="mso-spacerun: yes;"> </span>The <strong>pracma</strong> implementation of the Hampel filter procedure (<strong>outlierMAD</strong>) takes an alternative approach, one that is particularly appropriate for data cleaning filters.<span style="mso-spacerun: yes;"> </span>Specifically, procedure <strong>outlierMAD</strong> simply passes the first and last K observations unmodified from the original data sequence to the filter output.<span style="mso-spacerun: yes;"> </span>This would also seem to be a reasonable option for smoothing filters like the linear Savitzky-Golay filter discussed next.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-irmS_ED9KG0/TxHLXwqiMpI/AAAAAAAAAGo/ccj53hOUrHM/s1600/HampelIIfig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://1.bp.blogspot.com/-irmS_ED9KG0/TxHLXwqiMpI/AAAAAAAAAGo/ccj53hOUrHM/s320/HampelIIfig04.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As noted, this linear smoothing filter is popular in chemistry and physics, and it is implemented in the <strong>pracma</strong> package as procedure <strong>savgol.</strong><span style="mso-spacerun: yes;"> </span>For a more detailed discussion of this filter, refer to the treatment in the book <a href="http://www.amazon.com/Numerical-Recipes-3rd-Scientific-Computing/dp/0521880688/ref=sr_1_1?s=books&ie=UTF8&qid=1326566316&sr=1-1">Numerical Recipes</a>, which the authors of the <strong>pracma</strong> package cite for further details (Section 14.8).<span style="mso-spacerun: yes;"> </span>Here, the key point is that this filter is a linear smoother, implemented as the convolution of the input sequence with an impulse response function (i.e., a smoothing kernel) that is constructed by the <strong>savgol </strong>procedure.<span style="mso-spacerun: yes;"> </span>The above two plots show the effects of applying this filter with a total window width of 11 points (i.e., the same half-width K = 5 used with the Hampel and median filters), first to the raw physical property data sequence (upper plot), and then to the sequence after it has been cleaned by the Hampel filter (lower plot).<span style="mso-spacerun: yes;"> </span>The large downward spike at k = 291 in the upper plot reflects the impact of the glaring outlier in the original data sequence, illustrating the practical importance of removing these artifacts from a data sequence before applying smoothing procedures like the Savitzky-Golay filter.<span style="mso-spacerun: yes;"> </span>Both the upper and lower plots exhibit similarly large spikes at the beginning and end of the data sequence, however, and these artifacts are due to the moving window problem noted above for the first K and the last K elements of the original data sequence.<span style="mso-spacerun: yes;"> </span>In particular, the filter implementation in the <strong>savgol</strong> procedure does not apply the sequence extension procedure discussed above, and this fact is responsible for these artifacts appearing at the beginning and end of the smoothed data sequence.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">It is extremely easy to correct this problem, adopting the same philosophy the package uses for the <strong>outlierMAD</strong> procedure: simply retain the first and last K elements of the original sequence unmodified.<span style="mso-spacerun: yes;"> </span>The procedure <strong>SGwrapper</strong> listed below does this after the fact, calling the <strong>savgol</strong> procedure and then replacing the first and last K elements of the filtered sequence with the original sequence values:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">SGwrapper <- function(x,K,forder=4,dorder=0){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>n = length(x)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>fl = 2*K+1</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y = savgol(x,fl,forder,dorder)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>if (dorder == 0){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y[1:K] = x[1:K]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y[(n-K):n] = x[(n-K):n]</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>else{</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y[1:K] = 0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y[(n-K):n] = 0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Before showing the results obtained with this procedure, it is important to note two points.<span style="mso-spacerun: yes;"> </span>First, the moving window width parameter fl required for the <strong>savgol </strong>procedure corresponds to fl = 2K+1 for a half-width parameter K.<span style="mso-spacerun: yes;"> </span>The procedure <strong>SGwrapper</strong> instead requires K as its passing parameter, constructing fl from this value of K.<span style="mso-spacerun: yes;"> </span>Second, note that in addition to serving as a smoother, the Savitzky-Golay filter family can also be used to estimate derivatives (this is tricky since differentiation filters are incredible noise amplifiers, but I’ll talk more about that in another post).<span style="mso-spacerun: yes;"> </span>In the <strong>savgol</strong> procedure, this is accomplished by specifying the parameter dorder, which has a default value of zero (implying smoothing), but which can be set to 1 to estimate the first derivative of a sequence, 2 for the second derivative, etc.<span style="mso-spacerun: yes;"> </span>In these cases, replacing the first and last K elements of the filtered sequence with the original data sequence elements is not reasonable: in the absence of any other knowledge, a better default derivative estimate is zero, and the <strong>SGwrapper</strong> procedure listed above does this.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-zc74biMBwes/TxHMuCc4tZI/AAAAAAAAAGw/zYuzLZbdfMk/s1600/HampelIIfig05.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kba="true" src="http://3.bp.blogspot.com/-zc74biMBwes/TxHMuCc4tZI/AAAAAAAAAGw/zYuzLZbdfMk/s320/HampelIIfig05.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The four plots shown above illustrate the differences between the original <strong>savgol</strong> procedure (the left-hand plots) and those obtained with the <strong>SGwrapper</strong> procedure listed above (the right-hand plots).<span style="mso-spacerun: yes;"> </span>In all cases, the data sequence used to generate these plots was the physical property data sequence cleaned using the Hampel filter with t0 = 3.<span style="mso-spacerun: yes;"> </span>The upper left plot repeats the lower of the two previous plots, corresponding to the <strong>savgol</strong> smoother output, while the upper right plot applies the <strong>SGwrapper</strong> function to remove the artifacts at the beginning and end of the smoothed data sequence.<span style="mso-spacerun: yes;"> </span>Similarly, the lower two plots give the corresponding second-derivative estimates, obtained by applying the <strong>savgol</strong> procedure with fl = 11 and dorder = 2 (lower left plot) or the <strong>SGwrapper</strong> procedure with K = 5 and dorder = 2 (lower right plot).<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com1tag:blogger.com,1999:blog-9179325420174899779.post-6061761910664338992011-11-27T08:37:00.000-08:002011-11-27T08:37:14.467-08:00Cleaning time-series and other data streams<div class="MsoNormal" style="margin: 0in 0in 0pt;">The need to analyze time-series or other forms of streaming data arises frequently in many different application areas.<span style="mso-spacerun: yes;"> </span>Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations.<span style="mso-spacerun: yes;"> </span>As a specific example, the figure below shows four data sequences: the upper two plots represent hourly physical property measurements, one made at the inlet of a product storage tank (the left-hand plot) and the other made at the same time at the outlet of the tank (the right-hand plot).<span style="mso-spacerun: yes;"> </span>The lower two plots in this figure show the results of applying the data cleaning filter <strong>outlierMAD</strong> from the <em>R</em> package <strong>pracma</strong> discussed further below.<span style="mso-spacerun: yes;"> </span>The two main points of this post are first, that isolated spikes like those seen in the upper two plots at hour 291 can badly distort the results of an otherwise reasonable time-series characterization, and second, that the simple moving window data cleaning filter described here is often very effective in removing these artifacts.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-xe3qt3qFIjc/TtJe9BAfGtI/AAAAAAAAAFw/GTVB2hnN3fU/s1600/hampelfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hda="true" height="319" src="http://3.bp.blogspot.com/-xe3qt3qFIjc/TtJe9BAfGtI/AAAAAAAAAFw/GTVB2hnN3fU/s320/hampelfig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">This example is discussed in more detail in Section 8.1.2 of my book <a href="http://www.amazon.com/Discrete-time-Dynamic-Models-Chemical-Engineering/dp/0195121988">Discrete-Time Dynamic Models</a>, but the key observations here are the following.<span style="mso-spacerun: yes;"> </span>First, the large spikes seen in both of the original data sequences were caused by the simultaneous, temporary loss of both measurements and the subsequent coding of these missing values as zero by the data collection system.<span style="mso-spacerun: yes;"> </span>The practical question of interest was to determine how long, on average, the viscous, polymeric material being fed into and out of the product storage tank was spending there.<span style="mso-spacerun: yes;"> </span>A standard method for addressing such questions is the use of cross-correlation analysis, where the expected result is a broad peak like the heavy dashed line in the plot shown below.<span style="mso-spacerun: yes;"> </span>The location of this peak provides an estimate of the average time spent in the tank, which is approximately 21 hours in this case, as indicated in the plot.<span style="mso-spacerun: yes;"> </span>This result was about what was expected, and it was obtained by applying standard cross-correlation analysis to the cleaned data sequences shown in the bottom two plots above.<span style="mso-spacerun: yes;"> </span>The lighter solid curve in the plot below shows the results of applying exactly the same analysis, but to the original data sequences instead of the cleaned data sequences.<span style="mso-spacerun: yes;"> </span>This dramatically different plot suggests that the material is spending very little time in the storage tank: accepted uncritically, this result would imply severe fouling of the tank, suggesting a need to shut the process down and clean out the tank, an expensive and labor-intensive proposition.<span style="mso-spacerun: yes;"> </span>The main point of this example is that the difference in these two plots is entirely due to the extreme data anomalies present in the original time-series.<span style="mso-spacerun: yes;"> </span>Additional examples of problems caused by time-series outliers are discussed in Section 4.3 of my book <a href="http://www.amazon.com/Mining-Imperfect-Data-Contamination-Incomplete/dp/0898715822">Mining Imperfect Data</a>.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-LtDGNc0Pq3w/TtJgcfIkfwI/AAAAAAAAAF4/OP18CGkOpck/s1600/hampelfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hda="true" height="319" src="http://1.bp.blogspot.com/-LtDGNc0Pq3w/TtJgcfIkfwI/AAAAAAAAAF4/OP18CGkOpck/s320/hampelfig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">One of the primary features of the analysis of time-series and other streaming data sequences is the need for <i style="mso-bidi-font-style: normal;">local</i> data characterizations.<span style="mso-spacerun: yes;"> </span>This point is illustrated in the plot below, which shows the first 200 observations of the storage tank inlet data sequence discussed above.<span style="mso-spacerun: yes;"> </span>All of these observations but one are represented as open circles in this plot, but the data point at <em>k = 110</em> is shown as a solid circle, to emphasize how far it lies from its immediate neighbors in the data sequence.<span style="mso-spacerun: yes;"> </span>It is important to note that this point is not anomalous with respect to the overall range of this data sequence – it is, for example, well within the normal range of variation seen for the points from about <em>k = 150</em> to <em>k = 200</em> – but it is clearly anomalous with respect to those points that immediately precede and follow it.<span style="mso-spacerun: yes;"> </span>A general strategy for automatically detecting and removing such spikes from a data sequence like this one is to apply a <i style="mso-bidi-font-style: normal;">moving window data cleaning filter</i> which characterizes each data point with respect to a local neighborhood of prior and subsequent samples.<span style="mso-spacerun: yes;"> </span>That is, for each data point <i style="mso-bidi-font-style: normal;">k</i> in the original data sequence, this type of filter forms a cleaned data estimate based on some number <i style="mso-bidi-font-style: normal;">J</i> of prior data values (i.e., points <i style="mso-bidi-font-style: normal;">k-J</i> through <i style="mso-bidi-font-style: normal;">k-1</i> in the sequence) and, in the simplest implementations, the same number of subsequent data values (i.e., points <i style="mso-bidi-font-style: normal;">k+1</i> through <i style="mso-bidi-font-style: normal;">k+J</i> in the sequence).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-bSZnrwrGhFg/TtJg1JK1mLI/AAAAAAAAAGA/I95d4s7VILM/s1600/hampelfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hda="true" height="319" src="http://4.bp.blogspot.com/-bSZnrwrGhFg/TtJg1JK1mLI/AAAAAAAAAGA/I95d4s7VILM/s320/hampelfig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The specific data cleaning filter considered here is the <em>Hampel filter</em>, which applies the Hampel identifier discussed in Chapter 7 of <a href="http://www.amazon.com/s?ie=UTF8&rh=n%3A283155%2Ck%3Aexploring%20data%20in%20engineering.%20the%20sciences.%20and%20medicine&page=1">Exploring Data in Engineering, the Sciences and Medicine</a> to this moving data window.<span style="mso-spacerun: yes;"> </span>If the <i style="mso-bidi-font-style: normal;">k<sup>th</sup></i> data point is declared to be an outlier, it is replaced by the median value computed from this data window; otherwise, the data point is not modified.<span style="mso-spacerun: yes;"> </span>The results of applying the Hampel filter with a window width of <i style="mso-bidi-font-style: normal;">J = 5</i> to the above data sequence are shown in the plot below.<span style="mso-spacerun: yes;"> </span>The effect is to modify three of the original data points – those at <i style="mso-bidi-font-style: normal;">k = 43, 110</i>, and <i style="mso-bidi-font-style: normal;">120</i> – and the original values of these modified points are shown as solid circles at the appropriate locations in this plot.<span style="mso-spacerun: yes;"> </span>It is clear that the most pronounced effect of the Hampel filter is to remove the local outlier indicated in the above figure and replace it with a value that is much more representative of the other data points in the immediate vicinity.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-sPj3GVpR9Uw/TtJhjPvDHnI/AAAAAAAAAGI/W6pb7RUWXdc/s1600/hampelfig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" hda="true" height="319" src="http://4.bp.blogspot.com/-sPj3GVpR9Uw/TtJhjPvDHnI/AAAAAAAAAGI/W6pb7RUWXdc/s320/hampelfig04.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As I noted above, the Hampel filter implementation used here is that available in the <em>R</em> package <strong>pracma</strong> as procedure <strong>outlierMAD</strong>.<span style="mso-spacerun: yes;"> </span>I will discuss this <em>R</em> package in more detail in my next post, but for those seeking a more detailed discussion of the Hampel filter in the meantime, one is freely available on-line in the form of an EDN article I wrote in 2002, <a href="http://www.edn.com/article/486039-Scrub_data_with_scale_invariant_nonlinear_digital_filters.php">Scrub data with scale-invariant nonlinear digital filters</a>.<span style="mso-spacerun: yes;"> Also, c</span>omparisons with alternatives like the standard median filter (generally too aggressive, introducing unwanted distortion into the “cleaned” data sequence) and the center-weighted median filter (sometimes quite effective) are presented in Section 4.2 of the book <em>Mining Imperfect Data</em> <span style="mso-spacerun: yes;"> mentioned above.</span></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com1tag:blogger.com,1999:blog-9179325420174899779.post-82050412104741903042011-11-11T14:16:00.000-08:002011-11-11T14:16:05.333-08:00Harmonic means, reciprocals, and ratios of random variables<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian.<span style="mso-spacerun: yes;"> </span>For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions.<span style="mso-spacerun: yes;"> </span>Since the harmonic mean is defined as the reciprocal of the mean of the reciprocal values, it is intimately related to the reciprocal transformation.<span style="mso-spacerun: yes;"> </span>The main point of this post is to show how profoundly the reciprocal transformation can alter the character of a distribution, for better or worse.<span style="mso-spacerun: yes;"> </span>One way that reciprocal transformations sneak into analysis results is through attempts to characterize ratios of random numbers.<span style="mso-spacerun: yes;"> </span>The key issue underlying all of these ideas is the question of when the denominator variable in either a reciprocal transformation or a ratio exhibits non-negligible probability in a finite neighborhood of zero.<span style="mso-spacerun: yes;"> </span>I discuss transformations in Chapter 12 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650/ref=sr_1_1?s=books&ie=UTF8&qid=1321042995&sr=1-1">Exploring Data in Engineering, the Sciences and Medicine</a>, with a section (12.7) devoted to reciprocal transformations, showing what happens when we apply them to six different distributions: Gaussian, <place w:st="on">Laplace</place>, Cauchy, beta, Pareto, and lognormal.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In the general case, if a random variable <em>x</em> has the density <em>p(x),</em> the distribution <em>g(y)</em> of the reciprocal <em>y = 1/x</em> has the density:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>g(y) = p(1/y)/y<sup>2</sup></em> </div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As I discuss in greater detail in <em>Exploring Data</em>, the consequence of this transformation is <i style="mso-bidi-font-style: normal;">typically</i> (though not always) to convert a well-behaved distribution into a very poorly behaved one.<span style="mso-spacerun: yes;"> </span>As a specific example, the plot below shows the effect of the reciprocal transformation on a Gaussian random variable with mean 1 and standard deviation 2.<span style="mso-spacerun: yes;"> </span>The most obvious characteristic of this transformed distribution is its strongly asymmetric, bimodal character, but another non-obvious consequence of the reciprocal transformation is that it takes a distribution that is completely characterized by its first two moments into a new distribution with Cauchy-like tails, for which none of the integer moments exist.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-ihUpKC5yNpg/Tr1xtl2PFDI/AAAAAAAAAFQ/03fpQJy8IIc/s1600/recipfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" nda="true" src="http://4.bp.blogspot.com/-ihUpKC5yNpg/Tr1xtl2PFDI/AAAAAAAAAFQ/03fpQJy8IIc/s320/recipfig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The implications of the reciprocal transformation for many other distributions are equally non-obvious.<span style="mso-spacerun: yes;"> </span>For example, both the badly-behaved Cauchy distribution (no moments exist) and the well-behaved lognormal distribution (all moments exist, but interestingly, do not completely characterize the distribution, as I have discussed in a previous post) are invariant under the reciprocal transformation.<span style="mso-spacerun: yes;"> </span>Also, applying the reciprocal transformation to the long-tailed Pareto type I distribution (which exhibits few or no finite moments, depending on its tail decay rate) yields a beta distribution, all of whose moments are finite.<span style="mso-spacerun: yes;"> </span>Finally, it is worth noting that the invariance of the Cauchy distribution under the reciprocal transformation lies at the heart of the following result, presented in the book <a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584959/ref=sr_1_2?s=books&ie=UTF8&qid=1321042772&sr=1-2">Continuous Univariate Distributions</a> by Johnson, Kotz, and Balakrishnan (Volume 1, 2<sup>nd</sup> edition, Wiley, 1994, page 319).<span style="mso-spacerun: yes;"> </span>They note that if the density of <em>x</em> is positive, continuous, and differentiable at <em>x = 0</em> – all true for the Gaussian case – the distribution of the harmonic mean of <em>N</em> samples approaches a Cauchy limit as <em>N</em> becomes infinitely large.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As noted above, the key issue responsible for the pathological behavior of the reciprocal transformation is the question of whether the original data distribution exhibits nonzero probability of taking on values within a neighborhood around zero.<span style="mso-spacerun: yes;"> </span>In particular, note that if <em>x</em> can only assume values larger than some positive lower limit <em>L</em>, it follows that <em>1/x</em> necessarily lies between <em>0</em> and <em>1/L</em>, which is enough to guarantee that all moments of the transformed distribution exist.<span style="mso-spacerun: yes;"> </span>For the Gaussian distribution, even if the mean is large enough and the standard deviation is small enough that the probability of observing values less than some limit <em>L > 0</em> is negligible, the fact that this probability is not <i style="mso-bidi-font-style: normal;">zero</i> means that the moments of <i style="mso-bidi-font-style: normal;">any</i> reciprocally-transformed Gaussian distribution are not finite.<span style="mso-spacerun: yes;"> </span>As a practical matter, however, reciprocal transformations and related characterizations – like harmonic means and ratios – do become better-behaved as the probability of observing values near zero become negligibly small.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To see this point, consider two reciprocally-transformed Gaussian examples.<span style="mso-spacerun: yes;"> </span>The first is the one considered above: the reciprocal transformation of a Gaussian random variable with mean 1 and standard deviation 2.<span style="mso-spacerun: yes;"> </span>In this case, the probability that <em>x</em> assumes values smaller than or equal to zero is non-negligible.<span style="mso-spacerun: yes;"> </span>Specifically, this probability is simply the cumulative distribution function for the distribution evaluated at zero, easily computed in R as approximately 31%:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">> pnorm(0,mean=1,sd=2)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">[1] 0.3085375</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In contrast, for a Gaussian random variable with mean 1 and standard deviation 0.1, the corresponding probability is negligibly small:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">> pnorm(0,mean=1,sd=0.1)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">[1] 7.619853e-24</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">If we consider the harmonic means of these two examples, we see that the first one is horribly behaved, as all of the results presented here would lead us to expect.<span style="mso-spacerun: yes;"> </span>In fact, the <strong>qqPlot</strong> command in the <strong>car</strong> package in <em>R </em>allows us to compute quantile-quantile plots for the Student’s <em>t</em>-distribution with one degree of freedom, corresponding to the Cauchy distribution, yielding the plot shown below.<span style="mso-spacerun: yes;"> </span>The Cauchy-like tail behavior expected from the results presented by Johnson, Kotz and Balakrishnan is seen clearly in this Cauchy Q-Q plot, constructed from 1000 harmonic means, each computed from statistically independent samples drawn from a Gaussian distribution with mean 1 and standard deviation 2.<span style="mso-spacerun: yes;"> </span>The fact that almost all of the observations fall within the – very wide – 95% confidence interval around the reference line suggest that the Cauchy tail behavior is appropriate here.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-tQbQfuhvKY4/Tr1y6ipHrTI/AAAAAAAAAFY/BWQUNWtTVbg/s1600/recipfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" nda="true" src="http://2.bp.blogspot.com/-tQbQfuhvKY4/Tr1y6ipHrTI/AAAAAAAAAFY/BWQUNWtTVbg/s320/recipfig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To further confirm this point, compare the corresponding normal Q-Q plot for the same sequence of harmonic means, shown below.<span style="mso-spacerun: yes;"> Th</span>ere, the extreme non-Gaussian character of these harmonic means is readily apparent from the pronounced outliers evident in both the upper and lower tails.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-II9KLHCeIYw/Tr1zH9K003I/AAAAAAAAAFg/14mIAISzn4U/s1600/recipfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" nda="true" src="http://2.bp.blogspot.com/-II9KLHCeIYw/Tr1zH9K003I/AAAAAAAAAFg/14mIAISzn4U/s320/recipfig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In marked contrast, for the second example with the mean of 1 as before but the much smaller standard deviation of 0.1, the harmonic mean is much better behaved, as the normal Q-Q plot below illustrates.<span style="mso-spacerun: yes;"> </span>Specifically, this plot is identical in construction to the one above, except it was computed from samples drawn from the second data distribution.<span style="mso-spacerun: yes;"> </span>Here, most of the computed harmonic mean values fall within the 95% confidence limits around the Gaussian reference line, suggesting that it is not unreasonable in practice to regard these values as approximately normally distributed, in spite of the pathologies of the reciprocal transformation.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-9kCbnML55mE/Tr1zVSWL8kI/AAAAAAAAAFo/aGD2h8oow4c/s1600/recipfig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" nda="true" src="http://2.bp.blogspot.com/-9kCbnML55mE/Tr1zVSWL8kI/AAAAAAAAAFo/aGD2h8oow4c/s320/recipfig04.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">One reason the reciprocal transformation is important in practice – particularly in connection with the Gaussian distribution – is that the desire to characterize ratios of uncertain quantities does arise from time to time.<span style="mso-spacerun: yes;"> </span>In particular, if we are interested in characterizing the ratio of two averages, the Central Limit Theorem would lead us to expect that, at least approximately, this ratio should behave like the ratio of two Gaussian random variables.<span style="mso-spacerun: yes;"> </span>If these component averages are statistically independent, the expected value of the ratio can be re-written as the product of the expected value of the numerator average and the expected value of the reciprocal of the denominator average, leading us directly to the reciprocal Gaussian transformation discussed here.<span style="mso-spacerun: yes;"> </span>In fact, if these two averages are both zero mean, it is a standard result that the ratio has a Cauchy distribution (this result is presented in the same discussion from Johnson, Kotz and Balakrishnan noted above).<span style="mso-spacerun: yes;"> </span>As in the second harmonic mean example presented above, however, it turns out to be true that if the mean and standard deviation of the denominator variable are such that the probability of a zero or negative denominator are negligible, the distribution of the ratio may be approximated reasonably well as Gaussian.<span style="mso-spacerun: yes;"> </span>A very readable and detailed discussion of this fact is given in the paper by George Marsaglia in the May 2006 issue of <a href="http://www.jstatsoft.org/v16/i04">Journal of Statistical Software</a>.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, it is important to note that the “reciprocally-transformed Gaussian distribution” I have been discussing here is <i style="mso-bidi-font-style: normal;">not</i> the same as the <em>inverse Gaussian distribution</em>, to which Johnson, Kotz and Balakrishnan devote a 39-page chapter (Chapter 15).<span style="mso-spacerun: yes;"> </span>That distribution takes only positive values and exhibits moments of all orders, both positive and negative, and as a consequence, it has the interesting characteristic that it remains well-behaved under reciprocal transformations, in marked contrast to the Gaussian case.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com0tag:blogger.com,1999:blog-9179325420174899779.post-32075753837295733872011-10-23T13:31:00.000-07:002011-10-23T13:31:47.692-07:00The Zipf and Zipf-Mandelbrot distributions<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data.<span style="mso-spacerun: yes;"> </span>Today’s post considers the Zipf distribution for discrete data, which has come to be extremely popular as a model for phenomena like word frequencies, city sizes, or sales rank data, where the values of these quantities associated with randomly selected samples can vary by many orders of magnitude.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">More specifically, the Zipf distribution is defined by a probability p<sub>i</sub> of observing the i<sup>th</sup> element of an infinite sequence of objects in a single random draw from that sequence, where the probability is given by:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote>p<sub>i</sub> = A/i<sup>a</sup></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Here, <i style="mso-bidi-font-style: normal;">a</i> is a positive number greater than 1 that determines the rate of the distribution’s tail decay, and <i style="mso-bidi-font-style: normal;">A</i> is a normalization constant, chosen so that these probabilities sum to 1.<span style="mso-spacerun: yes;"> </span>Like the continuous-valued Pareto type I distribution, the Zipf distribution exhibits a “long tail,” meaning that its tail decays slowly enough that in a random sample of objects <i style="mso-bidi-font-style: normal;">O<sub>i</sub></i> drawn from a Zipf distribution, some very large values of the index <i style="mso-bidi-font-style: normal;">i</i> will be observed, particularly for relatively small values of the exponent <i style="mso-bidi-font-style: normal;">a</i>.<span style="mso-spacerun: yes;"> </span>In one of the earliest and most common applications of the Zipf distribution, the objects considered represent words in a document and <i style="mso-bidi-font-style: normal;">i</i> represents their rank, ranging from most frequent (for <i style="mso-bidi-font-style: normal;">i = 1</i>) to rare (for large <i style="mso-bidi-font-style: normal;">i</i> ).<span style="mso-spacerun: yes;"> </span>In a more business-oriented application, the objects might be products for sale (e.g., books listed on Amazon), with the index <i style="mso-bidi-font-style: normal;">i</i> corresponding to their sales rank.<span style="mso-spacerun: yes;"> </span>For a fairly extensive collection of references to many different applications of the Zipf distribution, the website (originally) from <a href="http://www.nslij-genetics.org/wli/zipf/index.html">Rockefeller University</a> is an excellent source.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In <a href="http://www.amazon.com/s/ref=nb_sb_ss_i_1_15?url=search-alias%3Dstripbooks&field-keywords=exploring+data+in+engineering.+the+sciences.+and+medicine&sprefix=Exploring+Data+">Exploring Data in Engineering, the Sciences, and Medicine</a>, I give a brief discussion of both the Zipf distribution and the closely related Zipf-Mandelbrot distribution discussed by Beniot Mandelbrot in his book <a href="http://www.amazon.com/s/ref=nb_sb_ss_i_0_12?url=search-alias%3Dstripbooks&field-keywords=the+fractal+geometry+of+nature&sprefix=the+fractal+">The Fractal Geometry of Nature</a>.<span style="mso-spacerun: yes;"> </span>The probabilities defining this distribution may be parameterized in several ways, and the one given in <em>Exploring Data</em> is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote>p<sub>i</sub> = A/(1+Bi)<sup>a</sup></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where again <i style="mso-bidi-font-style: normal;">a</i> is an exponent that determines the rate at which the tail of the distribution decays, and <i style="mso-bidi-font-style: normal;">B</i> is a second parameter with a value that is strictly positive but no greater than 1.<span style="mso-spacerun: yes;"> </span>For both the Zipf distribution and the Zipf-Mandelbrot distribution, the exponent <i style="mso-bidi-font-style: normal;">a</i> must be greater than 1 for the distribution to be well-defined, it must be greater than 2 for the mean to be finite, and it must be greater than 3 for the variance to be finite.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">So far, I have been unable to find an <em>R</em> package that supports the generation of random samples drawn from the Zipf distribution, but the package <strong>zipfR</strong> includes the command <strong>rlnre</strong>, which generates random samples drawn from the Zipf-Mandelbrot distribution.<span style="mso-spacerun: yes;"> </span>As I noted, this distribution can be parameterized in several different ways and, as Murphy’s law would have it, the <strong>zipfR</strong> parameterization is not the same as the one presented above and discussed in <em>Exploring Data</em>.<span style="mso-spacerun: yes;"> </span>Fortunately, the conversion between these parameters is simple.<span style="mso-spacerun: yes;"> </span>The <strong>zipfR</strong> package defines the distribution in terms of a parameter <strong>alpha</strong> that must lie strictly between 0 and 1, and a second parameter <strong>B</strong> that I will call <em>B<sub>zipfR</sub></em> to avoid confusion with the parameter <em>B</em> in the above definition.<span style="mso-spacerun: yes;"> </span>These parameters are related by:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote>alpha = 1/a<span style="mso-spacerun: yes;"> </span>and<span style="mso-spacerun: yes;"> </span>B<sub>zipfR</sub> = (a-1) B</blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Since the <i style="mso-bidi-font-style: normal;">a</i> parameter (and thus the <strong>alpha</strong> parameter in the <strong>zipfR</strong> package) determines the tail decay rate of the distribution, it is of the most interest here, and the rest of this post will focus on three examples: a = 1.5 (alpha = 2/3), for which both the distribution’s mean and variance are infinite, a = 2.5 (alpha = 2/5), for which the mean is finite but the variance is not, and a = 3.5 (alpha = 2/7), for which both the mean and variance are finite.<span style="mso-spacerun: yes;"> </span>The value of the parameter <em>B</em> in the <em>Exploring Data</em> definition of the distribution will be fixed at 0.2 in all of these examples, corresponding to values of <em>B<sub>zipfR</sub></em> = 0.1, 0.3, and 0.5 for the three examples considered here.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To generate Zipf-Mandelbrot random samples, the <strong>zipfR</strong> package uses the procedure <strong>rlnre</strong> in conjunction with the procedure <strong>lnre </strong>(the abbreviation “lnre”<span style="mso-spacerun: yes;"> stands for “large number of rare events” and it represents a class of data models that includes the Zipf-Mandelbrot distribution). </span>Specifically, to generate a random sample of size N = 100 for the first case considered here, the following <em>R</em> code is executed:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;">> library(zipfR)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">> ZM = lnre(“zm”, alpha = 2/3, B = 0.1)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">> zmsample = rlnre(ZM, n=100)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The first line loads the <strong>zipfR</strong> library (which must first be installed, of course, using the <strong>install.packages</strong> command), the second line invokes the <strong>lnre</strong> command to set up the distribution with the desired parameters, and the last line invokes the <strong>rlnre</strong> command to generate 100 random samples from this distribution.<span style="mso-spacerun: yes;"> </span>(As with all <em>R</em> random number generators, the <strong>set.seed</strong> command should be used first to initialize the random number generator seed if you want to get repeatable results; for the results presented here, I used <strong>set.seed(101)</strong>.)<span style="mso-spacerun: yes;"> </span>The sample returned by the <strong>rlnre</strong> command is a vector of 100 observations, which have the “factor” data type, although their designations are numeric (think of the factor value “1339” as meaning “1 sample of object number 1339”).<span style="mso-spacerun: yes;"> </span>In the results I present here, I have converted these factor responses to numerical ones so I can interpret them as numerical ranks.<span style="mso-spacerun: yes;"> </span>This conversion is a little subtle: simply converting from factor to numeric values via something like “<strong>zmnumeric = as.numeric(zmsample)</strong>” almost certainly doesn’t give you what you want: this will convert the first-ocurring factor value (which has a numeric label, say “1339”) into the number 1, convert the second-occurring value (since this is a random sequence, this might be “73”) into the number 2, etc.<span style="mso-spacerun: yes;"> </span>To get what you want (e.g., the labels “1339” and “73” assigned to the numbers 1339 and 73, respectively), you need to first convert the factors in <strong>zmsample</strong> into characters and then convert these characters into numeric values:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote>zmnumeric = as.numeric(as.character(zmsample))</blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The three plots below show random samples drawn from each of the three Zipf-Mandelbrot distributions considered here.<span style="mso-spacerun: yes;"> </span>In all cases, the y-axis corresponds to the number of times the object labeled <em>i </em>was observed in a random sample of size N = 100 drawn from the distribution with the indicated exponent.<span style="mso-spacerun: yes;"> </span>Since the range of these indices can be quite large in the slowly-decaying members of the Zipf-Mandelbrot distribution family, the plots are drawn with logarithmic x-axes, and to facilitate comparisons, the x-axes have the same range in all three plots, as do the y-axes.<span style="mso-spacerun: yes;"> </span>In all three plots, object i = 1 occurs most often – about a dozen times in the top plot, two dozen times in the middle plot, and three dozen times in the bottom plot – and those objects with larger indices occur less frequently.<span style="mso-spacerun: yes;"> </span>The major difference between these three examples lies in the largest indices of the objects seen in the samples: we never see an object with index greater than 50 in the bottom plot, we see only two such objects in the middle plot, while more than a third of the objects in the top plot meet this condition, with the most extreme object having index i = 115,116.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-4IkBCpRbZdk/TqRtqcC60pI/AAAAAAAAAEg/0NJarxwlteo/s1600/zipfig00.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rda="true" src="http://1.bp.blogspot.com/-4IkBCpRbZdk/TqRtqcC60pI/AAAAAAAAAEg/0NJarxwlteo/s320/zipfig00.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As in the case of the Pareto type I distributions I discussed in several previous posts – which may be regarded as the continuous analog of the Zipf distribution – the mean is generally not a useful characterization for the Zipf distribution.<span style="mso-spacerun: yes;"> </span>This point is illustrated in the boxplot comparison presented below, which summarizes the means computed from 1000 statistically independent random samples drawn from each of the three distributions considered here, where the object labels have been converted to numerical values as described above.<span style="mso-spacerun: yes;"> </span>Thus, the three boxplots on the left represent the means – note the logarithmic scale on the y-axis – of these index values <i style="mso-bidi-font-style: normal;">i</i> generated for each random sample.<span style="mso-spacerun: yes;"> </span>The extreme variability seen for Case 1 (a = 1.5) reflects the fact that neither the mean nor the variance are finite for this case, and the consistent reduction in the range of variability for Cases 2 (a = 2.5, finite mean but infinite variance) and 3 (a = 3.5, finite mean and variance) reflects the “shortening tail” of this distribution with increasing exponent <i style="mso-bidi-font-style: normal;">a</i>.<span style="mso-spacerun: yes;"> </span>As I discussed in my last post, a better characterization than the mean for distributions like this is the “95% tail length,” corresponding to the 95% sample quantile. Boxplots summarizing these values for the three distributions considered here are shown to the right of the dashed vertical line in the plot below.<span style="mso-spacerun: yes;"> </span>In each case, the range of variation seen here is much less extreme for the 95% tail length than it is for the mean, supporting the idea that this is a better characterization for data described by Zipf-like discrete distributions.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-1efDOBGWbGM/TqRuBRTgeMI/AAAAAAAAAEo/GtG7pBgZVIY/s1600/zipfig01a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rda="true" src="http://3.bp.blogspot.com/-1efDOBGWbGM/TqRuBRTgeMI/AAAAAAAAAEo/GtG7pBgZVIY/s320/zipfig01a.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Other alternatives to the (arithmetic) mean that I discussed in conjunction with the Pareto type I distribution were the sample median, the geometric mean, and the harmonic mean.<span style="mso-spacerun: yes;"> </span>The plot below compares these four characterizations for 1000 random samples, each of size N = 100, drawn from the Zipf-Mandelbrot distribution with a = 3.5 (the third case), for which the mean is well-defined.<span style="mso-spacerun: yes;"> </span>Even here, it is clear that the mean is considerably more variable than these other three alternatives.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-iFLdn8T5avk/TqRuODcIExI/AAAAAAAAAEw/sX0yQP6zmzM/s1600/zipfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rda="true" src="http://3.bp.blogspot.com/-iFLdn8T5avk/TqRuODcIExI/AAAAAAAAAEw/sX0yQP6zmzM/s320/zipfig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, the plot below shows boxplot comparisons of these alternative characterizations – the median, the geometric mean, and the harmonic mean – for all three of the distributions considered here.<span style="mso-spacerun: yes;"> </span>Not surprisingly, Case 1 (a = 1.5) exhibits the largest variability seen for all three characterizations, but the harmonic mean is much more consistent for this case than either the geometric mean or the median.<span style="mso-spacerun: yes;"> </span>In fact, the same observation holds – although less dramatically – for Case 2 (a = 2.5), and the harmonic mean appears more consistent than the geometric mean for all three cases.<span style="mso-spacerun: yes;"> </span>This observation is particularly interesting in view of the connection between the harmonic mean and the reciprocal transformation, which I will discuss in more detail next time.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-swrGBTcTcNs/TqRubBMbQ5I/AAAAAAAAAE4/8tNLF8W1G4Y/s1600/zipfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rda="true" src="http://4.bp.blogspot.com/-swrGBTcTcNs/TqRubBMbQ5I/AAAAAAAAAE4/8tNLF8W1G4Y/s320/zipfig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com0tag:blogger.com,1999:blog-9179325420174899779.post-71806279906162061832011-09-28T15:11:00.000-07:002011-09-28T15:11:01.454-07:00Is the “Long Tail” a Useless Concept?<div class="MsoNormal" style="margin: 0in 0in 0pt;">In response to my last post, “The Long Tail of the Pareto Distribution,” Neil Gunther had the following comment:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote>“<span style="color: #333333;">Unfortunately, you've fallen into the trap of using the ‘long tail’ misnomer. If you think about it, it can't possibly be the length of the tail that sets distributions like Pareto and Zipf apart; even the negative exponential and Gaussian have <i>infinitely</i> long tails.”</span></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">He goes on to say that the relevant concept is the “width” or the “weight” of the tails that is important, and that a more appropriate characterization of these “Long Tails” would be “heavy-tailed” or “power-law” distributions.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Neil’s comment raises an important point: while the term “long tail” appears a lot in both the on-line and hard-copy literature, it is often somewhat ambiguously defined.<span style="mso-spacerun: yes;"> </span>For example, in his book, <a href="http://www.amazon.com/Long-Tail-Revised-Updated-Business/dp/1401309666/ref=sr_1_1?s=books&ie=UTF8&qid=1317246600&sr=1-1"><em>The Long Tail</em></a>, Chris Anderson offers the following description (page 10):</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote>“In statistics, curves like that are called ‘long-tailed distributions’ because the tail of the curve is very long relative to the head.”</blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The difficulty with this description is that it is somewhat ambiguous since it says nothing about how to measure “tail length,” forcing us to adopt our own definitions.<span style="mso-spacerun: yes;"> </span>It is clear from Neil’s comments that the definition he adopts for “tail length” is the width of the distribution’s support set.<span style="mso-spacerun: yes;"> </span>Under this definition, the notion of a “long-tailed distribution” is of extremely limited utility: the situation is exactly as Neil describes it, with “long-tailed distributions” corresponding to any distribution with unbounded support, including both distributions like the Gaussian and gamma distribution where the mean is a reasonable characterization, and those like the Cauchy and Pareto distribution where the mean doesn’t even exist.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The situation is analogous to that of confidence intervals, which characterize the uncertainty inherited by any characterization computed from a collection of uncertain (i.e., random) data values.<span style="mso-spacerun: yes;"> </span>As a specific example consider the mean: the <em>sample mean</em> is the arithmetic average of <em>N</em> observed data samples, and it is generally intended as an estimate of the <em>population mean</em>, defined as the first moment of the data distribution.<span style="mso-spacerun: yes;"> </span>A <em>q% confidence interval</em> around the sample mean is an interval that contains the population mean with probability at least <em>q%</em>.<span style="mso-spacerun: yes;"> </span>These intervals can be computed in various ways for different data characterizations, but the key point here is that they are widely used in practice, with the most popular choices being the 90%, 95% and 99% confidence intervals, which necessarily become wider as this percentage <em>q</em> increases.<span style="mso-spacerun: yes;"> </span>(For a more detailed discussion of confidence intervals, refer to Chapter 9 of <a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650/ref=sr_1_1?s=books&ie=UTF8&qid=1317246817&sr=1-1#_">Exploring Data in Engineering, the Sciences, and Medicine</a>.)<span style="mso-spacerun: yes;"> </span>We can, in principle, construct 100% confidence intervals, but this leads us directly back to Neil’s objection: the 100% confidence interval for the mean is entire support set of the distribution (e.g., for the Gaussian distribution, this 100% confidence interval is the whole real line, while for any gamma distribution, it is the set of all positive numbers).<span style="mso-spacerun: yes;"> </span>These observations suggest the following notion of “tail length” that addresses Neil’s concern while retaining the essential idea of interest in the business literature: we can compare the “q% tail length” of different distributions for some <em>q</em> less than 100.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In particular, consider the case of J-shaped distributions, defined as those like the Pareto type I distribution whose distribution p(x) decays monotonically with increasing x, approaching zero as x goes to infinity.<span style="mso-spacerun: yes;"> </span>The plot below shows two specific examples to illustrate the idea: the solid line corresponds to the (shifted) exponential distribution:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>p(x) = e<sup>–(x-1)</sup></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">for all x greater than or equal to 1 and zero otherwise, while the dotted line represents the Pareto type I distribution with location parameter <em>k = 1</em> and shape parameter <em>a = 0.5</em> discussed in my last post.<span style="mso-spacerun: yes;"> </span>Initially, as x increases from 1, the exponential density is greater than the Pareto density, but for x larger than about 3.5, the opposite is true: the exponential distribution rapidly becomes much smaller, reflecting its much more rapid rate of tail decay.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-WtujfYFLLtw/ToOVGk0u85I/AAAAAAAAAEM/cjJl9R66-hk/s1600/LongUselessFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" kca="true" src="http://1.bp.blogspot.com/-WtujfYFLLtw/ToOVGk0u85I/AAAAAAAAAEM/cjJl9R66-hk/s320/LongUselessFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">For these distributions, define the q% tail length to be the distance from the minimum possible value of x (the “head” of the distribution; here, x = 1) to the point in the tail where the cumulative probability reaches q% (i.e., the value x<sub>q</sub> where x < x<sub>q</sub> with probability q%). <span style="mso-spacerun: yes;"> </span>In practical terms, the q% tail length tells us how far out we have to go in the tail to account for q% of the possible cases.<span style="mso-spacerun: yes;"> </span>In <em>R</em>, this value is easy to compute using the <em>quantile</em> function included in most families of available distribution functions.<span style="mso-spacerun: yes;"> </span>As a specific example, for the Pareto type I distribution, the function <strong>qparetoI</strong> in the <strong>VGAM</strong> package gives us the desired quantiles for the distribution with specified values of the parameters <em>k</em> (designated “scale” in the <strong>qparetoI</strong> call) and <em>a</em> (designated “shape” in the <strong>qparetoI</strong> call).<span style="mso-spacerun: yes;"> </span>Thus, for the case <em>k = 1</em> and <em>a = 0.5</em> (i.e., the dashed curve in the above plot), the “90% tail length” is given by:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">> qparetoI(p=0.9,scale=1,shape=0.5)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">[1] 100</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">For comparison, the corresponding shifted exponential distribution has the 90% tail length given by:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">> 1 + qexp(p = 0.9)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">[1] 3.302585</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">(Note that here, I added 1 to the exponential quantile to account for the shift in its domain from “all positive numbers” – the domain for the standard exponential distribution – to the shifted domain “all numbers greater than 1”.)<span style="mso-spacerun: yes;"> </span>Since these 90% tail lengths differ by a factor of 30, they provide a sound basis for declaring the Pareto type I distribution to be “longer tailed” than the exponential distribution.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">These results also provide a useful basis for assessing the influence of the decay parameter a for the Pareto distribution.<span style="mso-spacerun: yes;"> </span>As I noted last time, two of the examples I considered did not have finite means (<em>a = 0.5</em> and <em>1.0</em>), and none of the four had finite variances (i.e., also <em>a = 1.5</em> and <em>2.0</em>), rendering moment characterizations like the mean and standard deviation fundamentally useless.<span style="mso-spacerun: yes;"> </span>Comparing the 90% tail lengths for these distributions, however, leads to the following results:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 0.5:</em> 90% tail length = 100.000</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 1.0:</em> 90% tail length =<span style="mso-spacerun: yes;"> </span>10.000</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 1.5:</em> 90% tail length =<span style="mso-spacerun: yes;"> </span>4.642</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 2.0:</em> 90% tail length =<span style="mso-spacerun: yes;"> </span>3.162</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">It is clear from these results that the shape parameter <em>a</em> has a dramatic effect on the 90% tail length (in fact, on the q% tail length for any <em>q</em> less than 100).<span style="mso-spacerun: yes;"> </span>Further, note that the 90% tail length for the Pareto type I distribution with <em>a = 2.0</em> is actually a little bit shorter than that for the exponential distribution.<span style="mso-spacerun: yes;"> </span>If we move further out into the tail, however, this situation changes.<span style="mso-spacerun: yes;"> </span>As a specific example, suppose we compare the 98% tail lengths. For the exponential distribution, this yields the value 4.912, while for the four Pareto shape parameters we have the following results:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 0.5:</em> 98% tail length = 2,500.000</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 1.0:</em> 98% tail length =<span style="mso-spacerun: yes;"> </span>50.000</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 1.5:</em> 98% tail length =<span style="mso-spacerun: yes;"> </span>13.572</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><em>a = 2.0:</em> 98% tail length =<span style="mso-spacerun: yes;"> </span>7.071</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">This value (i.e., the 98% tail length) seems a particularly appropriate choice to include here since in his book, <em>The Long Tail</em>, Chris Anderson notes that his original presentations on the topic were entitled “The 98% Rule,” reflecting the fact that he was explicitly considering how far out you had to go into the tail of a distribution of goods (e.g., the books for sale by Amazon) to account for 98% of the sales.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Since this discussion originally began with the question, “when are averages useless?” it is appropriate to note that, in contrast to the much better-known average, the “q% tail length” considered here is well-defined for <em>any </em>proper distribution.<span style="mso-spacerun: yes;"> </span>As the examples discussed here demonstrate, this characterization also provides a useful basis for quantifying the “Long Tail” behavior that is of increasing interest in business applications like Internet marketing.<span style="mso-spacerun: yes;"> </span>Thus, if we adopt this measure for any <em>q</em> value less than 100%, the answer to the title question of this post is, “No: The Long Tail is a useful concept.”</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The downside of this minor change is that – as the results shown here illustrate – the results obtained using the q% tail length depend on the value of <em>q</em> we choose.<span style="mso-spacerun: yes;"> </span>In my next post, I will explore the computational issues associated with that choice.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com1tag:blogger.com,1999:blog-9179325420174899779.post-16112661374266985532011-09-17T09:54:00.000-07:002011-09-17T09:54:58.940-07:00The Long Tail of the Pareto Distribution<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization.<span style="mso-spacerun: yes;"> </span>One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span>p(x) = ak<sup>a</sup>/x<sup>a+1</sup></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">defined for all <i style="mso-bidi-font-style: normal;">x > k</i>, where <i style="mso-bidi-font-style: normal;">k</i> and <i style="mso-bidi-font-style: normal;">a</i> are numeric parameters that define the distribution.<span style="mso-spacerun: yes;"> </span>In the example I discussed last time, I considered the case where a = 1.5, which exhibits a finite mean (specifically, the mean is 3 for this case), but an infinite variance.<span style="mso-spacerun: yes;"> </span>As the results I presented last time demonstrated, the extreme data variability of this distribution renders the computed mean too variable to be useful.<span style="mso-spacerun: yes;"> </span>Another reason this distribution is particularly interesting is that it exhibits essentially the same tail behavior as the discrete Zipf distribution; there, the probability that a discrete random variable x takes its i<sup>th</sup> value is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span>p<sub>i</sub> = A/i<sup>c</sup>,</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where A is a normalization constant and <i style="mso-bidi-font-style: normal;">c</i> is a parameter that determines how slowly the tail decays.<span style="mso-spacerun: yes;"> </span>This distribution was originally proposed to characterize the frequency of words in long documents (the Zipf-Estoup law), it was investigated further by Zipf in the mid-twentieth century in a wide range of applications (e.g., the distributions of city sizes), and it has become the subject of considerable recent attention as a model for “long-tailed” business phenomena (for a non-technical introduction to some of these business phenomena, see the book by Chris Anderson, <a href="http://www.amazon.com/Long-Tail-Future-Business-Selling/dp/1401302378">The Long Tail</a>).<span style="mso-spacerun: yes;"> </span>I will discuss the Zipf distribution further in a later post, but one of the reasons for discussing the Pareto type I distribution first is that since it is a continuous distribution, the math is easier, meaning that more characterization results are available for the Pareto distribution.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-Od4r-V4YWGU/TnTLcJsl0DI/AAAAAAAAAD0/fqILT5xoVxM/s1600/ParetoIFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rba="true" src="http://4.bp.blogspot.com/-Od4r-V4YWGU/TnTLcJsl0DI/AAAAAAAAAD0/fqILT5xoVxM/s320/ParetoIFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The mean of the Pareto type I distribution is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span>Mean = ak/(a-1),</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">provided <i style="mso-bidi-font-style: normal;">a > 1</i>, and the variance of the distribution is finite only if <i style="mso-bidi-font-style: normal;">a > 2</i>.<span style="mso-spacerun: yes;"> </span>Plots of the probability density defined above for this distribution are shown above, for <i style="mso-bidi-font-style: normal;">k = 1</i> in all cases, and with <i style="mso-bidi-font-style: normal;">a</i> taking the values 0.5, 1.0, 1.5, and 2.0.<span style="mso-spacerun: yes;"> </span>(This is essentially the same plot as Figure 4.17 in <a href="http://www.amazon.com/s/ref=nb_sb_ss_i_1_14?url=search-alias%3Dstripbooks&field-keywords=exploring+data+in+engineering.+the+sciences.+and+medicine&sprefix=Exploring+Data">Exploring Data in Engineering, the Sciences, and Medicine</a>, where I give a brief description of the Pareto type I distribution.)<span style="mso-spacerun: yes;"> </span>Note that all of the cases considered here are characterized by infinite variance, while the first two (a = 0.5 and 1.0) are also characterized by infinite means.<span style="mso-spacerun: yes;"> </span>As the results presented below emphasize, the mean represents a very poor characterization in practice for data drawn from any of these distributions, but there are alternatives, including the familiar median that I have discussed previously, along with two others that are more specific to the Pareto type I distribution: the geometric mean and the harmonic mean.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The plot below emphasizes the point made above about the extremely limited utility of the mean as a characterization of Pareto type I data, even in cases where it is theoretically well-defined.<span style="mso-spacerun: yes;"> </span>Specifically, this plot compares the four characterizations I discuss here – the mean (more precisely known as the “arithmetic mean” to distinguish it from the other means considered here), the median, the geometric mean, and the harmonic mean – for 1000 statistically independent Pareto type I data sequences, each of length N = 400, with parameters <i style="mso-bidi-font-style: normal;">k = 1</i> and <i style="mso-bidi-font-style: normal;">a = 2.0</i>.<span style="mso-spacerun: yes;"> </span>For this example, the mean is well-defined (specifically, it is equal to 2), but compared with the other data characterizations, its variability is much greater, reflecting the more serious impact of this distribution’s infinite variance on the mean than on these other data characterizations.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-IPXsTMe5thU/TnTL7h31dMI/AAAAAAAAAD4/RMWinyP9vQU/s1600/ParetoIFig09.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rba="true" src="http://4.bp.blogspot.com/-IPXsTMe5thU/TnTL7h31dMI/AAAAAAAAAD4/RMWinyP9vQU/s320/ParetoIFig09.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To give a more complete view of the extreme variability of the arithmetic mean, boxplots of 1000 statistically independent samples drawn from all four of the Pareto type I distribution examples plotted above are shown in the boxplots below.<span style="mso-spacerun: yes;"> </span>As before, each sample is of size N = 400 and the parameter <i style="mso-bidi-font-style: normal;">k</i> has the value 1, but here the computed arithmetic means are shown for the parameter values a = 0.5, 1.0, 1.5, and 2.0; note the log scale used here because the range of computed means is so large.<span style="mso-spacerun: yes;"> </span>For the first two of these examples, the population mean does not exist, so it is not surprising that the computed values span such an enormous range, but even when the mean is well-defined, the influence of the infinite variance of these cases is clearly evident.<span style="mso-spacerun: yes;"> </span>It may be argued that infinite variance is an extreme phenomenon, but it is worth emphasizing here that for the specific “long tail” distributions popular in many applications, the decay rate is sufficiently slow for the variance – and sometimes even the mean – to be infinite, as in these examples.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-uAabSu1vdws/TnTMME2enxI/AAAAAAAAAD8/PVl2_aeqXfk/s1600/ParetoIFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rba="true" src="http://3.bp.blogspot.com/-uAabSu1vdws/TnTMME2enxI/AAAAAAAAAD8/PVl2_aeqXfk/s320/ParetoIFig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As I have noted several times in previous posts, the median is much better behaved than the mean, so much so that it is well-defined for any proper distribution.<span style="mso-spacerun: yes;"> </span>One of the advantages of the Pareto type I distribution is that the form of the density function is simple enough that the median of the distribution can be computed explicitly from the distribution parameters.<span style="mso-spacerun: yes;"> </span>This result is given in the fabulous book by <a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584959/ref=sr_1_1?s=books&ie=UTF8&qid=1316277338&sr=1-1">Johnson, Kotz and Balakrishnan</a> that I have mentioned previously, which devotes an entire chapter (Chapter 20) to the Pareto family of distributions.<span style="mso-spacerun: yes;"> </span>Specifically, the median of the Pareto type I distribution with parameters <i style="mso-bidi-font-style: normal;">k</i> and <i style="mso-bidi-font-style: normal;">a</i> is given by:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span>Median = 2<sup>1/a</sup>k</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Thus, for the four examples considered here, the median values are 4.0 (for a = 0.5), 2.0 (for a = 1.0), 1.587 (for a = 1.5), and 1.414 (for a = 2.0).<span style="mso-spacerun: yes;"> </span>Boxplot summaries for the same 1000 random samples considered above are shown in the plot below, which also includes horizontal dotted lines at these theoretical median values for the four distributions.<span style="mso-spacerun: yes;"> </span>The fact that these lines correspond closely with the median lines in the boxplots gives an indication that the computed median is, on average, in good agreement with the correct values it is attempting to estimate.<span style="mso-spacerun: yes;"> </span>As in the case of the arithmetic means, the variability of these estimates decreases monotonically as <em>a</em> increases, corresponding to the fact that the distribution becomes generally better-behaved as the <i style="mso-bidi-font-style: normal;">a</i> parameter increases.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Q0nHedEppms/TnTM4jen5fI/AAAAAAAAAEA/YWd432AzrBg/s1600/ParetoIFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rba="true" src="http://1.bp.blogspot.com/-Q0nHedEppms/TnTM4jen5fI/AAAAAAAAAEA/YWd432AzrBg/s320/ParetoIFig04.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The <i style="mso-bidi-font-style: normal;">geometric mean</i> is an alternative characterization to the more familiar arithmetic mean, one that is well-defined for any sequence of positive numbers.<span style="mso-spacerun: yes;"> </span>Specifically, the geometric mean of <i style="mso-bidi-font-style: normal;">N</i> positive numbers is defined as the <i style="mso-bidi-font-style: normal;">N<sup>th</sup></i> root of their product.<span style="mso-spacerun: yes;"> </span>Equivalently, the geometric mean may be computed by exponentiating the arithmetic average of the log-transformed values.<span style="mso-spacerun: yes;"> </span>In the case of the Pareto type I distribution, the utility of the geometric mean is closely related to the fact that the log transformation converts a Pareto-distributed random variable into an exponentially distributed one, a point that I will discuss further in a later post on data transformations.<span style="mso-spacerun: yes;"> </span>(These transformations are the topic of Chapter 12 of <em>Exploring Data</em>, where I briefly discuss both the logarithmic transformation on which the geometric mean is based and the reciprocal transformation on which the harmonic mean is based, described next.)<span style="mso-spacerun: yes;"> </span>The key point here is that the following simple expression is available for the geometric mean of the Pareto type I distribution (Johnson, Kotz, and Balakrishnan, page 577):</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span>Geometric Mean = k exp(1/a)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">For the four specific examples considered here, these geometric mean values are approximately 7.389 (for a = 0.5), 2.718 (for a = 1.0), 1.948 (for a = 1.5), and 1.649 (for a = 2.0).<span style="mso-spacerun: yes;"> </span>The boxplots shown below summarize the range of variation seen in the computed geometric means for the same 1000 statistically independent samples considered above.<span style="mso-spacerun: yes;"> </span>Again, the horizontal dotted lines indicate the correct values for each distribution, and it may be seen that the computed values are in good agreement, on average.<span style="mso-spacerun: yes;"> </span>As before, the variability of these computed values decreases with increasing <i style="mso-bidi-font-style: normal;">a </i>values as the distribution becomes better-behaved.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-oFTSSZfIUNc/TnTNzo_0TAI/AAAAAAAAAEE/ivn91x51gdI/s1600/ParetoIFig06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rba="true" src="http://1.bp.blogspot.com/-oFTSSZfIUNc/TnTNzo_0TAI/AAAAAAAAAEE/ivn91x51gdI/s320/ParetoIFig06.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The fourth characterization considered here is the <i style="mso-bidi-font-style: normal;">harmonic mean</i>, again appropriate to positive values, and defined as the reciprocal of the average of the reciprocal data values.<span style="mso-spacerun: yes;"> </span>In the case of the geometric mean just discussed, the log transformation on which it is based is often useful in improving the distributional character of data values that span a wide range.<span style="mso-spacerun: yes;"> </span>In the case of the Pareto type I distribution – and a number of others – the reciprocal transformation on which the harmonic mean is based also improves the behavior of the data distribution, but this is often not the case.<span style="mso-spacerun: yes;"> </span>In particular, reciprocal transformations often make the character of a data distribution much worse: applied to the extremely well-behaved standard uniform distribution, it yields the Pareto type I distribution with a = 1, for which none of the integer moments exist; similarly, applied to the Gaussian distribution, the reciprocal transformation yields a result that is both infinite variance and bimodal.<span style="mso-spacerun: yes;"> </span>(A little thought suggests that the reciprocal transformation is inappropriate for the Gaussian distribution because it is not strictly positive, but normality is a favorite working assumption, sometimes applied to the denominators of ratios, leading to a number of theoretical difficulties.<span style="mso-spacerun: yes;"> </span>I will have more to say about that in a future post.)<span style="mso-spacerun: yes;"> </span>For the case of the Pareto type I distribution, the reciprocal transformation converts it into the extremely well-behaved beta distribution, and the harmonic mean has the following simple expression:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>Harmonic mean = k(1 + a<sup>-1</sup>)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">For the four examples considered here, this expression yields harmonic mean values of 3 (for a = 0.5), 2 (for a = 1.0), 1.667 (for a = 1.5), and 1.5 (for a = 2.0).<span style="mso-spacerun: yes;"> </span>Boxplot summaries of the computed harmonic means for the 1000 simulations of each case considered previously are shown below, again with dotted horizontal lines at the theoretical values for each case.<span style="mso-spacerun: yes;"> </span>As with both the median and the geometric mean, it is clear from these plots that the computed values are correct on average, and their variability decreases with increasing values of the <i style="mso-bidi-font-style: normal;">a</i> parameter.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-pDmiBtXFO0Y/TnTOLsBqh4I/AAAAAAAAAEI/zAuxxfiTnCw/s1600/ParetoIFig08.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" rba="true" src="http://1.bp.blogspot.com/-pDmiBtXFO0Y/TnTOLsBqh4I/AAAAAAAAAEI/zAuxxfiTnCw/s320/ParetoIFig08.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The key point of this post has been to show that, while averages are not suitable characterizations for “long tailed” phenomena that are becoming an increasing subject of interest in many different fields, useful alternatives do exist.<span style="mso-spacerun: yes;"> </span>For the case of the Pareto type I distribution considered here, these alternatives include the popular median, along with the somewhat less well-known geometric and harmonic means.<span style="mso-spacerun: yes;"> </span>In an upcoming post, I will examine the utility of these characterizations for the Zipf distribution.</div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com2tag:blogger.com,1999:blog-9179325420174899779.post-12618532900006353532011-08-27T13:46:00.000-07:002011-08-27T13:46:16.463-07:00Some Additional Thoughts on Useless Averages<div class="MsoNormal" style="margin: 0in 0in 0pt;">In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions.<span style="mso-spacerun: yes;"> </span>The post generated three interesting comments that I want to respond to here.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">First and foremost, I want to say thanks to all of you for giving me something to think about further, leading me in some interesting new directions.<span style="mso-spacerun: yes;"> </span>First, <strong>chrisbeeleyimh</strong> had the following to say:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote>“I seem to have rather abandoned means and medians in favor of drawing the distribution all the time, which baffles my colleagues somewhat.”</blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Chris also maintains a collection of data examples where the mean is the same but the shape is very different.<span style="mso-spacerun: yes;"> </span>In fact, one of the points I illustrate in Section 4.4.1 of <span><a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Exploring Data in Engineering, the Sciences, and Medicine</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /></span> is that there are cases where not only the means but <em>all </em>of the moments (i.e., variance, skewness, kurtosis, etc.) are identical but the distributions are profoundly different.<span style="mso-spacerun: yes;"> </span>A specific example is taken from the book <span><a href="http://www.amazon.com/Counterexamples-Probability-2nd-Jordan-Stoyanov/dp/0471965383?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Counterexamples in Probability, 2nd Edition</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0471965383" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /></span> by J.M. Stoyanov, who shows that if the lognormal density is multiplied by the following function:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span></div><blockquote>f(x) = 1 + A sin(2 pi ln x),</blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">for any constant A between -1 and +1, the moments are unchanged.<span style="mso-spacerun: yes;"> </span>The character of the distribution is changed profoundly, however, as the following plot illustrates (this plot is similar to Fig. 4.8 in <em>Exploring Data,</em> which shows the same two distributions, but for A = 0.5 instead of A = 0.9, as shown here).<span style="mso-spacerun: yes;"> </span>To be sure, this behavior is pathological – distributions that have finite support, for example, are defined uniquely by their complete set of moments – but it does make the point that moment characterizations are not always complete, even if an infinite number of them are available.<span style="mso-spacerun: yes;"> </span>Within well-behaved families of distributions (such as the one proposed by Karl Pearson in 1895), a complete characterization is possible on the basis of the first few moments, which is one reason for the historical popularity of the method of moments for fitting data to distributions.<span style="mso-spacerun: yes;"> </span>It is important to recognize, however, that moments do have their limitations and that the first moment alone – i.e., the mean by itself – is almost never a complete characterization.<span style="mso-spacerun: yes;"> </span>(I am forced to say “almost” here because if we impose certain very strong distributional assumptions – e.g., the Poisson or binomial distributions – the specific distribution considered may be fully characterized by its mean.<span style="mso-spacerun: yes;"> </span>This begs the question, however, of whether this distributional assumption is adequate.<span style="mso-spacerun: yes;"> </span>My experience has been that, no matter how firmly held the belief in a particular distribution is, exceptions do arise in practice … overdispersion, anyone?)<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-9SsouQJ0FHo/TllFIR9ESOI/AAAAAAAAADk/mQLCGfSQdH8/s1600/MoreUselessFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://2.bp.blogspot.com/-9SsouQJ0FHo/TllFIR9ESOI/AAAAAAAAADk/mQLCGfSQdH8/s320/MoreUselessFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The plot below provides a further illustration of the inadequacy of the mean as a sole data characterization, comparing four different members of the family of beta distributions.<span style="mso-spacerun: yes;"> </span>These distributions – in the standard form assumed here – describe variables whose values range from 0 to 1, and they are defined by two parameters, p and q, that determine the shape of the density function and all moments of the distribution.<span style="mso-spacerun: yes;"> </span>The mean of the beta distribution is equal to p/(p+q), so if p = q – corresponding to the class of symmetric beta distributions – the mean is ½, regardless of the common value of these parameters.<span style="mso-spacerun: yes;"> </span>The four plots below show the corresponding distributions when both parameters are equal to 0.5 (upper left, the arcsin distribution I discussed last time), 1.0 (upper right, the uniform distribution), 1.5 (lower left), and 8.0 (lower right).<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/--uLGduKNCtY/TllFpqRxinI/AAAAAAAAADo/OVGP_ZITwL8/s1600/MoreUselessFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/--uLGduKNCtY/TllFpqRxinI/AAAAAAAAADo/OVGP_ZITwL8/s320/MoreUselessFig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The second comment on my last post was from <strong>Efrique</strong>, who suggested the Student’s t-distribution with 2 degrees of freedom as a better infinite-variance example than the Cauchy example I used (corresponding to Student’s t-distribution with one degree of freedom), because the first moment doesn’t even exist for the Cauchy distribution (“there’s nothing to converge to”).<span style="mso-spacerun: yes;"> </span>The figure below expands the boxplot comparison I presented last time, comparing the means, medians, and modes (from the <strong>modeest </strong>package), for both of these infinite-variance examples: the Cauchy distribution I discussed last time and the Student’s t-distribution with two degrees of freedom that Efrique suggested.<span style="mso-spacerun: yes;"> H</span>ere, the same characterization (mean, median, or mode) is summarized for both distributions in side-by-side boxplots to facilitate comparisons.<span style="mso-spacerun: yes;"> </span>It is clear from these boxplots that the results for the median and the mode are essentially identical for these distributions, but the results for the mean differ dramatically (recall that these results are truncated for the Cauchy distribution: 13.6% of the 1000 computed means fell outside the +/- 5 range shown here, exhibiting values approaching +/- 1000).<span style="mso-spacerun: yes;"> </span>This difference illustrates Efrique’s further point that the mean of the data values is a consistent estimator of the (well-defined) population mean of the Student’s t-distribution with 2 degrees of freedom, while it is not a consistent estimator for the Cauchy distribution.<span style="mso-spacerun: yes;"> </span>Still, it also clear from this plot that the mean is substantially more variable for the Student’s t-distribution with 2 degrees of freedom than either the median or the <strong>modeest</strong> mode estimate.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-mBKzlxuO3Lo/TllGgZGwsCI/AAAAAAAAADs/L2mdVfDwpo4/s1600/MoreUselessFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/-mBKzlxuO3Lo/TllGgZGwsCI/AAAAAAAAADs/L2mdVfDwpo4/s320/MoreUselessFig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Another example of an infinite-variance distribution where the mean is well-defined but highly variable is the Pareto type I distribution, discussed in Section 4.5.8 of <em>Exploring Data</em>.<span style="mso-spacerun: yes;"> </span>My favorite reference on distributions is the two volume set by Johnson, Kotz, and Balakrishnan (<span><a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584959?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Continuous Univariate Distributions, Vol. 1 (Wiley Series in Probability and Statistics)</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0471584959" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /> and <span><a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584940?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Continuous Univariate Distributions, Vol. 2 (Wiley Series in Probability and Statistics)</a>)<img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0471584940" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /></span></span>, who devote an entire 55 page chapter (Chapter 20 in Volume 1) to the Pareto distribution, noting that it is named after Vilafredo Pareto, a mid nineteenth- to early twentieth-century Swiss professor of economics, who proposed it as a description of the distribution of income over a population.<span style="mso-spacerun: yes;"> </span>In fact, there are several different distributions named after Pareto, but the type I distribution considered here exhibits a power-law decay like the Student’s t-distributions, but it is a J-shaped distribution whose mode is equal to its minimum value.<span style="mso-spacerun: yes;"> </span>More specifically, this distribution is defined by a location parameter that determines this minimum value and a shape parameter that determines how rapidly the tail decays for values larger than this minimum.<span style="mso-spacerun: yes;"> </span>The example considered here takes this minimum value as 1 and the shape parameter as 1.5, giving a distribution with a finite mean but an infinite variance.<span style="mso-spacerun: yes;"> </span>As in the above example, the boxplot summary shown below characterizes the mean, median, and mode for 1000 statistically independent random samples drawn from this distribution, each of size N = 100.<span style="mso-spacerun: yes;"> </span>As before, it is clear from this plot that the mean is much more highly variable than either the median or the mode.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-xaz314FpZmU/TllHOi1mdEI/AAAAAAAAADw/_iCdakolo68/s1600/MoreUselessFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/-xaz314FpZmU/TllHOi1mdEI/AAAAAAAAADw/_iCdakolo68/s320/MoreUselessFig04.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">In this case, however, we have the added complication that since this distribution is not symmetric, its mean, median and mode do not coincide.<span style="mso-spacerun: yes;"> </span>In fact, the population mode is the minimum value (which is 1 here), corresponding to the solid line at the bottom of the plot.<span style="mso-spacerun: yes;"> </span>The narrow range of the boxplot values around this correct value suggest that the <strong>modeest</strong> package is reliably estimating this mode value, but as I noted in my last post, this characterization is not useful here because it tells us nothing about the rate at which the density decays.<span style="mso-spacerun: yes;"> </span>The theoretical median value can also be calculated easily for this distribution, and here it is approximately equal to 1.587, corresponding to the dashed horizontal line in the plot.<span style="mso-spacerun: yes;"> </span>As with the mode, it is clear from the boxplot that the median estimated from the data is in generally excellent agreement with this value.<span style="mso-spacerun: yes;"> </span>Finally, the mean value for this particular distribution is 3, corresponding to the dotted horizontal line in the plot.<span style="mso-spacerun: yes;"> </span>Since this line lies fairly close to the upper quartile of the computed means (i.e., the top of the “box” in the boxplot), it follows that the estimated mean falls below the correct value almost 75% of the time, but it is also clear that when the mean is overestimated, the extent of this overestimation can be very large.<span style="mso-spacerun: yes;"> </span>Motivated in part by the fact that the mean doesn’t always exist for the Pareto distribution, Johnson, Kotz and Balakrishnan note in their chapter on these distributions that alternative location measures have been considered, including both the geometric and harmonic means.<span style="mso-spacerun: yes;"> </span>I will examine these ideas further in a future post.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Finally, <strong>klr</strong> mentioned my post on useless averages in his blog <a href="http://timelyportfolio.blogspot.com/">TimelyPortfolio</a>, where he discusses alternatives to the moving average in characterizing financial time-series.<span style="mso-spacerun: yes;"> </span>For the case he considers, klr compares a 10-month moving average, the corresponding moving median, and a number of the corresponding mode estimators from the <strong>modeest</strong> package.<span style="mso-spacerun: yes;"> </span>This is a very interesting avenue of exploration for me since it is closely related to the median filter and other nonlinear digital filters that can be very useful in cleaning noisy time-series data.<span style="mso-spacerun: yes;"> </span>I discuss a number of these ideas – including moving-window extensions of other data characterizations like skewness and kurtosis – in my book <span><a href="http://www.amazon.com/Mining-Imperfect-Data-Contamination-Incomplete/dp/0898715822?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Mining Imperfect Data: Dealing with Contamination and Incomplete Records</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0898715822" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" />. </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Again, thanks to all of you for your comments.<span style="mso-spacerun: yes;"> </span>You have given me much to think about and investigate further, which is one of the joys of doing this blog.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com1tag:blogger.com,1999:blog-9179325420174899779.post-41467424617327171532011-08-20T08:21:00.000-07:002011-08-20T08:21:05.423-07:00When are averages useless?<div class="MsoNormal" style="margin: 0in 0in 0pt;">Of all possible single-number characterizations of a data sequence, the average is probably the best known.<span style="mso-spacerun: yes;"> </span>It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers.<span style="mso-spacerun: yes;"> </span>It is not the only such “typical value,” however, nor is it always the most useful one: two other candidates – location estimators in statistical terminology – are the median and the mode, both of which are discussed in detail in Section 4.1.2 of <span><a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Exploring Data in Engineering, the Sciences, and Medicine</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /></span>.<span style="mso-spacerun: yes;"> </span>Like the average, these alternative location estimators are not always “fully representative,” but they do represent viable alternatives – at least sometimes – in cases where the average is sufficiently non-representative as to be effectively useless.<span style="mso-spacerun: yes;"> </span>As the title of this post suggests, the focus here is on those cases where the mean doesn’t really tell us what we want to know about a data sequence, briefly examining why this happens and what we can do about it.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-bDidkJnPnX4/Tk_CdddskeI/AAAAAAAAADU/BFEJDtmip7U/s1600/UselessFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://2.bp.blogspot.com/-bDidkJnPnX4/Tk_CdddskeI/AAAAAAAAADU/BFEJDtmip7U/s320/UselessFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">First, it is worth saying a few words about the two alternatives just mentioned: the median and the mode.<span style="mso-spacerun: yes;"> </span>Of these, the mode is both the more difficult to estimate and the less broadly useful.<span style="mso-spacerun: yes;"> </span>Essentially, “the mode” corresponds to “the location of the peak in the data distribution.”<span style="mso-spacerun: yes;"> </span>One difficulty with this somewhat loose definition is that “the mode” is not always well-defined.<span style="mso-spacerun: yes;"> </span>The above collection of plots shows three examples where the mode is not well-defined, and another where the mode is well-defined but not particularly useful.<span style="mso-spacerun: yes;"> </span>The upper left plot shows the density of the uniform distribution on the range [1,2]: there, the density is constant over the entire range, so there is no single, well-defined “peak” or unique maximum to serve as a mode for this distribution.<span style="mso-spacerun: yes;"> </span>The upper right plot shows a nonparametric density estimate for the <place w:st="on">Old Faithful</place> geyser waiting time data that I have discussed in several of my recent posts (the <em>R</em> data object <strong>faithful</strong>).<span style="mso-spacerun: yes;"> </span>Here, the difficulty is that there are not one but two modes, so “the mode” is not well-defined here, either: we must discuss “the modes.”<span style="mso-spacerun: yes;"> </span>The same behavior is observed for the <em>arcsin distribution</em>, whose density is shown in the lower left plot in the above figure.<span style="mso-spacerun: yes;"> </span>This density corresponds to the beta distribution with shape parameters both equal to ½, giving a bimodal distribution whose cumulative probability function can be written simply in terms of the arcsin function, motivating its name (see Section 4.5.1 of <em>Exploring Data</em> for a more complete discussion of both the beta distribution family and the special case of the arcsin distribution).<span style="mso-spacerun: yes;"> </span>In this case, the two modes of the distribution occur at the extremes of the data, at x = 1 and x = 2.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The second difficulty with the mode noted above is that it is sometimes well-defined but not particularly useful.<span style="mso-spacerun: yes;"> </span>The case of the J-shaped exponential density shown in the lower right plot above illustrates this point: this distribution exhibits a single, well-defined peak at the minimum value x = 0.<span style="mso-spacerun: yes;"> </span>Here, you don’t even have to look at the data to arrive at this result, which therefore tells you nothing about the data distribution: this density is described by a single parameter that determines how slowly or rapidly the distribution decays and the mode is independent of this parameter. Despite these limitations, there are cases where the mode represents an extremely useful data characterization, even though it is much harder to estimate than the mean or the median.<span style="mso-spacerun: yes;"> </span>Fortunately, there is a nice package available in <em>R</em> to address this problem: the <strong>modeest </strong>package provides 11 different mode estimation procedures.<span style="mso-spacerun: yes;"> </span>I will illustrate one of these in the examples that follow – the half range mode estimator of Bickel – and I will give a more complete discussion of this package in a later post.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The median is a far better-known data characterization than the mode, and it is both much easier to estimate and much more broadly applicable.<span style="mso-spacerun: yes;"> </span>In particular, unlike either the mean or the mode, the median is well-defined for <em>any</em> proper data distribution, a result demonstrated in Section 4.1.2 of <em>Exploring Data</em>.<span style="mso-spacerun: yes;"> </span>Conceptually, computing the median only requires sorting the N data values from smallest to largest and then taking either the middle element from this sorted list (if N is odd), or averaging the middle two elements (if N is even).<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The mean is, of course, both the easiest of these characterizations to compute – simply add the N data values and divide by N – and unquestionably the best known.<span style="mso-spacerun: yes;"> </span>There are, however, at least three situations where the mean can be so highly non-representative as to be useless:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><span style="mso-list: Ignore;"><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"><span style="mso-list: Ignore;">1.<span style="font: 7pt "Times New Roman";"> </span></span>if severe outliers are present;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"><span style="mso-list: Ignore;">2.<span style="font: 7pt "Times New Roman";"> </span></span>if the distribution is multi-modal;</div><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"><span style="mso-list: Ignore;">3.<span style="font: 7pt "Times New Roman";"> </span></span>if the distribution has infinite variance.</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The rest of this post examines each of these cases in turn.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">I have discussed the problem of outliers before, but they are an important enough problem in practice to bear repeating.<span style="mso-spacerun: yes;"> </span>(I devote all of Chapter 7 to this topic in <em>Exploring Data</em>.)<span style="mso-spacerun: yes;"> </span>The plot below shows the makeup flow rate dataset, available from the companion website for <em>Exploring Data</em> (the dataset is <strong>makeup.csv</strong>, available on the <a href="http://www.oup.com/us/companion.websites/9780195089653/rprogram">R programs and datasets page</a>).<span style="mso-spacerun: yes;"> </span>This dataset consists of 2,589 successive measurements of the flow rate of a fluid stream in an industrial manufacturing process.<span style="mso-spacerun: yes;"> </span>The points in this plot show two distinct forms of behavior: those with values on the order of 400 represent measurements made during normal process operation, while those with values less than about 300 correspond to measurements made when the process is shut down (these values are approximately zero) or is in the process of being either shut down or started back up.<span style="mso-spacerun: yes;"> </span>The three lines in this plot correspond to the mean (the solid line at approximately 315), the median (the dotted line at approximately 393), and the mode (the dashed line at approximately 403, estimated using the “hrm” method in the <strong>modeest</strong> package).<span style="mso-spacerun: yes;"> </span>As I have noted previously, the mean in this case represents a useful line of demarcation between the normal operation data (those points above the mean, representing 77.6% of the data) and the shutdown segments (those points below the mean, representing 22.4% of the data).<span style="mso-spacerun: yes;"> </span>In contrast, both the median and the specific mode estimator used here provide much better characterizations of the normal operating data.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-CnPhxb-jIgE/Tk_H08ceYNI/AAAAAAAAADY/IIByhlS_LcI/s1600/UselessFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/-CnPhxb-jIgE/Tk_H08ceYNI/AAAAAAAAADY/IIByhlS_LcI/s320/UselessFig02.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The next plot below shows a nonparametric density estimate of the <place w:st="on">Old Faithful</place> geyser waiting data I discussed in my last few posts.<span style="mso-spacerun: yes;"> </span>The solid vertical line at 70.90 corresponds to the mean value computed from the complete dataset.<span style="mso-spacerun: yes;"> </span>It has been said that a true compromise is an agreement that makes all parties equally unhappy, and this seems a reasonable description of the mean here: the value lies about mid-way between the two peaks in this distribution, centered at approximately 55 and 80; in fact, this value lies fairly close to the trough between the peaks in this density estimate.<span style="mso-spacerun: yes;"> </span>(The situation is even worse for the arcsin density discussed above: there, the two modes occur at values of 1 and 2, while the mean falls equidistant from both at 1.5, arguably the “least representative” value in the whole data range.)<span style="mso-spacerun: yes;"> </span>The median waiting time value is 76, corresponding to the dotted line just to the left of the main peak at about 80, and the mode (again, computed using the package <strong>modeest</strong> with the “hrm” method) corresponds to the dashed line at 83, just to the right of the main peak.<span style="mso-spacerun: yes;"> </span>The basic difficulty here is that all of these location estimators are inherently inadequate since they are attempting to characterize “the representative value” of a data sequence that has “two representative values:” one representing the smaller peak at around 55 and the other representing the larger peak at around 80.<span style="mso-spacerun: yes;"> In this case, both the median and the mode do a better job of characterizing the larger of the two peaks in the distribution (but not a great job), although such a partial characterization is not always what we want. </span>This type of behavior is exactly what the mixture models I discussed in my last few posts are intended to describe.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-aFToge7EWtc/Tk_IdMboJOI/AAAAAAAAADc/CRKirO7Nh0s/s1600/UselessFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://2.bp.blogspot.com/-aFToge7EWtc/Tk_IdMboJOI/AAAAAAAAADc/CRKirO7Nh0s/s320/UselessFig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To illustrate the third situation where the mean is essentially useless, consider the Cauchy distribution, corresponding to the Student’s t distribution with one degree of freedom.<span style="mso-spacerun: yes;"> </span>This is probably the best known infinite-variance distribution there is, and it is often used as an extreme example because it causes a lot of estimation procedures to fail.<span style="mso-spacerun: yes;"> </span>The plot below is a (truncated) boxplot comparison of the values of the mean, median, and mode computed from 1000 independently generated Cauchy random number sequences, each of length N = 100.<span style="mso-spacerun: yes;"> </span>It is clear from these boxplots that the variability of the mean is much greater than that of either of the other two estimators, which are the median and the mode, the latter again estimated from the data using the half-range mode (hrm) method in the <strong>modeest</strong> package.<span style="mso-spacerun: yes;"> </span>One of the consequences of working with infinite variance distributions is that the mean is no longer a consistent location estimator, meaning that the variance of the estimated mean does not approach zero in the limit of large sample sizes.<span style="mso-spacerun: yes;"> </span>In fact, the Cauchy distribution is one of the examples I discuss in Chapter 6 of <em>Exploring Data</em> as a counterexample to the Central Limit Theorem: for most data distributions, the distribution of the mean approaches a Gaussian limit with a variance that decreases inversely with the sample size N, but for the Cauchy distribution, the distribution of the mean is exactly the same as that of the data itself.<span style="mso-spacerun: yes;"> </span>In other words, for the Cauchy distribution, averaging a collection of N numbers does not reduce the variability at all.<span style="mso-spacerun: yes;"> </span>This is exactly what we are seeing here, although the plot below doesn’t show how bad the situation really is: the smallest value of the mean in this sequence of 1000 estimates is -798.97 and the largest value is 928.85.<span style="mso-spacerun: yes;"> </span>In order to see any detail at all in the distribution of the median and mode values, it was necessary to restrict the range of the boxplots shown here to lie between -5 and +5, which eliminated 13.6% of the computed mean values.<span style="mso-spacerun: yes;"> </span>In contrast, the median is known to be a reasonably good location estimator for the Cauchy distribution (see Section 6.6.1 of <em>Exploring Data</em> for a further discussion of this point), and the results presented here suggest that Bickel’s half-range mode estimator is also a reasonable candidate.<span style="mso-spacerun: yes;"> </span>The main point here is that the mean is a completely unreasonable estimator in situations like this one, an important point in view of the growing interest in data models like the infinite-variance Zipf distribution to describe “long-tailed” phenomena in business.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-uwnQfCezko8/Tk_JXm5QSOI/AAAAAAAAADg/pTyXC8kq8iI/s1600/UselessFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" qaa="true" src="http://1.bp.blogspot.com/-uwnQfCezko8/Tk_JXm5QSOI/AAAAAAAAADg/pTyXC8kq8iI/s320/UselessFig04.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">I will have more to say about both the <strong>modeest</strong> package and Zipf distributions in upcoming posts.</div></span>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com3tag:blogger.com,1999:blog-9179325420174899779.post-26562012778515765402011-08-06T14:23:00.000-07:002011-08-06T14:23:22.895-07:00Fitting mixture distributions with the R package mixtools<div class="MsoNormal" style="margin: 0in 0in 0pt;">My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful.<span style="mso-spacerun: yes;"> </span>Further discussion and more examples can be found in Chapter 10 of <span><a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Exploring Data in Engineering, the Sciences, and Medicine</a></span>.<span style="mso-spacerun: yes;"> </span>One important topic I haven’t covered is how to fit mixture models to datasets like the <place w:st="on">Old Faithful</place> geyser data that I have discussed previously: a nonparametric density plot gives fairly compelling evidence for a bimodal distribution, but how do you estimate the parameters of a mixture model that describes these two modes?<span style="mso-spacerun: yes;"> </span>For a finite Gaussian mixture distribution, one way is by trial and error, first estimating the centers of the peaks by eye in the density plot (these become the component means), and adjusting the standard deviations and mixing percentages to approximately match the peak widths and heights, respectively.<span style="mso-spacerun: yes;"> </span>This post considers the more systematic alternative of estimating the mixture distribution parameters using the <strong>mixtools</strong> package in <em>R</em>.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The <strong>mixtools</strong> package is one of several available in <em>R</em> to fit mixture distributions or to solve the closely related problem of model-based clustering.<span style="mso-spacerun: yes;"> </span>Further, <strong>mixtools</strong> includes a variety of procedures for fitting mixture models of different types.<span style="mso-spacerun: yes;"> </span>This post focuses on one of these – the <strong>normalmixEM</strong> procedure for fitting normal mixture densities – and applies it to two simple examples, starting with the <place w:st="on">Old Faithful</place> dataset mentioned above.<span style="mso-spacerun: yes;"> </span>A much more complete and thorough discussion of the <strong>mixtools</strong> package – which also discusses its application to the <place w:st="on">Old Faithful</place> dataset – is given in the <em>R</em> package vignette, <a href="http://cran.r-project.org/web/packages/mixtools/vignettes/vignette.pdf">mixtools: An R Package for Analyzing Finite Mixture Models</a>.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-kbk_korLXMw/Tj2JMvEPiPI/AAAAAAAAADE/avAFubexWKk/s1600/mixtoolsFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://3.bp.blogspot.com/-kbk_korLXMw/Tj2JMvEPiPI/AAAAAAAAADE/avAFubexWKk/s320/mixtoolsFig01.png" t$="true" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The above plot shows the results obtained using the <strong>normalmixEM</strong> procedure with its default parameter values, applied to the <place w:st="on">Old Faithful</place> waiting time data.<span style="mso-spacerun: yes;"> </span>Specifically, this plot was generated by the following sequence of <em>R</em> commands:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"> library(mixtools)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>wait = faithful$waiting</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>mixmdl = normalmixEM(wait)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>plot(mixmdl,which=2)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>lines(density(wait), lty=2, lwd=2)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Like many modeling tools in <em>R</em>, the <strong>normalmixEM</strong> procedure has associated plot and summary methods.<span style="mso-spacerun: yes;"> </span>In this case, the plot method displays either the log likelihood associated with each iteration of the EM fitting algorithm (more about that below), or the component densities shown above, or both.<span style="mso-spacerun: yes;"> </span>Specifying “which=1” displays only the log likelihood plot (this is the default), specifying “which = 2” displays only the density components/histogram plot shown here, and specifying “density = TRUE” without specifying the “which” parameter gives both plots.<span style="mso-spacerun: yes;"> </span>Note that the two solid curves shown in the above plot correspond to the individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of an observation being drawn from that component distribution.<span style="mso-spacerun: yes;"> </span>The final line of <em>R</em> code above overlays the nonparametric density estimate generated by the <strong>density</strong> function with its default parameters, shown here as the heavy dashed line (obtained by specifying “lty = 2”).</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Most of the procedures in the <strong>mixtools</strong> package are based on the iterative <em>expectation maximization (EM) algorithm,</em> discussed in Section 2 of the <strong>mixtools</strong> vignette and also in Chapter 16 of <em>Exploring Data</em>.<span style="mso-spacerun: yes;"> </span>A detailed discussion of this algorithm is beyond the scope of this post – books have been devoted to the topic (see, for example, the book by McLachlan and Krishnan, <span><a href="http://www.amazon.com/Algorithm-Extensions-Wiley-Probability-Statistics/dp/0471201707?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">The EM Algorithm and Extensions (Wiley Series in Probability and Statistics)</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0471201707" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /></span> ) – but the following two points are important to note here.<span style="mso-spacerun: yes;"> </span>First, the EM algorithm is an iterative procedure, and the time required for it to reach convergence – if it converges at all – depends strongly on the problem to which it is applied.<span style="mso-spacerun: yes;"> </span>The second key point is that because it is an iterative procedure, the EM algorithm requires starting values for the parameters, and algorithm performance can depend strongly on these initial values.<span style="mso-spacerun: yes;"> </span>The <strong>normalmixEM</strong> procedure supports both user-supplied starting values and built-in estimation of starting values if none are supplied.<span style="mso-spacerun: yes;"> </span>These built-in estimates are the default and, in favorable cases, they work quite well.<span style="mso-spacerun: yes;"> </span>The <place w:st="on">Old Faithful</place> waiting time data is a case in point – using the default starting values gives the following parameter estimates:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"> > mixmdl[c("lambda","mu","sigma")]</div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">$lambda</div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">[1] 0.3608868 0.6391132</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">$mu</div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">[1] 54.61489 80.09109</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">$sigma</div><div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;">[1] 5.871241 5.867718</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The mixture density described by these parameters is given by:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>p(x) = lambda[1] n(x; mu[1], sigma[1]) + lambda[2] n(x; mu[2], sigma[2])</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where <em>n(x; mu, sigma)</em> represents the Gaussian probability density function with mean <em>mu</em> and standard deviation <em>sigma.</em></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">One reason the default starting values work well for the Old Faithful waiting time data is that if nothing is specified, the number of components (the parameter k) is set equal to 2.<span style="mso-spacerun: yes;"> </span>Thus, if you are attempting to fit a mixture model with more than two components, this number should be specified, either by setting k to some other value and not specifying any starting estimates for the parameters lambda, mu, and sigma, or by specifying a vector with k components as starting values for at least one of these parameters.<span style="mso-spacerun: yes;"> </span>(There are a number of useful options in calling the <strong>normalmixEM</strong> procedure: for example, specifying the initial sigma value as a scalar constant rather than a vector with k components forces the component variances to be equal.<span style="mso-spacerun: yes;"> </span>I won’t attempt to give a detailed discussion of these options here; for that, type “help(normalmixEM)”.)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Another important point about the default starting values is that, aside from the number of components k, any unspecified initial parameter estimates are selected randomly by the <strong>normalmixEM</strong> procedure.<span style="mso-spacerun: yes;"> </span>This means that, even in cases where the default starting values consistently work well – again, the <place w:st="on">Old Faithful</place> waiting time dataset seems to be such a case – the number of iterations required to obtain the final result can vary significantly from one run to the next.<span style="mso-spacerun: yes;"> </span>(Specifically, the <strong>normalmixEM</strong> procedure does not fix the seed for the random number generators used to compute these starting values, so repeated runs of the procedure with the same data will start from different initial parameter values and require different numbers of iterations to achieve convergence.<span style="mso-spacerun: yes;"> </span>In the case of the Old Faithful waiting time data, I have seen anywhere between 16 and 59 iterations required, with the final results differing only very slightly, typically in the fifth or sixth decimal place.<span style="mso-spacerun: yes;"> </span>If you want to use the same starting value on successive runs, this can be done by setting the random number seed via the <strong>set.seed</strong> command before you invoke the <strong>normalmixEM</strong> procedure.)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-lOvyRUwwCyM/Tj2Za7E91KI/AAAAAAAAADI/Ld4rDY-0gvM/s1600/mixtoolsFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://4.bp.blogspot.com/-lOvyRUwwCyM/Tj2Za7E91KI/AAAAAAAAADI/Ld4rDY-0gvM/s320/mixtoolsFig02.png" t$="true" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">It is important to note that the default starting values do not always work well, even if the correct number of components is specified.<span style="mso-spacerun: yes;"> </span>This point is illustrated nicely by the following example.<span style="mso-spacerun: yes;"> </span>The plot above shows two curves: the solid line is the exact density for the three-component Gaussian mixture distribution described by the following parameters:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><span style="mso-tab-count: 1;"><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>mu = (2.00, 5.00, 7.00)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>sigma = (1.000, 1.000, 1.000)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>lambda = (0.200, 0.600, 0.200)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The dashed curve in the figure is the nonparametric density estimate generated from n = 500 observations drawn from this mixture distribution.<span style="mso-spacerun: yes;"> </span>Note that the first two components of this mixture distribution are evident in both of these plots, from the density peaks at approximately 2 and 5.<span style="mso-spacerun: yes;"> </span>The third component, however, is too close to the second to yield a clear peak in either density, giving rise instead to slightly asymmetric “shoulders” on the right side of the upper peaks.<span style="mso-spacerun: yes;"> </span>The key point is that the components in this mixture distribution are difficult to distinguish from either of these density estimates, and this hints at further difficulties to come.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Applying the <strong>normalmixEM</strong> procedure to the 500 sample sequence used to generate the nonparametric density estimate shown above and specifying k = 3 gives results that are substantially more variable than the <place w:st="on">Old Faithful</place> results discussed above.<span style="mso-spacerun: yes;"> </span>In fact, to compare these results, it is necessary to be explicit about the values of the random seeds used to initialize the parameter estimation procedure.<span style="mso-spacerun: yes;"> </span>Specifying this random seed as 101 and only specifying k=3 in the <strong>normalmixEM</strong> call yields the following parameter estimates after 78 iterations:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><span style="mso-tab-count: 1;"><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>mu = (1.77, 4.87, 5.44)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>sigma = (0.766, 0.115, 1.463)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>lambda = (0.168, 0.028, 0.803)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Comparing these results with the correct parameter values listed above, it is clear that some of these estimation errors are quite large.<span style="mso-spacerun: yes;"> </span>The figure shown below compares the mixture density constructed from these parameters (the heavy dashed curve) with the nonparametric density estimate computed from the data used to estimate them.<span style="mso-spacerun: yes;"> </span>The prominent “spike” in this mixture density plot corresponds to the very small standard deviation estimated for the second component and it provides a dramatic illustration of the relatively poor results obtained for this particular example.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-_EAqyfF3h3k/Tj2ftS37PxI/AAAAAAAAADM/yV__jlaiSSc/s1600/mixtoolsFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://4.bp.blogspot.com/-_EAqyfF3h3k/Tj2ftS37PxI/AAAAAAAAADM/yV__jlaiSSc/s320/mixtoolsFig03.png" t$="true" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Repeating this numerical experiment with different random seeds to obtain different random starting estimates, the <strong>normalmixEM</strong> procedure failed to converge in 1000 iterations for seed values of 102 and 103, but it converged after 393 iterations for the seed value 104, yielding the following parameter estimates:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><span style="mso-tab-count: 1;"><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>mu = (1.79, 5.03, 5.46)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>sigma = (0.775, 0.352, 1.493)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>lambda = (0.169, 0.063, 0.768)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Arguably, the general behavior of these parameter estimates is quite similar to those obtained with the random seed value 101, but note that the second variance component differs by a factor of three, and the second component of lambda increases almost as much. </div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br />Increasing the sample size from n = 500 to n = 2000 and repeating these experiments, the <strong>normalmixEM</strong> procedure failed to converge after 1000 iterations for all four of the random seed values 101 through 104.<span style="mso-spacerun: yes;"> </span>If, however, we specify the correct standard deviations (i.e., specify “sigma = c(1,1,1)” when we invoke <strong>normalmixEM</strong>) and we increase the maximum number of iterations to 3000 (i.e., specify “maxit = 3000”), the procedure does converge after 2417 iterations for the seed value 101, yielding the following parameter estimates:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><span style="mso-tab-count: 1;"><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>mu = (1.98, 4.98, 7.15)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>sigma = (1.012, 1.055, 0.929)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>lambda = (0.198, 0.641, 0.161)</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">While these parameters took a lot more effort to obtain, they are clearly much closer to the correct values, emphasizing the point that when we are fitting a model to data, our results generally improve as the amount of available data increases and as our starting estimates become more accurate.<span style="mso-spacerun: yes;"> </span>This point is further illustrated by the plot shown below, analogous to the previous one, but constructed from the model fit to the longer data sequence and incorporating better initial parameter estimates.<span style="mso-spacerun: yes;"> </span>Interestingly, re-running the same procedure but taking the correct means as starting parameter estimates instead of the correct standard deviations, the procedure failed to converge in 3000 iterations.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-RE8snFLFQDk/Tj2gaHs0hcI/AAAAAAAAADQ/cjNii4IW3w8/s1600/mixtoolsFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="http://1.bp.blogspot.com/-RE8snFLFQDk/Tj2gaHs0hcI/AAAAAAAAADQ/cjNii4IW3w8/s320/mixtoolsFig04.png" t$="true" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">Overall, I like what I have seen so far of the <strong>mixtools</strong> package, and I look forward to exploring its capabilities further.<span style="mso-spacerun: yes;"> </span>It’s great to have a built-in procedure – i.e., one I didn’t have to write and debug myself – that does all of the things that this package does.<span style="mso-spacerun: yes;"> </span>However, the three-component mixture results presented here do illustrate an important point: the behavior of iterative procedures like <strong>normalmixEM</strong> and others in the <strong>mixtools</strong> package can depend strongly on the starting values chosen to initialize the iteration process, and the extent of this dependence can vary greatly from one application to another.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div></span></span><br /> </span></span>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com3tag:blogger.com,1999:blog-9179325420174899779.post-36360260655971212432011-07-16T11:32:00.000-07:002011-07-16T11:32:52.103-07:00Mixture distributions and models: a clarification<span></span> <div class="MsoNormal" style="margin: 0in 0in 0pt;">In response to my last post, Chris had the following comment:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span></div><blockquote><span style="mso-tab-count: 1;"></span>I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work.<span style="mso-spacerun: yes;"> </span>You seem to say mixture models apply to a small set of models – namely regression models.</blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">This comment suggests that my caution about the difference between <em>mixed-effect models</em> and <em>mixture distributions</em> may have caused as much confusion as clarification, and the purpose of this post is to try to clear up this confusion.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">So first, let me offer the following general observations.<span style="mso-spacerun: yes;"> </span>The terms “mixture models” refers to a generalization of the class of finite mixture distributions that I discussed in my previous post.<span style="mso-spacerun: yes;"> </span>I give a more detailed discussion of finite mixture distributions in Chapter 10 of <span><a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Exploring Data in Engineering, the Sciences, and Medicine</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /></span><span> </span>, and the more general class of mixture models is discussed in the book <span><a href="http://www.amazon.com/Mixture-Models-Statistics-Textbooks-Monographs/dp/0824776917?ie=UTF8&tag=widgetsamazon-20&link_code=btl&camp=213689&creative=392969" target="_blank">Mixture Models (Statistics: A Series of Textbooks and Monographs)</a><img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&l=btl&camp=213689&creative=392969&o=1&a=0824776917" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /></span> by Geoffrey J. McLachlan and Kaye E. Bashford.<span style="mso-spacerun: yes;"> </span>The basic idea is that we are describing some observed phenomenon like the Old Faithful geyser data (the <strong>faithful</strong> data object in <em>R</em>) where a close look at the data (e.g., with a nonparametric density estimate) suggests substantial heterogeneity.<span style="mso-spacerun: yes;"> </span>In particular, the density estimates I presented last time for both of the variables in this dataset exhibit clear evidence of bimodality.<span style="mso-spacerun: yes;"> </span>Essentially, the idea behind a mixture model/mixture distribution is that we are observing something that isn’t fully characterized by a single, simple distribution or model, but instead by several such distributions or models, with some random selection mechanism at work. In the case of mixture distributions, some observations appear to be drawn from distribution 1, some from distribution 2, and so forth.<span style="mso-spacerun: yes;"> </span>The more general class of mixture models is quite broad, including things like heterogeneous regression models, where the response may depend approximately linearly on some covariate with one slope and intercept for observations drawn from one sub-population, but with another, very different slope and intercept for observations drawn from another sub-population. I present an example at the end of this post that illustrates this idea.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The probable source of confusion for Chris – and very possibly other readers – is the comment I made about the difference between these mixture models and <i style="mso-bidi-font-style: normal;">mixed-effect models</i>.<span style="mso-spacerun: yes;"> </span>This other class of models – which I only mentioned in passing in my post – typically consists of a linear regression model with two types of prediction variables: deterministic predictors, like those that appear in standard linear regression models, and random predictors that are typically assumed to obey a Gaussian distribution.<span style="mso-spacerun: yes;"> </span>This framework has been extended to more general settings like generalized linear models (e.g., mixed-effect logistic regression models).<span style="mso-spacerun: yes;"> </span>The <em>R</em> package <strong>lme4</strong> provides support for fitting both linear mixed-effect models and generalized linear mixed-effect models to data.<span style="mso-spacerun: yes;"> </span>As I noted last time, these model classes are distinct from the mixture distribution/mixture model classes I discuss here.<span style="mso-spacerun: yes;"> </span>The models that I do discuss – mixture models – have strong connections with cluster analysis, where we are given a heterogeneous group of objects and typically wish to determine how many distinct groups of objects are present and assign individuals to the appropriate groups.<span style="mso-spacerun: yes;"> </span>A very high-level view of the many <em>R</em> packages available for clustering – some based on mixture model ideas and some not – is available from the <a href="http://cran.r-project.org/web/views/Cluster.html">CRAN clustering task view page</a>.<span style="mso-spacerun: yes;"> </span>Two packages from this task view that I plan to discuss in future posts are <strong>flexmix</strong> and <strong>mixtools</strong>, both of which support a variety of mixture model applications.<span style="mso-spacerun: yes;"> </span>The following comments from the vignette <a href="http://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf">FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R</a> give an indication of the range of areas where these ideas are useful:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;">“Finite mixture models have been used for more than 100 years, but have seen a real boost in popularity over the last decade due to the tremendous increase in available computing power.<span style="mso-spacerun: yes;"> </span>The areas of application of mixture models range from biology and medicine to physics, economics, and marketing.<span style="mso-spacerun: yes;"> </span>On the one hand, these models can be applied to data where observations originate from various groups and the group affiliations are not known, and on the other hand to provide approximations for multi-modal distributions.”</div></blockquote><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-Do8IMtBIjKY/TiHSCHhpw-I/AAAAAAAAAC4/lPxrVps2ZNs/s1600/OldFaithfulEx01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" m$="true" src="http://1.bp.blogspot.com/-Do8IMtBIjKY/TiHSCHhpw-I/AAAAAAAAAC4/lPxrVps2ZNs/s320/OldFaithfulEx01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>The following example illustrates the second of these ideas, motivated by the <place w:st="on">Old Faithful</place> geyser data that I discussed last time.<span style="mso-spacerun: yes;"> </span>As a reminder, the plot above shows the nonparametric density estimate generated from the 272 observations of the <place w:st="on">Old Faithful</place> waiting time data included in the <strong>faithful</strong> data object, using the <strong>density</strong> procedure in <em>R</em> with the default parameter settings.<span style="mso-spacerun: yes;"> </span>As I noted last time, the plot shows two clear peaks, the lower one centered at approximately 55 minutes, and the second at approximately 80 minutes.<span style="mso-spacerun: yes;"> </span>Also, note that the first peak is substantially smaller in amplitude and appears to be somewhat narrower than the second peak.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-kTFAkWBQjpw/TiHSPs00adI/AAAAAAAAAC8/3Nf0iIoj2Zw/s1600/MixDensFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" m$="true" src="http://1.bp.blogspot.com/-kTFAkWBQjpw/TiHSPs00adI/AAAAAAAAAC8/3Nf0iIoj2Zw/s320/MixDensFig01.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">To illustrate the connection with finite mixture distributions, the <em>R</em> procedure described below generates a two-component Gaussian mixture density whose random samples exhibit approximately the same behavior seen in the <place w:st="on">Old Faithful</place> waiting time data.<span style="mso-spacerun: yes;"> </span>The results generated by this procedure are shown in the above figure, which includes two overlaid plots: one corresponding to the exact density for the two-component Gaussian mixture distribution (the solid line), and the other corresponding to the nonparametric density estimate computed from N = 272 random samples drawn from this mixture distribution (the dashed line).<span style="mso-spacerun: yes;"> </span>As in the previous plot, the nonparametric density estimate was computed using the <strong>density</strong> command in <em>R</em> with its default parameter values.<span style="mso-spacerun: yes;"> </span>The first component in this mixture has mean 54.5 and standard deviation 8.0, values chosen by trial and error to approximately match the lower peak in the <place w:st="on">Old Faithful</place> waiting time distribution.<span style="mso-spacerun: yes;"> </span>The second component has mean 80.0 and standard deviation 5.0, chosen to approximately match the second peak in the waiting time distribution.<span style="mso-spacerun: yes;"> </span>The probabilities associated with the first and second components are 0.45 and 0.55, respectively, selected to give approximately the same peak heights seen in the waiting time density estimate.<span style="mso-spacerun: yes;"> </span>Combining these results, the density of this mixture distribution is:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>p(x) = 0.45 n(x; 54.5, 8.0) + 0.55 n(x; 80.0, 5.0),</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where n(x;m,s) denotes the Gaussian density function with mean m and standard deviation s.<span style="mso-spacerun: yes;"> </span>These density functions can be generated using the <strong>dnorm</strong> function in <em>R</em>.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span>The <em>R</em> procedure listed below generates <strong>n</strong> independent, identically distributed random samples from an <em>m</em>-component Gaussian mixture distribution.<span style="mso-spacerun: yes;"> </span>This procedure is called with the following parameters:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><strong>n</strong> = the number of random samples to generate</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><strong>mvec</strong> = vector of <em>m</em> mean values</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><strong>svec</strong> = vector of <em>m</em> standard deviations</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><strong>pvec</strong> = vector of probabilities for each of the <em>m </em>components</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 1;"> </span><strong>iseed</strong> = integer seed to initialize the random number generators</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The <em>R</em> code for the procedure looks like this:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">MixEx01GenProc <- function(n, muvec, sigvec, pvec, iseed=101){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>set.seed(iseed)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>m <- length(pvec)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>indx <- sample(seq(1,m,1), size=n, replace=T, prob=pvec)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>yvec <- 0</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>for (i in 1:m){</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>xvec <- rnorm(n, mean=muvec[i], sd=sigvec[i])</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>yvec <- yvec + xvec * as.numeric(indx == i)</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>#</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>yvec</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">}</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">The first statement initializes the random number generator using the <strong>iseed</strong> parameter, which is given a default value of 101.<span style="mso-spacerun: yes;"> </span>The second line determines the number of components in the mixture density from the length of the <strong>pvec</strong> parameter vector, and the third line generates a random sequence <strong>indx</strong> of component indices taking the values 1 through <em>m</em> with probabilities determined by the <strong>pvec</strong> parameter.<span style="mso-spacerun: yes;"> </span>The rest of the program is a short loop that generates each component in turn, using <strong>indx</strong> to randomly select observations from each of these components with the appropriate probability. <span style="mso-spacerun: yes;"> </span>To see how this works, note that the first pass through the loop generates the random vector <strong>xvec</strong> of length <strong>n</strong>, with mean given by the first element of the vector <strong>muvec</strong> and standard deviation given by the first element of the vector <strong>sigvec</strong>.<span style="mso-spacerun: yes;"> </span>Then, for every one of the <strong>n</strong> elements of <strong>yvec</strong> for which the <strong>indx</strong> vector is equal to 1, <strong>yvec</strong> is set equal to the corresponding element of this first random component <strong>xvec</strong>.<span style="mso-spacerun: yes;"> </span>On the second pass through the loop, the second random component is generated as <strong>xvec</strong>, again with length <strong>n</strong> but now with mean specified by the second element of <strong>muvec</strong> and standard deviation determined by the second element of <strong>sigvec</strong>.<span style="mso-spacerun: yes;"> </span>As before, this value is added to the initial value of <strong>yvec</strong> whenever the selection index vector <strong>indx</strong> is equal to 2.<span style="mso-spacerun: yes;"> </span>Note that since every element of the <strong>indx</strong> vector is unique, none of the nonzero elements of <strong>yvec</strong> computed during the first iteration of the loop are modified; instead, the only elements of <strong>yvec</strong> that are modified in the second pass through the loop have their initial value of zero, specified in the line above the start of the loop.<span style="mso-spacerun: yes;"> </span>More generally, each pass through the loop generates the next component of the mixture distribution and fills in the corresponding elements of <strong>yvec</strong> as determined by the random selection index vector <strong>indx</strong>.</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-Zu-7I2Gm0Ew/TiHUTcXeLSI/AAAAAAAAADA/8VTpCBWBFMI/s1600/MixExFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" m$="true" src="http://3.bp.blogspot.com/-Zu-7I2Gm0Ew/TiHUTcXeLSI/AAAAAAAAADA/8VTpCBWBFMI/s320/MixExFig03.png" width="320" /></a></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">As I noted at the beginning of this post, the notion of a mixture model is more general than that of the finite mixture distributions just described, but closely related.<span style="mso-spacerun: yes;"> </span>I conclude this post with a simple example of a more general mixture model.<span style="mso-spacerun: yes;"> </span>The above scatter plot shows two variables, x and y, related by the following mixture model:</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-tab-count: 2;"> </span>y = x + e<sub>1</sub> with probability p<sub>1</sub> = 0.40,</div><div class="MsoNormal" style="margin: 0in 0in 0pt;">and</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"> </span>y = -x + 2 + e<sub>2</sub> with probability p<sub>2</sub> = 0.60,</div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">where e<sub>1</sub> is a zero-mean Gaussian random variable with standard deviation 0.1, and e<sub>2</sub> is a zero-mean Gaussian random variable with standard deviation 0.3.<span style="mso-spacerun: yes;"> </span>To emphasize the components in the mixture model, points corresponding to the first component are plotted as solid circles, while points corresponding to the second component are plotted as open triangles.<span style="mso-spacerun: yes;"> </span>The two dashed lines in this plot represent the ordnary least squares regression lines fit to each component separately, and they both correspond reasonably well to the underlying linear relationships that define the two components (e.g., the least squares line fit to the solid circles has a slope of approximately +1 and an intercept of approximately 0). In contrast, the heavier dotted line represents the ordinary least squares regression line fit to the complete dataset without any knowledge of its underlying component structure: this line is almost horizontal and represents a very poor approximation to the behavior of the dataset.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="mso-spacerun: yes;"></span>The point of this example is to illustrate two things.<span style="mso-spacerun: yes;"> </span>First, it provides a relatively simple illustration of how the mixture density idea discussed above generalizes to the setting of regression models and beyond: we can construct fairly general mixture models by requiring different randomly selected subsets of the data to conform to different modeling assumptions.<span style="mso-spacerun: yes;"> </span>The second point – emphasized by the strong disagreement between the overall regression line and both of the component regression lines – is that if we are given only the dataset (i.e., the x and y values themselves) without knowing which component they represent, standard analysis procedures are likely to perform very badly.<span style="mso-spacerun: yes;"> </span>This question – how do we analyze a dataset like this one without detailed prior knowledge of its heterogeneous structure – is what <em>R</em> packages like <strong>flexmix</strong> and <strong>mixtools</strong> are designed to address.<span style="mso-spacerun: yes;"> </span></div><div class="MsoNormal" style="margin: 0in 0in 0pt;"><br /></div><div class="MsoNormal" style="margin: 0in 0in 0pt;">More about that in future posts. </div>Ron Pearson (aka TheNoodleDoodler)http://www.blogger.com/profile/15693640298594791682noreply@blogger.com3