<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-9179325420174899779</id><updated>2012-02-24T14:48:33.315-08:00</updated><category term='beanplots'/><category term='normal quantiles'/><category term='Pareto distributions'/><category term='Time-series'/><category term='mixture distributions'/><category term='skewness'/><category term='Q-Q plots'/><category term='Exploring Data'/><category term='medians'/><category term='binary associations'/><category term='R package modeest'/><category term='asymptotic normality'/><category term='R procedures'/><category term='Hampel filter'/><category term='regression models'/><category term='Hampel identifier'/><category term='association measures'/><category term='interestingness measures'/><category term='categorical variables'/><category term='arithmetic means'/><category term='Gini&apos;s mean difference'/><category term='odds ratios'/><category term='violin plots'/><category term='derivative estimation'/><category term='screening predictors'/><category term='moving window filters'/><category term='moving averages'/><category term='harmonic means'/><category term='contingency tables'/><category term='R statistical software'/><category term='limitations of the mean'/><category term='bimodality'/><category term='inverse Gaussian distribution'/><category term='Savitzky-Golay smoothing'/><category term='binary confidence intervals'/><category term='Expectation Maximization algorithm'/><category term='Old Faithful dataset'/><category term='Gaussian mixture distributions'/><category term='multimodal distributions'/><category term='outliers'/><category term='long tail phenomena'/><category term='pracma R package'/><category term='data cleaning'/><category term='UCI mushroom dataset'/><category term='reciprocal transformations'/><category term='mixture models'/><category term='geometric means'/><category term='boxplots'/><category term='Shannon entropy'/><category term='initialization of iterative algorithms'/><category term='Goodman and Kruskal&apos;s tau'/><category term='Zipf distribution'/><category term='infinite variance distributions'/><category term='moving window data characterizations'/><category term='Gaussian distribution'/><category term='mixtools'/><category term='mode estimation'/><category term='medcouple'/><category term='beeswarm plots'/><category term='correlation measures'/><category term='R packages'/><category term='asymmetry'/><category term='Zipf-Mandelbrot distribution'/><category term='nonparametric density estimates'/><title type='text'>ExploringDataBlog</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>24</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-7091152336183665933</id><published>2012-02-04T16:06:00.000-08:00</published><updated>2012-02-04T16:06:09.600-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='association measures'/><category scheme='http://www.blogger.com/atom/ns#' term='Goodman and Kruskal&apos;s tau'/><category scheme='http://www.blogger.com/atom/ns#' term='correlation measures'/><category scheme='http://www.blogger.com/atom/ns#' term='categorical variables'/><category scheme='http://www.blogger.com/atom/ns#' term='odds ratios'/><title type='text'>Measuring associations between non-numeric variables</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is often useful to know how strongly or weakly two variables are associated: do they vary together or are they essentially unrelated?&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the case of numerical variables, the best-known measure of association is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For variables that are ordered but not necessarily numeric (e.g., Likert scale responses with levels like “strongly agree,” “agree,” “neither agree nor disagree,” “disagree” and “strongly disagree”), association can be measured in terms of the Spearman rank correlation coefficient.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Both of these measures are discussed in detail in Chapter 10 of &lt;a href="http://www.amazon.com/s?ie=UTF8&amp;amp;rh=n%3A283155%2Ck%3Aexploring%20data%20in%20engineering.%20the%20sciences.%20and%20medicine&amp;amp;page=1"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For unordered categorical variables (e.g., country, state, county, tumor type, literary genre, etc.), neither of these measures are applicable, but applicable alternatives do exist.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of these is Goodman and Kruskal’s tau measure, discussed very briefly in &lt;em&gt;Exploring Data&lt;/em&gt; (Chapter 10, page 492).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The point of this post is to give a more detailed discussion of this association measure, illustrating some of its advantages, disadvantages, and peculiarities.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;A more complete discussion of Goodman and Kruskal’s tau measure is given in Agresti’s book &lt;a href="http://www.amazon.com/s/ref=nb_sb_ss_i_1_8?url=search-alias%3Dstripbooks&amp;amp;field-keywords=agresti+categorical+data+analysis&amp;amp;sprefix=agresti+%2Cstripbooks%2C428"&gt;Categorical Data Analysis&lt;/a&gt;, on pages 68 and 69.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It belongs to a family of categorical association measures of the general form:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;a(x,y) = [V(y) – E{V(y|x)}]/V(y)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;where V(y) is a measure of the overall (i.e., marginal) variability of y and E{V(y|x)} is the expected value of the conditional variability V(y|x) of y given a fixed value of x, where the expectation is taken over all possible values of x.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These variability measures can be defined in different ways, leading to different association measures, including Goodman and Kruskal’s tau as a special case.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Agresti’s book gives detailed expressions for several of these variability measures, including the one on which Goodman and Kruskal’s tau is based, and an alternative expression for the overall association measure a(x,y) is given in Eq. (10.178) on page 492 of &lt;em&gt;Exploring Data&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This association measure does not appear to be available in any current &lt;em&gt;R&lt;/em&gt; package, but it is easily implemented as the following function:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;blockquote class="tr_bq"&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;GKtau &amp;lt;- function(x,y){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, compute the IxJ contingency table between x and y&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Nij = table(x,y,useNA="ifany")&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Next, convert this table into a joint probability estimate&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;PIij = Nij/sum(Nij)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Compute the marginal probability estimates&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;PIiPlus = apply(PIij,MARGIN=1,sum)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;PIPlusj = apply(PIij,MARGIN=2,sum)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Compute the marginal variation of y&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Vy = 1 - sum(PIPlusj^2)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Compute the expected conditional variation of y given x&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;InnerSum = apply(PIij^2,MARGIN=1,sum)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;VyBarx = 1 - sum(InnerSum/PIiPlus)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Compute and return Goodman and Kruskal's tau measure&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;tau = (Vy - VyBarx)/Vy&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;tau&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;An important feature of this procedure is that it allows missing values in either of the variables x or y, treating “missing” as an additional level.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In practice, this is sometimes very important since missing values in one variable may be strongly associated with either missing values in another variable or specific non-missing levels of that variable.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;An important characteristic of Goodman and Kruskal’s tau measure is its asymmetry: because the variables x and y enter this expression differently, the value of a(y,x) is &lt;em&gt;not&lt;/em&gt; the same as the value of a(x,y), in general.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This stands in marked contrast to either the product-moment correlation coefficient or the Spearman rank correlation coefficient, which are both symmetric, giving the same association between x and y as that between y and x.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fundamental reason for the asymmetry of the general class of measures defined above is that they quantify the extent to which the variable x is useful in predicting y, which may be very different than the extent to which the variable y is useful in predicting x.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, if x and y are statistically independent,&amp;nbsp;then E{V(y|x)} = V(y) – i.e., knowing x does not help at all in predicting y – and this implies that a(x,y) = 0.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;At the other extreme, if y is perfectly predictable from x, then E{V(y|x)} = 0, which implies that a(x,y) = 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As the examples presented next demonstrate, it is possible that y is extremely predictable from x, but x is only slightly predictable from y.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Specifically, consider the sequence of 400 random numbers, uniformly distributed between 0 and 1 generated by the following R code:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;set.seed(123)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;u = runif(400)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;(Here, I have used the “set.seed” command to initialize the random number generator so repeated runs of this example will give exactly the same results.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second sequence is obtained by quantizing the first, rounding the values of u to a single digit:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;x = round(u,digits=1)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The plot below shows the effects of this coarse quantization: values of u vary continuously from 0 to 1, but values of x are restricted to 0.0, 0.1, 0.2, … , 1.0.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Although this example is simulation-based, it is important to note that this type of grouping of variables is often encountered in practice (e.g., the use of age groups instead of ages in demographic characterizations, blood pressure characterizations like “normal,” “borderline hypertensive,” etc. in clinical data analysis, or the recording of industrial process temperatures to the nearest 0.1 degree, in part due to measurement accuracy considerations and in part due to memory limitations of early data collection systems).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-1yCneUgZLQE/Ty3C5dfv3II/AAAAAAAAAG4/36tSbqEgXFQ/s1600/GKtauFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" sda="true" src="http://4.bp.blogspot.com/-1yCneUgZLQE/Ty3C5dfv3II/AAAAAAAAAG4/36tSbqEgXFQ/s320/GKtauFig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In this particular case, because the variables x and u are both numeric, we could compute either the product-moment correlation coefficient or the Spearman rank correlation, obtaining the very large value of approximately 0.995 for either one, showing that these variables are strongly associated.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;We can also apply Goodman and Kruskal’s tau measure here, and the result is much more informative.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, the value of a(u,x) is 1 in this case, correctly reflecting the fact that the grouped variable x is exactly computable from the original variable u.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, the value of a(x,u) is approximately 0.025, suggesting – again correctly – that the original variable u cannot be well predicted from the grouped variable x.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To illustrate a case where the product-moment and rank correlation measures are not applicable at all, consider the following alphabetic re-coding of the variable x into an unordered categorical variable c:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;letters = c(“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”, “K”)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;c = letters[10*x+1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In this case, both of the Goodman and Kruskal tau measures, a(x,c) and a(c,x), are equal to 1, reflecting the fact that these two variables are effectively identical, related via the non-numeric transformation given above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Being able to detect relationships like these can be extremely useful in exploratory data analysis where such relationships may be unexpected, particularly in the early stages of characterizing a dataset whose metadata – i.e., detailed descriptions of the variables included in the dataset – is absent, incomplete, ambiguous, or suspect.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a real data illustration, consider the &lt;strong&gt;rent&lt;/strong&gt; data frame from the &lt;em&gt;R&lt;/em&gt; package &lt;strong&gt;gamlss.data&lt;/strong&gt;, which has 1,969 rows, each corresponding to a rental property in &lt;place w:st="on"&gt;&lt;city w:st="on"&gt;Munich&lt;/city&gt;&lt;/place&gt;, and 9 columns, each giving a characteristic of that unit (e.g., the rent, floor space, year of construction, etc.).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Three of these variables are &lt;em&gt;Sp&lt;/em&gt;, a binary variable indicating whether the location is considered above average (1) or not (0), &lt;em&gt;Sm&lt;/em&gt;, another binary variable indicating whether the location is considered below average (1) or not (0), and &lt;em&gt;loc&lt;/em&gt;, a three-level variable combining the information in these other two, taking the values 1 (below average), 2 (average), or 3 (above average).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The Goodman and Kruskal tau values between all possible pairs of these three variables are:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;a(Sm,Sp) = a(Sp,Sm) = 0.037&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;a(Sm,loc) = 0.245 vs. a(loc,Sm) = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;a(Sp,loc) = 0.701 vs. a(loc,Sp) = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The first of these results – the symmetry of Goodman and Kruskal’s tau for the variables &lt;em&gt;Sm&lt;/em&gt; and &lt;em&gt;Sp&lt;/em&gt; – is a consequence of the fact that this measure is symmetric for any pair of &lt;em&gt;binary&lt;/em&gt; variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, the odds ratio that I have discussed in previous posts represents a much better way of characterizing the relationship between binary variables (here, the odds ratio between &lt;em&gt;Sm&lt;/em&gt; and &lt;em&gt;Sp&lt;/em&gt; is zero, reflecting the fact that a location cannot be both “above average” and “below average” at the same time).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The real utility of the tau measure here is that&amp;nbsp;the second and third lines above show&amp;nbsp;that the variables &lt;em&gt;Sm&lt;/em&gt; and &lt;em&gt;Sp&lt;/em&gt; are both re-groupings of the finer-grained variable &lt;em&gt;loc&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-oRSclb8fvPE/Ty3EgV0qJ9I/AAAAAAAAAHA/gsQgEujOFxs/s1600/GKtauFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" sda="true" src="http://4.bp.blogspot.com/-oRSclb8fvPE/Ty3EgV0qJ9I/AAAAAAAAAHA/gsQgEujOFxs/s320/GKtauFig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, a more interesting exploratory application to this dataset is the following one.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Computing Goodman and Kruskal’s tau measure between the location variable &lt;em&gt;loc&lt;/em&gt; and all of the other variables in the dataset – beyond the cases of &lt;em&gt;Sm&lt;/em&gt; and &lt;em&gt;Sp&lt;/em&gt; just considered – generally yields small values for the associations in either direction.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, the association a(loc,Fl) is 0.001, suggesting that location is not a good predictor of the unit’s floor space in meters, and although the reverse association a(Fl,loc) is larger (0.057), it is not large enough to suggest that the unit’s floor space is a particularly good predictor of its location quality.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The same is true of most of the other variables in the dataset: they are neither well predicted by nor good predictors of location quality.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The one glaring exception is the rent variable &lt;em&gt;R:&lt;/em&gt; although the association a(loc,R) is only 0.001, the reverse association a(R,loc) is 0.907, a very large value suggesting that location quality is quite well predicted by the rent.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The beanplot above shows what is happening here: because the variation in rents for all three location qualities is substantial, knowledge of the &lt;em&gt;loc&lt;/em&gt; value is not sufficient to accurately predict the rent &lt;em&gt;R&lt;/em&gt;, but these rent values do generally increase in going from below-average locations (loc = 1) to average locations (loc = 2) to above-average locations (loc = 3).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For comparison, the beanplots below show why the association with floor space is so much weaker: both the mean floor space in each location quality group and the overall range of these values are quite comparable, implying that neither location quality can be well predicted from floor space nor vice versa.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-96Hzx9KHTtk/Ty3FGgKun9I/AAAAAAAAAHI/mxprlhMDTYk/s1600/GKtauFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" sda="true" src="http://3.bp.blogspot.com/-96Hzx9KHTtk/Ty3FGgKun9I/AAAAAAAAAHI/mxprlhMDTYk/s320/GKtauFig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The asymmetry of Goodman and Kruskal’s tau measure is disconcerting at first because it has no counterpart in better-known measures like the product-moment correlation coefficient between numerical variables, Spearman’s rank correlation coefficient between ordinal variables, or the odds ratio between binary variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of the points of this post has been to demonstrate how this unusual asymmetry can be useful in practice, distinguishing between the ability of one variable x to predict another variable y, and the reverse case.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-7091152336183665933?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/7091152336183665933/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2012/02/measuring-associations-between-non.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/7091152336183665933'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/7091152336183665933'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2012/02/measuring-associations-between-non.html' title='Measuring associations between non-numeric variables'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-1yCneUgZLQE/Ty3C5dfv3II/AAAAAAAAAG4/36tSbqEgXFQ/s72-c/GKtauFig01.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-6125017501737556488</id><published>2012-01-14T11:06:00.000-08:00</published><updated>2012-01-14T11:06:32.484-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data cleaning'/><category scheme='http://www.blogger.com/atom/ns#' term='derivative estimation'/><category scheme='http://www.blogger.com/atom/ns#' term='Savitzky-Golay smoothing'/><category scheme='http://www.blogger.com/atom/ns#' term='moving window filters'/><category scheme='http://www.blogger.com/atom/ns#' term='Hampel identifier'/><category scheme='http://www.blogger.com/atom/ns#' term='Hampel filter'/><category scheme='http://www.blogger.com/atom/ns#' term='pracma R package'/><title type='text'>Moving window filters and the pracma package</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last post, I discussed the Hampel filter, a useful moving window nonlinear data cleaning filter that is available in the &lt;em&gt;R&lt;/em&gt; package &lt;strong&gt;pracma&lt;/strong&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In this post, I briefly discuss this moving window filter in a little more detail, focusing on two important practical points: the choice of the filter’s local outlier detection threshold, and the question of how to initialize moving window filters.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This second point is particularly important here because the &lt;strong&gt;pracma&lt;/strong&gt; package initializes the Hampel filter in a particularly appropriate way, but doesn’t do such a good job of initializing the Savitzky-Golay filter, a linear smoothing filter that is&amp;nbsp;popular in physics and chemistry.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Fortunately, this second difficulty is easy to fix, as I demonstrate here.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Recall from my last post that the Hampel filter is a moving window implementation of the Hampel identifier, discussed in Chapter 7 of &lt;a href="http://www.amazon.com/s?ie=UTF8&amp;amp;rh=n%3A283155%2Ck%3Aexploring%20data%20in%20engineering.%20the%20sciences.%20and%20medicine&amp;amp;page=1"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, this procedure – implemented as &lt;strong&gt;outlierMAD&lt;/strong&gt; in the &lt;strong&gt;pracma&lt;/strong&gt; package – is a nonlinear data cleaning filter that looks for local outliers in a time-series or other streaming data sequence, replacing them with a more reasonable alternative value when it finds them.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, this filter may be viewed as a more effective alternative to a “local three-sigma edit rule” that would replace any data point lying more than three standard deviations from the mean of its neighbors with that mean value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The difficulty with this simple strategy is that both the mean and especially the standard deviation are badly distorted by the presence of outliers in the data, causing this data cleaning procedure to often fail completely in practice.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The Hampel filter instead uses the median of neighboring observations as a reference value, and the MAD scale estimator as an alternative measure of distance: that is, a data point is declared an outlier and replaced if it lies more than some number &lt;em&gt;t &lt;/em&gt;of MAD scale estimates from the median of its neighbors; the replacement value used in this procedure is the median.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-zeJmhgqThZk/TxHJNN-4XlI/AAAAAAAAAGQ/OxoXHvRm-3U/s1600/HampelIIfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" kba="true" src="http://4.bp.blogspot.com/-zeJmhgqThZk/TxHJNN-4XlI/AAAAAAAAAGQ/OxoXHvRm-3U/s320/HampelIIfig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;More specifically, for each observation in the original data sequence, the Hampel filter constructs a moving window that includes the &lt;em&gt;K&lt;/em&gt; prior points, the data point of primary interest, and the &lt;em&gt;K&lt;/em&gt; subsequent data points.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The reference value used for the central data point is the median of these &lt;em&gt;2K+1&lt;/em&gt; successive observations, and the MAD scale estimate is computed from these same observations to serve as a measure of the “natural local spread” of the data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If the central data point lies more than &lt;em&gt;t &lt;/em&gt;MAD scale estimate values from the median, it is replaced with the median; otherwise, it is left unchanged.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To illustrate the performance of this filter, the top plot above shows the sequence of 1024 successive physical property measurements from an industrial manufacturing process that I also discussed in my last post.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The bottom plot in this pair shows the results of applying the Hampel filter with a window half-width parameter K=5 and a threshold value of t = 3 to this data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Comparing these two plots, it is clear that the Hampel filter has removed the glaring outlier – the value zero – at observation k = 291, yielding a cleaned data sequence that varies over a much narrower (and, at least in this case, much more reasonable) range of possible values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;What is less obvious is that this filter has also replaced 18 other data points with their local median reference values.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-uQht8r4-pk8/TxHJuf8hHiI/AAAAAAAAAGY/Vq7Pu7BaeRA/s1600/HampelIIfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" kba="true" src="http://4.bp.blogspot.com/-uQht8r4-pk8/TxHJuf8hHiI/AAAAAAAAAGY/Vq7Pu7BaeRA/s320/HampelIIfig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The above plot shows the original data sequence, but on approximately the same range as the cleaned data sequence so that the glaring outlier at k = 291 no longer dominates the figure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The large solid circles represent the 18 additional points that the Hampel filter has declared to be outliers and replaced with their local median values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This plot was generated using the Hampel filter implemented in the &lt;strong&gt;outlierMAD&lt;/strong&gt; command in the &lt;strong&gt;pracma&lt;/strong&gt; package, which has the following syntax:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outlierMAD(x,k)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;where &lt;em&gt;x&lt;/em&gt; is the data sequence to be cleaned and &lt;em&gt;k&lt;/em&gt; is the half-width that defines the moving data window on which the filter is based.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, specifying k = 5 results in an 11-point moving data window.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Unfortunately, the threshold parameter &lt;em&gt;t&lt;/em&gt; is hard-coded as 3 in this &lt;strong&gt;pracma&lt;/strong&gt; procedure, which has the following code:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;outlierMAD &amp;lt;- function (x, k){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n &amp;lt;- length(x)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y &amp;lt;- x&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;state w:st="on"&gt;&lt;place w:st="on"&gt;ind&lt;/place&gt;&lt;/state&gt; &amp;lt;- c()&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;L &amp;lt;- 1.4826&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;t0 &amp;lt;- 3&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;for (i in (k + 1):(n - k)) {&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;x0 &amp;lt;- median(x[(i - k):(i + k)])&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;S0 &amp;lt;- L * median(abs(x[(i - k):(i + k)] - x0))&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;if (abs(x[i] - x0) &amp;gt; t0 * S0) {&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y[i] &amp;lt;- x0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;state w:st="on"&gt;ind&lt;/state&gt; &amp;lt;- c(&lt;state w:st="on"&gt;&lt;place w:st="on"&gt;ind&lt;/place&gt;&lt;/state&gt;, i)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;list(y = y, &lt;state w:st="on"&gt;ind&lt;/state&gt; = &lt;state w:st="on"&gt;&lt;place w:st="on"&gt;ind&lt;/place&gt;&lt;/state&gt;)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Note that it is a simple matter to create your own version of this filter, specifying the threshold (here, the variable &lt;em&gt;t0&lt;/em&gt;) to have a default value of 3, but allowing the user to modify it in the function call.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, the code would be:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;HampelFilter &amp;lt;- function (x, k,t0=3){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n &amp;lt;- length(x)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y &amp;lt;- x&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;place w:st="on"&gt;&lt;state w:st="on"&gt;ind&lt;/state&gt;&lt;/place&gt; &amp;lt;- c()&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;L &amp;lt;- 1.4826&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;for (i in (k + 1):(n - k)) {&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;x0 &amp;lt;- median(x[(i - k):(i + k)])&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;S0 &amp;lt;- L * median(abs(x[(i - k):(i + k)] - x0))&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;if (abs(x[i] - x0) &amp;gt; t0 * S0) {&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y[i] &amp;lt;- x0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;state w:st="on"&gt;ind&lt;/state&gt; &amp;lt;- c(&lt;place w:st="on"&gt;&lt;state w:st="on"&gt;ind&lt;/state&gt;&lt;/place&gt;, i)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;list(y = y, &lt;state w:st="on"&gt;ind&lt;/state&gt; = &lt;place w:st="on"&gt;&lt;state w:st="on"&gt;ind&lt;/state&gt;&lt;/place&gt;)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The advantage of this modification is that it allows you to explore the influence of varying the threshold parameter.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that increasing t0 makes the filter more forgiving, allowing more extreme local fluctuations to pass through the filter unmodified, while decreasing t0 makes the filter more aggressive, declaring more points to be local outliers and replacing them with the appropriate local median.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, this filter remains well-defined even for t0 = 0, where it reduces to the median filter, popular in nonlinear digital signal processing.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;John Tukey&amp;nbsp;– the developer or co-developer of many useful things, including the fast Fourier transform (FFT) – introduced the median filter at a technical conference in 1974, and it has profoundly influenced subsequent developments in nonlinear digital filtering.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It may be viewed as the most aggressive limit of the Hampel filter and, although it is quite effective in removing local outliers, it is often too aggressive in practice, introducing significant distortions into the original data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This point may be seen in the plot below, which shows the results of applying the median filter (i.e., the &lt;strong&gt;HampelFilter&lt;/strong&gt; procedure defined above with t0=0) to the physical property dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, the heavy solid line in this plot shows the behavior of the first 250 points of the median filtered sequence, while the lighter dotted line shows the corresponding results for the Hampel filter with t0=3.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note the “clipped” or “blocky” appearance of the median filtered results, compared with the more irregular local variation seen in the Hampel filtered results.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In many applications (e.g., fitting time-series models), the less aggressive Hampel filter gives much better overall results.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-EjtEmc-sw8w/TxHK5E_gtSI/AAAAAAAAAGg/9SldVilqjoc/s1600/HampelIIfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" kba="true" src="http://2.bp.blogspot.com/-EjtEmc-sw8w/TxHK5E_gtSI/AAAAAAAAAGg/9SldVilqjoc/s320/HampelIIfig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The other main issue I wanted to discuss in this post is that of initializing moving window filters.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The basic structure of these filters – whether they are nonlinear types like the Hampel and median filters discussed above, or linear types like the Savitzky-Golay filter discussed briefly below –&amp;nbsp;is built on&amp;nbsp;a moving data window that includes a central point of interest, prior observations and subsequent observations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For a symmetric window that includes K prior and K subsequent observations, this window is not well defined for the first K or the last K observations in the data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These points must be given special treatment, and a very common approach in the digital signal processing community is to extend the original sequence by appending K additional copies of the first element to the beginning of the sequence and K additional copies of the last element to the end of the sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The &lt;strong&gt;pracma&lt;/strong&gt; implementation of the Hampel filter procedure (&lt;strong&gt;outlierMAD&lt;/strong&gt;) takes an alternative approach, one that is particularly appropriate for data cleaning filters.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, procedure &lt;strong&gt;outlierMAD&lt;/strong&gt; simply passes the first and last K observations unmodified from the original data sequence to the filter output.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This would also seem to be a reasonable option for smoothing filters like the linear Savitzky-Golay filter discussed next.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-irmS_ED9KG0/TxHLXwqiMpI/AAAAAAAAAGo/ccj53hOUrHM/s1600/HampelIIfig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" kba="true" src="http://1.bp.blogspot.com/-irmS_ED9KG0/TxHLXwqiMpI/AAAAAAAAAGo/ccj53hOUrHM/s320/HampelIIfig04.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As noted, this linear smoothing filter is popular in chemistry and physics, and it is implemented in the &lt;strong&gt;pracma&lt;/strong&gt; package as procedure &lt;strong&gt;savgol.&lt;/strong&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For a more detailed discussion of this filter, refer to the treatment in the book &lt;a href="http://www.amazon.com/Numerical-Recipes-3rd-Scientific-Computing/dp/0521880688/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1326566316&amp;amp;sr=1-1"&gt;Numerical Recipes&lt;/a&gt;, which the authors of the &lt;strong&gt;pracma&lt;/strong&gt; package cite for further details (Section 14.8).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, the key point is that this filter is a linear smoother, implemented as the convolution of the input sequence with an impulse response function (i.e., a smoothing kernel) that is constructed by the &lt;strong&gt;savgol &lt;/strong&gt;procedure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The above two plots show the effects of applying this filter with a total window width of 11 points (i.e., the same half-width K = 5 used with the Hampel and median filters), first to the raw physical property data sequence (upper plot), and then to the sequence after it has been cleaned by the Hampel filter (lower plot).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The large downward spike at k = 291 in the upper plot reflects the impact of the glaring outlier in the original data sequence, illustrating the practical importance of removing these artifacts from a data sequence before applying smoothing procedures like the Savitzky-Golay filter.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Both the upper and lower plots exhibit similarly large spikes at the beginning and end of the data sequence, however, and these artifacts are due to the moving window problem noted above for the first K and the last K elements of the original data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, the filter implementation in the &lt;strong&gt;savgol&lt;/strong&gt; procedure does not apply the sequence extension procedure discussed above, and this fact is responsible for these artifacts appearing at the beginning and end of the smoothed data sequence.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is extremely easy to correct this problem, adopting the same philosophy the package uses for the &lt;strong&gt;outlierMAD&lt;/strong&gt; procedure: simply retain the first and last K elements of the original sequence unmodified.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The procedure &lt;strong&gt;SGwrapper&lt;/strong&gt; listed below does this after the fact, calling the &lt;strong&gt;savgol&lt;/strong&gt; procedure and then replacing the first and last K elements of the filtered sequence with the original sequence values:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;SGwrapper &amp;lt;- function(x,K,forder=4,dorder=0){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;n = length(x)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;fl = 2*K+1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;y = savgol(x,fl,forder,dorder)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;if (dorder == 0){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y[1:K] = x[1:K]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y[(n-K):n] = x[(n-K):n]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;else{&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y[1:K] = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y[(n-K):n] = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;y&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Before showing the results obtained with this procedure, it is important to note two points.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, the moving window width parameter fl required for the &lt;strong&gt;savgol &lt;/strong&gt;procedure corresponds to fl = 2K+1 for a half-width parameter K.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The procedure &lt;strong&gt;SGwrapper&lt;/strong&gt; instead requires K as its passing parameter, constructing fl from this value of K.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Second, note that in addition to serving as a smoother, the Savitzky-Golay filter family can also be used to estimate derivatives (this is tricky since differentiation filters are incredible noise amplifiers, but I’ll talk more about that in another post).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the &lt;strong&gt;savgol&lt;/strong&gt; procedure, this is accomplished by specifying the parameter dorder, which has a default value of zero (implying smoothing), but which can be set to 1 to estimate the first derivative of a sequence, 2 for the second derivative, etc.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In these cases, replacing the first and last K elements of the filtered sequence with the original data sequence elements is not reasonable: in the absence of any other knowledge, a better default derivative estimate is zero, and the &lt;strong&gt;SGwrapper&lt;/strong&gt; procedure listed above does this.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-zc74biMBwes/TxHMuCc4tZI/AAAAAAAAAGw/zYuzLZbdfMk/s1600/HampelIIfig05.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" kba="true" src="http://3.bp.blogspot.com/-zc74biMBwes/TxHMuCc4tZI/AAAAAAAAAGw/zYuzLZbdfMk/s320/HampelIIfig05.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The four plots shown above illustrate the differences between the original &lt;strong&gt;savgol&lt;/strong&gt; procedure (the left-hand plots) and those obtained with the &lt;strong&gt;SGwrapper&lt;/strong&gt; procedure listed above (the right-hand plots).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In all cases, the data sequence used to generate these plots was the physical property data sequence cleaned using the Hampel filter with t0 = 3.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot repeats the lower of the two previous plots, corresponding to the &lt;strong&gt;savgol&lt;/strong&gt; smoother output, while the upper right plot applies the &lt;strong&gt;SGwrapper&lt;/strong&gt; function to remove the artifacts at the beginning and end of the smoothed data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Similarly, the lower two plots give the corresponding second-derivative estimates, obtained by applying the &lt;strong&gt;savgol&lt;/strong&gt; procedure with fl = 11 and dorder = 2 (lower left plot) or the &lt;strong&gt;SGwrapper&lt;/strong&gt; procedure with K = 5 and dorder = 2 (lower right plot).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-6125017501737556488?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/6125017501737556488/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2012/01/moving-window-filters-and-pracma.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/6125017501737556488'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/6125017501737556488'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2012/01/moving-window-filters-and-pracma.html' title='Moving window filters and the pracma package'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-zeJmhgqThZk/TxHJNN-4XlI/AAAAAAAAAGQ/OxoXHvRm-3U/s72-c/HampelIIfig01.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-606176191066433899</id><published>2011-11-27T08:37:00.000-08:00</published><updated>2011-11-27T08:37:14.467-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Time-series'/><category scheme='http://www.blogger.com/atom/ns#' term='data cleaning'/><category scheme='http://www.blogger.com/atom/ns#' term='Hampel identifier'/><category scheme='http://www.blogger.com/atom/ns#' term='Hampel filter'/><category scheme='http://www.blogger.com/atom/ns#' term='moving window data characterizations'/><title type='text'>Cleaning time-series and other data streams</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The need to analyze time-series or other forms of streaming data arises frequently in many different application areas.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, the figure below shows four data sequences: the upper two plots represent hourly physical property measurements, one made at the inlet of a product storage tank (the left-hand plot) and the other made at the same time at the outlet of the tank (the right-hand plot).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The lower two plots in this figure show the results of applying the data cleaning filter &lt;strong&gt;outlierMAD&lt;/strong&gt; from the &lt;em&gt;R&lt;/em&gt; package &lt;strong&gt;pracma&lt;/strong&gt; discussed further below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The two main points of this post are first, that isolated spikes like those seen in the upper two plots at hour 291 can badly distort the results of an otherwise reasonable time-series characterization, and second, that the simple moving window data cleaning filter described here is often very effective in removing these artifacts.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-xe3qt3qFIjc/TtJe9BAfGtI/AAAAAAAAAFw/GTVB2hnN3fU/s1600/hampelfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" hda="true" height="319" src="http://3.bp.blogspot.com/-xe3qt3qFIjc/TtJe9BAfGtI/AAAAAAAAAFw/GTVB2hnN3fU/s320/hampelfig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This example is discussed in more detail in Section 8.1.2 of my book &lt;a href="http://www.amazon.com/Discrete-time-Dynamic-Models-Chemical-Engineering/dp/0195121988"&gt;Discrete-Time Dynamic Models&lt;/a&gt;, but the key observations here are the following.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, the large spikes seen in both of the original data sequences were caused by the simultaneous, temporary loss of both measurements and the subsequent coding of these missing values as zero by the data collection system.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The practical question of interest was to determine how long, on average, the viscous, polymeric material being fed into and out of the product storage tank was spending there.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A standard method for addressing such questions is the use of cross-correlation analysis, where the expected result is a broad peak like the heavy dashed line in the plot shown below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The location of this peak provides an estimate of the average time spent in the tank, which is approximately 21 hours in this case, as indicated in the plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This result was about what was expected, and it was obtained by applying standard cross-correlation analysis to the cleaned data sequences shown in the bottom two plots above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The lighter solid curve in the plot below shows the results of applying exactly the same analysis, but to the original data sequences instead of the cleaned data sequences.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This dramatically different plot suggests that the material is spending very little time in the storage tank: accepted uncritically, this result would imply severe fouling of the tank, suggesting a need to shut the process down and clean out the tank, an expensive and labor-intensive proposition.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The main point of this example is that the difference in these two plots is entirely due to the extreme data anomalies present in the original time-series.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Additional examples of problems caused by time-series outliers are discussed in Section 4.3 of my book &lt;a href="http://www.amazon.com/Mining-Imperfect-Data-Contamination-Incomplete/dp/0898715822"&gt;Mining Imperfect Data&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-LtDGNc0Pq3w/TtJgcfIkfwI/AAAAAAAAAF4/OP18CGkOpck/s1600/hampelfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" hda="true" height="319" src="http://1.bp.blogspot.com/-LtDGNc0Pq3w/TtJgcfIkfwI/AAAAAAAAAF4/OP18CGkOpck/s320/hampelfig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;One of the primary features of the analysis of time-series and other streaming data sequences is the need for &lt;i style="mso-bidi-font-style: normal;"&gt;local&lt;/i&gt; data characterizations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This point is illustrated in the plot below, which shows the first 200 observations of the storage tank inlet data sequence discussed above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;All of these observations but one are represented as open circles in this plot, but the data point at &lt;em&gt;k = 110&lt;/em&gt; is shown as a solid circle, to emphasize how far it lies from its immediate neighbors in the data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is important to note that this point is not anomalous with respect to the overall range of this data sequence – it is, for example, well within the normal range of variation seen for the points from about &lt;em&gt;k = 150&lt;/em&gt; to &lt;em&gt;k = 200&lt;/em&gt; – but it is clearly anomalous with respect to those points that immediately precede and follow it.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A general strategy for automatically detecting and removing&amp;nbsp;such spikes from a data sequence like this one is to apply a &lt;i style="mso-bidi-font-style: normal;"&gt;moving window data cleaning filter&lt;/i&gt; which characterizes each data point with respect to a local neighborhood of prior and subsequent samples.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That is, for each data point &lt;i style="mso-bidi-font-style: normal;"&gt;k&lt;/i&gt; in the original data sequence, this type of filter forms a cleaned data estimate based on some number &lt;i style="mso-bidi-font-style: normal;"&gt;J&lt;/i&gt; of prior data values (i.e., points &lt;i style="mso-bidi-font-style: normal;"&gt;k-J&lt;/i&gt; through &lt;i style="mso-bidi-font-style: normal;"&gt;k-1&lt;/i&gt; in the sequence) and, in the simplest implementations, the same number of subsequent data values (i.e., points &lt;i style="mso-bidi-font-style: normal;"&gt;k+1&lt;/i&gt; through &lt;i style="mso-bidi-font-style: normal;"&gt;k+J&lt;/i&gt; in the sequence).&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-bSZnrwrGhFg/TtJg1JK1mLI/AAAAAAAAAGA/I95d4s7VILM/s1600/hampelfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" hda="true" height="319" src="http://4.bp.blogspot.com/-bSZnrwrGhFg/TtJg1JK1mLI/AAAAAAAAAGA/I95d4s7VILM/s320/hampelfig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The specific data cleaning filter considered here is the &lt;em&gt;Hampel filter&lt;/em&gt;, which applies the Hampel identifier discussed in Chapter 7 of &lt;a href="http://www.amazon.com/s?ie=UTF8&amp;amp;rh=n%3A283155%2Ck%3Aexploring%20data%20in%20engineering.%20the%20sciences.%20and%20medicine&amp;amp;page=1"&gt;Exploring Data in Engineering, the Sciences and Medicine&lt;/a&gt; to this moving data window.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If the &lt;i style="mso-bidi-font-style: normal;"&gt;k&lt;sup&gt;th&lt;/sup&gt;&lt;/i&gt; data point is declared to be an outlier, it is replaced by the median value computed from this data window; otherwise, the data point is not modified.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The results of applying the Hampel filter with a window width of &lt;i style="mso-bidi-font-style: normal;"&gt;J = 5&lt;/i&gt; to the above data sequence&amp;nbsp;are shown in the plot below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The effect is to modify three of the original data points – those at &lt;i style="mso-bidi-font-style: normal;"&gt;k = 43, 110&lt;/i&gt;, and &lt;i style="mso-bidi-font-style: normal;"&gt;120&lt;/i&gt; – and the original values of these modified points are shown as solid circles at the appropriate locations&amp;nbsp;in this plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is clear that the most pronounced effect of the Hampel filter is to remove the local outlier indicated in the above figure and replace it with a value that is much more representative of the other data points in the immediate vicinity.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-sPj3GVpR9Uw/TtJhjPvDHnI/AAAAAAAAAGI/W6pb7RUWXdc/s1600/hampelfig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" hda="true" height="319" src="http://4.bp.blogspot.com/-sPj3GVpR9Uw/TtJhjPvDHnI/AAAAAAAAAGI/W6pb7RUWXdc/s320/hampelfig04.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As I noted above, the Hampel filter implementation used here is that available in the &lt;em&gt;R&lt;/em&gt; package &lt;strong&gt;pracma&lt;/strong&gt; as procedure &lt;strong&gt;outlierMAD&lt;/strong&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I will discuss this &lt;em&gt;R&lt;/em&gt; package in more detail in my next post, but for those seeking a more detailed discussion of the Hampel filter in the meantime, one is freely available on-line in the form of an EDN article I wrote in 2002, &lt;a href="http://www.edn.com/article/486039-Scrub_data_with_scale_invariant_nonlinear_digital_filters.php"&gt;Scrub data with scale-invariant nonlinear digital filters&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; Also, c&lt;/span&gt;omparisons with alternatives like the standard median filter (generally too aggressive, introducing unwanted distortion into the “cleaned” data sequence) and the center-weighted median filter (sometimes quite effective) are presented in Section 4.2 of the book&amp;nbsp;&lt;em&gt;Mining Imperfect Data&lt;/em&gt;&amp;nbsp;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;mentioned above.&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-606176191066433899?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/606176191066433899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/11/cleaning-time-series-and-other-data.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/606176191066433899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/606176191066433899'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/11/cleaning-time-series-and-other-data.html' title='Cleaning time-series and other data streams'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-xe3qt3qFIjc/TtJe9BAfGtI/AAAAAAAAAFw/GTVB2hnN3fU/s72-c/hampelfig01.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-8205041210474190304</id><published>2011-11-11T14:16:00.000-08:00</published><updated>2011-11-11T14:16:05.333-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='harmonic means'/><category scheme='http://www.blogger.com/atom/ns#' term='reciprocal transformations'/><category scheme='http://www.blogger.com/atom/ns#' term='inverse Gaussian distribution'/><category scheme='http://www.blogger.com/atom/ns#' term='Gaussian distribution'/><category scheme='http://www.blogger.com/atom/ns#' term='Pareto distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='bimodality'/><category scheme='http://www.blogger.com/atom/ns#' term='asymmetry'/><title type='text'>Harmonic means, reciprocals, and ratios of random variables</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since the harmonic mean is defined as the reciprocal of the mean of the reciprocal values, it is intimately related to the reciprocal transformation.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The main point of this post is to show how profoundly the reciprocal transformation can alter the character of a distribution, for better or worse.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;One way that reciprocal transformations sneak into analysis results is through attempts to characterize ratios of random numbers.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The key issue underlying all of these ideas is the question of when the denominator variable in either a reciprocal transformation or a ratio exhibits non-negligible probability in a finite neighborhood of zero.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I discuss transformations in Chapter 12 of &lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1321042995&amp;amp;sr=1-1"&gt;Exploring Data in Engineering, the Sciences and Medicine&lt;/a&gt;, with a section (12.7) devoted to reciprocal transformations, showing what happens when we apply them to six different distributions: Gaussian, &lt;place w:st="on"&gt;Laplace&lt;/place&gt;, Cauchy, beta, Pareto, and lognormal.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In the general case, if a random variable &lt;em&gt;x&lt;/em&gt; has the density &lt;em&gt;p(x),&lt;/em&gt; the distribution &lt;em&gt;g(y)&lt;/em&gt; of the reciprocal &lt;em&gt;y = 1/x&lt;/em&gt; has the density:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;g(y) = p(1/y)/y&lt;sup&gt;2&lt;/sup&gt;&lt;/em&gt; &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As I discuss in greater detail in &lt;em&gt;Exploring Data&lt;/em&gt;, the consequence of this transformation is &lt;i style="mso-bidi-font-style: normal;"&gt;typically&lt;/i&gt; (though not always) to convert a well-behaved distribution into a very poorly behaved one.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, the plot below shows the effect of the reciprocal transformation on a Gaussian random variable with mean 1 and standard deviation 2.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The most obvious characteristic of this transformed distribution is its strongly asymmetric, bimodal character, but another non-obvious consequence of the reciprocal transformation is that it takes a distribution that is completely characterized by its first two moments into a new distribution with Cauchy-like tails, for which none of the integer moments exist.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-ihUpKC5yNpg/Tr1xtl2PFDI/AAAAAAAAAFQ/03fpQJy8IIc/s1600/recipfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" nda="true" src="http://4.bp.blogspot.com/-ihUpKC5yNpg/Tr1xtl2PFDI/AAAAAAAAAFQ/03fpQJy8IIc/s320/recipfig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The implications of the reciprocal transformation for many other distributions are equally non-obvious.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For example, both the badly-behaved Cauchy distribution (no moments exist) and the well-behaved lognormal distribution (all moments exist, but interestingly, do not completely characterize the distribution, as I have discussed in a previous post) are invariant under the reciprocal transformation.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Also, applying the reciprocal transformation to the long-tailed Pareto type I distribution (which exhibits few or no finite moments, depending on its tail decay rate) yields a beta distribution, all of whose moments are finite.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, it is worth noting that the invariance of the Cauchy distribution under the reciprocal transformation lies at the heart of the following result, presented in the book &lt;a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584959/ref=sr_1_2?s=books&amp;amp;ie=UTF8&amp;amp;qid=1321042772&amp;amp;sr=1-2"&gt;Continuous Univariate Distributions&lt;/a&gt;&amp;nbsp;by Johnson, Kotz, and Balakrishnan (Volume 1, 2&lt;sup&gt;nd&lt;/sup&gt; edition, Wiley, 1994, page 319).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;They note that if the density of &lt;em&gt;x&lt;/em&gt; is positive, continuous, and differentiable at &lt;em&gt;x = 0&lt;/em&gt; – all true for the Gaussian case – the distribution of the harmonic mean of &lt;em&gt;N&lt;/em&gt; samples approaches a Cauchy limit as &lt;em&gt;N&lt;/em&gt; becomes infinitely large.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As noted above, the key issue responsible for the pathological behavior of the reciprocal transformation is the question of whether the original data distribution exhibits nonzero probability of taking on values within a neighborhood around zero.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, note that if &lt;em&gt;x&lt;/em&gt; can only assume values larger than some positive lower limit &lt;em&gt;L&lt;/em&gt;, it follows that &lt;em&gt;1/x&lt;/em&gt; necessarily lies between &lt;em&gt;0&lt;/em&gt; and &lt;em&gt;1/L&lt;/em&gt;, which is enough to guarantee that all moments of the transformed distribution exist.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the Gaussian distribution, even if the mean is large enough and the standard deviation is small enough that the probability of observing values less than some limit &lt;em&gt;L &amp;gt; 0&lt;/em&gt; is negligible, the fact that this probability is not &lt;i style="mso-bidi-font-style: normal;"&gt;zero&lt;/i&gt; means that the moments of &lt;i style="mso-bidi-font-style: normal;"&gt;any&lt;/i&gt; reciprocally-transformed Gaussian distribution are not finite.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a practical matter, however, reciprocal transformations and related characterizations – like harmonic means and ratios – do become better-behaved as the probability of observing values near zero become negligibly small.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To see this point, consider two reciprocally-transformed Gaussian examples.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The first is the one considered above: the reciprocal transformation of a Gaussian random variable with mean 1 and standard deviation 2.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In this case, the probability that &lt;em&gt;x&lt;/em&gt; assumes values smaller than or equal to zero is non-negligible.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, this probability is simply the cumulative distribution function for the distribution evaluated at zero, easily computed in R as approximately 31%:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; pnorm(0,mean=1,sd=2)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[1] 0.3085375&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In contrast, for a Gaussian random variable with mean 1 and standard deviation 0.1, the corresponding probability is negligibly small:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; pnorm(0,mean=1,sd=0.1)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[1] 7.619853e-24&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;If we consider the harmonic means of these two examples, we see that the first one is horribly behaved, as all of the results presented here would lead us to expect.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, the &lt;strong&gt;qqPlot&lt;/strong&gt; command in the &lt;strong&gt;car&lt;/strong&gt; package&amp;nbsp; in &lt;em&gt;R &lt;/em&gt;allows us to compute quantile-quantile plots for the Student’s &lt;em&gt;t&lt;/em&gt;-distribution with one degree of freedom, corresponding to the Cauchy distribution, yielding the plot shown below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The Cauchy-like tail behavior expected from the results presented by Johnson, Kotz and Balakrishnan is seen clearly in this Cauchy Q-Q plot, constructed from 1000 harmonic means, each computed from statistically independent samples drawn from a Gaussian distribution with mean 1 and standard deviation 2.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fact that almost all of the observations fall within the – very wide – 95% confidence interval around the reference line suggest that the Cauchy tail behavior is appropriate here.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-tQbQfuhvKY4/Tr1y6ipHrTI/AAAAAAAAAFY/BWQUNWtTVbg/s1600/recipfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" nda="true" src="http://2.bp.blogspot.com/-tQbQfuhvKY4/Tr1y6ipHrTI/AAAAAAAAAFY/BWQUNWtTVbg/s320/recipfig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To further confirm this point, compare the corresponding normal Q-Q plot for the same sequence of harmonic means, shown below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; Th&lt;/span&gt;ere, the extreme non-Gaussian character of these harmonic means is readily apparent from the pronounced outliers evident in both the upper and lower tails.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-II9KLHCeIYw/Tr1zH9K003I/AAAAAAAAAFg/14mIAISzn4U/s1600/recipfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" nda="true" src="http://2.bp.blogspot.com/-II9KLHCeIYw/Tr1zH9K003I/AAAAAAAAAFg/14mIAISzn4U/s320/recipfig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In marked contrast, for the second example with the mean of 1 as before but the much smaller standard deviation of 0.1, the harmonic mean is much better behaved, as the normal Q-Q plot below illustrates.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, this plot is identical in construction to the one above, except it was computed from samples drawn from the second data distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, most of the computed harmonic mean values fall within the 95% confidence limits around the Gaussian reference line, suggesting that it is not unreasonable in practice to regard these values as approximately normally distributed, in spite of the pathologies of the reciprocal transformation.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-9kCbnML55mE/Tr1zVSWL8kI/AAAAAAAAAFo/aGD2h8oow4c/s1600/recipfig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" nda="true" src="http://2.bp.blogspot.com/-9kCbnML55mE/Tr1zVSWL8kI/AAAAAAAAAFo/aGD2h8oow4c/s320/recipfig04.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;One reason the reciprocal transformation is important in practice – particularly in connection with the Gaussian distribution – is that the desire to characterize ratios of uncertain quantities does arise from time to time.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, if we are interested in characterizing the ratio of two averages, the Central Limit Theorem would lead us to expect that, at least approximately, this ratio should behave like the ratio of two Gaussian random variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If these component averages are statistically independent, the expected value of the ratio can be re-written as the product of the expected value of the numerator average and the expected value of the reciprocal of the denominator average, leading us directly to the reciprocal Gaussian transformation discussed here.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, if these two averages are both zero mean, it is a standard result that the ratio has a Cauchy distribution (this result is presented in the same discussion from Johnson, Kotz and Balakrishnan noted above).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As in the second harmonic mean example presented above, however, it turns out to be true that if the mean and standard deviation of the denominator variable are such that the probability of a zero or negative denominator are negligible, the distribution of the ratio may be approximated reasonably well as Gaussian.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A very readable and detailed discussion of this fact is given in the paper by George Marsaglia in the May 2006 issue of &lt;a href="http://www.jstatsoft.org/v16/i04"&gt;Journal of Statistical Software&lt;/a&gt;.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, it is important to note that the “reciprocally-transformed Gaussian distribution” I have been discussing here is &lt;i style="mso-bidi-font-style: normal;"&gt;not&lt;/i&gt; the same as the &lt;em&gt;inverse Gaussian distribution&lt;/em&gt;, to which Johnson, Kotz and Balakrishnan devote a 39-page chapter (Chapter 15).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That distribution takes only positive values and exhibits moments of all orders, both positive and negative, and as a consequence, it has the interesting characteristic that it remains well-behaved under reciprocal transformations, in marked contrast to the Gaussian case.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-8205041210474190304?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/8205041210474190304/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/11/harmonic-means-reciprocals-and-ratios.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/8205041210474190304'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/8205041210474190304'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/11/harmonic-means-reciprocals-and-ratios.html' title='Harmonic means, reciprocals, and ratios of random variables'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-ihUpKC5yNpg/Tr1xtl2PFDI/AAAAAAAAAFQ/03fpQJy8IIc/s72-c/recipfig01.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-3207575383729573387</id><published>2011-10-23T13:31:00.000-07:00</published><updated>2011-10-23T13:31:47.692-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='harmonic means'/><category scheme='http://www.blogger.com/atom/ns#' term='Zipf distribution'/><category scheme='http://www.blogger.com/atom/ns#' term='Zipf-Mandelbrot distribution'/><category scheme='http://www.blogger.com/atom/ns#' term='geometric means'/><category scheme='http://www.blogger.com/atom/ns#' term='infinite variance distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='long tail phenomena'/><title type='text'>The Zipf and Zipf-Mandelbrot distributions</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Today’s post considers the Zipf distribution for discrete data, which has come to be extremely popular as a model for phenomena like word frequencies, city sizes, or sales rank data, where the values of these quantities associated with randomly selected samples can vary by many orders of magnitude.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;More specifically, the Zipf distribution is defined by a probability p&lt;sub&gt;i&lt;/sub&gt; of observing the i&lt;sup&gt;th&lt;/sup&gt; element of an infinite sequence of objects in a single random draw from that sequence, where the probability is given by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;p&lt;sub&gt;i&lt;/sub&gt; = A/i&lt;sup&gt;a&lt;/sup&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Here, &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; is a positive number greater than 1 that determines the rate of the distribution’s tail decay, and &lt;i style="mso-bidi-font-style: normal;"&gt;A&lt;/i&gt; is a normalization constant, chosen so that these probabilities sum to 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Like the continuous-valued Pareto type I distribution, the Zipf distribution exhibits a “long tail,” meaning that its tail decays slowly enough that in a random sample of objects &lt;i style="mso-bidi-font-style: normal;"&gt;O&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; drawn from a Zipf distribution, some very large values of the index &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;/i&gt; will be observed, particularly for relatively small values of the exponent &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In one of the earliest and most common applications of the Zipf distribution, the objects considered represent words in a document and &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;/i&gt; represents their rank, ranging from most frequent (for &lt;i style="mso-bidi-font-style: normal;"&gt;i = 1&lt;/i&gt;) to rare (for large &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;/i&gt; ).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;In a more business-oriented application, the objects might be products for sale (e.g., books listed on Amazon), with the index &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;/i&gt; corresponding to their sales rank.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For a fairly extensive collection of references to many different applications of the Zipf distribution, the website&amp;nbsp;(originally) from&amp;nbsp;&lt;a href="http://www.nslij-genetics.org/wli/zipf/index.html"&gt;Rockefeller University&lt;/a&gt;&amp;nbsp;is an excellent source.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In &lt;a href="http://www.amazon.com/s/ref=nb_sb_ss_i_1_15?url=search-alias%3Dstripbooks&amp;amp;field-keywords=exploring+data+in+engineering.+the+sciences.+and+medicine&amp;amp;sprefix=Exploring+Data+"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;, I give a brief discussion of both the Zipf distribution and the closely related Zipf-Mandelbrot distribution discussed by Beniot Mandelbrot in his book &lt;a href="http://www.amazon.com/s/ref=nb_sb_ss_i_0_12?url=search-alias%3Dstripbooks&amp;amp;field-keywords=the+fractal+geometry+of+nature&amp;amp;sprefix=the+fractal+"&gt;The Fractal Geometry of Nature&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The probabilities defining this distribution may be parameterized in several ways, and&amp;nbsp;the one given in &lt;em&gt;Exploring Data&lt;/em&gt; is:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;p&lt;sub&gt;i&lt;/sub&gt; = A/(1+Bi)&lt;sup&gt;a&lt;/sup&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;where again &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; is an exponent that determines the rate at which the tail of the distribution decays, and &lt;i style="mso-bidi-font-style: normal;"&gt;B&lt;/i&gt; is a second parameter with a value that is strictly positive but no greater than 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For both the Zipf distribution and the Zipf-Mandelbrot distribution, the exponent &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; must be greater than 1 for the distribution to be well-defined, it must be greater than 2 for the mean to be finite, and it must be greater than 3 for the variance to be finite.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;So far, I have been unable to find an &lt;em&gt;R&lt;/em&gt; package that supports the generation of random samples drawn from the Zipf distribution, but the package &lt;strong&gt;zipfR&lt;/strong&gt; includes the command &lt;strong&gt;rlnre&lt;/strong&gt;, which generates random samples drawn from the Zipf-Mandelbrot distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I noted, this distribution can be parameterized in several different ways and, as Murphy’s law would have it, the &lt;strong&gt;zipfR&lt;/strong&gt; parameterization is not the same as the one presented above and discussed in &lt;em&gt;Exploring Data&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Fortunately, the conversion between these parameters is simple.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The &lt;strong&gt;zipfR&lt;/strong&gt; package defines the distribution in terms of a parameter &lt;strong&gt;alpha&lt;/strong&gt; that must lie strictly between 0 and 1, and a second parameter &lt;strong&gt;B&lt;/strong&gt; that I will call &lt;em&gt;B&lt;sub&gt;zipfR&lt;/sub&gt;&lt;/em&gt; to avoid confusion with the parameter &lt;em&gt;B&lt;/em&gt; in the above definition.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These parameters are related by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;alpha = 1/a&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;and&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;B&lt;sub&gt;zipfR&lt;/sub&gt; = (a-1) B&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Since the &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; parameter (and thus the &lt;strong&gt;alpha&lt;/strong&gt; parameter in the &lt;strong&gt;zipfR&lt;/strong&gt; package) determines the tail decay rate of the distribution, it is of the most interest here, and the rest of this post will focus on three examples: a = 1.5 (alpha = 2/3), for which both the distribution’s mean and variance are infinite, a = 2.5 (alpha = 2/5), for which the mean is finite but the variance is not, and a = 3.5 (alpha = 2/7), for which both the mean and variance are finite.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The value of the parameter &lt;em&gt;B&lt;/em&gt; in the &lt;em&gt;Exploring Data&lt;/em&gt; definition of the distribution will be fixed at 0.2 in all of these examples, corresponding to values of &lt;em&gt;B&lt;sub&gt;zipfR&lt;/sub&gt;&lt;/em&gt; = 0.1, 0.3, and 0.5 for the three examples considered here.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To generate Zipf-Mandelbrot random samples, the &lt;strong&gt;zipfR&lt;/strong&gt; package uses the procedure &lt;strong&gt;rlnre&lt;/strong&gt; in conjunction with the procedure &lt;strong&gt;lnre &lt;/strong&gt;(the abbreviation&amp;nbsp;“lnre”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;stands for “large number of rare events” and it represents a class of data models that includes the Zipf-Mandelbrot distribution). &amp;nbsp;&lt;/span&gt;Specifically, to generate a random sample of size N = 100 for the first case considered here, the following &lt;em&gt;R&lt;/em&gt; code is executed:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; library(zipfR)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; ZM = lnre(“zm”, alpha = 2/3, B = 0.1)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; zmsample = rlnre(ZM, n=100)&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The first line loads the &lt;strong&gt;zipfR&lt;/strong&gt; library (which must first be installed, of course, using the &lt;strong&gt;install.packages&lt;/strong&gt; command), the second line invokes the &lt;strong&gt;lnre&lt;/strong&gt; command to set up the distribution with the desired parameters, and the last line invokes the &lt;strong&gt;rlnre&lt;/strong&gt; command to generate 100 random samples from this distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(As with all &lt;em&gt;R&lt;/em&gt; random number generators, the &lt;strong&gt;set.seed&lt;/strong&gt; command should be used first to initialize the random number generator seed if you want to get repeatable results; for the results presented here, I used &lt;strong&gt;set.seed(101)&lt;/strong&gt;.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The sample returned by the &lt;strong&gt;rlnre&lt;/strong&gt; command is a vector of 100 observations, which have the “factor” data type, although their designations are numeric (think of the factor value “1339” as meaning “1 sample of object number 1339”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the results I present here, I have converted these factor responses to numerical ones so I can interpret them as numerical ranks.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This conversion is a little subtle: simply converting from factor to numeric values via something like “&lt;strong&gt;zmnumeric = as.numeric(zmsample)&lt;/strong&gt;” almost certainly doesn’t give you what you want: this will convert the first-ocurring factor value (which has a numeric label, say “1339”) into the number 1, convert the second-occurring value (since this is a random sequence, this might be “73”) into the number 2, etc.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To get what you want (e.g., the labels “1339” and “73” assigned to the numbers 1339 and 73, respectively), you need to first convert the factors in &lt;strong&gt;zmsample&lt;/strong&gt; into characters and then convert these characters into numeric values:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;zmnumeric = as.numeric(as.character(zmsample))&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The three plots below show random samples drawn from each of the three Zipf-Mandelbrot distributions considered here.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In all cases, the y-axis corresponds to the number of times the object labeled &lt;em&gt;i &lt;/em&gt;was observed in a random sample of size N = 100 drawn from the distribution with the indicated exponent.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since the range of these indices can be quite large in the slowly-decaying members of the Zipf-Mandelbrot distribution family, the plots are drawn with logarithmic x-axes, and to facilitate comparisons, the x-axes have the same range in all three plots, as do the y-axes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In all three plots, object i = 1 occurs most often – about a dozen times in the top plot, two dozen times in the middle plot, and three dozen times in the bottom plot – and those objects with larger indices occur less frequently.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The major difference between these three examples lies in the largest indices of the objects seen in the samples: we never see an object with index greater than 50 in the bottom plot, we see only two such objects in the middle plot, while more than a third of the objects in the top plot meet this condition, with the most extreme object having index i = 115,116.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-4IkBCpRbZdk/TqRtqcC60pI/AAAAAAAAAEg/0NJarxwlteo/s1600/zipfig00.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rda="true" src="http://1.bp.blogspot.com/-4IkBCpRbZdk/TqRtqcC60pI/AAAAAAAAAEg/0NJarxwlteo/s320/zipfig00.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As in the case of the Pareto type I distributions I discussed in several previous posts – which may be regarded as the continuous analog of the Zipf distribution – the mean is generally not a useful characterization for the Zipf distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This point is illustrated in the boxplot comparison presented below, which summarizes the means computed from 1000 statistically independent random samples drawn from each of the three distributions considered here, where the object labels have been converted to numerical values as described above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, the three boxplots on the left represent the means – note the logarithmic scale on the y-axis – of these index values &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;/i&gt; generated for each random sample.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The extreme variability seen for Case 1 (a = 1.5) reflects the fact that neither the mean nor the variance are finite for this case, and the consistent reduction in the range of variability for Cases 2 (a = 2.5, finite mean but infinite variance) and 3 (a = 3.5, finite mean and variance) reflects the “shortening tail” of this distribution with increasing exponent &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I discussed in my last post, a better characterization than the mean for distributions like this is the “95% tail length,” corresponding to the 95% sample quantile. Boxplots summarizing these values for the three distributions considered here are shown to the right of the dashed vertical line in the plot below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In each case, the range of variation seen here is much less extreme for the 95% tail length than it is for the mean, supporting the idea that this is a better characterization for data described by Zipf-like discrete distributions.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-1efDOBGWbGM/TqRuBRTgeMI/AAAAAAAAAEo/GtG7pBgZVIY/s1600/zipfig01a.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rda="true" src="http://3.bp.blogspot.com/-1efDOBGWbGM/TqRuBRTgeMI/AAAAAAAAAEo/GtG7pBgZVIY/s320/zipfig01a.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Other alternatives to the (arithmetic) mean that I discussed in conjunction with the Pareto type I distribution were the sample median, the geometric mean, and the harmonic mean.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot below compares these four characterizations for 1000 random samples, each of size N = 100, drawn from the Zipf-Mandelbrot distribution with a = 3.5 (the third case), for which the mean is well-defined.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Even here, it is clear that the mean is considerably more variable than these other three alternatives.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-iFLdn8T5avk/TqRuODcIExI/AAAAAAAAAEw/sX0yQP6zmzM/s1600/zipfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rda="true" src="http://3.bp.blogspot.com/-iFLdn8T5avk/TqRuODcIExI/AAAAAAAAAEw/sX0yQP6zmzM/s320/zipfig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, the plot below shows boxplot comparisons of these alternative characterizations – the median, the geometric mean, and the harmonic mean – for all three of the distributions considered here.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Not surprisingly, Case 1 (a = 1.5) exhibits the largest variability seen for all three characterizations, but the harmonic mean is much more consistent for this case than either the geometric mean or the median.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, the same observation holds – although less dramatically – for Case 2 (a = 2.5), and the harmonic mean appears more consistent than the geometric mean for all three cases.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This observation is particularly interesting in view of the connection between the harmonic mean and the reciprocal transformation, which I will discuss in more detail next time.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-swrGBTcTcNs/TqRubBMbQ5I/AAAAAAAAAE4/8tNLF8W1G4Y/s1600/zipfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rda="true" src="http://4.bp.blogspot.com/-swrGBTcTcNs/TqRubBMbQ5I/AAAAAAAAAE4/8tNLF8W1G4Y/s320/zipfig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.5in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-3207575383729573387?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/3207575383729573387/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/10/zipf-and-zipf-mandelbrot-distributions.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/3207575383729573387'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/3207575383729573387'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/10/zipf-and-zipf-mandelbrot-distributions.html' title='The Zipf and Zipf-Mandelbrot distributions'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-4IkBCpRbZdk/TqRtqcC60pI/AAAAAAAAAEg/0NJarxwlteo/s72-c/zipfig00.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-7180627990616206183</id><published>2011-09-28T15:11:00.000-07:00</published><updated>2011-09-28T15:11:01.454-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='arithmetic means'/><category scheme='http://www.blogger.com/atom/ns#' term='Pareto distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='long tail phenomena'/><category scheme='http://www.blogger.com/atom/ns#' term='normal quantiles'/><title type='text'>Is the “Long Tail” a Useless Concept?</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In response to my last post, “The Long Tail of the Pareto Distribution,” Neil Gunther had the following comment:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;“&lt;span style="color: #333333;"&gt;Unfortunately, you've fallen into the trap of using the ‘long tail’ misnomer. If you think about it, it can't possibly be the length of the tail that sets distributions like Pareto and Zipf apart; even the negative exponential and Gaussian have &lt;i&gt;infinitely&lt;/i&gt; long tails.”&lt;/span&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;He goes on to say that the relevant concept is the “width” or the “weight” of the tails that is important, and that a more appropriate characterization of these “Long Tails” would be “heavy-tailed” or “power-law” distributions.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Neil’s comment raises an important point: while the term “long tail” appears a lot in both the on-line and hard-copy literature, it is often somewhat ambiguously defined.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For example, in his book, &lt;a href="http://www.amazon.com/Long-Tail-Revised-Updated-Business/dp/1401309666/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1317246600&amp;amp;sr=1-1"&gt;&lt;em&gt;The Long Tail&lt;/em&gt;&lt;/a&gt;, Chris Anderson offers the following description (page 10):&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;“In statistics, curves like that are called ‘long-tailed distributions’ because the tail of the curve is very long relative to the head.”&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The difficulty with this description is that it is somewhat ambiguous since it says nothing about how to measure “tail length,” forcing us to adopt our own definitions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is clear from Neil’s comments that the definition he adopts for “tail length” is the width of the distribution’s support set.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Under this definition, the notion of a “long-tailed distribution” is of extremely limited utility: the situation is exactly as Neil describes it, with “long-tailed distributions” corresponding to any distribution with unbounded support, including both distributions like the Gaussian and gamma distribution where the mean is a reasonable characterization, and those like the Cauchy and Pareto distribution where the mean doesn’t even exist.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The situation is analogous to that of confidence intervals, which characterize the uncertainty inherited by any characterization computed from a collection of uncertain (i.e., random) data values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example consider the mean: the &lt;em&gt;sample mean&lt;/em&gt; is the arithmetic average of &lt;em&gt;N&lt;/em&gt; observed data samples, and it is generally intended as an estimate of the &lt;em&gt;population mean&lt;/em&gt;, defined as the&amp;nbsp;first moment of&amp;nbsp;the data distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A &lt;em&gt;q% confidence interval&lt;/em&gt; around the&amp;nbsp;sample mean is an interval that contains the population mean&amp;nbsp;with probability at least &lt;em&gt;q%&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These intervals can be computed in various ways for different data characterizations, but the key point here is that they are widely used in practice,&amp;nbsp;with the most popular choices being the 90%, 95% and 99% confidence intervals, which necessarily become wider as&amp;nbsp;this percentage&amp;nbsp;&lt;em&gt;q&lt;/em&gt; increases.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(For a more detailed discussion of confidence intervals, refer to Chapter 9 of &lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1317246817&amp;amp;sr=1-1#_"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;We can, in principle, construct 100% confidence intervals, but this leads us directly back to Neil’s objection: the 100% confidence interval for the mean is entire support set of the distribution (e.g., for the Gaussian distribution, this 100% confidence interval is the whole real line, while for any gamma distribution,&amp;nbsp;it is the set of all positive numbers).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These observations suggest the following notion of “tail length” that addresses Neil’s concern while retaining the essential idea of interest in the business literature: we can compare the “q% tail length” of different distributions for some &lt;em&gt;q&lt;/em&gt; less than 100.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In particular, consider the case of J-shaped distributions, defined as those like the Pareto type I distribution whose distribution p(x) decays monotonically with increasing x, approaching zero as x goes to infinity.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot below shows two specific examples to illustrate the idea: the solid line corresponds to the (shifted) exponential distribution:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;p(x) = e&lt;sup&gt;–(x-1)&lt;/sup&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;for all x greater than or equal to 1 and zero otherwise, while the dotted line represents the Pareto type I distribution with location parameter &lt;em&gt;k = 1&lt;/em&gt; and shape parameter &lt;em&gt;a = 0.5&lt;/em&gt; discussed in my last post.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Initially, as x increases from 1, the exponential density is greater than the Pareto density, but for x larger than about 3.5, the opposite is true: the exponential distribution rapidly becomes much smaller, reflecting its much more rapid rate of tail decay.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-WtujfYFLLtw/ToOVGk0u85I/AAAAAAAAAEM/cjJl9R66-hk/s1600/LongUselessFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" kca="true" src="http://1.bp.blogspot.com/-WtujfYFLLtw/ToOVGk0u85I/AAAAAAAAAEM/cjJl9R66-hk/s320/LongUselessFig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;For these distributions, define the q% tail length to be the distance from the minimum possible value of x (the “head” of the distribution; here, x = 1) to the point in the tail where the cumulative probability reaches q% (i.e., the value x&lt;sub&gt;q&lt;/sub&gt; where x &amp;lt; x&lt;sub&gt;q&lt;/sub&gt; with probability q%). &lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;In practical terms, the q% tail length tells us how far out we have to go in the tail to account for q% of the possible cases.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In &lt;em&gt;R&lt;/em&gt;, this value is easy to compute using the &lt;em&gt;quantile&lt;/em&gt; function included in most families of available distribution functions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, for the Pareto type I distribution, the function &lt;strong&gt;qparetoI&lt;/strong&gt; in the &lt;strong&gt;VGAM&lt;/strong&gt; package gives us the desired quantiles for the distribution with specified values of the parameters &lt;em&gt;k&lt;/em&gt; (designated “scale” in the &lt;strong&gt;qparetoI&lt;/strong&gt; call) and &lt;em&gt;a&lt;/em&gt; (designated “shape” in the &lt;strong&gt;qparetoI&lt;/strong&gt; call).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, for the case &lt;em&gt;k = 1&lt;/em&gt; and &lt;em&gt;a = 0.5&lt;/em&gt; (i.e., the dashed curve in the above plot), the “90% tail length” is given by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; qparetoI(p=0.9,scale=1,shape=0.5)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[1] 100&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;For comparison, the corresponding shifted exponential distribution has the 90% tail length given by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; 1 + qexp(p = 0.9)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[1] 3.302585&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;(Note that here,&amp;nbsp;I added 1 to the exponential quantile to account for the shift in its domain from “all positive numbers” – the domain for the standard exponential distribution – to the shifted domain “all numbers greater than 1”.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since these 90% tail lengths differ by a factor of 30, they provide a sound basis for declaring the Pareto type I distribution to be “longer tailed” than the exponential distribution.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;These results also provide a useful basis for assessing the influence of the decay parameter a for the Pareto distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I noted last time, two of the examples I considered did not have finite means (&lt;em&gt;a = 0.5&lt;/em&gt; and &lt;em&gt;1.0&lt;/em&gt;), and none of the four had finite variances (i.e., also &lt;em&gt;a = 1.5&lt;/em&gt; and &lt;em&gt;2.0&lt;/em&gt;), rendering moment characterizations like the mean and standard deviation&amp;nbsp;fundamentally useless.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Comparing the 90% tail lengths for these distributions, however, leads to the following results:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 0.5:&lt;/em&gt; 90% tail length = 100.000&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 1.0:&lt;/em&gt; 90% tail length =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;10.000&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 1.5:&lt;/em&gt; 90% tail length =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;4.642&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 2.0:&lt;/em&gt; 90% tail length =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;3.162&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is clear from these results that the shape parameter &lt;em&gt;a&lt;/em&gt; has a dramatic effect on the 90% tail length (in fact, on the q% tail length for any &lt;em&gt;q&lt;/em&gt; less than 100).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further, note that the 90% tail length for the Pareto type I distribution with &lt;em&gt;a = 2.0&lt;/em&gt; is actually a little bit shorter than that for the exponential distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If we move further out into the tail, however, this situation changes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, suppose we compare the 98% tail lengths. For the exponential distribution, this yields the value 4.912, while for the four Pareto shape parameters we have the following results:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 0.5:&lt;/em&gt; 98% tail length = 2,500.000&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 1.0:&lt;/em&gt; 98% tail length =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;50.000&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 1.5:&lt;/em&gt; 98% tail length =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;13.572&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;a = 2.0:&lt;/em&gt; 98% tail length =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;7.071&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This value (i.e., the 98% tail length) seems a particularly appropriate choice to include here since in his book, &lt;em&gt;The Long Tail&lt;/em&gt;, Chris Anderson notes that his original presentations on the topic were entitled “The 98% Rule,” reflecting the fact that he was explicitly considering how far out you had to go into the tail of a distribution of goods (e.g., the books for sale by Amazon) to account for 98% of the sales.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Since this discussion originally began with the question, “when are averages useless?” it is appropriate to note that, in contrast to the much better-known average, the “q% tail length” considered here is well-defined for &lt;em&gt;any &lt;/em&gt;proper distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;As the examples discussed here demonstrate, this characterization also provides a useful basis for quantifying the “Long Tail” behavior that is of increasing interest in business applications like Internet marketing.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, if we adopt this measure for any &lt;em&gt;q&lt;/em&gt; value less than 100%, the answer to the title question of this post is, “No: The Long Tail is a useful concept.”&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The downside of this minor change is that – as the results shown here illustrate – the results obtained using the q% tail length depend on the value of&amp;nbsp;&lt;em&gt;q&lt;/em&gt; we choose.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In my next post, I will explore the computational issues associated with that choice.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-7180627990616206183?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/7180627990616206183/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/09/is-long-tail-useless-concept.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/7180627990616206183'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/7180627990616206183'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/09/is-long-tail-useless-concept.html' title='Is the “Long Tail” a Useless Concept?'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-WtujfYFLLtw/ToOVGk0u85I/AAAAAAAAAEM/cjJl9R66-hk/s72-c/LongUselessFig01.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-1611266137426698553</id><published>2011-09-17T09:54:00.000-07:00</published><updated>2011-09-17T09:54:58.940-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='harmonic means'/><category scheme='http://www.blogger.com/atom/ns#' term='medians'/><category scheme='http://www.blogger.com/atom/ns#' term='arithmetic means'/><category scheme='http://www.blogger.com/atom/ns#' term='Pareto distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='geometric means'/><category scheme='http://www.blogger.com/atom/ns#' term='infinite variance distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='long tail phenomena'/><title type='text'>The Long Tail of the Pareto Distribution</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last two posts, I have discussed cases where the mean is of little or no use as a data characterization.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of the specific examples I discussed last time was the case of the Pareto type I distribution, for which the density is given by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;p(x) = ak&lt;sup&gt;a&lt;/sup&gt;/x&lt;sup&gt;a+1&lt;/sup&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;defined for all &lt;i style="mso-bidi-font-style: normal;"&gt;x &amp;gt; k&lt;/i&gt;, where &lt;i style="mso-bidi-font-style: normal;"&gt;k&lt;/i&gt; and &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; are numeric parameters that define the distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the example I discussed last time, I considered the case where a = 1.5, which exhibits a finite mean (specifically, the mean is 3 for this case), but an infinite variance.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As the results I presented last time demonstrated, the extreme data variability of this distribution renders the computed mean too variable to be useful.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Another reason this distribution is particularly interesting is that it exhibits essentially the same tail behavior as the discrete Zipf distribution; there, the probability that a discrete random variable x takes its i&lt;sup&gt;th&lt;/sup&gt; value is:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;p&lt;sub&gt;i&lt;/sub&gt; = A/i&lt;sup&gt;c&lt;/sup&gt;,&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;where A is a normalization constant and &lt;i style="mso-bidi-font-style: normal;"&gt;c&lt;/i&gt; is a parameter that determines how slowly the tail decays.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This distribution was originally proposed to characterize the frequency of words in long documents (the Zipf-Estoup law), it was investigated further by Zipf in the mid-twentieth century in a wide range of applications (e.g., the distributions of city sizes), and it has become the subject of considerable recent attention as a model for “long-tailed” business phenomena (for a non-technical introduction to some of these business phenomena, see the book by Chris Anderson, &lt;a href="http://www.amazon.com/Long-Tail-Future-Business-Selling/dp/1401302378"&gt;The Long Tail&lt;/a&gt;).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I will discuss the Zipf distribution further in a later post, but one of the reasons for discussing the Pareto type I distribution first is that since it is a continuous distribution, the math is easier, meaning that more characterization results are available for the Pareto distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-Od4r-V4YWGU/TnTLcJsl0DI/AAAAAAAAAD0/fqILT5xoVxM/s1600/ParetoIFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rba="true" src="http://4.bp.blogspot.com/-Od4r-V4YWGU/TnTLcJsl0DI/AAAAAAAAAD0/fqILT5xoVxM/s320/ParetoIFig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The mean of the Pareto type I distribution is:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Mean = ak/(a-1),&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;provided &lt;i style="mso-bidi-font-style: normal;"&gt;a &amp;gt; 1&lt;/i&gt;, and the variance of the distribution is finite only if &lt;i style="mso-bidi-font-style: normal;"&gt;a &amp;gt; 2&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Plots of the probability density defined above for this distribution are shown above, for &lt;i style="mso-bidi-font-style: normal;"&gt;k = 1&lt;/i&gt; in all cases, and with &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; taking the values 0.5, 1.0, 1.5, and 2.0.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(This is essentially the same plot as Figure 4.17 in &lt;a href="http://www.amazon.com/s/ref=nb_sb_ss_i_1_14?url=search-alias%3Dstripbooks&amp;amp;field-keywords=exploring+data+in+engineering.+the+sciences.+and+medicine&amp;amp;sprefix=Exploring+Data"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;, where I give a brief description of the Pareto type I distribution.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that all of the cases considered here are characterized by infinite variance, while the first two (a = 0.5 and 1.0) are also characterized by infinite means.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As the results presented below emphasize, the mean represents a very poor characterization in practice for data drawn from any of these distributions, but there are alternatives, including the familiar median that I have discussed previously, along with two others that are more specific to the Pareto type I distribution: the geometric mean and the harmonic mean.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The plot below emphasizes the point made above about the extremely limited utility of the mean as a characterization of Pareto type I data, even in cases where&amp;nbsp;it is theoretically well-defined.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, this plot compares the four characterizations I discuss here – the mean (more precisely known as the “arithmetic mean” to distinguish it from the other means considered here), the median, the geometric mean, and the harmonic mean – for 1000 statistically independent Pareto type I data sequences, each of length N = 400, with parameters &lt;i style="mso-bidi-font-style: normal;"&gt;k = 1&lt;/i&gt; and &lt;i style="mso-bidi-font-style: normal;"&gt;a = 2.0&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For this example, the mean is well-defined (specifically, it is equal to 2), but compared with the other data characterizations, its variability is much greater, reflecting the more serious impact of this distribution’s infinite variance on the mean than on these other data characterizations.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-IPXsTMe5thU/TnTL7h31dMI/AAAAAAAAAD4/RMWinyP9vQU/s1600/ParetoIFig09.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rba="true" src="http://4.bp.blogspot.com/-IPXsTMe5thU/TnTL7h31dMI/AAAAAAAAAD4/RMWinyP9vQU/s320/ParetoIFig09.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To give a more complete view of the extreme variability of the arithmetic mean, boxplots of 1000 statistically independent samples drawn from all four of the Pareto type I distribution examples plotted above are shown in the boxplots below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As before, each sample is of size N = 400 and the parameter &lt;i style="mso-bidi-font-style: normal;"&gt;k&lt;/i&gt; has the value 1, but here the computed arithmetic means are shown for the parameter values a = 0.5, 1.0, 1.5, and 2.0; note the log scale used here because the range of computed means is so large.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the first two of these examples, the population mean does not exist, so it is not surprising that the computed values span such an enormous range, but even when the mean is well-defined, the influence of the infinite variance of these cases is clearly evident.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It may be argued that infinite variance is an extreme phenomenon, but it is worth emphasizing here that for the specific “long tail” distributions popular in many applications, the decay rate is sufficiently slow for the variance – and sometimes even the mean – to be infinite, as in these examples.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-uAabSu1vdws/TnTMME2enxI/AAAAAAAAAD8/PVl2_aeqXfk/s1600/ParetoIFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rba="true" src="http://3.bp.blogspot.com/-uAabSu1vdws/TnTMME2enxI/AAAAAAAAAD8/PVl2_aeqXfk/s320/ParetoIFig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As I have noted several times in previous posts, the median is much better behaved than the mean, so much so that it is well-defined for any proper distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of the advantages of the Pareto type I distribution is that the form of the density function is simple enough that the median of the distribution can be computed explicitly from the distribution parameters.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This result is given in the fabulous book by &lt;a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584959/ref=sr_1_1?s=books&amp;amp;ie=UTF8&amp;amp;qid=1316277338&amp;amp;sr=1-1"&gt;Johnson, Kotz and Balakrishnan&lt;/a&gt; that I have mentioned previously, which devotes an entire chapter (Chapter 20) to the Pareto family of distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, the median of the Pareto type I distribution with parameters &lt;i style="mso-bidi-font-style: normal;"&gt;k&lt;/i&gt; and &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; is given by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Median = 2&lt;sup&gt;1/a&lt;/sup&gt;k&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Thus, for the four examples considered here, the median values are 4.0 (for a = 0.5), 2.0 (for a = 1.0), 1.587 (for a = 1.5), and 1.414 (for a = 2.0).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Boxplot summaries for the same 1000 random samples considered above are shown in the plot below, which also includes horizontal dotted lines at these theoretical median values for the four distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fact that these lines correspond closely with the median lines in the boxplots gives an indication that the computed median is, on average, in good agreement with the correct values it is attempting to estimate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As in the case of the arithmetic means, the variability of these estimates decreases monotonically as &lt;em&gt;a&lt;/em&gt; increases, corresponding to the fact that the distribution becomes generally better-behaved as the &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; parameter increases.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Q0nHedEppms/TnTM4jen5fI/AAAAAAAAAEA/YWd432AzrBg/s1600/ParetoIFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rba="true" src="http://1.bp.blogspot.com/-Q0nHedEppms/TnTM4jen5fI/AAAAAAAAAEA/YWd432AzrBg/s320/ParetoIFig04.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The &lt;i style="mso-bidi-font-style: normal;"&gt;geometric mean&lt;/i&gt; is an alternative characterization to the more familiar arithmetic mean, one that is well-defined for any sequence of positive numbers.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, the geometric mean of &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; positive numbers is defined as the &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;sup&gt;th&lt;/sup&gt;&lt;/i&gt; root of their product.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Equivalently, the geometric mean may be computed by exponentiating the arithmetic average of the log-transformed values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the case of the Pareto type I distribution, the utility of the geometric mean is closely related to the fact that the log transformation converts a Pareto-distributed random variable into an exponentially distributed one, a point that I will discuss further in a later post on data transformations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(These transformations are the topic of Chapter 12 of &lt;em&gt;Exploring Data&lt;/em&gt;, where I briefly discuss both the logarithmic transformation on which the geometric mean is based and the reciprocal transformation on which the harmonic mean is based, described next.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;The key point here is that the following simple expression is available for the geometric mean of the Pareto type I distribution (Johnson, Kotz, and Balakrishnan, page 577):&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Geometric Mean = k exp(1/a)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;For the four specific examples considered here, these geometric mean values are approximately 7.389 (for a = 0.5), 2.718 (for a = 1.0), 1.948 (for a = 1.5), and 1.649 (for a = 2.0).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The boxplots shown below summarize the range of variation seen in the computed geometric means for the same 1000 statistically independent samples considered above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Again, the horizontal dotted lines indicate the correct values for each distribution, and it may be seen that the computed values are in good agreement, on average.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As before, the variability of these computed values decreases with increasing &lt;i style="mso-bidi-font-style: normal;"&gt;a &lt;/i&gt;values as the distribution becomes better-behaved.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-oFTSSZfIUNc/TnTNzo_0TAI/AAAAAAAAAEE/ivn91x51gdI/s1600/ParetoIFig06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rba="true" src="http://1.bp.blogspot.com/-oFTSSZfIUNc/TnTNzo_0TAI/AAAAAAAAAEE/ivn91x51gdI/s320/ParetoIFig06.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The fourth characterization considered here is the &lt;i style="mso-bidi-font-style: normal;"&gt;harmonic mean&lt;/i&gt;, again appropriate to positive values, and defined as the reciprocal of the average of the reciprocal data values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the case of the geometric mean just discussed, the log transformation on which it is based is often useful in improving the distributional character of data values that span a wide range.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the case of the Pareto type I distribution – and a number of others – the reciprocal transformation on which the harmonic mean is based also improves the behavior of the data distribution, but this is often not the case.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, reciprocal transformations often make the character of a data distribution much worse: applied to the extremely well-behaved standard uniform distribution, it yields the Pareto type I distribution with a = 1, for which none of the integer moments exist; similarly, applied to the Gaussian distribution, the reciprocal transformation yields a result that is both infinite variance and bimodal.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(A little thought suggests that the reciprocal transformation is inappropriate for the Gaussian distribution because it is not strictly positive, but normality is a favorite working assumption, sometimes applied to the denominators of ratios, leading to a number of theoretical difficulties.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I will have more to say about that in a future post.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the case of the Pareto type I distribution, the reciprocal transformation converts it into the extremely well-behaved beta distribution, and the harmonic mean has the following simple expression:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Harmonic mean = k(1 + a&lt;sup&gt;-1&lt;/sup&gt;)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;For the four examples considered here, this expression yields harmonic mean values of 3 (for a = 0.5), 2 (for a = 1.0), 1.667 (for a = 1.5), and 1.5 (for a = 2.0).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Boxplot summaries of the computed harmonic means for the 1000 simulations of each case considered previously are shown below, again with dotted horizontal lines at the theoretical values for each case.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As with both the median and the geometric mean, it is clear from these plots that the computed values are correct on average, and their variability decreases with increasing values of the &lt;i style="mso-bidi-font-style: normal;"&gt;a&lt;/i&gt; parameter.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-pDmiBtXFO0Y/TnTOLsBqh4I/AAAAAAAAAEI/zAuxxfiTnCw/s1600/ParetoIFig08.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" rba="true" src="http://1.bp.blogspot.com/-pDmiBtXFO0Y/TnTOLsBqh4I/AAAAAAAAAEI/zAuxxfiTnCw/s320/ParetoIFig08.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The key point of this post has been to show that, while averages are not suitable characterizations for “long tailed” phenomena that are becoming an increasing subject of interest in many different fields, useful alternatives do exist.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the case of the Pareto type I distribution considered here, these alternatives include the popular median, along with the somewhat less well-known geometric and harmonic means.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In an upcoming post, I will examine the utility of these characterizations for the Zipf distribution.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-1611266137426698553?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/1611266137426698553/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/09/long-tail-of-pareto-distribution.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/1611266137426698553'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/1611266137426698553'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/09/long-tail-of-pareto-distribution.html' title='The Long Tail of the Pareto Distribution'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-Od4r-V4YWGU/TnTLcJsl0DI/AAAAAAAAAD0/fqILT5xoVxM/s72-c/ParetoIFig01.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-1261853290000635353</id><published>2011-08-27T13:46:00.000-07:00</published><updated>2011-08-27T13:46:16.463-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='medians'/><category scheme='http://www.blogger.com/atom/ns#' term='Pareto distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='moving averages'/><category scheme='http://www.blogger.com/atom/ns#' term='R package modeest'/><category scheme='http://www.blogger.com/atom/ns#' term='mode estimation'/><category scheme='http://www.blogger.com/atom/ns#' term='limitations of the mean'/><category scheme='http://www.blogger.com/atom/ns#' term='moving window data characterizations'/><title type='text'>Some Additional Thoughts on Useless Averages</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The post generated three interesting comments that I want to respond to here.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;First and foremost, I want to say thanks to all of you for&amp;nbsp;giving me something to think about further, leading me in some interesting new directions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, &lt;strong&gt;chrisbeeleyimh&lt;/strong&gt; had the following to say:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;“I seem to have rather abandoned means and medians in favor of drawing the distribution all the time, which baffles my colleagues somewhat.”&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Chris also maintains a collection of data examples where the mean is the same but the shape is very different.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, one of the points I illustrate in Section 4.4.1 of &lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt; is that there are cases where not only the means but&amp;nbsp;&lt;em&gt;all &lt;/em&gt;of the moments&amp;nbsp;(i.e., variance, skewness, kurtosis, etc.) are identical but the distributions are profoundly different.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A specific example is taken from the book &lt;span&gt;&lt;a href="http://www.amazon.com/Counterexamples-Probability-2nd-Jordan-Stoyanov/dp/0471965383?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Counterexamples in Probability, 2nd Edition&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0471965383" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt; by J.M. Stoyanov, who shows that if the lognormal density is multiplied by the following function:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;f(x) = 1 + A sin(2 pi ln x),&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;for any constant A between -1 and +1, the moments are unchanged.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The character of the distribution is changed profoundly, however, as the following plot illustrates (this plot is similar to Fig. 4.8 in &lt;em&gt;Exploring Data,&lt;/em&gt; which shows the same two distributions, but for A = 0.5 instead of A = 0.9, as shown here).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To be sure, this behavior is pathological – distributions that have finite support, for example, are defined uniquely by their complete set of moments – but it does make the point that moment characterizations are not always complete, even if an infinite number of them are available.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Within well-behaved families of distributions (such as the one proposed by Karl Pearson in 1895), a complete characterization is possible on the basis of the first few moments, which is one reason for the historical popularity of the method of moments for fitting data to distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is important to recognize, however, that moments do have their limitations and that the first moment alone – i.e., the mean by itself – is almost never a complete characterization.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(I am forced to say “almost” here because if we impose certain very strong distributional assumptions – e.g.,&amp;nbsp;the Poisson or binomial distributions&amp;nbsp;– the specific distribution considered may be fully characterized by its mean.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This begs the question, however, of whether this distributional assumption is adequate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;My experience has been that, no matter how firmly held the belief in a particular distribution is, exceptions do arise in practice … overdispersion, anyone?)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-9SsouQJ0FHo/TllFIR9ESOI/AAAAAAAAADk/mQLCGfSQdH8/s1600/MoreUselessFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://2.bp.blogspot.com/-9SsouQJ0FHo/TllFIR9ESOI/AAAAAAAAADk/mQLCGfSQdH8/s320/MoreUselessFig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The plot below provides a further illustration of the inadequacy of the mean as a sole data characterization,&amp;nbsp;comparing four different members of the family of beta distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These distributions – in the standard form assumed here – describe variables whose values range from 0 to 1, and they are&amp;nbsp;defined by two parameters, p and q, that determine the shape of the density function and all moments of the distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The mean of the beta distribution is equal to p/(p+q), so&amp;nbsp;if p = q – corresponding to the class of symmetric beta distributions – the mean is ½, regardless of the common value of these parameters.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The four plots below show the corresponding distributions when both parameters are equal to 0.5 (upper left, the arcsin distribution I discussed last time), 1.0 (upper right, the uniform distribution), 1.5 (lower left), and 8.0 (lower right).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/--uLGduKNCtY/TllFpqRxinI/AAAAAAAAADo/OVGP_ZITwL8/s1600/MoreUselessFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/--uLGduKNCtY/TllFpqRxinI/AAAAAAAAADo/OVGP_ZITwL8/s320/MoreUselessFig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The second comment on my last post was from &lt;strong&gt;Efrique&lt;/strong&gt;, who suggested the Student’s t-distribution with 2 degrees of freedom as a better infinite-variance example than the Cauchy example I used (corresponding to Student’s t-distribution with one degree of freedom), because the first moment doesn’t even exist for the Cauchy distribution (“there’s nothing to converge to”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The figure below expands the boxplot comparison I presented last time, comparing the means, medians, and modes (from the &lt;strong&gt;modeest &lt;/strong&gt;package), for both of these infinite-variance examples: the Cauchy distribution I discussed last time and the Student’s t-distribution with two degrees of freedom that Efrique suggested.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; H&lt;/span&gt;ere, the same characterization (mean, median, or mode) is&amp;nbsp;summarized for both distributions in side-by-side boxplots to facilitate comparisons.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is clear from these&amp;nbsp;boxplots that the results for the median and the mode are essentially identical for these distributions, but the results for the mean differ dramatically (recall that these results are truncated for the Cauchy distribution: 13.6% of the 1000 computed means fell outside the +/- 5 range shown here, exhibiting values&amp;nbsp;approaching +/- 1000).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This difference illustrates Efrique’s further point that the mean of the data values is a consistent estimator of the (well-defined) population mean of the Student’s t-distribution with 2 degrees of freedom, while it is not a consistent estimator for the Cauchy distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Still, it also clear from this plot that the mean is substantially more variable for the Student’s t-distribution with 2 degrees of freedom than either the median or the &lt;strong&gt;modeest&lt;/strong&gt; mode estimate.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-mBKzlxuO3Lo/TllGgZGwsCI/AAAAAAAAADs/L2mdVfDwpo4/s1600/MoreUselessFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/-mBKzlxuO3Lo/TllGgZGwsCI/AAAAAAAAADs/L2mdVfDwpo4/s320/MoreUselessFig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Another example of an infinite-variance distribution where the mean is well-defined but highly variable is the Pareto type I distribution, discussed in Section 4.5.8 of &lt;em&gt;Exploring Data&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;My favorite reference on distributions is the two volume set by Johnson, Kotz, and Balakrishnan (&lt;span&gt;&lt;a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584959?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Continuous Univariate Distributions, Vol. 1 (Wiley Series in Probability and Statistics)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0471584959" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&amp;nbsp;and &lt;span&gt;&lt;a href="http://www.amazon.com/Continuous-Univariate-Distributions-Probability-Statistics/dp/0471584940?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Continuous Univariate Distributions, Vol. 2 (Wiley Series in Probability and Statistics)&lt;/a&gt;)&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0471584940" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;&lt;/span&gt;, who devote an entire 55 page chapter (Chapter 20 in Volume 1) to the Pareto distribution, noting that it is named after Vilafredo Pareto, a&amp;nbsp;mid nineteenth- to early twentieth-century Swiss professor of economics, who proposed it as a description of the distribution of income over a population.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, there are several different distributions named after Pareto, but the type I distribution considered here exhibits a power-law decay like the Student’s t-distributions, but it is a J-shaped distribution whose mode is equal to its minimum value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;More specifically, this distribution is defined by a location parameter that determines this minimum value and a shape parameter that determines how rapidly the tail decays for values larger than this minimum.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The example considered here takes this minimum value as 1 and the shape parameter as 1.5, giving a distribution with a finite mean but an infinite variance.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As in the above example, the boxplot summary shown below characterizes the mean, median, and mode for 1000 statistically independent random samples drawn from this distribution, each of size N = 100.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As before, it is clear from this plot that the mean is much more highly variable than either the median or the mode.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-xaz314FpZmU/TllHOi1mdEI/AAAAAAAAADw/_iCdakolo68/s1600/MoreUselessFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/-xaz314FpZmU/TllHOi1mdEI/AAAAAAAAADw/_iCdakolo68/s320/MoreUselessFig04.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In this case, however, we have the added complication that since this distribution is not symmetric,&amp;nbsp;its mean, median and mode do not coincide.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, the population mode is the minimum value (which is 1 here), corresponding to the solid line at the bottom of the plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The narrow range of the boxplot values around this correct value suggest that the &lt;strong&gt;modeest&lt;/strong&gt; package is reliably estimating this mode value, but as I noted in my last post, this characterization is not useful here because it tells us nothing about the rate at which the density decays.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The theoretical median value can also be calculated easily for this distribution, and here it is approximately equal to 1.587, corresponding to the dashed horizontal line in the plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As with the mode, it is clear from the boxplot that the median estimated from the data is in generally excellent agreement with this value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, the mean value for this particular distribution is 3, corresponding to the dotted horizontal line in the plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since this line lies fairly close to the upper quartile of the computed means (i.e., the top of the “box” in the boxplot), it follows that the estimated mean falls below the correct value almost 75% of the time, but it is also clear that when the mean is overestimated, the extent of this overestimation can be very large.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Motivated in part by the fact that the mean doesn’t always exist for the Pareto distribution, Johnson, Kotz and Balakrishnan note in their chapter on these distributions that alternative location measures have been considered, including both the geometric and harmonic means.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I will examine these ideas further in a future post.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, &lt;strong&gt;klr&lt;/strong&gt; mentioned my post on useless averages in his blog &lt;a href="http://timelyportfolio.blogspot.com/"&gt;TimelyPortfolio&lt;/a&gt;, where he discusses alternatives to the moving average in characterizing financial time-series.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the case he considers, klr compares a 10-month moving average, the corresponding moving median, and a number of the corresponding mode estimators from the &lt;strong&gt;modeest&lt;/strong&gt; package.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This is a very interesting avenue of exploration for me since it is closely related to the median filter and other nonlinear digital filters that can be very useful in cleaning noisy time-series data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I discuss a number of these ideas – including moving-window extensions of other data characterizations like skewness and kurtosis – in my book &lt;span&gt;&lt;a href="http://www.amazon.com/Mining-Imperfect-Data-Contamination-Incomplete/dp/0898715822?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Mining Imperfect Data: Dealing with Contamination and Incomplete Records&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0898715822" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;.&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Again, thanks to all of you for your comments.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;You have given me much to think about and investigate further, which is one of the joys of doing this blog.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-1261853290000635353?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/1261853290000635353/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/08/some-additional-thoughts-on-useless.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/1261853290000635353'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/1261853290000635353'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/08/some-additional-thoughts-on-useless.html' title='Some Additional Thoughts on Useless Averages'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-9SsouQJ0FHo/TllFIR9ESOI/AAAAAAAAADk/mQLCGfSQdH8/s72-c/MoreUselessFig01.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-4146742461732717153</id><published>2011-08-20T08:21:00.000-07:00</published><updated>2011-08-20T08:21:05.423-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Old Faithful dataset'/><category scheme='http://www.blogger.com/atom/ns#' term='medians'/><category scheme='http://www.blogger.com/atom/ns#' term='outliers'/><category scheme='http://www.blogger.com/atom/ns#' term='infinite variance distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='R package modeest'/><category scheme='http://www.blogger.com/atom/ns#' term='mode estimation'/><category scheme='http://www.blogger.com/atom/ns#' term='limitations of the mean'/><title type='text'>When are averages useless?</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Of all possible single-number characterizations of a data sequence, the average is probably the best known.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is also easy to compute and in favorable cases, it provides a useful characterization of “the typical value” of a sequence of numbers.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is not the only such “typical value,” however, nor is it always the most useful one: two other candidates – location estimators in statistical terminology – are the median and the mode, both of which are discussed in detail in Section 4.1.2 of &lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Like the average, these alternative location estimators are not always “fully representative,” but they do represent viable alternatives – at least sometimes – in cases where the average is sufficiently non-representative as to be effectively useless.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As the title of this post suggests, the focus&amp;nbsp;here is on those cases where the mean doesn’t really tell us what we want to know about a data sequence, briefly examining why this happens and what we can do about it.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-bDidkJnPnX4/Tk_CdddskeI/AAAAAAAAADU/BFEJDtmip7U/s1600/UselessFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://2.bp.blogspot.com/-bDidkJnPnX4/Tk_CdddskeI/AAAAAAAAADU/BFEJDtmip7U/s320/UselessFig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;First, it is worth saying a few words about the two alternatives just mentioned: the median and the mode.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Of these, the mode is both the more difficult to estimate and the less broadly useful.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Essentially, “the mode” corresponds to “the location of the peak in the data distribution.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One difficulty with this somewhat loose definition is that “the mode” is not always well-defined.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The above collection of plots shows three examples where the mode is not well-defined, and another where the mode is well-defined but not particularly useful.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot shows the density of the uniform distribution on the range [1,2]: there, the density is constant over the entire range, so there is no single, well-defined “peak” or unique maximum to serve as a mode for this distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper right plot shows a nonparametric density estimate for the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; geyser waiting time data that I have discussed in several of my recent posts (the &lt;em&gt;R&lt;/em&gt; data object &lt;strong&gt;faithful&lt;/strong&gt;).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, the difficulty is that there are not one but two modes, so “the mode” is not well-defined here, either: we must discuss “the modes.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The same behavior is observed for the &lt;em&gt;arcsin distribution&lt;/em&gt;, whose density is shown in the lower left plot in the above figure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This density corresponds to the beta distribution with shape parameters both equal to ½, giving a bimodal distribution whose cumulative&amp;nbsp;probability function&amp;nbsp;can be written simply in terms of the arcsin function, motivating its name&amp;nbsp;(see Section 4.5.1 of &lt;em&gt;Exploring Data&lt;/em&gt; for a more complete discussion of both the beta distribution family and the special case of the arcsin distribution).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In this case, the two modes of the distribution occur at the extremes of the data, at x = 1 and x = 2.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The second difficulty with the mode noted above is that it is sometimes well-defined but not particularly useful.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The case of the J-shaped exponential density shown in the lower right plot above illustrates this point: this distribution exhibits a single, well-defined peak at the minimum value x = 0.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, you don’t even have to look at the data to arrive at this result, which therefore tells you nothing about the data distribution: this density is described by a single parameter that determines how slowly or rapidly the distribution decays and the mode is independent of this parameter.&amp;nbsp; Despite these limitations, there are cases where the mode represents an extremely useful data characterization, even though it is much harder to estimate than the mean or the median.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Fortunately, there is a nice package available in &lt;em&gt;R&lt;/em&gt; to address this problem: the &lt;strong&gt;modeest &lt;/strong&gt;package provides 11 different mode estimation procedures.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I will illustrate one of these in the examples that follow – the half range mode estimator of Bickel – and I will give a more complete discussion of this package in a later post.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The median is a far better-known data characterization than the mode, and it is both much easier to estimate and much more broadly applicable.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, unlike either the mean or the mode, the median is well-defined for &lt;em&gt;any&lt;/em&gt; proper data distribution, a&amp;nbsp;result demonstrated&amp;nbsp;in Section 4.1.2 of &lt;em&gt;Exploring Data&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Conceptually, computing the median only requires sorting the N data values from smallest to largest and then taking either the middle element from this sorted list (if N is odd), or averaging the middle two elements (if N is even).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The mean is, of course, both the easiest of these characterizations to compute – simply add the N data values and divide by N – and unquestionably the best known.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;There are, however, at least three situations where the mean can be so highly non-representative as to be useless:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;1.&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;if severe outliers are present;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;2.&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;if the distribution is multi-modal;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;3.&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;if the distribution has infinite variance.&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The rest of this post examines each of these cases in turn.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;I have discussed the problem of outliers before, but they are an important enough problem in practice to bear repeating.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(I devote all of Chapter 7 to this topic in &lt;em&gt;Exploring Data&lt;/em&gt;.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot below shows the makeup flow rate dataset, available from the companion website for &lt;em&gt;Exploring Data&lt;/em&gt; (the dataset is &lt;strong&gt;makeup.csv&lt;/strong&gt;, available on the &lt;a href="http://www.oup.com/us/companion.websites/9780195089653/rprogram"&gt;R programs and datasets page&lt;/a&gt;).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This dataset consists of 2,589 successive measurements of the flow rate of a fluid stream in an industrial manufacturing process.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The points in this plot show two distinct forms of behavior: those with values on the order of 400 represent measurements made during normal process operation, while those with values less than about 300 correspond to measurements made when the process is shut down (these values are approximately zero) or is in the process of being either shut down or started back up.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The three lines in this plot correspond to the mean (the solid line at approximately 315), the median (the dotted line at approximately 393), and the mode (the dashed line at approximately 403, estimated using the “hrm” method in the &lt;strong&gt;modeest&lt;/strong&gt; package).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I have noted previously, the mean in this case represents a useful line of demarcation between the normal operation data (those points above the mean, representing 77.6% of the data) and the shutdown segments (those points below the mean, representing 22.4% of the data).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, both the median and the specific mode estimator used here provide much better characterizations of the normal operating data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-CnPhxb-jIgE/Tk_H08ceYNI/AAAAAAAAADY/IIByhlS_LcI/s1600/UselessFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://3.bp.blogspot.com/-CnPhxb-jIgE/Tk_H08ceYNI/AAAAAAAAADY/IIByhlS_LcI/s320/UselessFig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The next plot below shows a nonparametric density estimate of the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; geyser waiting data I discussed in my last few posts.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The solid vertical line at 70.90 corresponds to the mean value computed from the complete dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It has been said that a true compromise is an agreement that makes all parties equally unhappy, and this seems a reasonable description of the mean here: the value lies about mid-way between the two peaks in this distribution, centered at approximately 55 and 80; in fact, this value lies fairly close to the trough between the peaks in this density estimate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(The situation is even worse for the arcsin density discussed above: there, the two modes occur at values of 1 and 2, while the mean falls equidistant from both at 1.5, arguably the “least representative” value in the whole data range.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The median waiting time value is 76, corresponding to the dotted line just to the left of the main peak at about 80, and the mode (again, computed using the package &lt;strong&gt;modeest&lt;/strong&gt; with the “hrm” method) corresponds to the dashed line at 83, just to the right of the main peak.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The basic difficulty here is that all of these location estimators are inherently inadequate since they are attempting to characterize “the representative value” of a data sequence that has “two representative values:” one representing the smaller peak at around 55 and the other representing the larger peak at around 80.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; In this case, both the median and the mode do a better job of characterizing the larger of the two peaks in the distribution (but not a great job), although such a partial characterization is not always what we want.&amp;nbsp; &lt;/span&gt;This type of behavior is exactly what the mixture models I discussed in my last few posts are intended to describe.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-aFToge7EWtc/Tk_IdMboJOI/AAAAAAAAADc/CRKirO7Nh0s/s1600/UselessFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://2.bp.blogspot.com/-aFToge7EWtc/Tk_IdMboJOI/AAAAAAAAADc/CRKirO7Nh0s/s320/UselessFig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To illustrate the third situation where the mean is essentially useless, consider the Cauchy distribution, corresponding to the Student’s t distribution with one degree of freedom.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This is probably the best known infinite-variance distribution there is, and it is often used as an extreme example because it causes a lot of estimation procedures to fail.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot below is a (truncated) boxplot comparison of the values of the mean, median, and mode computed from 1000 independently generated Cauchy random number sequences, each of length N = 100.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is clear from these boxplots that the variability of the mean is much greater than that of either of the other two estimators, which are the median and the mode, the latter again estimated from the data using the half-range mode (hrm) method in the &lt;strong&gt;modeest&lt;/strong&gt; package.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of the consequences of working with infinite variance distributions is that the mean is no longer a consistent location estimator, meaning that the variance of the estimated mean does not approach zero in the limit of large sample sizes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, the Cauchy distribution is one of the examples I discuss in Chapter 6 of &lt;em&gt;Exploring Data&lt;/em&gt; as a counterexample to the Central Limit Theorem: for most data distributions, the distribution of the mean approaches a Gaussian limit with a variance that decreases inversely with the sample size N, but for the Cauchy distribution, the distribution of the mean is exactly the same as that of the data itself.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In other words, for the Cauchy distribution, averaging a collection of N numbers does not reduce the variability at all.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This is exactly what we are seeing here, although the plot below doesn’t show how bad the situation really is: the smallest value of the mean in this sequence of 1000 estimates is -798.97 and the largest value is 928.85.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In order to see any detail at all in the distribution of the median and mode values, it was necessary to restrict the range of the boxplots shown here to lie between -5 and +5, which eliminated 13.6% of the computed mean values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, the median is known to be a reasonably good location estimator for the Cauchy distribution (see Section 6.6.1 of &lt;em&gt;Exploring Data&lt;/em&gt; for a further discussion of this point), and the results presented here suggest that Bickel’s half-range mode estimator is also a reasonable candidate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The main point here is that the mean is a completely unreasonable estimator in situations like this one, an important point in view of the growing interest in data models like the infinite-variance Zipf distribution to describe “long-tailed” phenomena in business.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-uwnQfCezko8/Tk_JXm5QSOI/AAAAAAAAADg/pTyXC8kq8iI/s1600/UselessFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" qaa="true" src="http://1.bp.blogspot.com/-uwnQfCezko8/Tk_JXm5QSOI/AAAAAAAAADg/pTyXC8kq8iI/s320/UselessFig04.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;I will have more to say about both the &lt;strong&gt;modeest&lt;/strong&gt; package and Zipf distributions in upcoming posts.&lt;/div&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-4146742461732717153?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/4146742461732717153/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/08/when-are-averages-useless.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/4146742461732717153'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/4146742461732717153'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/08/when-are-averages-useless.html' title='When are averages useless?'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-bDidkJnPnX4/Tk_CdddskeI/AAAAAAAAADU/BFEJDtmip7U/s72-c/UselessFig01.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-2656201277851576540</id><published>2011-08-06T14:23:00.000-07:00</published><updated>2011-08-06T14:23:22.895-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mixtools'/><category scheme='http://www.blogger.com/atom/ns#' term='Expectation Maximization algorithm'/><category scheme='http://www.blogger.com/atom/ns#' term='Gaussian mixture distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='multimodal distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='initialization of iterative algorithms'/><category scheme='http://www.blogger.com/atom/ns#' term='R packages'/><title type='text'>Fitting mixture distributions with the R package mixtools</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;My last two posts have been about mixture models, with examples to illustrate what they are and how they&amp;nbsp;can be&amp;nbsp;useful.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further discussion and more examples can be found in Chapter 10 of &lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;/span&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One important topic&amp;nbsp;I haven’t&amp;nbsp;covered is how to fit mixture models to datasets like the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; geyser data that I have discussed previously: a nonparametric density plot gives fairly compelling evidence for a bimodal distribution, but how do you estimate the parameters of a mixture model that describes these two modes?&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For a finite Gaussian mixture distribution, one way is by trial and error, first estimating the centers of the peaks by eye in the density plot (these become the component means), and adjusting the standard deviations and mixing percentages to approximately match the peak widths and heights, respectively.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This post considers the more systematic alternative of estimating the mixture distribution parameters using the &lt;strong&gt;mixtools&lt;/strong&gt; package in &lt;em&gt;R&lt;/em&gt;.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The &lt;strong&gt;mixtools&lt;/strong&gt; package is one of several available in &lt;em&gt;R&lt;/em&gt; to fit mixture distributions or to solve the closely related problem of model-based clustering.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further, &lt;strong&gt;mixtools&lt;/strong&gt;&amp;nbsp;includes a variety of procedures for fitting mixture models of&amp;nbsp;different types.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This post focuses on one of these – the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure for fitting normal mixture densities – and applies it to&amp;nbsp;two simple examples, starting with the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; dataset mentioned above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A much more complete and thorough discussion of the &lt;strong&gt;mixtools&lt;/strong&gt; package – which also discusses its application to the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; dataset – is given in the &lt;em&gt;R&lt;/em&gt; package vignette,&amp;nbsp;&amp;nbsp;&lt;a href="http://cran.r-project.org/web/packages/mixtools/vignettes/vignette.pdf"&gt;mixtools: An R Package for Analyzing Finite Mixture Models&lt;/a&gt;.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-kbk_korLXMw/Tj2JMvEPiPI/AAAAAAAAADE/avAFubexWKk/s1600/mixtoolsFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" src="http://3.bp.blogspot.com/-kbk_korLXMw/Tj2JMvEPiPI/AAAAAAAAADE/avAFubexWKk/s320/mixtoolsFig01.png" t$="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The above plot shows the results obtained using the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure with its default parameter values, applied to the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; waiting time data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, this plot was generated by the following sequence of &lt;em&gt;R&lt;/em&gt; commands:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; library(mixtools)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;wait = faithful$waiting&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;mixmdl = normalmixEM(wait)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;plot(mixmdl,which=2)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;lines(density(wait), lty=2, lwd=2)&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Like many modeling tools in &lt;em&gt;R&lt;/em&gt;, the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure has associated plot and summary methods.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In this case, the plot method displays either the log likelihood associated with each iteration of the EM fitting algorithm (more about that below), or the component densities shown above, or both.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifying “which=1” displays only the log likelihood plot (this is the default), specifying “which = 2” displays only the density components/histogram plot shown here, and specifying “density = TRUE” without specifying the “which” parameter gives both plots.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that the two solid curves shown in the above plot correspond to the individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of an observation being drawn from&amp;nbsp;that component distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The final line of &lt;em&gt;R&lt;/em&gt; code above overlays the nonparametric density estimate generated by the &lt;strong&gt;density&lt;/strong&gt; function with its default parameters, shown here as the heavy dashed line (obtained by specifying “lty = 2”).&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Most of the procedures in the &lt;strong&gt;mixtools&lt;/strong&gt; package are based on the iterative &lt;em&gt;expectation maximization (EM) algorithm,&lt;/em&gt; discussed in Section 2 of the &lt;strong&gt;mixtools&lt;/strong&gt; vignette and also in Chapter 16 of &lt;em&gt;Exploring Data&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A detailed discussion of this algorithm is beyond the scope of this post – books have been devoted to the topic (see, for example, the book by McLachlan and Krishnan, &lt;span&gt;&lt;a href="http://www.amazon.com/Algorithm-Extensions-Wiley-Probability-Statistics/dp/0471201707?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;The EM Algorithm and Extensions (Wiley Series in Probability and Statistics)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0471201707" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt; ) – but the following two points are important to note here.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, the EM algorithm is an iterative procedure, and the time required for it to reach convergence – if it converges at all – depends strongly on the problem to which it is applied.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second key point is that because it is an iterative procedure, the EM algorithm requires starting values for the parameters, and algorithm performance can depend strongly on these initial values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure supports both user-supplied starting values and built-in estimation of starting values if none are supplied.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These built-in estimates are the default and, in favorable cases, they work quite well.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; waiting time data is a case in point – using the default starting values gives the following parameter estimates:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;gt; mixmdl[c("lambda","mu","sigma")]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;$lambda&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;[1] 0.3608868 0.6391132&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;$mu&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;[1] 54.61489 80.09109&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;$sigma&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;[1] 5.871241 5.867718&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The mixture density described by these parameters is given by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;p(x) = lambda[1] n(x; mu[1], sigma[1]) + lambda[2] n(x; mu[2], sigma[2])&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;where &lt;em&gt;n(x; mu, sigma)&lt;/em&gt; represents the Gaussian probability density function with mean &lt;em&gt;mu&lt;/em&gt; and standard deviation &lt;em&gt;sigma.&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;One reason the default starting values work well for the Old Faithful waiting time data is that if nothing is specified, the number of components (the parameter k) is set equal to 2.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, if you are attempting to fit a mixture model with more than two components, this number should be specified, either by setting k to some other value and not specifying any starting estimates for the parameters lambda, mu, and sigma, or by specifying a vector with k components as starting values for at least one of these parameters.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(There are a number of useful options in calling the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure: for example, specifying the initial sigma value as a scalar constant rather than a vector with k components forces the component variances to be equal.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I won’t attempt to give a detailed discussion of these options here; for that, type “help(normalmixEM)”.)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Another important point about the default starting values is that, aside from the number of components k, any unspecified initial parameter estimates are selected randomly by the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This means that, even in cases where the default starting values consistently work well – again, the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; waiting time dataset seems to be such a case – the number of iterations required to obtain the final result can vary significantly from one run to the next.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(Specifically, the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure does not fix the seed for the random number generators used to compute these starting values, so repeated runs of the procedure with the same data will start from different initial parameter values and require different numbers of iterations to achieve convergence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the case of the Old Faithful waiting time data, I have seen anywhere between 16 and 59 iterations required, with the final results differing only very slightly, typically in the fifth or sixth decimal place.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If you want to use the same starting value on successive runs, this can be done by setting the random number seed via the &lt;strong&gt;set.seed&lt;/strong&gt; command before you invoke the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure.)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-lOvyRUwwCyM/Tj2Za7E91KI/AAAAAAAAADI/Ld4rDY-0gvM/s1600/mixtoolsFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" src="http://4.bp.blogspot.com/-lOvyRUwwCyM/Tj2Za7E91KI/AAAAAAAAADI/Ld4rDY-0gvM/s320/mixtoolsFig02.png" t$="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is important to note that the default starting values do not always work well, even if the correct number of components is specified.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This point is illustrated nicely by the following example.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot above shows two curves: the solid line is the exact density for the three-component Gaussian mixture distribution described by the following parameters:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="mso-tab-count: 1;"&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;mu = (2.00, 5.00, 7.00)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;sigma = (1.000, 1.000, 1.000)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;lambda = (0.200, 0.600, 0.200)&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The dashed curve in the figure is the nonparametric density estimate generated from n = 500 observations drawn from this mixture distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that the first two components of this mixture distribution are evident in&amp;nbsp;both of these&amp;nbsp;plots, from the density peaks at approximately 2 and 5.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The third component, however, is too close to the second to yield a clear peak in&amp;nbsp;either density, giving rise instead to slightly asymmetric “shoulders” on the right side of the upper peaks.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The key point is that the components in this mixture distribution are difficult to distinguish from either of these density estimates, and this hints at further difficulties to come.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Applying the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure to the 500 sample sequence used to generate the nonparametric density estimate shown above and specifying k = 3 gives results that are substantially more variable than the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; results discussed above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, to compare these results, it is necessary to be explicit about the values of the random seeds used to initialize the parameter estimation procedure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifying this random seed as 101 and only specifying k=3 in the &lt;strong&gt;normalmixEM&lt;/strong&gt; call yields the following parameter estimates after 78 iterations:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="mso-tab-count: 1;"&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;mu = (1.77, 4.87, 5.44)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;sigma = (0.766, 0.115, 1.463)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;lambda = (0.168, 0.028, 0.803)&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Comparing these results with the correct parameter values listed above, it is clear that some of these estimation errors are&amp;nbsp;quite large.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The figure shown below compares the mixture density constructed from these parameters (the heavy dashed curve) with the nonparametric density estimate computed from the data used to estimate them.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The prominent “spike” in this mixture density plot corresponds to the very small standard deviation estimated for the second component and it provides a dramatic illustration of the relatively poor results obtained for this particular example.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-_EAqyfF3h3k/Tj2ftS37PxI/AAAAAAAAADM/yV__jlaiSSc/s1600/mixtoolsFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" src="http://4.bp.blogspot.com/-_EAqyfF3h3k/Tj2ftS37PxI/AAAAAAAAADM/yV__jlaiSSc/s320/mixtoolsFig03.png" t$="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Repeating this numerical experiment with different random seeds to obtain different random starting estimates, the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure failed to converge&amp;nbsp;in 1000 iterations for seed values of 102 and 103, but it converged after 393 iterations for the seed value 104, yielding the following parameter estimates:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="mso-tab-count: 1;"&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;mu = (1.79, 5.03, 5.46)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;sigma = (0.775, 0.352, 1.493)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;lambda = (0.169, 0.063, 0.768)&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Arguably, the general behavior of these parameter estimates is quite similar to those obtained with the random seed value 101, but note that the second variance component differs by a factor of three, and the second component of lambda increases almost as much.&amp;nbsp;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;Increasing the sample size from n = 500 to n = 2000 and repeating these experiments, the &lt;strong&gt;normalmixEM&lt;/strong&gt; procedure failed to converge after 1000 iterations for all four of the random seed values 101 through 104.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If, however, we specify the correct standard deviations (i.e., specify “sigma = c(1,1,1)” when we invoke &lt;strong&gt;normalmixEM&lt;/strong&gt;) and we increase the maximum number of iterations to 3000 (i.e., specify “maxit = 3000”), the procedure does converge after 2417 iterations for the seed value 101, yielding the following parameter estimates:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="mso-tab-count: 1;"&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;mu = (1.98, 4.98, 7.15)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;sigma = (1.012, 1.055, 0.929)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;lambda = (0.198, 0.641, 0.161)&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;While these parameters took a lot more effort to obtain, they are clearly much closer to the correct values, emphasizing the point that when we are fitting a model to data, our results generally improve as the amount of available data increases and as our starting estimates become more accurate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This point is further illustrated by the plot shown below, analogous to the previous one, but constructed from the model fit to the longer data sequence and incorporating better initial parameter estimates.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Interestingly, re-running the same procedure but taking the correct means as starting parameter estimates instead of the correct standard deviations, the&amp;nbsp;procedure failed to converge in 3000 iterations.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-RE8snFLFQDk/Tj2gaHs0hcI/AAAAAAAAADQ/cjNii4IW3w8/s1600/mixtoolsFig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" src="http://1.bp.blogspot.com/-RE8snFLFQDk/Tj2gaHs0hcI/AAAAAAAAADQ/cjNii4IW3w8/s320/mixtoolsFig04.png" t$="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Overall, I like what I have seen so far of the &lt;strong&gt;mixtools&lt;/strong&gt; package, and I look forward to exploring its capabilities further.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It’s great to have a built-in procedure – i.e., one I didn’t have to write and debug myself – that does all of the things that this package does.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;However, the three-component mixture results presented here do illustrate an important point: the behavior of iterative procedures like &lt;strong&gt;normalmixEM&lt;/strong&gt; and others in the &lt;strong&gt;mixtools&lt;/strong&gt; package can depend strongly on&amp;nbsp;the starting values chosen to initialize the iteration process, and the extent of this dependence can vary greatly from one application to another.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-2656201277851576540?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/2656201277851576540/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/08/fitting-mixture-distributions-with-r.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/2656201277851576540'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/2656201277851576540'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/08/fitting-mixture-distributions-with-r.html' title='Fitting mixture distributions with the R package mixtools'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-kbk_korLXMw/Tj2JMvEPiPI/AAAAAAAAADE/avAFubexWKk/s72-c/mixtoolsFig01.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-3636026065597121243</id><published>2011-07-16T11:32:00.000-07:00</published><updated>2011-07-16T11:32:52.103-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Old Faithful dataset'/><category scheme='http://www.blogger.com/atom/ns#' term='mixture distributions'/><category scheme='http://www.blogger.com/atom/ns#' term='regression models'/><category scheme='http://www.blogger.com/atom/ns#' term='mixture models'/><category scheme='http://www.blogger.com/atom/ns#' term='Gaussian mixture distributions'/><title type='text'>Mixture distributions and models: a clarification</title><content type='html'>&lt;span&gt;&lt;/span&gt; &lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In response to my last post, Chris had&amp;nbsp;the following comment:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;span style="mso-tab-count: 1;"&gt;&lt;/span&gt;I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;You seem to say mixture models apply to a small set of models – namely regression models.&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This&amp;nbsp;comment suggests that my caution about the difference between &lt;em&gt;mixed-effect models&lt;/em&gt; and &lt;em&gt;mixture distributions&lt;/em&gt; may have caused as much confusion as clarification, and the purpose of this post is to try to clear up this confusion.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;So first, let me offer the following general observations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The terms “mixture models” refers to a generalization of the class of finite mixture distributions that I discussed in my previous post.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I give a more detailed discussion of finite mixture distributions in Chapter 10 of&amp;nbsp;&lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;&lt;span&gt; &lt;/span&gt;, and the more general class of mixture models is discussed in the book &lt;span&gt;&lt;a href="http://www.amazon.com/Mixture-Models-Statistics-Textbooks-Monographs/dp/0824776917?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Mixture Models (Statistics: A Series of Textbooks and Monographs)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0824776917" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt; by Geoffrey J. McLachlan and Kaye E. Bashford.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The basic idea is that we are describing some observed phenomenon like the Old Faithful geyser data (the &lt;strong&gt;faithful&lt;/strong&gt; data object in &lt;em&gt;R&lt;/em&gt;) where a close look at the data (e.g., with a nonparametric density estimate) suggests substantial heterogeneity.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, the density estimates I presented last time for both of the variables in this dataset exhibit clear evidence of bimodality.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Essentially, the idea behind a mixture model/mixture distribution is that we are observing something that isn’t fully characterized by a single, simple distribution or model, but instead by several such distributions or models, with some random selection mechanism at work.&amp;nbsp; In the case of mixture distributions,&amp;nbsp;some observations appear to be drawn from distribution 1, some from distribution 2, and so forth.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The more general class of mixture models is quite broad, including things like heterogeneous regression models, where the response may depend approximately linearly on some covariate with one slope and intercept for observations drawn from one sub-population, but with another, very different slope and intercept for observations drawn from another sub-population.&amp;nbsp; I present an example at the end of this post that illustrates this idea.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The probable source of confusion for Chris – and very possibly other readers – is the comment I made about the difference between these mixture models and &lt;i style="mso-bidi-font-style: normal;"&gt;mixed-effect models&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This other class of models – which I only mentioned in passing in my post – typically consists of a linear regression model with two types of prediction variables: deterministic predictors, like those that appear in standard linear regression models, and random predictors that are typically assumed to obey a Gaussian distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This framework has been extended to more general settings like generalized linear models (e.g., mixed-effect logistic regression models).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The &lt;em&gt;R&lt;/em&gt; package &lt;strong&gt;lme4&lt;/strong&gt; provides support for fitting both linear mixed-effect models and generalized linear mixed-effect models to data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I noted last time, these model classes are distinct from the mixture distribution/mixture model classes I discuss here.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The models that I do discuss – mixture models – have strong connections with cluster analysis, where we are given a heterogeneous group of objects and typically wish to determine how many distinct groups of objects are present and assign individuals to the appropriate groups.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A very high-level view of the many &lt;em&gt;R&lt;/em&gt; packages available for clustering – some based on mixture model ideas and some not – is available from the &lt;a href="http://cran.r-project.org/web/views/Cluster.html"&gt;CRAN clustering task view page&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Two packages from this task view that I plan to discuss in future posts are &lt;strong&gt;flexmix&lt;/strong&gt; and &lt;strong&gt;mixtools&lt;/strong&gt;, both of which support a variety of mixture model applications.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The following comments&amp;nbsp;from the vignette &lt;a href="http://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf"&gt;FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R&lt;/a&gt;&amp;nbsp;give an indication of the range of areas where these ideas are useful:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;“Finite mixture models have been used for more than 100 years, but have seen a real boost in popularity over the last decade due to the tremendous increase in available computing power.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The areas of application of mixture models range from biology and medicine to physics, economics, and marketing.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;On the one hand, these models can be applied to data where observations originate from various groups and the group affiliations are not known, and on the other hand to provide approximations for multi-modal distributions.”&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Do8IMtBIjKY/TiHSCHhpw-I/AAAAAAAAAC4/lPxrVps2ZNs/s1600/OldFaithfulEx01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" m$="true" src="http://1.bp.blogspot.com/-Do8IMtBIjKY/TiHSCHhpw-I/AAAAAAAAAC4/lPxrVps2ZNs/s320/OldFaithfulEx01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;The following example illustrates the second of these ideas, motivated by the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; geyser data that I discussed last time.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a reminder, the plot above shows the nonparametric density estimate generated from the 272 observations of the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; waiting time data included in the &lt;strong&gt;faithful&lt;/strong&gt; data object, using the &lt;strong&gt;density&lt;/strong&gt; procedure in &lt;em&gt;R&lt;/em&gt; with the default parameter settings.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I noted last time, the plot shows two clear peaks, the lower one centered at approximately 55 minutes, and the second at approximately 80 minutes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Also, note that the first peak is substantially smaller in amplitude and appears to be somewhat narrower than the second peak.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-kTFAkWBQjpw/TiHSPs00adI/AAAAAAAAAC8/3Nf0iIoj2Zw/s1600/MixDensFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" m$="true" src="http://1.bp.blogspot.com/-kTFAkWBQjpw/TiHSPs00adI/AAAAAAAAAC8/3Nf0iIoj2Zw/s320/MixDensFig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To illustrate the connection with finite mixture distributions, the &lt;em&gt;R&lt;/em&gt; procedure described below generates a two-component Gaussian mixture density whose random samples exhibit approximately the same behavior seen in the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; waiting time data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The results generated by this procedure are shown in the above figure, which includes two overlaid plots: one corresponding to the exact density for the two-component Gaussian mixture distribution (the solid line), and the other corresponding to the nonparametric density estimate computed from N = 272 random samples drawn from this mixture distribution (the dashed line).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As in the previous plot, the nonparametric density estimate was computed using the &lt;strong&gt;density&lt;/strong&gt; command in &lt;em&gt;R&lt;/em&gt; with its default parameter values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The first component in this mixture has mean 54.5 and standard deviation 8.0, values chosen by trial and error to approximately match the lower peak in the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; waiting time distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second component has mean 80.0 and standard deviation 5.0, chosen to approximately match the second peak in the waiting time distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The probabilities associated with the first and second components are 0.45 and 0.55, respectively, selected to give approximately the same peak heights seen in the waiting time density estimate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Combining these results, the density of this mixture distribution is:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;p(x) = 0.45 n(x; 54.5, 8.0) + 0.55 n(x; 80.0, 5.0),&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;where n(x;m,s) denotes the Gaussian density function with mean m and standard deviation s.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These density functions can be generated using the &lt;strong&gt;dnorm&lt;/strong&gt; function in &lt;em&gt;R&lt;/em&gt;.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;The &lt;em&gt;R&lt;/em&gt; procedure listed below generates &lt;strong&gt;n&lt;/strong&gt; independent, identically distributed random samples from an &lt;em&gt;m&lt;/em&gt;-component Gaussian mixture distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This procedure is called with the following parameters:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;strong&gt;n&lt;/strong&gt; = the number of random samples to generate&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;strong&gt;mvec&lt;/strong&gt; = vector of &lt;em&gt;m&lt;/em&gt; mean values&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;strong&gt;svec&lt;/strong&gt; = vector of &lt;em&gt;m&lt;/em&gt; standard deviations&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;strong&gt;pvec&lt;/strong&gt; = vector of probabilities for each of the &lt;em&gt;m &lt;/em&gt;components&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;strong&gt;iseed&lt;/strong&gt; = integer seed to initialize the random number generators&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The &lt;em&gt;R&lt;/em&gt; code for the procedure looks like this:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;MixEx01GenProc &amp;lt;- function(n, muvec, sigvec, pvec, iseed=101){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;set.seed(iseed)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;m &amp;lt;- length(pvec)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;indx &amp;lt;- sample(seq(1,m,1), size=n, replace=T, prob=pvec)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;yvec &amp;lt;- 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;for (i in 1:m){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;xvec &amp;lt;- rnorm(n, mean=muvec[i], sd=sigvec[i])&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;yvec &amp;lt;- yvec + xvec * as.numeric(indx == i)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;yvec&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The first statement initializes the random number generator using the &lt;strong&gt;iseed&lt;/strong&gt; parameter, which is given a default value of 101.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second line determines the number of components in the mixture density from the length of the &lt;strong&gt;pvec&lt;/strong&gt; parameter vector, and the third line generates a random sequence &lt;strong&gt;indx&lt;/strong&gt; of component indices taking the values 1 through &lt;em&gt;m&lt;/em&gt; with probabilities determined by the &lt;strong&gt;pvec&lt;/strong&gt; parameter.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The rest of the program is a short loop that generates each component in turn, using &lt;strong&gt;indx&lt;/strong&gt; to randomly select observations from each of these components with the appropriate probability. &lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;To see how this works, note that the first pass through the loop generates the random vector &lt;strong&gt;xvec&lt;/strong&gt; of length &lt;strong&gt;n&lt;/strong&gt;, with mean given by the first element of the vector &lt;strong&gt;muvec&lt;/strong&gt; and standard deviation given by the first element of the vector &lt;strong&gt;sigvec&lt;/strong&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Then, for every one of the &lt;strong&gt;n&lt;/strong&gt; elements of &lt;strong&gt;yvec&lt;/strong&gt; for which the &lt;strong&gt;indx&lt;/strong&gt; vector is equal to 1, &lt;strong&gt;yvec&lt;/strong&gt; is set equal to the corresponding element of this first random component &lt;strong&gt;xvec&lt;/strong&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;On the second pass through the loop, the second random component is generated as &lt;strong&gt;xvec&lt;/strong&gt;, again with length &lt;strong&gt;n&lt;/strong&gt; but now with mean specified by the second element of &lt;strong&gt;muvec&lt;/strong&gt; and standard deviation determined by the second element of &lt;strong&gt;sigvec&lt;/strong&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As before, this value is added to the initial value of &lt;strong&gt;yvec&lt;/strong&gt; whenever the selection index vector &lt;strong&gt;indx&lt;/strong&gt; is equal to 2.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that since every element of the &lt;strong&gt;indx&lt;/strong&gt; vector is unique, none of the nonzero elements of &lt;strong&gt;yvec&lt;/strong&gt; computed during the first iteration of the loop are modified; instead, the only elements of &lt;strong&gt;yvec&lt;/strong&gt; that are modified in the second pass through the loop have their initial value of zero, specified in the line above the start of the loop.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;More generally, each pass through the loop generates the next component of the mixture distribution and fills in the corresponding elements of &lt;strong&gt;yvec&lt;/strong&gt; as determined by the random selection index vector &lt;strong&gt;indx&lt;/strong&gt;.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Zu-7I2Gm0Ew/TiHUTcXeLSI/AAAAAAAAADA/8VTpCBWBFMI/s1600/MixExFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" m$="true" src="http://3.bp.blogspot.com/-Zu-7I2Gm0Ew/TiHUTcXeLSI/AAAAAAAAADA/8VTpCBWBFMI/s320/MixExFig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As I noted at the beginning of this post, the notion of a mixture model is more general than that of the finite mixture distributions just described, but closely related.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I conclude this post with a simple example of a more general mixture model.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The&amp;nbsp;above scatter plot&amp;nbsp;shows two variables, x and y, related by the following mixture model:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y = x + e&lt;sub&gt;1&lt;/sub&gt; with probability p&lt;sub&gt;1&lt;/sub&gt; = 0.40,&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;and&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;y = -x + 2 + e&lt;sub&gt;2&lt;/sub&gt; with probability p&lt;sub&gt;2&lt;/sub&gt; = 0.60,&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;where e&lt;sub&gt;1&lt;/sub&gt; is a zero-mean Gaussian random variable with standard deviation 0.1, and e&lt;sub&gt;2&lt;/sub&gt; is a zero-mean Gaussian random variable with standard deviation 0.3.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To emphasize the components in the mixture model, points corresponding to the first component are plotted as solid circles, while points corresponding to the second component are plotted as open triangles.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The two dashed lines in this plot represent the ordnary least squares regression lines fit to each component separately, and they both correspond reasonably well to the underlying linear relationships that define the two components (e.g., the least squares line fit to the solid circles has a slope of approximately +1 and an intercept of approximately 0).&amp;nbsp; In contrast, the heavier dotted line represents the ordinary least squares regression line fit to the complete dataset without any knowledge of its underlying component structure: this line is almost horizontal and represents a very poor approximation to the behavior of the dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&lt;/span&gt;The point of this example is to illustrate two things.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, it provides a relatively simple illustration of how the mixture density idea discussed above generalizes to the setting of regression models and beyond: we can construct fairly general mixture models by requiring different randomly selected subsets of the data to conform to different modeling assumptions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second point – emphasized by the strong disagreement between the overall regression line and both of the component regression lines – is that if we are given only the dataset (i.e., the x and y values themselves) without knowing which component they represent, standard analysis procedures are likely to perform very badly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This question – how do we analyze a dataset like this one without detailed prior knowledge of its heterogeneous structure – is what &lt;em&gt;R&lt;/em&gt; packages like &lt;strong&gt;flexmix&lt;/strong&gt; and &lt;strong&gt;mixtools&lt;/strong&gt; are designed to address.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;More about that in future posts. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-3636026065597121243?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/3636026065597121243/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/07/mixture-distributions-and-models.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/3636026065597121243'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/3636026065597121243'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/07/mixture-distributions-and-models.html' title='Mixture distributions and models: a clarification'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-Do8IMtBIjKY/TiHSCHhpw-I/AAAAAAAAAC4/lPxrVps2ZNs/s72-c/OldFaithfulEx01.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-9160323063841521076</id><published>2011-06-18T11:20:00.000-07:00</published><updated>2011-06-18T11:20:33.864-07:00</updated><title type='text'>A Brief Introduction to Mixture Distributions</title><content type='html'>Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;An extremely useful idea in working with datasets like this one is that of &lt;em&gt;mixture distributions&lt;/em&gt;, which describe random variables that are drawn from more than one parent population.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I discuss mixture distributions in some detail in Chapter 10 of &lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt; (Section 10.6), and the objective of this post is to give a brief introduction to the basic ideas of mixture distributions, with some hints about how they can be useful in connection with problems like analyzing the rainfall data discussed last time.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Subsequent posts will discuss these ideas in more detail, with pointers to specific &lt;em&gt;R&lt;/em&gt; packages that are useful in applying these ideas to real data analysis problems. &lt;br /&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Before proceeding, it is important to emphasize that the topic of this post and subsequent discussions is “mixture distributions” and &lt;em&gt;not&lt;/em&gt; “mixture models,” which are the subject of considerable current interest in the statistics community.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The distinction is important because these topics are very different: mixture distributions represent a useful way of describing heterogeneity in the distribution of a variable, whereas mixture models provide a foundation for incorporating both deterministic and random predictor variables in regression models.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-23UcJUd7_yE/TfziuPmMc4I/AAAAAAAAACs/fDnv4PD1TYs/s1600/mixfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" i$="true" src="http://4.bp.blogspot.com/-23UcJUd7_yE/TfziuPmMc4I/AAAAAAAAACs/fDnv4PD1TYs/s320/mixfig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To motivate the ideas presented here, the figure above shows four plots constructed from the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; geyser data that I have discussed previously (the &lt;em&gt;R&lt;/em&gt; dataset &lt;strong&gt;faithful&lt;/strong&gt;).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This data frame contains 272 observations of the durations of successive eruptions and the waiting time until the next eruption, both measured in minutes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot above is the normal Q-Q plot for the duration values, computed using the &lt;strong&gt;qqPlot&lt;/strong&gt; procedure from the &lt;strong&gt;car&lt;/strong&gt; package that I have discussed previously, and the upper right plot is the corresponding normal Q-Q plot for the waiting times.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Both of these Q-Q plots exhibit pronounced “kinks,” which I have noted previously often indicate the presence of a multimodal data distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The lower two plots are the corresponding nonparametric density estimates (computed using the &lt;strong&gt;density&lt;/strong&gt; procedure in &lt;em&gt;R&lt;/em&gt; with its default parameters).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The point of these plots is that they give strong evidence that the distributions of both the durations and waiting times are bimodal.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In fact, bimodal and more general multimodal distributions arise frequently in practice, particularly in cases where we are observing a composite response from multiple, distinct sources.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This is the basic idea behind mixture distributions: the response &lt;em&gt;x&lt;/em&gt; that we observe is modeled as a random variable that has some probability &lt;em&gt;p&lt;sub&gt;1&lt;/sub&gt;&lt;/em&gt; of being drawn from distribution &lt;em&gt;D&lt;sub&gt;1&lt;/sub&gt;&lt;/em&gt;, probability &lt;em&gt;p&lt;sub&gt;2&lt;/sub&gt;&lt;/em&gt; of being drawn from distribution &lt;em&gt;D&lt;sub&gt;2&lt;/sub&gt;&lt;/em&gt;, and so forth, with probability &lt;em&gt;p&lt;sub&gt;n&lt;/sub&gt;&lt;/em&gt; of being drawn from distribution &lt;em&gt;D&lt;sub&gt;n&lt;/sub&gt;&lt;/em&gt;, where &lt;em&gt;n&lt;/em&gt; is the number of components in our mixture distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The key assumption here is one of statistical independence between the process of randomly selecting the component distribution &lt;em&gt;D&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt; to be drawn and these distributions themselves.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That is, we assume there is a random selection process that first generates the numbers 1 through &lt;em&gt;n&lt;/em&gt; with probabilities &lt;em&gt;p&lt;sub&gt;1&lt;/sub&gt;&lt;/em&gt; through &lt;em&gt;p&lt;sub&gt;n&lt;/sub&gt;&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Then, once we have drawn some number &lt;em&gt;j,&lt;/em&gt; we turn to distribution &lt;em&gt;D&lt;sub&gt;j&lt;/sub&gt; &lt;/em&gt;and draw the random variable &lt;em&gt;x&lt;/em&gt; from this distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;So long as the probabilities – or &lt;em&gt;mixing percentages&lt;/em&gt; – &lt;em&gt;p&lt;sub&gt;i&lt;/sub&gt; &lt;/em&gt;sum to 1, and all of the distributions &lt;em&gt;D&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt; are proper densities, the combination also defines a proper probability density function, which can be used as the basis for computing expectations, formulating maximum likelihood estimation problems, and so forth.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The simplest case is that for &lt;em&gt;n&lt;/em&gt; = 2, which provides many different specific examples that have been found to be extremely useful in practice.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Because it is the easiest to understand, this post will focus on this special case.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Since the mixing percentages must sum to 1, it follows that &lt;em&gt;p&lt;sub&gt;2&lt;/sub&gt; = 1 – p&lt;sub&gt;1&lt;/sub&gt;&lt;/em&gt; when &lt;em&gt;n = 2&lt;/em&gt;, it simplifies the discussion to drop the subscripts, writing &lt;em&gt;p&lt;sub&gt;1&lt;/sub&gt; = p&lt;/em&gt; and &lt;em&gt;p&lt;sub&gt;2&lt;/sub&gt; = 1 – p&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Similarly, let &lt;em&gt;f(x)&lt;/em&gt; denote the density associated with the distribution &lt;em&gt;D&lt;sub&gt;1&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;g(x)&lt;/em&gt; denote the density associated with the distribution &lt;em&gt;D&lt;sub&gt;2&lt;/sub&gt;&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The overall probability density function &lt;em&gt;p(x)&lt;/em&gt; is then given by:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;p(x) = p f(x) + (1 – p)g(x)&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To illustrate the flexibility of this idea, consider the case where both &lt;em&gt;f(x)&lt;/em&gt; and &lt;em&gt;g(x)&lt;/em&gt; are Gaussian distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The figure below shows four specific examples.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-O7jtwat9gW0/Tfzi440K96I/AAAAAAAAACw/zatnB-x_gJA/s1600/mixfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" i$="true" src="http://1.bp.blogspot.com/-O7jtwat9gW0/Tfzi440K96I/AAAAAAAAACw/zatnB-x_gJA/s320/mixfig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The upper left plot shows the standard normal distribution (i.e., mean 0 and standard deviation 1), which corresponds to taking both &lt;em&gt;f(x)&lt;/em&gt; and &lt;em&gt;g(x)&lt;/em&gt; as standard normal densities, for any choice of &lt;em&gt;p&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I have included it here both because it represents what has been historically the most common distributional assumption, and because it represents a reference case in interpreting the other distributions considered here.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;The upper right plot corresponds to the &lt;em&gt;contaminated normal outlier distribution&lt;/em&gt;, widely adopted in the robust statistics literature.&amp;nbsp; There, the idea is that measurement errors – traditionally modeled as zero-mean Gaussian random variables with some unknown standard deviation &lt;em&gt;S&lt;/em&gt; – mostly conform to this model, but some fraction (typically, 10% to 20%) of these measurement errors have larger variability.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The traditional contaminated normal model defines &lt;em&gt;p&lt;/em&gt; as this contamination percentage and assumes a standard deviation of &lt;em&gt;3S&lt;/em&gt; for these measurements.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper right plot above shows the density for this contaminated normal model with &lt;em&gt;p = 0.15&lt;/em&gt; (i.e., 15% contamination).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Visually, this plot looks identical to the one on the upper left for the standard normal distribution: plotting them on common axes (as done in Fig. 10.17 of &lt;em&gt;Exploring Data&lt;/em&gt;) shows that these distributions are not identical, a point discussed further below on the basis of Q-Q plots.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The lower left plot in the figure above corresponds to a two-component Gaussian mixture distribution where both components are equally represented (i.e., &lt;em&gt;p = 0.5&lt;/em&gt;), the first component &lt;em&gt;f(x)&lt;/em&gt; is the standard normal distribution as before, and the second component &lt;em&gt;g(x)&lt;/em&gt; is a Gaussian distribution with mean 3 and standard deviation 3.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The first component contributes the sharper main peak centered at zero, while the second component contributes the broad “shoulder” seen in the right half of this plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, the lower&amp;nbsp;right plot shows a mixture distribution with &lt;em&gt;p = 0.40&lt;/em&gt; where the first component is a Gaussian distribution with mean -2 and standard deviation 1, and the second component is a Gaussian distribution with mean +2 and standard deviation 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The result is a bimodal distribution with the same general characteristics as the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; geyser data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Be1WhADvqgE/TfzjpjdECXI/AAAAAAAAAC0/HhRADITa-Xs/s1600/mixfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" i$="true" src="http://3.bp.blogspot.com/-Be1WhADvqgE/TfzjpjdECXI/AAAAAAAAAC0/HhRADITa-Xs/s320/mixfig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The point of these examples has been to illustrate the flexibility of the mixture distribution concept, in describing everything from outliers to the natural heterogeneity of natural phenomena with more than one distinct generation mechanism.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Before leaving this discussion, it is worth considering the contaminated normal case a bit further.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot above shows the normal Q-Q plot for 272 samples of the contaminated normal distribution just described, generated using the &lt;em&gt;R&lt;/em&gt; procedure listed below (the number 272 was chosen to make the sample size the same as that of the Old Faithful geyser data discussed above).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As before, this Q-Q plot was generated using the procedure &lt;strong&gt;qqPlot&lt;/strong&gt; from the &lt;strong&gt;car&lt;/strong&gt; package.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The heavy-tailed, non-Gaussian character of this distribution is evident from the fact that both the upper and lower points in this plot fall well outside the 95% confidence interval around the Gaussian reference line.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This example illustrates the power of the Q-Q plot for distributional assessment: like the density plots shown above for the standard normal and contaminated normal distributions, nonparametric density plots (not shown) generated from 272 samples drawn from each of these data distributions are not markedly different.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The contaminated normal random samples&amp;nbsp;used to construct&amp;nbsp;this Q-Q plot were generated with the following simple R function:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;cngen.proc01 &amp;lt;- function(n=272,cpct = 0.15, mu1 = 0, mu2 = 0, sig1 = 1, sig2 = 3,iseed=101){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;set.seed(iseed)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;y0 &amp;lt;- rnorm(n,mean=mu1, sd = sig1)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;y1 &amp;lt;- rnorm(n,mean=mu2, sd = sig2)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;flag &amp;lt;- rbinom(n,size=1,prob=cpct)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;y &amp;lt;- y0*(1 - flag) + y1*flag&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;y&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This function is called with seven parameters, all of which are given default values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The parameter &lt;strong&gt;n&lt;/strong&gt; is the sample size, &lt;strong&gt;cpct&lt;/strong&gt; is the contamination percentage (expressed as a fraction, so &lt;strong&gt;cpct&lt;/strong&gt; = 0.15 corresponds to 15% contamination), &lt;strong&gt;mu1&lt;/strong&gt; and &lt;strong&gt;mu2&lt;/strong&gt; are the means of the component Gaussian distributions, and &lt;strong&gt;sig1&lt;/strong&gt; and &lt;strong&gt;sig2&lt;/strong&gt; are the corresponding standard deviations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The default values for &lt;strong&gt;cpct, mu1, mu2, sig1,&lt;/strong&gt; and &lt;strong&gt;sig2&lt;/strong&gt; are those for a typical contaminated normal outlier distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, the parameter &lt;strong&gt;iseed&lt;/strong&gt; is the seed for the random number generator: specifying its value as a default means that the procedure will return the same set of pseudorandom numbers each time it is called; to obtain an independent set, simply specify a different value for &lt;strong&gt;iseed.&lt;/strong&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The procedure itself generates three mutually independent random variables: &lt;strong&gt;y0&lt;/strong&gt; corresponds to the first component distribution, &lt;strong&gt;y1&lt;/strong&gt; corresponds to the second, and &lt;strong&gt;flag&lt;/strong&gt; is a binomial random variable that determines which component distribution is selected for the final random number: when &lt;strong&gt;flag&lt;/strong&gt; = 1 (an event with probability &lt;strong&gt;cpct&lt;/strong&gt;), the contaminated value &lt;strong&gt;y1&lt;/strong&gt; is returned, and when &lt;strong&gt;flag&lt;/strong&gt; = 0 (an event with probability 1 –&lt;strong&gt; cpct&lt;/strong&gt;), the non-contaminated value &lt;strong&gt;y0&lt;/strong&gt; is returned.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The basic ideas just described – random selection of a random variable from &lt;em&gt;n&lt;/em&gt; distinct distributions – can be applied to a very wide range of distributions, leading to an extremely wide range of data models.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The examples described above give a very preliminary indication of the power of Gaussian mixture distributions as approximations to the distribution of heterogeneous phenomena that arise from multiple sources.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Practical applications of more general (i.e., not necessarily Gaussian) mixture distributions include modeling particle sizes generated by multiple mechanisms (e.g., accretion of large particles from smaller ones and fragmentation of larger particles to form smaller ones, possibly due to differences in material characteristics), pore size distributions in rocks, polymer chain length or chain branching distributions, and many others.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;More generally, these ideas are also applicable to discrete distributions, leading to ideas like the zero-inflated Poisson and negative binomial distributions that I will discuss in my next post, or to combinations of continuous distributions with degenerate distributions (e.g., concentrated at zero), leading to ideas like zero-augmented continuous distributions that may be appropriate in applications like the analysis of the rainfall data I discussed last time.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of the advantages of &lt;em&gt;R&lt;/em&gt; is that it provides a wide variety of support for various useful implementations of these ideas.&lt;/div&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;em&gt;&lt;/em&gt;&amp;nbsp;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-9160323063841521076?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/9160323063841521076/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/06/brief-introduction-to-mixture.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/9160323063841521076'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/9160323063841521076'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/06/brief-introduction-to-mixture.html' title='A Brief Introduction to Mixture Distributions'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-23UcJUd7_yE/TfziuPmMc4I/AAAAAAAAACs/fDnv4PD1TYs/s72-c/mixfig01.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-202960599817804400</id><published>2011-06-06T17:40:00.000-07:00</published><updated>2011-06-06T17:40:04.344-07:00</updated><title type='text'>The pros and cons of robust data characterizations</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of &lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book &lt;span&gt;&lt;a href="http://www.amazon.com/Outliers-Statistical-Data-Barnett-Lewis/dp/B001COQ282?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Outliers in Statistical Data 2nd Edition&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=B001COQ282" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;, that outliers are “data points inconsistent with the majority of values in a dataset.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The practical importance of outliers lies in the fact that even a few of them can cause standard data characterization and analysis procedures to give highly misleading results.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Robust procedures have been developed to address these problems, and their performance in the face of outliers can be dramatically better than that of standard methods.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;On the other hand, robust procedures can also fail, and when they do, they often fail spectacularly.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-mRzZxRoj6hY/Te1ouL2iwuI/AAAAAAAAACg/Auz_Ycgpmnk/s1600/RobustFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" src="http://2.bp.blogspot.com/-mRzZxRoj6hY/Te1ouL2iwuI/AAAAAAAAACg/Auz_Ycgpmnk/s320/RobustFig01.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The above figure is a plot of successive measurements of the makeup flow rate from an industrial manufacturing process.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This example is introduced in Chapter 2 of &lt;em&gt;Exploring Data&lt;/em&gt; and is discussed in detail in Chapter 7; the dataset itself is available as the dataset &lt;strong&gt;makeup.csv&lt;/strong&gt; from the &lt;a href="http://www.oup.com/us/ExploringData"&gt;companion website&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;for the book.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Each point in this dataset corresponds to a measurement of the flow rate of a liquid component in a process with a recycle stream: most of this component is recovered from a downstream processing step and recycled back to the upstream step, but a small amount of this material is lost to evaporation and must be “made up” by a separate feed stream.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The data points plotted above correspond to the flow rate of this makeup stream.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The &lt;strong&gt;quantile&lt;/strong&gt; function in &lt;em&gt;R&lt;/em&gt; computes the Tukey five number summary of the data sequence (i.e., the minimum value, lower quartile, median, upper quartile, and maximum value).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Applied to this sequence, the results are:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; quantile(MakeUpFlow)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;0%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;25%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;50%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;75%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;100% &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;0.00086&amp;nbsp;&amp;nbsp; 370.43747&amp;nbsp; &amp;nbsp;393.35861&amp;nbsp;&amp;nbsp; 404.29602&amp;nbsp;&amp;nbsp;&amp;nbsp; 439.59091 &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Looking at the plot, it is clear that these observations partition naturally into two subsets: those values between 0 and about 200 – all falling between the minimum value and the lower quartile – and those values between about 300 and 440.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, 12.2% of the data values lie between 0 and 10, and 22.3% of the values are less than 200.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Physically, these smaller-than-average&amp;nbsp;flow rates correspond to &lt;em&gt;shutdown&lt;/em&gt; &lt;em&gt;episodes,&lt;/em&gt; where the manufacturing process is either not running, is in the process of being shut down, or is in the process of being started back up.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If we compute the mean of the complete dataset, we obtain an average makeup flow rate of 315.46, corresponding to the dotted line in the above plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is clear that while this number represents the center of the overall data distribution, it is not representative of either the shutdown episodes or normal process operation, since it forms a perfect dividing line between these two data subsets.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If we exclude the shutdown episodes – specifically, if we omit all values smaller than 200 – and compute the mean of what remains, we obtain 397.56, corresponding to the solid line in the plot above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is clear that this number much better represents “typical process operation.”&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This example illustrates the well-known outlier sensitivity of the mean.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;An outlier-resistant alternative is the median, defined as follows: if the number N of data values is odd, the median is the middle element in the ordered list we obtain by sorting these values from smallest to largest; if N is even, the median is the average of the two middle elements in this list.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Applied to the complete makeup flow rate dataset, the median gives a “typical” flow rate of 393.36, fairly close to the average value computed without the shutdown episodes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Excluding these episodes changes the median value only a little, to 399.50, again quite close to the mean value after we remove these anomalous data points.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The sensitivity of the standard deviation to outliers is even worse than that of the mean.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the complete dataset, the computed standard deviation is 155.05, an order of magnitude larger than the value (15.22) computed from the dataset without the shutdown episodes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;While it is not as well known as the median, an outlier-resistant alternative to the standard deviation is the MADM scale estimator, constructed as follows.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, compute the median as a reference value and compute the differences between every data point and this reference value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Then, take the absolute values of these differences, giving a set of non-negative numbers that tell us how far each point lies from the median reference value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Next, we compute the median of this sequence of absolute differences, which tells us how far a &lt;em&gt;typical &lt;/em&gt;data point lies from the reference value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Because we have used the outlier-resistant median both to compute the reference value and to compute this “typical distance” number, the result is highly outlier-resistant.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The only difficulty is that, for approximately normally distributed data with no outliers, this number is consistently smaller than the standard deviation.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To obtain an alternative estimator of the standard deviation – i.e., a number that is approximately equal to the standard deviation when we compute it from a large, outlier-free collection of approximately Gaussian data – we need to multiply the number just defined by 1.4826.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(A derivation of this result is given in Chapter 7 of &lt;em&gt;Exploring Data&lt;/em&gt;.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This bias-corrected scale estimate is available as the built-in function &lt;strong&gt;mad&lt;/strong&gt; in &lt;em&gt;R&lt;/em&gt;, and applying it to the makeup flow rate dataset gives a scale estimate of 20.22, about the same magnitude as the standard deviation of the data sequence with the shutdown episodes removed.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The MADM scale estimate is not completely immune to outliers, so when it is applied to the dataset without the shutdown episodes, it gives a somewhat smaller scale estimate of 13.40.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Still, both of these estimates are much more consistent with the standard deviation of the normal operation data than the uncorrected standard deviation computed from the complete dataset.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The key point of this discussion has been to provide a simple, real-data illustration of both the considerable outlier sensitivity of the “standard” location and scale estimators (i.e., the mean and standard deviation) and the existence and performance of robust alternatives to both (i.e., the median and MADM scale estimator).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I noted at the beginning of this discussion, however, while robust estimators can work extremely well – as in this example – they can also fail spectacularly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In general, robust estimators are designed to protect us against anomalous observations that occur out in the tails of the distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the case of “light-tailed” distributions like the uniform distribution, robust measures like the MADM scale estimator can exhibit surprising behavior.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, in the absence of outliers, we expect the standard deviation and the MADM scale estimate to give us roughly comparable results.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If the standard deviation is much larger than the MADM scale estimate – as in the makeup flow rate example just discussed – this can usually be taken as an indication of either a naturally heavy-tailed data distribution (e.g., the Student’s t distribution with few degrees of freedom), or, as in this example, the presence of contaminating outliers.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the face of light-tailed data distributions, however, the MADM scale estimate can be substantially &lt;em&gt;larger&lt;/em&gt; than the standard deviation, and we need to be aware of this fact if we use the MADM scale estimator as a measure of data spread.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-myLkWPECtX8/Te1rZplVT3I/AAAAAAAAACk/zHYzWDcF-eQ/s1600/RobustFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" src="http://1.bp.blogspot.com/-myLkWPECtX8/Te1rZplVT3I/AAAAAAAAACk/zHYzWDcF-eQ/s320/RobustFig02.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;More serious problems arise in the case of coarsely quantized data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, it was once popular in engineering applications to record variables like temperature only to single digit accuracy (e.g., temperature to the nearest tenth of a degree).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This practice was motivated both by limited memory available in early process monitoring computers, and by a recognition that this was roughly the limit of measurement accuracy, anyway.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This kind of data truncation does degrade traditional characterizations like the standard deviation somewhat, but the consequences for the MADM scale estimate can be catastrophic.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot above shows a sequence of 100 Gaussian data samples, with mean 0 and standard deviation 0.15, represented two ways.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The line represents the original Gaussian data sample, while the solid circles correspond to these data values truncated to a single digit (e.g., the first value in the original data sequence is -0.11658569, which gets truncated to -0.1).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Both the standard deviation and the MADM scale estimate are reasonably accurate when computed from the unmodified data sequence: the standard deviation is 0.137, while the MADM scale estimate is 0.133, both within about 10% of the correct value of 0.150.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If we compute the standard deviation from the truncated data samples, the value declines to 0.103, reflecting the fact that by truncating the data from the 9 digits generated by the random number generator to 1, we are consistently reducing the magnitude of each number, so now our estimation error is more like 30%.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, however, the MADM scale estimate fails completely, returning the value 0.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The reason for this is that, on truncation, 55% of these data values are zero (note that any value between -0.09 and 0.09 will be truncated to zero, and these data values occur quite commonly in a zero-mean, normally distributed data sequence with standard deviation 0.15).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Whenever more than 50% of the data values are equal, we don’t need to bother computing either the median or the MADM scale estimator.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, the median of this data sequence is simply equal to this majority value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Similarly, since more than 50% of the observed data values lie a distance of zero from the median, the median of the absolute deviation sequence on which the MADM scale estimate is based is necessarily zero.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, the MADM scale for any such sequence is zero, as in this example; this behavior is commonly called &lt;em&gt;implosion.&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-WcV80rRmBfU/Te1sE3IXowI/AAAAAAAAACo/u40UsUy4wXI/s1600/RobustFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" src="http://1.bp.blogspot.com/-WcV80rRmBfU/Te1sE3IXowI/AAAAAAAAACo/u40UsUy4wXI/s320/RobustFig03.png" t8="true" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This coarse quantization example indirectly illustrates one of the key distinctions between continuous and discrete variables: for a continuously distributed random variable, no two observations can have exactly the same value, but for discrete random variables, such ties are characteristic.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The coarse quantization example represents something in between, being a discrete approximation of a continuous variable.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The distinction is important because it is precisely these ties that were responsible for the complete failure of the MADM scale estimator in the example just described.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;There are other cases where the underlying data distribution has both continuous and discrete components, and these cases can also cause rank- and order-based data characterizations like the MADM scale estimator to behave very badly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot above shows rainfall measurements made every half hour for two months (a total of 732 data points), included in the dataset&amp;nbsp; associated with Chapter 14 (“Half-hourly Precipitation and Streamflow, River Hirnant, Wales, U.K., November and December, 1972”) from the book &lt;span&gt;&lt;a href="http://www.amazon.com/Data-Collection-Problems-Research-Statistics/dp/0387961259?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Data:: A Collection of Problems from Many Fields for the Student and Research Worker (Springer Series in Statistics)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0387961259" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;,&lt;/span&gt; by D.F. Andrews and A.M. Herzberg.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It was noted above that the built-in function &lt;strong&gt;quantile&lt;/strong&gt; in &lt;em&gt;R&lt;/em&gt; generates the Tukey five-number summary for a data sequence, but by specifying the &lt;strong&gt;prob&lt;/strong&gt; parameter, it is possible to obtain finer-grained data characterizations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For example, the following command gives the deciles of this rainfall dataset:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;quantile(Rain,prob=seq(0,1,0.1))&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;0%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;10%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;20%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;30%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;40%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;50%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;60%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;70%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;80%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;90%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;100% &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.040 0.146 0.440 2.800 &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is clear from these results that more than 60% of these recorded rainfall values are identically zero, reflecting the fact that it does not rain most days.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As with the coarse quantization example discussed above, the MADM scale estimator implodes for this example, returning the value zero.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Similarly, the median value of these numbers is also zero, but the mean is not (specifically, it is 0.136), and the standard deviation is not (this value is 0.354).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Even though the mean and standard deviation give more reasonable-sounding results for the rainfall dataset than their robust counterparts do, it is not completely clear how to interpret these numbers.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, we might be interested in asking any or all of the following three types of question:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;1.&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;What is the typical rainfall for this region in a month? (or in a week, in the entire two-month period, in a day, etc.)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;2.&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;How frequently does it rain?&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;3.&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;When it does rain, what is the average amount?&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Characterizations like the mean and standard deviation may be used to address the first of these three questions, but not the other two.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a preliminary answer to the second question, we can note that rainfall occurs in 34.6% of the half-hourly intervals recorded here, but this requires a slightly more detailed look at the data than what the mean and standard deviation give us.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Additional work would be required if we wanted an estimate of how probable it was to rain on any given day, since we would need to summarize the data at the level of days instead of half-hourly observations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, to address questions of the third type, it is necessary to segment the dataset into a “rainy part” and a “dry part.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that this segmentation is analogous to that made for the makeup flow rate dataset, but the difference is that here we are interested in characterizing the minority part of the dataset (i.e., “the outliers”) instead of the majority part, which is what the robust estimators are good at.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Special types of statistical models have been developed to deal with these situations, and I will discuss some of these in my next post.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For now, I will conclude with two key observations.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The first is that of Collin Mallows (C.L. Mallows, “Robust methods – some examples of their use,” &lt;em&gt;American Statistician,&lt;/em&gt; vol. 33, 1979, pages 179-184), which I quote at the end of Chapter 7 of &lt;em&gt;Exploring Data:&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;blockquote&gt;&lt;blockquote&gt;“A simple and useful strategy is to perform one’s analysis both robustly and by standard methods and to compare the results.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If the differences are minor, either set can be presented.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If the differences are not, one must perforce consider why not, and the robust analysis is already at hand to guide the next steps.&amp;nbsp; &lt;/blockquote&gt;&lt;br /&gt;&lt;blockquote&gt;The importance of these considerations is enhanced when we are dealing with large amounts of data, since then examining all the data in detail is impractical, and we are forced to contemplate working with data that is, almost certainly, partially bad and with models that are almost certainly inadequate.”&lt;/blockquote&gt;&lt;/blockquote&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The second observation is that Mallows advice is practical for almost any type of data analysis we want to consider, since robust methods exist to handle a very wide variety of different analysis tasks.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;An excellent introduction to many of these robust methods is given by Rand Wilcox in his book &lt;span&gt;&lt;a href="http://www.amazon.com/Introduction-Estimation-Hypothesis-Statistical-Modeling/dp/0127515429?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Introduction to Robust Estimation and Hypothesis Testing, Second Edition (Statistical Modeling and Decision Science)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0127515429" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;, which includes &lt;em&gt;R&lt;/em&gt; procedures for implementing most of the methods he discusses.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-202960599817804400?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/202960599817804400/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/06/pros-and-cons-of-robust-data.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/202960599817804400'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/202960599817804400'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/06/pros-and-cons-of-robust-data.html' title='The pros and cons of robust data characterizations'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-mRzZxRoj6hY/Te1ouL2iwuI/AAAAAAAAACg/Auz_Ycgpmnk/s72-c/RobustFig01.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-4573601951707069878</id><published>2011-05-21T10:51:00.000-07:00</published><updated>2011-05-21T10:51:48.816-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='interestingness measures'/><category scheme='http://www.blogger.com/atom/ns#' term='asymptotic normality'/><category scheme='http://www.blogger.com/atom/ns#' term='UCI mushroom dataset'/><category scheme='http://www.blogger.com/atom/ns#' term='categorical variables'/><title type='text'>The distribution of interestingness</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3&lt;sup&gt;rd&lt;/sup&gt; post on “Interestingness Measures.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;“Our belief is that a useful measure of interestingness should generate index values that are reasonably distributed throughout the range of possible values (such as in a SND)”&lt;/div&gt;&lt;/blockquote&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;David's&amp;nbsp;question was whether there is a way of showing that interestingness measures should be normally distributed.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, this question raises a number of important issues, and this post explores a few of them.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;First, as I noted in my comment in response to&amp;nbsp;this question, the short answer is that the “standard behavior” we expect from most estimators – i.e., most characterizations computed from a set of uncertain data values – is asymptotic normality.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That is, for sufficiently large data samples, we expect most of the standard characterizations we might compute – variance estimates, correlation estimates, least squares regression coefficients,&amp;nbsp;or linear model parameters computed using robust estimators like Least Trimmed Squares (LTS) – to be approximately normally distributed.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I discuss this idea in Chapter 6 of &lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;/span&gt;, beginning with the Central Limit Theorem (CLT), which says, very roughly, that “averages of &lt;em&gt;N&lt;/em&gt; numbers tend to Gaussian limits as &lt;em&gt;N&lt;/em&gt; becomes infinitely large.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;There are exceptions – cases where asymptotic normality does not hold – but the phenomenon is common enough to make violations noteworthy.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, one of the reasons the conceptually simpler least median of squares (LMS) estimator was dropped in favor of the more complex least trimmed squares (LTS) estimator is that the LMS estimator is not asymptotically normal, approaching a non-normal limiting distribution at an anomalously slow rate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, the LTS estimator is asymptotically normal, with the standard convergence rate (i.e., the standard deviation of the estimator approaches zero inversely with the square root of the sample size as the sample becomes infinitely large). &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The longer – and far less satisfying – answer to the question of how interestingness measures should be distributed is, “it depends,” as the following discussion illustrates.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;One of the many cases where asymptotic normality has been shown to hold is Gini’s mean difference, which forms the basis for one of the four interestingness measures I discussed in my earlier post.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;While this result appears to provide a reasonably specific answer to the question of how this interestingness measure is distributed, there are two closely related issues that greatly complicate the matter.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The first is a question that plagues many applications of asymptotic normality, which is quite commonly used for constructing approximate confidence intervals: how large must the sample be for this asymptotic approximation to be reasonable?&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(The confidence intervals described in my first post on odds ratios, for example, were derived on the basis of asymptotic normality.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Once again, the answer is, “it depends,” as the examples discussed here demonstrate. &lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;The second issue is that the interestingness measures that I describe in &lt;em&gt;Exploring Data&lt;/em&gt; and that I discussed in my previous post are &lt;i style="mso-bidi-font-style: normal;"&gt;normalized&lt;/i&gt; measures.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, the Gini interestingness measure is defined as Gini’s mean difference divided by its maximum possible value, giving a measure that is bounded between 0 and 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;An important aspect of the Gaussian distribution is that it assigns a nonzero probability to &lt;i style="mso-bidi-font-style: normal;"&gt;any&lt;/i&gt; real value, although the probability associated with values more than a few standard deviations from the mean rapidly becomes very small.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, for a bounded quantity like the Gini interestingness measure, a Gaussian approximation can only be reasonable if the standard deviation is small enough that the probability of exhibiting values less than 0 or greater than 1 is acceptably small.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further, this “maximum reasonable standard deviation” will depend on the mean value of the estimator: if the computed interestingness measure is approximately 0.5, the maximum feasible standard deviation for a reasonable Gaussian approximation is considerably larger than if the computed interestingness measure is 0.01 or 0.99.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a practical matter, this usually means that, for a fixed sample size, the shape of the empirical distribution of normalized interestingness values for sequences exhibiting a given degree of heterogeneity (i.e., a specified “true” interestingness value) will vary significantly in shape (specifically, symmetry) as this “true” interestingness value varies from near 0 to approximately 0.5, to near 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This idea is easily demonstrated in &lt;i style="mso-bidi-font-style: normal;"&gt;R&lt;/i&gt; on the basis of simulated examples.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The essential idea behind this simulation study is the following.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Suppose we have a categorical variable that can assume any one of &lt;i style="mso-bidi-font-style: normal;"&gt;L&lt;/i&gt; distinct levels, and suppose we want to generate random samples of this variable where each level &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;/i&gt; has a certain probability &lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; of occurring in our sample.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The multinomial distribution discussed in Sec. 10.2.1 of &lt;i style="mso-bidi-font-style: normal;"&gt;Exploring Data&lt;/i&gt; is appropriate here, characterizing the number of times &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; we observe each of these levels in a random sample of size &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt;, given the probability of observing each level &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The simulation strategy considered here then proceeds by specifying the number of levels (here, I take &lt;i style="mso-bidi-font-style: normal;"&gt;L&lt;/i&gt; = 4) and the probability of observing each level.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Then, I generate a large number &lt;i style="mso-bidi-font-style: normal;"&gt;B&lt;/i&gt; of random samples (here, I take &lt;i style="mso-bidi-font-style: normal;"&gt;B&lt;/i&gt; = 1000), each generated from the appropriate multinomial distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In what follows, I consider the following four cases:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;ol style="margin-top: 0in;" type="1"&gt;&lt;li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in;"&gt;Case A: four equally-represented levels, each with probability 0.25;&lt;/li&gt;&lt;li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in;"&gt;Case B: one dominant level (&lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;1&lt;/sub&gt;&lt;/i&gt; = 0.97) and three rare levels (&lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;2&lt;/sub&gt; = p&lt;sub&gt;3&lt;/sub&gt; = p&lt;sub&gt;4&lt;/sub&gt; =&lt;/i&gt; 0.01);&lt;/li&gt;&lt;li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in;"&gt;Case C: three equally-represented majority levels (&lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;1&lt;/sub&gt; = p&lt;sub&gt;2&lt;/sub&gt; = p&lt;sub&gt;3&lt;/sub&gt; =&lt;/i&gt; 0.33) and one rare minority level (&lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;4&lt;/sub&gt; =&lt;/i&gt; 0.01);&lt;/li&gt;&lt;li class="MsoNormal" style="margin: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in;"&gt;Case D: four distinct probabilities (&lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;1&lt;/sub&gt; =&lt;/i&gt; 0.05, &lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;2&lt;/sub&gt; =&lt;/i&gt; 0.10, &lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;3&lt;/sub&gt; =&lt;/i&gt; 0.25, &lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;4&lt;/sub&gt; =&lt;/i&gt; 0.60).&lt;/li&gt;&lt;/ol&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To generate these sequences, I use &lt;i style="mso-bidi-font-style: normal;"&gt;R&lt;/i&gt;’s built-in multinomial random number generator, &lt;b style="mso-bidi-font-weight: normal;"&gt;rmultinom&lt;/b&gt;, with parameters n = 1000 (this corresponds to &lt;i style="mso-bidi-font-style: normal;"&gt;B&lt;/i&gt; in the above discussion: I want to generate 1000 multinomial random vectors), size = 100 (initially, I take each vector to be of length &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 100; later, I consider larger vectors), and, for Case A, I specify prob = c(0.25, 0.25, 0.25, 0.25), corresponding to the four &lt;i style="mso-bidi-font-style: normal;"&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; values specified above (for the other three cases, I specify the prob parameter appropriately).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The value returned by this procedure is a matrix, with one row for each level and one column for each simulation; for each column, the number in the &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;sup&gt;th&lt;/sup&gt;&lt;/i&gt; row is &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;, the number of times the &lt;i style="mso-bidi-font-style: normal;"&gt;i&lt;sup&gt;th&lt;/sup&gt; &lt;/i&gt;level is observed in that simulated response.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Dividing these counts by the sample size gives the fractional representation of each level, from which the normalized interestingness measures are then computed.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the Gini interestingness measure and Case A, these steps are accomplished using the following &lt;i style="mso-bidi-font-style: normal;"&gt;R&lt;/i&gt; code:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;set.seed(101)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;pvectA &amp;lt;- c(0.25, 0.25, 0.25, 0.25)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;CountsA &amp;lt;- rmultinom(n = 1000, size = 100, prob = pvectA)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;FractionsA &amp;lt;- CountsA/100&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;GiniA &amp;lt;- apply(FractionsA, MARGIN=2, gini.proc)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; tab-stops: 397.5pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The first line of this sequence sets the seed value to initialize the random number generator: if you don’t do this, running the same procedure again will give you results that are statistically similar but not exactly the same as before; specifying the seed guarantees you get exactly the same results the next time you execute the command sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second line defines the “true” heterogeneity of each sequence (i.e., the underlying probability of observing each level that the random number generator uses to simulate the data).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The third line invokes the multinomial random number generator to return the desired matrix of counts, where each column represents an independent simulation and the rows correspond to the four possible levels of the response.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fourth line converts these counts to fractions by dividing by the size of each sample, and the last line uses the &lt;em&gt;R&lt;/em&gt; function &lt;b style="mso-bidi-font-weight: normal;"&gt;apply&lt;/b&gt; to invoke the Gini interestingness procedure gini.proc for each column of the fractions matrix: setting MARGIN = 2 tells &lt;em&gt;R&lt;/em&gt; to “apply the indicated function gini.proc to the columns of FractionsA.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(It would also be possible to write this procedure as a loop in &lt;em&gt;R&lt;/em&gt;, but the &lt;b style="mso-bidi-font-weight: normal;"&gt;apply&lt;/b&gt; function is much faster.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Replacing gini.proc by bray.proc in the above sequence of commands would generate the Bray interestingness measures, using shannon.proc would generate the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; interestingness measures, and using simpson.proc would generate the Simpson interestingness measures (note that all of these procedures are available from the &lt;a href="http://www.oup.com/us/ExploringData"&gt;companion website&lt;/a&gt; for &lt;i style="mso-bidi-font-style: normal;"&gt;Exploring Data&lt;/i&gt;).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;For this first example – i.e., Case A, where all four levels are equally represented – note that the “true” value for any of the four normalized interestingness measures I discussed in my earlier post is 0, the smallest possible value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since the random samples generated by the multinomial random number generator rarely have identical counts for the four levels, the interestingness measure computed from each sample is generally larger than 0, but it can never be smaller than this value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This observation suggests two things.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The first is that the average of these interestingness values is larger than 0, implying that any of the normalized measures I considered exhibit a positive bias for this case.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second consequence of this observation is that the distribution of any of these interestingness values is probably asymmetric.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The first of these conclusions is evident in the plot below, which shows nonparametric density estimates for the Gini measure computed from the 1000 random simulations generated for each case.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The left-most peak corresponds to Case A and it is clear from this plot that the average value is substantially larger than zero (specifically, the mean of these simulations is 0.114).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This plot also suggests asymmetry, although the visual evidence is not overwhelming.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-8tGjRhvLvMQ/Tdfj4YyvJbI/AAAAAAAAACM/k-zFFsCI1wc/s1600/GiniDistPlot01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" j8="true" src="http://4.bp.blogspot.com/-8tGjRhvLvMQ/Tdfj4YyvJbI/AAAAAAAAACM/k-zFFsCI1wc/s320/GiniDistPlot01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The right-most peak in the above plot corresponds to Case B, for which the “true” Gini interestingness value is 0.96.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, the peak is much narrower, although again the average of the simulation values (0.971) is slightly larger than the true value, which is represented by the right-most vertical dashed line.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Case C is modeled on the CapSurf variable from the &lt;a href="http://archive.ics.uci.edu/ml/datasets/Mushroom"&gt;UCI mushroom dataset&lt;/a&gt; that I discussed in my earlier post on interestingness measures: the first three levels are equally distributed, but there is also an extremely rare fourth level.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The “true” Gini measure for this case is 0.32, corresponding to the second-from-left dashed vertical line in the above plot, and – as with the previous two cases –&amp;nbsp; the distribution of Gini values computed from the multinomial random samples is biased above this correct value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, however, the shape of the distribution is more symmetric, suggesting possible conformance with the Gaussian expectations suggested above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, Case D is the third peak from the left in the above plot, and here the distribution appears reasonably symmetric around the correct value of 0.6, again indicated by a vertical dashed line.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-3tGyhMOXVRc/TdfkcwgqMeI/AAAAAAAAACQ/raBXvdI3uEM/s1600/ShannonDistPlot01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" j8="true" src="http://1.bp.blogspot.com/-3tGyhMOXVRc/TdfkcwgqMeI/AAAAAAAAACQ/raBXvdI3uEM/s320/ShannonDistPlot01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The above plot shows the corresponding results for the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; interestingness measure for the same four cases.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, the distributions for the different cases exhibit a much wider range of variation than they did for the Gini measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As before, Case D (the smoothest of the four peaks) appears reasonably symmetric in its&amp;nbsp;distribution around the “true” value of 0.255, possibly consistent with a normal distribution, but this is clearly not true for any of the other three cases.&amp;nbsp;&amp;nbsp;The asymmetry seen for Case A (the left-most peak) appears fairly pronounced, the density for Case C (the second-from-left peak) appears to be multi-modal,&amp;nbsp;and the right-most peak (Case B) exhibits both asymmetry and multimodal character.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The key point here is that the observed distribution of interestingness values depends on both the specific simulation case considered (i.e., the true probability distribution of the levels), and the particular interestingness measure chosen.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-LZ_DKumTOZE/TdflRqpa_XI/AAAAAAAAACU/HlynUqOIKO8/s1600/CaseDdensityPlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" j8="true" src="http://1.bp.blogspot.com/-LZ_DKumTOZE/TdflRqpa_XI/AAAAAAAAACU/HlynUqOIKO8/s320/CaseDdensityPlot.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As noted above, if we expect approximate normality for a data characterization on the basis of its asymptotic behavior, an important practical question is whether we are working with large enough samples for asymptotic approximations to be at all reasonable.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, the sample size is &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 100: we are generating and characterizing a lot more sequences than this, but each characterization is based on 100 data observations.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the UCI mushroom example I discussed in my previous post, the sample size was much larger: that dataset characterizes 8,124 different mushrooms, so the interestingness measures considered there were based on a sample size of &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 8124.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, this can make an important difference, as the following example demonstrates.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The above plot compares the Gini measure density estimates obtained from 1000 random samples, each of length &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 100 (the dashed curve), with those obtained from 1000 random samples, each of length &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 8124 (the solid curve), for Case D.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The vertical dotted line corresponds to the true Gini value of 0.6 for this case.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that both of these densities appear to be symmetrically distributed around this correct value, but the distribution of values is much narrower when they are computed from the larger samples.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;These results are consistent with our expectations of asymptotic normality for the Gini measure, at least for cases that are not too near either extreme.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further confirmation of these general expectations for a “well-behaved interestingness measure” is provided by the four Q-Q plot shown below, generated using the &lt;b style="mso-bidi-font-weight: normal;"&gt;qqPlot&lt;/b&gt; command in the &lt;b style="mso-bidi-font-weight: normal;"&gt;car&lt;/b&gt; package in &lt;i style="mso-bidi-font-style: normal;"&gt;R&lt;/i&gt; that I discussed in a previous post.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These plots compare the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measures computed from Cases C (upper two plots) and D (lower two plots), for the smaller sample size &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 100 (left-hand plots) and the larger sample size &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 8124 (right-hand plots).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, in the upper left Q-Q plot (Case C, &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 100), compelling evidence of non-normality is given by both the large number of points falling outside the 95% confidence intervals for this Q-Q plot, and the “kinks” in the plot that reflect the multimodality seen in the second peak in the second plot above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, all of the points in the upper right Q-Q plot (Case C, &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 8124) fall within the 95% confidence limits for this plot, suggesting that the approximate normality assumption is reasonable for the larger sample. &lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;Qualitatively similar behavior is seen for Case D in the two lower plots: the results computed for &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 100 exhibit some evidence for asymmetry, with both the lower and upper tails lying above the upper 95% confidence limit for the Q-Q plot, while the results computed for &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 8124 show no such deviations.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-UZYnxVpRNBA/Tdfl1Xq8OWI/AAAAAAAAACY/EY1k3poTDB0/s1600/ShannonquadPlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" j8="true" src="http://3.bp.blogspot.com/-UZYnxVpRNBA/Tdfl1Xq8OWI/AAAAAAAAACY/EY1k3poTDB0/s320/ShannonquadPlot.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is important to emphasize, however, that approximate normality is not always to be expected.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot below shows Q-Q plots for the Gini measure (upper plots) and the Shannon measure (lower plots) for Case A, computed from both the smaller samples (&lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 100, left-hand plots) and the larger samples (&lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; = 8124, right-hand plots).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, all of these plots provide strong evidence for distributional asymmetry, consistent with the point made above that, while the computed interestingness value can exceed the true value of 0 for this case, it can never fall below this value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, we expect both a positive bias and a skewed distribution of values, even for large sample sizes, and we see both of these features here.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-WwhSQd4sW1o/TdfmBJGnzrI/AAAAAAAAACc/4VrJ6Ttgk3s/s1600/CaseAquadPlot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" j8="true" src="http://2.bp.blogspot.com/-WwhSQd4sW1o/TdfmBJGnzrI/AAAAAAAAACc/4VrJ6Ttgk3s/s320/CaseAquadPlot.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To conclude, then, I reiterate what I said at the beginning: the answer to the question of what distribution we should expect for a “good” interestingness measure is, “it depends.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;But now I can be more specific.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The answer depends, first, on the sample size from which we compute the interestingness measure: the larger the sample, the more likely this distribution is to be approximately normal.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This is, essentially, a restatement of the definition of asymptotic normality, but it raises the key practical question&amp;nbsp;of how large &lt;i style="mso-bidi-font-style: normal;"&gt;N&lt;/i&gt; must be for this approximation to be reasonable.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That answer also depends, on at least two things: first, the specific interestingness measure we consider, and second, the true heterogeneity of the data sample from which we compute it.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Both of these points were illustrated in the first two plots shown above: the shape of the estimated densities varied significantly both with the case considered (i.e., the “true” heterogeneity) and whether we considered the Gini measure or the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Analogous comments apply to the other interestingness measures I discussed in my previous post.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-4573601951707069878?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/4573601951707069878/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/05/distribution-of-interestingness.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/4573601951707069878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/4573601951707069878'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/05/distribution-of-interestingness.html' title='The distribution of interestingness'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-8tGjRhvLvMQ/Tdfj4YyvJbI/AAAAAAAAACM/k-zFFsCI1wc/s72-c/GiniDistPlot01.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-1769950154357189006</id><published>2011-05-07T08:21:00.000-07:00</published><updated>2011-05-07T08:21:12.546-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R procedures'/><category scheme='http://www.blogger.com/atom/ns#' term='contingency tables'/><category scheme='http://www.blogger.com/atom/ns#' term='odds ratios'/><title type='text'>Computing Odds Ratios in R</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last post, I discussed the use of odds ratios&amp;nbsp;to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the &lt;a href="http://archive.ics.uci.edu/ml/datasets/Mushroom"&gt;UCI mushroom dataset&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I did not, however,&amp;nbsp;describe those computations in detail, and the purpose of this post is to give a brief discussion of how they were done.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That said, I should emphasize that the primary focus of this blog is not &lt;em&gt;R&lt;/em&gt; programming per se, but rather the almost limitless number of things that &lt;em&gt;R&lt;/em&gt; allows you to do in the realm of exploratory data analysis.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For more detailed discussions of the mechanics of &lt;em&gt;R&lt;/em&gt; programming, two excellent resources are &lt;span&gt;&lt;a href="http://www.amazon.com/R-Book-Michael-J-Crawley/dp/0470510242?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;The R Book&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0470510242" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt; by Michael J. Crawley, and &lt;span&gt;&lt;a href="http://www.amazon.com/Modern-Applied-Statistics-Computing/dp/1441930086?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Modern Applied Statistics with S (Statistics and Computing)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=1441930086" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&amp;nbsp;by W.N. Venables and B.D. Ripley&lt;/span&gt;.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;As a reminder, I discuss the odds ratio in Chapter 13 of &lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;, which may be viewed as an association measure between binary variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I discussed last time, for two binary variables, &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt;, each taking the values 0 and 1, the odds ratio is defined on the basis of the following four numbers:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;N&lt;sub&gt;00&lt;/sub&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 0 and &lt;em&gt;y&lt;/em&gt; = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;N&lt;sub&gt;01&lt;/sub&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 0 and &lt;em&gt;y&lt;/em&gt; = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;N&lt;sub&gt;10&lt;/sub&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 1 and &lt;em&gt;y&lt;/em&gt; = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;N&lt;sub&gt;11&lt;/sub&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 1 and &lt;em&gt;y&lt;/em&gt; = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Specifically, the odds ratio is given by the following expression: &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;OR = N&lt;sub&gt;00&lt;/sub&gt; N&lt;sub&gt;11&lt;/sub&gt; / N&lt;sub&gt;01&lt;/sub&gt; N&lt;sub&gt;10&lt;/sub&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Similarly, confidence intervals for the odds ratio are easily constructed by appealing to the asymptotic normality of log OR, which has a limiting variance given by the square root of the sum of the reciprocals of these four numbers.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The &lt;em&gt;R&lt;/em&gt; procedure &lt;strong&gt;oddsratioWald.proc&lt;/strong&gt; available from the &lt;a href="http://www.oup.com/us/ExploringData"&gt;companion website&lt;/a&gt; for &lt;em&gt;Exploring Data&lt;/em&gt; computes the odds ratio and the upper and lower confidence limits at a specified level alpha from these four values:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;oddsratioWald.proc &amp;lt;- function(n00, n01, n10, n11, alpha = 0.05){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Compute the odds ratio between two binary variables, x and y,&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;as defined by the four numbers nij:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n00 = number of cases where x = 0 and y = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n01 = number of cases where x = 0 and y = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n10 = number of cases where x = 1 and y = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n11 = number of cases where x = 1 and y = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;OR &amp;lt;- (n00 * n11)/(n01 * n10)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;#&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Compute the Wald confidence intervals:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;siglog &amp;lt;- sqrt((1/n00) + (1/n01) + (1/n10) + (1/n11))&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;zalph &amp;lt;- qnorm(1 - alpha/2)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;logOR &amp;lt;- log(OR)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;loglo &amp;lt;- logOR - zalph * siglog&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;loghi &amp;lt;- logOR + zalph * siglog&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;ORlo &amp;lt;- exp(loglo)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;ORhi &amp;lt;- exp(loghi)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;oframe &amp;lt;- data.frame(LowerCI = &lt;place w:st="on"&gt;&lt;city w:st="on"&gt;ORlo&lt;/city&gt;, &lt;state w:st="on"&gt;OR&lt;/state&gt;&lt;/place&gt; = OR, UpperCI = ORhi, alpha = alpha)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;oframe&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Including “alpha = 0.05” in the parameter list fixes the default value for alpha at 0.05, which yields the 95% confidence intervals for the computed odds ratio, based on the Wald approximation described above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;An important practical point is that these intervals become infinitely wide if any of the four numbers N&lt;sub&gt;ij&lt;/sub&gt; are equal to zero; also, note that in this case, the computed odds ratio is either zero or infinite.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, it is worth noting that if the numbers N&lt;sub&gt;ij&lt;/sub&gt; are large enough, the procedure just described can encounter numerical overflow problems (i.e., the products in either the numerator or the denominator become too large to be represented in machine arithmetic).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If this is a possibility, a better alternative is to regroup the computations as follows:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;OR = (N&lt;sub&gt;00&lt;/sub&gt; / N&lt;sub&gt;01&lt;/sub&gt;) x (N&lt;sub&gt;11&lt;/sub&gt; / N&lt;sub&gt;10 &lt;/sub&gt;)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To use the routine just described, it is necessary to have the four numbers defined above, which form the basis for a two-by-two contingency table.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Because contingency tables are widely used in characterizing categorical data, these numbers are easily computed in &lt;em&gt;R&lt;/em&gt; using the &lt;strong&gt;table&lt;/strong&gt; command.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a simple example, the following code reads the UCI mushroom dataset and generates the two-by-two contingency table for the EorP and GillSize attributes: &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; mushrooms &amp;lt;- read.csv("mushroom.csv")&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; table(mushrooms$EorP, mushrooms$GillSize)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;b&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;e 3920&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;288&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;p 1692 2224&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;(Note that the first line reads the csv file containing the mushroom data; for this command to work as shown, it is necessary for this file to be in the working directory.&amp;nbsp; Alternatively, you can change the working directory using the &lt;strong&gt;setwd&lt;/strong&gt; command.)&amp;nbsp; &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;To facilitate the computation of odds ratios, the following preliminary procedure combines the &lt;strong&gt;table&lt;/strong&gt; command with the &lt;strong&gt;oddsratioWald.proc&lt;/strong&gt; procedure, allowing you to compute the odds ratio and its level-alpha confidence interval from the two-level variables directly:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;TableOR.proc00 &amp;lt;- function(x,y,alpha=0.05){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;xtab &amp;lt;- table(x,y)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;n00 &amp;lt;- xtab[1,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;n01 &amp;lt;- xtab[1,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;n10 &amp;lt;- xtab[2,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;n11 &amp;lt;- xtab[2,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;oddsratioWald.proc(n00,n01,n10,n11,alpha)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The primary disadvantage of this procedure is that it doesn’t tell you which levels of the two variables are being characterized by the computed odds ratio.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, this characterization describes the first level of each of these variables, and the following slight modification makes this fact explicit:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;TableOR.proc &amp;lt;- function(x,y,alpha=0.05){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;xtab &amp;lt;- table(x,y)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n00 &amp;lt;- xtab[1,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n01 &amp;lt;- xtab[1,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n10 &amp;lt;- xtab[2,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n11 &amp;lt;- xtab[2,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outList &amp;lt;- vector("list",2)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outList[[1]] &amp;lt;- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][1],"] of the second variable:",sep=" ")&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outList[[2]] &amp;lt;- oddsratioWald.proc(n00,n01,n10,n11,alpha)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outList&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Specifically, I have used the fact that the dimension names of the 2x2 table &lt;em&gt;xtab&lt;/em&gt; correspond to the levels of the variables &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt;, and I have used the &lt;strong&gt;paste&lt;/strong&gt; command to include these values in a text string displayed to the user.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(I have enclosed the levels in square brackets to make them stand out from the surrounding text, particularly useful here since the levels are coded as single letters.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Applying this procedure to the mushroom characteristics EorP and GillSize yields the following results:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; TableOR.proc(mushrooms$EorP, mushrooms$GillSize)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[[1]]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[1] "Odds ratio between the level [ e ] of the first variable and the level [ b ] of the second variable:"&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[[2]]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;LowerCI&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;OR&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;UpperCI alpha&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;1 15.62615 17.89073 20.48349&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;0.05&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Almost certainly, the formatting I have used here could be improved – probably a lot – but the key point is to provide a result that is reasonably complete and easy to interpret.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, I noted in my last post that if we are interested in using odds ratios to compare or rank associations, it is useful to code the levels so that the computed odds ratio is larger than 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, note that applying the above procedure to characterize the relationship between edibility and the Bruises characteristic yields:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; TableOR.proc(mushrooms$EorP, mushrooms$Bruises)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[[1]]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[1] "Odds ratio between the level [ e ] of the first variable and the level [ f ] of the second variable:"&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[[2]]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;LowerCI&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;OR&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;UpperCI alpha&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;1 0.09014769 0.1002854 0.1115632&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;0.05&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is clear from these results that both Bruises and GillSize exhibit odds ratios with respect to mushroom edibility that are significantly different from the neutral value 1 (i.e., the 95% confidence interval excludes the value 1 in both cases), but it is not obvious which variable has the stronger association, based on the available data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The following procedure automatically restructures the computation so that the computed odds ratio is larger than or equal to 1, allowing us to make this comparison:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;AutomaticOR.proc &amp;lt;- function(x,y,alpha=0.05){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;xtab &amp;lt;- table(x,y)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n00 &amp;lt;- xtab[1,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n01 &amp;lt;- xtab[1,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n10 &amp;lt;- xtab[2,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n11 &amp;lt;- xtab[2,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;#&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;rawOR &amp;lt;- (n00*n11)/(n01*n10)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;if (rawOR &amp;lt; 1){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n01 &amp;lt;- xtab[1,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n00 &amp;lt;- xtab[1,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n11 &amp;lt;- xtab[2,1]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;n10 &amp;lt;- xtab[2,2]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;iLevel &amp;lt;- 2&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;else{&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;iLevel &amp;lt;- 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;} &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outList &amp;lt;- vector("list",2)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outList[[1]] &amp;lt;- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][iLevel],"] of the second variable:",sep=" ")&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;outList[[2]] &amp;lt;- oddsratioWald.proc(n00,n01,n10,n11,alpha)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;outList&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Note that this procedure first constructs the 2x2 table on which everything is based and then computes the odds ratio in the default coding: if this value is smaller than 1, the coding of the second variable (&lt;em&gt;y&lt;/em&gt;) is reversed.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The odds ratio and its confidence interval are then computed and the levels of the variables used in computing it are presented as before.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Applying this procedure to the Bruises characteristic yields the following result, from which we can see that GillSize appears to have the stronger association, as noted last time:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt; AutomaticOR.proc(mushrooms$EorP, mushrooms$Bruises)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[[1]]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[1] "Odds ratio between the level [ e ] of the first variable and the level [ t ] of the second variable:"&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;[[2]]&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;LowerCI&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;OR&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;UpperCI alpha&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;1 8.963532 9.971541 11.09291&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;0.05&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&amp;gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Well, that’s it for now.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Next time, I will come back to the question of how we should expect interestingness measures to be distributed and whether those expectations are met in practice.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-1769950154357189006?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/1769950154357189006/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/05/computing-odds-ratios-in-r.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/1769950154357189006'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/1769950154357189006'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/05/computing-odds-ratios-in-r.html' title='Computing Odds Ratios in R'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-6696430712139624523</id><published>2011-04-23T11:13:00.000-07:00</published><updated>2011-04-23T11:13:10.195-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='reciprocal transformations'/><category scheme='http://www.blogger.com/atom/ns#' term='binary associations'/><category scheme='http://www.blogger.com/atom/ns#' term='UCI mushroom dataset'/><category scheme='http://www.blogger.com/atom/ns#' term='Exploring Data'/><category scheme='http://www.blogger.com/atom/ns#' term='odds ratios'/><title type='text'>Measuring association using odds ratios</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last two posts, I have&amp;nbsp;used the &lt;a href="http://archive.ics.uci.edu/ml/datasets/Mushroom"&gt;UCI mushroom dataset&lt;/a&gt;&amp;nbsp;to illustrate two things.&amp;nbsp; The first was the use of interestingness measures to characterize categorical variables, and the second was&amp;nbsp;the use of binary confidence intervals to&amp;nbsp;visualize the relationship between a categorical predictor variable and a binary response variable.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This second&amp;nbsp;approach can be&amp;nbsp;applied to&amp;nbsp;categorical predictors having any number of levels, but in the case of a binary (i.e., two-level) predictor, an attractive alternative is to measure their association&amp;nbsp;with odds ratios.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The objective of this post is to illustrate this&amp;nbsp;idea and highlight a few important details.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-BidOUAUunmg/TbL3aX7xxlI/AAAAAAAAACI/log4CBAAoKc/s1600/OddsRatioPlot01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" i8="true" src="http://1.bp.blogspot.com/-BidOUAUunmg/TbL3aX7xxlI/AAAAAAAAACI/log4CBAAoKc/s320/OddsRatioPlot01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The above plots show the binomial confidence intervals discussed last time for four different binary mushroom characteristics: GillSize (upper left), GillAtt (upper right), Bruises (lower left), and StalkShape (lower right).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, these plots show the estimated probability that mushrooms with each of the two possible values&amp;nbsp;for these variables are edible.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, the upper left plot shows that mushrooms with GillSize characteristic “b” (“broad”) are much more likely to be edible than mushrooms with GillSize characteristic “n” (“narrow”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The other three plots have analogous interpretations: mushrooms with GillAtt value “a” (“attached”) are more likely to be edible than those with value “f” (“free”), mushrooms with bruises (Bruises value “t”) are more likely to be edible than those without (Bruises value “f”), and mushrooms with StalkShape value “t” (“tapering”) are slightly more likely to be edible than those with value “e” (“enlarging”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Also, while the smaller slopes for GillAtt and StalkShape suggest this association is weaker for these variables than for GillSize, where the slope appears much larger, it would be nice to have a quantitative measure of this degree of association that we could compare directly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This is particularly the case for GillSize and Bruises, where both associations appear to be reasonably strong, but since the reference lines run in opposite directions on the plots, it is difficult to reliably compare the slopes on the basis of appearance alone.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The &lt;i style="mso-bidi-font-style: normal;"&gt;odds ratio&lt;/i&gt; provides a simple quantitative association measure for these variables that allows us to make these comparisons directly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I discuss the odds ratio in Chapter 13 of&amp;nbsp;&amp;nbsp;&lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&amp;nbsp;&lt;/span&gt;in&amp;nbsp;connection with&amp;nbsp;the practical implications of data type (e.g., numerical versus categorical data).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The odds ratio may be viewed as an association measure between binary variables, and it is defined as follows.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For simplicity, suppose &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt; are two binary variables of interest and assume that they are coded so that they each take the values 0 or 1 – this assumption is easily relaxed, as discussed below, but it simplifies the basic description of the odds ratio.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;Next,&lt;/span&gt; define the following four numbers:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;N&lt;sub&gt;00&lt;/sub&gt;&lt;/em&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 0 and &lt;em&gt;y&lt;/em&gt; = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;N&lt;sub&gt;01&lt;/sub&gt;&lt;/em&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 0 and &lt;em&gt;y&lt;/em&gt; = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;N&lt;sub&gt;10&lt;/sub&gt;&lt;/em&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 1 and &lt;em&gt;y&lt;/em&gt; = 0&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;N&lt;sub&gt;11&lt;/sub&gt;&lt;/em&gt; = the number of data records with &lt;em&gt;x&lt;/em&gt; = 1 and &lt;em&gt;y&lt;/em&gt; = 1&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The odds ratio is defined in terms of these four numbers as &lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt; text-indent: 0.5in;"&gt;&lt;em&gt;OR = N&lt;sub&gt;00&lt;/sub&gt; N&lt;sub&gt;11&lt;/sub&gt; / N&lt;sub&gt;01&lt;/sub&gt; N&lt;sub&gt;10&lt;/sub&gt;&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Since all of the four numbers appearing in this ratio are nonnegative, it follows that the odds ratio is also nonnegative and can assume any value between 0 and positive infinity.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further, if &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt; are two statistically independent binary random variables, it can be shown that the odds ratio is equal to 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Values greater than 1 imply that records with &lt;em&gt;y&lt;/em&gt; = 1 are more likely to&amp;nbsp;have &lt;em&gt;x&lt;/em&gt; = 1 than &lt;em&gt;x&lt;/em&gt; = 0, and similarly, that records with &lt;em&gt;y&lt;/em&gt; = 0 are more likely to&amp;nbsp;have &lt;em&gt;x&lt;/em&gt; = 0 than &lt;em&gt;x = 1&lt;/em&gt;; in other words, &lt;em&gt;OR &amp;gt; 1&lt;/em&gt; implies that the variables &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt; are more likely to agree than they are to disagree.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Conversely, odds ratio values less than 1 imply that the variables &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt; are more likely to disagree: records with &lt;em&gt;y&lt;/em&gt; = 1 are more likely to have &lt;em&gt;x&lt;/em&gt; = 0 than &lt;em&gt;x&lt;/em&gt; = 1, and those with &lt;em&gt;y&lt;/em&gt; = 0 are more likely to have &lt;em&gt;x&lt;/em&gt; = 1 than &lt;em&gt;x&lt;/em&gt; = 0.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Often – as in the mushroom dataset – the binary variables are not coded as 0 or 1, but instead as two different categorical values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, the binary response variable considered last time – the edibility variable EorP – assumes the values “e” (for “edible”) or “p” (for “poisonous” or “non-edible”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In the results presented here, we recode EorP to have the values 1 for edible mushrooms and 0 for non-edible mushrooms.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the mushroom characteristic GillSize shown in the upper left plot above, suppose we initially code the value “b” (“broad”) as 0 and the value “n” (“narrow”) as 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This choice is arbitrary – we could equally well code “b” as 1 and “n” as zero – and its practical consequences are explored further below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the coding just described, the odds ratio between mushroom edibility (EorP) and gill size (GillSize) is 0.056.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since this number is substantially smaller than 1, it suggests that edible mushrooms (&lt;em&gt;y&lt;/em&gt; = 1) are unlikely to be associated with narrow gills (&lt;em&gt;x&lt;/em&gt; = 1), a result that is consistent with the appearance of the upper left plot above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;An important practical issue in interpreting odds ratios is that of how much smaller or larger than 1 the computed odds ratio should be to be regarded as evidence for a “significant” association between the variables &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That is, since we are computing this ratio from uncertain data, we need a measure of precision for the odds ratio, like the binomial confidence intervals discussed in&amp;nbsp;my last post: e.g.,&amp;nbsp;how much does the odds ratio change&amp;nbsp;if&amp;nbsp;some mushrooms previously declared edible is reclassified as poisonous, or if&amp;nbsp;some additional&amp;nbsp;mushrooms are added to our dataset?&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Fortunately, confidence intervals for the odds ratio are easily constructed.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In his book &lt;span&gt;&lt;a href="http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Categorical Data Analysis (Wiley Series in Probability and Statistics)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0471360937" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt;, Alan Agresti notes that confidence intervals for the odds ratio can be computed directly by appealing to the fact that the odds ratio estimator is asymptotically normal, approaching a Gaussian distribution in the limit of large sample sizes.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;He does not give explicit results for these direct confidence intervals, however, because he does not recommend them.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Instead, Agresti advocates the construction of confidence intervals for the &lt;em&gt;log&lt;/em&gt; of the odds ratio and transforming them back to get upper and lower confidence limits for the odds ratio itself.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This recommendation rests primarily on three practical points: first, that the log of the odds ratio approaches normality faster than the odds ratio itself does, so this approach yields more accurate confidence intervals; second, this approach guarantees a positive lower confidence limit for the odds ratio, which is not the case for the direct approach; and, third, the same result can be used to compute confidence intervals for both the odds ratio and its reciprocal, a result that is again not true for the direct approach and that will be useful in the discussion presented below.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;For the gill size example, Agresti’s recommended procedure yields a 95% confidence interval between 0.049 and 0.064.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since this interval does not include the value 1, we conclude that there is evidence to support an association between a mushroom’s gill size and its edibility, &lt;b style="mso-bidi-font-weight: normal;"&gt;at least for mushrooms in the UCI dataset&lt;/b&gt;. Applying this procedure to the GillAtt characteristic shown in the upper right plot above yields an estimated odds ratio of 0.097 with a 95% confidence interval between 0.059 and 0.157.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Again, the fact that this interval does not include 1 supports the idea that the GillAtt characteristic is associated with edibility (again, for the mushrooms considered here), but the fact that this odds ratio is larger (i.e., closer to the neutral value 1) also suggests that this association is weaker than that between edibility and the GillShape characteristic.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Again, this result is in agreement with the visual appearance of the upper right plot above, relative to that of the upper left plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The advantage of the odds ratio over these plots is that&amp;nbsp;it provides a quantitative measure that can be used to make more objective comparisons, removing the subjective visual judgment required in comparing plots.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Applying this procedure to the Bruises variable yields an odds ratio of 9.972, with a 95% confidence interval from 8.963 to 11.093.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fact that these values are larger than 1 implies that mushrooms whose bruise characteristics have been coded as 1 (here, “t” for “true” or “bruised”) are more likely to be edible than those whose characteristics have been coded as 0 (here, “f” for “false” or “not bruised”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As noted above, this coding is arbitrary, as were the earlier assignments.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;An extremely useful observation is that if we reverse this assignment – i.e., for this example, if we code “bruised” as 0 and “not bruised” as 1 – we simply exchange the numbers &lt;em&gt;N&lt;sub&gt;00&lt;/sub&gt;&lt;/em&gt; with &lt;em&gt;N&lt;sub&gt;10&lt;/sub&gt;&lt;/em&gt; and&amp;nbsp;also the numbers &lt;em&gt;N&lt;sub&gt;11&lt;/sub&gt;&lt;/em&gt; with &lt;em&gt;N&lt;sub&gt;01&lt;/sub&gt;&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The effect of these exchanges on the odds ratio is a reciprocal transformation:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;em&gt;OR = N&lt;sub&gt;00&lt;/sub&gt;N&lt;sub&gt;11&lt;/sub&gt;/N&lt;sub&gt;01&lt;/sub&gt;N&lt;sub&gt;10&lt;/sub&gt; -&amp;gt; N&lt;sub&gt;10&lt;/sub&gt;N&lt;sub&gt;01&lt;/sub&gt;/N&lt;sub&gt;11&lt;/sub&gt;N&lt;sub&gt;00&lt;/sub&gt; = 1/OR&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This observation provides a simple basis for comparing results like those for GillSize where the odds ratio is less than 1 with those for Bruises where the odds ratio is greater than 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As with the visual comparisons discussed above, it is not obvious from the odds ratios computed so far which of these variables is more strongly associated with mushroom edibility.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Reversing the coding for GillSize so that “b” is coded as 1 and “n” is coded as 0 changes the odds ratio from 0.059 to 1/0.059 = 17.857.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since this number is larger than the odds ratio of 9.972 for Bruises, we can conclude that GillSize is more strongly associated with edibility – i.e., it is a better predictor of edibility for the mushrooms considered here – than Bruises, at least for this dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In fact, the same trick can be applied to the confidence intervals, illustrating the third advantage noted above for Agresti’s preferred approach to constructing these intervals.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, the asymptotically normal approximation says that the log of the odds ratio has a mean of &lt;em&gt;log OR&lt;/em&gt; and a standard deviation &lt;em&gt;S&lt;/em&gt; that can be simply computed from the four numbers &lt;em&gt;N&lt;sub&gt;00&lt;/sub&gt;&lt;/em&gt;&lt;em&gt;, N&lt;sub&gt;01&lt;/sub&gt;, N&lt;sub&gt;10&lt;/sub&gt;&lt;/em&gt;, and &lt;em&gt;N&lt;sub&gt;11&lt;/sub&gt;&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since the Gaussian distribution is symmetric about its mean and &lt;em&gt;log(1/OR) = - log(OR),&lt;/em&gt; it follows that the&amp;nbsp;log of the reciprocal odds ratio has the same approximate standard deviation &lt;em&gt;S&lt;/em&gt; as &lt;em&gt;log(OR).&lt;/em&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In practical terms, this means that if we reverse the coding of our binary predictor variables, it is a simple matter to compute new confidence intervals as follows:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;New lower CI = 1/Old upper CI&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;New odds ratio = 1/Old odds ratio&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;New upper CI = 1/Old lower CI&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;(Note here that because the reciprocal transformation is order-reversing, the transformation of the lower confidence limit yields the new upper confidence limit, and vice-versa; for a more detailed discussion of order-preserving and order-reversing transformations in general and the reciprocal transformation in particular, refer to Chapter 12 of &lt;em&gt;Exploring Data&lt;/em&gt;.)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Applying these transformation results to the odds ratio for Bruises yields a new odds ratio of 0.100, with a 95% confidence interval from 0.090 to 0.112, which we can now compare with the earlier results for GillSize and GillAtt.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Alternatively, if we reverse the coding of the results for GillSize and GillAtt, we obtain odds ratios that are larger than 1 between edibility and “the more edible value” of each of these mushroom characteristics.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This has the advantage of giving us a sequences of odds ratios, all larger than 1, with the largest value suggestive of the strongest association between each mushroom characteristic variable and edibility.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the four mushroom characteristics shown in the above four plots, this approach yields the following odds ratios and their 95% confidence intervals:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;GillSize:&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Lower CI = 15.625,&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;OR = 17.857,&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Upper CI = 20.408&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;GillAtt:&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;Lower CI =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;6.369,&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;OR = 10.309,&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Upper CI = 16.949&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Bruises:&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Lower CI =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;8.963,&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;OR =&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;9.972,&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Upper CI = 11.093&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 1;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;StalkShape: Lower CI =&amp;nbsp;&amp;nbsp; 1.384,&amp;nbsp; OR = 1.512,&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;Upper Ci = 1.651&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;These results suggest that, of these four variables, the best predictor of mushroom edibility is GillSize, followed by GillAtt as second-best, then Bruises, and finally StalkShape as least predictive.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These conclusions are probably the same as we would draw based on a careful comparison of the plots shown above, but the odds ratios computed in the way just described lead us to these conclusions much more directly.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, it is important to make&amp;nbsp;three points.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, as I have noted before – but the point is important enough to bear repeating – the associations described here between these binary mushroom characteristics and edibility are based entirely on the UCI mushroom dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, these conclusions are only as representative of mushrooms in general or in any particular setting as the UCI mushroom dataset is representative of this larger and/or different mushroom population.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, mushrooms from other locales or unusual environments may exhibit different relationships between edibility and gill size or other characteristics than the UCI mushrooms do.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The second key point is that the results presented here only attempt to assess the predictability of a single binary mushroom characteristic in isolation.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To get a more complete picture of the relationship between mushroom characteristics and edibility, it is necessary to explore more general multivariate analysis techniques like logistic regression.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;More about that later.&amp;nbsp; Last but not least, the third point is that I realized in reviewing this post before I issued it that I hadn't included any actual R code to compute odds ratios.&amp;nbsp; In my next post, I will remedy this problem, giving a detailed view of how the numbers presented here were obained.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-6696430712139624523?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/6696430712139624523/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/04/measuring-association-using-odds-ratios.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/6696430712139624523'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/6696430712139624523'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/04/measuring-association-using-odds-ratios.html' title='Measuring association using odds ratios'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-BidOUAUunmg/TbL3aX7xxlI/AAAAAAAAACI/log4CBAAoKc/s72-c/OddsRatioPlot01.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-6027137527912648852</id><published>2011-04-12T16:46:00.000-07:00</published><updated>2011-04-12T16:46:51.900-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='screening predictors'/><category scheme='http://www.blogger.com/atom/ns#' term='UCI mushroom dataset'/><category scheme='http://www.blogger.com/atom/ns#' term='Exploring Data'/><category scheme='http://www.blogger.com/atom/ns#' term='binary confidence intervals'/><title type='text'>Screening for predictive characteristics … and a mea culpa</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In my last post, I considered the &lt;a href="http://archive.ics.uci.edu/ml/datasets/Mushroom"&gt;UCI mushroom dataset&lt;/a&gt; and characterized the variables included there using four different interestingness measures.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;When I began drafting this post, my intention was to consider the question of how the different mushroom characteristics included in this dataset relate to each mushroom’s classification as edible or poisonous.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, I do consider this problem here, but in the process of working out the example, I discovered a minor typographical error in&amp;nbsp;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt; that has somewhat less minor consequences.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, in Eq. (9.67) on page 413, two square roots were omitted, making the result incorrect as stated.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(More specifically, the term in curly brackets that appears twice should have an exponent of ½, like the different term that appears twice in curly brackets in Eq. (9.66) just above it on the same page.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The consequence of this omission is that the confidence intervals defined by Eq. (9.67) are too narrow; further, since this equation was used to implement the &lt;em&gt;R &lt;/em&gt;procedure &lt;b style="mso-bidi-font-weight: normal;"&gt;binomCI.proc&lt;/b&gt; available from the &lt;a href="http://www.oup.com/us/ExploringData"&gt;companion website&lt;/a&gt;, the results generated by this procedure are also incorrect.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I have brought these errors to &lt;city w:st="on"&gt;&lt;place w:st="on"&gt;Oxford&lt;/place&gt;&lt;/city&gt;’s attention and have asked them to replace the original &lt;em&gt;R&lt;/em&gt; procedure with a corrected update, but if you have already downloaded this procedure, you need to be aware of the missing square root.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The rest of this post carries out my original plan – which was to show how binomial confidence intervals can be useful in screening categorical variables for their ability to predict a binary outcome like edibility.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Recall that the UCI mushroom dataset gives 23 characteristics for each of 8,124 mushrooms, including a binary classification of each mushroom as “edible” or “poisonous.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The question considered here is which – if any – of the 22 mushroom attributes included in the dataset is potentially useful in predicting edibility.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The basic idea is the following: each of these predictors is a categorical variable that can take on any one of a fixed set of possible values, so we can examine the groups of mushrooms&amp;nbsp;defined by&amp;nbsp;each of these values and estimate the probability that the mushrooms in the&amp;nbsp;group&amp;nbsp;are edible.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, the mushroom characteristic CapSurf has four possible values: “f” (fibrous), “g” (grooves), “s” (smooth), or “y” (scaly).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; In this case, we want to estimate the probability that mushrooms with CapSurf = f are edible, the probability that those with CapSurf = g are edible, and similarly for CapSurf = s and y.&lt;/span&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The most common approach to this problem is to estimate&amp;nbsp;these probabilities as the fractions of edible mushrooms in each group: &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt; = n&lt;sub&gt;edible&lt;/sub&gt;/n&lt;sub&gt;group&lt;/sub&gt;&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The difficulty with this number, taken by itself, is that it doesn’t tell us how much weight to give the final result: if we have one edible mushroom in a group of five, we get &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; = 0.200, and we get the same result if we have 200 edible mushrooms in a group of 1,000.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;We are likely to put more faith in the second result than in the first, however, because it has a lot more weight of evidence behind it.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For example, if we add a single edible mushroom to each group, our probability estimate for the first case increases from &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; = 0.200 to &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; = 0.333, while in the second case, the&amp;nbsp;estimated probability only&amp;nbsp;increases to &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; = 0.201.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Even worse, if we &lt;em&gt;remove&lt;/em&gt; one edible mushroom from the first group, &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; drops to 0.000, while in the second case, it only drops to 0.199.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This is where statistics can come to our rescue: in addition to computing the point estimate &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; of the probability that a mushroom is edible, we can also compute confidence intervals, which quantify the uncertainty in this result.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That is, a confidence interval is defined as a set of values that has at least some specified probability of containing the true but unknown value of &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;.&lt;/i&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A common choice is 95% confidence limits: the true value of &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; lies between some lower limit &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;-&lt;/sub&gt;&lt;/i&gt; and some upper limit &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;+&lt;/sub&gt;&lt;/i&gt; with probability at least 95%.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of the key points of this post is that these intervals can be computed in more than one way, and the way that was widely adopted as “the standard method” for a long time has been found to be inadequate.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Fortunately, a simple alternative is available that gives much better results, at least if you implement it correctly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The details follow, all based on the material presented in Section 9.7 of &lt;em&gt;Exploring Data.&lt;/em&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This&amp;nbsp;standard method&amp;nbsp;relies on the assumption of asymptotic normality: for a “sufficiently large” group (i.e., for “large enough” values of &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;group&lt;/sub&gt;&lt;/i&gt; and possibly &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt;), the estimator &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; should approaches a Gaussian limiting distribution with variance &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;(1 – P&lt;sub&gt;edible&lt;/sub&gt;)/n&lt;sub&gt;group&lt;/sub&gt;&lt;/i&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;If we assume our sample is large enough for this to be a good approximation, we can rely on known results for the Gaussian distribution to construct our confidence intervals.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, the 95% confidence interval would be centered at &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; with upper and lower limits lying approximately plus or minus 1.96 standard deviations from this value, where the standard deviation is just the square root of the variance given above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot below shows the results obtained by applying this strategy to the groups of mushrooms defined by the four possible values of the CapSurf variable.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, the open circles in this plot correspond to the estimated probability &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; that a mushroom from the group defined by each CapSurf value is edible.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The downward-pointing triangles represent the upper 95% confidence limit for this value, and the upward-pointing triangles represent the lower 95% confidence limit for this value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The horizontal dotted line corresponds to the average fraction of edible mushrooms in the UCI dataset, giving us a frame of reference for assessing the edibility results for each individual CapSurf value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That is, points lying well above this average line represent groups of mushrooms that are more edible than average, while points lying well below this average line represent groups of mushrooms that are less edible than average.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The result obtained for level “g” clearly illustrates one difficulty with this approach: this group is extremely small, containing only four mushrooms, none of which are classified as edible.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, not only is &lt;i style="mso-bidi-font-style: normal;"&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; zero, its associated variance is also zero, giving us zero-width confidence intervals.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In words, this result is suggesting that mushrooms with grooved cap surfaces are never edible and that we are quite certain of this, despite the fact that&amp;nbsp;this conclusion&amp;nbsp;is only based on four mushrooms.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, we seem to be less certain about the probability that scaly (“y”) or smooth (“s”) mushrooms are edible, despite the fact that these results are based on groups of 3,244 and 2,556 mushrooms, respectively.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-e-3kL_Eyouw/TaO7Ryg7PRI/AAAAAAAAAB4/Hrle3ob_AwM/s1600/binomCIfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="http://1.bp.blogspot.com/-e-3kL_Eyouw/TaO7Ryg7PRI/AAAAAAAAAB4/Hrle3ob_AwM/s320/binomCIfig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;An alternative approach that gives more accurate confidence intervals and also overcomes this particular difficulty is one proposed by Brown, Cai, and DasGupta.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The details are given in &lt;em&gt;Exploring Data&lt;/em&gt; (with the exception of the error noted at the beginning of this post, I believe they are correct), and they are somewhat messy, so I won’t repeat them here, but the basic ideas are, first, to add positive offsets to both &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; and &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;group&lt;/sub&gt;&lt;/i&gt; in computing the probability that a mushroom is edible, and second, to modify the expression for the variance.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Both of these modifications depend explicitly on the confidence level considered (i.e., the offsets are different for 95% confidence intervals than they are for 99%&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;confidence intervals, as are the variance modifications), and they become negligible in the limit as both &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;edible&lt;/sub&gt;&lt;/i&gt; and &lt;i style="mso-bidi-font-style: normal;"&gt;n&lt;sub&gt;group&lt;/sub&gt;&lt;/i&gt; become very large.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To see the impact of these modifications, the plot below gives the modified 95% confidence intervals for the CapSurf data, in the same general format as before.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Comparing this plot with the one above, it is clear that the most dramatic difference is that for level “g,” the grooved cap mushrooms: whereas the asymptotic result suggested the mushrooms in this group were poisonous with absolute certainty, the very wide confidence intervals for this group in the plot below reflect the fact that this result is only based on four mushrooms and, while none of these four are edible, the confidence intervals extend from essentially zero probability of being edible to almost the average probability for the complete dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, while we can conclude from this plot that mushrooms with grooved cap surfaces appear less likely than average to be edible, the available evidence isn’t enough to make this argument too strongly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, the results for the mushrooms with scaly cap surfaces (“y”) or smooth cap surfaces (“s”) are essentially identical to those presented above, consistent with the much larger groups on which these results are based.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Bm-QL2tEGN0/TaO7tFeCTeI/AAAAAAAAAB8/WTtQBFAnHcA/s1600/binomCIfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="http://1.bp.blogspot.com/-Bm-QL2tEGN0/TaO7tFeCTeI/AAAAAAAAAB8/WTtQBFAnHcA/s320/binomCIfig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Before leaving this example, it is worth showing how the results are changed in light of my typographical error in Eq. (9.67) of &lt;em&gt;Exploring Data&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The missing square roots were omitted from terms that define the width of the confidence intervals, and these terms are numerically smaller than 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since if x is smaller than 1, the square root of x is larger than x but still smaller than 1, the effect of these omitted square roots is to make the resulting confidence intervals too narrow (i.e., we are using the value x rather than the larger square root of x that we should be using to determine the width of the confidence intervals).&amp;nbsp; This error&amp;nbsp;causes our results&amp;nbsp;to appear&amp;nbsp;more precise than they really are.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This effect may be seen clearly in the plot below, which corresponds to the two plots discussed above, but with the confidence intervals based on the erroneous implementation of the estimator of Brown et al.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-15ZFjTmu_2o/TaO71mfQKrI/AAAAAAAAACA/oInenxOXQSU/s1600/binomCIfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="http://4.bp.blogspot.com/-15ZFjTmu_2o/TaO71mfQKrI/AAAAAAAAACA/oInenxOXQSU/s320/binomCIfig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, the figure below shows four plots, each in the same format as those discussed above, corresponding to the&amp;nbsp;&lt;em&gt;P&lt;sub&gt;edible&lt;/sub&gt;&lt;/em&gt; estimates obtained by applying the method of Brown et al. to four different mushroom characteristics.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot shows the results obtained for the mushroom characteristic “Odor,” which appears to be highly predictive of edibility. Careful examination of these results reveals that, &lt;em&gt;&lt;strong&gt;for the mushrooms in the UCI dataset&lt;/strong&gt;&lt;/em&gt;, those with odors characterized as “a” (almond) or “l” (anise) are always edible, those with odors characterized as “c” (creosote), “f” (foul), “m” (musty), “p” (pungent), “s” (spicy), or “y” (fishy) are always poisonous, and those with no odor are more likely to be edible than not, but they can still be poisonous.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In contrast, CapShape (upper right plot) appears much less predictive: some values seem to be strongly associated with edibility (“b” or “s”), while the levels “f” and “x” seem to convey no information at all: the likelihood that these mushrooms are edible is essentially the same as that of the complete collection, without regard to CapShape.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The lower left plot shows the corresponding results for StalkRoot, which suggest that levels “c,” “e,” and “r” are more likely to be edible than average, level “b” conveys no information, and mushrooms where StalkRoot values are missing are somewhat more likely to be poisonous (the class “?”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This result is somewhat distressing, raising the possibility that the missing values for this variable are not missing at random, but that there may be some systematic mechanism at work (e.g., is the StalkRoot characterization somehow more difficult for poisonous mushrooms?).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, the lower right plot shows the results for the binary characteristic GillSize: it appears that mushrooms with GillSize “n” (narrow) are much more likely to be poisonous than those with GillSize “b” (broad).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Because both the response (i.e., edibility) and the candidate predictor GillSize are binary in this case, an alternative – and arguably better – approach to characterizing their relationship is in terms of odds ratios, which I will take up in my next post.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-XTECTqTRO8k/TaO79qwIovI/AAAAAAAAACE/LCkwg8EJ4UA/s1600/binomCIfig04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="http://1.bp.blogspot.com/-XTECTqTRO8k/TaO79qwIovI/AAAAAAAAACE/LCkwg8EJ4UA/s320/binomCIfig04.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-6027137527912648852?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/6027137527912648852/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/04/screening-for-predictive.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/6027137527912648852'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/6027137527912648852'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/04/screening-for-predictive.html' title='Screening for predictive characteristics … and a mea culpa'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-e-3kL_Eyouw/TaO7Ryg7PRI/AAAAAAAAAB4/Hrle3ob_AwM/s72-c/binomCIfig01.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-872218228257381784</id><published>2011-04-03T08:06:00.000-07:00</published><updated>2011-04-03T08:06:06.449-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='interestingness measures'/><category scheme='http://www.blogger.com/atom/ns#' term='Shannon entropy'/><category scheme='http://www.blogger.com/atom/ns#' term='Gini&apos;s mean difference'/><category scheme='http://www.blogger.com/atom/ns#' term='UCI mushroom dataset'/><category scheme='http://www.blogger.com/atom/ns#' term='Exploring Data'/><title type='text'>Interestingness Measures</title><content type='html'>&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;br /&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Probably because I first encountered them somewhat late in my professional life, I am fascinated by categorical data types.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Without question, my favorite book on the subject is Alan Agresti’s &lt;a href="http://www.amazon.com/Categorical-Analysis-Wiley-Probability-Statistics/dp/0471360937?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Categorical Data Analysis (Wiley Series in Probability and Statistics)&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0471360937" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;, which provides a well-integrated, comprehensive treatment of the analysis of categorical variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;One of the topics that Agresti does not discuss, however, is that of &lt;em&gt;interestingness measures&lt;/em&gt;, a useful quantitative characterization of categorical variables that comes from the computer science literature rather than the statistics literature.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I discuss in Chapter 3 of &lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;, interestingness measures are essentially numerical characterizations of the extent to which a categorical variable is uniformly distributed over its range of possible values: variables where the levels are equally represented are deemed “less interesting” than those whose distribution varies widely across these levels.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Many different interestingness measures have been proposed, and Hilderman and Hamilton give an excellent survey, describing 13 different measures in detail.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(I have been unable to find a PDF version on-line, but the reference is R.J. Hilderman and H.J. Hamilton, “Evaluation of interestingness measures for ranking discovered knowledge,” in &lt;em&gt;Proceedings of the 5&lt;sup&gt;th&lt;/sup&gt; Asia-Pacific Conference on Knowledge Discovery and Data Mining&lt;/em&gt;, D. Chueng, G.J. Williams, and Q. Li, eds., Hong Kong, April, 2001, pages 247 to 259.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In addition, the authors present five behavioral axioms for characterizing interestingness measures.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In &lt;em&gt;Exploring Data&lt;/em&gt;, I consider four normalized interestingness measures that satisfy the following three of Hilderman and &lt;place w:st="on"&gt;&lt;city w:st="on"&gt;Hamilton&lt;/city&gt;&lt;/place&gt;’s axioms:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;1.&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;The minimum value principle: the measure exhibits its minimum value when the variable is uniformly distributed over its range of possible values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the normalized measures considered here, this means the measure takes the value 0.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;2.&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;The maximum value principle: the measure exhibits its maximum value when the variable is completely concentrated on one of its several possible values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For the normalized measures considered here, this means the measure takes the value 1.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1in; mso-list: l0 level1 lfo1; tab-stops: list 1.0in; text-indent: -0.25in;"&gt;&lt;span style="mso-list: Ignore;"&gt;3.&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;The permutation-invariance principle: re-labeling the levels of the variable does not change the value of the interestingness measure.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;To compute the interestingness measures considered here, it is necessary to first compute the empirical probabilities that a variable assumes each of its possible values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To be specific, assume that the variable x can take any one of M possible values, and let p&lt;sub&gt;i &lt;/sub&gt;denote the fraction of the N observed samples of x that assume the i&lt;sup&gt;th&lt;/sup&gt; possible value.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;All of the interestingness measures considered here attempt to characterize the extent to which these empirical probabilities are constant, i.e. the extent to which p&lt;sub&gt;i&lt;/sub&gt;&amp;nbsp;is approximately equal to&amp;nbsp;1/M for all i.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Probably the best known of the four interestingness measures I consider is the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; entropy from information theory, which is based on the average of the product p&lt;sub&gt;i&lt;/sub&gt; log p&lt;sub&gt;i &lt;/sub&gt;&lt;span style="mso-spacerun: yes;"&gt;over all i.&amp;nbsp; &lt;/span&gt;A second measure is the normalized version of Gini’s mean difference from statistics, which is the average distance that p&lt;sub&gt;i&lt;/sub&gt; lies from p&lt;sub&gt;j&lt;/sub&gt; for all i distinct from j, and a third – Simpson’s measure – is a normalized version of the variance of the p&lt;sub&gt;i&lt;/sub&gt; values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fourth characterization considered in &lt;em&gt;Exploring Data&lt;/em&gt; is Bray’s measure, which comes from ecology and is based on the average of the smaller of p&lt;sub&gt;i&lt;/sub&gt; and 1/M.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The key point here is that, because these measures are computed in different ways, they are sensitive to different aspects of the distributional heterogeneity of a categorical variable over its range of possible values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Specifically, since all four of these measures assume the value 0 for uniformly distributed variables and 1 for variables completely concentrated on a single value, they can only differ for intermediate degrees of heterogeneity.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The &lt;em&gt;R&lt;/em&gt; procedures &lt;strong&gt;bray.proc&lt;/strong&gt;, &lt;strong&gt;gini.proc&lt;/strong&gt;, &lt;strong&gt;shannon.proc&lt;/strong&gt;, and &lt;strong&gt;simpson.proc&lt;/strong&gt; are all available from the &lt;em&gt;Exploring Data&amp;nbsp;&lt;/em&gt;&amp;nbsp;&lt;a href="http://www.oup.com/us/companion.websites/9780195089653/?view=usa"&gt;companion website&lt;/a&gt;,&amp;nbsp;each implementing the corresponding interestingness measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To illustrate the use of these procedures, they are applied here to the &lt;a href="http://archive.ics.uci.edu/ml/datasets/Mushroom"&gt;UCI Machine Learning Repository Mushroom dataset&lt;/a&gt;, which gives 23 categorical characterizations of 8,124 different species of mushrooms, taken from &lt;em&gt;The Audubon Society Field Guide to North American Mushrooms&lt;/em&gt; (G.H. Lincoff, Pres., published by Alfred A. Knopf, New York, 1981).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A typical characterization is gill color, which exhibits 12 distinct values, each corresponding to a one-character color code (e.g., “p” for pink, “u” for purple, etc.).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To evaluate the four interestingness measures for any of these attributes, it is necessary to first compute its associated empirical probability vector.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This is easily done with the following &lt;em&gt;R &lt;/em&gt;function:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;ComputePvalues &amp;lt;- function(x){&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 3;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;xtab &amp;lt;- table(x)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 3;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;pvect &amp;lt;- xtab/sum(xtab)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 3;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;pvect&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;span style="mso-tab-count: 2;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;}&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Given this empirical probability vector, the four &lt;em&gt;R&lt;/em&gt; procedures listed above can be used to compute the corresponding interestingness measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a specific example, the following sequence gives the values for the four interestingness measures for the seven-level variable “Habitat” from the mushroom dataset:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;x &amp;lt;- mushroom$Habitat&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;pvect &amp;lt;- ComputePvector(x)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;bray.proc(pvect)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;0.427212&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;gini.proc(pvect)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;0.548006&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;shannon.proc(pvect)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: Times New Roman;"&gt;&amp;gt;&lt;/span&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;0.189719&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: Times New Roman;"&gt;&amp;gt;&lt;/span&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;simpson.proc(pvect)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: Times New Roman;"&gt;&amp;gt;&lt;/span&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;0.129993&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;These results illustrate the point noted above, that while all of these measures span the same range – from 0 for completely uniform level distributions to 1 for variables completely concentrated on one level – in general, the values of the four&amp;nbsp;measures are different.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These differences arise from the fact that the different interestingness measures weight specific types of deviations from uniformity differently.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To see this, note that the seven levels of the habitat variable considered here have the following distribution:&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 1.25in; mso-list: l0 level1 lfo1; tab-stops: list 1.25in; text-indent: -0.25in;"&gt;&lt;span style="font-family: Wingdings; mso-bidi-font-family: Wingdings; mso-fareast-font-family: Wingdings;"&gt;&lt;span style="mso-list: Ignore;"&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;table(mushroom$Habitat)&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: 0.5in;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; d&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;span style="mso-tab-count: 1;"&gt;g&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;l&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; m&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; p&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; u&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; w&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: 0.5in;"&gt;&amp;gt;&amp;nbsp;&amp;nbsp; 3148&amp;nbsp;&amp;nbsp; 2148&amp;nbsp;&amp;nbsp; 832&amp;nbsp;&amp;nbsp; 292&amp;nbsp;&amp;nbsp; 1144&amp;nbsp;&amp;nbsp; 368&amp;nbsp;&amp;nbsp; 192&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;It is clear from looking at these numbers that the seven different habitat levels are not all equally represented in this dataset, with the most common level (“d”) occurring about 15 times as often as the rarest level (“w”).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The average representation is approximately 1160, so the two most populous levels occur much more frequently than average, one level occurs with about average frequency (“p”), and the other four levels occur anywhere between half as often as average and one tenth as often.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is clear that the Gini measure is the most sensitive to these deviations from homogeneity, at least for this example, while the Simpson measure is the least sensitive.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These observations raise the following question: to what extent is this behavior typical, and to what extent is it specific to this particular example?&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The rest of this post examines this question further.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-dFVQ3xcf_KQ/TZdrJR1K5JI/AAAAAAAAABk/Ey0MNyMx3nA/s1600/interestingnessFig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="http://2.bp.blogspot.com/-dFVQ3xcf_KQ/TZdrJR1K5JI/AAAAAAAAABk/Ey0MNyMx3nA/s320/interestingnessFig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The four plots shown above compare these different interestingness measures for all of the variables included in the UCI mushroom dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot shows the Gini measure plotted against the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure for all 23 of these categorical variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The dashed line in the plot is the equality reference line: if both measures were identical, all points would lie along this line.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fact that all points lie above this line means that – at least for this dataset – the Gini measure is always larger than the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper right plot shows the Simpson measure plotted against the Shannon measure, and here it is clear that the Simpson measure is generally smaller than the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure, but there are exceptions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further, it is also clear from this plot that the differences between the Shannon and Simpson measures are&amp;nbsp;substantially less&amp;nbsp;than those between the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; and Gini measures, again, at least for this dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The lower left plot shows the Bray measure plotted against the Shannon measure, and this plot has the same general shape as the one above for the Gini versus &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measures.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The differences between the Bray and &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measures appear slightly less than those between the Gini and Shannon measures, suggesting that the Bray measure is slightly smaller than the Gini measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The lower right plot of the Bray versus the Gini measures shows that this is generally true: some of the points on this plot appear to fall exactly on the reference line, while the others fall slightly below this line, implying that the Bray measure is generally approximately equal to the Gini measure, but where they differ, the Bray measure is consistently smaller.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Again, it is important to emphasize that the results presented here are based on a single dataset, and they may not hold in general, but these observations do what a good exploratory analysis should: they raise the question.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-nfNaJlij_ko/TZdsb3Fe2wI/AAAAAAAAABo/IFRX7DZ0lhk/s1600/interestingnessFig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="http://1.bp.blogspot.com/-nfNaJlij_ko/TZdsb3Fe2wI/AAAAAAAAABo/IFRX7DZ0lhk/s320/interestingnessFig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Another question raised by these results is whether there is anything special about the few points that violate the general rule that the Simpson measure is less than the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The above figure repeats the upper right plot from the four shown earlier, with the Simpson measure plotted against the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, the six points that correspond to binary characterizations are represented as solid circles, and they correspond to six of the eight points that fall above the line, implying that the Simpson measure is greater than the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure for these points.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The fact that these are the only binary variables in the dataset suggests – but does not prove – that the Simpson measure tends to be greater than the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure for binary variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Of the other two points above the reference line, one represents the only three-level characterization in the mushroom dataset (specifically, the point corresponding to “RingNum,” a categorical variable based on the number of rings on the mushroom, labeled on the plot).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The last remaining point above the equality reference line corresponds to one of four four-level characterizations in the dataset; the other three four-level characterizations fall below the reference line, as do all of those with more than four levels.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Taken together, these results suggest that the number of levels in a categorical variable influences the Shannon and Simpson measures differently.&amp;nbsp; In particular, it appears that those cases where the Simpson measure exceeds the Shannon measure tend to be variables with relatively few levels.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;In fact, these observations raise a subtle but extremely important point in dealing with categorical variables.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The metadata describing the mushroom dataset indicates that the variable VeilType is binary, with levels universal (“u”) and partial (“p”), but the type listed for all 8,124 mushrooms in the dataset is “p.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a consequence, VeilType appears in this analysis as a one-level variable.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Since the basic numerical expressions for all four of the interestingness measures considered here all become indeterminate for one-level variables, this case is tested for explicitly in the &lt;em&gt;R&lt;/em&gt; procedures used here, and&amp;nbsp;it is assigned a value of 0: this seems reasonable, representing a classification of “fully homogeneous, evenly spread over the range of possible values,” as it is difficult to imagine declaring a one-level variable to be strongly heterogeneous or “interesting.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Here, this result is a consequence of defining the number of levels for this categorical variable from the data alone, neglecting the possibility of other values that could be present but are not.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, if we regard this variable as binary – in agreement with the metadata – all four of the interestingness measures would yield the value 1, corresponding to the fact that the observed values are fully concentrated in one of two possible levels.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This example illustrates the difference between &lt;em&gt;internal categorization&lt;/em&gt; – i.e., determination of the number of levels for a categorical variable from the observed data alone – and &lt;em&gt;external categorization&lt;/em&gt;, where the number of levels is specified by the metadata.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As this example illustrates, the difference can have important consequences for both numerical data characterizations and their interpretation.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-CIhueXs1cQs/TZh9_vtw9BI/AAAAAAAAAB0/yU8Yz1bTk9c/s1600/interestingnessFig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="http://1.bp.blogspot.com/-CIhueXs1cQs/TZh9_vtw9BI/AAAAAAAAAB0/yU8Yz1bTk9c/s320/interestingnessFig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, the above figure shows the Gini measure plotted against the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure for all 23 of the mushroom characteristics, corresponding to the upper left plot in the first figure, but with three additions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, the solid curve represents the result of applying the lowess nonparametric smoothing procedure to the scatterplot of Gini versus Shanon measures.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Essentially, this curve fills in the line that your eye suggests is present,&amp;nbsp;approximating the nonlinear relationship between these two measures that is satisfied by most of the points on the plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Second, the solid circle represents the variable whose results lie farthest from this curve, which is CapSurf, a four-level categorical variable describing the surface of the mushroom cap as fibrous (“f”), grooved (“g”), scaly (“y”), or smooth (“s”), distributed as follows: 2,320 “f,” 4 “g,” 2,556 “y,” and 3224 “s.”&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Note that if the rare cagetory “g” were omitted from consideration, this variable would be almost uniformly distributed over the remaining three values, and this modified result (i.e., with the rare level "g" omitted from the analysis) corresponds to the solid square point that falls on the lowess curve at the lower left end.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, the original CapSurf result appears to represent a specific type of deviation from uniformity that violates the curvilinear relationship that seems to generally hold – at least approximately – between the Gini and &lt;place w:st="on"&gt;Shannon&lt;/place&gt; interestingness measures.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The real point of these examples has been two-fold.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;First, I wanted to introduce the notion of an interestingness measure and illustrate the application of these numerical summaries to categorical variables, something that is not covered in typical introductory statistics or data analysis courses.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Second, I wanted to show that – in common with most other summary statistics – different data characteristics influence alternative measures differently.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Thus, while it appears that the Simpson measure is generally slightly smaller than the Shannon measure – at least for the examples considered here – this behavior depends strongly on the number of levels the categorical variable can assume, with binary variables consistently violating this rule of thumb (again, at least for the examples considered here).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Similarly, while it appears that the Gini measure always exceeds the &lt;place w:st="on"&gt;Shannon&lt;/place&gt; measure, often substantially so, there appear to be certain types of heterogeneous data distributions for which this difference is substantially smaller than normal.&amp;nbsp; Thus, as is often the case, it can be extremely useful to compute similar characterizations of the same type and comapre them, since unexpected differences can illuminate features of the data that are not initially apparent and which may turn out to be interesting.&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-872218228257381784?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/872218228257381784/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/04/interestingness-measures.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/872218228257381784'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/872218228257381784'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/04/interestingness-measures.html' title='Interestingness Measures'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-dFVQ3xcf_KQ/TZdrJR1K5JI/AAAAAAAAABk/Ey0MNyMx3nA/s72-c/interestingnessFig01.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-3179326419182672661</id><published>2011-03-23T18:17:00.000-07:00</published><updated>2011-03-23T18:17:46.987-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='beeswarm plots'/><category scheme='http://www.blogger.com/atom/ns#' term='Q-Q plots'/><category scheme='http://www.blogger.com/atom/ns#' term='boxplots'/><category scheme='http://www.blogger.com/atom/ns#' term='normal quantiles'/><title type='text'>The Many Uses of Q-Q Plots</title><content type='html'>&lt;span&gt;&lt;/span&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;My last four posts have dealt with boxplots and some useful variations on that theme.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Just after I finished the series, Tal Galili, who maintains the &lt;a href="http://www.r-statistics.com/tag/r-bloggers/"&gt;R-bloggers&lt;/a&gt; website, pointed me to a variant I hadn’t seen before.&amp;nbsp; It's&amp;nbsp;called a &lt;i style="mso-bidi-font-style: normal;"&gt;beeswarm plot,&lt;/i&gt; and it's produced by the &lt;strong&gt;beeswarm&lt;/strong&gt; package in &lt;em&gt;R&lt;/em&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I haven’t played with this package a lot yet, but it does appear to be useful for datasets that aren’t too large and that you want to examine across a moderate number of different segments.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The plot shown below provides a typical illustration: it shows the beeswarm plot comparing the potassium content of different cereals, broken down by manufacturer, from the &lt;strong&gt;UScereal &lt;/strong&gt;dataset included in the &lt;strong&gt;MASS&lt;/strong&gt; package in &lt;em&gt;R.&lt;/em&gt;&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;I discussed this data example in my first couple of boxplot posts and I think this is a case where the beeswarm plot gives you a more useful picture of how the data points are distributed than the boxplots do.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;For more information&amp;nbsp;about the &lt;strong&gt;beeswarm&lt;/strong&gt;&amp;nbsp;package, I recommend &lt;a href="http://www.r-statistics.com/2011/03/beeswarm-boxplot-and-plotting-it-with-r/"&gt;Tal's post&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;More generally, anyone interested in learning more about what you can do with the &lt;em&gt;R&lt;/em&gt; software package should find the R-blogger website extremely useful.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-pQ80HZSJVI0/TYqJztV1bMI/AAAAAAAAABU/QZbtVgv7r74/s1600/beeswarmfig00.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="https://lh6.googleusercontent.com/-pQ80HZSJVI0/TYqJztV1bMI/AAAAAAAAABU/QZbtVgv7r74/s320/beeswarmfig00.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Besides boxplots, one of the other useful graphical data characterizations I discuss in&amp;nbsp;&lt;span&gt;&lt;a href="http://www.amazon.com/Exploring-Data-Engineering-Sciences-Medicine/dp/0195089650?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;Exploring Data in Engineering, the Sciences, and Medicine&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=0195089650" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;&lt;/span&gt; is the quantile-quantile (Q-Q) plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The most common form of this characterization is the &lt;em&gt;normal Q-Q plot,&lt;/em&gt; which represents an informal graphical test of the hypothesis that a data sequence is normally distributed.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;That is, if the points on a normal Q-Q plot are reasonably well approximated by a straight line, the popular Gaussian data hypothesis is plausible, while marked deviations from linearity provide evidence against this hypothesis.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The utility of normal Q-Q plots goes well beyond this informal hypothesis test, however, which is the main point of this post.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In particular, the shape of a normal Q-Q plot can be extremely useful in highlighting distributional asymmetry, heavy tails, outliers, multi-modality, or other data anomalies.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The specific objective of this post is to illustrate some of these ideas, expanding on the discussion presented in &lt;em&gt;Exploring Data&lt;/em&gt;.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-Z_kXbFQnUyM/TYqJtKIjUtI/AAAAAAAAABQ/QNXne7ASN7Y/s1600/qqplotfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="https://lh6.googleusercontent.com/-Z_kXbFQnUyM/TYqJtKIjUtI/AAAAAAAAABQ/QNXne7ASN7Y/s320/qqplotfig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The above figure shows four different normal Q-Q plots that illustrate some of the different data characteristics these plots can emphasize.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot demonstrates that normal Q-Q plots can be extremely effective in highlighting glaring outliers in a data sequence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This plot shows the annual number of traffic deaths per ten thousand drivers over an unspecified time period, for 25 of the 50 states in the &lt;country-region w:st="on"&gt;U.S.&lt;/country-region&gt;, plus the &lt;state w:st="on"&gt;&lt;place w:st="on"&gt;District of Columbia&lt;/place&gt;&lt;/state&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This plot was constructed from the &lt;strong&gt;road&lt;/strong&gt; dataset included in the &lt;strong&gt;MASS&lt;/strong&gt; package in &lt;em&gt;R&lt;/em&gt;, which gives the numbers of deaths, the numbers of drivers (in tens of thousands), and several other characteristics for each of these regions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Based on the interpretation of normal Q-Q plots offered above, the normal distribution hypothesis appears fairly reasonable for this data sequence, in all cases except the point in the extreme upper right.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This point corresponds to the state of &lt;place w:st="on"&gt;&lt;state w:st="on"&gt;Maine&lt;/state&gt;&lt;/place&gt;, which exhibited 26 deaths per ten thousand drivers, well above the average of approximately 5 for all other regions considered. &lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-c-ZxRM4T3J8/TYqKIZzOXfI/AAAAAAAAABY/URDdfAJlnyc/s1600/qqplotfig00.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="https://lh6.googleusercontent.com/-c-ZxRM4T3J8/TYqKIZzOXfI/AAAAAAAAABY/URDdfAJlnyc/s320/qqplotfig00.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;It is not clear why the reported traffic death rate is so high for &lt;state w:st="on"&gt;&lt;place w:st="on"&gt;Maine&lt;/place&gt;&lt;/state&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The scatterplot above shows the reported traffic deaths for each state or district against the number of drivers, in tens of thousands.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The dashed line in the plot corresponds to the average traffic death rate for all regions except Maine, and it is clear that this line fits most of the data points reasonably well, with Maine (the solid point) representing the most glaring exception.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp;Although it still leaves us wanting to know more,&lt;/span&gt;&amp;nbsp;this plot&amp;nbsp;suggests that the number of deaths for &lt;state w:st="on"&gt;&lt;place w:st="on"&gt;Maine&lt;/place&gt;&lt;/state&gt; is unusually high, rather than the number of drivers being unusually low, which might be a more tempting explanation.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The Q-Q plot for this denominator variable – i.e., for the number of drivers – is shown as the upper right plot in the original set of four shown above.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; Th&lt;/span&gt;ere, the fact that both tails of the distribution lie above the reference line is suggestive of distributional asymmetry, a point examined further below using Q-Q plots for other reference distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Also, note that both of the upper Q-Q plots shown above are based on only 26 data values, which is right at the lower limit on sample size that various authors have suggested for normal Q-Q plots to be useful (see the discussion of normal Q-Q plots in Section 6.3.3 of &lt;em&gt;Exploring Data&lt;/em&gt; for details).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The tricky issues of separating outliers, asymmetry, and other potentially interesting data characteristics in samples this small is greatly facilitated using the Q-Q plot confidence intervals discussed below.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The lower left Q-Q plot in the above sequence is that for the &lt;place w:st="on"&gt;Old Faithful&lt;/place&gt; geyser dataset &lt;strong&gt;faithful &lt;/strong&gt;included with the base R package.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As I have discussed previously, the eruption duration data exhibits a pronounced bimodal distribution, which may be seen clearly in nonparametric density estimates computed from these data values.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Normal Q-Q plots constructed from bimodal data typically exhibit a “kink” like the one seen in this plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;A crude way of explaining this behavior is the following: the lower portion of the Q-Q plot is very roughly linear, suggesting a&amp;nbsp;very approximate&amp;nbsp;Gaussian distribution, corresponding to the first mode of the eruption data distribution (i.e., the durations of the shorter group of eruptions).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Similarly, the upper portion of the Q-Q plot is again very roughly linear, but with a much different intercept that corresponds to the larger mean of the second peak in the distribution (i.e., the durations of the longer group of eruptions).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To connect these two “roughly linear” local segments, the curve must exhibit a “kink” or rapid transition region between them.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;By the same reasoning, more general multi-modal distributions will exhibit more than one such “kink” in their Q-Q plots.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Finally, the lower right Q-Q plot in the collection above was constructed from the Pima Indians diabetes dataset available from the&amp;nbsp;&lt;a href="http://www.ics.uci.edu/~mlearn/MLRepository.html"&gt;UCI Machine Learning Repository&lt;/a&gt;.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This dataset includes a number of clinical measurements for 768 female members of the Pima tribe of Native Americans, including their diastolic blood pressure.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The lower right Q-Q plot was constructed from this blood pressure data, and its most obvious feature is the prominent lower tail anomaly.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, careful examination of this plot reveals that these points correspond to the value &lt;em&gt;zero&lt;/em&gt;, which is not realistic for any living person.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;What has happened here is that zero has been used to code missing values, both for this variable and several others in this dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This observation is important because the metadata associated with this dataset indicates that there is no missing data, and a number of studies in the classification literature have proceeded under the assumption that this is true.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Unfortunately, this assumption can lead to badly biased results, a point discussed in detail in a paper I published in SIGKDD Explorations (&lt;a href="http://www.sigkdd.org/explorations/issues/8-1-2006-06/12-Pearson.pdf"&gt;Disguised Missing Data paper PDF&lt;/a&gt;).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The point of the example presented here is to show that normal Q-Q plots can be extremely effective in highlighting this kind of data anomaly.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The normal Q-Q plots considered so far were constructed using the &lt;strong&gt;qqnorm&lt;/strong&gt; procedure available in base &lt;em&gt;R&lt;/em&gt;, and the reference lines shown in these plots were constructed using the &lt;strong&gt;qqline&lt;/strong&gt; command.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is not difficult to construct Q-Q plots for other reference distributions using procedures in base &lt;em&gt;R&lt;/em&gt;, but a much simpler alternative is to use the &lt;strong&gt;qqPlot&lt;/strong&gt; command in the optional &lt;strong&gt;car&lt;/strong&gt; package.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This &lt;em&gt;R&lt;/em&gt; add-on package was developed in association with the book &lt;span&gt;&lt;a href="http://www.amazon.com/R-Companion-Applied-Regression/dp/141297514X?ie=UTF8&amp;amp;tag=widgetsamazon-20&amp;amp;link_code=btl&amp;amp;camp=213689&amp;amp;creative=392969" target="_blank"&gt;An R Companion to Applied Regression&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=widgetsamazon-20&amp;amp;l=btl&amp;amp;camp=213689&amp;amp;creative=392969&amp;amp;o=1&amp;amp;a=141297514X" style="border-bottom: medium none; border-left: medium none; border-right: medium none; border-top: medium none; margin: 0px; padding-bottom: 0px !important; padding-left: 0px !important; padding-right: 0px !important; padding-top: 0px !important;" width="1" /&gt;,&lt;/span&gt; by Fox and Weisberg, and it includes a number of very useful procedures.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The default options of the &lt;strong&gt;qqPot&lt;/strong&gt; procedure automatically generate a reference line, along with upper and lower 95% confidence intervals for the plot, which are particularly useful for small samples like the &lt;strong&gt;road&lt;/strong&gt; dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The figure below shows a normal Q-Q plot for the number of traffic deaths per 10,000 drivers generated using the &lt;strong&gt;qqPlot&lt;/strong&gt; package.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp;&amp;nbsp; &lt;/span&gt;The fact that all of the points but the one obvious outlier fall within the 95% confidence limits suggest that the scatter around the reference line seen for these 25 observations is small enough to be consistent with a normal reference distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further, these confidence limits also emphasize how much the outlying result for the state of &lt;state w:st="on"&gt;&lt;place w:st="on"&gt;Maine&lt;/place&gt;&lt;/state&gt; violates this normality assumption.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh5.googleusercontent.com/-Ql-OcHO9AEw/TYqQtHqmmcI/AAAAAAAAABc/tyqxf7Ua-jY/s1600/qqplotfig02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="https://lh5.googleusercontent.com/-Ql-OcHO9AEw/TYqQtHqmmcI/AAAAAAAAABc/tyqxf7Ua-jY/s320/qqplotfig02.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Another advantage of the &lt;strong&gt;qqPlot&lt;/strong&gt; command is that it provides the basis for very easy generation of Q-Q plots for essentially any reference distribution that is available in &lt;em&gt;R&lt;/em&gt;, including those available in add-on packages like &lt;strong&gt;gamlss.dist&lt;/strong&gt;, which supports an &lt;em&gt;extremely&lt;/em&gt; wide range of distributions (generalized inverse Gaussian distributions, anyone?).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This capability is illustrated in the four Q-Q plots shown below, all generated with the &lt;strong&gt;qqPlot&lt;/strong&gt; command for non-Gaussian distributions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In all of these plots, the data corresponds to the driver counts for the 26 states and districts summarized in the &lt;strong&gt;road&lt;/strong&gt; dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Motivation for the specific Q-Q plots shown here is that the four distributions represented by these plots are all better suited to capturing the asymmetry seen in the normal Q-Q plot for this data sequence than the symmetric Gaussian distribution is.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot shows the results obtained for the exponential distribution which, like the Gaussian distribution, does not require the specification of a shape parameter.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Comparing this plot with the normal Q-Q plot shown above for this data sequence, it is clear that the exponential distribution is more consistent with the driver data than the Gaussian distribution is.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The data point in the extreme upper right does fall just barely outside the 95% confidence limits shown on this plot, and careful inspection reveals that the points in the lower left fall slightly below these confidence limits, which become quite narrow at this end of the plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh5.googleusercontent.com/-Eb2cQMS8xmA/TYqRSvWgjMI/AAAAAAAAABg/ZUr8Yr1IUqc/s1600/qqplotfig03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" r6="true" src="https://lh5.googleusercontent.com/-Eb2cQMS8xmA/TYqRSvWgjMI/AAAAAAAAABg/ZUr8Yr1IUqc/s320/qqplotfig03.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;The exponential distribution represents&amp;nbsp;a special case of the gamma distribution, with a shape parameter equal to 1.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;In fact, the exponential distribution exhibits a &lt;em&gt;J-shaped&lt;/em&gt; density, decaying from a maximum value at zero, and it corresponds to a “dividing line” within the gamma family: members with shape parameters larger than 1 exhibit unimodal densities with a single maximum at some positive value, while gamma distributions with shape parameters less than 1 are J-shaped like the exponential distribution.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To construct Q-Q plots for general members of the gamma family, it is necessary to specify a particular value for this shape parameter, and the other three Q-Q plots shown above have done this using the &lt;strong&gt;qqPlot &lt;/strong&gt;command.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Comparing these plots, it appears that increasing the shape parameter causes the points in the upper tail to fall farther outside the 95% confidence limits, while decreasing the shape parameter better accommodates these upper tail points.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Conversely, decreasing the shape parameter causes the cluster of points in the lower tail to fall farther outside the confidence limits.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It is not obvious that any of the plots shown here suggest a better fit than the exponential distribution, but the point of this example was to show the flexibility of the &lt;strong&gt;qqPlot&lt;/strong&gt; procedure in being able to pose the question and examine the results graphically.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Alternatively, the Weibull distribution – which also includes the exponential distribution as a special case – might describe these data values better than any member of the gamma distribution family, and these plots can also be easily generated using the &lt;strong&gt;qqPlot&lt;/strong&gt; command (just specify &lt;strong&gt;dist = “weibull”&lt;/strong&gt; instead of &lt;strong&gt;dist = “gamma”&lt;/strong&gt;, along with &lt;strong&gt;shape = a&lt;/strong&gt; for some positive value of &lt;strong&gt;a &lt;/strong&gt;other than 1).&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;Finally, one cautionary note is important here for those working with very large datasets.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Q-Q plots are based on sorting data, something that can be done quite efficiently, but which can still take a very long time for a really huge dataset.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;As a consequence, while you can attempt to construct Q-Q plots for sequences of hundreds of thousands of points or more, you may have to wait a long time to get your plot.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Further, it is often true that plots made up of a very large number of points reduce to ugly-looking dark blobs that can use up a lot of toner if you make the further mistake of trying to print them.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;So, if you are working with really enormous datasets, my suggestion is to construct Q-Q plots from a representative random sample of a few hundred or a few thousand points, not hundreds of thousands or millions of points.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;It will make your life a lot easier.&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9179325420174899779-3179326419182672661?l=exploringdatablog.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://exploringdatablog.blogspot.com/feeds/3179326419182672661/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://exploringdatablog.blogspot.com/2011/03/many-uses-of-q-q-plots.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/3179326419182672661'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9179325420174899779/posts/default/3179326419182672661'/><link rel='alternate' type='text/html' href='http://exploringdatablog.blogspot.com/2011/03/many-uses-of-q-q-plots.html' title='The Many Uses of Q-Q Plots'/><author><name>Ron Pearson (aka TheNoodleDoodler)</name><uri>http://www.blogger.com/profile/15693640298594791682</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh6.googleusercontent.com/-pQ80HZSJVI0/TYqJztV1bMI/AAAAAAAAABU/QZbtVgv7r74/s72-c/beeswarmfig00.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9179325420174899779.post-7798615030770181250</id><published>2011-03-05T13:53:00.000-08:00</published><updated>2011-03-05T13:53:13.749-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Old Faithful dataset'/><category scheme='http://www.blogger.com/atom/ns#' term='beanplots'/><category scheme='http://www.blogger.com/atom/ns#' term='boxplots'/><category scheme='http://www.blogger.com/atom/ns#' term='violin plots'/><category scheme='http://www.blogger.com/atom/ns#' term='asymmetry'/><title type='text'>Boxplots &amp; Beyond IV: Beanplots</title><content type='html'>&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;This post is the last in a series of four on boxplots and some of their extensions.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Previous posts in this series have discussed basic boxplots, modified boxplots based on a robust asymmetry measure, and violin plots, an alternative that essentially combines boxplots with nonparametric density estimates.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;This post introduces beanplots, a boxplot extension similar to violin plots but with some added features.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These plots are generated by the &lt;strong&gt;beanplot&lt;/strong&gt; command in the &lt;em&gt;R&lt;/em&gt; package of the same name and the purpose of this post is to introduce beanplots and briefly discuss their advantages and disadvantages relative to the basic boxplot and the other variants discussed in previous posts.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh3.googleusercontent.com/-0iE3PKqmwaA/TXKQbxnRxcI/AAAAAAAAABA/yY_thDqvGcc/s1600/beanplotsfig01.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="319" l6="true" src="https://lh3.googleusercontent.com/-0iE3PKqmwaA/TXKQbxnRxcI/AAAAAAAAABA/yY_thDqvGcc/s320/beanplotsfig01.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin: 0in 0in 0pt;"&gt;One of the examples discussed in Chapter 13 of &lt;em&gt;Exploring Data&lt;/em&gt; is based on a dataset from the book &lt;em&gt;Data&lt;/em&gt; by D.F. Andrews and A.M. Herzberg that summarizes the prevalence of &lt;em&gt;bitter pit&lt;/em&gt; in 42 apple trees, including information on supplemental nitrogen treatments applied to the trees and further chemical composition data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Essentially, bitter pit is a cosmetic defect in apples that makes them unattractive to consumers, and the intent of the study that generated this dataset was to better understand how various factors influence the prevalence of bitter pit.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;Four supplemental nitrogen treatments are compared in this dataset, labeled A through D, including the control case of “no supplemental nitrogen treatment applied” (treatment A).&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;To quantify the relationship between the prevalence of bitter pit and the other variables in the dataset, the discussion given in &lt;em&gt;Exploring Data&lt;/em&gt; applies both classical analysis of variance (ANOVA) and logistic regression, but much can be seen by simply looking at a sufficiently informative representation of the data.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The figure above shows four different visualizations of the percentage of apples with bitter pit observed in the apples harvested from each tree, broken down by the four treatments considered.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper left plot gives side-by-side boxplot summaries of the bitter pit percentage for each tree,&amp;nbsp;with one boxplot for each&amp;nbsp;treatment.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;These summaries were generated by the &lt;strong&gt;boxplot&lt;/strong&gt; command in base &lt;em&gt;R&lt;/em&gt; with its default settings, and they suggest that the choice of treatment strongly influences the prevalence of bitter pit; indeed, they suggest that all of the “non-control” treatments considered here are harmful with respect to bitter pit, increasing its prevalence.&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;(In fact, this is only part of the story here, since&amp;nbsp;two of these treatments substantially increase the average apple weight, another important commercial consideration.)&lt;span style="mso-spacerun: yes;"&gt;&amp;nbsp; &lt;/span&gt;The upper right plot in the above figure was generated using the &lt;strong&gt;adjbox&lt;/strong&gt; command in the &lt;strong&gt;robustbase&lt;/strong&gt; package that I discussed in the second post
