ExploringDataBlog

Saturday, June 18, 2011

A Brief Introduction to Mixture Distributions

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly. An extremely useful idea in working with datasets like this one is that of mixture distributions, which describe random variables that are drawn from more than one parent population. I discuss mixture distributions in some detail in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine (Section 10.6), and the objective of this post is to give a brief introduction to the basic ideas of mixture distributions, with some hints about how they can be useful in connection with problems like analyzing the rainfall data discussed last time. Subsequent posts will discuss these ideas in more detail, with pointers to specific R packages that are useful in applying these ideas to real data analysis problems.

Before proceeding, it is important to emphasize that the topic of this post and subsequent discussions is “mixture distributions” and not “mixture models,” which are the subject of considerable current interest in the statistics community. The distinction is important because these topics are very different: mixture distributions represent a useful way of describing heterogeneity in the distribution of a variable, whereas mixture models provide a foundation for incorporating both deterministic and random predictor variables in regression models.

To motivate the ideas presented here, the figure above shows four plots constructed from the Old Faithful geyser data that I have discussed previously (the R dataset faithful). This data frame contains 272 observations of the durations of successive eruptions and the waiting time until the next eruption, both measured in minutes. The upper left plot above is the normal Q-Q plot for the duration values, computed using the qqPlot procedure from the car package that I have discussed previously, and the upper right plot is the corresponding normal Q-Q plot for the waiting times. Both of these Q-Q plots exhibit pronounced “kinks,” which I have noted previously often indicate the presence of a multimodal data distribution. The lower two plots are the corresponding nonparametric density estimates (computed using the density procedure in R with its default parameters). The point of these plots is that they give strong evidence that the distributions of both the durations and waiting times are bimodal.

In fact, bimodal and more general multimodal distributions arise frequently in practice, particularly in cases where we are observing a composite response from multiple, distinct sources. This is the basic idea behind mixture distributions: the response x that we observe is modeled as a random variable that has some probability p₁ of being drawn from distribution D₁, probability p₂ of being drawn from distribution D₂, and so forth, with probability p_n of being drawn from distribution D_n, where n is the number of components in our mixture distribution. The key assumption here is one of statistical independence between the process of randomly selecting the component distribution D_i to be drawn and these distributions themselves. That is, we assume there is a random selection process that first generates the numbers 1 through n with probabilities p₁ through p_n. Then, once we have drawn some number j, we turn to distribution D_j and draw the random variable x from this distribution. So long as the probabilities – or mixing percentages – p_i sum to 1, and all of the distributions D_i are proper densities, the combination also defines a proper probability density function, which can be used as the basis for computing expectations, formulating maximum likelihood estimation problems, and so forth. The simplest case is that for n = 2, which provides many different specific examples that have been found to be extremely useful in practice. Because it is the easiest to understand, this post will focus on this special case.

Since the mixing percentages must sum to 1, it follows that p₂ = 1 – p₁ when n = 2, it simplifies the discussion to drop the subscripts, writing p₁ = p and p₂ = 1 – p. Similarly, let f(x) denote the density associated with the distribution D₁ and g(x) denote the density associated with the distribution D₂. The overall probability density function p(x) is then given by:

p(x) = p f(x) + (1 – p)g(x)

To illustrate the flexibility of this idea, consider the case where both f(x) and g(x) are Gaussian distributions. The figure below shows four specific examples.

The upper left plot shows the standard normal distribution (i.e., mean 0 and standard deviation 1), which corresponds to taking both f(x) and g(x) as standard normal densities, for any choice of p. I have included it here both because it represents what has been historically the most common distributional assumption, and because it represents a reference case in interpreting the other distributions considered here. The upper right plot corresponds to the contaminated normal outlier distribution, widely adopted in the robust statistics literature. There, the idea is that measurement errors – traditionally modeled as zero-mean Gaussian random variables with some unknown standard deviation S – mostly conform to this model, but some fraction (typically, 10% to 20%) of these measurement errors have larger variability. The traditional contaminated normal model defines p as this contamination percentage and assumes a standard deviation of 3S for these measurements. The upper right plot above shows the density for this contaminated normal model with p = 0.15 (i.e., 15% contamination). Visually, this plot looks identical to the one on the upper left for the standard normal distribution: plotting them on common axes (as done in Fig. 10.17 of Exploring Data) shows that these distributions are not identical, a point discussed further below on the basis of Q-Q plots.

The lower left plot in the figure above corresponds to a two-component Gaussian mixture distribution where both components are equally represented (i.e., p = 0.5), the first component f(x) is the standard normal distribution as before, and the second component g(x) is a Gaussian distribution with mean 3 and standard deviation 3. The first component contributes the sharper main peak centered at zero, while the second component contributes the broad “shoulder” seen in the right half of this plot. Finally, the lower right plot shows a mixture distribution with p = 0.40 where the first component is a Gaussian distribution with mean -2 and standard deviation 1, and the second component is a Gaussian distribution with mean +2 and standard deviation 1. The result is a bimodal distribution with the same general characteristics as the Old Faithful geyser data.

The point of these examples has been to illustrate the flexibility of the mixture distribution concept, in describing everything from outliers to the natural heterogeneity of natural phenomena with more than one distinct generation mechanism. Before leaving this discussion, it is worth considering the contaminated normal case a bit further. The plot above shows the normal Q-Q plot for 272 samples of the contaminated normal distribution just described, generated using the R procedure listed below (the number 272 was chosen to make the sample size the same as that of the Old Faithful geyser data discussed above). As before, this Q-Q plot was generated using the procedure qqPlot from the car package. The heavy-tailed, non-Gaussian character of this distribution is evident from the fact that both the upper and lower points in this plot fall well outside the 95% confidence interval around the Gaussian reference line. This example illustrates the power of the Q-Q plot for distributional assessment: like the density plots shown above for the standard normal and contaminated normal distributions, nonparametric density plots (not shown) generated from 272 samples drawn from each of these data distributions are not markedly different.

The contaminated normal random samples used to construct this Q-Q plot were generated with the following simple R function:

cngen.proc01 <- function(n=272,cpct = 0.15, mu1 = 0, mu2 = 0, sig1 = 1, sig2 = 3,iseed=101){

set.seed(iseed)

y0 <- rnorm(n,mean=mu1, sd = sig1)

y1 <- rnorm(n,mean=mu2, sd = sig2)

flag <- rbinom(n,size=1,prob=cpct)

y <- y0*(1 - flag) + y1*flag

}

This function is called with seven parameters, all of which are given default values. The parameter n is the sample size, cpct is the contamination percentage (expressed as a fraction, so cpct = 0.15 corresponds to 15% contamination), mu1 and mu2 are the means of the component Gaussian distributions, and sig1 and sig2 are the corresponding standard deviations. The default values for cpct, mu1, mu2, sig1, and sig2 are those for a typical contaminated normal outlier distribution. Finally, the parameter iseed is the seed for the random number generator: specifying its value as a default means that the procedure will return the same set of pseudorandom numbers each time it is called; to obtain an independent set, simply specify a different value for iseed.

The procedure itself generates three mutually independent random variables: y0 corresponds to the first component distribution, y1 corresponds to the second, and flag is a binomial random variable that determines which component distribution is selected for the final random number: when flag = 1 (an event with probability cpct), the contaminated value y1 is returned, and when flag = 0 (an event with probability 1 – cpct), the non-contaminated value y0 is returned.

The basic ideas just described – random selection of a random variable from n distinct distributions – can be applied to a very wide range of distributions, leading to an extremely wide range of data models. The examples described above give a very preliminary indication of the power of Gaussian mixture distributions as approximations to the distribution of heterogeneous phenomena that arise from multiple sources. Practical applications of more general (i.e., not necessarily Gaussian) mixture distributions include modeling particle sizes generated by multiple mechanisms (e.g., accretion of large particles from smaller ones and fragmentation of larger particles to form smaller ones, possibly due to differences in material characteristics), pore size distributions in rocks, polymer chain length or chain branching distributions, and many others. More generally, these ideas are also applicable to discrete distributions, leading to ideas like the zero-inflated Poisson and negative binomial distributions that I will discuss in my next post, or to combinations of continuous distributions with degenerate distributions (e.g., concentrated at zero), leading to ideas like zero-augmented continuous distributions that may be appropriate in applications like the analysis of the rainfall data I discussed last time. One of the advantages of R is that it provides a wide variety of support for various useful implementations of these ideas.

Monday, June 6, 2011

The pros and cons of robust data characterizations

Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine. That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book Outliers in Statistical Data 2nd Edition, that outliers are “data points inconsistent with the majority of values in a dataset.” The practical importance of outliers lies in the fact that even a few of them can cause standard data characterization and analysis procedures to give highly misleading results. Robust procedures have been developed to address these problems, and their performance in the face of outliers can be dramatically better than that of standard methods. On the other hand, robust procedures can also fail, and when they do, they often fail spectacularly.

The above figure is a plot of successive measurements of the makeup flow rate from an industrial manufacturing process. This example is introduced in Chapter 2 of Exploring Data and is discussed in detail in Chapter 7; the dataset itself is available as the dataset makeup.csv from the companion website

for the book. Each point in this dataset corresponds to a measurement of the flow rate of a liquid component in a process with a recycle stream: most of this component is recovered from a downstream processing step and recycled back to the upstream step, but a small amount of this material is lost to evaporation and must be “made up” by a separate feed stream. The data points plotted above correspond to the flow rate of this makeup stream. The quantile function in R computes the Tukey five number summary of the data sequence (i.e., the minimum value, lower quartile, median, upper quartile, and maximum value). Applied to this sequence, the results are:

> quantile(MakeUpFlow)

0% 25% 50% 75% 100%

0.00086 370.43747 393.35861 404.29602 439.59091

Looking at the plot, it is clear that these observations partition naturally into two subsets: those values between 0 and about 200 – all falling between the minimum value and the lower quartile – and those values between about 300 and 440. In fact, 12.2% of the data values lie between 0 and 10, and 22.3% of the values are less than 200. Physically, these smaller-than-average flow rates correspond to shutdown episodes, where the manufacturing process is either not running, is in the process of being shut down, or is in the process of being started back up. If we compute the mean of the complete dataset, we obtain an average makeup flow rate of 315.46, corresponding to the dotted line in the above plot. It is clear that while this number represents the center of the overall data distribution, it is not representative of either the shutdown episodes or normal process operation, since it forms a perfect dividing line between these two data subsets. If we exclude the shutdown episodes – specifically, if we omit all values smaller than 200 – and compute the mean of what remains, we obtain 397.56, corresponding to the solid line in the plot above. It is clear that this number much better represents “typical process operation.”

This example illustrates the well-known outlier sensitivity of the mean. An outlier-resistant alternative is the median, defined as follows: if the number N of data values is odd, the median is the middle element in the ordered list we obtain by sorting these values from smallest to largest; if N is even, the median is the average of the two middle elements in this list. Applied to the complete makeup flow rate dataset, the median gives a “typical” flow rate of 393.36, fairly close to the average value computed without the shutdown episodes. Excluding these episodes changes the median value only a little, to 399.50, again quite close to the mean value after we remove these anomalous data points.

The sensitivity of the standard deviation to outliers is even worse than that of the mean. For the complete dataset, the computed standard deviation is 155.05, an order of magnitude larger than the value (15.22) computed from the dataset without the shutdown episodes. While it is not as well known as the median, an outlier-resistant alternative to the standard deviation is the MADM scale estimator, constructed as follows. First, compute the median as a reference value and compute the differences between every data point and this reference value. Then, take the absolute values of these differences, giving a set of non-negative numbers that tell us how far each point lies from the median reference value. Next, we compute the median of this sequence of absolute differences, which tells us how far a typical data point lies from the reference value. Because we have used the outlier-resistant median both to compute the reference value and to compute this “typical distance” number, the result is highly outlier-resistant. The only difficulty is that, for approximately normally distributed data with no outliers, this number is consistently smaller than the standard deviation. To obtain an alternative estimator of the standard deviation – i.e., a number that is approximately equal to the standard deviation when we compute it from a large, outlier-free collection of approximately Gaussian data – we need to multiply the number just defined by 1.4826. (A derivation of this result is given in Chapter 7 of Exploring Data.) This bias-corrected scale estimate is available as the built-in function mad in R, and applying it to the makeup flow rate dataset gives a scale estimate of 20.22, about the same magnitude as the standard deviation of the data sequence with the shutdown episodes removed. The MADM scale estimate is not completely immune to outliers, so when it is applied to the dataset without the shutdown episodes, it gives a somewhat smaller scale estimate of 13.40. Still, both of these estimates are much more consistent with the standard deviation of the normal operation data than the uncorrected standard deviation computed from the complete dataset.

The key point of this discussion has been to provide a simple, real-data illustration of both the considerable outlier sensitivity of the “standard” location and scale estimators (i.e., the mean and standard deviation) and the existence and performance of robust alternatives to both (i.e., the median and MADM scale estimator). As I noted at the beginning of this discussion, however, while robust estimators can work extremely well – as in this example – they can also fail spectacularly. In general, robust estimators are designed to protect us against anomalous observations that occur out in the tails of the distribution. In the case of “light-tailed” distributions like the uniform distribution, robust measures like the MADM scale estimator can exhibit surprising behavior. In particular, in the absence of outliers, we expect the standard deviation and the MADM scale estimate to give us roughly comparable results. If the standard deviation is much larger than the MADM scale estimate – as in the makeup flow rate example just discussed – this can usually be taken as an indication of either a naturally heavy-tailed data distribution (e.g., the Student’s t distribution with few degrees of freedom), or, as in this example, the presence of contaminating outliers. In the face of light-tailed data distributions, however, the MADM scale estimate can be substantially larger than the standard deviation, and we need to be aware of this fact if we use the MADM scale estimator as a measure of data spread.

More serious problems arise in the case of coarsely quantized data. In particular, it was once popular in engineering applications to record variables like temperature only to single digit accuracy (e.g., temperature to the nearest tenth of a degree). This practice was motivated both by limited memory available in early process monitoring computers, and by a recognition that this was roughly the limit of measurement accuracy, anyway. This kind of data truncation does degrade traditional characterizations like the standard deviation somewhat, but the consequences for the MADM scale estimate can be catastrophic. The plot above shows a sequence of 100 Gaussian data samples, with mean 0 and standard deviation 0.15, represented two ways. The line represents the original Gaussian data sample, while the solid circles correspond to these data values truncated to a single digit (e.g., the first value in the original data sequence is -0.11658569, which gets truncated to -0.1). Both the standard deviation and the MADM scale estimate are reasonably accurate when computed from the unmodified data sequence: the standard deviation is 0.137, while the MADM scale estimate is 0.133, both within about 10% of the correct value of 0.150. If we compute the standard deviation from the truncated data samples, the value declines to 0.103, reflecting the fact that by truncating the data from the 9 digits generated by the random number generator to 1, we are consistently reducing the magnitude of each number, so now our estimation error is more like 30%. Here, however, the MADM scale estimate fails completely, returning the value 0. The reason for this is that, on truncation, 55% of these data values are zero (note that any value between -0.09 and 0.09 will be truncated to zero, and these data values occur quite commonly in a zero-mean, normally distributed data sequence with standard deviation 0.15). Whenever more than 50% of the data values are equal, we don’t need to bother computing either the median or the MADM scale estimator. In particular, the median of this data sequence is simply equal to this majority value. Similarly, since more than 50% of the observed data values lie a distance of zero from the median, the median of the absolute deviation sequence on which the MADM scale estimate is based is necessarily zero. Thus, the MADM scale for any such sequence is zero, as in this example; this behavior is commonly called implosion.

This coarse quantization example indirectly illustrates one of the key distinctions between continuous and discrete variables: for a continuously distributed random variable, no two observations can have exactly the same value, but for discrete random variables, such ties are characteristic. The coarse quantization example represents something in between, being a discrete approximation of a continuous variable. The distinction is important because it is precisely these ties that were responsible for the complete failure of the MADM scale estimator in the example just described. There are other cases where the underlying data distribution has both continuous and discrete components, and these cases can also cause rank- and order-based data characterizations like the MADM scale estimator to behave very badly. The plot above shows rainfall measurements made every half hour for two months (a total of 732 data points), included in the dataset associated with Chapter 14 (“Half-hourly Precipitation and Streamflow, River Hirnant, Wales, U.K., November and December, 1972”) from the book Data:: A Collection of Problems from Many Fields for the Student and Research Worker (Springer Series in Statistics), by D.F. Andrews and A.M. Herzberg. It was noted above that the built-in function quantile in R generates the Tukey five-number summary for a data sequence, but by specifying the prob parameter, it is possible to obtain finer-grained data characterizations. For example, the following command gives the deciles of this rainfall dataset:

>quantile(Rain,prob=seq(0,1,0.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.040 0.146 0.440 2.800

It is clear from these results that more than 60% of these recorded rainfall values are identically zero, reflecting the fact that it does not rain most days. As with the coarse quantization example discussed above, the MADM scale estimator implodes for this example, returning the value zero. Similarly, the median value of these numbers is also zero, but the mean is not (specifically, it is 0.136), and the standard deviation is not (this value is 0.354).

Even though the mean and standard deviation give more reasonable-sounding results for the rainfall dataset than their robust counterparts do, it is not completely clear how to interpret these numbers. In particular, we might be interested in asking any or all of the following three types of question:

1. What is the typical rainfall for this region in a month? (or in a week, in the entire two-month period, in a day, etc.)

2. How frequently does it rain?

3. When it does rain, what is the average amount?

Characterizations like the mean and standard deviation may be used to address the first of these three questions, but not the other two. As a preliminary answer to the second question, we can note that rainfall occurs in 34.6% of the half-hourly intervals recorded here, but this requires a slightly more detailed look at the data than what the mean and standard deviation give us. Additional work would be required if we wanted an estimate of how probable it was to rain on any given day, since we would need to summarize the data at the level of days instead of half-hourly observations. Finally, to address questions of the third type, it is necessary to segment the dataset into a “rainy part” and a “dry part.” Note that this segmentation is analogous to that made for the makeup flow rate dataset, but the difference is that here we are interested in characterizing the minority part of the dataset (i.e., “the outliers”) instead of the majority part, which is what the robust estimators are good at. Special types of statistical models have been developed to deal with these situations, and I will discuss some of these in my next post. For now, I will conclude with two key observations.

The first is that of Collin Mallows (C.L. Mallows, “Robust methods – some examples of their use,” American Statistician, vol. 33, 1979, pages 179-184), which I quote at the end of Chapter 7 of Exploring Data:

“A simple and useful strategy is to perform one’s analysis both robustly and by standard methods and to compare the results. If the differences are minor, either set can be presented. If the differences are not, one must perforce consider why not, and the robust analysis is already at hand to guide the next steps.

The importance of these considerations is enhanced when we are dealing with large amounts of data, since then examining all the data in detail is impractical, and we are forced to contemplate working with data that is, almost certainly, partially bad and with models that are almost certainly inadequate.”

The second observation is that Mallows advice is practical for almost any type of data analysis we want to consider, since robust methods exist to handle a very wide variety of different analysis tasks. An excellent introduction to many of these robust methods is given by Rand Wilcox in his book Introduction to Robust Estimation and Hypothesis Testing, Second Edition (Statistical Modeling and Decision Science), which includes R procedures for implementing most of the methods he discusses.

Saturday, May 21, 2011

The distribution of interestingness

On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3^rd post on “Interestingness Measures.” He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment:

“Our belief is that a useful measure of interestingness should generate index values that are reasonably distributed throughout the range of possible values (such as in a SND)”

David's question was whether there is a way of showing that interestingness measures should be normally distributed. In fact, this question raises a number of important issues, and this post explores a few of them.

First, as I noted in my comment in response to this question, the short answer is that the “standard behavior” we expect from most estimators – i.e., most characterizations computed from a set of uncertain data values – is asymptotic normality. That is, for sufficiently large data samples, we expect most of the standard characterizations we might compute – variance estimates, correlation estimates, least squares regression coefficients, or linear model parameters computed using robust estimators like Least Trimmed Squares (LTS) – to be approximately normally distributed. I discuss this idea in Chapter 6 of Exploring Data in Engineering, the Sciences, and Medicine, beginning with the Central Limit Theorem (CLT), which says, very roughly, that “averages of N numbers tend to Gaussian limits as N becomes infinitely large.” There are exceptions – cases where asymptotic normality does not hold – but the phenomenon is common enough to make violations noteworthy. In fact, one of the reasons the conceptually simpler least median of squares (LMS) estimator was dropped in favor of the more complex least trimmed squares (LTS) estimator is that the LMS estimator is not asymptotically normal, approaching a non-normal limiting distribution at an anomalously slow rate. In contrast, the LTS estimator is asymptotically normal, with the standard convergence rate (i.e., the standard deviation of the estimator approaches zero inversely with the square root of the sample size as the sample becomes infinitely large).

The longer – and far less satisfying – answer to the question of how interestingness measures should be distributed is, “it depends,” as the following discussion illustrates.

One of the many cases where asymptotic normality has been shown to hold is Gini’s mean difference, which forms the basis for one of the four interestingness measures I discussed in my earlier post. While this result appears to provide a reasonably specific answer to the question of how this interestingness measure is distributed, there are two closely related issues that greatly complicate the matter. The first is a question that plagues many applications of asymptotic normality, which is quite commonly used for constructing approximate confidence intervals: how large must the sample be for this asymptotic approximation to be reasonable? (The confidence intervals described in my first post on odds ratios, for example, were derived on the basis of asymptotic normality.) Once again, the answer is, “it depends,” as the examples discussed here demonstrate. The second issue is that the interestingness measures that I describe in Exploring Data and that I discussed in my previous post are normalized measures. In particular, the Gini interestingness measure is defined as Gini’s mean difference divided by its maximum possible value, giving a measure that is bounded between 0 and 1. An important aspect of the Gaussian distribution is that it assigns a nonzero probability to any real value, although the probability associated with values more than a few standard deviations from the mean rapidly becomes very small. Thus, for a bounded quantity like the Gini interestingness measure, a Gaussian approximation can only be reasonable if the standard deviation is small enough that the probability of exhibiting values less than 0 or greater than 1 is acceptably small. Further, this “maximum reasonable standard deviation” will depend on the mean value of the estimator: if the computed interestingness measure is approximately 0.5, the maximum feasible standard deviation for a reasonable Gaussian approximation is considerably larger than if the computed interestingness measure is 0.01 or 0.99. As a practical matter, this usually means that, for a fixed sample size, the shape of the empirical distribution of normalized interestingness values for sequences exhibiting a given degree of heterogeneity (i.e., a specified “true” interestingness value) will vary significantly in shape (specifically, symmetry) as this “true” interestingness value varies from near 0 to approximately 0.5, to near 1. This idea is easily demonstrated in R on the basis of simulated examples.

The essential idea behind this simulation study is the following. Suppose we have a categorical variable that can assume any one of L distinct levels, and suppose we want to generate random samples of this variable where each level i has a certain probability p_i of occurring in our sample. The multinomial distribution discussed in Sec. 10.2.1 of Exploring Data is appropriate here, characterizing the number of times n_i we observe each of these levels in a random sample of size N, given the probability of observing each level i. The simulation strategy considered here then proceeds by specifying the number of levels (here, I take L = 4) and the probability of observing each level. Then, I generate a large number B of random samples (here, I take B = 1000), each generated from the appropriate multinomial distribution. In what follows, I consider the following four cases:

Case A: four equally-represented levels, each with probability 0.25;
Case B: one dominant level (p₁ = 0.97) and three rare levels (p₂ = p₃ = p₄ = 0.01);
Case C: three equally-represented majority levels (p₁ = p₂ = p₃ = 0.33) and one rare minority level (p₄ = 0.01);
Case D: four distinct probabilities (p₁ = 0.05, p₂ = 0.10, p₃ = 0.25, p₄ = 0.60).

To generate these sequences, I use R’s built-in multinomial random number generator, rmultinom, with parameters n = 1000 (this corresponds to B in the above discussion: I want to generate 1000 multinomial random vectors), size = 100 (initially, I take each vector to be of length N = 100; later, I consider larger vectors), and, for Case A, I specify prob = c(0.25, 0.25, 0.25, 0.25), corresponding to the four p_i values specified above (for the other three cases, I specify the prob parameter appropriately). The value returned by this procedure is a matrix, with one row for each level and one column for each simulation; for each column, the number in the i^th row is n_i, the number of times the i^th level is observed in that simulated response. Dividing these counts by the sample size gives the fractional representation of each level, from which the normalized interestingness measures are then computed. For the Gini interestingness measure and Case A, these steps are accomplished using the following R code:

set.seed(101)

pvectA <- c(0.25, 0.25, 0.25, 0.25)

CountsA <- rmultinom(n = 1000, size = 100, prob = pvectA)

FractionsA <- CountsA/100

GiniA <- apply(FractionsA, MARGIN=2, gini.proc)

The first line of this sequence sets the seed value to initialize the random number generator: if you don’t do this, running the same procedure again will give you results that are statistically similar but not exactly the same as before; specifying the seed guarantees you get exactly the same results the next time you execute the command sequence. The second line defines the “true” heterogeneity of each sequence (i.e., the underlying probability of observing each level that the random number generator uses to simulate the data). The third line invokes the multinomial random number generator to return the desired matrix of counts, where each column represents an independent simulation and the rows correspond to the four possible levels of the response. The fourth line converts these counts to fractions by dividing by the size of each sample, and the last line uses the R function apply to invoke the Gini interestingness procedure gini.proc for each column of the fractions matrix: setting MARGIN = 2 tells R to “apply the indicated function gini.proc to the columns of FractionsA.” (It would also be possible to write this procedure as a loop in R, but the apply function is much faster.) Replacing gini.proc by bray.proc in the above sequence of commands would generate the Bray interestingness measures, using shannon.proc would generate the Shannon interestingness measures, and using simpson.proc would generate the Simpson interestingness measures (note that all of these procedures are available from the companion website for Exploring Data).

For this first example – i.e., Case A, where all four levels are equally represented – note that the “true” value for any of the four normalized interestingness measures I discussed in my earlier post is 0, the smallest possible value. Since the random samples generated by the multinomial random number generator rarely have identical counts for the four levels, the interestingness measure computed from each sample is generally larger than 0, but it can never be smaller than this value. This observation suggests two things. The first is that the average of these interestingness values is larger than 0, implying that any of the normalized measures I considered exhibit a positive bias for this case. The second consequence of this observation is that the distribution of any of these interestingness values is probably asymmetric. The first of these conclusions is evident in the plot below, which shows nonparametric density estimates for the Gini measure computed from the 1000 random simulations generated for each case. The left-most peak corresponds to Case A and it is clear from this plot that the average value is substantially larger than zero (specifically, the mean of these simulations is 0.114). This plot also suggests asymmetry, although the visual evidence is not overwhelming.

The right-most peak in the above plot corresponds to Case B, for which the “true” Gini interestingness value is 0.96. Here, the peak is much narrower, although again the average of the simulation values (0.971) is slightly larger than the true value, which is represented by the right-most vertical dashed line. Case C is modeled on the CapSurf variable from the UCI mushroom dataset that I discussed in my earlier post on interestingness measures: the first three levels are equally distributed, but there is also an extremely rare fourth level. The “true” Gini measure for this case is 0.32, corresponding to the second-from-left dashed vertical line in the above plot, and – as with the previous two cases – the distribution of Gini values computed from the multinomial random samples is biased above this correct value. Here, however, the shape of the distribution is more symmetric, suggesting possible conformance with the Gaussian expectations suggested above. Finally, Case D is the third peak from the left in the above plot, and here the distribution appears reasonably symmetric around the correct value of 0.6, again indicated by a vertical dashed line.

The above plot shows the corresponding results for the Shannon interestingness measure for the same four cases. Here, the distributions for the different cases exhibit a much wider range of variation than they did for the Gini measure. As before, Case D (the smoothest of the four peaks) appears reasonably symmetric in its distribution around the “true” value of 0.255, possibly consistent with a normal distribution, but this is clearly not true for any of the other three cases. The asymmetry seen for Case A (the left-most peak) appears fairly pronounced, the density for Case C (the second-from-left peak) appears to be multi-modal, and the right-most peak (Case B) exhibits both asymmetry and multimodal character. The key point here is that the observed distribution of interestingness values depends on both the specific simulation case considered (i.e., the true probability distribution of the levels), and the particular interestingness measure chosen.

As noted above, if we expect approximate normality for a data characterization on the basis of its asymptotic behavior, an important practical question is whether we are working with large enough samples for asymptotic approximations to be at all reasonable. Here, the sample size is N = 100: we are generating and characterizing a lot more sequences than this, but each characterization is based on 100 data observations. In the UCI mushroom example I discussed in my previous post, the sample size was much larger: that dataset characterizes 8,124 different mushrooms, so the interestingness measures considered there were based on a sample size of N = 8124. In fact, this can make an important difference, as the following example demonstrates. The above plot compares the Gini measure density estimates obtained from 1000 random samples, each of length N = 100 (the dashed curve), with those obtained from 1000 random samples, each of length N = 8124 (the solid curve), for Case D. The vertical dotted line corresponds to the true Gini value of 0.6 for this case. Note that both of these densities appear to be symmetrically distributed around this correct value, but the distribution of values is much narrower when they are computed from the larger samples.

These results are consistent with our expectations of asymptotic normality for the Gini measure, at least for cases that are not too near either extreme. Further confirmation of these general expectations for a “well-behaved interestingness measure” is provided by the four Q-Q plot shown below, generated using the qqPlot command in the car package in R that I discussed in a previous post. These plots compare the Shannon measures computed from Cases C (upper two plots) and D (lower two plots), for the smaller sample size N = 100 (left-hand plots) and the larger sample size N = 8124 (right-hand plots). In particular, in the upper left Q-Q plot (Case C, N = 100), compelling evidence of non-normality is given by both the large number of points falling outside the 95% confidence intervals for this Q-Q plot, and the “kinks” in the plot that reflect the multimodality seen in the second peak in the second plot above. In contrast, all of the points in the upper right Q-Q plot (Case C, N = 8124) fall within the 95% confidence limits for this plot, suggesting that the approximate normality assumption is reasonable for the larger sample. Qualitatively similar behavior is seen for Case D in the two lower plots: the results computed for N = 100 exhibit some evidence for asymmetry, with both the lower and upper tails lying above the upper 95% confidence limit for the Q-Q plot, while the results computed for N = 8124 show no such deviations.

It is important to emphasize, however, that approximate normality is not always to be expected. The plot below shows Q-Q plots for the Gini measure (upper plots) and the Shannon measure (lower plots) for Case A, computed from both the smaller samples (N = 100, left-hand plots) and the larger samples (N = 8124, right-hand plots). Here, all of these plots provide strong evidence for distributional asymmetry, consistent with the point made above that, while the computed interestingness value can exceed the true value of 0 for this case, it can never fall below this value. Thus, we expect both a positive bias and a skewed distribution of values, even for large sample sizes, and we see both of these features here.

To conclude, then, I reiterate what I said at the beginning: the answer to the question of what distribution we should expect for a “good” interestingness measure is, “it depends.” But now I can be more specific. The answer depends, first, on the sample size from which we compute the interestingness measure: the larger the sample, the more likely this distribution is to be approximately normal. This is, essentially, a restatement of the definition of asymptotic normality, but it raises the key practical question of how large N must be for this approximation to be reasonable. That answer also depends, on at least two things: first, the specific interestingness measure we consider, and second, the true heterogeneity of the data sample from which we compute it. Both of these points were illustrated in the first two plots shown above: the shape of the estimated densities varied significantly both with the case considered (i.e., the “true” heterogeneity) and whether we considered the Gini measure or the Shannon measure. Analogous comments apply to the other interestingness measures I discussed in my previous post.

Saturday, May 7, 2011

Computing Odds Ratios in R

In my last post, I discussed the use of odds ratios to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the UCI mushroom dataset. I did not, however, describe those computations in detail, and the purpose of this post is to give a brief discussion of how they were done. That said, I should emphasize that the primary focus of this blog is not R programming per se, but rather the almost limitless number of things that R allows you to do in the realm of exploratory data analysis. For more detailed discussions of the mechanics of R programming, two excellent resources are The R Book by Michael J. Crawley, and Modern Applied Statistics with S (Statistics and Computing) by W.N. Venables and B.D. Ripley.

As a reminder, I discuss the odds ratio in Chapter 13 of Exploring Data in Engineering, the Sciences, and Medicine, which may be viewed as an association measure between binary variables. As I discussed last time, for two binary variables, x and y, each taking the values 0 and 1, the odds ratio is defined on the basis of the following four numbers:

N₀₀ = the number of data records with x = 0 and y = 0

N₀₁ = the number of data records with x = 0 and y = 1

N₁₀ = the number of data records with x = 1 and y = 0

N₁₁ = the number of data records with x = 1 and y = 1

Specifically, the odds ratio is given by the following expression:

OR = N₀₀ N₁₁ / N₀₁ N₁₀

Similarly, confidence intervals for the odds ratio are easily constructed by appealing to the asymptotic normality of log OR, which has a limiting variance given by the square root of the sum of the reciprocals of these four numbers. The R procedure oddsratioWald.proc available from the companion website for Exploring Data computes the odds ratio and the upper and lower confidence limits at a specified level alpha from these four values:

oddsratioWald.proc <- function(n00, n01, n10, n11, alpha = 0.05){

# Compute the odds ratio between two binary variables, x and y,

# as defined by the four numbers nij:

# n00 = number of cases where x = 0 and y = 0

# n01 = number of cases where x = 0 and y = 1

# n10 = number of cases where x = 1 and y = 0

# n11 = number of cases where x = 1 and y = 1

OR <- (n00 * n11)/(n01 * n10)

# Compute the Wald confidence intervals:

siglog <- sqrt((1/n00) + (1/n01) + (1/n10) + (1/n11))

zalph <- qnorm(1 - alpha/2)

logOR <- log(OR)

loglo <- logOR - zalph * siglog

loghi <- logOR + zalph * siglog

ORlo <- exp(loglo)

ORhi <- exp(loghi)

oframe <- data.frame(LowerCI = ORlo, OR = OR, UpperCI = ORhi, alpha = alpha)

oframe

}

Including “alpha = 0.05” in the parameter list fixes the default value for alpha at 0.05, which yields the 95% confidence intervals for the computed odds ratio, based on the Wald approximation described above. An important practical point is that these intervals become infinitely wide if any of the four numbers N_ij are equal to zero; also, note that in this case, the computed odds ratio is either zero or infinite. Finally, it is worth noting that if the numbers N_ij are large enough, the procedure just described can encounter numerical overflow problems (i.e., the products in either the numerator or the denominator become too large to be represented in machine arithmetic). If this is a possibility, a better alternative is to regroup the computations as follows:

OR = (N₀₀ / N₀₁) x (N₁₁ / N₁₀)

To use the routine just described, it is necessary to have the four numbers defined above, which form the basis for a two-by-two contingency table. Because contingency tables are widely used in characterizing categorical data, these numbers are easily computed in R using the table command. As a simple example, the following code reads the UCI mushroom dataset and generates the two-by-two contingency table for the EorP and GillSize attributes:

> mushrooms <- read.csv("mushroom.csv")

> table(mushrooms$EorP, mushrooms$GillSize)

b n

e 3920 288

p 1692 2224

(Note that the first line reads the csv file containing the mushroom data; for this command to work as shown, it is necessary for this file to be in the working directory. Alternatively, you can change the working directory using the setwd command.)

To facilitate the computation of odds ratios, the following preliminary procedure combines the table command with the oddsratioWald.proc procedure, allowing you to compute the odds ratio and its level-alpha confidence interval from the two-level variables directly:

TableOR.proc00 <- function(x,y,alpha=0.05){

xtab <- table(x,y)

n00 <- xtab[1,1]

n01 <- xtab[1,2]

n10 <- xtab[2,1]

n11 <- xtab[2,2]

oddsratioWald.proc(n00,n01,n10,n11,alpha)

}

The primary disadvantage of this procedure is that it doesn’t tell you which levels of the two variables are being characterized by the computed odds ratio. In fact, this characterization describes the first level of each of these variables, and the following slight modification makes this fact explicit:

TableOR.proc <- function(x,y,alpha=0.05){

xtab <- table(x,y)

n00 <- xtab[1,1]

n01 <- xtab[1,2]

n10 <- xtab[2,1]

n11 <- xtab[2,2]

outList <- vector("list",2)

outList[[1]] <- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][1],"] of the second variable:",sep=" ")

outList[[2]] <- oddsratioWald.proc(n00,n01,n10,n11,alpha)

outList

}

Specifically, I have used the fact that the dimension names of the 2x2 table xtab correspond to the levels of the variables x and y, and I have used the paste command to include these values in a text string displayed to the user. (I have enclosed the levels in square brackets to make them stand out from the surrounding text, particularly useful here since the levels are coded as single letters.) Applying this procedure to the mushroom characteristics EorP and GillSize yields the following results:

> TableOR.proc(mushrooms$EorP, mushrooms$GillSize)

[[1]]

[1] "Odds ratio between the level [ e ] of the first variable and the level [ b ] of the second variable:"

[[2]]

LowerCI OR UpperCI alpha

1 15.62615 17.89073 20.48349 0.05

Almost certainly, the formatting I have used here could be improved – probably a lot – but the key point is to provide a result that is reasonably complete and easy to interpret.

Finally, I noted in my last post that if we are interested in using odds ratios to compare or rank associations, it is useful to code the levels so that the computed odds ratio is larger than 1. In particular, note that applying the above procedure to characterize the relationship between edibility and the Bruises characteristic yields:

> TableOR.proc(mushrooms$EorP, mushrooms$Bruises)

[[1]]

[1] "Odds ratio between the level [ e ] of the first variable and the level [ f ] of the second variable:"

[[2]]

LowerCI OR UpperCI alpha

1 0.09014769 0.1002854 0.1115632 0.05

It is clear from these results that both Bruises and GillSize exhibit odds ratios with respect to mushroom edibility that are significantly different from the neutral value 1 (i.e., the 95% confidence interval excludes the value 1 in both cases), but it is not obvious which variable has the stronger association, based on the available data. The following procedure automatically restructures the computation so that the computed odds ratio is larger than or equal to 1, allowing us to make this comparison:

AutomaticOR.proc <- function(x,y,alpha=0.05){

xtab <- table(x,y)

n00 <- xtab[1,1]

n01 <- xtab[1,2]

n10 <- xtab[2,1]

n11 <- xtab[2,2]

rawOR <- (n00*n11)/(n01*n10)

if (rawOR < 1){

n01 <- xtab[1,1]

n00 <- xtab[1,2]

n11 <- xtab[2,1]

n10 <- xtab[2,2]

iLevel <- 2

}

else{

iLevel <- 1

}

outList <- vector("list",2)

outList[[1]] <- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][iLevel],"] of the second variable:",sep=" ")

outList[[2]] <- oddsratioWald.proc(n00,n01,n10,n11,alpha)

outList

}

Note that this procedure first constructs the 2x2 table on which everything is based and then computes the odds ratio in the default coding: if this value is smaller than 1, the coding of the second variable (y) is reversed. The odds ratio and its confidence interval are then computed and the levels of the variables used in computing it are presented as before. Applying this procedure to the Bruises characteristic yields the following result, from which we can see that GillSize appears to have the stronger association, as noted last time:

> AutomaticOR.proc(mushrooms$EorP, mushrooms$Bruises)

[[1]]

[1] "Odds ratio between the level [ e ] of the first variable and the level [ t ] of the second variable:"

[[2]]

LowerCI OR UpperCI alpha

1 8.963532 9.971541 11.09291 0.05

Well, that’s it for now. Next time, I will come back to the question of how we should expect interestingness measures to be distributed and whether those expectations are met in practice.