ExploringDataBlog: June 2011

Saturday, June 18, 2011

A Brief Introduction to Mixture Distributions

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly. An extremely useful idea in working with datasets like this one is that of mixture distributions, which describe random variables that are drawn from more than one parent population. I discuss mixture distributions in some detail in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine (Section 10.6), and the objective of this post is to give a brief introduction to the basic ideas of mixture distributions, with some hints about how they can be useful in connection with problems like analyzing the rainfall data discussed last time. Subsequent posts will discuss these ideas in more detail, with pointers to specific R packages that are useful in applying these ideas to real data analysis problems.

Before proceeding, it is important to emphasize that the topic of this post and subsequent discussions is “mixture distributions” and not “mixture models,” which are the subject of considerable current interest in the statistics community. The distinction is important because these topics are very different: mixture distributions represent a useful way of describing heterogeneity in the distribution of a variable, whereas mixture models provide a foundation for incorporating both deterministic and random predictor variables in regression models.

To motivate the ideas presented here, the figure above shows four plots constructed from the Old Faithful geyser data that I have discussed previously (the R dataset faithful). This data frame contains 272 observations of the durations of successive eruptions and the waiting time until the next eruption, both measured in minutes. The upper left plot above is the normal Q-Q plot for the duration values, computed using the qqPlot procedure from the car package that I have discussed previously, and the upper right plot is the corresponding normal Q-Q plot for the waiting times. Both of these Q-Q plots exhibit pronounced “kinks,” which I have noted previously often indicate the presence of a multimodal data distribution. The lower two plots are the corresponding nonparametric density estimates (computed using the density procedure in R with its default parameters). The point of these plots is that they give strong evidence that the distributions of both the durations and waiting times are bimodal.

In fact, bimodal and more general multimodal distributions arise frequently in practice, particularly in cases where we are observing a composite response from multiple, distinct sources. This is the basic idea behind mixture distributions: the response x that we observe is modeled as a random variable that has some probability p₁ of being drawn from distribution D₁, probability p₂ of being drawn from distribution D₂, and so forth, with probability p_n of being drawn from distribution D_n, where n is the number of components in our mixture distribution. The key assumption here is one of statistical independence between the process of randomly selecting the component distribution D_i to be drawn and these distributions themselves. That is, we assume there is a random selection process that first generates the numbers 1 through n with probabilities p₁ through p_n. Then, once we have drawn some number j, we turn to distribution D_j and draw the random variable x from this distribution. So long as the probabilities – or mixing percentages – p_i sum to 1, and all of the distributions D_i are proper densities, the combination also defines a proper probability density function, which can be used as the basis for computing expectations, formulating maximum likelihood estimation problems, and so forth. The simplest case is that for n = 2, which provides many different specific examples that have been found to be extremely useful in practice. Because it is the easiest to understand, this post will focus on this special case.

Since the mixing percentages must sum to 1, it follows that p₂ = 1 – p₁ when n = 2, it simplifies the discussion to drop the subscripts, writing p₁ = p and p₂ = 1 – p. Similarly, let f(x) denote the density associated with the distribution D₁ and g(x) denote the density associated with the distribution D₂. The overall probability density function p(x) is then given by:

p(x) = p f(x) + (1 – p)g(x)

To illustrate the flexibility of this idea, consider the case where both f(x) and g(x) are Gaussian distributions. The figure below shows four specific examples.

The upper left plot shows the standard normal distribution (i.e., mean 0 and standard deviation 1), which corresponds to taking both f(x) and g(x) as standard normal densities, for any choice of p. I have included it here both because it represents what has been historically the most common distributional assumption, and because it represents a reference case in interpreting the other distributions considered here. The upper right plot corresponds to the contaminated normal outlier distribution, widely adopted in the robust statistics literature. There, the idea is that measurement errors – traditionally modeled as zero-mean Gaussian random variables with some unknown standard deviation S – mostly conform to this model, but some fraction (typically, 10% to 20%) of these measurement errors have larger variability. The traditional contaminated normal model defines p as this contamination percentage and assumes a standard deviation of 3S for these measurements. The upper right plot above shows the density for this contaminated normal model with p = 0.15 (i.e., 15% contamination). Visually, this plot looks identical to the one on the upper left for the standard normal distribution: plotting them on common axes (as done in Fig. 10.17 of Exploring Data) shows that these distributions are not identical, a point discussed further below on the basis of Q-Q plots.

The lower left plot in the figure above corresponds to a two-component Gaussian mixture distribution where both components are equally represented (i.e., p = 0.5), the first component f(x) is the standard normal distribution as before, and the second component g(x) is a Gaussian distribution with mean 3 and standard deviation 3. The first component contributes the sharper main peak centered at zero, while the second component contributes the broad “shoulder” seen in the right half of this plot. Finally, the lower right plot shows a mixture distribution with p = 0.40 where the first component is a Gaussian distribution with mean -2 and standard deviation 1, and the second component is a Gaussian distribution with mean +2 and standard deviation 1. The result is a bimodal distribution with the same general characteristics as the Old Faithful geyser data.

The point of these examples has been to illustrate the flexibility of the mixture distribution concept, in describing everything from outliers to the natural heterogeneity of natural phenomena with more than one distinct generation mechanism. Before leaving this discussion, it is worth considering the contaminated normal case a bit further. The plot above shows the normal Q-Q plot for 272 samples of the contaminated normal distribution just described, generated using the R procedure listed below (the number 272 was chosen to make the sample size the same as that of the Old Faithful geyser data discussed above). As before, this Q-Q plot was generated using the procedure qqPlot from the car package. The heavy-tailed, non-Gaussian character of this distribution is evident from the fact that both the upper and lower points in this plot fall well outside the 95% confidence interval around the Gaussian reference line. This example illustrates the power of the Q-Q plot for distributional assessment: like the density plots shown above for the standard normal and contaminated normal distributions, nonparametric density plots (not shown) generated from 272 samples drawn from each of these data distributions are not markedly different.

The contaminated normal random samples used to construct this Q-Q plot were generated with the following simple R function:

cngen.proc01 <- function(n=272,cpct = 0.15, mu1 = 0, mu2 = 0, sig1 = 1, sig2 = 3,iseed=101){

set.seed(iseed)

y0 <- rnorm(n,mean=mu1, sd = sig1)

y1 <- rnorm(n,mean=mu2, sd = sig2)

flag <- rbinom(n,size=1,prob=cpct)

y <- y0*(1 - flag) + y1*flag

}

This function is called with seven parameters, all of which are given default values. The parameter n is the sample size, cpct is the contamination percentage (expressed as a fraction, so cpct = 0.15 corresponds to 15% contamination), mu1 and mu2 are the means of the component Gaussian distributions, and sig1 and sig2 are the corresponding standard deviations. The default values for cpct, mu1, mu2, sig1, and sig2 are those for a typical contaminated normal outlier distribution. Finally, the parameter iseed is the seed for the random number generator: specifying its value as a default means that the procedure will return the same set of pseudorandom numbers each time it is called; to obtain an independent set, simply specify a different value for iseed.

The procedure itself generates three mutually independent random variables: y0 corresponds to the first component distribution, y1 corresponds to the second, and flag is a binomial random variable that determines which component distribution is selected for the final random number: when flag = 1 (an event with probability cpct), the contaminated value y1 is returned, and when flag = 0 (an event with probability 1 – cpct), the non-contaminated value y0 is returned.

The basic ideas just described – random selection of a random variable from n distinct distributions – can be applied to a very wide range of distributions, leading to an extremely wide range of data models. The examples described above give a very preliminary indication of the power of Gaussian mixture distributions as approximations to the distribution of heterogeneous phenomena that arise from multiple sources. Practical applications of more general (i.e., not necessarily Gaussian) mixture distributions include modeling particle sizes generated by multiple mechanisms (e.g., accretion of large particles from smaller ones and fragmentation of larger particles to form smaller ones, possibly due to differences in material characteristics), pore size distributions in rocks, polymer chain length or chain branching distributions, and many others. More generally, these ideas are also applicable to discrete distributions, leading to ideas like the zero-inflated Poisson and negative binomial distributions that I will discuss in my next post, or to combinations of continuous distributions with degenerate distributions (e.g., concentrated at zero), leading to ideas like zero-augmented continuous distributions that may be appropriate in applications like the analysis of the rainfall data I discussed last time. One of the advantages of R is that it provides a wide variety of support for various useful implementations of these ideas.

Monday, June 6, 2011

The pros and cons of robust data characterizations

Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine. That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book Outliers in Statistical Data 2nd Edition, that outliers are “data points inconsistent with the majority of values in a dataset.” The practical importance of outliers lies in the fact that even a few of them can cause standard data characterization and analysis procedures to give highly misleading results. Robust procedures have been developed to address these problems, and their performance in the face of outliers can be dramatically better than that of standard methods. On the other hand, robust procedures can also fail, and when they do, they often fail spectacularly.

The above figure is a plot of successive measurements of the makeup flow rate from an industrial manufacturing process. This example is introduced in Chapter 2 of Exploring Data and is discussed in detail in Chapter 7; the dataset itself is available as the dataset makeup.csv from the companion website

for the book. Each point in this dataset corresponds to a measurement of the flow rate of a liquid component in a process with a recycle stream: most of this component is recovered from a downstream processing step and recycled back to the upstream step, but a small amount of this material is lost to evaporation and must be “made up” by a separate feed stream. The data points plotted above correspond to the flow rate of this makeup stream. The quantile function in R computes the Tukey five number summary of the data sequence (i.e., the minimum value, lower quartile, median, upper quartile, and maximum value). Applied to this sequence, the results are:

> quantile(MakeUpFlow)

0% 25% 50% 75% 100%

0.00086 370.43747 393.35861 404.29602 439.59091

Looking at the plot, it is clear that these observations partition naturally into two subsets: those values between 0 and about 200 – all falling between the minimum value and the lower quartile – and those values between about 300 and 440. In fact, 12.2% of the data values lie between 0 and 10, and 22.3% of the values are less than 200. Physically, these smaller-than-average flow rates correspond to shutdown episodes, where the manufacturing process is either not running, is in the process of being shut down, or is in the process of being started back up. If we compute the mean of the complete dataset, we obtain an average makeup flow rate of 315.46, corresponding to the dotted line in the above plot. It is clear that while this number represents the center of the overall data distribution, it is not representative of either the shutdown episodes or normal process operation, since it forms a perfect dividing line between these two data subsets. If we exclude the shutdown episodes – specifically, if we omit all values smaller than 200 – and compute the mean of what remains, we obtain 397.56, corresponding to the solid line in the plot above. It is clear that this number much better represents “typical process operation.”

This example illustrates the well-known outlier sensitivity of the mean. An outlier-resistant alternative is the median, defined as follows: if the number N of data values is odd, the median is the middle element in the ordered list we obtain by sorting these values from smallest to largest; if N is even, the median is the average of the two middle elements in this list. Applied to the complete makeup flow rate dataset, the median gives a “typical” flow rate of 393.36, fairly close to the average value computed without the shutdown episodes. Excluding these episodes changes the median value only a little, to 399.50, again quite close to the mean value after we remove these anomalous data points.

The sensitivity of the standard deviation to outliers is even worse than that of the mean. For the complete dataset, the computed standard deviation is 155.05, an order of magnitude larger than the value (15.22) computed from the dataset without the shutdown episodes. While it is not as well known as the median, an outlier-resistant alternative to the standard deviation is the MADM scale estimator, constructed as follows. First, compute the median as a reference value and compute the differences between every data point and this reference value. Then, take the absolute values of these differences, giving a set of non-negative numbers that tell us how far each point lies from the median reference value. Next, we compute the median of this sequence of absolute differences, which tells us how far a typical data point lies from the reference value. Because we have used the outlier-resistant median both to compute the reference value and to compute this “typical distance” number, the result is highly outlier-resistant. The only difficulty is that, for approximately normally distributed data with no outliers, this number is consistently smaller than the standard deviation. To obtain an alternative estimator of the standard deviation – i.e., a number that is approximately equal to the standard deviation when we compute it from a large, outlier-free collection of approximately Gaussian data – we need to multiply the number just defined by 1.4826. (A derivation of this result is given in Chapter 7 of Exploring Data.) This bias-corrected scale estimate is available as the built-in function mad in R, and applying it to the makeup flow rate dataset gives a scale estimate of 20.22, about the same magnitude as the standard deviation of the data sequence with the shutdown episodes removed. The MADM scale estimate is not completely immune to outliers, so when it is applied to the dataset without the shutdown episodes, it gives a somewhat smaller scale estimate of 13.40. Still, both of these estimates are much more consistent with the standard deviation of the normal operation data than the uncorrected standard deviation computed from the complete dataset.

The key point of this discussion has been to provide a simple, real-data illustration of both the considerable outlier sensitivity of the “standard” location and scale estimators (i.e., the mean and standard deviation) and the existence and performance of robust alternatives to both (i.e., the median and MADM scale estimator). As I noted at the beginning of this discussion, however, while robust estimators can work extremely well – as in this example – they can also fail spectacularly. In general, robust estimators are designed to protect us against anomalous observations that occur out in the tails of the distribution. In the case of “light-tailed” distributions like the uniform distribution, robust measures like the MADM scale estimator can exhibit surprising behavior. In particular, in the absence of outliers, we expect the standard deviation and the MADM scale estimate to give us roughly comparable results. If the standard deviation is much larger than the MADM scale estimate – as in the makeup flow rate example just discussed – this can usually be taken as an indication of either a naturally heavy-tailed data distribution (e.g., the Student’s t distribution with few degrees of freedom), or, as in this example, the presence of contaminating outliers. In the face of light-tailed data distributions, however, the MADM scale estimate can be substantially larger than the standard deviation, and we need to be aware of this fact if we use the MADM scale estimator as a measure of data spread.

More serious problems arise in the case of coarsely quantized data. In particular, it was once popular in engineering applications to record variables like temperature only to single digit accuracy (e.g., temperature to the nearest tenth of a degree). This practice was motivated both by limited memory available in early process monitoring computers, and by a recognition that this was roughly the limit of measurement accuracy, anyway. This kind of data truncation does degrade traditional characterizations like the standard deviation somewhat, but the consequences for the MADM scale estimate can be catastrophic. The plot above shows a sequence of 100 Gaussian data samples, with mean 0 and standard deviation 0.15, represented two ways. The line represents the original Gaussian data sample, while the solid circles correspond to these data values truncated to a single digit (e.g., the first value in the original data sequence is -0.11658569, which gets truncated to -0.1). Both the standard deviation and the MADM scale estimate are reasonably accurate when computed from the unmodified data sequence: the standard deviation is 0.137, while the MADM scale estimate is 0.133, both within about 10% of the correct value of 0.150. If we compute the standard deviation from the truncated data samples, the value declines to 0.103, reflecting the fact that by truncating the data from the 9 digits generated by the random number generator to 1, we are consistently reducing the magnitude of each number, so now our estimation error is more like 30%. Here, however, the MADM scale estimate fails completely, returning the value 0. The reason for this is that, on truncation, 55% of these data values are zero (note that any value between -0.09 and 0.09 will be truncated to zero, and these data values occur quite commonly in a zero-mean, normally distributed data sequence with standard deviation 0.15). Whenever more than 50% of the data values are equal, we don’t need to bother computing either the median or the MADM scale estimator. In particular, the median of this data sequence is simply equal to this majority value. Similarly, since more than 50% of the observed data values lie a distance of zero from the median, the median of the absolute deviation sequence on which the MADM scale estimate is based is necessarily zero. Thus, the MADM scale for any such sequence is zero, as in this example; this behavior is commonly called implosion.

This coarse quantization example indirectly illustrates one of the key distinctions between continuous and discrete variables: for a continuously distributed random variable, no two observations can have exactly the same value, but for discrete random variables, such ties are characteristic. The coarse quantization example represents something in between, being a discrete approximation of a continuous variable. The distinction is important because it is precisely these ties that were responsible for the complete failure of the MADM scale estimator in the example just described. There are other cases where the underlying data distribution has both continuous and discrete components, and these cases can also cause rank- and order-based data characterizations like the MADM scale estimator to behave very badly. The plot above shows rainfall measurements made every half hour for two months (a total of 732 data points), included in the dataset associated with Chapter 14 (“Half-hourly Precipitation and Streamflow, River Hirnant, Wales, U.K., November and December, 1972”) from the book Data:: A Collection of Problems from Many Fields for the Student and Research Worker (Springer Series in Statistics), by D.F. Andrews and A.M. Herzberg. It was noted above that the built-in function quantile in R generates the Tukey five-number summary for a data sequence, but by specifying the prob parameter, it is possible to obtain finer-grained data characterizations. For example, the following command gives the deciles of this rainfall dataset:

>quantile(Rain,prob=seq(0,1,0.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.040 0.146 0.440 2.800

It is clear from these results that more than 60% of these recorded rainfall values are identically zero, reflecting the fact that it does not rain most days. As with the coarse quantization example discussed above, the MADM scale estimator implodes for this example, returning the value zero. Similarly, the median value of these numbers is also zero, but the mean is not (specifically, it is 0.136), and the standard deviation is not (this value is 0.354).

Even though the mean and standard deviation give more reasonable-sounding results for the rainfall dataset than their robust counterparts do, it is not completely clear how to interpret these numbers. In particular, we might be interested in asking any or all of the following three types of question:

1. What is the typical rainfall for this region in a month? (or in a week, in the entire two-month period, in a day, etc.)

2. How frequently does it rain?

3. When it does rain, what is the average amount?

Characterizations like the mean and standard deviation may be used to address the first of these three questions, but not the other two. As a preliminary answer to the second question, we can note that rainfall occurs in 34.6% of the half-hourly intervals recorded here, but this requires a slightly more detailed look at the data than what the mean and standard deviation give us. Additional work would be required if we wanted an estimate of how probable it was to rain on any given day, since we would need to summarize the data at the level of days instead of half-hourly observations. Finally, to address questions of the third type, it is necessary to segment the dataset into a “rainy part” and a “dry part.” Note that this segmentation is analogous to that made for the makeup flow rate dataset, but the difference is that here we are interested in characterizing the minority part of the dataset (i.e., “the outliers”) instead of the majority part, which is what the robust estimators are good at. Special types of statistical models have been developed to deal with these situations, and I will discuss some of these in my next post. For now, I will conclude with two key observations.

The first is that of Collin Mallows (C.L. Mallows, “Robust methods – some examples of their use,” American Statistician, vol. 33, 1979, pages 179-184), which I quote at the end of Chapter 7 of Exploring Data:

“A simple and useful strategy is to perform one’s analysis both robustly and by standard methods and to compare the results. If the differences are minor, either set can be presented. If the differences are not, one must perforce consider why not, and the robust analysis is already at hand to guide the next steps.

The importance of these considerations is enhanced when we are dealing with large amounts of data, since then examining all the data in detail is impractical, and we are forced to contemplate working with data that is, almost certainly, partially bad and with models that are almost certainly inadequate.”

The second observation is that Mallows advice is practical for almost any type of data analysis we want to consider, since robust methods exist to handle a very wide variety of different analysis tasks. An excellent introduction to many of these robust methods is given by Rand Wilcox in his book Introduction to Robust Estimation and Hypothesis Testing, Second Edition (Statistical Modeling and Decision Science), which includes R procedures for implementing most of the methods he discusses.