This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, June 18, 2011

A Brief Introduction to Mixture Distributions

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly.  An extremely useful idea in working with datasets like this one is that of mixture distributions, which describe random variables that are drawn from more than one parent population.  I discuss mixture distributions in some detail in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine (Section 10.6), and the objective of this post is to give a brief introduction to the basic ideas of mixture distributions, with some hints about how they can be useful in connection with problems like analyzing the rainfall data discussed last time.  Subsequent posts will discuss these ideas in more detail, with pointers to specific R packages that are useful in applying these ideas to real data analysis problems.

Before proceeding, it is important to emphasize that the topic of this post and subsequent discussions is “mixture distributions” and not “mixture models,” which are the subject of considerable current interest in the statistics community.  The distinction is important because these topics are very different: mixture distributions represent a useful way of describing heterogeneity in the distribution of a variable, whereas mixture models provide a foundation for incorporating both deterministic and random predictor variables in regression models.



To motivate the ideas presented here, the figure above shows four plots constructed from the Old Faithful geyser data that I have discussed previously (the R dataset faithful).  This data frame contains 272 observations of the durations of successive eruptions and the waiting time until the next eruption, both measured in minutes.  The upper left plot above is the normal Q-Q plot for the duration values, computed using the qqPlot procedure from the car package that I have discussed previously, and the upper right plot is the corresponding normal Q-Q plot for the waiting times.  Both of these Q-Q plots exhibit pronounced “kinks,” which I have noted previously often indicate the presence of a multimodal data distribution.  The lower two plots are the corresponding nonparametric density estimates (computed using the density procedure in R with its default parameters).  The point of these plots is that they give strong evidence that the distributions of both the durations and waiting times are bimodal.

In fact, bimodal and more general multimodal distributions arise frequently in practice, particularly in cases where we are observing a composite response from multiple, distinct sources.  This is the basic idea behind mixture distributions: the response x that we observe is modeled as a random variable that has some probability p1 of being drawn from distribution D1, probability p2 of being drawn from distribution D2, and so forth, with probability pn of being drawn from distribution Dn, where n is the number of components in our mixture distribution.  The key assumption here is one of statistical independence between the process of randomly selecting the component distribution Di to be drawn and these distributions themselves.  That is, we assume there is a random selection process that first generates the numbers 1 through n with probabilities p1 through pn.  Then, once we have drawn some number j, we turn to distribution Dj and draw the random variable x from this distribution.  So long as the probabilities – or mixing percentagespi sum to 1, and all of the distributions Di are proper densities, the combination also defines a proper probability density function, which can be used as the basis for computing expectations, formulating maximum likelihood estimation problems, and so forth.  The simplest case is that for n = 2, which provides many different specific examples that have been found to be extremely useful in practice.  Because it is the easiest to understand, this post will focus on this special case.

Since the mixing percentages must sum to 1, it follows that p2 = 1 – p1 when n = 2, it simplifies the discussion to drop the subscripts, writing p1 = p and p2 = 1 – p.  Similarly, let f(x) denote the density associated with the distribution D1 and g(x) denote the density associated with the distribution D2.  The overall probability density function p(x) is then given by:
                       

                        p(x) = p f(x) + (1 – p)g(x)



To illustrate the flexibility of this idea, consider the case where both f(x) and g(x) are Gaussian distributions.  The figure below shows four specific examples.



The upper left plot shows the standard normal distribution (i.e., mean 0 and standard deviation 1), which corresponds to taking both f(x) and g(x) as standard normal densities, for any choice of p.  I have included it here both because it represents what has been historically the most common distributional assumption, and because it represents a reference case in interpreting the other distributions considered here.   The upper right plot corresponds to the contaminated normal outlier distribution, widely adopted in the robust statistics literature.  There, the idea is that measurement errors – traditionally modeled as zero-mean Gaussian random variables with some unknown standard deviation S – mostly conform to this model, but some fraction (typically, 10% to 20%) of these measurement errors have larger variability.  The traditional contaminated normal model defines p as this contamination percentage and assumes a standard deviation of 3S for these measurements.  The upper right plot above shows the density for this contaminated normal model with p = 0.15 (i.e., 15% contamination).   Visually, this plot looks identical to the one on the upper left for the standard normal distribution: plotting them on common axes (as done in Fig. 10.17 of Exploring Data) shows that these distributions are not identical, a point discussed further below on the basis of Q-Q plots.

The lower left plot in the figure above corresponds to a two-component Gaussian mixture distribution where both components are equally represented (i.e., p = 0.5), the first component f(x) is the standard normal distribution as before, and the second component g(x) is a Gaussian distribution with mean 3 and standard deviation 3.  The first component contributes the sharper main peak centered at zero, while the second component contributes the broad “shoulder” seen in the right half of this plot.  Finally, the lower right plot shows a mixture distribution with p = 0.40 where the first component is a Gaussian distribution with mean -2 and standard deviation 1, and the second component is a Gaussian distribution with mean +2 and standard deviation 1.  The result is a bimodal distribution with the same general characteristics as the Old Faithful geyser data. 



The point of these examples has been to illustrate the flexibility of the mixture distribution concept, in describing everything from outliers to the natural heterogeneity of natural phenomena with more than one distinct generation mechanism.  Before leaving this discussion, it is worth considering the contaminated normal case a bit further.  The plot above shows the normal Q-Q plot for 272 samples of the contaminated normal distribution just described, generated using the R procedure listed below (the number 272 was chosen to make the sample size the same as that of the Old Faithful geyser data discussed above).  As before, this Q-Q plot was generated using the procedure qqPlot from the car package.  The heavy-tailed, non-Gaussian character of this distribution is evident from the fact that both the upper and lower points in this plot fall well outside the 95% confidence interval around the Gaussian reference line.  This example illustrates the power of the Q-Q plot for distributional assessment: like the density plots shown above for the standard normal and contaminated normal distributions, nonparametric density plots (not shown) generated from 272 samples drawn from each of these data distributions are not markedly different. 

The contaminated normal random samples used to construct this Q-Q plot were generated with the following simple R function:

cngen.proc01 <- function(n=272,cpct = 0.15, mu1 = 0, mu2 = 0, sig1 = 1, sig2 = 3,iseed=101){
  #
  set.seed(iseed)
  y0 <- rnorm(n,mean=mu1, sd = sig1)
  y1 <- rnorm(n,mean=mu2, sd = sig2)
  #
  flag <- rbinom(n,size=1,prob=cpct)
  y <- y0*(1 - flag) + y1*flag
  #
  y
}

This function is called with seven parameters, all of which are given default values.  The parameter n is the sample size, cpct is the contamination percentage (expressed as a fraction, so cpct = 0.15 corresponds to 15% contamination), mu1 and mu2 are the means of the component Gaussian distributions, and sig1 and sig2 are the corresponding standard deviations.  The default values for cpct, mu1, mu2, sig1, and sig2 are those for a typical contaminated normal outlier distribution.  Finally, the parameter iseed is the seed for the random number generator: specifying its value as a default means that the procedure will return the same set of pseudorandom numbers each time it is called; to obtain an independent set, simply specify a different value for iseed.

The procedure itself generates three mutually independent random variables: y0 corresponds to the first component distribution, y1 corresponds to the second, and flag is a binomial random variable that determines which component distribution is selected for the final random number: when flag = 1 (an event with probability cpct), the contaminated value y1 is returned, and when flag = 0 (an event with probability 1 – cpct), the non-contaminated value y0 is returned.

The basic ideas just described – random selection of a random variable from n distinct distributions – can be applied to a very wide range of distributions, leading to an extremely wide range of data models.  The examples described above give a very preliminary indication of the power of Gaussian mixture distributions as approximations to the distribution of heterogeneous phenomena that arise from multiple sources.  Practical applications of more general (i.e., not necessarily Gaussian) mixture distributions include modeling particle sizes generated by multiple mechanisms (e.g., accretion of large particles from smaller ones and fragmentation of larger particles to form smaller ones, possibly due to differences in material characteristics), pore size distributions in rocks, polymer chain length or chain branching distributions, and many others.  More generally, these ideas are also applicable to discrete distributions, leading to ideas like the zero-inflated Poisson and negative binomial distributions that I will discuss in my next post, or to combinations of continuous distributions with degenerate distributions (e.g., concentrated at zero), leading to ideas like zero-augmented continuous distributions that may be appropriate in applications like the analysis of the rainfall data I discussed last time.  One of the advantages of R is that it provides a wide variety of support for various useful implementations of these ideas.
 

2 comments:

  1. Hi Ron,

    Your writing style is clear and easy to follow--a very refreshing find in the statistics literature!

    I am actually trying to better understand the distinction between mixture models and mixture distributions for my own work. You seem to say mixture models apply to a small set of models--namely regression models.

    But in looking at the distinction between mixture models and distributions on Wikipedia, the two appear related. For instance (from the Mixture Model entry on Wikipedia): Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population.

    Do you agree with this more general assertion, as you only seem to view mixture models from a regression framework?

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete