This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, August 6, 2011

Fitting mixture distributions with the R package mixtools

My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful.  Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine.  One important topic I haven’t covered is how to fit mixture models to datasets like the Old Faithful geyser data that I have discussed previously: a nonparametric density plot gives fairly compelling evidence for a bimodal distribution, but how do you estimate the parameters of a mixture model that describes these two modes?  For a finite Gaussian mixture distribution, one way is by trial and error, first estimating the centers of the peaks by eye in the density plot (these become the component means), and adjusting the standard deviations and mixing percentages to approximately match the peak widths and heights, respectively.  This post considers the more systematic alternative of estimating the mixture distribution parameters using the mixtools package in R.

The mixtools package is one of several available in R to fit mixture distributions or to solve the closely related problem of model-based clustering.  Further, mixtools includes a variety of procedures for fitting mixture models of different types.  This post focuses on one of these – the normalmixEM procedure for fitting normal mixture densities – and applies it to two simple examples, starting with the Old Faithful dataset mentioned above.  A much more complete and thorough discussion of the mixtools package – which also discusses its application to the Old Faithful dataset – is given in the R package vignette,  mixtools: An R Package for Analyzing Finite Mixture Models.

The above plot shows the results obtained using the normalmixEM procedure with its default parameter values, applied to the Old Faithful waiting time data.  Specifically, this plot was generated by the following sequence of R commands:

            wait = faithful$waiting
            mixmdl = normalmixEM(wait)
            lines(density(wait), lty=2, lwd=2)

Like many modeling tools in R, the normalmixEM procedure has associated plot and summary methods.  In this case, the plot method displays either the log likelihood associated with each iteration of the EM fitting algorithm (more about that below), or the component densities shown above, or both.  Specifying “which=1” displays only the log likelihood plot (this is the default), specifying “which = 2” displays only the density components/histogram plot shown here, and specifying “density = TRUE” without specifying the “which” parameter gives both plots.  Note that the two solid curves shown in the above plot correspond to the individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of an observation being drawn from that component distribution.  The final line of R code above overlays the nonparametric density estimate generated by the density function with its default parameters, shown here as the heavy dashed line (obtained by specifying “lty = 2”).

Most of the procedures in the mixtools package are based on the iterative expectation maximization (EM) algorithm, discussed in Section 2 of the mixtools vignette and also in Chapter 16 of Exploring Data.  A detailed discussion of this algorithm is beyond the scope of this post – books have been devoted to the topic (see, for example, the book by McLachlan and Krishnan, The EM Algorithm and Extensions (Wiley Series in Probability and Statistics) ) – but the following two points are important to note here.  First, the EM algorithm is an iterative procedure, and the time required for it to reach convergence – if it converges at all – depends strongly on the problem to which it is applied.  The second key point is that because it is an iterative procedure, the EM algorithm requires starting values for the parameters, and algorithm performance can depend strongly on these initial values.  The normalmixEM procedure supports both user-supplied starting values and built-in estimation of starting values if none are supplied.  These built-in estimates are the default and, in favorable cases, they work quite well.  The Old Faithful waiting time data is a case in point – using the default starting values gives the following parameter estimates:

            > mixmdl[c("lambda","mu","sigma")]
[1] 0.3608868 0.6391132
[1] 54.61489 80.09109
[1] 5.871241 5.867718

The mixture density described by these parameters is given by:

            p(x) = lambda[1] n(x; mu[1], sigma[1]) + lambda[2] n(x; mu[2], sigma[2])

where n(x; mu, sigma) represents the Gaussian probability density function with mean mu and standard deviation sigma.

One reason the default starting values work well for the Old Faithful waiting time data is that if nothing is specified, the number of components (the parameter k) is set equal to 2.  Thus, if you are attempting to fit a mixture model with more than two components, this number should be specified, either by setting k to some other value and not specifying any starting estimates for the parameters lambda, mu, and sigma, or by specifying a vector with k components as starting values for at least one of these parameters.  (There are a number of useful options in calling the normalmixEM procedure: for example, specifying the initial sigma value as a scalar constant rather than a vector with k components forces the component variances to be equal.  I won’t attempt to give a detailed discussion of these options here; for that, type “help(normalmixEM)”.)

Another important point about the default starting values is that, aside from the number of components k, any unspecified initial parameter estimates are selected randomly by the normalmixEM procedure.  This means that, even in cases where the default starting values consistently work well – again, the Old Faithful waiting time dataset seems to be such a case – the number of iterations required to obtain the final result can vary significantly from one run to the next.  (Specifically, the normalmixEM procedure does not fix the seed for the random number generators used to compute these starting values, so repeated runs of the procedure with the same data will start from different initial parameter values and require different numbers of iterations to achieve convergence.  In the case of the Old Faithful waiting time data, I have seen anywhere between 16 and 59 iterations required, with the final results differing only very slightly, typically in the fifth or sixth decimal place.  If you want to use the same starting value on successive runs, this can be done by setting the random number seed via the set.seed command before you invoke the normalmixEM procedure.)

It is important to note that the default starting values do not always work well, even if the correct number of components is specified.  This point is illustrated nicely by the following example.  The plot above shows two curves: the solid line is the exact density for the three-component Gaussian mixture distribution described by the following parameters:

            mu = (2.00, 5.00, 7.00)
            sigma = (1.000, 1.000, 1.000)
            lambda = (0.200, 0.600, 0.200)
The dashed curve in the figure is the nonparametric density estimate generated from n = 500 observations drawn from this mixture distribution.  Note that the first two components of this mixture distribution are evident in both of these plots, from the density peaks at approximately 2 and 5.  The third component, however, is too close to the second to yield a clear peak in either density, giving rise instead to slightly asymmetric “shoulders” on the right side of the upper peaks.  The key point is that the components in this mixture distribution are difficult to distinguish from either of these density estimates, and this hints at further difficulties to come.

Applying the normalmixEM procedure to the 500 sample sequence used to generate the nonparametric density estimate shown above and specifying k = 3 gives results that are substantially more variable than the Old Faithful results discussed above.  In fact, to compare these results, it is necessary to be explicit about the values of the random seeds used to initialize the parameter estimation procedure.  Specifying this random seed as 101 and only specifying k=3 in the normalmixEM call yields the following parameter estimates after 78 iterations:

            mu = (1.77, 4.87, 5.44)
            sigma = (0.766, 0.115, 1.463)
            lambda = (0.168, 0.028, 0.803)
Comparing these results with the correct parameter values listed above, it is clear that some of these estimation errors are quite large.  The figure shown below compares the mixture density constructed from these parameters (the heavy dashed curve) with the nonparametric density estimate computed from the data used to estimate them.  The prominent “spike” in this mixture density plot corresponds to the very small standard deviation estimated for the second component and it provides a dramatic illustration of the relatively poor results obtained for this particular example.

Repeating this numerical experiment with different random seeds to obtain different random starting estimates, the normalmixEM procedure failed to converge in 1000 iterations for seed values of 102 and 103, but it converged after 393 iterations for the seed value 104, yielding the following parameter estimates:

            mu = (1.79, 5.03, 5.46)
            sigma = (0.775, 0.352, 1.493)
            lambda = (0.169, 0.063, 0.768)

Arguably, the general behavior of these parameter estimates is quite similar to those obtained with the random seed value 101, but note that the second variance component differs by a factor of three, and the second component of lambda increases almost as much. 

Increasing the sample size from n = 500 to n = 2000 and repeating these experiments, the normalmixEM procedure failed to converge after 1000 iterations for all four of the random seed values 101 through 104.  If, however, we specify the correct standard deviations (i.e., specify “sigma = c(1,1,1)” when we invoke normalmixEM) and we increase the maximum number of iterations to 3000 (i.e., specify “maxit = 3000”), the procedure does converge after 2417 iterations for the seed value 101, yielding the following parameter estimates:

            mu = (1.98, 4.98, 7.15)
            sigma = (1.012, 1.055, 0.929)
            lambda = (0.198, 0.641, 0.161)
While these parameters took a lot more effort to obtain, they are clearly much closer to the correct values, emphasizing the point that when we are fitting a model to data, our results generally improve as the amount of available data increases and as our starting estimates become more accurate.  This point is further illustrated by the plot shown below, analogous to the previous one, but constructed from the model fit to the longer data sequence and incorporating better initial parameter estimates.  Interestingly, re-running the same procedure but taking the correct means as starting parameter estimates instead of the correct standard deviations, the procedure failed to converge in 3000 iterations.

Overall, I like what I have seen so far of the mixtools package, and I look forward to exploring its capabilities further.  It’s great to have a built-in procedure – i.e., one I didn’t have to write and debug myself – that does all of the things that this package does.  However, the three-component mixture results presented here do illustrate an important point: the behavior of iterative procedures like normalmixEM and others in the mixtools package can depend strongly on the starting values chosen to initialize the iteration process, and the extent of this dependence can vary greatly from one application to another.



  1. Interesting review! Have you compared its performance with mclust in identifying the true clusters (components) exist?

  2. Really neat example. I'm interested in getting the X intercept of the two Gaussians, but I don't think any of the attributes of the model gives it.. Is there another simple way for calculating it?