In response to my last post, Chris had the following comment:
I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work. You seem to say mixture models apply to a small set of models – namely regression models.
This comment suggests that my caution about the difference between mixed-effect models and mixture distributions may have caused as much confusion as clarification, and the purpose of this post is to try to clear up this confusion.
So first, let me offer the following general observations. The terms “mixture models” refers to a generalization of the class of finite mixture distributions that I discussed in my previous post. I give a more detailed discussion of finite mixture distributions in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine , and the more general class of mixture models is discussed in the book Mixture Models (Statistics: A Series of Textbooks and Monographs) by Geoffrey J. McLachlan and Kaye E. Bashford. The basic idea is that we are describing some observed phenomenon like the Old Faithful geyser data (the faithful data object in R) where a close look at the data (e.g., with a nonparametric density estimate) suggests substantial heterogeneity. In particular, the density estimates I presented last time for both of the variables in this dataset exhibit clear evidence of bimodality. Essentially, the idea behind a mixture model/mixture distribution is that we are observing something that isn’t fully characterized by a single, simple distribution or model, but instead by several such distributions or models, with some random selection mechanism at work. In the case of mixture distributions, some observations appear to be drawn from distribution 1, some from distribution 2, and so forth. The more general class of mixture models is quite broad, including things like heterogeneous regression models, where the response may depend approximately linearly on some covariate with one slope and intercept for observations drawn from one sub-population, but with another, very different slope and intercept for observations drawn from another sub-population. I present an example at the end of this post that illustrates this idea.
The probable source of confusion for Chris – and very possibly other readers – is the comment I made about the difference between these mixture models and mixed-effect models. This other class of models – which I only mentioned in passing in my post – typically consists of a linear regression model with two types of prediction variables: deterministic predictors, like those that appear in standard linear regression models, and random predictors that are typically assumed to obey a Gaussian distribution. This framework has been extended to more general settings like generalized linear models (e.g., mixed-effect logistic regression models). The R package lme4 provides support for fitting both linear mixed-effect models and generalized linear mixed-effect models to data. As I noted last time, these model classes are distinct from the mixture distribution/mixture model classes I discuss here. The models that I do discuss – mixture models – have strong connections with cluster analysis, where we are given a heterogeneous group of objects and typically wish to determine how many distinct groups of objects are present and assign individuals to the appropriate groups. A very high-level view of the many R packages available for clustering – some based on mixture model ideas and some not – is available from the CRAN clustering task view page. Two packages from this task view that I plan to discuss in future posts are flexmix and mixtools, both of which support a variety of mixture model applications. The following comments from the vignette FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R give an indication of the range of areas where these ideas are useful:
“Finite mixture models have been used for more than 100 years, but have seen a real boost in popularity over the last decade due to the tremendous increase in available computing power. The areas of application of mixture models range from biology and medicine to physics, economics, and marketing. On the one hand, these models can be applied to data where observations originate from various groups and the group affiliations are not known, and on the other hand to provide approximations for multi-modal distributions.”
The following example illustrates the second of these ideas, motivated by the Old Faithful geyser data that I discussed last time. As a reminder, the plot above shows the nonparametric density estimate generated from the 272 observations of the Old Faithful waiting time data included in the faithful data object, using the density procedure in R with the default parameter settings. As I noted last time, the plot shows two clear peaks, the lower one centered at approximately 55 minutes, and the second at approximately 80 minutes. Also, note that the first peak is substantially smaller in amplitude and appears to be somewhat narrower than the second peak.
To illustrate the connection with finite mixture distributions, the R procedure described below generates a two-component Gaussian mixture density whose random samples exhibit approximately the same behavior seen in the Old Faithful waiting time data. The results generated by this procedure are shown in the above figure, which includes two overlaid plots: one corresponding to the exact density for the two-component Gaussian mixture distribution (the solid line), and the other corresponding to the nonparametric density estimate computed from N = 272 random samples drawn from this mixture distribution (the dashed line). As in the previous plot, the nonparametric density estimate was computed using the density command in R with its default parameter values. The first component in this mixture has mean 54.5 and standard deviation 8.0, values chosen by trial and error to approximately match the lower peak in the Old Faithful waiting time distribution. The second component has mean 80.0 and standard deviation 5.0, chosen to approximately match the second peak in the waiting time distribution. The probabilities associated with the first and second components are 0.45 and 0.55, respectively, selected to give approximately the same peak heights seen in the waiting time density estimate. Combining these results, the density of this mixture distribution is:
p(x) = 0.45 n(x; 54.5, 8.0) + 0.55 n(x; 80.0, 5.0),
where n(x;m,s) denotes the Gaussian density function with mean m and standard deviation s. These density functions can be generated using the dnorm function in R.
The R procedure listed below generates n independent, identically distributed random samples from an m-component Gaussian mixture distribution. This procedure is called with the following parameters:
n = the number of random samples to generate
mvec = vector of m mean values
svec = vector of m standard deviations
pvec = vector of probabilities for each of the m components
iseed = integer seed to initialize the random number generators
The R code for the procedure looks like this:
MixEx01GenProc <- function(n, muvec, sigvec, pvec, iseed=101){
#
set.seed(iseed)
#
m <- length(pvec)
indx <- sample(seq(1,m,1), size=n, replace=T, prob=pvec)
#
yvec <- 0
for (i in 1:m){
xvec <- rnorm(n, mean=muvec[i], sd=sigvec[i])
yvec <- yvec + xvec * as.numeric(indx == i)
}
#
yvec
}
The first statement initializes the random number generator using the iseed parameter, which is given a default value of 101. The second line determines the number of components in the mixture density from the length of the pvec parameter vector, and the third line generates a random sequence indx of component indices taking the values 1 through m with probabilities determined by the pvec parameter. The rest of the program is a short loop that generates each component in turn, using indx to randomly select observations from each of these components with the appropriate probability. To see how this works, note that the first pass through the loop generates the random vector xvec of length n, with mean given by the first element of the vector muvec and standard deviation given by the first element of the vector sigvec. Then, for every one of the n elements of yvec for which the indx vector is equal to 1, yvec is set equal to the corresponding element of this first random component xvec. On the second pass through the loop, the second random component is generated as xvec, again with length n but now with mean specified by the second element of muvec and standard deviation determined by the second element of sigvec. As before, this value is added to the initial value of yvec whenever the selection index vector indx is equal to 2. Note that since every element of the indx vector is unique, none of the nonzero elements of yvec computed during the first iteration of the loop are modified; instead, the only elements of yvec that are modified in the second pass through the loop have their initial value of zero, specified in the line above the start of the loop. More generally, each pass through the loop generates the next component of the mixture distribution and fills in the corresponding elements of yvec as determined by the random selection index vector indx.
As I noted at the beginning of this post, the notion of a mixture model is more general than that of the finite mixture distributions just described, but closely related. I conclude this post with a simple example of a more general mixture model. The above scatter plot shows two variables, x and y, related by the following mixture model:
y = x + e1 with probability p1 = 0.40,
and
y = -x + 2 + e2 with probability p2 = 0.60,
where e1 is a zero-mean Gaussian random variable with standard deviation 0.1, and e2 is a zero-mean Gaussian random variable with standard deviation 0.3. To emphasize the components in the mixture model, points corresponding to the first component are plotted as solid circles, while points corresponding to the second component are plotted as open triangles. The two dashed lines in this plot represent the ordnary least squares regression lines fit to each component separately, and they both correspond reasonably well to the underlying linear relationships that define the two components (e.g., the least squares line fit to the solid circles has a slope of approximately +1 and an intercept of approximately 0). In contrast, the heavier dotted line represents the ordinary least squares regression line fit to the complete dataset without any knowledge of its underlying component structure: this line is almost horizontal and represents a very poor approximation to the behavior of the dataset.
The point of this example is to illustrate two things. First, it provides a relatively simple illustration of how the mixture density idea discussed above generalizes to the setting of regression models and beyond: we can construct fairly general mixture models by requiring different randomly selected subsets of the data to conform to different modeling assumptions. The second point – emphasized by the strong disagreement between the overall regression line and both of the component regression lines – is that if we are given only the dataset (i.e., the x and y values themselves) without knowing which component they represent, standard analysis procedures are likely to perform very badly. This question – how do we analyze a dataset like this one without detailed prior knowledge of its heterogeneous structure – is what R packages like flexmix and mixtools are designed to address.
More about that in future posts.