This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

## Saturday, August 6, 2011

### Fitting mixture distributions with the R package mixtools

My last two posts have been about mixture models, with examples to illustrate what they are and how they can be useful.  Further discussion and more examples can be found in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine.  One important topic I haven’t covered is how to fit mixture models to datasets like the Old Faithful geyser data that I have discussed previously: a nonparametric density plot gives fairly compelling evidence for a bimodal distribution, but how do you estimate the parameters of a mixture model that describes these two modes?  For a finite Gaussian mixture distribution, one way is by trial and error, first estimating the centers of the peaks by eye in the density plot (these become the component means), and adjusting the standard deviations and mixing percentages to approximately match the peak widths and heights, respectively.  This post considers the more systematic alternative of estimating the mixture distribution parameters using the mixtools package in R.

The mixtools package is one of several available in R to fit mixture distributions or to solve the closely related problem of model-based clustering.  Further, mixtools includes a variety of procedures for fitting mixture models of different types.  This post focuses on one of these – the normalmixEM procedure for fitting normal mixture densities – and applies it to two simple examples, starting with the Old Faithful dataset mentioned above.  A much more complete and thorough discussion of the mixtools package – which also discusses its application to the Old Faithful dataset – is given in the R package vignette,  mixtools: An R Package for Analyzing Finite Mixture Models.

The above plot shows the results obtained using the normalmixEM procedure with its default parameter values, applied to the Old Faithful waiting time data.  Specifically, this plot was generated by the following sequence of R commands:

library(mixtools)
wait = faithful\$waiting
mixmdl = normalmixEM(wait)
plot(mixmdl,which=2)
lines(density(wait), lty=2, lwd=2)

Like many modeling tools in R, the normalmixEM procedure has associated plot and summary methods.  In this case, the plot method displays either the log likelihood associated with each iteration of the EM fitting algorithm (more about that below), or the component densities shown above, or both.  Specifying “which=1” displays only the log likelihood plot (this is the default), specifying “which = 2” displays only the density components/histogram plot shown here, and specifying “density = TRUE” without specifying the “which” parameter gives both plots.  Note that the two solid curves shown in the above plot correspond to the individual Gaussian density components in the mixture distribution, each scaled by the estimated probability of an observation being drawn from that component distribution.  The final line of R code above overlays the nonparametric density estimate generated by the density function with its default parameters, shown here as the heavy dashed line (obtained by specifying “lty = 2”).

Most of the procedures in the mixtools package are based on the iterative expectation maximization (EM) algorithm, discussed in Section 2 of the mixtools vignette and also in Chapter 16 of Exploring Data.  A detailed discussion of this algorithm is beyond the scope of this post – books have been devoted to the topic (see, for example, the book by McLachlan and Krishnan, ) – but the following two points are important to note here.  First, the EM algorithm is an iterative procedure, and the time required for it to reach convergence – if it converges at all – depends strongly on the problem to which it is applied.  The second key point is that because it is an iterative procedure, the EM algorithm requires starting values for the parameters, and algorithm performance can depend strongly on these initial values.  The normalmixEM procedure supports both user-supplied starting values and built-in estimation of starting values if none are supplied.  These built-in estimates are the default and, in favorable cases, they work quite well.  The Old Faithful waiting time data is a case in point – using the default starting values gives the following parameter estimates:

> mixmdl[c("lambda","mu","sigma")]
\$lambda
[1] 0.3608868 0.6391132

\$mu
[1] 54.61489 80.09109

\$sigma
[1] 5.871241 5.867718

The mixture density described by these parameters is given by:

p(x) = lambda[1] n(x; mu[1], sigma[1]) + lambda[2] n(x; mu[2], sigma[2])

where n(x; mu, sigma) represents the Gaussian probability density function with mean mu and standard deviation sigma.

One reason the default starting values work well for the Old Faithful waiting time data is that if nothing is specified, the number of components (the parameter k) is set equal to 2.  Thus, if you are attempting to fit a mixture model with more than two components, this number should be specified, either by setting k to some other value and not specifying any starting estimates for the parameters lambda, mu, and sigma, or by specifying a vector with k components as starting values for at least one of these parameters.  (There are a number of useful options in calling the normalmixEM procedure: for example, specifying the initial sigma value as a scalar constant rather than a vector with k components forces the component variances to be equal.  I won’t attempt to give a detailed discussion of these options here; for that, type “help(normalmixEM)”.)

Another important point about the default starting values is that, aside from the number of components k, any unspecified initial parameter estimates are selected randomly by the normalmixEM procedure.  This means that, even in cases where the default starting values consistently work well – again, the Old Faithful waiting time dataset seems to be such a case – the number of iterations required to obtain the final result can vary significantly from one run to the next.  (Specifically, the normalmixEM procedure does not fix the seed for the random number generators used to compute these starting values, so repeated runs of the procedure with the same data will start from different initial parameter values and require different numbers of iterations to achieve convergence.  In the case of the Old Faithful waiting time data, I have seen anywhere between 16 and 59 iterations required, with the final results differing only very slightly, typically in the fifth or sixth decimal place.  If you want to use the same starting value on successive runs, this can be done by setting the random number seed via the set.seed command before you invoke the normalmixEM procedure.)

It is important to note that the default starting values do not always work well, even if the correct number of components is specified.  This point is illustrated nicely by the following example.  The plot above shows two curves: the solid line is the exact density for the three-component Gaussian mixture distribution described by the following parameters:

mu = (2.00, 5.00, 7.00)
sigma = (1.000, 1.000, 1.000)
lambda = (0.200, 0.600, 0.200)
The dashed curve in the figure is the nonparametric density estimate generated from n = 500 observations drawn from this mixture distribution.  Note that the first two components of this mixture distribution are evident in both of these plots, from the density peaks at approximately 2 and 5.  The third component, however, is too close to the second to yield a clear peak in either density, giving rise instead to slightly asymmetric “shoulders” on the right side of the upper peaks.  The key point is that the components in this mixture distribution are difficult to distinguish from either of these density estimates, and this hints at further difficulties to come.

Applying the normalmixEM procedure to the 500 sample sequence used to generate the nonparametric density estimate shown above and specifying k = 3 gives results that are substantially more variable than the Old Faithful results discussed above.  In fact, to compare these results, it is necessary to be explicit about the values of the random seeds used to initialize the parameter estimation procedure.  Specifying this random seed as 101 and only specifying k=3 in the normalmixEM call yields the following parameter estimates after 78 iterations:

mu = (1.77, 4.87, 5.44)
sigma = (0.766, 0.115, 1.463)
lambda = (0.168, 0.028, 0.803)
Comparing these results with the correct parameter values listed above, it is clear that some of these estimation errors are quite large.  The figure shown below compares the mixture density constructed from these parameters (the heavy dashed curve) with the nonparametric density estimate computed from the data used to estimate them.  The prominent “spike” in this mixture density plot corresponds to the very small standard deviation estimated for the second component and it provides a dramatic illustration of the relatively poor results obtained for this particular example.

Repeating this numerical experiment with different random seeds to obtain different random starting estimates, the normalmixEM procedure failed to converge in 1000 iterations for seed values of 102 and 103, but it converged after 393 iterations for the seed value 104, yielding the following parameter estimates:

mu = (1.79, 5.03, 5.46)
sigma = (0.775, 0.352, 1.493)
lambda = (0.169, 0.063, 0.768)

Arguably, the general behavior of these parameter estimates is quite similar to those obtained with the random seed value 101, but note that the second variance component differs by a factor of three, and the second component of lambda increases almost as much.

Increasing the sample size from n = 500 to n = 2000 and repeating these experiments, the normalmixEM procedure failed to converge after 1000 iterations for all four of the random seed values 101 through 104.  If, however, we specify the correct standard deviations (i.e., specify “sigma = c(1,1,1)” when we invoke normalmixEM) and we increase the maximum number of iterations to 3000 (i.e., specify “maxit = 3000”), the procedure does converge after 2417 iterations for the seed value 101, yielding the following parameter estimates:

mu = (1.98, 4.98, 7.15)
sigma = (1.012, 1.055, 0.929)
lambda = (0.198, 0.641, 0.161)
While these parameters took a lot more effort to obtain, they are clearly much closer to the correct values, emphasizing the point that when we are fitting a model to data, our results generally improve as the amount of available data increases and as our starting estimates become more accurate.  This point is further illustrated by the plot shown below, analogous to the previous one, but constructed from the model fit to the longer data sequence and incorporating better initial parameter estimates.  Interestingly, re-running the same procedure but taking the correct means as starting parameter estimates instead of the correct standard deviations, the procedure failed to converge in 3000 iterations.

Overall, I like what I have seen so far of the mixtools package, and I look forward to exploring its capabilities further.  It’s great to have a built-in procedure – i.e., one I didn’t have to write and debug myself – that does all of the things that this package does.  However, the three-component mixture results presented here do illustrate an important point: the behavior of iterative procedures like normalmixEM and others in the mixtools package can depend strongly on the starting values chosen to initialize the iteration process, and the extent of this dependence can vary greatly from one application to another.

1. Interesting review! Have you compared its performance with mclust in identifying the true clusters (components) exist?

2. Really neat example. I'm interested in getting the X intercept of the two Gaussians, but I don't think any of the attributes of the model gives it.. Is there another simple way for calculating it?

3. This was a very, very helpful post. I am trying to fit a mixture Gamma distribution - the gut retention time data I am analzying is bimodal. However, I am unable to figure out the expansion for mixture density (as you do here for the normal distribution). Mixtools lists lambda, alpha and beta coefficients for Gamma distributions - what do these correspond to? Are alpha and beta the fitted parameters for each of the Gamma distributions? It would be very helpful if you can suggest how we can expand this and fit the curves to histograms. Thanks!

4. Thanks for the post. I used this package on my data. When I used it the plot produces two curves but I see more populations when I increase bin widths. My question is how to increase bin width using this function?

Thanks a lot

5. Hi! Thank you very much for this useful post. How can you get the median of each population?

6. Once you have the mixmdl how can you generate new samples from it?

7. It was really a nice post and I was really impressed by reading thisData Science online Course

8. Thank you for sharing your article. Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.
Data Science Training in chennai at Credo Systemz | data science course fees in chennai | data science course in chennai quora | data science with python training in chennai

9. Thankful to you for this amazing information sharing with us. Get website designing and development services by Ogen Infosystem.
Website Designing Company in Delhi

10. Your content is really awesome and understandable, thanks for the efforts in this blog. Visit Mutual Fund Wala for Mutual Fund Schemes.
Mutual Fund Companies

11. Decent, Get Service for Night out page 3 parties and this magnificent service provided by Lifestyle Magazine.
Lifestyle Magazine India

12. agar aapka husband aapse ladai karta hai or aap karna nhi chatey toh aap jarur apne pati ko apne vash main kar sakty ho पति को वश में करने के टोटके kare

13. Great, I think this is one of the best blog in past some time I have seen. Visit Kalakutir for Fleet Painting, Godown Line Marking Painting and Caution & Indication Signages.
Fleet Painting

14. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.
data analytics courses in mumbai

data science interview questions

data science course in mumbai

15. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
Anchor tag
Data Science Certification in Bangalore

16. All the scientists think that the computers will have to come to this point to make sure that they are artificially intelligent and would be self aware. artificial intelligence training in hyderabad

17. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
Data Science Course in Bangalore

18. Happy to visit your blog, I am by all accounts forward to more solid articles and I figure we as a whole wish to thank such huge numbers of good articles, blog to impart to us.
360DigiTMG

19. Mindblowing blog appreciating your endless efforts in developing a truly transparent content. Which probably the best one to come across disclosing the content which people might not aware of it. Thanks for bringing out the amazing content and keep sharing more further.

360DigiTMG PMP Certification Course

20. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Correlation vs Covariance
Simple Linear Regression
data science interview questions
KNN Algorithm
Logistic Regression explained

21. Thanks for sharing such a great information.. It is really helpful to me..I always search to read the quality content and finally i found this in you post. keep it up!
Our Service:
Digital marketing Company
SMM Services
PPC Services in Delhi
Website Design & Development Packages
Seo Packages India
Web Development Packages

22. I have read your excellent post. This is a great job. I have enjoyed reading your post first time. I want to say thanks for this post. Thank you...
Data Science Training in Hyderabad

23. Cool stuff you have and you keep overhaul every one of us
Best Institute for Data Science in Hyderabad

24. I see the greatest contents on your blog and I extremely love reading them.
Best Institutes For Digital Marketing in Hyderabad

25. It's really nice and meaningful. it's a really cool blog. Linking is a very useful thing.you have really helped lots of people who visit blogs and provide them useful information.

26. Thanks for the informative and helpful post, obviously in your blog everything is good.. ExcelR Data Analyst Course

27. Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.

28. Just the way I have expected. Your website really is interesting.
data scientist course in hyderabad

29. A good blog always comes-up with new and exciting information and while reading I have felt that this blog really has all those qualities that qualify a blog to be a one.

Best Data Science courses in Hyderabad

30. Very nice blog and articles. I am really very happy to visit your blog. Now I am finding which I actually want. I check your blog everyday and try to learn something from your blog. Thank you and I'm waiting for your new post.

Best Data Science courses in Hyderabad

31. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
best data science institute in hyderabad

32. I truly like your composing style, incredible data, thank you for posting.
digital marketing courses in hyderabad with placement

33. Most of the small and mid-sized businesses that are willing to enhance the growth prospects through the effective implementation of cloud-based data management systems capable of providing real-time information for the compatible devices consider Salesforce implementation services and Salesforce salesforce training in bangalore

34. Very wonderful informative article. I appreciated looking at your article. Very wonderful reveal. I would like to twit this on my followers. Many thanks! .
Data Science certification training in Bangalore

35. Great tips and very easy to understand. This will definitely be very useful for me when I get a chance to start my blog.
best data science institute in hyderabad

36. Thanks for such a great post and the review, I am totally impressed! Keep stuff like this coming.
data scientist training and placement in hyderabad

37. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.

DevOps Training in Hyderabad

38. Highly appreciable regarding the uniqueness of the content. This perhaps makes the readers feels excited to get stick to the subject. Certainly, the learners would thank the blogger to come up with the innovative content which keeps the readers to be up to date to stand by the competition. Once again nice blog keep it up and keep sharing the content as always.

data science course in faridabad

39. You’ve got some interesting points in this article. I would have never considered any of these if I didn’t come across this. Thanks!. Fitness Bay

40. A great website with interesting and unique material what else would you need.
data science training

41. Thank you so much for this great article. Could you provide the code used to plot the two curves: the solid line (exact density for the three-component Gaussian mixture) and dashed line (nonparametric density estimate generated from n = 500 observations drawn from the mixture distribution).
Thank you

42. I trust you post again soon... Kanye West Donda Vest

43. I see some amazingly important and kept up to length of your strength searching for in your on the site
data scientist course in malaysia

44. Extremely overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. A debt of gratitude is in order for sharing.
aws certification cost hyderabad

45. Extremely overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. A debt of gratitude is in order for sharing.data scientist course in warangal

46. Excellent effort to make this blog more wonderful and informative. The information shared was very useful.
Data Analytics Course in Chandigarh

47. Nice blog and informative content. Really useful for many people, I bookmarked your website for further blogs. Thanks, you.
Data Science Course in Hyderabad

48. This is an excellent post I seen thanks to share it. It is really what I wanted to see hope in future you will continue for sharing such a excellent post.
cyber security course malaysia

49. I truly like your style of blogging. I added it to my preferred's blog webpage list and will return soon…

50. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles
Keep up the good work!. Red Yeast Rice Brands To Avoid

51. Awesome article. I enjoyed reading your articles. this can be really a good scan for me. wanting forward to reading new articles. maintain the nice work! wingstop promo code

52. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one.
data analytics course in hyderabad

53. I think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.
cyber security course in malaysia

54. 360DigiTMG, the top-rated organisation among the most prestigious industries around the world, is an educational destination for those looking to pursue their dreams around the globe. The company is changing careers of many people through constant improvement, 360DigiTMG provides an outstanding learning experience and distinguishes itself from the pack. 360DigiTMG is a prominent global presence by offering world-class training. Its main office is in India and subsidiaries across Malaysia, USA, East Asia, Australia, Uk, Netherlands, and the Middle East.

55. I like the way you express information to us. Thanks for such post and please keep it up. Biker Boyz Jacket

56. I truly like your style of blogging. I added it to my preferred's blog webpage list and will return soon…
data analytics courses in hyderabad

57. Acquire a firm grounding in the theory of Data Science by signing up for the Data Science courses in Bangalore. Master the relevant skills along with all the essential tools and techniques of Data Science. Get to avail benefits like Flexible timings, Best industry trainers, and a meticulously crafted curriculum with hands-on projects that will give you exposure to a real-world working environment.

Best Data Science Training institute in Bangalore

58. Get a comprehensive overview of Data Science and learn all the essential skills including collecting, modeling, and interpreting data. Register with Data Science institute Bangalore and build a strong foundation for a career where you will be involved in uncovering valuable information for your organization. Learn Python, Machine Learning, Big Data, Deep Learning, and Analytics to take center stage in Data Science.

Data Science Course in Bangalore

59. Develop technical skills and become an expert in analyzing large sets of data by enrolling for the Best Data Science course in Bangalore. Gain in-depth knowledge in Data Visualization, Statistics, and Predictive Analytics along with the two famous programming languages and Python. Learn to derive valuable insights from data using skills of Data Mining, Statistics, Machine Learning, Network Analysis, etc, and apply the skills you will learn in your final Capstone project to get recognized by potential employers.

Data Science in Bangalore

60. 360DigiTMG offers the best Data Science Courses on market. Enroll now for a bright future.
best data science courses in chennai

61. I was quite impressed with this valuable content this was purely interesting. Star Trek Picard Field Jacket

62. Hello,
This informative post delves into the intricacies of fitting mixture models using the mixtools package in R. Your detailed exploration sheds light on the challenges and nuances of parameter estimation. Great work!
Data Analytics Courses in Nashik

63. This comprehensive exploration of fitting mixture distributions using the mixtools package in R demonstrates your analytical prowess and dedication to robust statistical modeling. Impressive writeup.
Data Analytics Courses In Dubai

64. It's a powerful resource for anyone working with mixed data distributions, helping us gain deeper insights and make more informed decisions.
Data Analytics Courses In Chennai

65. It's clear that you have a deep understanding of the subject, and your ability to convey it in a straightforward manner is greatly appreciated. Thanks for sharing this valuable information.
Data Analytics Courses In Chennai

66. "Thank you for the insightful blog post on fitting mixture distributions using the R package mixtools.
Data analytics courses in new Jersey

67. Fitting mixture distributions with the R package mixtools is an essential technique in statistics for modeling complex data with multiple underlying components, allowing for more accurate data representation and analysis. In the domain of data analytics, Glasgow offers specialized Data Analytics courses that delve into advanced statistical tools like mixtools, enabling professionals to gain expertise in handling intricate datasets. Please also read Data Analytics courses in Glasgow.

68. Thanks for sharing informative and insightful blog post on mixtools package in R.
data analyst courses in limerick

69. a very informative blog, very useful.
Digital Marketing Courses In port-harcourt

70. This comment has been removed by the author.

71. Thank you for providing in depth knowledge and excellent insights on mixtools package in R.
Digital Marketing Courses In Bhutan

72. In a world where technological trends can be fleeting, your blog post serves as a valuable resource for those seeking a deeper understanding of the trajectories of gamification and big data. Thank you for sharing this insightful retrospective, and I look forward to exploring more of your analyses on technology trends. Digital marketing for business

73. Excellent blog post. Great information. Thanks for sharing.

Investment banking analyst jobs

74. Thanks for sharing this post.
Investment banking courses in the world

75. "Your blog on fitting mixture distributions with the R package mixtools is a treasure trove for statisticians and data enthusiasts. The detailed exploration and practical insights not only demystify the complexities of mixture distributions but also empower readers to leverage the full potential of the mixtools package in R. Thanks for providing a clear roadmap in statistical modeling, making mixture distribution fitting more accessible and actionable for researchers and analysts."
Investment banking as a career in India

76. "Your blog on fitting mixture distributions with the R package mixtools is a statistical gem for analysts and researchers. The detailed exploration and practical insights not only unravel the intricacies of mixture distributions but also empower readers to wield the full potential of the mixtools package in R. Thanks for providing a comprehensive guide that makes statistical modeling more approachable and impactful for the R community."
Investment banking as a career in India

77. Learning is fun if supplemented by valuable insights as provided by the author. Thanks for sharing.

Investment banking courses after 12th

78. Beloved by smoked meat purists, Memphis, Tennessee developed one of the four dominant styles of regional BBQ in the U.S. While you’ll certainly find beef and chicken bbq rubs is the foundation of Memphis style BBQ, and pork ribs are its crowned jewel. Memphis style BBQ prepares ribs in two ways: wet and dry.