This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Sunday, February 6, 2011

Boxplots and Beyond – Part II: Asymmetry

In my last post, I discussed boxplots in their simplest forms, illustrating some of the useful options available with the boxplot command in the open-source statistical software package R.  As I noted in that post, the basic boxplot is both useful and popular, but it does have its limitations.  One of those limitations is that the standard boxplot outlier rule is not appropriate for highly asymmetric data.  Specifically, the standard rule declares points to be outliers if they lie a fixed distance – typically, 1.5 times the interquartile distance (IQD) – beyond the quartiles.  While this rule is appropriate for symmetric, approximately Gaussian data distributions, highly asymmetric situations call for an outlier detection rule that treats upward-outliers and downward-outliers differently.  For a strongly right-skewed distribution, for example, flagging any point more than 1.5 IQD’s above the upper quartile may be too liberal, declaring too many points to be upward-outliers, while flagging any point more than 1.5 IQD’s below the lower quartile may be too conservative, declaring too few points to be downward-outliers.  To handle these cases, it is necessary to incorporate some measure of skewness into the outlier detection rule. 




One of the points I discuss in Chapter 7 of Exploring Data is that the standard skewness estimator - defined as a normalized third moment - is extremely sensitive to outliers, worse even than the standard deviation in this respect.   This point is illustrated in the above figure, which gives a standard boxplot summary comparing four different skewness estimates for two different cases.  The boxplots on the left present results for these skewness measures each applied to 1000 statistically independent standard Gaussian data samples, each of length n = 100, while the boxplots on the right present the corresponding results for exactly the same data sequences, but with each Gaussian data sample contaminated by a single additive outlier, lying 8 standard deviations above the mean.  The boxplots designated “Mo” in these comparisons correspond to the standard moment-based estimator, while the other boxplots correspond to three other skewness estimators, discussed in the next paragraph.  Two points are immediately evident from these results: first, that the moment estimator exhibits much higher variability than the other three estimators considered here, and second, that the moment estimator is extremely outlier-sensitive.  In particular, note that a single outlier in a dataset of size 100 causes the median estimate to shift from about zero, corresponding to the correct nominal result, to a large enough value to suggest asymmetry comparable to the J-shaped exponential distribution (for which the skewness is 2), all on the basis of a single anomalous data point.

Two of the other three skewness estimators compared in these boxplots are Hotelling’s skewness measure, designated “Ho,” and Galton’s skewness measure, designated “Ga.  Both of these alternatives to the standard moment-based skewness measure are discussed in Chapter 7 of Exploring Data, which presents similar comparisons to those presented here for these three asymmetry measures.  Hotelling’s measure is equal to the difference between the mean and the median of the data, divided by the estimated standard deviation, and it can be shown to lie between -1 and +1.  The boxplot comparisons presented here show that this measure is much less outlier-sensitive than the standard skewness measure, although this estimate does exhibit a slight positive bias (slight is the operative word here: the mean value for the contaminated “Mo” estimates is 2.36, while that for the “Ho” estimate is 0.05).  Galton’s skewness measure is based on sample quartiles and is closely related to the fourth measure discussed next.  Specifically, both of these measures are based on the following function: let x and y be any two values from the data sample satisfying the condition that x is smaller than the sample median m, which is in turn smaller than y.  The function h(x,y) on which these measures are based is defined as the ratio of two computed values: the numerator is (y – m) – (m – x), and the denominator is y – x.  If we take x as the lower quartile and y as the upper quartile, we obtain Galton’s skewness measure, designated “Ga” in the boxplots.  The other skewness measure is the medcouple, defined as the median of all values of h(x,y) computed from points satisfying the condition that x < m < y.  Note that both the Galton and medcouple measures also lie between -1 and +1, since this is true for h(x,y) for any admissible values of x and yDetailed comparison of the “Ga” and “Mo” boxplots shows that both of these estimators exhibit small bias – i.e., their median values are both approximately zero, as they should be – and that the medcouple exhibits a slightly smaller range of variation than Galton’s skewness measure does.

Procedures in R to compute the standard skewness measure, Hotelling’s measure, and Galton’s measure are available from the companion website for Exploring Data.  The medcouple is not discussed in the book, but it is available in the R add-on package robustbase, which also includes a procedure to construct the adjusted boxplot described next.  There, rather than using 1.5 interquartile distances to set both upper and lower outlier limits as in the standard boxplot rule, the adjbox procedure in the robustbase package makes this distance depend on the medcouple value computed from the data.  Specifically, both outlier detection limits have the same general form, but different specific parameters: in both cases, the nominal distance of 1.5 IQD’s is multiplied by a scale factor computed from the medcouple value.  For lower outliers, this scale factor is exp(-3.5 Mc), while for upper outliers, this scale factor is exp(+4Mc).  A detailed discussion of this outlier detection rule and the rationale behind it is presented both in the paper, “An Adjusted Boxplot for Skewed Distributions,” by Ellen Vanderviere and Mia Huber, published in the proceedings of the 2004 COMPSTAT symposium, and from the technical report cited in the R documentation for the adjbox procedure.



The above figure compares two sets of boxplots for the UScereal dataset discussed last time, from the MASS package in R.  Specifically, both boxplots summarize the reported potassium content for breakfast cereals from 6 different manufacturers, each identified by a one-letter designation.  Both boxplots are generated using the log = “y” and varwidth options to obtain boxplots whose width is proportional to the square root of the number of records in each subsample.  The difference in these two sets of boxplots is that the one on the left uses the standard outlier identification algorithm, which is the default for the boxplot command in base R, while the boxplot series on the right uses the modified outlier detection rule just described, based on the medcouple skewness measure and available via the adjbox command in the robustbase add-on package.  The points declared outliers for each boxplot are shown as solid circles for emphasis (simply specify “pch = 16” in either boxplot command), and these points illustrate the difference between the two boxplot results.  Specifically, the left-hand group of boxplots only show upward outliers, one for manufacturer G (General Mills) and two for manufacturer K (Kellogs), while the right-hand boxplots only show downward outliers, three for General Mills, one for Kellogs, and one for Quaker Oats (manufacturer Q).  The question remains which of these conclusions is more reasonable, and this question will be revisited in subsequent posts using two additional boxplot extensions: violin plots and bean plots, both of which incorporate nonparametric density estimates to give a more detailed picture of the data.  For now, it is enough to note three things: first, an alternative to the standard boxplot is available that provides distinct detection rules for upper and lower outliers; second, that this alternative approach often finds different outliers than the standard boxplot rule does; and third, there is evidence to suggest that this modified boxplot does indeed perform better in the face of significant distributional asymmetry.  For a more detailed discussion of this point, refer to the technical report cited above.

Finally, it is worth emphasizing that the robustbase package contains a lot more than just the adjbox procedure discussed here, including multivariable outlier detection procedures like adjOutlyingness, robust multivariate location and scale estimation procedures like covMcd, and robust fitting procedures like lmrob and glmrob for linear models and generalized linear models (specifically, a robust logistic regression procedure for binomial data, and a robust Poisson regression procedure for count data), among others.  The biggest drawback of this package is that the documentation is incomplete, so it is important to check out the references cited there to get a clear idea of what some of these procedures do and how they do it.  That said, it is important to emphasize two points.  First, if R were a commercially supported software package, this level of documentation would be inexcusable, but R is not a commercial product: it is the freely-available end result of the loosely coordinated efforts of a very large group of unpaid volunteers.  The second point is that, despite the limitations of the documentation, the robustbase package provides implementations of procedures that could, in principle, be built on the basis of their published descriptions by anyone who wanted to.  The experience of doing that would be educational and intellectually rewarding, but it would also be a lot of work that most of us would simply never do, so we would not have any of the neat goodies that the authors have made available to us.  In the end, while I would strongly encourage them to finish the documentation, more importantly, I offer my heartfelt appreciation for their releasing the package so I can use it.

 

37 comments:

  1. ttp://202.95.10.4/?ref=b31717y = https://s.id/dewapoker99 = https://rebrand.ly/onlinepoker
    Situs Judi Online

    ReplyDelete
  2. https://sites.google.com/view/jokerslotonline/taurustogel
    Situs Judi Online

    ReplyDelete
  3. Quality content is the key to be a focus바카라사이트
    for the visitors to pay
    a quick visit the website, that’s what this web site is providing.

    ReplyDelete
  4. The concept of FPS WW2 games is a bit of a dying breed. There is a growing favouritism being shown towards real-time strategy games set in the Second World War, which is leaving shooting fans in the lurch when it comes to satisfying their wants for modern FPS games set in this time period. With that being said, there's certainly a plethora of WW2 PC games that are waiting to be played. What are the five best ones? Let's find out. slot gacor 2022

    ReplyDelete
  5. Ever since the advent of video games, there have been debates on the pros and cons of gaming. While there is no denial that gaming sharpens a number of skills of the child, it also stands for a fact that the addiction of gaming can have dire health consequences. Apart from health risks, an addiction to gaming also brings along a non-social temperament that leads to stalled emotional growth. click for more

    ReplyDelete
  6. What makes on-line poker very appealing to people, when did it actually begin? Lets learn more about it. deposit pulsa tanpa potongan

    ReplyDelete
  7. Online Poker Bot is a better solution you are looking at to getting your money back through the grubbing hands of dealers and professionals!  slot online terpercaya

    ReplyDelete
  8. Selecting the right online casino is an important piece of the process. A lot of the casinos online do not have the best software and some are even shady to begin with. You may even find that you will deposit money only to find out you can't withdraw it. 카지노검증

    ReplyDelete
  9.  popular recreational activity is to experience online poker for fun. Did you realize, it is possible to parlay that recreational fun into actual money with free professional poker lessons.  먹튀검증

    ReplyDelete
  10. I am not sure where you are getting your information, but good topic. I needs to spend some time learning much more or understanding more. Thanks for excellent information I was looking for this information for my mission. 메이저놀이터

    ReplyDelete
  11. I am frequently to blogging and i truly appreciate your articles. The content has really peaks my interest. I am about to bookmark your site and keep checking achievable details. TOS885 vip

    ReplyDelete
  12. Once you are in Orlando, and after some days of visiting the parks, you cannot miss going to do some shopping. Moreover, this is a great city to go shopping. There are several terrific shopping centers in Orlando. You should consider some time to go to the best commercial centers, tour around, have fun and enjoy doing some acquisitions and souvenirs from this attracting city. CPE gloves

    ReplyDelete
  13. The general feeling is that video games do not provide any benefits to the player and especially so in the case of children. This article makes an attempt to list some of the benefits that gamers enjoy whether they are toddlers or grandparents. 파워볼분석법

    ReplyDelete
  14. I am not sure where you are getting your information, but good topic. I needs to spend some time learning much more or understanding more. Thanks for excellent information I was looking for this information for my mission. ויזה עסקית בהודו

    ReplyDelete
  15. The concept of FPS WW2 games is a bit of a dying breed. There is a growing favouritism being shown towards real-time strategy games set in the Second World War, which is leaving shooting fans in the lurch when it comes to satisfying their wants for modern FPS games set in this time period. With that being said, there's certainly a plethora of WW2 PC games that are waiting to be played. What are the five best ones? Let's find out. judi bola deposit 10rb

    ReplyDelete
  16. we can always give food aid to the african countries if we just save some pennies and donate it to them, ammo supply ware houseonline

    ReplyDelete
  17. Epidural steroid injections are commonly used for the treatment of back pain. It injects a powerful anti-inflammatory with a local anesthetic pain relief medicine directly around the spinal nerves, providing immediate pain relief. However, even the best drugs in the world are not without the dreadful side effects. Side effects from epidural steroid injections are rare but not uncommon. No more than 3 doses of treatment should be given within a 12 months period with each treatment spaced at least 2 months apart from each other. additional resources

    ReplyDelete
  18. The concept of FPS WW2 games is a bit of a dying breed. There is a growing favouritism being shown towards real-time strategy games set in the Second World War, which is leaving shooting fans in the lurch when it comes to satisfying their wants for modern FPS games set in this time period. With that being said, there's certainly a plethora of WW2 PC games that are waiting to be played. What are the five best ones? Let's find out. 파워볼중계

    ReplyDelete
  19. Ever since the advent of video games, there have been debates on the pros and cons of gaming. While there is no denial that gaming sharpens a number of skills of the child, it also stands for a fact that the addiction of gaming can have dire health consequences. Apart from health risks, an addiction to gaming also brings along a non-social temperament that leads to stalled emotional growth. 파워볼중계

    ReplyDelete
  20. If you want to be a professional gamer you need special gaming grade hardware to perform well in your game. Ordinary computer components lacks the features needed to create ideal environment for game-play. Powerful processing and high storage, speedy and clear communication, accuracy and precision, comfortable gameplay and other advanced features to interact with the game. ซุปเปอร์สล็อต

    ReplyDelete
  21. Bacon has been an essential for breakfast for years. It is important and much easier to have the bacon cooking equipment to cook with. เล่นสล็อต MEGAGAME

    ReplyDelete
  22. Many thanks for making the effort to line all this out for people like us. This kind of article was quite helpful to me. perfect ten singapore

    ReplyDelete
  23. In 1976, Hayes Noel, Bob Gurnsey and Charles Gaines were walking home talking about Gaines' buffalo hunting experience in Africa. They wanted to recreate the adrenaline rush of hunting so they came up with the idea of stalking and hunting each other. They used a gun called the "Nel-spot 007", a gun used by farmers to mark trees and livestock - the gun fired paint. This was the start of the paintball revolutio. Hornady Custom Ammunition 10mm Auto 180 Grain XTP Jacketed Hollow Point 500 rounds

    ReplyDelete
  24. Exercise regularly. This will relax your body muscles and may give you relief from sleeplessness. Heavy Hitters

    ReplyDelete
  25. I found your blog site on google and examine a couple of of your early posts. Proceed to maintain up the very good operate. I just extra up your RSS feed to my MSN News Reader. Looking for forward to reading extra from you afterward!? I am often to running a blog and i really appreciate your content. The article has actually peaks my interest. I’m going to bookmark your website and preserve checking for brand new information. Excel can hack the blast game !!!

    ReplyDelete
  26. Making a living by playing video games would have been considered to be a ridiculous and unrealistic career option just a few decades ago. However, the popularity of video games across the globe has made video game testing a viable career option. In fact, there are already a lot of people in the world who are making tons of cash doing what they know best and love the most; playing video games. bingo for android

    ReplyDelete
  27. The concept of FPS WW2 games is a bit of a dying breed. There is a growing favouritism being shown towards real-time strategy games set in the Second World War, which is leaving shooting fans in the lurch when it comes to satisfying their wants for modern FPS games set in this time period. With that being said, there's certainly a plethora of WW2 PC games that are waiting to be played. What are the five best ones? Let's find out. ทางเข้าเว็บเล่นกำถั่วบนมือถือ

    ReplyDelete
  28. Games, used in the classroom, can help students develop skills in problem solving/critical thinking as well as knowledge and basic skills. All games have some advantages. In my experience, teacher designed games are the most effective in the classroom. This article discussed their advantages and disadvantages and the characteristics that create an excellent game for the classroom especially those designed by teachers to fit into a current teaching topic. The article also includes a game called "Buzz" used by many teachers to help their young students learn to count. เล่นหวยรัฐบาลบนมือถือ

    ReplyDelete
  29. Ever wanted to know what a singleplayer game is? Could you face cyber bullying when playing a singleplayer game? Are there benefits to being in singleplayer mode? Teenpreneur explains the world of singleplayer games. แทงบอลต่างประเทศ

    ReplyDelete
  30. Video games are designed to entertain players and give missions to complete all throughout the game. It is normally made to challenge players, and allow every player achieve small accomplishments that will give them the urge to continue playing. May it be small or big achievements, those achievements are somehow the driving force of players to keep on playing video games. sa gaming

    ReplyDelete
  31. The birth of bitcoin in 2009 opened doors to investment opportunities in an entirely new kind of asset class - cryptocurrency. Lots entered the space way early. euro stablecoins

    ReplyDelete
  32. Play to bring back the nostalgic days and it will also generate profits for you as well. Have fun. ambbet

    ReplyDelete
  33. If anyone says that the trial game is useless or has no advantages, that is completely untrue. megagame

    ReplyDelete
  34. market. Direct web, online slots that are Big web slots, direct web The real one, the real voice, all slotxo

    ReplyDelete