This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

## Tuesday, February 15, 2011

### Boxplots and Beyond III: Violin Plots

This post is the third in a series of four on boxplots and closely related data visualization techniques for comparing subsets of a dataset, or comparing different datasets that we hope or expect to be similarly distributed.  The previous two posts in this series have dealt with the basic boxplot, simple variations like log transformations and variable-width boxplots, and the procedures used to identify and represent outliers in a boxplot display, both traditionally and in the face of significant asymmetry.  All of the boxplot variations discussed so far retain the basic “box” structure, however, and this is one of the inherent limitations of this data display: their visual simplicity can hide a lot of important details about how the data values are distributed.  A much more flexible extension of the basic boxplot is the violin plot, constructed by combining the concept of the boxplot with that of nonparametric density estimates.  Both boxplots and nonparametric density estimates are discussed in Exploring Data, but the idea of putting them together into a single data display was not.  The violin plot discussed in this post is one way of achieving this combination, and my next post will discuss beanplots, also built from these same components.  Violin plots are supported in R both as part of the lattice graphics package and via the add-on package wvioplot.  Because I find it somewhat easier to use, this post discusses the second of these options.

I have previously discussed the Old Faithful geyser data frame faithful, included in the datasets package in the base R installation, in part because it provides a nice, real example of a dataset with a bimodal distribution of values.  In fact, this data frame includes two variables – the waiting times between successive eruptions discussed last time, and the duration of each eruption – and it turns out that both of these variables exhibit strongly bimodal distributions.  This point is illustrated clearly in the four plots shown above.  The upper two plots are scatter plots of the observed values of the variables themselves, while the bottom two plots are nonparametric density estimates generated via the density command in R, using its default options.  As in my previous discussion of the waiting time data from this dataset, the bimodal character of these variables is clear from the two pronounced peaks seen in these density estimates.  Further, this bimodal character is also clearly evident in the scatter plots of the data values themselves, particularly in the duration data which shows a gap between approximately 2.5 and 3.0 where there are almost no points at all.  The main reason for considering this example here is that bimodality represents one of those potentially significant distributional details that is generally lost in boxplots.  This point is clearly illustrated in the figure shown below, which presents side-by-side boxplots of the waiting time divided by 20 on the left, and the duration of the eruptions on the right.  (The waiting times were rescaled simply to make the two boxplots fit reasonably in the same figure.)  Both plots suggest significant distributional asymmetry, but neither one gives any hint of the strongly bimodal character of these variables.

For comparison, the two figures shown below were both generated using the wvioplot package, which generates plots that may be viewed as boxplots whose “boxes” have been curved to reflect the estimated distribution of values over the observed data range.  The plot on the left uses the default options for wvioplot, and one of the points of this example is that these default options are not always much more informative than the basic boxplot.  This is certainly the case here: neither of the left-hand violin plots is at all suggestive of the bimodality we know to be a key characteristic of these variables.  The right-hand violin plots represent are slightly customized, generated by specifying two options.  First, note that the default violin plots on the left are both clipped, exhibiting hard limits at the smallest and largest values that appear in the data.  Clipping is the default, but specifying “clip = F” overrides this behavior, which is not typical for nonparametric density estimates: instead of abruptly dropping to zero, these estimates typically decay smoothly to zero and, as a result, extrapolate somewhat beyond the observed data range.  Specifying “clip = F” causes the density estimates used in the violin plots to exhibit this smooth decay behavior.

The second, more important, difference between the two sets of violin plots shown in the above pair of figures is that the right-hand pair specifies a different bandwidth parameter for these nonparametric density estimators.  In Chapter 8 of Exploring Data, I discuss the trade-offs inherent in selecting the bandwidth parameter for nonparametric density estimators: making the bandwidth too small yields a high-variance, “noisy” estimate with a lot of spurious “details,” while making the bandwidth too large yields low-variance but highly biased estimates that can miss critical details that are present.  For the violin plots generated by the wvioplot package, this bandwidth is specified by the adjust parameter, with the default value “adjust = 3” giving the results shown on the left.  The plots on the right were generated using the smaller value, “adjust = 1”, which reduces the bandwidth of the density estimator enough to see the bimodal character of the two variables.

One of the key points of this example is that, while the added capability inherent in the violin plot relative to the simpler boxplot can convey extremely useful details that we might otherwise miss, we can’t always rely on the default settings for the procedure to show us these details.  The basic source of this difficulty is that more complex procedures like nonparametric density estimators (at least, more complex compared with the five-number summary on which the basic boxplot is based) have tuning parameters, and simply using the default options for these parameters doesn’t always lead to the results we want.  For this reason, it is a good idea to spend a bit of time exploring the key options in using these procedures.

## Sunday, February 6, 2011

### Boxplots and Beyond – Part II: Asymmetry

In my last post, I discussed boxplots in their simplest forms, illustrating some of the useful options available with the boxplot command in the open-source statistical software package R.  As I noted in that post, the basic boxplot is both useful and popular, but it does have its limitations.  One of those limitations is that the standard boxplot outlier rule is not appropriate for highly asymmetric data.  Specifically, the standard rule declares points to be outliers if they lie a fixed distance – typically, 1.5 times the interquartile distance (IQD) – beyond the quartiles.  While this rule is appropriate for symmetric, approximately Gaussian data distributions, highly asymmetric situations call for an outlier detection rule that treats upward-outliers and downward-outliers differently.  For a strongly right-skewed distribution, for example, flagging any point more than 1.5 IQD’s above the upper quartile may be too liberal, declaring too many points to be upward-outliers, while flagging any point more than 1.5 IQD’s below the lower quartile may be too conservative, declaring too few points to be downward-outliers.  To handle these cases, it is necessary to incorporate some measure of skewness into the outlier detection rule.

One of the points I discuss in Chapter 7 of Exploring Data is that the standard skewness estimator - defined as a normalized third moment - is extremely sensitive to outliers, worse even than the standard deviation in this respect.   This point is illustrated in the above figure, which gives a standard boxplot summary comparing four different skewness estimates for two different cases.  The boxplots on the left present results for these skewness measures each applied to 1000 statistically independent standard Gaussian data samples, each of length n = 100, while the boxplots on the right present the corresponding results for exactly the same data sequences, but with each Gaussian data sample contaminated by a single additive outlier, lying 8 standard deviations above the mean.  The boxplots designated “Mo” in these comparisons correspond to the standard moment-based estimator, while the other boxplots correspond to three other skewness estimators, discussed in the next paragraph.  Two points are immediately evident from these results: first, that the moment estimator exhibits much higher variability than the other three estimators considered here, and second, that the moment estimator is extremely outlier-sensitive.  In particular, note that a single outlier in a dataset of size 100 causes the median estimate to shift from about zero, corresponding to the correct nominal result, to a large enough value to suggest asymmetry comparable to the J-shaped exponential distribution (for which the skewness is 2), all on the basis of a single anomalous data point.

Two of the other three skewness estimators compared in these boxplots are Hotelling’s skewness measure, designated “Ho,” and Galton’s skewness measure, designated “Ga.  Both of these alternatives to the standard moment-based skewness measure are discussed in Chapter 7 of Exploring Data, which presents similar comparisons to those presented here for these three asymmetry measures.  Hotelling’s measure is equal to the difference between the mean and the median of the data, divided by the estimated standard deviation, and it can be shown to lie between -1 and +1.  The boxplot comparisons presented here show that this measure is much less outlier-sensitive than the standard skewness measure, although this estimate does exhibit a slight positive bias (slight is the operative word here: the mean value for the contaminated “Mo” estimates is 2.36, while that for the “Ho” estimate is 0.05).  Galton’s skewness measure is based on sample quartiles and is closely related to the fourth measure discussed next.  Specifically, both of these measures are based on the following function: let x and y be any two values from the data sample satisfying the condition that x is smaller than the sample median m, which is in turn smaller than y.  The function h(x,y) on which these measures are based is defined as the ratio of two computed values: the numerator is (y – m) – (m – x), and the denominator is y – x.  If we take x as the lower quartile and y as the upper quartile, we obtain Galton’s skewness measure, designated “Ga” in the boxplots.  The other skewness measure is the medcouple, defined as the median of all values of h(x,y) computed from points satisfying the condition that x < m < y.  Note that both the Galton and medcouple measures also lie between -1 and +1, since this is true for h(x,y) for any admissible values of x and yDetailed comparison of the “Ga” and “Mo” boxplots shows that both of these estimators exhibit small bias – i.e., their median values are both approximately zero, as they should be – and that the medcouple exhibits a slightly smaller range of variation than Galton’s skewness measure does.

Procedures in R to compute the standard skewness measure, Hotelling’s measure, and Galton’s measure are available from the companion website for Exploring Data.  The medcouple is not discussed in the book, but it is available in the R add-on package robustbase, which also includes a procedure to construct the adjusted boxplot described next.  There, rather than using 1.5 interquartile distances to set both upper and lower outlier limits as in the standard boxplot rule, the adjbox procedure in the robustbase package makes this distance depend on the medcouple value computed from the data.  Specifically, both outlier detection limits have the same general form, but different specific parameters: in both cases, the nominal distance of 1.5 IQD’s is multiplied by a scale factor computed from the medcouple value.  For lower outliers, this scale factor is exp(-3.5 Mc), while for upper outliers, this scale factor is exp(+4Mc).  A detailed discussion of this outlier detection rule and the rationale behind it is presented both in the paper, “An Adjusted Boxplot for Skewed Distributions,” by Ellen Vanderviere and Mia Huber, published in the proceedings of the 2004 COMPSTAT symposium, and from the technical report cited in the R documentation for the adjbox procedure.

The above figure compares two sets of boxplots for the UScereal dataset discussed last time, from the MASS package in R.  Specifically, both boxplots summarize the reported potassium content for breakfast cereals from 6 different manufacturers, each identified by a one-letter designation.  Both boxplots are generated using the log = “y” and varwidth options to obtain boxplots whose width is proportional to the square root of the number of records in each subsample.  The difference in these two sets of boxplots is that the one on the left uses the standard outlier identification algorithm, which is the default for the boxplot command in base R, while the boxplot series on the right uses the modified outlier detection rule just described, based on the medcouple skewness measure and available via the adjbox command in the robustbase add-on package.  The points declared outliers for each boxplot are shown as solid circles for emphasis (simply specify “pch = 16” in either boxplot command), and these points illustrate the difference between the two boxplot results.  Specifically, the left-hand group of boxplots only show upward outliers, one for manufacturer G (General Mills) and two for manufacturer K (Kellogs), while the right-hand boxplots only show downward outliers, three for General Mills, one for Kellogs, and one for Quaker Oats (manufacturer Q).  The question remains which of these conclusions is more reasonable, and this question will be revisited in subsequent posts using two additional boxplot extensions: violin plots and bean plots, both of which incorporate nonparametric density estimates to give a more detailed picture of the data.  For now, it is enough to note three things: first, an alternative to the standard boxplot is available that provides distinct detection rules for upper and lower outliers; second, that this alternative approach often finds different outliers than the standard boxplot rule does; and third, there is evidence to suggest that this modified boxplot does indeed perform better in the face of significant distributional asymmetry.  For a more detailed discussion of this point, refer to the technical report cited above.

Finally, it is worth emphasizing that the robustbase package contains a lot more than just the adjbox procedure discussed here, including multivariable outlier detection procedures like adjOutlyingness, robust multivariate location and scale estimation procedures like covMcd, and robust fitting procedures like lmrob and glmrob for linear models and generalized linear models (specifically, a robust logistic regression procedure for binomial data, and a robust Poisson regression procedure for count data), among others.  The biggest drawback of this package is that the documentation is incomplete, so it is important to check out the references cited there to get a clear idea of what some of these procedures do and how they do it.  That said, it is important to emphasize two points.  First, if R were a commercially supported software package, this level of documentation would be inexcusable, but R is not a commercial product: it is the freely-available end result of the loosely coordinated efforts of a very large group of unpaid volunteers.  The second point is that, despite the limitations of the documentation, the robustbase package provides implementations of procedures that could, in principle, be built on the basis of their published descriptions by anyone who wanted to.  The experience of doing that would be educational and intellectually rewarding, but it would also be a lot of work that most of us would simply never do, so we would not have any of the neat goodies that the authors have made available to us.  In the end, while I would strongly encourage them to finish the documentation, more importantly, I offer my heartfelt appreciation for their releasing the package so I can use it.