This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, January 29, 2011

Boxplots and Beyond - Part I

Boxplots are a simple and reasonably popular way of summarizing the range of variation of a real-valued variable across different subsets of data.  Typical examples might include diastolic blood pressure across a group of patients, broken down by gender and smoking status, or the breaking strength of material samples broken down by material type and manufacturing process.  Like all simple summaries, boxplots have their limitations, and these have motivated a number of useful variations on the theme.  The basic boxplot is introduced in Chapter 1 of my book Exploring Data in Engineering, the Sciences and Medicine, and it is used fairly extensively throughout the rest of the book.  This post is the first in a series of four that briefly describe some useful extensions of the basic boxplot that are not covered in the book, including variable width boxplots, boxplots based on a robust asymmetry measure, violin plots, and beanplots, all easily constructed using procedures from the open-source R programming language.


As an example, the above figure shows the simplest form of the boxplot summary, constructed from the UScereal dataset included in the MASS package in R, which characterizes 65 cereals from 6 different manufacturers, based on the information included on their FDA-mandated package labels.  The six manufacturers are abbreviated "G" for General Mills, "K" for Kellogs, "N" for Nabisco, "P" for Post, "Q" for Quaker Oats, and "R" for Ralston Purina, and the above boxplot summarizes the potassium content for one cup of each cereal, based on the label information.  (The metadata for this dataset list the units for this quantity as "grams," but it seems more likely that the units are actually milligrams; otherwise, the largest value in this dataset would be almost a kilogram, which seems like an awful lot of potassium for one cup of breakfast cereal.  This highlights one of the other points I discuss in Exploring Data, that metadata is not always accurate.)  For those not familiar with the basic boxplot, it is based on Tukey's 5-number summary for each data subset, plotted as follows:
  1. the sample minimum, defining the horizontal line at the bottom of each plot;
  2. the lower quartile, defining the lower limit of the box in each figure;
  3. the sample median, represented by the heavy line inside each box;
  4. the upper quartile, defining the upper limit of each box;
  5. the sample maximum, defining the horizontal line at the top of each plot.
Although the figure shown above represents the simplest form of the boxplot, it is probably not the most commonly used, and it is not the default boxplot generated in R.  Instead of always marking the sample minimum and maximum with horizontal lines as in the plot above, it is more common to put these lines at the limits of the nominal range of the data inferred from the upper and lower quartiles, denoting any points that fall outside this range as open circles.  This is illustrated in the boxplot shown below, which was constructed from the same dataset as that shown above, but using the default boxplot command in R.



In Chapter 7 of Exploring Data, I discuss the problem of outlier detection, using either the well-known "three-sigma edit rule" (which, ironically enough, often fails to detect outliers because they inflate the standard deviation used to classify points as outliers, making this classification less likely), or the more robust Hampel identifier that is analogous but replaces the outlier-sensitive mean with the outlier-resistant median, and replaces the extremely outlier-sensitive standard deviation with the highly robust MADM scale estimate.  Unfortunately, I do not discuss the boxplot outlier detection rule in Exploring Data, but I do discuss it at some length in my other book, Mining Imperfect Data.  The basic idea is that the interquartile distance (IQD) - i.e., the difference between the upper and lower quartiles, corresponding to the width of the central box in the boxplot - is used to determine the nominal range of data variation.  Like the MADM scale estimate, the IQD is less outlier-sensitive than the standard deviation, so it provides a more reliable basis for detecting outliers.  In the typical boxplot representation - like the example shown above - the upper end of the nominal data range is defined as the upper quartile plus 1.5 times the IQD, and the lower end of the nominal data range is defined as the lower quartile minus 1.5 times the IQD.  If the observed range of data values falls within these limits, the horizontal lines at the top and bottom of the boxplots correspond to the sample maximum and sample minimum as described above.  If any points fall outside this range, however, they are plotted as open circles and the horizontal lines correspond to the most extreme data values that fall within this nominal range.  In cases where the data distribution is markedly asymmetric, this approach may not be appropriate, and the next post in this series will describe a better alternative from the R package robustbase.


One of the topics I discuss at some length in Exploring Data is the use of transformations in exploratory data analysis, both to simplify some problem formulations and to potentially give more informative views of a dataset.  One transformation that is often extremely useful when the data values span a wide range is the logarithm.  The plot above shows the same boxplot as before, but now with the log="y" plotting option specified to give a logarithmic scaling to the y-axis.  Note that the appearance of this plot is quite different, giving greater visual emphasis to the range from 20 to 500 where most of the data values lie, while the previous plot gave equal emphasis to the range of values above 500, occupied by only a few points.  Logarithmic transformations are not always useful or even feasible - the R boxplot command returns an error message and generates no plot if the data includes zeros or negative values - but in the right circumstances they can be very informative.


Finally, another extremely useful boxplot option is varwidth="T" which causes the widths of the components of a boxplot to be drawn with variable size, larger for boxplots based on more data and smaller for boxplots based on less data.  The above figure shows the effect of adding this option to the previous boxplot.  Here, the width of each boxplot is drawn proportional to the square root of the number of data values on which it is based.  Thus, it is clear from this boxplot that two of the manufacturers (G and K) are more widely represented in these 65 cereals than the others, while manufacturer N has the smallest representation.  (In fact, of the 65 cereals, the representations of the different manufacturers are 22 for G, 21 for K, 3 for N, 9 for P, and 5 each for Q and R.)  Note that the results of designed experiments are often fairly evenly balanced, so that the different subsets being compared are of about the same size.  Where the variable width option becomes particularly useful is in the analysis of historical datasets that were not collected on the basis of a designed experiment, where the sizes of the different subsets compared may vary substantially from one to another.  The practical advantage of variable width boxplots like the one shown above is that they draw your attention to any substantial size differences that may exist between the data subsets being compared, differences that may not be obvious from the outset.  The rationale for the square root scaling used by the varwidth option is that the variability of asymptotically normal estimators - such as the median and the upper and lower quartiles - exhibits a standard deviation that decreases inversely with the square root of the sample size.  Thus, the square root of the sample size may be taken as a rough measure of "strength of evidence" in comparing these estimators or displays like boxplots that are based on these estimators.  Also, in characterizing very large datasets it may happen that the sizes of the subsets being compared vary by several orders of magnitude, and in such cases, making the boxplot widths proportional to the square root of sample size rather than proportional to sample size itself gives a much more reasonable display.  On the other hand, arbitrary scalings are possible with the R boxplot command, explicitly specifying the relative width of each boxplot via the width parameter.

That's all for now.  Subsequent posts in this sequence will discuss the alternative outlier detection strategy employed in the robustbase package (along with some of the other useful goodies in that package), an extension called violin plots that combines the idea of the boxplot with a nonparametric density estimate (these are available both in the wvioplot add-on package and as part of the lattice package included with base R installations), and another similar extension called beanplots, available via the beanplot add-on package.

4 comments: