This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, January 29, 2011

Boxplots and Beyond - Part I

Boxplots are a simple and reasonably popular way of summarizing the range of variation of a real-valued variable across different subsets of data.  Typical examples might include diastolic blood pressure across a group of patients, broken down by gender and smoking status, or the breaking strength of material samples broken down by material type and manufacturing process.  Like all simple summaries, boxplots have their limitations, and these have motivated a number of useful variations on the theme.  The basic boxplot is introduced in Chapter 1 of my book Exploring Data in Engineering, the Sciences and Medicine, and it is used fairly extensively throughout the rest of the book.  This post is the first in a series of four that briefly describe some useful extensions of the basic boxplot that are not covered in the book, including variable width boxplots, boxplots based on a robust asymmetry measure, violin plots, and beanplots, all easily constructed using procedures from the open-source R programming language.


As an example, the above figure shows the simplest form of the boxplot summary, constructed from the UScereal dataset included in the MASS package in R, which characterizes 65 cereals from 6 different manufacturers, based on the information included on their FDA-mandated package labels.  The six manufacturers are abbreviated "G" for General Mills, "K" for Kellogs, "N" for Nabisco, "P" for Post, "Q" for Quaker Oats, and "R" for Ralston Purina, and the above boxplot summarizes the potassium content for one cup of each cereal, based on the label information.  (The metadata for this dataset list the units for this quantity as "grams," but it seems more likely that the units are actually milligrams; otherwise, the largest value in this dataset would be almost a kilogram, which seems like an awful lot of potassium for one cup of breakfast cereal.  This highlights one of the other points I discuss in Exploring Data, that metadata is not always accurate.)  For those not familiar with the basic boxplot, it is based on Tukey's 5-number summary for each data subset, plotted as follows:
  1. the sample minimum, defining the horizontal line at the bottom of each plot;
  2. the lower quartile, defining the lower limit of the box in each figure;
  3. the sample median, represented by the heavy line inside each box;
  4. the upper quartile, defining the upper limit of each box;
  5. the sample maximum, defining the horizontal line at the top of each plot.
Although the figure shown above represents the simplest form of the boxplot, it is probably not the most commonly used, and it is not the default boxplot generated in R.  Instead of always marking the sample minimum and maximum with horizontal lines as in the plot above, it is more common to put these lines at the limits of the nominal range of the data inferred from the upper and lower quartiles, denoting any points that fall outside this range as open circles.  This is illustrated in the boxplot shown below, which was constructed from the same dataset as that shown above, but using the default boxplot command in R.



In Chapter 7 of Exploring Data, I discuss the problem of outlier detection, using either the well-known "three-sigma edit rule" (which, ironically enough, often fails to detect outliers because they inflate the standard deviation used to classify points as outliers, making this classification less likely), or the more robust Hampel identifier that is analogous but replaces the outlier-sensitive mean with the outlier-resistant median, and replaces the extremely outlier-sensitive standard deviation with the highly robust MADM scale estimate.  Unfortunately, I do not discuss the boxplot outlier detection rule in Exploring Data, but I do discuss it at some length in my other book, Mining Imperfect Data.  The basic idea is that the interquartile distance (IQD) - i.e., the difference between the upper and lower quartiles, corresponding to the width of the central box in the boxplot - is used to determine the nominal range of data variation.  Like the MADM scale estimate, the IQD is less outlier-sensitive than the standard deviation, so it provides a more reliable basis for detecting outliers.  In the typical boxplot representation - like the example shown above - the upper end of the nominal data range is defined as the upper quartile plus 1.5 times the IQD, and the lower end of the nominal data range is defined as the lower quartile minus 1.5 times the IQD.  If the observed range of data values falls within these limits, the horizontal lines at the top and bottom of the boxplots correspond to the sample maximum and sample minimum as described above.  If any points fall outside this range, however, they are plotted as open circles and the horizontal lines correspond to the most extreme data values that fall within this nominal range.  In cases where the data distribution is markedly asymmetric, this approach may not be appropriate, and the next post in this series will describe a better alternative from the R package robustbase.


One of the topics I discuss at some length in Exploring Data is the use of transformations in exploratory data analysis, both to simplify some problem formulations and to potentially give more informative views of a dataset.  One transformation that is often extremely useful when the data values span a wide range is the logarithm.  The plot above shows the same boxplot as before, but now with the log="y" plotting option specified to give a logarithmic scaling to the y-axis.  Note that the appearance of this plot is quite different, giving greater visual emphasis to the range from 20 to 500 where most of the data values lie, while the previous plot gave equal emphasis to the range of values above 500, occupied by only a few points.  Logarithmic transformations are not always useful or even feasible - the R boxplot command returns an error message and generates no plot if the data includes zeros or negative values - but in the right circumstances they can be very informative.


Finally, another extremely useful boxplot option is varwidth="T" which causes the widths of the components of a boxplot to be drawn with variable size, larger for boxplots based on more data and smaller for boxplots based on less data.  The above figure shows the effect of adding this option to the previous boxplot.  Here, the width of each boxplot is drawn proportional to the square root of the number of data values on which it is based.  Thus, it is clear from this boxplot that two of the manufacturers (G and K) are more widely represented in these 65 cereals than the others, while manufacturer N has the smallest representation.  (In fact, of the 65 cereals, the representations of the different manufacturers are 22 for G, 21 for K, 3 for N, 9 for P, and 5 each for Q and R.)  Note that the results of designed experiments are often fairly evenly balanced, so that the different subsets being compared are of about the same size.  Where the variable width option becomes particularly useful is in the analysis of historical datasets that were not collected on the basis of a designed experiment, where the sizes of the different subsets compared may vary substantially from one to another.  The practical advantage of variable width boxplots like the one shown above is that they draw your attention to any substantial size differences that may exist between the data subsets being compared, differences that may not be obvious from the outset.  The rationale for the square root scaling used by the varwidth option is that the variability of asymptotically normal estimators - such as the median and the upper and lower quartiles - exhibits a standard deviation that decreases inversely with the square root of the sample size.  Thus, the square root of the sample size may be taken as a rough measure of "strength of evidence" in comparing these estimators or displays like boxplots that are based on these estimators.  Also, in characterizing very large datasets it may happen that the sizes of the subsets being compared vary by several orders of magnitude, and in such cases, making the boxplot widths proportional to the square root of sample size rather than proportional to sample size itself gives a much more reasonable display.  On the other hand, arbitrary scalings are possible with the R boxplot command, explicitly specifying the relative width of each boxplot via the width parameter.

That's all for now.  Subsequent posts in this sequence will discuss the alternative outlier detection strategy employed in the robustbase package (along with some of the other useful goodies in that package), an extension called violin plots that combines the idea of the boxplot with a nonparametric density estimate (these are available both in the wvioplot add-on package and as part of the lattice package included with base R installations), and another similar extension called beanplots, available via the beanplot add-on package.

Saturday, January 22, 2011

The Art of Exploratory Data Analysis

This blog is about the art of exploratory data analysis, which is also the subject of my new book, Exploring Data in Engineering, the Sciences, and Medicine (http://www.oup.com/us/ExploringData).  This art is appropriate in situations where you are faced with an existing dataset that you want to understand better.  As Stanford University statistics professor Persi Diaconis describes it in "Theories of data analysis: From magical thinking through classical statistics" (Chapter 1 of Exploring Data Tables, Trends, and Shapes, D.C. Hoaglin, F. Mosteller, and J.W. Tukey, eds., Wiley, 1985), this art involves the following activities:
We look at numbers and try to find patterns.  We pursue leads suggested by background information, imagination, patterns perceived, and experience with other data analyses.
Later posts will discuss various things in more detail, but the objective of this first post is to set the stage for these later discussions, emphasizing the following three points.  First, data analysis is an exercise in approximation rather than a purely mathematical activity that leads to definite, correct answers.  In particular, any answer other than “4” to the question, “how much is 2 + 2?” is simply wrong (barring tricks, of course, like casting the problem in base 3 arithmetic).  Similarly, other mathematical questions like “what is the square root of 3?” or “what is the probability of drawing a sample from a Gaussian distribution that lies 2 or more standard deviations away from the mean?” also have clear, correct answers.  In contrast, data analysis questions like, “how is the waiting time between eruptions related to the duration of eruptions for the Old Faithful geyser in Yellowstone National Park?” do not have similarly well-defined answers.  Certainly, we can provide quantitative characterizations that attempt to describe this relationship (for example, the product-moment correlation coefficient between these values computed from a set of 272 observations is approximately 0.901), but this can usually be done in more than one way (e.g., the Spearman rank correlation coefficient between these same numbers is somewhat smaller, at 0.778), and however we compute these numbers they are generally only partial characterizations.  The practical consequence of this difference between data analysis problems and purely mathematical problems is the necessity of somehow dealing with unexplainable variation in the data.

This need for approximation in analyzing real-world data has been widely recognized and it motivates the widespread use of probability and statistics in analyzing data.  This is not the only possible approach - set-theoretic methods based on the “unknown but bounded” uncertainty model are also possible, for example, although they are not nearly as popular or as well developed as statistical methods - and there are even those who are unwilling to accept uncertainty in describing real-world data.  One colorful example is Alfred William Lawson’s Principle of Zig-Zag-and-Swirl, discussed briefly in my book and in more detail in L.D. Henry’s biography of Lawson, Zig-Zag-and-Swirl (see the Other Interesting Books section of The Exploring Data Store at the end of this post for details).  Lawson dabbled in many things, from playing and managing minor league baseball to writing a Utopian novel (the late Martin Gardner characterized it as “the worst work of fiction ever published”).  He is credited with introducing the term “aircraft” into general use not long after the Wright brothers first flight, and he obtained the first U.S. airmail contract before his company fell into receivership.  He is of interest here because he attempted to develop his own laws of physics, advocating the development of something he termed “Supreme Mathematics” to deal with the complexity of the real world.  As an illustrative example, he asked us to consider “a germ, moving across a blood corpuscle in the body of a man who is walking down the aisle of a flying airplane.”  To describe the true complexity of the germs path, including the man’s motion in the airplane, the airplane’s flight over the earth, the earth’s rotation about the sun, and so forth, Lawson argued that ordinary mathematics was inadequate and needed to be replaced by his Supreme Mathematics.  The fundamental difficulty is that, if we are really attempting to describe anything interesting about the germ, we probably don’t care at all about whatever eighteenth-order influence Jupiter’s gravity may have on the germ’s motion.  If we are willing to live with approximations, we can avoid the necessity of Supreme Mathematics.

To proceed, then, we need to make working assumptions about the nature of our approximations, providing practical descriptions of the uncertainty inherent in our data.  As noted, this data approximation problem can be approached in a number of different ways, but the most popular approach is via probability theory, describing our data uncertainty with some sort of random variable model.  To make this idea useful, we need to adopt a specific random variable model, which brings us to the second key aspect of data analysis considered here: exactly how do we describe data uncertainty?  Historically, the Gaussian distribution has been adopted so extensively as a data characterization that it has frequently been simply assumed without question.  In fact, a contemporary of Poincare’s once observed that:
Experimentalists tend to regard it as a mathematical result that data values obey a Gaussian distribution, whereas mathematicians tend to regard it as an experimental result.
In practice, this assumption sometimes represents a reasonable approximation of reality, but sometimes it is horribly inadequate.  This point is important because adopting the Gaussian distribution as a working assumption leads to a lot of very useful data analysis techniques (ordinary least squares regression methods, to name only one example), but in cases where the Gaussian distribution is a poor approximation, the results obtained using these methods can be extremely misleading. 

The third key point of this post is that simple graphical tools for assessing the reasonableness popular working assumptions like this one are extremely useful in exploratory data analysis, as are alternative analysis approaches that perform better when these assumptions are inappropriate.  Future posts will describe some of these tools and alternatives in greater detail, but the following example provides a useful illustration of both how we might assess real data distributions and the fact that these distributions can sometimes be very non-Gaussian.  Both of these points are illustrated in the four plots in the figure below.  The upper two plots show nonparametric density estimates, computed from two different datasets.  The curves in these plots are estimates of the probability density function appropriate to these datasets, if we regard each record in the dataset as a statistically independent random sample, drawn from some unknown probability distribution.  (If you are familiar with histograms, think of these curves as “smoothed histograms.”)


The upper left plot shows the density estimated from the measurements of the carapace length of 200 crabs, from the crabs dataset included in the MASS package that is part of the open-source R programming language discussed briefly at the end of this post.  In this case, the estimated probability density appears reasonably similar to the “bell-shaped curve” that characterizes the Gaussian probability distribution, so this working assumption may be fairly reasonable here.  This assumption is further supported by the bottom left plot, which shows the normal quantile-quantile (Q-Q) plot constructed from these 200 measurements.  The construction and interpretation of these plots is described in detail in my book and in various other places (see, for example, the Wikipedia entry on Q-Q plots at http://en.wikipedia.org/wiki/Q-Q_plot), but the key point here is that they represent an informal graphical test of the assumption that the Gaussian distribution is a reasonable approximation.  If the data values fall approximately on the reference line - as in this plot - the Gaussian assumption may be a reasonable basis for analysis.  Conversely, the plot on the lower right shows marked deviations from this reference line, suggesting that the Gaussian distribution may not be reasonable there.  This plot was constructed from 272 measurements of the waiting times between eruptions of the Old Faithful geyser in Yellowstone National Park, and the nonparametric density estimate shown in the upper right plot was also constructed from this data sequence.  (These numbers were taken from the faithful dataset that is included as part of the base installation of the R programming language.)  This upper plot shows two pronounced peaks – one at approximately 50 minutes and the other at approximately 80 minutes – suggesting that the distribution of waiting times is bimodal, consistent with the “kink” seen in the Q-Q plot below it.  The point here is that, while the Gaussian distribution may be reasonable as an approximate description of the crab carapace length data, it is not at all reasonable for the Old Faithful waiting time data.

In future posts, I plan to discuss some of these ideas in more detail, illustrating them with examples.  In no particular order, a few of the topics that appear on my radar are:

  • Metadata (the information describing the contents of a dataset that is all too often missing, incomplete, or incorrect);
  • Boxplots, modified boxplots, violinplots, and beanplots - useful tools for characterizing the range of variation of a numerical variable over different data subgroups;
  • Various types of data anomalies, including outliers, inliers, missing data, and misalignment errors: what they are, how to detect them, and what to do about them;
  • Interestingness measures as useful characterizations of categorical variables;
  • Data transformations and the things they can do, both expected and unexpected, sometimes good and sometimes very bad;
  • And anything else related to exploratory data analysis that strikes me as interesting along the way.

The results presented here were obtained using the open-source software package R, which I like a lot because it is both freely available and fabulous in the range of computational tools it provides.  For details, documentation, and download instructions, go to the Comprehensive R Archive Network (CRAN) at the following Web site:

            http://cran.r-project.org/

I should emphasize, however, that this blog is not about the R programming language per se, but rather about the kinds of data analysis that R supports extremely well. 

I hope you find this blog useful and interesting, and I welcome your comments and questions.  I can’t promise to respond right away, but I do promise to carefully consider any questions and comments that come my way and to respond to them when and how I can.