ExploringDataBlog: July 2012

Saturday, July 21, 2012

Base versus grid graphics

In a comment in response to my latest post, Robert Young took issue with my characterization of grid as an R graphics package. Perhaps grid is better described as a “graphics support package,” but my primary point – and the main point of this post – is that to generate the display you want, it is sometimes necessary to use commands from this package. In my case, the necessity to learn something about grid graphics came as the result of my attempt to implement the CountSummary procedure to be included in the ExploringData package that I am developing. CountSummary is a graphical summary procedure for count data, based on Poissonness plots, negative binomialness plots, and Ord plots, all discussed in Chapter 8 of Exploring Data in Engineering, the Sciences and Medicine. My original idea was to implement these plots myself, but then I discovered that all three were already available in the vcd package. One of the great things about R is that you are encouraged to build on what already exists, so using the vcd implementations seemed like a no-brainer. Unfortunately, my first attempt at creating a two-by-two array of plots from the vcd package failed, and I didn’t understand why. The reason turned out to be that I was attempting to mix the base graphics command “par(mfrow=c(2,2))” that sets up a two-by-two array with varous plotting commands from vcd, which are based on grid graphics. Because these two graphics systems don’t play well together, I didn’t get the results I wanted. In the end, however, by learning a little about the grid package and its commands, I was able to generate my two-by-two plot array without a great deal of difficulty. Since grid graphics isn’t even mentioned in my favorite R reference book (Michael Crawley’s The R Book), I wanted to say a little here about what the grid package is and why you might need to know something about it. To do this, I will describe the ideas that went into the development of the CountSummary procedure and conclude this post with an example that shows what the output looks like. Next time, I will give a detailed discussion of the R code that generated these results. (For those wanting a preliminary view of what the code looks like, load the vcd package with the library command and run “examples(Ord_plot)” – in addition to generating the plots, this example displays the grid commands needed to construct the two-by-two array.)

Count variables – non-negative integer variables like the “number of times pregnant” (NPG) variable from the Pima Indians database described below – are often assumed to obey a Poisson distribution, in much the same way that continuous-valued variables are often assumed to obey a Gaussian (normal) distribution. Like this normality assumption for continuous variables, the Poisson assumption for count data is sometimes reasonable, but sometimes it isn’t. Normal quantile-quantile plots like those generated by the qqnorm command in base R or the qqPlot command from the car package are useful in informally assessing the reasonableness of the normality assumption for continuous data. Similarly, Poissonness plots are the corresponding graphical tool for informally evaluating the Poisson hypothesis for count data. The construction and interpretation of these plots is discussed in some detail in Chapters 8 and 9 of Exploring Data, but briefly, this plot constructs a variable called the Poissonness count metameter from the number of times each possible count value occurs in the data; if the data sequence conforms to the Poisson distribution, the points on this plot should fall approximately on a straight line. A simple R function that constructs Poissonness plots is available on the OUP companion website for the book, but an implementation that is both more conveniently available and more flexible is the distplot function in the vcd package, which also generates the negative binomialness plot discussed below.

The figure above is the Poissonness plot constructed using the distplot procedure from the vcd package for the NPG variable from the Pima Indians diabetes dataset mentioned above. I have discussed this dataset in previous posts and have used it as the basis for several examples in Exploring Data. It is available from the UCI Machine Learning Repository and it has been incorporated in various forms as an example dataset in a number of R packages, including a cleaned-up version in the MASS package (dataset Pima.tr). The full version considered here contains nine characteristics for 768 female members of the Pima Indian tribe, including their age, medical characteristics like diastolic blood pressure, and the number of times each woman has been pregnant. If this NPG count sequence obeyed the Poisson distribution, the points in the above plot would fall approximately on the reference line included there. The fact that these points do not conform well to this line – note, in particular, the departure at the lower left end of the plot where most of the counts occur – calls the Poisson working assumption into question.

A fundamental feature of the Poisson distribution is that it is defined by a single parameter that determines all distributional characteristics, including both the mean and the variance. In fact, a key characteristic of the Poisson distribution is that the variance is equal to the mean. This constraint is not satisfied by all count data sequences we encounter, however, and these deviations are important enough to receive special designations: integer sequences whose variance is larger than their mean are commonly called overdispersed, while those whose variance is smaller than their mean are commonly called underdispersed. In practice, overdispersion seems to occur more frequently, and a popular distributional alternative for overdispersed sequences is the negative binomial distribution. This distribution is defined by two parameters and it is capable of matching both the mean and variance of arbitrary overdispersed count data sequences. For a detailed discussion of this distribution, refer to Chapter 3 of Exploring Data.

Like the Poisson distribution, it is possible to evaluate the reasonableness of the negative binomial distribution graphically, via the negative binomialness plot. Like the Poissonness plot, this plot is based on a quantity called the negative binomialness metameter, computed from the number of times each count value occurs, plotted against those count values. To construct this plot, it is necessary to specify a numerical value for the distribution’s second parameter (the size parameter in the distplot command, corresponding to the r parameter in the discussion of this distribution given in Chapter 8 of Exploring Data). This can be done in several different ways, including the specification of trial values, the approach taken in the negative binomialness plot procedure that is available from the OUP companion website. This option is also available with the distplot command from the vcd package: to obtain a negative binomialness plot, specify the type parameter as “nbinomial” and, if a fixed size parameter is desired, it is specified by giving a numerical value for the size parameter in the distplot function call. Alternatively, if this parameter is not specified, the distplot procedure will estimate it via the method of maximum likelihood, an extremely useful feature, although it is important to note that this estimation process can be time-consuming, especially for long data sequences. Finally, a third approach that can be adopted is to use the Ord plot described next to obtain an estimate of this parameter based on a simple heuristic. In addition, this heuristic suggests which of these two candidate distributions – the Poisson or the negative binomial – is more appropriate for the data sequence.

Like the Poissonness plot, the Ord plot computes a simple derived quantity from the original count data sequence – specifically, the frequency ratio, defined for each count value as that value multiplied by the ratio of the number of times it occurs to the number of times the next smaller count occurs – and plots this versus the counts. If the data sequence obeys the negative binomial distribution, these points should conform reasonably well to a line with positive slope, and this slope can be used to determine the size parameter for the distribution. Conversely, if the Poisson distribution is appropriate, the best fit reference line for the Ord plot should have zero slope. In addition, Ord plots can also be used to suggest two additional discrete distributions (specifically, the binomial distribution and the log-series distribution), and the vcd package provides dataset examples to illustrate all four of these cases.

For my CountSummary procedure, I decided to construct a two-by-two array with the following four components. First, in the upper left, I used the Ord_plot command in vcd to generate an Ord plot. This command returns the intercept and slope parameters for the reference line in the plot, and the Ord_estimate command can then be used to convert these values into a type specification and an estimate of the distribution parameter needed to construct the appropriate discrete distribution plot. I will discuss these results in more detail in my next post, but for the case of the NPG count sequence considered here, the Ord plot results suggest the negative binomial distribution as the most appropriate choice, returning a parameter prob, from which the size parameter required to generate the negative binomialness plot may be generated (specifically, size = 1/prob – 1). The upper right quadrant of this display gives a text summary identifying the variable being characterized and listing the Ord plot recommendations and parameter estimate. Since the Poisson distribution is “the default” assumption for count data, the lower left plot shows a Poissonness plot for the data sequence, while the lower right plot is the “distribution-ness plot” for the distribution recommended by the Ord plot results. The results obtained by the CountSummary procedure for the NPG sequence are shown below. Next time, I will present the code used to generate this plot.

Saturday, July 7, 2012

Graphical insights from the 2012 UseR! Meeting

About this time last month, I attended the 2012 UseR! Meeting. Now an annual event, this series of conferences started in Europe in 2004 as an every-other-year gathering that now seems to alternate between the U.S. and Europe. This year’s meeting was held on the Vanderbilt University campus in Nashville, TN, and it was attended by about 500 R aficionados, ranging from beginners who have just learned about R to members of the original group of developers and the R Core Team that continues to maintain it. Many different topics were discussed, but one given particular emphasis was data visualization, which forms the primary focus of this post. For a more complete view of the range of topics discussed and who discussed them, the conference program is available as a PDF file that includes short abstracts of the talks.

All attendees were invited to present a Lightning Talk, and about 20 of us did. The format is essentially the technical equivalent of the 50-yard dash: before the talk, you provide the organizers exactly 15 slides, each of which is displayed for 20 seconds. The speaker’s challenge is first, to try to keep up with the slides, and second, to try to convey some useful information about each one. For my Lightning Talk, I described the ExploringData R package that I am in the process of developing, as a companion to both this blog and my book, Exploring Data in Engineering, the Sciences, and Medicine. The intent of the package is first, to make the R procedures and datasets from the OUP companion site for the book more readily accessible, and second, to provide some additional useful tools for exploratory data analysis, incorporating some of the extensions I have discussed in previous blog posts.

Originally, I had hoped to have the package complete by the time I gave my Lightning Talk, but in retrospect, it is just as well that the package is still in the development stage, because I picked up some extremely useful tips on what constitutes a good package at the meeting. As a specific example, Hadley Wickham, Professor of Statistics at Rice University and the developer of the ggplot2 package (more on this later), gave a standing-room-only talk on package development, featuring the devtools package, something he developed to make the R package development process easier. In addition, the CRC vendor display at the meeting gave me the opportunity to browse and purchase Paul Murrell’s book, R Graphics, which provides an extremely useful, detailed, and well-written treatment of the four different approaches to graphics in R that I will say a bit more about below.

Because I am still deciding what to include in the ExploringData package, one of the most valuable sessions for me was the invited talk by Di Cook, Professor of Statistics at Iowa State University, who emphasized the importance of meaningful graphical displays in understanding the contents of a dataset, particularly if it is new to us. One of her key points – illustrated with examples from some extremely standard R packages – was that the “examples” associated with datasets included in R packages often fail to include any such graphical visualization, and even for those that do, the displays are often too cryptic to be informative. While this point is obvious enough in retrospect, it is one that I – along with a lot of other people, evidently – had not thought about previously. As a consequence, I am now giving careful thought to the design of informative display examples for each of the datasets I will include in the ExploringData package.

As I mentioned above, there are (at least) four fundamental approaches to doing graphics in R. The one that most of us first encounter – the one we use by default every time we issue a “plot” command – is called base graphics, and it is included in base R to support a wide range of useful data visualization procedures, including scatter plots, boxplots, histograms, and a variety of other common displays. The other three approaches to graphics – grid graphics, lattice graphics, and ggplot2 – all offer more advanced features than what is typically available in base graphics, but they are, most unfortunately, incompatible in a number of ways with base graphics. I discovered this the hard way when I was preparing one of the procedures for the ExploringData package (the CountSummary procedure, which I will describe and demonstrate in my next post). Specifically, the vcd package includes implementations of Poissonness plots, negative binomialness plots, and Ord plots, all discussed in Exploring Data, and I wanted to take advantage of these implementations in building a simple graphical summary display for count data. In base graphics, to generate a two-by-two array of plots, you simply specify “par(mfrow=c(2,2))” and then generate each individual plot using standard plot commands. When I tried this with the plots generated by the vcd package, I didn’t get what I wanted – for the most part, it appeared that the “par(mfrow=c(2,2))” command was simply being ignored, and when it wasn’t, multiple plots were piled up on top of each other. It turns out that the vcd package uses grid graphics, which has a fundamentally different syntax: it’s more complicated, but in the end, it does provide a wider range of display options. Ultimately, I was able to generate the display I wanted, although this required some digging, since grid graphics aren’t really discussed much in my standard R reference books. For example, The R Book by Michael J. Crawley covers an extremely wide range of useful topics, but the only mentions of “grid” in the index refer to the generation of grid lines (e.g., the base graphics command “grid” generates grid lines on a base R plot, which is not based on grid graphics).

Often, grid graphics are mentioned in passing in introductory descriptions of trellis (lattice) graphics, since the lattice package is based on grid graphics. This package is discussed in The R Book, and I have used it occasionally because it does support things like violin plots that are not part of base graphics. To date, I haven’t used it much because I find the syntax much more complicated, but I plan to look further into it, since it does appear to have a lot more capability than base graphics do. Also, Murrell’s R Graphics book devotes a chapter to trellis graphics and the lattice package, which goes well beyond the treatments given in my other R references, and this provides me further motivation to learn more. The fourth approach to R graphics – Hadley Wickham’s ggplot2 package – was much discussed at the UseR! Meeting, appearing both in examples presented in various authors’ talks and as components for more complex and specialized graphics packages. I have not yet used ggplot2, but I intend to try it out, since it appears from some of the examples that this package can generate an extremely wide range of data visualizations, with simple types comparable to what is found in base graphics often available as defaults. Like the lattice package, ggplot2 is also based on grid graphics, making it, too, incompatible with base graphics. Again, the fact that Murrell’s book devotes a chapter to this package should also be quite helpful in learning when and how to make the best use of it.

This year’s UseR! Meeting was the second one I have attended – I also went to the 2010 meeting in Gaithersburg, MD, held at the National Institute of Standards and Technology (NIST). Both have been fabulous meetings, and I fully expect future meetings to be as good: next year’s UseR! meeting is scheduled to be held in Spain and I’m not sure I will be able to attend, but I would love to. In any case, if you can get there, I highly recommend it, based on my experiences so far.