This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, July 7, 2012

Graphical insights from the 2012 UseR! Meeting

About this time last month, I attended the 2012 UseR! Meeting.  Now an annual event, this series of conferences started in Europe in 2004 as an every-other-year gathering that now seems to alternate between the U.S. and Europe.  This year’s meeting was held on the Vanderbilt University campus in Nashville, TN, and it was attended by about 500 R aficionados, ranging from beginners who have just learned about R to members of the original group of developers and the R Core Team that continues to maintain it.  Many different topics were discussed, but one given particular emphasis was data visualization, which forms the primary focus of this post.  For a more complete view of the range of topics discussed and who discussed them, the conference program is available as a PDF file that includes short abstracts of the talks.

All attendees were invited to present a Lightning Talk, and about 20 of us did.  The format is essentially the technical equivalent of the 50-yard dash: before the talk, you provide the organizers exactly 15 slides, each of which is displayed for 20 seconds.  The speaker’s challenge is first, to try to keep up with the slides, and second, to try to convey some useful information about each one.  For my Lightning Talk, I described the ExploringData R package that I am in the process of developing, as a companion to both this blog and my book, Exploring Data in Engineering, the Sciences, and Medicine.  The intent of the package is first, to make the R procedures and datasets from the OUP companion site for the book more readily accessible, and second, to provide some additional useful tools for exploratory data analysis, incorporating some of the extensions I have discussed in previous blog posts.

Originally, I had hoped to have the package complete by the time I gave my Lightning Talk, but in retrospect, it is just as well that the package is still in the development stage, because I picked up some extremely useful tips on what constitutes a good package at the meeting.  As a specific example, Hadley Wickham, Professor of Statistics at Rice University and the developer of the ggplot2 package (more on this later), gave a standing-room-only talk on package development, featuring the devtools package, something he developed to make the R package development process easier.  In addition, the CRC vendor display at the meeting gave me the opportunity to browse and purchase Paul Murrell’s book, R Graphics, which provides an extremely useful, detailed, and well-written treatment of the four different approaches to graphics in R that I will say a bit more about below.

Because I am still deciding what to include in the ExploringData package, one of the most valuable sessions for me was the invited talk by Di Cook, Professor of Statistics at Iowa State University, who emphasized the importance of meaningful graphical displays in understanding the contents of a dataset, particularly if it is new to us.  One of her key points – illustrated with examples from some extremely standard R packages – was that the “examples” associated with datasets included in R packages often fail to include any such graphical visualization, and even for those that do, the displays are often too cryptic to be informative.  While this point is obvious enough in retrospect, it is one that I – along with a lot of other people, evidently – had not thought about previously.  As a consequence, I am now giving careful thought to the design of informative display examples for each of the datasets I will include in the ExploringData package.

As I mentioned above, there are (at least) four fundamental approaches to doing graphics in R.  The one that most of us first encounter – the one we use by default every time we issue a “plot” command – is called base graphics, and it is included in base R to support a wide range of useful data visualization procedures, including scatter plots, boxplots, histograms, and a variety of other common displays.  The other three approaches to graphics – grid graphics, lattice graphics, and ggplot2 – all offer more advanced features than what is typically available in base graphics, but they are, most unfortunately, incompatible in a number of ways with base graphics.  I discovered this the hard way when I was preparing one of the procedures for the ExploringData package (the CountSummary procedure, which I will describe and demonstrate in my next post).  Specifically, the vcd package includes implementations of Poissonness plots, negative binomialness plots, and Ord plots, all discussed in Exploring Data, and I wanted to take advantage of these implementations in building a simple graphical summary display for count data.  In base graphics, to generate a two-by-two array of plots, you simply specify “par(mfrow=c(2,2))” and then generate each individual plot using standard plot commands.  When I tried this with the plots generated by the vcd package, I didn’t get what I wanted – for the most part, it appeared that the “par(mfrow=c(2,2))” command was simply being ignored, and when it wasn’t, multiple plots were piled up on top of each other.  It turns out that the vcd package uses grid graphics, which has a fundamentally different syntax: it’s more complicated, but in the end, it does provide a wider range of display options.  Ultimately, I was able to generate the display I wanted, although this required some digging, since grid graphics aren’t really discussed much in my standard R reference books.  For example, The R Book by Michael J. Crawley covers an extremely wide range of useful topics, but the only mentions of “grid” in the index refer to the generation of grid lines (e.g., the base graphics command “grid” generates grid lines on a base R plot, which is not based on grid graphics). 

Often, grid graphics are mentioned in passing in introductory descriptions of trellis (lattice) graphics, since the lattice package is based on grid graphics.  This package is discussed in The R Book, and I have used it occasionally because it does support things like violin plots that are not part of base graphics.  To date, I haven’t used it much because I find the syntax much more complicated, but I plan to look further into it, since it does appear to have a lot more capability than base graphics do.  Also, Murrell’s R Graphics book devotes a chapter to trellis graphics and the lattice package, which goes well beyond the treatments given in my other R references, and this provides me further motivation to learn more.  The fourth approach to R graphics – Hadley Wickham’s ggplot2 package – was much discussed at the UseR! Meeting, appearing both in examples presented in various authors’ talks and as components for more complex and specialized graphics packages.  I have not yet used ggplot2, but I intend to try it out, since it appears from some of the examples that this package can generate an extremely wide range of data visualizations, with simple types comparable to what is found in base graphics often available as defaults.  Like the lattice package, ggplot2 is also based on grid graphics, making it, too, incompatible with base graphics.  Again, the fact that Murrell’s book devotes a chapter to this package should also be quite helpful in learning when and how to make the best use of it.

This year’s UseR! Meeting was the second one I have attended – I also went to the 2010 meeting in Gaithersburg, MD, held at the National Institute of Standards and Technology (NIST).  Both have been fabulous meetings, and I fully expect future meetings to be as good: next year’s UseR! meeting is scheduled to be held in Spain and I’m not sure I will be able to attend, but I would love to.  In any case, if you can get there, I highly recommend it, based on my experiences so far.

2 comments:

  1. grid isn't a graphics pack, in the sense of Base, lattice, or ggplot2. I can recommend the latter over lattice, and Wickham's book. grid is the underlying plumbing used by lattice and ggplot2, or some graphic engine one wishes to write. Murrell explains this on page 119 (2nd edition) of his book; he wrote grid.

    ReplyDelete
  2. It was good to meet you.

    See you in next year's conference (I hope).

    With regards,
    Tal

    ReplyDelete