This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

## Saturday, September 8, 2012

### Implementing the CountSummary Procedure

In my last post, I described and demonstrated the CountSummary procedure to be included in the ExploringData package that I am in the process of developing.  This procedure generates a collection of graphical data summaries for a count data sequence, based on the distplot, Ord_plot, and Ord_estimate functions from the vcd package.  The distplot function generates both the Poissonness plot and the negative-binomialness plot discussed in Chapters 8 and 9 of Exploring Data in Engineering, the Sciences and Medicine.  These plots provide informal graphical assessments of the conformance of a count data sequence with the two most popular distribution models for count data, the Poisson distribution and the negative-binomial distribution.  As promised, this post describes the R code needed to implement the CountSummary procedure, based on these functions from the vcd package.

The key to this implementation lies in the use of the grid package, a set of low-level graphics primitives included in base R.  As I mentioned in my last post, the reason this was necessary - instead of using higher-level graphics packages like lattice or ggplot2 - was that the vcd package is based on grid graphics, making it incompatible with base graphics commands like those used to generate arrays of multiple plots.  The grid package was developed by Paul Murrell, who provides a lot of extremely useful information about both R graphics in general and grid graphics in particular on his home page, including the article “Drawing Diagrams with R,” which provides a nicely focused introduction to grid graphics.  The first example I present here is basically a composite of the first two examples presented in this paper.  Specifically, the code for this example is:

library(grid)
grid.newpage()
pushViewport(viewport(width = 0.8, height = 0.4))
grid.roundrect()
grid.text("This is text in a box")
popViewport()

The first line of this R code loads the grid package and the second tells this package to clear the plot window; failing to do this will cause this particular piece of code to overwrite whatever was there before, which usually isn’t what you want.  The third line creates a viewport, into which the plot will be placed.  In this particular example, we specify a width of 0.8, or 80% of the total plot window width, and a height of 0.4, corresponding to 40% of the total window height.  The next two lines draw a rectangular box with rounded corners and put “This is text in a box” in the center of this box.  The advantage of the grid package is that it provides us with simple graphics primitives to draw this kind of figure, without having to compute exact positions (e.g., in inches) for the different figure components.  Commands like grid.text provide useful defaults (i.e., put the text in the center of the viewport), which can be overridden by specifying positional parameters in a variety of ways (e.g., left- or right-justified, offsets in inches or lines of text, etc.).  The results obtained using these commands are shown in the figure below.

The code for the second example is a simple extension of the first one, essentially consisting of the added initial code required to create the desired two-by-two plot array, followed by four slightly modified copies of the above code.  Specifically, this code is:

grid.newpage()

pushViewport(viewport(layout=grid.layout(nrow=2,ncol=2)))

pushViewport(viewport(layout.pos.row=1,layout.pos.col=1))
grid.roundrect(width = 0.8, height=0.4)
grid.text("Plot 1 goes here")
popViewport()

pushViewport(viewport(layout.pos.row=1,layout.pos.col=2))
grid.roundrect(width = 0.8, height=0.4)
grid.text("Plot 2 goes here")
popViewport()

pushViewport(viewport(layout.pos.row=2,layout.pos.col=1))
grid.roundrect(width = 0.8, height=0.4)
grid.text("Plot 3 goes here")
popViewport()

pushViewport(viewport(layout.pos.row=2,layout.pos.col=2))
grid.roundrect(width = 0.8, height=0.4)
grid.text("Plot 4 goes here")
popViewport()

Here, note that the first “pushViewport” command creates the two-by-two plot array we want, by specifying “layout = grid.layout(nrow=2,ncol=2)”.  As in initializing a data frame in R, we can create an arbitrary two-dimensional array of grid graphics viewports – say m by n – by specifying “layout = grid.layout(nrow=m, ncol=n)”.  Once we have done this, we can use whatever grid commands – or grid-compatible commands, such as those generated by the vcd package – we want, to create the individual elements in our array of plots.  In this example, I have basically repeated the code from the first example to put text into rounded rectangular boxes in each position of the plot array.  The two most important details are, first, the “pushViewport” command at the beginning of each of these individual plot blocks specifies which of the four array elements the following plot will go in, and second, the “popViewport()” command at the end of each block, which tells the grid package that we are finished with this element of the array.  If we leave this command out, the next “pushViewport” command will not move to the desired plot element, but will simply overwrite the previous plot.  Executing this code yields the plot shown below.

The final example replaces the text in the above two-by-two example with the plots I want for the CountSummary procedure.  Before presenting this code, it is important to say something about the structure of the resulting plot and the vcd commands used to generate the different plot elements.  The first plot – in the upper left position of the array shown below – is an Ord plot, generated by the Ord_plot command, which does two things.  The first is to generate the desired plot, but the second is to return estimates of the intercept and slope of one of the two reference lines in the plot.  The first of these lines is fit to the points in the plot via ordinary least squares, while the second – the one whose parameters are returned – is fit via weighted least squares, to down-weight the widely scattered points seen in this plot that correspond to cases with very few observations.  The intent of the Ord plot is to help us decide which of several alternative distributions – including both the Poisson and the negative-binomial – fits our count data sequence better.   This guidance is based on the reference line parameters, and the Ord_estimate function in the vcd package transforms these parameter estimates into distributional recommendations and the distribution parameter values needed by the distplot function in the vcd package to generate either the Poissonness plot or the negative-binomialness plot for the count data sequence.  Although these recommendations are sometimes useful, it is important to emphasize the caution given in the vcd package documentation:

“Be careful with the conclusions from Ord_estimate as it implements just some simple heuristics!”

In the CountSummary procedure, I use these results both to generate part of the text summary in the upper right element of the plot array, and to decide which type of plot to display in the lower right element of this array.  Both this plot and the Poissonness reference plot in the lower left element of the display are created using the distplot command in the vcd package.  I include the Poissonness reference plot because the Poisson distribution is the most commonly assumed distribution for count data – analogous in many ways to the Gaussian distribution so often assumed for continuous-valued data – and, by not specifying the single parameter for this distribution, I allow the function to determine it by fitting the data.  In cases where the Ord plot heuristic recommends the Poissonness plot, it also provides this parameter, which I provide to the distplot function for the lower right plot.  Thus, while both the lower right and lower left plots are Poissonness plots in this case, they are generally based on different distribution parameters.  In the particular example shown here – constructed from the “number of times pregnant” variable in the Pima Indians diabetes dataset that I have discussed in several previous posts (available from the UCI Machine Learning Repository) – the Ord plot heuristic recommends the negative binomial distribution.  Comparing the Poissonness and negative-binomialness plots in the bottom row of the above plot array, it does appear that the negative binomial distribution fits the data better.

Finally, before examining the code for the CountSummary procedure, it is worth noting that the vcd package’s implementation of the Ord_plot and Ord_estimate procedures can generate four different distributional recommendations: the Poisson and negative-binomial distributions discussed here, along with the binomial distribution and the much less well-known log-series distribution.  The distplot procedure is flexible enough to generate plots for the first three of these distributions, but not the fourth, so in cases where the Ord plot heuristic recommends this last distribution, the CountSummary procedure displays the recommended distribution and parameter, but displays a warning message that no distribution plot is available for this case in the lower right plot position.

The code for the CountSummary procedure looks like this:

CountSummary <- function(xCount,TitleString){
#
#  Initial setup
#
library(vcd)
grid.newpage()
#
#  Set up 2x2 array of plots
#
pushViewport(viewport(layout=grid.layout(nrow=2,ncol=2)))
#
#  Generate the plots:
#
#  1 - upper left = Ord plot
#
pushViewport(viewport(layout.pos.row=1,layout.pos.col=1))
OrdLine = Ord_plot(xCount, newpage = FALSE, pop=FALSE, legend=FALSE)
OrdType = Ord_estimate(OrdLine)
popViewport()
#
#  2 - upper right = text summary
#
OrdTypeText = paste("Type = ",OrdType\$type,sep=" ")
if (OrdType\$type == "poisson"){
OrdPar = "Lambda = "
}
else if ((OrdType\$type == "nbinomial")|(OrdType\$type == "nbinomial")){
OrdPar = "Prob = "
}
else if (OrdType\$type == "log-series"){
OrdPar = "Theta = "
}
else{
OrdPar = "Parameter = "
}
OrdEstText = paste(OrdPar,round(OrdType\$estimate,digits=3), sep=" ")
TextSummary = paste("Ord plot heuristic results:",OrdTypeText,OrdEstText,sep="\n")
pushViewport(viewport(layout.pos.row=1,layout.pos.col=2))
grid.text(TitleString,y=2/3,gp=gpar(fontface="bold"))
grid.text(TextSummary,y=1/3)
popViewport()
#
#  3 - lower left = standard Poissonness plot
#
pushViewport(viewport(layout.pos.row=2,layout.pos.col=1))
distplot(xCount, type="poisson",newpage=FALSE, pop=FALSE, legend = FALSE)
popViewport()
#
#  4 - lower right = plot suggested by Ord results
#
pushViewport(viewport(layout.pos.row=2,layout.pos.col=2))
if (OrdType\$type == "poisson"){
distplot(xCount, type="poisson",lambda=OrdType\$estimate, newpage=FALSE, pop=FALSE, legend=FALSE)
}
else if (OrdType\$type == "nbinomial"){
prob = OrdType\$estimate
size = 1/prob - 1
distplot(xCount, type="nbinomial",size=size,newpage=FALSE, pop=FALSE, legend=FALSE)
}
else if (OrdType\$type == "binomial"){
distplot(xCount, type="binomial", newpage=FALSE, pop=FALSE, legend=FALSE)
}
else{
Message = paste("No distribution plot","available","for this case",sep="\n")
grid.text(Message)
}
popViewport()
#
}

This procedure is a function called with two arguments: the sequence of count values, xCounts, and TitleString, a text string that is displayed in the upper right text box in the plot array, along with the recommendations from the Ord plot heuristic.  When called, the function first loads the vcd library to make the Ord_plot, Ord_estimate, and distplot functions available for use, and it executes the grid.newpage() command to clear the display.  (Note that we don’t have to include “library(grid)” here to load the grid package, since loading the vcd package automatically does this.)   As in the previous example, the first “pushViewport” command creates the two-by-two plot array, and this is again followed by four code segments, one to generate each of the four displays in this array.  The first of these segments invokes the Ord_plot and Ord_estimate commands as discussed above, first to generate the upper left plot (a side-effect of the Ord_plot command) and second, to obtain the Ord plot heuristic recommendations, to be used in structuring the rest of the display.  The second segment creates a text display as in the first example considered here, but the structure of this display depends on the Ord plot heuristic results (i.e., the names of the parameters for the four possible recommended distributions are different, and the logic in this code block matches the display text to this distribution).  As noted in the preceding discussion, the third plot (lower left) is the Poissonness plot generated by the distplot function from the vcd package.  In this case, the function is called only specifying ‘type = “poisson”’ without the optional distribution parameter lambda, which is obtained by fitting the data.  The final element of this plot array, in the lower right, is also generated via a call to the distplot function, but here, the results from the Ord plot heuristic are used to specify both the type parameter and any optional or required shape parameters for the distribution.  As with the displayed text, simple if-then-else logic is required here to match the plot generated with the Ord plot heuristic recommendations.

Finally, it is important to note that in all of the calls made to Ord_plot or distplot in the CountSummary procedure, the parameters newpage, pop, and legend, are all specified as FALSE.  Specifying “newpage = FALSE” prevents these vcd plot commands from clearing the display page and erasing everything we have done so far.  Similarly, specifying “pop = FALSE” allows us to continue working in the current plot window until we notify the grid graphics system that we are done with it by issuing our own “popViewport()” command.  Specifying “legend = FALSE” tells Ord_plot and distplot not to write the default informational legend on each plot.  This is important here because, given the relatively small size of the plots generated in this two-by-two array, including the default legends would obscure important details.

1. 2. 3. 