This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.
Saturday, July 21, 2012
Count variables – non-negative integer variables like the “number of times pregnant” (NPG) variable from the Pima Indians database described below – are often assumed to obey a Poisson distribution, in much the same way that continuous-valued variables are often assumed to obey a Gaussian (normal) distribution. Like this normality assumption for continuous variables, the Poisson assumption for count data is sometimes reasonable, but sometimes it isn’t. Normal quantile-quantile plots like those generated by the qqnorm command in base R or the qqPlot command from the car package are useful in informally assessing the reasonableness of the normality assumption for continuous data. Similarly, Poissonness plots are the corresponding graphical tool for informally evaluating the Poisson hypothesis for count data. The construction and interpretation of these plots is discussed in some detail in Chapters 8 and 9 of Exploring Data, but briefly, this plot constructs a variable called the Poissonness count metameter from the number of times each possible count value occurs in the data; if the data sequence conforms to the Poisson distribution, the points on this plot should fall approximately on a straight line. A simple R function that constructs Poissonness plots is available on the OUP companion website for the book, but an implementation that is both more conveniently available and more flexible is the distplot function in the vcd package, which also generates the negative binomialness plot discussed below.
The figure above is the Poissonness plot constructed using the distplot procedure from the vcd package for the NPG variable from the Pima Indians diabetes dataset mentioned above. I have discussed this dataset in previous posts and have used it as the basis for several examples in Exploring Data. It is available from the UCI Machine Learning Repository and it has been incorporated in various forms as an example dataset in a number of R packages, including a cleaned-up version in the MASS package (dataset Pima.tr). The full version considered here contains nine characteristics for 768 female members of the Pima Indian tribe, including their age, medical characteristics like diastolic blood pressure, and the number of times each woman has been pregnant. If this NPG count sequence obeyed the Poisson distribution, the points in the above plot would fall approximately on the reference line included there. The fact that these points do not conform well to this line – note, in particular, the departure at the lower left end of the plot where most of the counts occur – calls the Poisson working assumption into question.
A fundamental feature of the Poisson distribution is that it is defined by a single parameter that determines all distributional characteristics, including both the mean and the variance. In fact, a key characteristic of the Poisson distribution is that the variance is equal to the mean. This constraint is not satisfied by all count data sequences we encounter, however, and these deviations are important enough to receive special designations: integer sequences whose variance is larger than their mean are commonly called overdispersed, while those whose variance is smaller than their mean are commonly called underdispersed. In practice, overdispersion seems to occur more frequently, and a popular distributional alternative for overdispersed sequences is the negative binomial distribution. This distribution is defined by two parameters and it is capable of matching both the mean and variance of arbitrary overdispersed count data sequences. For a detailed discussion of this distribution, refer to Chapter 3 of Exploring Data.
Like the Poisson distribution, it is possible to evaluate the reasonableness of the negative binomial distribution graphically, via the negative binomialness plot. Like the Poissonness plot, this plot is based on a quantity called the negative binomialness metameter, computed from the number of times each count value occurs, plotted against those count values. To construct this plot, it is necessary to specify a numerical value for the distribution’s second parameter (the size parameter in the distplot command, corresponding to the r parameter in the discussion of this distribution given in Chapter 8 of Exploring Data). This can be done in several different ways, including the specification of trial values, the approach taken in the negative binomialness plot procedure that is available from the OUP companion website. This option is also available with the distplot command from the vcd package: to obtain a negative binomialness plot, specify the type parameter as “nbinomial” and, if a fixed size parameter is desired, it is specified by giving a numerical value for the size parameter in the distplot function call. Alternatively, if this parameter is not specified, the distplot procedure will estimate it via the method of maximum likelihood, an extremely useful feature, although it is important to note that this estimation process can be time-consuming, especially for long data sequences. Finally, a third approach that can be adopted is to use the Ord plot described next to obtain an estimate of this parameter based on a simple heuristic. In addition, this heuristic suggests which of these two candidate distributions – the Poisson or the negative binomial – is more appropriate for the data sequence.
Like the Poissonness plot, the Ord plot computes a simple derived quantity from the original count data sequence – specifically, the frequency ratio, defined for each count value as that value multiplied by the ratio of the number of times it occurs to the number of times the next smaller count occurs – and plots this versus the counts. If the data sequence obeys the negative binomial distribution, these points should conform reasonably well to a line with positive slope, and this slope can be used to determine the size parameter for the distribution. Conversely, if the Poisson distribution is appropriate, the best fit reference line for the Ord plot should have zero slope. In addition, Ord plots can also be used to suggest two additional discrete distributions (specifically, the binomial distribution and the log-series distribution), and the vcd package provides dataset examples to illustrate all four of these cases.
For my CountSummary procedure, I decided to construct a two-by-two array with the following four components. First, in the upper left, I used the Ord_plot command in vcd to generate an Ord plot. This command returns the intercept and slope parameters for the reference line in the plot, and the Ord_estimate command can then be used to convert these values into a type specification and an estimate of the distribution parameter needed to construct the appropriate discrete distribution plot. I will discuss these results in more detail in my next post, but for the case of the NPG count sequence considered here, the Ord plot results suggest the negative binomial distribution as the most appropriate choice, returning a parameter prob, from which the size parameter required to generate the negative binomialness plot may be generated (specifically, size = 1/prob – 1). The upper right quadrant of this display gives a text summary identifying the variable being characterized and listing the Ord plot recommendations and parameter estimate. Since the Poisson distribution is “the default” assumption for count data, the lower left plot shows a Poissonness plot for the data sequence, while the lower right plot is the “distribution-ness plot” for the distribution recommended by the Ord plot results. The results obtained by the CountSummary procedure for the NPG sequence are shown below. Next time, I will present the code used to generate this plot.