This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Showing posts with label bimodality. Show all posts
Showing posts with label bimodality. Show all posts

Friday, November 11, 2011

Harmonic means, reciprocals, and ratios of random variables

In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian.  For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions.  Since the harmonic mean is defined as the reciprocal of the mean of the reciprocal values, it is intimately related to the reciprocal transformation.  The main point of this post is to show how profoundly the reciprocal transformation can alter the character of a distribution, for better or worse.   One way that reciprocal transformations sneak into analysis results is through attempts to characterize ratios of random numbers.  The key issue underlying all of these ideas is the question of when the denominator variable in either a reciprocal transformation or a ratio exhibits non-negligible probability in a finite neighborhood of zero.  I discuss transformations in Chapter 12 of Exploring Data in Engineering, the Sciences and Medicine, with a section (12.7) devoted to reciprocal transformations, showing what happens when we apply them to six different distributions: Gaussian, Laplace, Cauchy, beta, Pareto, and lognormal. 

In the general case, if a random variable x has the density p(x), the distribution g(y) of the reciprocal y = 1/x has the density:

            g(y) = p(1/y)/y2

As I discuss in greater detail in Exploring Data, the consequence of this transformation is typically (though not always) to convert a well-behaved distribution into a very poorly behaved one.  As a specific example, the plot below shows the effect of the reciprocal transformation on a Gaussian random variable with mean 1 and standard deviation 2.  The most obvious characteristic of this transformed distribution is its strongly asymmetric, bimodal character, but another non-obvious consequence of the reciprocal transformation is that it takes a distribution that is completely characterized by its first two moments into a new distribution with Cauchy-like tails, for which none of the integer moments exist.

           

The implications of the reciprocal transformation for many other distributions are equally non-obvious.  For example, both the badly-behaved Cauchy distribution (no moments exist) and the well-behaved lognormal distribution (all moments exist, but interestingly, do not completely characterize the distribution, as I have discussed in a previous post) are invariant under the reciprocal transformation.  Also, applying the reciprocal transformation to the long-tailed Pareto type I distribution (which exhibits few or no finite moments, depending on its tail decay rate) yields a beta distribution, all of whose moments are finite.  Finally, it is worth noting that the invariance of the Cauchy distribution under the reciprocal transformation lies at the heart of the following result, presented in the book Continuous Univariate Distributions by Johnson, Kotz, and Balakrishnan (Volume 1, 2nd edition, Wiley, 1994, page 319).  They note that if the density of x is positive, continuous, and differentiable at x = 0 – all true for the Gaussian case – the distribution of the harmonic mean of N samples approaches a Cauchy limit as N becomes infinitely large.

As noted above, the key issue responsible for the pathological behavior of the reciprocal transformation is the question of whether the original data distribution exhibits nonzero probability of taking on values within a neighborhood around zero.  In particular, note that if x can only assume values larger than some positive lower limit L, it follows that 1/x necessarily lies between 0 and 1/L, which is enough to guarantee that all moments of the transformed distribution exist.  For the Gaussian distribution, even if the mean is large enough and the standard deviation is small enough that the probability of observing values less than some limit L > 0 is negligible, the fact that this probability is not zero means that the moments of any reciprocally-transformed Gaussian distribution are not finite.  As a practical matter, however, reciprocal transformations and related characterizations – like harmonic means and ratios – do become better-behaved as the probability of observing values near zero become negligibly small.

To see this point, consider two reciprocally-transformed Gaussian examples.  The first is the one considered above: the reciprocal transformation of a Gaussian random variable with mean 1 and standard deviation 2.  In this case, the probability that x assumes values smaller than or equal to zero is non-negligible.  Specifically, this probability is simply the cumulative distribution function for the distribution evaluated at zero, easily computed in R as approximately 31%:

> pnorm(0,mean=1,sd=2)
[1] 0.3085375

In contrast, for a Gaussian random variable with mean 1 and standard deviation 0.1, the corresponding probability is negligibly small:

> pnorm(0,mean=1,sd=0.1)
[1] 7.619853e-24

If we consider the harmonic means of these two examples, we see that the first one is horribly behaved, as all of the results presented here would lead us to expect.  In fact, the qqPlot command in the car package  in R allows us to compute quantile-quantile plots for the Student’s t-distribution with one degree of freedom, corresponding to the Cauchy distribution, yielding the plot shown below.  The Cauchy-like tail behavior expected from the results presented by Johnson, Kotz and Balakrishnan is seen clearly in this Cauchy Q-Q plot, constructed from 1000 harmonic means, each computed from statistically independent samples drawn from a Gaussian distribution with mean 1 and standard deviation 2.  The fact that almost all of the observations fall within the – very wide – 95% confidence interval around the reference line suggest that the Cauchy tail behavior is appropriate here.

           

To further confirm this point, compare the corresponding normal Q-Q plot for the same sequence of harmonic means, shown below.  There, the extreme non-Gaussian character of these harmonic means is readily apparent from the pronounced outliers evident in both the upper and lower tails.

           

In marked contrast, for the second example with the mean of 1 as before but the much smaller standard deviation of 0.1, the harmonic mean is much better behaved, as the normal Q-Q plot below illustrates.  Specifically, this plot is identical in construction to the one above, except it was computed from samples drawn from the second data distribution.  Here, most of the computed harmonic mean values fall within the 95% confidence limits around the Gaussian reference line, suggesting that it is not unreasonable in practice to regard these values as approximately normally distributed, in spite of the pathologies of the reciprocal transformation.

           

One reason the reciprocal transformation is important in practice – particularly in connection with the Gaussian distribution – is that the desire to characterize ratios of uncertain quantities does arise from time to time.  In particular, if we are interested in characterizing the ratio of two averages, the Central Limit Theorem would lead us to expect that, at least approximately, this ratio should behave like the ratio of two Gaussian random variables.  If these component averages are statistically independent, the expected value of the ratio can be re-written as the product of the expected value of the numerator average and the expected value of the reciprocal of the denominator average, leading us directly to the reciprocal Gaussian transformation discussed here.  In fact, if these two averages are both zero mean, it is a standard result that the ratio has a Cauchy distribution (this result is presented in the same discussion from Johnson, Kotz and Balakrishnan noted above).  As in the second harmonic mean example presented above, however, it turns out to be true that if the mean and standard deviation of the denominator variable are such that the probability of a zero or negative denominator are negligible, the distribution of the ratio may be approximated reasonably well as Gaussian.  A very readable and detailed discussion of this fact is given in the paper by George Marsaglia in the May 2006 issue of Journal of Statistical Software.

Finally, it is important to note that the “reciprocally-transformed Gaussian distribution” I have been discussing here is not the same as the inverse Gaussian distribution, to which Johnson, Kotz and Balakrishnan devote a 39-page chapter (Chapter 15).  That distribution takes only positive values and exhibits moments of all orders, both positive and negative, and as a consequence, it has the interesting characteristic that it remains well-behaved under reciprocal transformations, in marked contrast to the Gaussian case.


Tuesday, February 15, 2011

Boxplots and Beyond III: Violin Plots

This post is the third in a series of four on boxplots and closely related data visualization techniques for comparing subsets of a dataset, or comparing different datasets that we hope or expect to be similarly distributed.  The previous two posts in this series have dealt with the basic boxplot, simple variations like log transformations and variable-width boxplots, and the procedures used to identify and represent outliers in a boxplot display, both traditionally and in the face of significant asymmetry.  All of the boxplot variations discussed so far retain the basic “box” structure, however, and this is one of the inherent limitations of this data display: their visual simplicity can hide a lot of important details about how the data values are distributed.  A much more flexible extension of the basic boxplot is the violin plot, constructed by combining the concept of the boxplot with that of nonparametric density estimates.  Both boxplots and nonparametric density estimates are discussed in Exploring Data, but the idea of putting them together into a single data display was not.  The violin plot discussed in this post is one way of achieving this combination, and my next post will discuss beanplots, also built from these same components.  Violin plots are supported in R both as part of the lattice graphics package and via the add-on package wvioplot.  Because I find it somewhat easier to use, this post discusses the second of these options.

I have previously discussed the Old Faithful geyser data frame faithful, included in the datasets package in the base R installation, in part because it provides a nice, real example of a dataset with a bimodal distribution of values.  In fact, this data frame includes two variables – the waiting times between successive eruptions discussed last time, and the duration of each eruption – and it turns out that both of these variables exhibit strongly bimodal distributions.  This point is illustrated clearly in the four plots shown above.  The upper two plots are scatter plots of the observed values of the variables themselves, while the bottom two plots are nonparametric density estimates generated via the density command in R, using its default options.  As in my previous discussion of the waiting time data from this dataset, the bimodal character of these variables is clear from the two pronounced peaks seen in these density estimates.  Further, this bimodal character is also clearly evident in the scatter plots of the data values themselves, particularly in the duration data which shows a gap between approximately 2.5 and 3.0 where there are almost no points at all.  The main reason for considering this example here is that bimodality represents one of those potentially significant distributional details that is generally lost in boxplots.  This point is clearly illustrated in the figure shown below, which presents side-by-side boxplots of the waiting time divided by 20 on the left, and the duration of the eruptions on the right.  (The waiting times were rescaled simply to make the two boxplots fit reasonably in the same figure.)  Both plots suggest significant distributional asymmetry, but neither one gives any hint of the strongly bimodal character of these variables.

For comparison, the two figures shown below were both generated using the wvioplot package, which generates plots that may be viewed as boxplots whose “boxes” have been curved to reflect the estimated distribution of values over the observed data range.  The plot on the left uses the default options for wvioplot, and one of the points of this example is that these default options are not always much more informative than the basic boxplot.  This is certainly the case here: neither of the left-hand violin plots is at all suggestive of the bimodality we know to be a key characteristic of these variables.  The right-hand violin plots represent are slightly customized, generated by specifying two options.  First, note that the default violin plots on the left are both clipped, exhibiting hard limits at the smallest and largest values that appear in the data.  Clipping is the default, but specifying “clip = F” overrides this behavior, which is not typical for nonparametric density estimates: instead of abruptly dropping to zero, these estimates typically decay smoothly to zero and, as a result, extrapolate somewhat beyond the observed data range.  Specifying “clip = F” causes the density estimates used in the violin plots to exhibit this smooth decay behavior.

The second, more important, difference between the two sets of violin plots shown in the above pair of figures is that the right-hand pair specifies a different bandwidth parameter for these nonparametric density estimators.  In Chapter 8 of Exploring Data, I discuss the trade-offs inherent in selecting the bandwidth parameter for nonparametric density estimators: making the bandwidth too small yields a high-variance, “noisy” estimate with a lot of spurious “details,” while making the bandwidth too large yields low-variance but highly biased estimates that can miss critical details that are present.  For the violin plots generated by the wvioplot package, this bandwidth is specified by the adjust parameter, with the default value “adjust = 3” giving the results shown on the left.  The plots on the right were generated using the smaller value, “adjust = 1”, which reduces the bandwidth of the density estimator enough to see the bimodal character of the two variables.

One of the key points of this example is that, while the added capability inherent in the violin plot relative to the simpler boxplot can convey extremely useful details that we might otherwise miss, we can’t always rely on the default settings for the procedure to show us these details.  The basic source of this difficulty is that more complex procedures like nonparametric density estimators (at least, more complex compared with the five-number summary on which the basic boxplot is based) have tuning parameters, and simply using the default options for these parameters doesn’t always lead to the results we want.  For this reason, it is a good idea to spend a bit of time exploring the key options in using these procedures.