ExploringDataBlog: reciprocal transformations

Showing posts with label reciprocal transformations. Show all posts

Friday, November 11, 2011

Harmonic means, reciprocals, and ratios of random variables

In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian. For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions. Since the harmonic mean is defined as the reciprocal of the mean of the reciprocal values, it is intimately related to the reciprocal transformation. The main point of this post is to show how profoundly the reciprocal transformation can alter the character of a distribution, for better or worse. One way that reciprocal transformations sneak into analysis results is through attempts to characterize ratios of random numbers. The key issue underlying all of these ideas is the question of when the denominator variable in either a reciprocal transformation or a ratio exhibits non-negligible probability in a finite neighborhood of zero. I discuss transformations in Chapter 12 of Exploring Data in Engineering, the Sciences and Medicine, with a section (12.7) devoted to reciprocal transformations, showing what happens when we apply them to six different distributions: Gaussian, Laplace, Cauchy, beta, Pareto, and lognormal.

In the general case, if a random variable x has the density p(x), the distribution g(y) of the reciprocal y = 1/x has the density:

g(y) = p(1/y)/y²

As I discuss in greater detail in Exploring Data, the consequence of this transformation is typically (though not always) to convert a well-behaved distribution into a very poorly behaved one. As a specific example, the plot below shows the effect of the reciprocal transformation on a Gaussian random variable with mean 1 and standard deviation 2. The most obvious characteristic of this transformed distribution is its strongly asymmetric, bimodal character, but another non-obvious consequence of the reciprocal transformation is that it takes a distribution that is completely characterized by its first two moments into a new distribution with Cauchy-like tails, for which none of the integer moments exist.

The implications of the reciprocal transformation for many other distributions are equally non-obvious. For example, both the badly-behaved Cauchy distribution (no moments exist) and the well-behaved lognormal distribution (all moments exist, but interestingly, do not completely characterize the distribution, as I have discussed in a previous post) are invariant under the reciprocal transformation. Also, applying the reciprocal transformation to the long-tailed Pareto type I distribution (which exhibits few or no finite moments, depending on its tail decay rate) yields a beta distribution, all of whose moments are finite. Finally, it is worth noting that the invariance of the Cauchy distribution under the reciprocal transformation lies at the heart of the following result, presented in the book Continuous Univariate Distributions by Johnson, Kotz, and Balakrishnan (Volume 1, 2^nd edition, Wiley, 1994, page 319). They note that if the density of x is positive, continuous, and differentiable at x = 0 – all true for the Gaussian case – the distribution of the harmonic mean of N samples approaches a Cauchy limit as N becomes infinitely large.

As noted above, the key issue responsible for the pathological behavior of the reciprocal transformation is the question of whether the original data distribution exhibits nonzero probability of taking on values within a neighborhood around zero. In particular, note that if x can only assume values larger than some positive lower limit L, it follows that 1/x necessarily lies between 0 and 1/L, which is enough to guarantee that all moments of the transformed distribution exist. For the Gaussian distribution, even if the mean is large enough and the standard deviation is small enough that the probability of observing values less than some limit L > 0 is negligible, the fact that this probability is not zero means that the moments of any reciprocally-transformed Gaussian distribution are not finite. As a practical matter, however, reciprocal transformations and related characterizations – like harmonic means and ratios – do become better-behaved as the probability of observing values near zero become negligibly small.

To see this point, consider two reciprocally-transformed Gaussian examples. The first is the one considered above: the reciprocal transformation of a Gaussian random variable with mean 1 and standard deviation 2. In this case, the probability that x assumes values smaller than or equal to zero is non-negligible. Specifically, this probability is simply the cumulative distribution function for the distribution evaluated at zero, easily computed in R as approximately 31%:

> pnorm(0,mean=1,sd=2)

[1] 0.3085375

In contrast, for a Gaussian random variable with mean 1 and standard deviation 0.1, the corresponding probability is negligibly small:

> pnorm(0,mean=1,sd=0.1)

[1] 7.619853e-24

If we consider the harmonic means of these two examples, we see that the first one is horribly behaved, as all of the results presented here would lead us to expect. In fact, the qqPlot command in the car package in R allows us to compute quantile-quantile plots for the Student’s t-distribution with one degree of freedom, corresponding to the Cauchy distribution, yielding the plot shown below. The Cauchy-like tail behavior expected from the results presented by Johnson, Kotz and Balakrishnan is seen clearly in this Cauchy Q-Q plot, constructed from 1000 harmonic means, each computed from statistically independent samples drawn from a Gaussian distribution with mean 1 and standard deviation 2. The fact that almost all of the observations fall within the – very wide – 95% confidence interval around the reference line suggest that the Cauchy tail behavior is appropriate here.

To further confirm this point, compare the corresponding normal Q-Q plot for the same sequence of harmonic means, shown below. There, the extreme non-Gaussian character of these harmonic means is readily apparent from the pronounced outliers evident in both the upper and lower tails.

In marked contrast, for the second example with the mean of 1 as before but the much smaller standard deviation of 0.1, the harmonic mean is much better behaved, as the normal Q-Q plot below illustrates. Specifically, this plot is identical in construction to the one above, except it was computed from samples drawn from the second data distribution. Here, most of the computed harmonic mean values fall within the 95% confidence limits around the Gaussian reference line, suggesting that it is not unreasonable in practice to regard these values as approximately normally distributed, in spite of the pathologies of the reciprocal transformation.

One reason the reciprocal transformation is important in practice – particularly in connection with the Gaussian distribution – is that the desire to characterize ratios of uncertain quantities does arise from time to time. In particular, if we are interested in characterizing the ratio of two averages, the Central Limit Theorem would lead us to expect that, at least approximately, this ratio should behave like the ratio of two Gaussian random variables. If these component averages are statistically independent, the expected value of the ratio can be re-written as the product of the expected value of the numerator average and the expected value of the reciprocal of the denominator average, leading us directly to the reciprocal Gaussian transformation discussed here. In fact, if these two averages are both zero mean, it is a standard result that the ratio has a Cauchy distribution (this result is presented in the same discussion from Johnson, Kotz and Balakrishnan noted above). As in the second harmonic mean example presented above, however, it turns out to be true that if the mean and standard deviation of the denominator variable are such that the probability of a zero or negative denominator are negligible, the distribution of the ratio may be approximated reasonably well as Gaussian. A very readable and detailed discussion of this fact is given in the paper by George Marsaglia in the May 2006 issue of Journal of Statistical Software.

Finally, it is important to note that the “reciprocally-transformed Gaussian distribution” I have been discussing here is not the same as the inverse Gaussian distribution, to which Johnson, Kotz and Balakrishnan devote a 39-page chapter (Chapter 15). That distribution takes only positive values and exhibits moments of all orders, both positive and negative, and as a consequence, it has the interesting characteristic that it remains well-behaved under reciprocal transformations, in marked contrast to the Gaussian case.

Saturday, April 23, 2011

Measuring association using odds ratios

In my last two posts, I have used the UCI mushroom dataset to illustrate two things. The first was the use of interestingness measures to characterize categorical variables, and the second was the use of binary confidence intervals to visualize the relationship between a categorical predictor variable and a binary response variable. This second approach can be applied to categorical predictors having any number of levels, but in the case of a binary (i.e., two-level) predictor, an attractive alternative is to measure their association with odds ratios. The objective of this post is to illustrate this idea and highlight a few important details.

The above plots show the binomial confidence intervals discussed last time for four different binary mushroom characteristics: GillSize (upper left), GillAtt (upper right), Bruises (lower left), and StalkShape (lower right). Specifically, these plots show the estimated probability that mushrooms with each of the two possible values for these variables are edible. Thus, the upper left plot shows that mushrooms with GillSize characteristic “b” (“broad”) are much more likely to be edible than mushrooms with GillSize characteristic “n” (“narrow”). The other three plots have analogous interpretations: mushrooms with GillAtt value “a” (“attached”) are more likely to be edible than those with value “f” (“free”), mushrooms with bruises (Bruises value “t”) are more likely to be edible than those without (Bruises value “f”), and mushrooms with StalkShape value “t” (“tapering”) are slightly more likely to be edible than those with value “e” (“enlarging”). Also, while the smaller slopes for GillAtt and StalkShape suggest this association is weaker for these variables than for GillSize, where the slope appears much larger, it would be nice to have a quantitative measure of this degree of association that we could compare directly. This is particularly the case for GillSize and Bruises, where both associations appear to be reasonably strong, but since the reference lines run in opposite directions on the plots, it is difficult to reliably compare the slopes on the basis of appearance alone.

The odds ratio provides a simple quantitative association measure for these variables that allows us to make these comparisons directly. I discuss the odds ratio in Chapter 13 of Exploring Data in Engineering, the Sciences, and Medicine in connection with the practical implications of data type (e.g., numerical versus categorical data). The odds ratio may be viewed as an association measure between binary variables, and it is defined as follows. For simplicity, suppose x and y are two binary variables of interest and assume that they are coded so that they each take the values 0 or 1 – this assumption is easily relaxed, as discussed below, but it simplifies the basic description of the odds ratio. Next, define the following four numbers:

N₀₀ = the number of data records with x = 0 and y = 0

N₀₁ = the number of data records with x = 0 and y = 1

N₁₀ = the number of data records with x = 1 and y = 0

N₁₁ = the number of data records with x = 1 and y = 1

The odds ratio is defined in terms of these four numbers as

OR = N₀₀ N₁₁ / N₀₁ N₁₀

Since all of the four numbers appearing in this ratio are nonnegative, it follows that the odds ratio is also nonnegative and can assume any value between 0 and positive infinity. Further, if x and y are two statistically independent binary random variables, it can be shown that the odds ratio is equal to 1. Values greater than 1 imply that records with y = 1 are more likely to have x = 1 than x = 0, and similarly, that records with y = 0 are more likely to have x = 0 than x = 1; in other words, OR > 1 implies that the variables x and y are more likely to agree than they are to disagree. Conversely, odds ratio values less than 1 imply that the variables x and y are more likely to disagree: records with y = 1 are more likely to have x = 0 than x = 1, and those with y = 0 are more likely to have x = 1 than x = 0.

Often – as in the mushroom dataset – the binary variables are not coded as 0 or 1, but instead as two different categorical values. As a specific example, the binary response variable considered last time – the edibility variable EorP – assumes the values “e” (for “edible”) or “p” (for “poisonous” or “non-edible”). In the results presented here, we recode EorP to have the values 1 for edible mushrooms and 0 for non-edible mushrooms. For the mushroom characteristic GillSize shown in the upper left plot above, suppose we initially code the value “b” (“broad”) as 0 and the value “n” (“narrow”) as 1. This choice is arbitrary – we could equally well code “b” as 1 and “n” as zero – and its practical consequences are explored further below. For the coding just described, the odds ratio between mushroom edibility (EorP) and gill size (GillSize) is 0.056. Since this number is substantially smaller than 1, it suggests that edible mushrooms (y = 1) are unlikely to be associated with narrow gills (x = 1), a result that is consistent with the appearance of the upper left plot above.

An important practical issue in interpreting odds ratios is that of how much smaller or larger than 1 the computed odds ratio should be to be regarded as evidence for a “significant” association between the variables x and y. That is, since we are computing this ratio from uncertain data, we need a measure of precision for the odds ratio, like the binomial confidence intervals discussed in my last post: e.g., how much does the odds ratio change if some mushrooms previously declared edible is reclassified as poisonous, or if some additional mushrooms are added to our dataset? Fortunately, confidence intervals for the odds ratio are easily constructed. In his book Categorical Data Analysis (Wiley Series in Probability and Statistics), Alan Agresti notes that confidence intervals for the odds ratio can be computed directly by appealing to the fact that the odds ratio estimator is asymptotically normal, approaching a Gaussian distribution in the limit of large sample sizes. He does not give explicit results for these direct confidence intervals, however, because he does not recommend them. Instead, Agresti advocates the construction of confidence intervals for the log of the odds ratio and transforming them back to get upper and lower confidence limits for the odds ratio itself. This recommendation rests primarily on three practical points: first, that the log of the odds ratio approaches normality faster than the odds ratio itself does, so this approach yields more accurate confidence intervals; second, this approach guarantees a positive lower confidence limit for the odds ratio, which is not the case for the direct approach; and, third, the same result can be used to compute confidence intervals for both the odds ratio and its reciprocal, a result that is again not true for the direct approach and that will be useful in the discussion presented below.

For the gill size example, Agresti’s recommended procedure yields a 95% confidence interval between 0.049 and 0.064. Since this interval does not include the value 1, we conclude that there is evidence to support an association between a mushroom’s gill size and its edibility, at least for mushrooms in the UCI dataset. Applying this procedure to the GillAtt characteristic shown in the upper right plot above yields an estimated odds ratio of 0.097 with a 95% confidence interval between 0.059 and 0.157. Again, the fact that this interval does not include 1 supports the idea that the GillAtt characteristic is associated with edibility (again, for the mushrooms considered here), but the fact that this odds ratio is larger (i.e., closer to the neutral value 1) also suggests that this association is weaker than that between edibility and the GillShape characteristic. Again, this result is in agreement with the visual appearance of the upper right plot above, relative to that of the upper left plot. The advantage of the odds ratio over these plots is that it provides a quantitative measure that can be used to make more objective comparisons, removing the subjective visual judgment required in comparing plots.

Applying this procedure to the Bruises variable yields an odds ratio of 9.972, with a 95% confidence interval from 8.963 to 11.093. The fact that these values are larger than 1 implies that mushrooms whose bruise characteristics have been coded as 1 (here, “t” for “true” or “bruised”) are more likely to be edible than those whose characteristics have been coded as 0 (here, “f” for “false” or “not bruised”). As noted above, this coding is arbitrary, as were the earlier assignments. An extremely useful observation is that if we reverse this assignment – i.e., for this example, if we code “bruised” as 0 and “not bruised” as 1 – we simply exchange the numbers N₀₀ with N₁₀ and also the numbers N₁₁ with N₀₁. The effect of these exchanges on the odds ratio is a reciprocal transformation:

OR = N₀₀N₁₁/N₀₁N₁₀ -> N₁₀N₀₁/N₁₁N₀₀ = 1/OR

This observation provides a simple basis for comparing results like those for GillSize where the odds ratio is less than 1 with those for Bruises where the odds ratio is greater than 1. As with the visual comparisons discussed above, it is not obvious from the odds ratios computed so far which of these variables is more strongly associated with mushroom edibility. Reversing the coding for GillSize so that “b” is coded as 1 and “n” is coded as 0 changes the odds ratio from 0.059 to 1/0.059 = 17.857. Since this number is larger than the odds ratio of 9.972 for Bruises, we can conclude that GillSize is more strongly associated with edibility – i.e., it is a better predictor of edibility for the mushrooms considered here – than Bruises, at least for this dataset.

In fact, the same trick can be applied to the confidence intervals, illustrating the third advantage noted above for Agresti’s preferred approach to constructing these intervals. Specifically, the asymptotically normal approximation says that the log of the odds ratio has a mean of log OR and a standard deviation S that can be simply computed from the four numbers N₀₀, N₀₁, N₁₀, and N₁₁. Since the Gaussian distribution is symmetric about its mean and log(1/OR) = - log(OR), it follows that the log of the reciprocal odds ratio has the same approximate standard deviation S as log(OR). In practical terms, this means that if we reverse the coding of our binary predictor variables, it is a simple matter to compute new confidence intervals as follows:

New lower CI = 1/Old upper CI

New odds ratio = 1/Old odds ratio

New upper CI = 1/Old lower CI

(Note here that because the reciprocal transformation is order-reversing, the transformation of the lower confidence limit yields the new upper confidence limit, and vice-versa; for a more detailed discussion of order-preserving and order-reversing transformations in general and the reciprocal transformation in particular, refer to Chapter 12 of Exploring Data.)

Applying these transformation results to the odds ratio for Bruises yields a new odds ratio of 0.100, with a 95% confidence interval from 0.090 to 0.112, which we can now compare with the earlier results for GillSize and GillAtt. Alternatively, if we reverse the coding of the results for GillSize and GillAtt, we obtain odds ratios that are larger than 1 between edibility and “the more edible value” of each of these mushroom characteristics. This has the advantage of giving us a sequences of odds ratios, all larger than 1, with the largest value suggestive of the strongest association between each mushroom characteristic variable and edibility. For the four mushroom characteristics shown in the above four plots, this approach yields the following odds ratios and their 95% confidence intervals:

GillSize: Lower CI = 15.625, OR = 17.857, Upper CI = 20.408

GillAtt: Lower CI = 6.369, OR = 10.309, Upper CI = 16.949

Bruises: Lower CI = 8.963, OR = 9.972, Upper CI = 11.093

StalkShape: Lower CI = 1.384, OR = 1.512, Upper Ci = 1.651

These results suggest that, of these four variables, the best predictor of mushroom edibility is GillSize, followed by GillAtt as second-best, then Bruises, and finally StalkShape as least predictive. These conclusions are probably the same as we would draw based on a careful comparison of the plots shown above, but the odds ratios computed in the way just described lead us to these conclusions much more directly.

Finally, it is important to make three points. First, as I have noted before – but the point is important enough to bear repeating – the associations described here between these binary mushroom characteristics and edibility are based entirely on the UCI mushroom dataset. Thus, these conclusions are only as representative of mushrooms in general or in any particular setting as the UCI mushroom dataset is representative of this larger and/or different mushroom population. In particular, mushrooms from other locales or unusual environments may exhibit different relationships between edibility and gill size or other characteristics than the UCI mushrooms do. The second key point is that the results presented here only attempt to assess the predictability of a single binary mushroom characteristic in isolation. To get a more complete picture of the relationship between mushroom characteristics and edibility, it is necessary to explore more general multivariate analysis techniques like logistic regression. More about that later. Last but not least, the third point is that I realized in reviewing this post before I issued it that I hadn't included any actual R code to compute odds ratios. In my next post, I will remedy this problem, giving a detailed view of how the numbers presented here were obained.