ExploringDataBlog: May 2011

Saturday, May 21, 2011

The distribution of interestingness

On April 22, David Landy posed a question about the distribution of interestingness values in response to my April 3^rd post on “Interestingness Measures.” He noted that the survey paper by Hilderman and Hamilton that I cited there makes the following comment:

“Our belief is that a useful measure of interestingness should generate index values that are reasonably distributed throughout the range of possible values (such as in a SND)”

David's question was whether there is a way of showing that interestingness measures should be normally distributed. In fact, this question raises a number of important issues, and this post explores a few of them.

First, as I noted in my comment in response to this question, the short answer is that the “standard behavior” we expect from most estimators – i.e., most characterizations computed from a set of uncertain data values – is asymptotic normality. That is, for sufficiently large data samples, we expect most of the standard characterizations we might compute – variance estimates, correlation estimates, least squares regression coefficients, or linear model parameters computed using robust estimators like Least Trimmed Squares (LTS) – to be approximately normally distributed. I discuss this idea in Chapter 6 of Exploring Data in Engineering, the Sciences, and Medicine, beginning with the Central Limit Theorem (CLT), which says, very roughly, that “averages of N numbers tend to Gaussian limits as N becomes infinitely large.” There are exceptions – cases where asymptotic normality does not hold – but the phenomenon is common enough to make violations noteworthy. In fact, one of the reasons the conceptually simpler least median of squares (LMS) estimator was dropped in favor of the more complex least trimmed squares (LTS) estimator is that the LMS estimator is not asymptotically normal, approaching a non-normal limiting distribution at an anomalously slow rate. In contrast, the LTS estimator is asymptotically normal, with the standard convergence rate (i.e., the standard deviation of the estimator approaches zero inversely with the square root of the sample size as the sample becomes infinitely large).

The longer – and far less satisfying – answer to the question of how interestingness measures should be distributed is, “it depends,” as the following discussion illustrates.

One of the many cases where asymptotic normality has been shown to hold is Gini’s mean difference, which forms the basis for one of the four interestingness measures I discussed in my earlier post. While this result appears to provide a reasonably specific answer to the question of how this interestingness measure is distributed, there are two closely related issues that greatly complicate the matter. The first is a question that plagues many applications of asymptotic normality, which is quite commonly used for constructing approximate confidence intervals: how large must the sample be for this asymptotic approximation to be reasonable? (The confidence intervals described in my first post on odds ratios, for example, were derived on the basis of asymptotic normality.) Once again, the answer is, “it depends,” as the examples discussed here demonstrate. The second issue is that the interestingness measures that I describe in Exploring Data and that I discussed in my previous post are normalized measures. In particular, the Gini interestingness measure is defined as Gini’s mean difference divided by its maximum possible value, giving a measure that is bounded between 0 and 1. An important aspect of the Gaussian distribution is that it assigns a nonzero probability to any real value, although the probability associated with values more than a few standard deviations from the mean rapidly becomes very small. Thus, for a bounded quantity like the Gini interestingness measure, a Gaussian approximation can only be reasonable if the standard deviation is small enough that the probability of exhibiting values less than 0 or greater than 1 is acceptably small. Further, this “maximum reasonable standard deviation” will depend on the mean value of the estimator: if the computed interestingness measure is approximately 0.5, the maximum feasible standard deviation for a reasonable Gaussian approximation is considerably larger than if the computed interestingness measure is 0.01 or 0.99. As a practical matter, this usually means that, for a fixed sample size, the shape of the empirical distribution of normalized interestingness values for sequences exhibiting a given degree of heterogeneity (i.e., a specified “true” interestingness value) will vary significantly in shape (specifically, symmetry) as this “true” interestingness value varies from near 0 to approximately 0.5, to near 1. This idea is easily demonstrated in R on the basis of simulated examples.

The essential idea behind this simulation study is the following. Suppose we have a categorical variable that can assume any one of L distinct levels, and suppose we want to generate random samples of this variable where each level i has a certain probability p_i of occurring in our sample. The multinomial distribution discussed in Sec. 10.2.1 of Exploring Data is appropriate here, characterizing the number of times n_i we observe each of these levels in a random sample of size N, given the probability of observing each level i. The simulation strategy considered here then proceeds by specifying the number of levels (here, I take L = 4) and the probability of observing each level. Then, I generate a large number B of random samples (here, I take B = 1000), each generated from the appropriate multinomial distribution. In what follows, I consider the following four cases:

Case A: four equally-represented levels, each with probability 0.25;
Case B: one dominant level (p₁ = 0.97) and three rare levels (p₂ = p₃ = p₄ = 0.01);
Case C: three equally-represented majority levels (p₁ = p₂ = p₃ = 0.33) and one rare minority level (p₄ = 0.01);
Case D: four distinct probabilities (p₁ = 0.05, p₂ = 0.10, p₃ = 0.25, p₄ = 0.60).

To generate these sequences, I use R’s built-in multinomial random number generator, rmultinom, with parameters n = 1000 (this corresponds to B in the above discussion: I want to generate 1000 multinomial random vectors), size = 100 (initially, I take each vector to be of length N = 100; later, I consider larger vectors), and, for Case A, I specify prob = c(0.25, 0.25, 0.25, 0.25), corresponding to the four p_i values specified above (for the other three cases, I specify the prob parameter appropriately). The value returned by this procedure is a matrix, with one row for each level and one column for each simulation; for each column, the number in the i^th row is n_i, the number of times the i^th level is observed in that simulated response. Dividing these counts by the sample size gives the fractional representation of each level, from which the normalized interestingness measures are then computed. For the Gini interestingness measure and Case A, these steps are accomplished using the following R code:

set.seed(101)

pvectA <- c(0.25, 0.25, 0.25, 0.25)

CountsA <- rmultinom(n = 1000, size = 100, prob = pvectA)

FractionsA <- CountsA/100

GiniA <- apply(FractionsA, MARGIN=2, gini.proc)

The first line of this sequence sets the seed value to initialize the random number generator: if you don’t do this, running the same procedure again will give you results that are statistically similar but not exactly the same as before; specifying the seed guarantees you get exactly the same results the next time you execute the command sequence. The second line defines the “true” heterogeneity of each sequence (i.e., the underlying probability of observing each level that the random number generator uses to simulate the data). The third line invokes the multinomial random number generator to return the desired matrix of counts, where each column represents an independent simulation and the rows correspond to the four possible levels of the response. The fourth line converts these counts to fractions by dividing by the size of each sample, and the last line uses the R function apply to invoke the Gini interestingness procedure gini.proc for each column of the fractions matrix: setting MARGIN = 2 tells R to “apply the indicated function gini.proc to the columns of FractionsA.” (It would also be possible to write this procedure as a loop in R, but the apply function is much faster.) Replacing gini.proc by bray.proc in the above sequence of commands would generate the Bray interestingness measures, using shannon.proc would generate the Shannon interestingness measures, and using simpson.proc would generate the Simpson interestingness measures (note that all of these procedures are available from the companion website for Exploring Data).

For this first example – i.e., Case A, where all four levels are equally represented – note that the “true” value for any of the four normalized interestingness measures I discussed in my earlier post is 0, the smallest possible value. Since the random samples generated by the multinomial random number generator rarely have identical counts for the four levels, the interestingness measure computed from each sample is generally larger than 0, but it can never be smaller than this value. This observation suggests two things. The first is that the average of these interestingness values is larger than 0, implying that any of the normalized measures I considered exhibit a positive bias for this case. The second consequence of this observation is that the distribution of any of these interestingness values is probably asymmetric. The first of these conclusions is evident in the plot below, which shows nonparametric density estimates for the Gini measure computed from the 1000 random simulations generated for each case. The left-most peak corresponds to Case A and it is clear from this plot that the average value is substantially larger than zero (specifically, the mean of these simulations is 0.114). This plot also suggests asymmetry, although the visual evidence is not overwhelming.

The right-most peak in the above plot corresponds to Case B, for which the “true” Gini interestingness value is 0.96. Here, the peak is much narrower, although again the average of the simulation values (0.971) is slightly larger than the true value, which is represented by the right-most vertical dashed line. Case C is modeled on the CapSurf variable from the UCI mushroom dataset that I discussed in my earlier post on interestingness measures: the first three levels are equally distributed, but there is also an extremely rare fourth level. The “true” Gini measure for this case is 0.32, corresponding to the second-from-left dashed vertical line in the above plot, and – as with the previous two cases – the distribution of Gini values computed from the multinomial random samples is biased above this correct value. Here, however, the shape of the distribution is more symmetric, suggesting possible conformance with the Gaussian expectations suggested above. Finally, Case D is the third peak from the left in the above plot, and here the distribution appears reasonably symmetric around the correct value of 0.6, again indicated by a vertical dashed line.

The above plot shows the corresponding results for the Shannon interestingness measure for the same four cases. Here, the distributions for the different cases exhibit a much wider range of variation than they did for the Gini measure. As before, Case D (the smoothest of the four peaks) appears reasonably symmetric in its distribution around the “true” value of 0.255, possibly consistent with a normal distribution, but this is clearly not true for any of the other three cases. The asymmetry seen for Case A (the left-most peak) appears fairly pronounced, the density for Case C (the second-from-left peak) appears to be multi-modal, and the right-most peak (Case B) exhibits both asymmetry and multimodal character. The key point here is that the observed distribution of interestingness values depends on both the specific simulation case considered (i.e., the true probability distribution of the levels), and the particular interestingness measure chosen.

As noted above, if we expect approximate normality for a data characterization on the basis of its asymptotic behavior, an important practical question is whether we are working with large enough samples for asymptotic approximations to be at all reasonable. Here, the sample size is N = 100: we are generating and characterizing a lot more sequences than this, but each characterization is based on 100 data observations. In the UCI mushroom example I discussed in my previous post, the sample size was much larger: that dataset characterizes 8,124 different mushrooms, so the interestingness measures considered there were based on a sample size of N = 8124. In fact, this can make an important difference, as the following example demonstrates. The above plot compares the Gini measure density estimates obtained from 1000 random samples, each of length N = 100 (the dashed curve), with those obtained from 1000 random samples, each of length N = 8124 (the solid curve), for Case D. The vertical dotted line corresponds to the true Gini value of 0.6 for this case. Note that both of these densities appear to be symmetrically distributed around this correct value, but the distribution of values is much narrower when they are computed from the larger samples.

These results are consistent with our expectations of asymptotic normality for the Gini measure, at least for cases that are not too near either extreme. Further confirmation of these general expectations for a “well-behaved interestingness measure” is provided by the four Q-Q plot shown below, generated using the qqPlot command in the car package in R that I discussed in a previous post. These plots compare the Shannon measures computed from Cases C (upper two plots) and D (lower two plots), for the smaller sample size N = 100 (left-hand plots) and the larger sample size N = 8124 (right-hand plots). In particular, in the upper left Q-Q plot (Case C, N = 100), compelling evidence of non-normality is given by both the large number of points falling outside the 95% confidence intervals for this Q-Q plot, and the “kinks” in the plot that reflect the multimodality seen in the second peak in the second plot above. In contrast, all of the points in the upper right Q-Q plot (Case C, N = 8124) fall within the 95% confidence limits for this plot, suggesting that the approximate normality assumption is reasonable for the larger sample. Qualitatively similar behavior is seen for Case D in the two lower plots: the results computed for N = 100 exhibit some evidence for asymmetry, with both the lower and upper tails lying above the upper 95% confidence limit for the Q-Q plot, while the results computed for N = 8124 show no such deviations.

It is important to emphasize, however, that approximate normality is not always to be expected. The plot below shows Q-Q plots for the Gini measure (upper plots) and the Shannon measure (lower plots) for Case A, computed from both the smaller samples (N = 100, left-hand plots) and the larger samples (N = 8124, right-hand plots). Here, all of these plots provide strong evidence for distributional asymmetry, consistent with the point made above that, while the computed interestingness value can exceed the true value of 0 for this case, it can never fall below this value. Thus, we expect both a positive bias and a skewed distribution of values, even for large sample sizes, and we see both of these features here.

To conclude, then, I reiterate what I said at the beginning: the answer to the question of what distribution we should expect for a “good” interestingness measure is, “it depends.” But now I can be more specific. The answer depends, first, on the sample size from which we compute the interestingness measure: the larger the sample, the more likely this distribution is to be approximately normal. This is, essentially, a restatement of the definition of asymptotic normality, but it raises the key practical question of how large N must be for this approximation to be reasonable. That answer also depends, on at least two things: first, the specific interestingness measure we consider, and second, the true heterogeneity of the data sample from which we compute it. Both of these points were illustrated in the first two plots shown above: the shape of the estimated densities varied significantly both with the case considered (i.e., the “true” heterogeneity) and whether we considered the Gini measure or the Shannon measure. Analogous comments apply to the other interestingness measures I discussed in my previous post.

Saturday, May 7, 2011

Computing Odds Ratios in R

In my last post, I discussed the use of odds ratios to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the UCI mushroom dataset. I did not, however, describe those computations in detail, and the purpose of this post is to give a brief discussion of how they were done. That said, I should emphasize that the primary focus of this blog is not R programming per se, but rather the almost limitless number of things that R allows you to do in the realm of exploratory data analysis. For more detailed discussions of the mechanics of R programming, two excellent resources are The R Book by Michael J. Crawley, and Modern Applied Statistics with S (Statistics and Computing) by W.N. Venables and B.D. Ripley.

As a reminder, I discuss the odds ratio in Chapter 13 of Exploring Data in Engineering, the Sciences, and Medicine, which may be viewed as an association measure between binary variables. As I discussed last time, for two binary variables, x and y, each taking the values 0 and 1, the odds ratio is defined on the basis of the following four numbers:

N₀₀ = the number of data records with x = 0 and y = 0

N₀₁ = the number of data records with x = 0 and y = 1

N₁₀ = the number of data records with x = 1 and y = 0

N₁₁ = the number of data records with x = 1 and y = 1

Specifically, the odds ratio is given by the following expression:

OR = N₀₀ N₁₁ / N₀₁ N₁₀

Similarly, confidence intervals for the odds ratio are easily constructed by appealing to the asymptotic normality of log OR, which has a limiting variance given by the square root of the sum of the reciprocals of these four numbers. The R procedure oddsratioWald.proc available from the companion website for Exploring Data computes the odds ratio and the upper and lower confidence limits at a specified level alpha from these four values:

oddsratioWald.proc <- function(n00, n01, n10, n11, alpha = 0.05){

# Compute the odds ratio between two binary variables, x and y,

# as defined by the four numbers nij:

# n00 = number of cases where x = 0 and y = 0

# n01 = number of cases where x = 0 and y = 1

# n10 = number of cases where x = 1 and y = 0

# n11 = number of cases where x = 1 and y = 1

OR <- (n00 * n11)/(n01 * n10)

# Compute the Wald confidence intervals:

siglog <- sqrt((1/n00) + (1/n01) + (1/n10) + (1/n11))

zalph <- qnorm(1 - alpha/2)

logOR <- log(OR)

loglo <- logOR - zalph * siglog

loghi <- logOR + zalph * siglog

ORlo <- exp(loglo)

ORhi <- exp(loghi)

oframe <- data.frame(LowerCI = ORlo, OR = OR, UpperCI = ORhi, alpha = alpha)

oframe

}

Including “alpha = 0.05” in the parameter list fixes the default value for alpha at 0.05, which yields the 95% confidence intervals for the computed odds ratio, based on the Wald approximation described above. An important practical point is that these intervals become infinitely wide if any of the four numbers N_ij are equal to zero; also, note that in this case, the computed odds ratio is either zero or infinite. Finally, it is worth noting that if the numbers N_ij are large enough, the procedure just described can encounter numerical overflow problems (i.e., the products in either the numerator or the denominator become too large to be represented in machine arithmetic). If this is a possibility, a better alternative is to regroup the computations as follows:

OR = (N₀₀ / N₀₁) x (N₁₁ / N₁₀)

To use the routine just described, it is necessary to have the four numbers defined above, which form the basis for a two-by-two contingency table. Because contingency tables are widely used in characterizing categorical data, these numbers are easily computed in R using the table command. As a simple example, the following code reads the UCI mushroom dataset and generates the two-by-two contingency table for the EorP and GillSize attributes:

> mushrooms <- read.csv("mushroom.csv")

> table(mushrooms$EorP, mushrooms$GillSize)

b n

e 3920 288

p 1692 2224

(Note that the first line reads the csv file containing the mushroom data; for this command to work as shown, it is necessary for this file to be in the working directory. Alternatively, you can change the working directory using the setwd command.)

To facilitate the computation of odds ratios, the following preliminary procedure combines the table command with the oddsratioWald.proc procedure, allowing you to compute the odds ratio and its level-alpha confidence interval from the two-level variables directly:

TableOR.proc00 <- function(x,y,alpha=0.05){

xtab <- table(x,y)

n00 <- xtab[1,1]

n01 <- xtab[1,2]

n10 <- xtab[2,1]

n11 <- xtab[2,2]

oddsratioWald.proc(n00,n01,n10,n11,alpha)

}

The primary disadvantage of this procedure is that it doesn’t tell you which levels of the two variables are being characterized by the computed odds ratio. In fact, this characterization describes the first level of each of these variables, and the following slight modification makes this fact explicit:

TableOR.proc <- function(x,y,alpha=0.05){

xtab <- table(x,y)

n00 <- xtab[1,1]

n01 <- xtab[1,2]

n10 <- xtab[2,1]

n11 <- xtab[2,2]

outList <- vector("list",2)

outList[[1]] <- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][1],"] of the second variable:",sep=" ")

outList[[2]] <- oddsratioWald.proc(n00,n01,n10,n11,alpha)

outList

}

Specifically, I have used the fact that the dimension names of the 2x2 table xtab correspond to the levels of the variables x and y, and I have used the paste command to include these values in a text string displayed to the user. (I have enclosed the levels in square brackets to make them stand out from the surrounding text, particularly useful here since the levels are coded as single letters.) Applying this procedure to the mushroom characteristics EorP and GillSize yields the following results:

> TableOR.proc(mushrooms$EorP, mushrooms$GillSize)

[[1]]

[1] "Odds ratio between the level [ e ] of the first variable and the level [ b ] of the second variable:"

[[2]]

LowerCI OR UpperCI alpha

1 15.62615 17.89073 20.48349 0.05

Almost certainly, the formatting I have used here could be improved – probably a lot – but the key point is to provide a result that is reasonably complete and easy to interpret.

Finally, I noted in my last post that if we are interested in using odds ratios to compare or rank associations, it is useful to code the levels so that the computed odds ratio is larger than 1. In particular, note that applying the above procedure to characterize the relationship between edibility and the Bruises characteristic yields:

> TableOR.proc(mushrooms$EorP, mushrooms$Bruises)

[[1]]

[1] "Odds ratio between the level [ e ] of the first variable and the level [ f ] of the second variable:"

[[2]]

LowerCI OR UpperCI alpha

1 0.09014769 0.1002854 0.1115632 0.05

It is clear from these results that both Bruises and GillSize exhibit odds ratios with respect to mushroom edibility that are significantly different from the neutral value 1 (i.e., the 95% confidence interval excludes the value 1 in both cases), but it is not obvious which variable has the stronger association, based on the available data. The following procedure automatically restructures the computation so that the computed odds ratio is larger than or equal to 1, allowing us to make this comparison:

AutomaticOR.proc <- function(x,y,alpha=0.05){

xtab <- table(x,y)

n00 <- xtab[1,1]

n01 <- xtab[1,2]

n10 <- xtab[2,1]

n11 <- xtab[2,2]

rawOR <- (n00*n11)/(n01*n10)

if (rawOR < 1){

n01 <- xtab[1,1]

n00 <- xtab[1,2]

n11 <- xtab[2,1]

n10 <- xtab[2,2]

iLevel <- 2

}

else{

iLevel <- 1

}

outList <- vector("list",2)

outList[[1]] <- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][iLevel],"] of the second variable:",sep=" ")

outList[[2]] <- oddsratioWald.proc(n00,n01,n10,n11,alpha)

outList

}

Note that this procedure first constructs the 2x2 table on which everything is based and then computes the odds ratio in the default coding: if this value is smaller than 1, the coding of the second variable (y) is reversed. The odds ratio and its confidence interval are then computed and the levels of the variables used in computing it are presented as before. Applying this procedure to the Bruises characteristic yields the following result, from which we can see that GillSize appears to have the stronger association, as noted last time:

> AutomaticOR.proc(mushrooms$EorP, mushrooms$Bruises)

[[1]]

[1] "Odds ratio between the level [ e ] of the first variable and the level [ t ] of the second variable:"

[[2]]

LowerCI OR UpperCI alpha

1 8.963532 9.971541 11.09291 0.05

Well, that’s it for now. Next time, I will come back to the question of how we should expect interestingness measures to be distributed and whether those expectations are met in practice.