In my last post, I discussed the use of odds ratios to characterize the association between edibility and binary mushroom characteristics for the mushrooms characterized in the UCI mushroom dataset. I did not, however, describe those computations in detail, and the purpose of this post is to give a brief discussion of how they were done. That said, I should emphasize that the primary focus of this blog is not R programming per se, but rather the almost limitless number of things that R allows you to do in the realm of exploratory data analysis. For more detailed discussions of the mechanics of R programming, two excellent resources are The R Book by Michael J. Crawley, and Modern Applied Statistics with S (Statistics and Computing) by W.N. Venables and B.D. Ripley.
As a reminder, I discuss the odds ratio in Chapter 13 of Exploring Data in Engineering, the Sciences, and Medicine, which may be viewed as an association measure between binary variables. As I discussed last time, for two binary variables, x and y, each taking the values 0 and 1, the odds ratio is defined on the basis of the following four numbers:
N00 = the number of data records with x = 0 and y = 0
N01 = the number of data records with x = 0 and y = 1
N10 = the number of data records with x = 1 and y = 0
N11 = the number of data records with x = 1 and y = 1
Specifically, the odds ratio is given by the following expression:
OR = N00 N11 / N01 N10
Similarly, confidence intervals for the odds ratio are easily constructed by appealing to the asymptotic normality of log OR, which has a limiting variance given by the square root of the sum of the reciprocals of these four numbers. The R procedure oddsratioWald.proc available from the companion website for Exploring Data computes the odds ratio and the upper and lower confidence limits at a specified level alpha from these four values:
oddsratioWald.proc <- function(n00, n01, n10, n11, alpha = 0.05){
#
# Compute the odds ratio between two binary variables, x and y,
# as defined by the four numbers nij:
#
# n00 = number of cases where x = 0 and y = 0
# n01 = number of cases where x = 0 and y = 1
# n10 = number of cases where x = 1 and y = 0
# n11 = number of cases where x = 1 and y = 1
#
OR <- (n00 * n11)/(n01 * n10)
#
# Compute the Wald confidence intervals:
#
siglog <- sqrt((1/n00) + (1/n01) + (1/n10) + (1/n11))
zalph <- qnorm(1 - alpha/2)
logOR <- log(OR)
loglo <- logOR - zalph * siglog
loghi <- logOR + zalph * siglog
#
ORlo <- exp(loglo)
ORhi <- exp(loghi)
#
oframe <- data.frame(LowerCI = ORlo , OR = OR, UpperCI = ORhi, alpha = alpha)
oframe
}
Including “alpha = 0.05” in the parameter list fixes the default value for alpha at 0.05, which yields the 95% confidence intervals for the computed odds ratio, based on the Wald approximation described above. An important practical point is that these intervals become infinitely wide if any of the four numbers Nij are equal to zero; also, note that in this case, the computed odds ratio is either zero or infinite. Finally, it is worth noting that if the numbers Nij are large enough, the procedure just described can encounter numerical overflow problems (i.e., the products in either the numerator or the denominator become too large to be represented in machine arithmetic). If this is a possibility, a better alternative is to regroup the computations as follows:
OR = (N00 / N01) x (N11 / N10 )
To use the routine just described, it is necessary to have the four numbers defined above, which form the basis for a two-by-two contingency table. Because contingency tables are widely used in characterizing categorical data, these numbers are easily computed in R using the table command. As a simple example, the following code reads the UCI mushroom dataset and generates the two-by-two contingency table for the EorP and GillSize attributes:
> mushrooms <- read.csv("mushroom.csv")
>
> table(mushrooms$EorP, mushrooms$GillSize)
b n
e 3920 288
p 1692 2224
>
>
(Note that the first line reads the csv file containing the mushroom data; for this command to work as shown, it is necessary for this file to be in the working directory. Alternatively, you can change the working directory using the setwd command.)
To facilitate the computation of odds ratios, the following preliminary procedure combines the table command with the oddsratioWald.proc procedure, allowing you to compute the odds ratio and its level-alpha confidence interval from the two-level variables directly:
TableOR.proc00 <- function(x,y,alpha=0.05){
#
xtab <- table(x,y)
n00 <- xtab[1,1]
n01 <- xtab[1,2]
n10 <- xtab[2,1]
n11 <- xtab[2,2]
oddsratioWald.proc(n00,n01,n10,n11,alpha)
}
The primary disadvantage of this procedure is that it doesn’t tell you which levels of the two variables are being characterized by the computed odds ratio. In fact, this characterization describes the first level of each of these variables, and the following slight modification makes this fact explicit:
TableOR.proc <- function(x,y,alpha=0.05){
#
xtab <- table(x,y)
n00 <- xtab[1,1]
n01 <- xtab[1,2]
n10 <- xtab[2,1]
n11 <- xtab[2,2]
#
outList <- vector("list",2)
outList[[1]] <- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][1],"] of the second variable:",sep=" ")
outList[[2]] <- oddsratioWald.proc(n00,n01,n10,n11,alpha)
outList
}
Specifically, I have used the fact that the dimension names of the 2x2 table xtab correspond to the levels of the variables x and y, and I have used the paste command to include these values in a text string displayed to the user. (I have enclosed the levels in square brackets to make them stand out from the surrounding text, particularly useful here since the levels are coded as single letters.) Applying this procedure to the mushroom characteristics EorP and GillSize yields the following results:
> TableOR.proc(mushrooms$EorP, mushrooms$GillSize)
[[1]]
[1] "Odds ratio between the level [ e ] of the first variable and the level [ b ] of the second variable:"
[[2]]
LowerCI OR UpperCI alpha
1 15.62615 17.89073 20.48349 0.05
>
Almost certainly, the formatting I have used here could be improved – probably a lot – but the key point is to provide a result that is reasonably complete and easy to interpret.
Finally, I noted in my last post that if we are interested in using odds ratios to compare or rank associations, it is useful to code the levels so that the computed odds ratio is larger than 1. In particular, note that applying the above procedure to characterize the relationship between edibility and the Bruises characteristic yields:
> TableOR.proc(mushrooms$EorP, mushrooms$Bruises)
[[1]]
[1] "Odds ratio between the level [ e ] of the first variable and the level [ f ] of the second variable:"
[[2]]
LowerCI OR UpperCI alpha
1 0.09014769 0.1002854 0.1115632 0.05
>
It is clear from these results that both Bruises and GillSize exhibit odds ratios with respect to mushroom edibility that are significantly different from the neutral value 1 (i.e., the 95% confidence interval excludes the value 1 in both cases), but it is not obvious which variable has the stronger association, based on the available data. The following procedure automatically restructures the computation so that the computed odds ratio is larger than or equal to 1, allowing us to make this comparison:
AutomaticOR.proc <- function(x,y,alpha=0.05){
#
xtab <- table(x,y)
n00 <- xtab[1,1]
n01 <- xtab[1,2]
n10 <- xtab[2,1]
n11 <- xtab[2,2]
#
rawOR <- (n00*n11)/(n01*n10)
if (rawOR < 1){
n01 <- xtab[1,1]
n00 <- xtab[1,2]
n11 <- xtab[2,1]
n10 <- xtab[2,2]
iLevel <- 2
}
else{
iLevel <- 1
}
outList <- vector("list",2)
outList[[1]] <- paste("Odds ratio between the level [",dimnames(xtab)[[1]][1],"] of the first variable and the level [",dimnames(xtab)[[2]][iLevel],"] of the second variable:",sep=" ")
outList[[2]] <- oddsratioWald.proc(n00,n01,n10,n11,alpha)
outList
}
Note that this procedure first constructs the 2x2 table on which everything is based and then computes the odds ratio in the default coding: if this value is smaller than 1, the coding of the second variable (y) is reversed. The odds ratio and its confidence interval are then computed and the levels of the variables used in computing it are presented as before. Applying this procedure to the Bruises characteristic yields the following result, from which we can see that GillSize appears to have the stronger association, as noted last time:
> AutomaticOR.proc(mushrooms$EorP, mushrooms$Bruises)
[[1]]
[1] "Odds ratio between the level [ e ] of the first variable and the level [ t ] of the second variable:"
[[2]]
LowerCI OR UpperCI alpha
1 8.963532 9.971541 11.09291 0.05
>
Well, that’s it for now. Next time, I will come back to the question of how we should expect interestingness measures to be distributed and whether those expectations are met in practice.
No comments:
Post a Comment