This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Saturday, February 4, 2012

Measuring associations between non-numeric variables

It is often useful to know how strongly or weakly two variables are associated: do they vary together or are they essentially unrelated?  In the case of numerical variables, the best-known measure of association is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century.  For variables that are ordered but not necessarily numeric (e.g., Likert scale responses with levels like “strongly agree,” “agree,” “neither agree nor disagree,” “disagree” and “strongly disagree”), association can be measured in terms of the Spearman rank correlation coefficient.  Both of these measures are discussed in detail in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine.  For unordered categorical variables (e.g., country, state, county, tumor type, literary genre, etc.), neither of these measures are applicable, but applicable alternatives do exist.  One of these is Goodman and Kruskal’s tau measure, discussed very briefly in Exploring Data (Chapter 10, page 492).  The point of this post is to give a more detailed discussion of this association measure, illustrating some of its advantages, disadvantages, and peculiarities.

A more complete discussion of Goodman and Kruskal’s tau measure is given in Agresti’s book Categorical Data Analysis, on pages 68 and 69.  It belongs to a family of categorical association measures of the general form:

            a(x,y) = [V(y) – E{V(y|x)}]/V(y)

where V(y) is a measure of the overall (i.e., marginal) variability of y and E{V(y|x)} is the expected value of the conditional variability V(y|x) of y given a fixed value of x, where the expectation is taken over all possible values of x.  These variability measures can be defined in different ways, leading to different association measures, including Goodman and Kruskal’s tau as a special case.  Agresti’s book gives detailed expressions for several of these variability measures, including the one on which Goodman and Kruskal’s tau is based, and an alternative expression for the overall association measure a(x,y) is given in Eq. (10.178) on page 492 of Exploring Data.  This association measure does not appear to be available in any current R package, but it is easily implemented as the following function:

GKtau <- function(x,y){
  #  First, compute the IxJ contingency table between x and y
  Nij = table(x,y,useNA="ifany")
  #  Next, convert this table into a joint probability estimate
  PIij = Nij/sum(Nij)
  #  Compute the marginal probability estimates
  PIiPlus = apply(PIij,MARGIN=1,sum)
  PIPlusj = apply(PIij,MARGIN=2,sum)
  #  Compute the marginal variation of y
  Vy = 1 - sum(PIPlusj^2)
  #  Compute the expected conditional variation of y given x
  InnerSum = apply(PIij^2,MARGIN=1,sum)
  VyBarx = 1 - sum(InnerSum/PIiPlus)
  #  Compute and return Goodman and Kruskal's tau measure
  tau = (Vy - VyBarx)/Vy

An important feature of this procedure is that it allows missing values in either of the variables x or y, treating “missing” as an additional level.  In practice, this is sometimes very important since missing values in one variable may be strongly associated with either missing values in another variable or specific non-missing levels of that variable.

An important characteristic of Goodman and Kruskal’s tau measure is its asymmetry: because the variables x and y enter this expression differently, the value of a(y,x) is not the same as the value of a(x,y), in general.  This stands in marked contrast to either the product-moment correlation coefficient or the Spearman rank correlation coefficient, which are both symmetric, giving the same association between x and y as that between y and x.  The fundamental reason for the asymmetry of the general class of measures defined above is that they quantify the extent to which the variable x is useful in predicting y, which may be very different than the extent to which the variable y is useful in predicting x.  Specifically, if x and y are statistically independent, then E{V(y|x)} = V(y) – i.e., knowing x does not help at all in predicting y – and this implies that a(x,y) = 0.  At the other extreme, if y is perfectly predictable from x, then E{V(y|x)} = 0, which implies that a(x,y) = 1.  As the examples presented next demonstrate, it is possible that y is extremely predictable from x, but x is only slightly predictable from y.

Specifically, consider the sequence of 400 random numbers, uniformly distributed between 0 and 1 generated by the following R code:

            u = runif(400)

(Here, I have used the “set.seed” command to initialize the random number generator so repeated runs of this example will give exactly the same results.)  The second sequence is obtained by quantizing the first, rounding the values of u to a single digit:

            x = round(u,digits=1)

The plot below shows the effects of this coarse quantization: values of u vary continuously from 0 to 1, but values of x are restricted to 0.0, 0.1, 0.2, … , 1.0.  Although this example is simulation-based, it is important to note that this type of grouping of variables is often encountered in practice (e.g., the use of age groups instead of ages in demographic characterizations, blood pressure characterizations like “normal,” “borderline hypertensive,” etc. in clinical data analysis, or the recording of industrial process temperatures to the nearest 0.1 degree, in part due to measurement accuracy considerations and in part due to memory limitations of early data collection systems). 

In this particular case, because the variables x and u are both numeric, we could compute either the product-moment correlation coefficient or the Spearman rank correlation, obtaining the very large value of approximately 0.995 for either one, showing that these variables are strongly associated.  We can also apply Goodman and Kruskal’s tau measure here, and the result is much more informative.  Specifically, the value of a(u,x) is 1 in this case, correctly reflecting the fact that the grouped variable x is exactly computable from the original variable u.  In contrast, the value of a(x,u) is approximately 0.025, suggesting – again correctly – that the original variable u cannot be well predicted from the grouped variable x. 

To illustrate a case where the product-moment and rank correlation measures are not applicable at all, consider the following alphabetic re-coding of the variable x into an unordered categorical variable c:

            letters = c(“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”, “K”)
            c = letters[10*x+1]

In this case, both of the Goodman and Kruskal tau measures, a(x,c) and a(c,x), are equal to 1, reflecting the fact that these two variables are effectively identical, related via the non-numeric transformation given above. 

Being able to detect relationships like these can be extremely useful in exploratory data analysis where such relationships may be unexpected, particularly in the early stages of characterizing a dataset whose metadata – i.e., detailed descriptions of the variables included in the dataset – is absent, incomplete, ambiguous, or suspect.  As a real data illustration, consider the rent data frame from the R package, which has 1,969 rows, each corresponding to a rental property in Munich, and 9 columns, each giving a characteristic of that unit (e.g., the rent, floor space, year of construction, etc.).  Three of these variables are Sp, a binary variable indicating whether the location is considered above average (1) or not (0), Sm, another binary variable indicating whether the location is considered below average (1) or not (0), and loc, a three-level variable combining the information in these other two, taking the values 1 (below average), 2 (average), or 3 (above average).  The Goodman and Kruskal tau values between all possible pairs of these three variables are:

            a(Sm,Sp) = a(Sp,Sm) = 0.037
            a(Sm,loc) = 0.245 vs. a(loc,Sm) = 1
            a(Sp,loc) = 0.701 vs. a(loc,Sp) = 1

The first of these results – the symmetry of Goodman and Kruskal’s tau for the variables Sm and Sp – is a consequence of the fact that this measure is symmetric for any pair of binary variables.  In fact, the odds ratio that I have discussed in previous posts represents a much better way of characterizing the relationship between binary variables (here, the odds ratio between Sm and Sp is zero, reflecting the fact that a location cannot be both “above average” and “below average” at the same time).  The real utility of the tau measure here is that the second and third lines above show that the variables Sm and Sp are both re-groupings of the finer-grained variable loc. 

Finally, a more interesting exploratory application to this dataset is the following one.  Computing Goodman and Kruskal’s tau measure between the location variable loc and all of the other variables in the dataset – beyond the cases of Sm and Sp just considered – generally yields small values for the associations in either direction.  As a specific example, the association a(loc,Fl) is 0.001, suggesting that location is not a good predictor of the unit’s floor space in meters, and although the reverse association a(Fl,loc) is larger (0.057), it is not large enough to suggest that the unit’s floor space is a particularly good predictor of its location quality.  The same is true of most of the other variables in the dataset: they are neither well predicted by nor good predictors of location quality.  The one glaring exception is the rent variable R: although the association a(loc,R) is only 0.001, the reverse association a(R,loc) is 0.907, a very large value suggesting that location quality is quite well predicted by the rent.  The beanplot above shows what is happening here: because the variation in rents for all three location qualities is substantial, knowledge of the loc value is not sufficient to accurately predict the rent R, but these rent values do generally increase in going from below-average locations (loc = 1) to average locations (loc = 2) to above-average locations (loc = 3).  For comparison, the beanplots below show why the association with floor space is so much weaker: both the mean floor space in each location quality group and the overall range of these values are quite comparable, implying that neither location quality can be well predicted from floor space nor vice versa.

The asymmetry of Goodman and Kruskal’s tau measure is disconcerting at first because it has no counterpart in better-known measures like the product-moment correlation coefficient between numerical variables, Spearman’s rank correlation coefficient between ordinal variables, or the odds ratio between binary variables.  One of the points of this post has been to demonstrate how this unusual asymmetry can be useful in practice, distinguishing between the ability of one variable x to predict another variable y, and the reverse case.