This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Showing posts with label medcouple. Show all posts
Showing posts with label medcouple. Show all posts

Saturday, February 16, 2013

Finding outliers in numerical data

One of the topics emphasized in Exploring Data in Engineering, the Sciences and Medicine is the damage outliers can do to traditional data characterizations.  Consequently, one of the procedures to be included in the ExploringData package is FindOutliers, described in this post.  Given a vector of numeric values, this procedure supports four different methods for identifying possible outliers.

Before describing these methods, it is important to emphasize two points.  First, the detection of outliers in a sequence of numbers can be approached as a mathematical problem, but the interpretation of these data observations cannot.  That is, mathematical outlier detection procedures implement various rules for identifying points that appear to be anomalous with respect to the nominal behavior of the data, but they cannot explain why these points appear to be anomalous.  The second point is closely related to the first: one possible source of outliers in a data sequence is gross measurement errors or other data quality problems, but other sources of outliers are also possible so it is important to keep an open mind.  The terms “outlier” and “bad data” are not synonymous.  Chapter 7 of Exploring Data briefly describes two examples of outliers whose detection and interpretation led to a Nobel Prize and to a major new industrial product (Teflon, a registered trademark of the DuPont Company).

In the case of a single sequence of numbers, the typical approach to outlier detection is to first determine upper and lower limits on the nominal range of data variation, and then declare any point falling outside this range to be an outlier.  The FindOutliers procedure implements the following methods of computing the upper and lower limits of the nominal data range:

1.                  The ESD identifier, more commonly known as the “three-sigma edit rule,” well known but unreliable;
2.                  The Hampel identifier, a more reliable procedure based on the median and the MADM scale estimate;
3.                  The standard boxplot rule, based on the upper and lower quartiles of the data distribution;
4.                  An adjusted boxplot rule, based on the upper and lower quartiles, along with a robust skewness estimator called the medcouple.

The rest of this post briefly describes these four outlier detection rules and illustrates their application to two real data examples.

Without question, the most popular outlier detection rule is the ESD identifier (an abbreviation for “extreme Studentized deviation”), which declares any point more than t standard deviations from the mean to be an outlier, where the threshold value t is most commonly taken to be 3.  In other words, the nominal range used by this outlier detection procedure is the closed interval:

            [mean – t * SD, mean + t * SD]

where SD is the estimated standard deviation of the data sequence.  Motivation for the threshold choice t = 3 comes from the fact that for normally-distributed data, the probability of observing a value more than three standard deviations from the mean is only about 0.3%.  The problem with this outlier detection procedure is that both the mean and the standard deviation are themselves extremely sensitive to the presence of outliers in the data.  As a consequence, this procedure is likely to miss outliers that are present in the data.  In fact, it can be shown that for a contamination level greater than 10%, this rule fails completely, detecting no outliers at all, no matter how extreme they are (for details, see the discussion in Sec. 3.2.1 of Mining Imperfect Data).

The default option for the FindOutliers procedure is the Hampel identifier, which replaces the mean with the median and the standard deviation with the MAD (or MADM)  scale estimate.  The nominal data range for this outlier detection procedure is:

            [median – t * MAD, median + t * MAD]

As I have discussed in previous posts, the median and the MAD scale are much more resistant to the influence of outliers than the mean and standard deviation.  As a consequence, the Hampel identifier is generally more effective than the ESD identifier, although the Hampel identifier can be too aggressive, declaring too many points as outliers.  For detailed comparisons of the ESD and Hampel identifiers, refer to Sec. 7.5 of Exploring Data or Sec. 3.3 of Mining Imperfect Data.

The third method option for the FindOutliers procedure is the standard boxplot rule, based on the following nominal data range:

            [Q1 – c * IQD, Q3 + c * IQD]

where Q1 and Q3 represent the lower and upper quartiles, respectively, of the data distribution, and IQD = Q3 – Q1 is the interquartile distance, a measure of the spread of the data similar to the standard deviation.  The threshold parameter c is analogous to t in the first two outlier detection rules, and the value most commonly used in this outlier detection rule is c = 1.5.  This outlier detection rule is much less sensitive to the presence of outliers than the ESD identifier, but more sensitive than the Hampel identifier, and, like the Hampel identifier, it can be somewhat too aggressive, declaring nominal data observations to be outliers.  An advantage of the boxplot rule over these two alternatives is that, because it does not depend on an estimate of the “center” of the data (e.g., the mean in the ESD identifier or the median in the Hampel identifier), it is better suited to distributions that are moderately asymmetric.

The fourth method option is an extension of the standard boxplot rule, developed for data distributions that may be strongly asymmetric.  Basically, this procedure modifies the threshold parameter c by an amount that depends on the asymmetry of the distribution, modifying the upper threshold and the lower threshold differently.  Because the standard moment-based skewness estimator is extremely outlier-sensitive (for an illustration of this point, see the discussion in Sec. 7.1.1 of Exploring Data), it is necessary to use an outlier-resistant alternative to assess distributional asymmetry.  The asymmetry measure used here is the medcouple, a robust skewness measure available in the robustbase package in R and that I have discussed in a previous post (Boxplots and Beyond - Part II: Asymmetry ).   An important point about the medcouple is that it can be either positive or negative, depending on the direction of the distributional asymmetry; positive values arise more frequently in practice, but negative values can occur and the sign of the medcouple influences the definition of the asymmetric boxplot rule.  Specifically, for positive values of the medcouple MC, the adjusted boxplot rule’s nominal data range is:

            [Q1 – c * exp(a * MC) * IQD, Q3 + c * exp(b * MC) * IQD ]

while for negative medcouple values, the nominal data range is:

            [Q1 – c * exp(-b * MC) * IQD, Q3 + c * exp(-a * MC) * IQD ]

An important observation here is that for symmetric data distributions, MC should be zero, reducing the adjusted boxplot rule to the standard boxplot rule described above.  As in the standard boxplot rule, the threshold parameter is typically taken as c = 1.5, while the other two parameters are typically taken as a = -4 and b = 3.  In particular, these are the default values for the procedure adjboxStats in the robustbase package.



To illustrate how these outlier detection methods compare, the above pair of plots shows the results of applying all four of them to the makeup flow rate dataset discussed in Exploring Data (Sec. 7.1.2) in connection with the failure of the ESD identifier.  The points in these plots represent approximately 2,500 regularly sampled flow rate measurements from an industrial manufacturing process.  These measurements were taken over a long enough period of time to contain both periods of regular process operation – during which the measurements fluctuate around a value of approximately 400 – and periods when the process was shut down, was being shut down, or was being restarted, during which the measurements exhibit values near zero.  If we wish to characterize normal process operation, these shut down episodes represent outliers, and they correspond to about 20% of the data.  The left-hand plot shows the outlier detection limits for the ESD identifier (lighter, dashed lines) and the Hampel identifier (darker, dotted lines).  As discussed in Exploring Data, the ESD limits are wide enough that they do not detect any outliers in this data sequence, while the Hampel identifier nicely separates the data into normal operating data and outliers that correspond to the shut down episodes.  The right-hand plot shows the analogous results obtained with the standard boxplot method (lighter, dashed lines) and the adjusted boxplot method (darker, dotted lines).  Here, the standard boxplot rule gives results very similar to the Hampel identifier, again nicely separating the dataset into normal operating data and shut down episodes.  Unfortunately, the adjusted boxplot rule does not perform very well here, placing its lower nominal data limit in about the middle of the shut down data and its upper nominal data limit in about the middle of the normal operating data.  The likely cause of this behavior is that the relatively large fraction of lower tail outliers, which introduces a fairly strong negative skewness (the medcouple value for this example is -0.589).



The second example considered here is the industrial pressure data sequence shown in the above figure, in the same format as the previous figure.  This data sequence was discussed in Exploring Data (pp. 326-327) as a troublesome case because the two smallest values in this data sequence – near the right-hand end of the plots – appear to be downward outliers in a sequence with generally positive skewness (here, the medcouple value is 0.162).  As a consequence, neither the ESD identifier nor the Hampel identifier give fully satisfactory performance, in both cases declaring only one of these points as a downward outlier and arguably detecting too many upward outliers.  In fact, because the Hampel identifier is more aggressive here, it actually declares more upward outliers, making its performance worse for this example.  The right-hand plot in the above figure shows the outlier detection limits for the standard boxplot rule (lighter, dashed lines) and the adjusted boxplot rule (darker, dotted lines).  As in the previous example, the limits for the standard boxplot rule are almost the same as those for the Hampel identifier (the darker, dotted lines in the left-hand plot), but here the adjusted boxplot rule gives much better results, identifying both of the visually evident downward outliers and declaring far fewer points as upward outliers.

The primary point of this post has been to describe and demonstrate the outlier detection methods to be included in the FindOutliers procedure in the forthcoming ExploringData R package.  It should be clear from these results that, when it comes to outlier detection, “one size does not fit all” – method matters, and the choice of method requires a comparison of the results obtained by each one.  I have not included the code for the FindOutliers procedure here, but that will be the subject of my next post.

Sunday, February 6, 2011

Boxplots and Beyond – Part II: Asymmetry

In my last post, I discussed boxplots in their simplest forms, illustrating some of the useful options available with the boxplot command in the open-source statistical software package R.  As I noted in that post, the basic boxplot is both useful and popular, but it does have its limitations.  One of those limitations is that the standard boxplot outlier rule is not appropriate for highly asymmetric data.  Specifically, the standard rule declares points to be outliers if they lie a fixed distance – typically, 1.5 times the interquartile distance (IQD) – beyond the quartiles.  While this rule is appropriate for symmetric, approximately Gaussian data distributions, highly asymmetric situations call for an outlier detection rule that treats upward-outliers and downward-outliers differently.  For a strongly right-skewed distribution, for example, flagging any point more than 1.5 IQD’s above the upper quartile may be too liberal, declaring too many points to be upward-outliers, while flagging any point more than 1.5 IQD’s below the lower quartile may be too conservative, declaring too few points to be downward-outliers.  To handle these cases, it is necessary to incorporate some measure of skewness into the outlier detection rule. 




One of the points I discuss in Chapter 7 of Exploring Data is that the standard skewness estimator - defined as a normalized third moment - is extremely sensitive to outliers, worse even than the standard deviation in this respect.   This point is illustrated in the above figure, which gives a standard boxplot summary comparing four different skewness estimates for two different cases.  The boxplots on the left present results for these skewness measures each applied to 1000 statistically independent standard Gaussian data samples, each of length n = 100, while the boxplots on the right present the corresponding results for exactly the same data sequences, but with each Gaussian data sample contaminated by a single additive outlier, lying 8 standard deviations above the mean.  The boxplots designated “Mo” in these comparisons correspond to the standard moment-based estimator, while the other boxplots correspond to three other skewness estimators, discussed in the next paragraph.  Two points are immediately evident from these results: first, that the moment estimator exhibits much higher variability than the other three estimators considered here, and second, that the moment estimator is extremely outlier-sensitive.  In particular, note that a single outlier in a dataset of size 100 causes the median estimate to shift from about zero, corresponding to the correct nominal result, to a large enough value to suggest asymmetry comparable to the J-shaped exponential distribution (for which the skewness is 2), all on the basis of a single anomalous data point.

Two of the other three skewness estimators compared in these boxplots are Hotelling’s skewness measure, designated “Ho,” and Galton’s skewness measure, designated “Ga.  Both of these alternatives to the standard moment-based skewness measure are discussed in Chapter 7 of Exploring Data, which presents similar comparisons to those presented here for these three asymmetry measures.  Hotelling’s measure is equal to the difference between the mean and the median of the data, divided by the estimated standard deviation, and it can be shown to lie between -1 and +1.  The boxplot comparisons presented here show that this measure is much less outlier-sensitive than the standard skewness measure, although this estimate does exhibit a slight positive bias (slight is the operative word here: the mean value for the contaminated “Mo” estimates is 2.36, while that for the “Ho” estimate is 0.05).  Galton’s skewness measure is based on sample quartiles and is closely related to the fourth measure discussed next.  Specifically, both of these measures are based on the following function: let x and y be any two values from the data sample satisfying the condition that x is smaller than the sample median m, which is in turn smaller than y.  The function h(x,y) on which these measures are based is defined as the ratio of two computed values: the numerator is (y – m) – (m – x), and the denominator is y – x.  If we take x as the lower quartile and y as the upper quartile, we obtain Galton’s skewness measure, designated “Ga” in the boxplots.  The other skewness measure is the medcouple, defined as the median of all values of h(x,y) computed from points satisfying the condition that x < m < y.  Note that both the Galton and medcouple measures also lie between -1 and +1, since this is true for h(x,y) for any admissible values of x and yDetailed comparison of the “Ga” and “Mo” boxplots shows that both of these estimators exhibit small bias – i.e., their median values are both approximately zero, as they should be – and that the medcouple exhibits a slightly smaller range of variation than Galton’s skewness measure does.

Procedures in R to compute the standard skewness measure, Hotelling’s measure, and Galton’s measure are available from the companion website for Exploring Data.  The medcouple is not discussed in the book, but it is available in the R add-on package robustbase, which also includes a procedure to construct the adjusted boxplot described next.  There, rather than using 1.5 interquartile distances to set both upper and lower outlier limits as in the standard boxplot rule, the adjbox procedure in the robustbase package makes this distance depend on the medcouple value computed from the data.  Specifically, both outlier detection limits have the same general form, but different specific parameters: in both cases, the nominal distance of 1.5 IQD’s is multiplied by a scale factor computed from the medcouple value.  For lower outliers, this scale factor is exp(-3.5 Mc), while for upper outliers, this scale factor is exp(+4Mc).  A detailed discussion of this outlier detection rule and the rationale behind it is presented both in the paper, “An Adjusted Boxplot for Skewed Distributions,” by Ellen Vanderviere and Mia Huber, published in the proceedings of the 2004 COMPSTAT symposium, and from the technical report cited in the R documentation for the adjbox procedure.



The above figure compares two sets of boxplots for the UScereal dataset discussed last time, from the MASS package in R.  Specifically, both boxplots summarize the reported potassium content for breakfast cereals from 6 different manufacturers, each identified by a one-letter designation.  Both boxplots are generated using the log = “y” and varwidth options to obtain boxplots whose width is proportional to the square root of the number of records in each subsample.  The difference in these two sets of boxplots is that the one on the left uses the standard outlier identification algorithm, which is the default for the boxplot command in base R, while the boxplot series on the right uses the modified outlier detection rule just described, based on the medcouple skewness measure and available via the adjbox command in the robustbase add-on package.  The points declared outliers for each boxplot are shown as solid circles for emphasis (simply specify “pch = 16” in either boxplot command), and these points illustrate the difference between the two boxplot results.  Specifically, the left-hand group of boxplots only show upward outliers, one for manufacturer G (General Mills) and two for manufacturer K (Kellogs), while the right-hand boxplots only show downward outliers, three for General Mills, one for Kellogs, and one for Quaker Oats (manufacturer Q).  The question remains which of these conclusions is more reasonable, and this question will be revisited in subsequent posts using two additional boxplot extensions: violin plots and bean plots, both of which incorporate nonparametric density estimates to give a more detailed picture of the data.  For now, it is enough to note three things: first, an alternative to the standard boxplot is available that provides distinct detection rules for upper and lower outliers; second, that this alternative approach often finds different outliers than the standard boxplot rule does; and third, there is evidence to suggest that this modified boxplot does indeed perform better in the face of significant distributional asymmetry.  For a more detailed discussion of this point, refer to the technical report cited above.

Finally, it is worth emphasizing that the robustbase package contains a lot more than just the adjbox procedure discussed here, including multivariable outlier detection procedures like adjOutlyingness, robust multivariate location and scale estimation procedures like covMcd, and robust fitting procedures like lmrob and glmrob for linear models and generalized linear models (specifically, a robust logistic regression procedure for binomial data, and a robust Poisson regression procedure for count data), among others.  The biggest drawback of this package is that the documentation is incomplete, so it is important to check out the references cited there to get a clear idea of what some of these procedures do and how they do it.  That said, it is important to emphasize two points.  First, if R were a commercially supported software package, this level of documentation would be inexcusable, but R is not a commercial product: it is the freely-available end result of the loosely coordinated efforts of a very large group of unpaid volunteers.  The second point is that, despite the limitations of the documentation, the robustbase package provides implementations of procedures that could, in principle, be built on the basis of their published descriptions by anyone who wanted to.  The experience of doing that would be educational and intellectually rewarding, but it would also be a lot of work that most of us would simply never do, so we would not have any of the neat goodies that the authors have made available to us.  In the end, while I would strongly encourage them to finish the documentation, more importantly, I offer my heartfelt appreciation for their releasing the package so I can use it.