This blog is a companion to my recent book, Exploring Data in Engineering, the Sciences, and Medicine, published by Oxford University Press. The blog expands on topics discussed in the book, and the content is heavily example-based, making extensive use of the open-source statistical software package R.

Sunday, November 27, 2011

Cleaning time-series and other data streams

The need to analyze time-series or other forms of streaming data arises frequently in many different application areas.  Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations.  As a specific example, the figure below shows four data sequences: the upper two plots represent hourly physical property measurements, one made at the inlet of a product storage tank (the left-hand plot) and the other made at the same time at the outlet of the tank (the right-hand plot).  The lower two plots in this figure show the results of applying the data cleaning filter outlierMAD from the R package pracma discussed further below.  The two main points of this post are first, that isolated spikes like those seen in the upper two plots at hour 291 can badly distort the results of an otherwise reasonable time-series characterization, and second, that the simple moving window data cleaning filter described here is often very effective in removing these artifacts.

           

This example is discussed in more detail in Section 8.1.2 of my book Discrete-Time Dynamic Models, but the key observations here are the following.  First, the large spikes seen in both of the original data sequences were caused by the simultaneous, temporary loss of both measurements and the subsequent coding of these missing values as zero by the data collection system.  The practical question of interest was to determine how long, on average, the viscous, polymeric material being fed into and out of the product storage tank was spending there.  A standard method for addressing such questions is the use of cross-correlation analysis, where the expected result is a broad peak like the heavy dashed line in the plot shown below.  The location of this peak provides an estimate of the average time spent in the tank, which is approximately 21 hours in this case, as indicated in the plot.  This result was about what was expected, and it was obtained by applying standard cross-correlation analysis to the cleaned data sequences shown in the bottom two plots above.  The lighter solid curve in the plot below shows the results of applying exactly the same analysis, but to the original data sequences instead of the cleaned data sequences.  This dramatically different plot suggests that the material is spending very little time in the storage tank: accepted uncritically, this result would imply severe fouling of the tank, suggesting a need to shut the process down and clean out the tank, an expensive and labor-intensive proposition.  The main point of this example is that the difference in these two plots is entirely due to the extreme data anomalies present in the original time-series.  Additional examples of problems caused by time-series outliers are discussed in Section 4.3 of my book Mining Imperfect Data. 

           

One of the primary features of the analysis of time-series and other streaming data sequences is the need for local data characterizations.  This point is illustrated in the plot below, which shows the first 200 observations of the storage tank inlet data sequence discussed above.  All of these observations but one are represented as open circles in this plot, but the data point at k = 110 is shown as a solid circle, to emphasize how far it lies from its immediate neighbors in the data sequence.  It is important to note that this point is not anomalous with respect to the overall range of this data sequence – it is, for example, well within the normal range of variation seen for the points from about k = 150 to k = 200 – but it is clearly anomalous with respect to those points that immediately precede and follow it.  A general strategy for automatically detecting and removing such spikes from a data sequence like this one is to apply a moving window data cleaning filter which characterizes each data point with respect to a local neighborhood of prior and subsequent samples.  That is, for each data point k in the original data sequence, this type of filter forms a cleaned data estimate based on some number J of prior data values (i.e., points k-J through k-1 in the sequence) and, in the simplest implementations, the same number of subsequent data values (i.e., points k+1 through k+J in the sequence).

           

The specific data cleaning filter considered here is the Hampel filter, which applies the Hampel identifier discussed in Chapter 7 of Exploring Data in Engineering, the Sciences and Medicine to this moving data window.  If the kth data point is declared to be an outlier, it is replaced by the median value computed from this data window; otherwise, the data point is not modified.  The results of applying the Hampel filter with a window width of J = 5 to the above data sequence are shown in the plot below.  The effect is to modify three of the original data points – those at k = 43, 110, and 120 – and the original values of these modified points are shown as solid circles at the appropriate locations in this plot.  It is clear that the most pronounced effect of the Hampel filter is to remove the local outlier indicated in the above figure and replace it with a value that is much more representative of the other data points in the immediate vicinity.

           

As I noted above, the Hampel filter implementation used here is that available in the R package pracma as procedure outlierMAD.  I will discuss this R package in more detail in my next post, but for those seeking a more detailed discussion of the Hampel filter in the meantime, one is freely available on-line in the form of an EDN article I wrote in 2002, Scrub data with scale-invariant nonlinear digital filters.  Also, comparisons with alternatives like the standard median filter (generally too aggressive, introducing unwanted distortion into the “cleaned” data sequence) and the center-weighted median filter (sometimes quite effective) are presented in Section 4.2 of the book Mining Imperfect Data  mentioned above.

Friday, November 11, 2011

Harmonic means, reciprocals, and ratios of random variables

In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian.  For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally not even well-defined theoretically for these distributions.  Since the harmonic mean is defined as the reciprocal of the mean of the reciprocal values, it is intimately related to the reciprocal transformation.  The main point of this post is to show how profoundly the reciprocal transformation can alter the character of a distribution, for better or worse.   One way that reciprocal transformations sneak into analysis results is through attempts to characterize ratios of random numbers.  The key issue underlying all of these ideas is the question of when the denominator variable in either a reciprocal transformation or a ratio exhibits non-negligible probability in a finite neighborhood of zero.  I discuss transformations in Chapter 12 of Exploring Data in Engineering, the Sciences and Medicine, with a section (12.7) devoted to reciprocal transformations, showing what happens when we apply them to six different distributions: Gaussian, Laplace, Cauchy, beta, Pareto, and lognormal. 

In the general case, if a random variable x has the density p(x), the distribution g(y) of the reciprocal y = 1/x has the density:

            g(y) = p(1/y)/y2

As I discuss in greater detail in Exploring Data, the consequence of this transformation is typically (though not always) to convert a well-behaved distribution into a very poorly behaved one.  As a specific example, the plot below shows the effect of the reciprocal transformation on a Gaussian random variable with mean 1 and standard deviation 2.  The most obvious characteristic of this transformed distribution is its strongly asymmetric, bimodal character, but another non-obvious consequence of the reciprocal transformation is that it takes a distribution that is completely characterized by its first two moments into a new distribution with Cauchy-like tails, for which none of the integer moments exist.

           

The implications of the reciprocal transformation for many other distributions are equally non-obvious.  For example, both the badly-behaved Cauchy distribution (no moments exist) and the well-behaved lognormal distribution (all moments exist, but interestingly, do not completely characterize the distribution, as I have discussed in a previous post) are invariant under the reciprocal transformation.  Also, applying the reciprocal transformation to the long-tailed Pareto type I distribution (which exhibits few or no finite moments, depending on its tail decay rate) yields a beta distribution, all of whose moments are finite.  Finally, it is worth noting that the invariance of the Cauchy distribution under the reciprocal transformation lies at the heart of the following result, presented in the book Continuous Univariate Distributions by Johnson, Kotz, and Balakrishnan (Volume 1, 2nd edition, Wiley, 1994, page 319).  They note that if the density of x is positive, continuous, and differentiable at x = 0 – all true for the Gaussian case – the distribution of the harmonic mean of N samples approaches a Cauchy limit as N becomes infinitely large.

As noted above, the key issue responsible for the pathological behavior of the reciprocal transformation is the question of whether the original data distribution exhibits nonzero probability of taking on values within a neighborhood around zero.  In particular, note that if x can only assume values larger than some positive lower limit L, it follows that 1/x necessarily lies between 0 and 1/L, which is enough to guarantee that all moments of the transformed distribution exist.  For the Gaussian distribution, even if the mean is large enough and the standard deviation is small enough that the probability of observing values less than some limit L > 0 is negligible, the fact that this probability is not zero means that the moments of any reciprocally-transformed Gaussian distribution are not finite.  As a practical matter, however, reciprocal transformations and related characterizations – like harmonic means and ratios – do become better-behaved as the probability of observing values near zero become negligibly small.

To see this point, consider two reciprocally-transformed Gaussian examples.  The first is the one considered above: the reciprocal transformation of a Gaussian random variable with mean 1 and standard deviation 2.  In this case, the probability that x assumes values smaller than or equal to zero is non-negligible.  Specifically, this probability is simply the cumulative distribution function for the distribution evaluated at zero, easily computed in R as approximately 31%:

> pnorm(0,mean=1,sd=2)
[1] 0.3085375

In contrast, for a Gaussian random variable with mean 1 and standard deviation 0.1, the corresponding probability is negligibly small:

> pnorm(0,mean=1,sd=0.1)
[1] 7.619853e-24

If we consider the harmonic means of these two examples, we see that the first one is horribly behaved, as all of the results presented here would lead us to expect.  In fact, the qqPlot command in the car package  in R allows us to compute quantile-quantile plots for the Student’s t-distribution with one degree of freedom, corresponding to the Cauchy distribution, yielding the plot shown below.  The Cauchy-like tail behavior expected from the results presented by Johnson, Kotz and Balakrishnan is seen clearly in this Cauchy Q-Q plot, constructed from 1000 harmonic means, each computed from statistically independent samples drawn from a Gaussian distribution with mean 1 and standard deviation 2.  The fact that almost all of the observations fall within the – very wide – 95% confidence interval around the reference line suggest that the Cauchy tail behavior is appropriate here.

           

To further confirm this point, compare the corresponding normal Q-Q plot for the same sequence of harmonic means, shown below.  There, the extreme non-Gaussian character of these harmonic means is readily apparent from the pronounced outliers evident in both the upper and lower tails.

           

In marked contrast, for the second example with the mean of 1 as before but the much smaller standard deviation of 0.1, the harmonic mean is much better behaved, as the normal Q-Q plot below illustrates.  Specifically, this plot is identical in construction to the one above, except it was computed from samples drawn from the second data distribution.  Here, most of the computed harmonic mean values fall within the 95% confidence limits around the Gaussian reference line, suggesting that it is not unreasonable in practice to regard these values as approximately normally distributed, in spite of the pathologies of the reciprocal transformation.

           

One reason the reciprocal transformation is important in practice – particularly in connection with the Gaussian distribution – is that the desire to characterize ratios of uncertain quantities does arise from time to time.  In particular, if we are interested in characterizing the ratio of two averages, the Central Limit Theorem would lead us to expect that, at least approximately, this ratio should behave like the ratio of two Gaussian random variables.  If these component averages are statistically independent, the expected value of the ratio can be re-written as the product of the expected value of the numerator average and the expected value of the reciprocal of the denominator average, leading us directly to the reciprocal Gaussian transformation discussed here.  In fact, if these two averages are both zero mean, it is a standard result that the ratio has a Cauchy distribution (this result is presented in the same discussion from Johnson, Kotz and Balakrishnan noted above).  As in the second harmonic mean example presented above, however, it turns out to be true that if the mean and standard deviation of the denominator variable are such that the probability of a zero or negative denominator are negligible, the distribution of the ratio may be approximated reasonably well as Gaussian.  A very readable and detailed discussion of this fact is given in the paper by George Marsaglia in the May 2006 issue of Journal of Statistical Software.

Finally, it is important to note that the “reciprocally-transformed Gaussian distribution” I have been discussing here is not the same as the inverse Gaussian distribution, to which Johnson, Kotz and Balakrishnan devote a 39-page chapter (Chapter 15).  That distribution takes only positive values and exhibits moments of all orders, both positive and negative, and as a consequence, it has the interesting characteristic that it remains well-behaved under reciprocal transformations, in marked contrast to the Gaussian case.