## Search in the document preview

Outliers, Page 1

**Outliers
**

**Introduction **
** Outliers** are defined as

*data points that are statistically inconsistent with the rest of the data.* We must be careful because some “questionable” data points end up being outliers, but others do not.

**. Here, we discuss statistical methods that help us to know whether to**

*Questionable data points should never be discarded without proper statistical justification**keep*or

*discard*suspected outliers. We discuss two types of outliers: (1) outliers in a sample of a

*single variable*(

*x*), and (2) outliers in a set of

*data pairs* (*y* vs. *x*).

**Outliers in a sample of a single variable**
Consider a sample of *n* measurements of a *single* variable *x*, i.e., *x*1, *x*2, ... , *x*n. It is helpful to arrange the *x*

values in increasing order, so that the suspected outliers are easily spotted (typically either the first or last
data point(s) are suspected, since they are the lowest and highest values of *x* in the sample, respectively).

The ** modified Thompson tau technique** is a

*statistical method for deciding whether to keep or discard suspected outliers in a sample of a single variable.*Here is the procedure: o The sample mean

*x*and the sample standard deviation

*S*are calculated in the usual fashion. o For each data point, the

**is calculated as**

*absolute value of the deviation**i i id x x* . o The data point

*most suspected*as a possible outlier is

*the data point with the maximum value of*

*i*. o The value of

****

*the modified Thompson***(Greek letter tau) is calculated from the critical value of the**

student’s *t* PDF, and is therefore a function of the number of data points *n* in the sample.

o is obtained from the expression / 2 2

/ 2

1

2

*t n
*

*n n t
*

, where

*n* is the number of data points
*t*/2 is the critical student’s *t* value, based on = 0.05 and df = *n*-2 (note that here df = *n*-2 instead of

*n*-1). In Excel, we calculate *t*/2 as TINV(, df), i.e., here *t*/2 = TINV(, *n*-2)
o A table of the modified Thompson is provided below:

o We determine whether to reject or keep this suspected outlier, using the following simple rules:
**If **** i > **

**. It**

*S*,*reject*the data point*is*an outlier.

**If**

****

*i***. It is**

*S*,*keep*the data point*not*an outlier.

o With the modified Thompson technique, ** we consider only one suspected outlier at a time** – namely,

**the data point with the largest value of**

**. If that data point is rejected as an outlier, we remove it and start over. In other words, we calculate a**

*i**new*sample mean and a

*new*sample standard deviation, and search for more outliers. This process is repeated

**until no more outliers are found.**

*one at a time*docsity.com

Outliers, Page 2
**Example**:

** Given:** Ten values of variable

*x*are measured. The data have been arranged in increasing order for convenience: 48.9, 49.2, 49.2, 49.3, 49.3, 49.8, 49.9, 50.1, 50.2, and 50.5.

** To do:** Apply the modified Thompson tau test to see if any data points can be rejected as outliers.

**o The number of data points is counted:**

*Solution:**n*= 10. o You suspect the first (smallest

*x*) and last (largest

*x*) data points to be possible outliers. o The sample mean and sample standard deviation are calculated:

*x*= 49.64, and

*S*= 0.52957. o For the smallest (first)

*x*value, the absolute value of the deviation is 1 1

*x x* = 48.9 49.64 = 0.74.

[We keep one extra digit here to avoid round-off error in the remaining calculations.]
o Similarly, for the largest (*n*th) *x* value, the absolute value of the deviation is 10 10*x x* =

50.5 49.64 = 0.86. o Since the absolute value of the deviation of the largest data point is bigger than that of the smallest data

point, the modified Thompson tau technique is applied *only* to data point number 10. (Point number 10 is
the most suspect data point.)

o For *n* = 10, the value of the modified Thompson is 1.7984 (see the above table). We calculate *S* =
(1.7984)(0.52957) = 0.95238.

o Since 10 = 0.86 < *S* = 0.95238, ** there are no outlier points **. All ten data points must be kept.

**Since the tenth data point had the largest value of , there is no need to test any other data point**

*Discussion:*– we are sure that there are no outliers in this sample.

If an outlier point is identified, it can be removed without guilt. (The engineer is not “fudging” the data.)
To identify any *further* suspected outliers, however, the mean and standard deviation must be re-calculated,

and *n* must be decreased by 1, before re-applying the modified Thompson tau technique.
This process should be continued again and again as necessary until no further outlier points are found.

**Outliers in a set of data pairs**
Now consider a ** set of n data pairs** (

*y*vs.

*x*), for which we have used regression analysis to find the best-fit

line or curve through the data. The best-fit line or curve is denoted by *Y* as a function of *x*.
In regression analysis, we defined the ** error** or

**for each data pair (**

*residual ei**xi*,

*yi*), as

*the difference between*** the predicted or fitted value and the actual value**:

*ei*= error or residual at data pair

*i*, or

*i i ie Y y* . As mentioned in our discussion of regression analysis, a useful measure of overall error is called the

** standard error of estimate** or simply the

**,**

*standard error***. In general, for regression analysis with**

*Sy,x**m*

independent variables (and so also for a polynomial fit of order *m*), *Sy,x* is defined by
2

1 , df

*i n
*

*i i
i
*

*y x
*

*y Y
S
*

,

where df = ** degrees of freedom**, df ( 1)

*n m* . When we perform a regression analysis in Excel,

*Sy,x*is calculated for us as part of Excel’s

**Summary**

**Output**. *Sy,x* has the same units and dimensions as variable *y*.
To determine if there are any outliers, we calculate the *standardized residual*** ei / Sy,x** for each data pair.
Since

*Sy,x*is a kind of standard deviation of the original data compared to the predicted least-squares fit

values, we would expect that approximately 95% of the standardized residuals should lie between 2 and 2, in other words, within 2 standard deviations from the best-fit curve, assuming random errors.

So, to determine if a particular (*x*,*y*) data point is an outlier, we check if ,/ 2*i y xe S * . If so, we suspect that
data point to be an outlier. But wait! We are not finished yet.

Unfortunately, this is not the *only* test for outliers. There is another criterion for calling the data point an
outlier – namely, *the standardized residual must be inconsistent with its neighbors*.

Bottom line: There are ** two criteria** for a data pair to be called an outlier:
o ,/ 2

*i y xe S* o The standardized residual is inconsistent with its neighbors when plotted as a function of

*x*.

If *either* of these criteria fail, we *cannot* claim this data point to be an outlier.
The second criterion (inconsistency) is rather subjective, and is best illustrated by examples.

docsity.com

Outliers, Page 3
**Example**:

** Given:** Eight data pairs (

*x*,

*y*) are measured in a calibration experiment, as shown below:

** To do:** Fit a straight line through these data, and determine if any of the data points should be thrown out as

legitimate outliers.
** Solution:**
o We plot the data pairs and perform a linear (1st-order) regression analysis. Standard error =

*Sy,x*= 0.5833.

o By eye, we suspect that the third data point (at *x* = 30.01) might be an outlier. To test our suspicion, we

calculate and plot the standardized residuals ,/*i y xe S *. For comparison, we also plot the zero line.

o For the third data point, ,/*i y xe S * = 2.064. Since |2.064| > 2, we meet the *first* criterion for an outlier.
o What about the *second* criterion? Here, the third data point is indeed *inconsistent* with its neighbors (it is

not smooth and *does not follow any kind of pattern*, but is simply way lower than any of the others).
o Since *both* criteria are met, we say that ** the third data point is definitely an outlier **, and we remove it.
** Discussion:** After removing the outlier, we should re-do the regression analysis, obtaining a better fit.

docsity.com

Outliers, Page 4
**Example**:

** Given:** Twelve data pairs (

*x*,

*y*) are measured in a calibration experiment, as shown below:

** To do:** Fit a straight line through these data, and determine if any of the data points should be thrown out as

legitimate outliers.
** Solution:**
o We plot the data pairs and perform a linear (1st-order) regression analysis. Standard error =

*Sy,x*= 4.385.

o By eye, we suspect that perhaps the last data point (*x* = 22.00) might be an outlier. To test our suspicion,

we calculate and plot the standardized residuals ,/*i y xe S *. For comparison, we also plot the zero line.

o For the last data point, ,/*i y xe S * = 2.025. Since |2.025| > 2, we meet the *first* criterion for an outlier.

docsity.com

Outliers, Page 5
o The second criterion is a bit subjective, but the last data point *is* consistent with its neighbors (the data

are *smooth* and follow a recognizable pattern). The second criterion is *not* met for this case.
o Since both criteria are not met, we say that ** the last data point is not an outlier **, and we cannot justify

removing it.
** Discussion:** When the standardized residual plot is

*not*random, but instead, the data follow a pattern, as in

this example, it often indicates that we should *use a different curve fit*.
o Here, we apply a second-order polynomial fit to the data, and re-plot the results. We plot the data pairs

and perform a quadratic (2nd-order) regression analysis. Standard error = *Sy,x* = 1.066 (much lower than
the previous value of 4.385). *The quadratic fit is much better than the linear fit!*

o By eye, we see that the 2nd-order fit is indeed much better – the fitted curve passes through nearly all the

data points.
o However, we still suspect that the last data point may be an outlier. Let’s check.
o First criterion: Using the new value of *Sy,x*, we calculate the standardized residual at the last data point to

be ,/*i y xe S * = 1.976. Since |1.976| < 2, we do *not* meet the *first* criterion for an outlier.
** There are no outlier points for this quadratic fit **.

o Finally, we plot the standardized residuals, just for fun:

o There still appears to be somewhat of a pattern in the standardized residuals, but the last data point is

inconsistent with the pattern of the other points. Thus, we would say that the second criterion *is* met.
However, since the first criterion failed, we cannot claim legitimately that the last data point is an outlier.

o Since the standardized residuals still show somewhat of a pattern, we might want to try a different kind of curve fit. However, comparing the fitted curve to the data, the agreement is excellent, so further analysis is probably not necessary in this example.

docsity.com