Linear Models Lab: Residual Analysis of a Cats Dataset, Study notes of Linear Algebra

The process of analyzing the residuals of a linear regression model using the cats dataset from ETH Zurich. The analysis includes checks for normality, constant variance, independence, and influential observations. The document also discusses the use of robust smoothers and weighted regression.

Typology: Study notes

2021/2022

Uploaded on 07/05/2022

paul.kc
paul.kc 🇦🇺

4.7

(68)

1K documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Linear Models 4: Residual analysis Lab
Dr. Matteo Tanadini
Angewandte statistische Regression I, HS19 (ETHZ)
Contents
1 Cats Example 2
1.1 Normalityoftheresiduals...................................... 3
1.2 Expected value of the errors is zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Homoscedasticity........................................... 5
1.4 Independenceoftheerrors...................................... 6
1.5 Influentialobservations........................................ 6
1.6 Theplot.lm()function........................................ 9
1.7 Furtherplots(*) ........................................... 11
2 Appendix 13
2.1 Heteroscedasticity(**)........................................ 13
2.2 Tests to test assumptions (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Normalityofthedata ........................................ 14
2.4 Robust smoothers for residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Session Information 17
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Linear Models Lab: Residual Analysis of a Cats Dataset and more Study notes Linear Algebra in PDF only on Docsity!

Linear Models 4: Residual analysis Lab

Dr. Matteo Tanadini Angewandte statistische Regression I, HS19 (ETHZ)

  • 1 Cats Example Contents
    • 1.1 Normality of the residuals
    • 1.2 Expected value of the errors is zero
    • 1.3 Homoscedasticity
    • 1.4 Independence of the errors
    • 1.5 Influential observations
    • 1.6 The plot.lm() function
    • 1.7 Further plots (*)
  • 2 Appendix
    • 2.1 Heteroscedasticity (**)
    • 2.2 Tests to test assumptions (**)
    • 2.3 Normality of the data
    • 2.4 Robust smoothers for residual plots
  • 3 Session Information

1 Cats Example

In this exercise we work with the “cats” data set. Let’s load the data. d.cats <- read.csv ("../../Data_sets/Cats.csv", header = TRUE)

head (d.cats)

Sex Bwt Hwt 1 F 2.0 7. 2 F 2.0 7. 3 F 2.0 9. 4 F 2.1 7. 5 F 2.1 7. 6 F 2.1 7. str (d.cats)

’data.frame’: 144 obs. of 3 variables: $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ... $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2. ... $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8. ...

We produce a graph to display the data. Note that we use smoothers to characterise the relationship between the response (i.e. heart weight) and the predictor body weight. library (ggplot2) ggplot (data = d.cats, mapping = aes (y = Hwt, x = Bwt)) + geom_point () + geom_smooth () + facet_wrap (. ~ Sex)

F M

2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.

10

15

20

Bwt

Hwt

The effect of Bwt really seems to be linear for males. It is less clear whether assuming a linear effect for females is correct. Assuming a linear effect in both groups, we would expect the slopes to differ.

theoretical quantiles

−2 −1 0 1 2

st.sm.res_Hwt

0

2

4

This graph confirms that residuals seem to follow a normal distribution. Indeed, the observed observations (in black) lay within the simulated light blue lines.

Note that there are also some formal tests to test for normality. As explained in the appendix, they should not be used.

1.2 Expected value of the errors is zero

You may remember from the slides that we tested this assumption by plotting the residuals against the only predictor we had in our model. In this case there are two predictors (i.e. Bwt and Sex ). Usually, a model contains many predictors. Therefore, a quick and powerful solution is needed to test with only one graph whether the expected value of the errors is zero. The solution to this problem, proposed by two famous statisticians (Tukey and Anscombe), consists in plotting the residuals against the fitted values. The plot is nowadays referred to as the Tukey-Anscombe plot (or TA-plot for short). ggplot (mapping = aes (y = resid (lm.cats), x = fitted (lm.cats))) + geom_abline (intercept = 0, slope = 0) + geom_point () + geom_smooth ()

−2.

8 10 12 14 16 fitted(lm.cats)

resid(lm.cats)

In our case, the smoother is on zero. Therefore, we may conclude that the model equation is correct. If the effect of the Bwt predictor would have been clearly non-linear, then this plot should have given us a hint for that.

1.3 Homoscedasticity

We now want to verify that the “variability” of the residuals is constant. For this purpose, we plot the square-root transformed absolute value of the residuals against the fitted values. ggplot (mapping = aes (y = sqrt ( abs ( resid (lm.cats))), x = fitted (lm.cats))) + geom_abline (intercept = 0, slope = 0) + geom_point () + geom_smooth ()

8 10 12 14 16 fitted(lm.cats)

sqrt(abs(resid(lm.cats)))

geom_smooth (method = "lm", se = FALSE) + facet_wrap (. ~ Sex)

F M

2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.

10

15

20

Bwt

Hwt

By looking carefully, we see that both females and males have an observation on their right hand-side that has a large positive residual.

To see whether they have a large influence on the estimated regression coefficients, we may want to use an influence measure: the cooks distance. This quantity measures the effect that omitting a given observation has on the regression coefficients. This higher its value, the larger the effect.

Below, we compute the Cook’s distance for all observations. The higher the value, the brighter the color of an observation (note the colour argument in the ggplot() function call).

1) compute Cook’s distance

cooks.dist.cats <- cooks.distance (lm.cats)

2) plot it

ggplot (data = d.cats, mapping = aes (y = Hwt, x = Bwt, colour = cooks.dist.cats)) + geom_point () + geom_smooth (method = "lm", se = FALSE) + facet_wrap (. ~ Sex)

F M

2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.

10

15

20

Bwt

Hwt

cooks.dist.cats

As expected, the two observations at the very end of Bwt , which have a large residual, have also high leverage. The question that raises naturally here, is “when is the Cook’s distance too large? When shall I do something?”

The thumb rule is that observations with a value larger that 0.5 “should be inspected”. Observations that have a value larger than 1 are seriously influencing the estimated regression coefficients and “action should be undertaken”. What exactly should be done with these influential observation is case-specific:

  • If the observation turns out to be a mistake, and the correct value cannot be retrieved, the observation can safely be omitted (but this should be documented).
  • If the observation has a logical explanation with respect to another predictor not yet included in the model, the latter may be included in the model to account for that. For example, a very old cat could turn out to be could be an influential observation. Thus, the age of the cats could be included in the model.
  • If there is no explanation and the observation seems to be correct, robust methods could/should be used^1.
  • If the is no explanation and the observation seems to be correct, it could be removed. Note, however, that this should be documented and discussed when reporting the results. Furthermore, it is very often the case, that by removing an influential observation and refitting the model, the next influential observation pops up. the problem is a long-tailed distribution of the errors, and not a few “wrong” observations.
  • Note that in some cases, a missed non-linearity can also lead to influential observations. In theory, we would have noticed in the preliminary graphical analysis that a given predictor may have a non-linear effect. However, sometimes a non-linearity may be overlooked. One solution to spot non-linearities after fitting a model is to plot the residuals against the continuous predictors.

Note that in this simple case, where we have only two predictors in the model, we can display the information about the Cook’s distance graphically with the raw data. However, in reality we often have many predictors. Fortunately, there is a built-in R function that creates an adequate graph to inspect influential observations.

plot (lm.cats, which = 5)

(^1) See for example the function lmrob() in the {robustbase} package.

8 10 12 14

0

2

4

Fitted values

Residuals

lm(Hwt ~ Bwt * Sex)

Residuals vs Fitted

144

140

135

−2 −1 0 1 2

0

1

2

3

4

Theoretical Quantiles

Standardized residuals

lm(Hwt ~ Bwt * Sex)

Normal Q−Q

144

140

135

8 10 12 14

Fitted values

Standardized residuals

lm(Hwt ~ Bwt * Sex)

Scale−Location 144 140 135

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.

0

1

2

3

4

Leverage

Standardized residuals

lm(Hwt ~ Bwt * Sex)

Cook's distance

Residuals vs Leverage

144

47

140

This function is very handy because it produces all four plots in one go. However, sometimes we may want to customise our plots or create additional ones. For example, here we may want to further test the linearity of the effect for Bwt in the females group.

1.7 Further plots (*)

It is good practice to plot the residuals against all predictors in a model.

We start with the Sex predictor.

ggplot (data = d.cats, mapping = aes (y = resid.lm.cats, x = Sex)) + geom_boxplot ()

with only two predictors, the Tukey-Anscombe plot is very similar to this plot.

Nevertheless, you may remember that there were indications that the effect of Bwt for females may be non-linear. Therefore, we may want to inspect this a bit further.

ggplot (data = d.cats, mapping = aes (y = resid.lm.cats, x = Bwt)) + geom_hline (yintercept = 0) + geom_point () + geom_smooth () + facet_wrap (. ~ Sex, scales = "free_x") ## beware of different x-axes

F M

2.0 2.2 2.5 2.8 3.0 2.0 2.5 3.0 3.

−2.

Bwt

resid.lm.cats

The effect of Bwt does not seem to be perfectly linear for females. However, this may be due to “natural randomness” of the data. If we want to further inspect this potential non-linearity, we could use Generalised Additive Models, GAMs for short, where no linearity assumptions is being made. Note that smoothers are indeed GAMs.

2 Appendix

2.1 Heteroscedasticity (**)

Let’s assume that by looking at the scale-location plot (absolute residuals versus fitted values) we note a problem of non-constant variance of the residuals (i.e. heteroscedasticity). There are three possible solutions to this issue:

  1. The response variable needs to be transformed such that the variance of the residuals is stabilised.
  2. The model must be updated such that a predictor indicates that residuals have different variances (i.e. weighted regression).
  3. An important predictor or interaction is missing from the model. The model must be updated. 1. Transforming the response variable

It is common practice to apply the following transformation to the response variable (these transformation are known as “first aid transformations”).

  • log(y) transformation for amounts. Amounts are variables that can only take positive values such as salary, weight or concentration.
  • square-root transformation for counts. Count data consists of positive integers (e.g. number of siblings). Note also that there is a wide range of extensions of the linear model that have been developed to deal with count data (e.g. Poisson model).
  • arc-sine-square-root transformation for percentages^3. 2. Weighted regression Sometimes, the variance of the residuals can happen to depend on one predictor only. For example, it may be very different between groups (say males and females). The variance of the residuals can also depend on a continuous predictor (e.g. older cats show more variable observations). In this case, weighted regression can be used. It consist in defining a weight for each observation that is inversely proportional to its variance. More variable observations are less reliable and have therefore a lower weight. A classical example where weighted regression is very helpful is when the observations themselves represent averages of other observations. Let’s assume that we are analysing the factors that influence body weight in the Swiss population. For privacy reasons, the data was aggregated at municipality level, by averaging the weights of all measured people in a given municipality. Obviously, the average for a very small municipality (say Vergeletto in Valle Onsernone) is based on a few measurements and is thus less precise than the average for Zurich which is based on way more observations. In this case, we include the “sample size” information in our model via the weights argument as lm(average.weight ~ ., weights = N) , where N indicates the number of observations taken to estimate the average in each municipality.

2.2 Tests to test assumptions (**)

There are two main reasons why tests should not be used to test assumptions about the errors of a linear model:

  1. Philosophical reason: these tests also rely on some assumptions which should also be tested, leading to an infinite circle of testing.
  2. Practical reason: the power of any test relies on the sample size^4. For a low sample size the test could be non-significant even though the residuals are, for example, clearly not normally distributed. On the other hand, a very small deviation from normality, may be detected in a very large sample size.

2.3 Normality of the data

A commonly encountered mistake is to check the normality of the response variable itself. To explain why this is incorrect, we simulate a very simple example. set.seed (7) x <- gl (n = 2, k = 50, labels = c ("A", "B")) errors <- rnorm (n = 100, sd = 2) y <- rep (x = c (8,20), each = 50) + errors plot (y ~ x)

(^3) In R this is achieved with asin(sqrt(y)) where y is a variable between 0 and 1. (^4) In statistics the power is defined as the probability to detect an effect when this is present. In more formal words, the probability of rejecting the null hypothesis when it is false.

−2 −1 0 1 2

0

2

4

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

2.4 Robust smoothers for residual plots

The plot.lm() function as well as the plregr() function use a robust smoother (i.e. a loess smoother with family set to symmetric). We can use the same condition when using the ggplot() function by specifying further arguments.

ggplot (data = d.cats, mapping = aes (y = resid.lm.cats, x = Bwt)) + geom_hline (yintercept = 0) + geom_point () + geom_smooth (method = "loess", method.args = list (family = "symmetric")) + ## robust smoother facet_wrap (. ~ Sex, scales = "free_x") ## beware of different x-axes

F M

2.0 2.2 2.5 2.8 3.0 2.0 2.5 3.0 3.

−2.

Bwt

resid.lm.cats

3 Session Information

sessionInfo ()

R version 3.5.3 (2019-03-11) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Fedora 30 (Workstation Edition)

Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF- [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF- [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] plgraphics_1.0 ggplot2_3.0.0 knitr_1.

loaded via a namespace (and not attached): [1] Rcpp_0.12.18 nloptr_1.2.0 pillar_1.3.1 compiler_3.5. [5] plyr_1.8.4 bindr_0.1.1 tools_3.5.3 lme4_1.1-18- [9] digest_0.6.16 nlme_3.1-137 lattice_0.20-35 evaluate_0.10. [13] tibble_2.1.1 gtable_0.2.0 pkgconfig_2.0.2 rlang_0.3. [17] Matrix_1.2-15 yaml_2.1.19 bindrcpp_0.2.2 withr_2.1. [21] dplyr_0.7.6 stringr_1.3.1 rprojroot_1.3-2 grid_3.5. [25] tidyselect_0.2.4 glue_1.2.0 R6_2.2.2 survival_2.43- [29] rmarkdown_1.10 minqa_1.2.4 purrr_0.2.5 magrittr_1. [33] splines_3.5.3 backports_1.1.2 scales_1.0.0 htmltools_0.3. [37] MASS_7.3-51.1 assertthat_0.2.0 colorspace_1.3-2 labeling_0. [41] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 chron_2.3- [45] crayon_1.3.