









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The process of analyzing the residuals of a linear regression model using the cats dataset from ETH Zurich. The analysis includes checks for normality, constant variance, independence, and influential observations. The document also discusses the use of robust smoothers and weighted regression.
Typology: Study notes
1 / 17
This page cannot be seen from the preview
Don't miss anything!










Dr. Matteo Tanadini Angewandte statistische Regression I, HS19 (ETHZ)
1 Cats Example
In this exercise we work with the “cats” data set. Let’s load the data. d.cats <- read.csv ("../../Data_sets/Cats.csv", header = TRUE)
head (d.cats)
Sex Bwt Hwt 1 F 2.0 7. 2 F 2.0 7. 3 F 2.0 9. 4 F 2.1 7. 5 F 2.1 7. 6 F 2.1 7. str (d.cats)
’data.frame’: 144 obs. of 3 variables: $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ... $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2. ... $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8. ...
We produce a graph to display the data. Note that we use smoothers to characterise the relationship between the response (i.e. heart weight) and the predictor body weight. library (ggplot2) ggplot (data = d.cats, mapping = aes (y = Hwt, x = Bwt)) + geom_point () + geom_smooth () + facet_wrap (. ~ Sex)
F M
2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.
10
15
20
Bwt
Hwt
The effect of Bwt really seems to be linear for males. It is less clear whether assuming a linear effect for females is correct. Assuming a linear effect in both groups, we would expect the slopes to differ.
theoretical quantiles
−2 −1 0 1 2
st.sm.res_Hwt
−
0
2
4
This graph confirms that residuals seem to follow a normal distribution. Indeed, the observed observations (in black) lay within the simulated light blue lines.
Note that there are also some formal tests to test for normality. As explained in the appendix, they should not be used.
You may remember from the slides that we tested this assumption by plotting the residuals against the only predictor we had in our model. In this case there are two predictors (i.e. Bwt and Sex ). Usually, a model contains many predictors. Therefore, a quick and powerful solution is needed to test with only one graph whether the expected value of the errors is zero. The solution to this problem, proposed by two famous statisticians (Tukey and Anscombe), consists in plotting the residuals against the fitted values. The plot is nowadays referred to as the Tukey-Anscombe plot (or TA-plot for short). ggplot (mapping = aes (y = resid (lm.cats), x = fitted (lm.cats))) + geom_abline (intercept = 0, slope = 0) + geom_point () + geom_smooth ()
−2.
8 10 12 14 16 fitted(lm.cats)
resid(lm.cats)
In our case, the smoother is on zero. Therefore, we may conclude that the model equation is correct. If the effect of the Bwt predictor would have been clearly non-linear, then this plot should have given us a hint for that.
We now want to verify that the “variability” of the residuals is constant. For this purpose, we plot the square-root transformed absolute value of the residuals against the fitted values. ggplot (mapping = aes (y = sqrt ( abs ( resid (lm.cats))), x = fitted (lm.cats))) + geom_abline (intercept = 0, slope = 0) + geom_point () + geom_smooth ()
8 10 12 14 16 fitted(lm.cats)
sqrt(abs(resid(lm.cats)))
geom_smooth (method = "lm", se = FALSE) + facet_wrap (. ~ Sex)
F M
2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.
10
15
20
Bwt
Hwt
By looking carefully, we see that both females and males have an observation on their right hand-side that has a large positive residual.
To see whether they have a large influence on the estimated regression coefficients, we may want to use an influence measure: the cooks distance. This quantity measures the effect that omitting a given observation has on the regression coefficients. This higher its value, the larger the effect.
Below, we compute the Cook’s distance for all observations. The higher the value, the brighter the color of an observation (note the colour argument in the ggplot() function call).
cooks.dist.cats <- cooks.distance (lm.cats)
ggplot (data = d.cats, mapping = aes (y = Hwt, x = Bwt, colour = cooks.dist.cats)) + geom_point () + geom_smooth (method = "lm", se = FALSE) + facet_wrap (. ~ Sex)
F M
2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.
10
15
20
Bwt
Hwt
cooks.dist.cats
As expected, the two observations at the very end of Bwt , which have a large residual, have also high leverage. The question that raises naturally here, is “when is the Cook’s distance too large? When shall I do something?”
The thumb rule is that observations with a value larger that 0.5 “should be inspected”. Observations that have a value larger than 1 are seriously influencing the estimated regression coefficients and “action should be undertaken”. What exactly should be done with these influential observation is case-specific:
Note that in this simple case, where we have only two predictors in the model, we can display the information about the Cook’s distance graphically with the raw data. However, in reality we often have many predictors. Fortunately, there is a built-in R function that creates an adequate graph to inspect influential observations.
plot (lm.cats, which = 5)
(^1) See for example the function lmrob() in the {robustbase} package.
8 10 12 14
−
−
0
2
4
Fitted values
Residuals
lm(Hwt ~ Bwt * Sex)
Residuals vs Fitted
144
140
135
−2 −1 0 1 2
−
−
0
1
2
3
4
Theoretical Quantiles
Standardized residuals
lm(Hwt ~ Bwt * Sex)
Normal Q−Q
144
140
135
8 10 12 14
Fitted values
Standardized residuals
lm(Hwt ~ Bwt * Sex)
Scale−Location 144 140 135
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.
−
−
−
0
1
2
3
4
Leverage
Standardized residuals
lm(Hwt ~ Bwt * Sex)
Cook's distance
Residuals vs Leverage
144
47
140
This function is very handy because it produces all four plots in one go. However, sometimes we may want to customise our plots or create additional ones. For example, here we may want to further test the linearity of the effect for Bwt in the females group.
It is good practice to plot the residuals against all predictors in a model.
We start with the Sex predictor.
ggplot (data = d.cats, mapping = aes (y = resid.lm.cats, x = Sex)) + geom_boxplot ()
with only two predictors, the Tukey-Anscombe plot is very similar to this plot.
Nevertheless, you may remember that there were indications that the effect of Bwt for females may be non-linear. Therefore, we may want to inspect this a bit further.
ggplot (data = d.cats, mapping = aes (y = resid.lm.cats, x = Bwt)) + geom_hline (yintercept = 0) + geom_point () + geom_smooth () + facet_wrap (. ~ Sex, scales = "free_x") ## beware of different x-axes
F M
2.0 2.2 2.5 2.8 3.0 2.0 2.5 3.0 3.
−2.
Bwt
resid.lm.cats
The effect of Bwt does not seem to be perfectly linear for females. However, this may be due to “natural randomness” of the data. If we want to further inspect this potential non-linearity, we could use Generalised Additive Models, GAMs for short, where no linearity assumptions is being made. Note that smoothers are indeed GAMs.
2 Appendix
Let’s assume that by looking at the scale-location plot (absolute residuals versus fitted values) we note a problem of non-constant variance of the residuals (i.e. heteroscedasticity). There are three possible solutions to this issue:
It is common practice to apply the following transformation to the response variable (these transformation are known as “first aid transformations”).
There are two main reasons why tests should not be used to test assumptions about the errors of a linear model:
A commonly encountered mistake is to check the normality of the response variable itself. To explain why this is incorrect, we simulate a very simple example. set.seed (7) x <- gl (n = 2, k = 50, labels = c ("A", "B")) errors <- rnorm (n = 100, sd = 2) y <- rep (x = c (8,20), each = 50) + errors plot (y ~ x)
(^3) In R this is achieved with asin(sqrt(y)) where y is a variable between 0 and 1. (^4) In statistics the power is defined as the probability to detect an effect when this is present. In more formal words, the probability of rejecting the null hypothesis when it is false.
−2 −1 0 1 2
−
0
2
4
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
The plot.lm() function as well as the plregr() function use a robust smoother (i.e. a loess smoother with family set to symmetric). We can use the same condition when using the ggplot() function by specifying further arguments.
ggplot (data = d.cats, mapping = aes (y = resid.lm.cats, x = Bwt)) + geom_hline (yintercept = 0) + geom_point () + geom_smooth (method = "loess", method.args = list (family = "symmetric")) + ## robust smoother facet_wrap (. ~ Sex, scales = "free_x") ## beware of different x-axes
F M
2.0 2.2 2.5 2.8 3.0 2.0 2.5 3.0 3.
−2.
Bwt
resid.lm.cats
3 Session Information
sessionInfo ()
R version 3.5.3 (2019-03-11) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Fedora 30 (Workstation Edition)
Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF- [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF- [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] plgraphics_1.0 ggplot2_3.0.0 knitr_1.
loaded via a namespace (and not attached): [1] Rcpp_0.12.18 nloptr_1.2.0 pillar_1.3.1 compiler_3.5. [5] plyr_1.8.4 bindr_0.1.1 tools_3.5.3 lme4_1.1-18- [9] digest_0.6.16 nlme_3.1-137 lattice_0.20-35 evaluate_0.10. [13] tibble_2.1.1 gtable_0.2.0 pkgconfig_2.0.2 rlang_0.3. [17] Matrix_1.2-15 yaml_2.1.19 bindrcpp_0.2.2 withr_2.1. [21] dplyr_0.7.6 stringr_1.3.1 rprojroot_1.3-2 grid_3.5. [25] tidyselect_0.2.4 glue_1.2.0 R6_2.2.2 survival_2.43- [29] rmarkdown_1.10 minqa_1.2.4 purrr_0.2.5 magrittr_1. [33] splines_3.5.3 backports_1.1.2 scales_1.0.0 htmltools_0.3. [37] MASS_7.3-51.1 assertthat_0.2.0 colorspace_1.3-2 labeling_0. [41] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 chron_2.3- [45] crayon_1.3.