







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The relationship between urbanization and coliform bacteria concentration using least-squares regression analysis. how to draw regression lines, minimize residuals, and interpret the results. The data is presented in Figure 2.1 (a) and (b), which show the arithmetic and logarithmic plots, respectively. The document also covers the concept of sums of squared residuals for a nested sequence of fitted models.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








eesc BC 3017 statistics notes 1
Systematic variation in the true value
Up to now, we have been thinking about measurement as sampling of values from an ensemble of all possible outcomes in order to estimate the true value (which would, according to our previous discussion, be well approximated by the mean of a very large sample). Given a sample of outcomes, we have sometimes checked the hypothesis that it is a random sample from some ensemble of outcomes, by plotting the data points against some other variable, such as ordinal position. Under the hypothesis of random selection, no clear trend should appear. However, the contrary case, where one finds a clear trend, is very important. A clear trend can be a discovery, rather than a nuisance! Whether it is a discovery or a nuisance (or both) depends on what one finds out about the reasons underlying the trend. In either case one must be prepared to deal with trends in analyzing data.
Figure 2.1 (a) shows a plot of (hypothetical) data in which there is a very clear trend. The y axis scales concentration of coliform bacteria sampled from rivers in various regions (units are colonies per liter). The x axis is a hypothetical index of regional urbanization, ranging from 1 to 10. The hypothetical data consist of 6 different measurements at each level of urbanization. The mean of each set of 6 measurements gives a rough estimate of the true value for coliform bacteria concentration for rivers in a region with that urbanization level. The jagged dark line drawn on the graph connects these estimates of true value and makes the trend quite clear: more extensive urbanization is associated with higher true values of bacteria concentration. The straight dashed line shows that much of this trend can be approximated by a linear relationship. The equation of the dashed line is y = 5. 3 + 8. 0 x, i.e., the true value of bacteria concentration is estimated to increase by about 8 colonies per liter for each unit increase in the urbanization index.
The linear relationship does not fit the data too well at the lowest and especially at the highest end of urbanization. A linear trend represents a useful approximation, but there may be better ways to represent the actual trend. One device that is often very useful is to replace concentration by its logarithm. Figure 2.1 (b) shows the analogous plot, using base-10 logarithms of bacterial concentration for the y-axis scale. Now the linear relationship fits very well, and indeed, one can view the bumps in the jagged dark line as plausibly due to noise. The equation of the dashed straight line in 2.1(b) is y = 1. 25 + 0. 071 x. Each increase of one unit urbanization adds 0.071 to the log concentration, i.e., multiplies the concentration itself by a factor of 10.071^ ≈ 1. 18. This trend can be summarized by saying that the true value of bacteria concentration is estimated to increase by about 18% for each unit increase in urbanization index.
eesc BC 3017 statistics notes 2
Fitting lines (or other models) by least-squares
The lines in Figure 2.1 (a,b) could be drawn freehand, or by eye, so as to come as close as possible to the various plotted points. A more systematic approach, introduced by the great mathematician Gauss around 1800, is to draw a line such that the sum of squared vertical deviations of the points from the line is as small as possible. This used to be called the method of "least squares." Nowadays, it is called "ordinary least squares," often abbreviated as OLS, because some situations turn out to be handled better by methods that also use a least-squares principle but in a more complex fashion. (These latter methods are beyond the scope of what we will do here.)
The vertical deviations from a fitted line or curve are called residuals. The least-squares criterion minimizes the sum of squared residuals; when this has been done, the (unsquared) residuals, which can be positive (points above the curve) or negative (below) always add up to 0.
Figure 2.2 demonstrates what is meant with a very small set of paired numbers (X , Y). Four points are plotted, and the least-squares line is drawn. One can see that its slope is over 2 (actually about 2.44). At the lower left point, there is a small positive residual, 0.36, at the next point, a very small negative residual, −0. 08, and the last two points have larger residuals, one negative and one positive. The sum of the four residuals is 0: this will always be the case, apart from small rounding errors. When one squares these four residuals, only the two large ones matter very much, giving a sum of squares of around 4. Note also that the point ’m’ whose coordinates are the means of X and Y falls exactly on the line. This will always be true: the least-squares line goes through the point defined by the empirical means, (X , Y).
One can imagine different lines drawn on this plot. Each line would produce a different set of residuals, but the sum of their squares would always be larger than for the line depicted.
The ordinary sample average is actually a least-squares fit: the sum of squared deviations of a set of observations from their average is smaller than the sum of squared deviations from any other value. For example, the average of 1 and 5 is 3; the sum of squared deviations of 1 and 5 from 3 is 2^2 + 22 = 8. The sum of squared deviations from a value different from the average would be larger than 8; for example, if one replaces 3 by 3.1, the sum is 2. 1^2 + 1. 9^2 = 8.02, slightly larger; etc.
As another example, consider Figure 2.2 again. The horizontal dashed line is drawn at the level of the Y mean, and so one can visualize the residuals about the mean value, as the vertical distances of the 4 points from this horizontal. One of these is actually much smaller than the residual from the least-squares line, but the other 3 are all much larger. Because this horizontal line goes through Y, it has a smaller sum of squared residuals than any other horizontal line, but a much larger sum of squared residuals than the sloping least-squared line. This sum of squared residuals about the horizontal line at the level of the mean will be discussed further below, in the section on Analysis of Variance tables.
eesc BC 3017 statistics notes 4
(iv) Finally, there is the peculiar word "regression," which has gone unexplained thus far. For now, it should be taken as an odd historical accident. Modern statisticians often speak of linear modelling, or nonlinear modelling, and avoid this odd word. A somewhat interesting explanation can be given for the term, but this is not the place for it.
In a situation where the trend is clearly curvilinear, so that a straight line fit is really not satisfactory, one can often get a better representation with a family of functions more complicated than straight lines. For example, the data in Figure 2.1 (a) are better fit by the quadratic polynomial,
y = 26. 2 − 2. 47 x + 0. 95 x^2
than by the linear equation y = 5. 3 + 8. 0 x. In fact, the total sum of squared deviations of the 60 points from the dashed line in Figure 2.1 (a) is a little over 5900, but the sum of squared residuals from the quadratic goes down to a little over 3000. We do not graph this, because we think that the logarithmic fit in Figure 2.1 (b) is in this instance a more appropriate way to handle the deviation from a straight line. However, quadratic (or cubic) fits do often come up in data analysis. The point we want to make is that the selection of a best-fitting curve is again done by least squares. This gives us three examples of least squares: a horizontal line at the level of Y minimizes the sum of squared residuals, among all possible horizontal lines; the regression line minimizes this sum among all possible lines, horizontal or otherwise; and the function shown above minimizes this sum among all possible functions of the form y = a + b x + c x^2.
eesc BC 3017 statistics notes 5
The output of regression functions
A fairly standard set of summary calculations is reported by regression or linear modelling functions^1 in most data-analysis packages; including the regression function in Excel. We give a list of the summary calculations that we consider most important, and then subsequently comment on each one in detail.
Just as before, one can form 95% confidence intervals for the true value of the slope or intercept by centering the confidence interval around the estimated value and taking ± the standard error multiplied by the appropriate t percentile.
To illustrate, the Excel regression output for the linear fit in Figure 2.1 (a) would look as follows:
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 5.27 2.81 1.87 0.066 -0.35 10.
Urbanization 8.00 0.45 17.65 0.000 7.10 8.
We focus here on the coefficients, standard errors, and the lower and upper limits for the 95%-confidence intervals for the coefficients. (The t statistic is just the ratio between the estimated coefficient and its estimated standard error; the P-values in tables like this can be very tricky to interpret and are better ignored entirely.
eesc BC 3017 statistics notes 7
(minimized) sum of squared residuals, dividing by N − k, and then taking the square root. For a linear trend, k = 2 (slope and intercept are estimated), so one divides by N − 2 before taking the square root.
There is a very sad fact about the residual standard deviation. We hate to tell you this, it is so sad. Most statistical software packages (including Excel) continue to mislabel it, even in the year 2002, calling it the "residual standard error" or sometimes just the "standard error" and thereby confusing a lot of people. How would a student feel if all her professors not only called her by a wrong name but actually misled a lot of her friends? In these notes, we are going to deal with such a sad fact by pretending henceforth that it just is not so; we use the correct term, "residual standard deviation." Perhaps by 2102 things will be labelled correctly.
Analysis of variance table. This table is built around the idea of sums of squared residuals for a nested sequence of fitted models. In the simplest case, there are only two models in the sequence. One, called the null model, assumes no effect of x on y. That is, in the equation y = a + b x, the slope b is zero, and so the model is just y = a. The least-squares value of the parameter a is just the sample mean y of the N values of y. In Figure 2.2, this null model is shown by the horizontal line. The sum of squared residuals from the null model is just the numerator of the usual estimate of the variance of N numbers. It is often called the total sum of squares. The second model in the sequence is the linear regression model, where the intercept a and the slope b are both obtained by least squares. The sum of squared residuals for this model is necessarily less than for the null model (since b = 0 and a = y is a possible set of parameter estimates, and the actual estimates are chosen so as to do at least slightly better than that). The difference between these two sums of squares is the sum of squares attributed to regression. The analysis of variance (ANOVA) table reports the these three quantities (two residual sums of squares and their difference) and the corresponding residual df (and their difference).
This is how the analysis of variance (ANOVA) table would look in Excel, for the fitted linear trend in Figure 2.1 (a).
df SS MS F Significance F Regression 1 31700 31700 311.4 0 Residual 58 5904 102 Total 59 37604
Pay attention only to the first two columns, labelled df and SS (sum of squares), where there is an entry in each row. The bottom row (Total) shows the df and sum of squared residuals for the null model, the next row up (Residual) shows the corresponding values for the linear regression model, and the top row (Regression) shows the differences in df and in sum of squares between the null and the regression models.
In this case, clearly, the linear regression model fits much better than the null model. That is what we already knew as soon as we looked at Figure 2.1 (a) and saw a clear upward trend that was well approximated by a simple straight line, over much of the
eesc BC 3017 statistics notes 8
range. The proportional reduction in residual sum of squares is 31700/37604, or 84.3%. And that, in turn is R^2 , exactly. (Earlier we rounded to 84%.) So now you know more exactly how R^2 measures the goodness-of-fit of the fitted model to the data: it looks at the reduction in squared residuals for y, by using the particular regression model, as compared with a null model that attributes all the variation to random error, rather than a systematic trend involving x. This definition of R^2 applies equally well to quadratic trends or to many other models one might try.
The last 3 columns of the analysis of variance table are best ignored for the present. They will not be needed. Just to make one interconnection: the F value (should be called "F Stat") is just the square of the t Stat for the Urbanization slope in the preceding table (17. 65^2 = 311. 5, which differs only through rounding error).
Examining Residuals. The final output from most regression functions is some sort of examination of the residuals. In the simple case of a sample of measurements, one checks for randomness by looking for an alternative: one plots the data against ordinal position or against some other paired variable to look for a trend. If one does not find any clearly systematic pattern, one continues to assume that observations can be viewed as a "random sample" from some ensemble. In regression, one has already found a trend; now one similarly looks for randomness in the residuals from that trend.
Figure 2.3 examines the residuals from the linear trends in Figure 2.1, by simply plotting the residuals against Urbanization. We already noted, in Figure 2.1 (a), that the linear trend was far from a perfect description. The corresponding residual plot, Figure 2.3 (a), dramatizes the departure. In addition to showing the systematic trend in the mean residuals, this plot also makes obvious something that was partly hidden in Figure 2.1 (a): there is also a trend in the standard deviation of the residuals, with small variability at the left changing to large variability at the right-hand side of the plot. This in turn means that the residual standard deviation is not a good representation of variability—it is an average, but does not do justice to either the left or the right side of the plot.
Figure 2.3 (b) does the same job for the linear trend in Figure 2.1 (b). Now one sees two reasons for strongly preferring the logarithmic model. The trend in mean residuals is abolished, and so is the trend in standard deviation of the residuals. The residuals in 2. (b) still do not look quite random, since there are clusters above and below zero at different Urbanization levels, but at least there are no trends, and so the linear fit is giving quite a good approximation to the data and the measurement errors look fairly homogeneous on the logarithmic scale.