Download Topic Overview - Applied Regression Analysis | STAT 51200 and more Study notes Statistics in PDF only on Docsity! Topic01.doc 8/22/05 11:34 AM 1 of 24 Stat 512 Topic 1 PPT 1 Topic Overview This topic we will cover: − Course Overview & Policies − SAS − KNNL Chapter 1 (much should be review) – Simple linear regression − KNNL Chapter 2 – Inference in simple linear regression, Prediction intervals and Confidence bands, ANOVA tables, General Linear Test Class website http://web.ics.purdue.edu/~simonsen/stat512/ Class policies Refer to handout. Overview We will cover - simple linear regression (SLR) (Chapters 1 – 5) - multiple regression (MLR) (Chapters 6 – 11) - analysis of variance (ANOVA) (Chapters 16 – 25) The emphasis will be placed on using selected practical tools using SAS rather than on the mathematical manipulations. We want to understand the theory so that we can apply it appropriately. Some of the material on SLR will be review, but our goal with SLR is to be able to generalize the methods to MLR. References − Text: Applied Linear Statistical Models, (4th ed.) by Neter, Kutner, Nachtsheim and Wasserman (KNNL). − SAS System for Regression by Freund and Little. − SAS/STAT User’s Guide Vol. 1 and 2. − SAS System for Elementary Analysis by Schlotzhauer. − SAS Help menus SAS Getting Help with SAS Statistical Consulting Service Math B5 Hours 10-4 M through F http://www.stat.purdue.edu/consulting/ Topic01.doc 8/22/05 11:34 AM 2 of 24 Evening computer labs − Room? − help with SAS, Excel for multiple Stat courses − Hours 7– 9 M through Th − starting second week of classes − staffed with graduate student TA’s SAS SAS is the program we will use to perform data analysis for this class. I will often give examples from SAS in class. The commands are meant to be all together as one program but it will be easier to understand if I show each command followed by its output. The programs will be available for you to download from the website. I will use the following font convention in the lecture notes: SAS input will look like this: Courier New SAS output will look like this: SAS Monospace I will usually have to edit the output somewhat to get it to fit on the page of notes. My own comments will be in regular (Times/printout or Arial/lecture) fonts like this. Let me know if you get confused about what is input, output, or my comments. You should run the SAS programs yourself to see the real output, and experiment with changing the commands to learn how they work. I will tell you the names of all SAS files I use in these notes. If the notes differ from the SAS file, take the SAS file to be correct, since there may be cut-and-paste errors. There is a tutorial in SAS to help you get started. Help Getting Started with SAS Software. You should spend some time before next week getting comfortable with SAS (see HW #0). For today, don’t worry about the detailed syntax of the commands. Just try to get a sense of what is going on. Example (Price Analysis for Diamond Rings in Singapore) Variables − response variable price in Singapore dollars (Y) − explanatory variable weight of diamond in carats (X) Goals − create a scatterplot of the data − fit a regression line − predict the price of a sale for a 0.43 carat diamond ring SAS Data Step − file diamond.sas on website One way to input data in SAS is to just type or paste it in. In this case, we have a sequence of ordered pairs (weight, price). Topic01.doc 8/22/05 11:34 AM 5 of 24 proc reg data=diamonds; model price=weight/p r; output out=diag p=pred r=resid; id weight; run; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 2098596 2098596 2069.99 <.0001 Error 46 46636 1013.81886 Corrected Total 47 2145232 Root MSE 31.84052 R-Square 0.9783 Dependent Mean 500.08333 Adj R-Sq 0.9778 Coeff Var 6.36704 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -259.62591 17.31886 -14.99 <.0001 weight 1 3721.02485 81.78588 45.50 <.0001 proc print data=diag; run; Output Statistics Dep Var Predicted Std Error Std Error Obs weight price Value Mean Predict Residual Residual 1 0.17 355.0000 372.9483 5.3786 -17.9483 31.383 2 0.16 328.0000 335.7381 5.8454 -7.7381 31.299 3 0.17 350.0000 372.9483 5.3786 -22.9483 31.383 4 0.18 325.0000 410.1586 5.0028 -85.1586 31.445 5 0.25 642.0000 670.6303 5.9307 -28.6303 31.283 ... 46 0.15 287.0000 298.5278 6.3833 -11.5278 31.194 47 0.26 693.0000 707.8406 6.4787 -14.8406 31.174 48 0.15 316.0000 298.5278 6.3833 17.4722 31.194 49 0.43 . 1340 19.0332 . . PPT 2 Simple Linear Regression Why Use It? − Descriptive purposes (cause/effect relationships) − Control (often of cost) − Prediction of outcomes Data for Simple Linear Regression − Observe i = 1, 2, … n pairs of variables (explanatory, response) − Each pair often called a case or a data point − = ith response variable iY − iX = i th explanatory variable Simple Linear Regression Model for i = 1, 2, … n 0 1iY X=β +β +εi i Simple Linear Regression Model Parameters − is the intercept 0β − is the slope 1β − are independent, normally distributed random errors with mean 0 and variance , i.e. iε 2σ ( )2~ 0,iidi Nε σ Features of Simple Linear Regression Model − Individual Observations: Y X 0 1i i i= β + β + ε − Since are random, are also random and iε iY − ( ) ( )0 1 0 1i i i iE Y X E X= β + β + ε = β + β − ( ) ( ) 2Var 0 Vari iY = + ε = σ − Since is Normally distributed, iε ( )20 1~iY N X ,iβ + β σ (See A.36, p1319) Fitted Regression Equation and Residuals We must estimate the parameters from the data. The estimates are denoted . These give us the fitted or estimated regression line where 2 0 1, ,β β σ 2 0 1, ,b b s 0 1î iY b b X= + Topic01.doc 8/22/05 11:34 AM 6 of 24 − is the estimated intercept 0b − is the estimated slope 1b − is the estimated mean for Y when the predictor is îY iX (i.e. the point on the fitted line) − is the residual for the ith case (the vertical distance from the data point to the fitted regression line). Note that ie ( )0 1ˆi i i i ie Y Y Y b b X= − = − + Using SAS to plot the residuals (Diamond Example) When we called PROC REG earlier, we assigned the residuals to the name “resid” and placed them in a new data set called “diag”. We now plot them vs. X. symbol1 v=circle i=NONE; title2 color=blue 'Residual Plot'; axis2 label=(angle=90 'Residual'); proc gplot data=diag; plot resid*weight / haxis=axis1 vaxis=axis2 vref=0; where price ne .; run; Notice there does not appear to be any obvious pattern in the residuals. We’ll talk a lot more about diagnostics later, but for now you should know that looking at residual plots is an important way to check assumptions. Least Squares − want to find “best” estimates and 0b 1b − will minimize the sum of the squared residuals ( )( )22 0 1 1 1 n n i i i i i e Y b b X = = = − +∑ ∑ Topic01.doc 8/22/05 11:34 AM 7 of 24 Confidence Intervals We are 100(1-α)% confident that the following interval contains μ: { } { } { },c c cW t s W W t s W W t s W⎡ ⎤± = − +⎣ ⎦ where ( 21 , 1ct t nα= − − ) , the upper ( )21 α− percentile of the t distribution with n-1 degrees of freedom, and 1-α is the confidence level (e.g. 0.95 = 95%, so α = 0.05) Significance tests To test whether μ has a specific value, we use a t-test (one-sample, non-directional). 0 0: vs :aH Hμ = μ μ ≠ μ0 { } 0W W t s − μ = has a t(n-1) distribution under H0. Reject H0 if ( )2, 1 , 1c ct t t t nα≥ = − − . P-value = (0HProb T t> ) , where ( )t n 1T ~ - The p-value is twice the area in the upper tail of the t(n-1) distribution above the observed |t|. It is the probability of observing a test statistic at least as extreme as what was actually observed, when the null hypothesis is really true. We reject H0 if P ≤ α. Important notational comment The text says “conclude HA” if t is in the rejection region (|t| ≥ tc) , otherwise “conclude H0”. This is shorthand for − “conclude HA” means “there is sufficient evidence in the data to conclude that H0 is false, and so we assume that HA is true” − “conclude H0" means “there is insufficient evidence in the data to conclude that either H0 or HA is true or false, so we default to assuming that H0 is true” Notice that a failure to reject H0 does not mean that there was any evidence in favour of H0. NOTE In this course α = 0.05 unless otherwise specified. 2.1: β1 Inference { }( ) { } ( ) { } { } 2 1 1 1 2 2 1 1 1 1 2 1 1 ~ , where where b t ~ t(n 2) if 0 X X b N b b SS b t s b ss SS β σ σ σ = −β = = − β = Topic01.doc 8/22/05 11:34 AM 10 of 24 According to our discussion above for “W”, you therefore know how to obtain CI’s and t-tests for β1. (I’ll go through it now but not in the future.) There is one important difference: the df here are n – 2 not n – 1, because we are also estimating β0. Confidence Interval for β1 − { }1 1cb t s b± − where ( )21 , 2ct t nα= − − , the upper ( )2100 1 α− percentile of the t distribution with n – 2 degrees of freedom − 1 – α is the confidence level Significance tests for β1 { } ( ) 0 1 1 1 1 c c 0 vs : 0 : 0 b 0 t Reject if t t , t t 1 2 n 2 Prob( T t ), where T~t(n 2) aH H s b H α/ , P β = β ≠ − = ≥ = − − = > − 2.2: β0 Inference { }( ) { } { } { } { } 2 0 0 0 2 2 2 0 X 0 0 0 2 2 0 2 0 X b ~ , b 1 X where b n SS bt for b replace by and take 1 X n SS t ~ t(n 2) N s b s s s b s β σ ⎡ ⎤ σ = σ +⎢ ⎥ ⎣ ⎦ −β = σ = + − Confidence Interval for β0 − { }0 0cb t s b± − where ( )21 , 2ct t nα= − − , 1-α is the confidence level Topic01.doc 8/22/05 11:34 AM 11 of 24 Significance tests for β0 { } ( ) ( ) ( ) 0 0 0 0 0 0 : 0 vs : 0 0 Reject if , 1 , 2 Prob , where ~ 2 A c c H H bt s b H t t t t n P T t T t n α 2 β = β ≠ − = ≥ = − − = > − Notes − The normality of b0 and b1 follows from the fact that each of these is a linear combination of the Yi, each of which is an independent normal − For b1 see NKNW p46 − For b0 try this as an exercise − Usually the CI and significance test for β0 is not of interest − If the εi are not normal but are approximately normal, then the CIs and significance tests are generally reasonable approximations − These procedures can easily be modified to produce one-sided confidence intervals and significance tests − Because ( ) ( ) 2 2 1 2 i b X X σ σ = −∑ , we can make this quantity small by making ( )2iX X−∑ large, i.e. by spreading out the Xi’s. SAS Proc Reg Here is how to get the parameter estimates in SAS. (Still using diamond.sas). The option “clb” asks SAS to give you confidence limits for the parameter estimates b0 and b1. proc reg data=diamonds; model price=weight/clb; Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits Intercept 1 -259.62591 17.31886 -14.99 <.0001 -294.48696 -224.76486 weight 1 3721.02485 81.78588 45.50 <.0001 3556.39841 3885.65129 Points to remember − What is the default value of α that we use in this class? − What is the default confidence level that we use in this class? − Suppose you could choose the X’s. How would you choose them if you wanted a precise estimate of the slope? intercept? both? PPT3 Topic01.doc 8/22/05 11:34 AM 12 of 24 output; end; proc print data=a2; run; title1 'Power for the slope in simple linear regression'; symbol1 v=none i=join; proc gplot data=a2; plot power*beta1; run; power 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 bet a1 -2 -1 0 1 2 2.4: Estimation of E(Yh) − E(Yh) = μh = β0 + β1Xh, the mean value of Y for the subpopulation with X = Xh − we will estimate E(Yh) by 0 1ˆ ˆh hY b b= μ = + hX h− KNNL use to denote this estimate; we will use the symbols ĥY ˆ ˆhY = μ interchangeably − see equation (2.28) on p 57 Theory for Estimation of E(Yh) is normal with mean μh and variance { } ( ) ( ) 2 h2 2 2 i X X1ˆ n X X hY ⎡ ⎤−⎢ ⎥σ = σ +⎢ ⎥−⎢ ⎥⎣ ⎦∑ ĥY − The normality is a consequence of the fact that b0 + b1Xh is a linear combination of Yi’s − The variance has two components – one for the intercept and one for the slope. The variance associated with the slope depends on the distance hX X− . The estimation is more accurate near X . − See NKNW pp 56-59 for details Topic01.doc 8/22/05 11:34 AM 15 of 24 Application of the Theory ( ) Topic01.doc 8/22/05 11:34 AM 16 of 24 We estimate { }2 ĥYσ by { } ( ) 2 h2 2 2 i X X1ˆ n X X hs Y s ⎡ ⎤− ⎢ ⎥= + ⎢ ⎥−⎣ ⎦∑ It follows that h h ˆ E(Y )t ~ t ˆ( ) (hY s − = μ n 2)− ; proceed as usual. 95% Confidence Interval for E(Yh) { }ˆ ˆch hY t s Y± where tc = t(.975, n-2) NOTE: significance tests can be performed for but they are rarely used in practice ĥY Example See program nknw060.sas for the estimation of subpopulation means. The option “clm” to the model statement asks for confidence limits for the mean . ĥY data a1; infile 'H:\Stat512\Datasets\Ch01ta01.dat'; input size hours; data a2; size=65; output; size=100; output; data a3; set a1 a2; proc print data=a3; run; proc reg data=a3; model hours=size/clm; id size; run; Dep Var Predicted Std Error Obs size hours Value Mean Predict 95% CL Mean 25 70 323.0 312.2800 9.7647 292.0803 332.4797 26 65 . 294.4290 9.9176 273.9129 314.9451 27 100 . 419.3861 14.2723 389.8615 448.9106 PPT 4 2.5: Prediction of Yh(new) We wish to construct an interval into which we predict the next observation (for a given Xh) will fall. − The only difference between this and ( )hE Y is that the variance is different. − In prediction, we have two variance components: (1) variance associated with the estimation of the mean response and (2) variability in a single observation taken from the distribution with that mean. ĥY − is the value for a new observation with X = Xh. ( ) 0 1 hh newY = β + β + εX We estimate Yh(new) starting with the predicted value . This is the centre of the confidence interval, just as it was for E(Yh). However, the width of the CI is different because they have different variances. ĥY Topic01.doc 8/22/05 11:34 AM 17 of 24 ε( )( ) ( ) ( ) { } { }2 2 2 ˆ ˆ = + = + hh new h Var Y Var Y Var s pred s Y s { } ( ) ( ) 2 h2 2 2 i X X1pred 1 n X X ⎡ ⎤−⎢ ⎥= + +⎢ ⎥ −⎢ ⎥⎣ ⎦∑ s s { } ( ) ( ) ˆ 2 − −∼h new h Y Y t n s pred − s{pred} denotes the estimated standard deviation of a new observation with X = Xh . It takes into account variability in estimating the mean as well as variability in a single observation from a distribution with that mean. ĥY Notes The procedure can be modified for the mean of m observations at X=Xh (see 2.39a on page 66) Standard error is affected by how far Xh is from X (see Figure 2.3). As was the case for the mean response, prediction is more accurate near X . See program nknw065.sas for the prediction interval example. The “cli” option to the model statements asks SAS to give confidence limits for an individual observation. (c.f. clb and clm) data a1; infile 'H:\Stat512\Datasets\Ch01ta01.dat'; input size hours; data a2; size=65; output; size=100; output; data a3; set a1 a2; proc reg data=a3; model hours=size/cli; run; Dep Var Predicted Std Error Obs size hours Value Mean Predict 95% CL Predict Residual 25 70 323.0000 312.2800 9.7647 209.2811 415.2789 10.7200 26 65 . 294.4290 9.9176 191.3676 397.4904 . 27 100 . 419.3861 14.2723 314.1604 524.6117 . Notes The standard error (Std Error Mean Predict) given in this output is the standard error of , not s{pred}. (That’s why the word mean ĥY is in there). The CL Predict label tells you that the confidence interval is for the prediction of a new observation. The prediction interval for is wider than the confidence interval for because it has a larger variance. (h newY ) ĥY Prediction intervals: symbol1 v=circle i=rlcli95; proc gplot data=a1; plot hours*size; run; 2.7: Analysis of Variance (ANOVA) Table Organize results arithmetically Total sum of squares in Y is ( )2Y iSS Y Y= −∑ Partition this into two sources - Model (explained by regression) - Error (unexplained / residual) Topic01.doc 8/22/05 11:34 AM 20 of 24 ( ) ( ) ( ) ( ) ( ) i i i i 2 22 i i i i ˆ ˆY Y Y Y Y Y ˆ ˆY Y Y Y Y Y − = − + − − = − + −∑ ∑ ∑ (cross terms cancel: see p. 72) Total Sum of Squares − Consider ignoring Xh to predict E(Yh). Then the best predictor would be the sample mean Y . − SST is the sum of squared deviations from this predictor: ( )2Y iSST SS Y Y= = −∑ − The total degrees of freedom is dfT = n-1 − MST = SST/dfT − MST is the usual estimate of the variance of Y if there are no explanatory variables, also known as { }2s Y − SAS uses the term Corrected Total for this source. “Uncorrected” is . The term “corrected” means that we subtract off the mean 2 iY∑ Y before squaring Model Sum of Squares Topic01.doc 8/22/05 11:34 AM 21 of 24 − SSM = ( )2îY Y−∑ − The model degrees of freedom is 1Mdf = since one parameter (slope) is estimated. − MSM = SSM/dfM − KNNL uses the word regression for what SAS calls model . − So SSR (KNNL) is the same as SS Model (SAS). I prefer to use the terms SSM and dfM because R stands for regression, residual, and reduced (later), which I find confusing.. Error Sum of Squares − SSE = ∑ ( )2ˆi iY Y− − The error degrees of freedom is dfE = n-2 since estimates have been made for both slope and intercept − MSE = SSE/dfE − MSE = s2 is an estimate of the variance of Y taking into account (or conditioning on) the explanatory variable(s) ANOVA Table for SLR Source df SS MS Model (Regression) 1 ( )2îY Y−∑ M SSM df Error n-2 ( )2ˆi iY Y−∑ E SSE df Total n-1 ( )2iY Y−∑ T SST df Note about degrees of freedom Occasionally you will run across a reference to “degrees of freedom”, without specifying whether this is model, error, or total. Sometimes it will be clear from context, and although that is sloppy usage, you can generally assume that if it is not specified, it means error degrees of freedom. Expected Mean Squares − MSM, MSE are random variables − E(MSM) = 2 21 XSSσ + β − E(MSE) = σ2 − When H0 : β1 = 0 is true , then E(MSM) = E(MSE) − This makes sense since in that case îY Y= . F test − ( ) ( )/ ~ , 1, 2M EF MSM MSE F df df F n= = − ) − See NKNW pp76-77 − When H0: β1 = 0 is false, MSM tends to be larger than MSE, so we would want to reject H0 when F is large. − Generally, our decision rule is to reject the null hypothesis if ( ) (1 , , 0.95,1, 2c R EF F F df df F n≥ = − α = − − In practice we use p-values (and reject H0 if the p-value is less than α ). − Recall that t = b1/s(b1) tests H0 : β1=0. It can be shown that t2 (df)= F(1,df). The two approaches give same P-value; they are really the same test. − Aside: When H0: β1=0 is false, F has a noncentral F distribution; this can be used to calculate power ANOVA Table Source df SS MS F P Model 1 SSM MSM MSM MSE .xxx Error n-2 SSE MSE Total n-1 See program nknw073.sas for the program used to generate the other output used in this lecture data a1; infile 'H:\Stat512\Datasets\Ch01ta01.dat'; input size hours; proc reg data=a1; model hours=size; run; Sum of Mean Source DF Squares Square F Value Pr > F Model 1 252378 252378 105.88 <.0001 Error 23 54825 2383.71562 C. Total 24 307203 Topic01.doc 8/22/05 11:34 AM 22 of 24