





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
FOUNDATIONAL CONCEPTS Model Specification: What is SLR? y = β₀ + β₁x + ε Response vs. predictor variable What does the line represent? (conditional mean) Slope interpretation. Intercept interpretation.
Typology: Slides
1 / 9
This page cannot be seen from the preview
Don't miss anything!






Professor
School of Industrial and Systems Engineering
Learning Objectives:
the model assumptions and about
identifications of outliers
fit and linear model performance
Data : {(x 1
,y 1
),…,(x n
,y n
)}
i
0
1
i
i
i =1,…,n
Assumptions :
2
n
} are independent random variables
~ Normal)
Residual Values:
Graphical display: Plot of the residuals 1 i
If the scatter of # i
is not random around zero line , it could be that
Ø The relationship between X and Y is not linear
Ø Variances of error terms are not equal
Ø Response data are not independent
à Goodness-of-fit (GOF) : Use diagnostics to evaluate assumptions.
i
$
$
%
&
$
uncorrelated errors but not
independence.
matter. If the data are from a
randomized trial, then independence
is established, but most data are from
observational studies.
Independence Assumption:
There are clusters of residuals: the independence assumption does not hold.
x-axis:
One way to check this assumption in a regression is using a
Normal Probability Plot
$
= rank of ; $
(between 1, n)
F = CDF of Normal Distribution
Ø Let the R statistical software do this for you!
Ø A straight line in normal probability plot
implies assumption of normality is valid
Ø Curvature (especially at the ends) shows
non-normality
Φ
!"
%
− 3 / 8
y-axis: ; $
A complementary approach to check for the
normality assumption is by plotting the
histogram of the residuals
Normality Assumption:
The residuals should have an approximately
symmetric distribution, unimodal, and with
no gaps in the data.
A data point far from the majority of the data (in y and/or x ) may be called an
outlier , especially if it does not follow the general trend of the rest of the data.
Ø Data points that are far from the mean of the x’s are called leverage points.
Ø A data point that is far from the mean of either or both the x’s and/or the y’s
are influential points if they influence the fit of the regression.
Ø An outlier may or may not impact the regression fit significantly, thus it may or
may not be an influential point.
The upshot : Sometimes there are good reasons for excluding subsets (there
were errors in the data entry; there were errors in the experiment).
Sometimes - the outlier belongs in the data. Outliers should always be examined.
Look at the standardized residuals :
Compare the standardized residuals to the - 2 to +2 band (or - 1 to + 1).
Most statistics packages will calculate these automatically.
∗
y = 2.6 + 0.07 x
y = 1.34 + 0.13 x
y = 1.29 + 0.17 x
y = 4.9 – 0.07 x
A statistic that efficiently summarizes how well the X’s can be used
to predict Y is the R-square:
2 = 1 – SSE / SST
2 = Proportion of total
variability in Y that can be
explained by the regression
(that uses X)
which is interpreted as:
$,&
$
$,&
$