






































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A comprehensive overview of various statistical modeling techniques, including regression analysis and generalized linear models. It covers topics such as power and log transformations, outliers, leverage points, influential points, r-squared, anova, pooled variance estimator, sum of square errors, sum of square treatments, mean square error, f-test, simple linear regression, multiple linear regression, marginal and conditional models, explanatory and predictive factors, logistic regression, poisson regression, goodness of fit, and model selection methods like aic, bic, and stepwise regression. The document aims to equip readers with a deep understanding of these statistical concepts and their practical applications in data analysis and modeling.
Typology: Exams
1 / 46
This page cannot be seen from the preview
Don't miss anything!







































response (dependent) variables - Precise Answer ✔✔one particular variable that we are interested in understanding or modeling (y) predicting or explanatory (independent) variables - Precise Answer ✔✔a set of other variables that might be useful in predicting or modeling the response variable (x1, x2) What kind of variable is a response variable and why? - Precise Answer ✔✔random, because it varies with changes in the predictor/s along with other random changes. What kind of variable is a predicting variable and why? - Precise Answer ✔✔fixed, because it does not change with the response but it is fixed before the response is measured. linear relationship - Precise Answer ✔✔a simple deterministic relationship between 2 factors, x and y what are three things that a regression analysis is used for? - Precise Answer ✔✔1. Prediction of the response variable, 2. Modeling the relationship between the response and explanatory variables, 3. Testing hypotheses of association relationships B0 =? - Precise Answer ✔✔intercept B1 =? - Precise Answer ✔✔slope for our linear model where: Y = B0 + B1 + EPSILON (E), what does the epsilon represent? - Precise Answer ✔✔deviance of the data from the linear model (error term) what are the 4 assumptions of linear regression? - Precise Answer ✔✔Linearity/Mean Zero, Constant Variance, Independence, Normality
Linearity/Mean zero assumption - Precise Answer ✔✔Means that the expected value (deviances) of errors is zero. This leads to difficulties in estimating B0 and means that our model does not include a necessary systematic component Constant variance assumption - Precise Answer ✔✔Means that it cannot be true that the model is more accurate for some parts of the population, and less accurate for other parts of the populations. This can result in less accurate parameters and poorly-calibrated prediction intervals. Assumption of Independence - Precise Answer ✔✔Means that the deviances, or in fact the response variables ys, are independently drawn from the data-generating process. (this most often occurs in time series data) This can result in very misleading assessments of the strength of regression. Normality assumption - Precise Answer ✔✔This is needed if we want to do any confidence or prediction intervals, or hypothesis test, which we usually do. If this assumption is violated, hypothesis test and confidence and prediction intervals and be very misleading. what are the 3 parameters we estimated in regression? - Precise Answer ✔✔B0, B1, sigma squared (variance of the one pop.) What do we mean by model parameters in statistics? - Precise Answer ✔✔Model parameters are unknown quantities, and they stay unknown regardless how much data are observed. We estimate those parameters given the model assumptions and the data, but through estimation, we're not identifying the true parameters. We're just estimating approximations of those parameters. What is the estimated sampling distribution of s^2? - Precise Answer ✔✔chi-square with n-1 DF Why do we lose 1 DF for s^2? - Precise Answer ✔✔we replace mu with zbar what is the relationship between s^2 and sigma^2? - Precise Answer ✔✔S^2 estimates sigma^ What is the estimated sampling distribution of sigma^2? - Precise Answer ✔✔chi-square with n-2 DF (~ equivalent to MSE)
The estimators for the regression coefficients are: A) Biased but with small variance B) Unbiased under normality assumptions but biased otherwise. C) Unbiased regardless of the distribution of the data. - Precise Answer ✔✔C The assumption of normality: A) It is needed for deriving the estimators of the regression coefficients. B) It is not needed for linear regression modeling and inference. C) It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. D) It is needed for deriving the expectation and variance of the estimators of the regression coefficients.
B) Have the same expectation C) Have the same variance and expectation D) None of the above - Precise Answer ✔✔B The variability in the prediction comes from: A) The variability due to a new measurement. B) The variability due to estimation. C) The variability due to a new measurement and due to estimation. D) None of the above. - Precise Answer ✔✔C residuals - Precise Answer ✔✔the difference between the observed response and the fitted responses what does residual analysis NOT check for? (for SLR assumptions) - Precise Answer ✔✔independence what can we use to check for normality? - Precise Answer ✔✔QQ plot and histogram what are two ways to transform data? - Precise Answer ✔✔power and log transformation outliers - Precise Answer ✔✔which are data points far from the majority of the data in both x and y or just x leverage points - Precise Answer ✔✔Data points that are far from the mean of the x's influential points - Precise Answer ✔✔A data point that is far from the mean of both the x's and the y's, because they are influencing the fit of the regression. R^2 or coefficient of determination - Precise Answer ✔✔a statistic that efficiently summarizes how well the x can be used to predict the response variable.
MSSTr measures... - Precise Answer ✔✔between-group variability ANOVA measures... - Precise Answer ✔✔variability between samples to the variability within a sample. F-test measures... - Precise Answer ✔✔ratio of between-group variability and within-group variability Which are all the model parameters in ANOVA? A) The means of the k populations. B) The sample means of the k populations. C) The sample means of the k samples. D) None of the above. - Precise Answer ✔✔D The pooled variance estimator is: A) The sample variance estimator assuming equal variances. B) The variance estimator assuming equal means and equal variances. C) The sample variance estimator assuming equal means. D) None of the above. - Precise Answer ✔✔A The total sum of squares divided by N-1 is: A) The mean sum of squared errors B) The sample variance estimator assuming equal means and equal variances C) The sample variance estimator assuming equal variances. D) None of the above. - Precise Answer ✔✔B The mean squared errors (MSE) measures:
A) The within-treatment variability. B) The between-treatment variability. C) The sum of the within-treatment and between-treatment variability. D) None of the above. - Precise Answer ✔✔A Which is correct? A) If we reject the test of equal means, we conclude that all treatment means are not equal. B) If we do not reject the test of equal means, we conclude that means are definitely all equal C) If we reject the test of equal means, we conclude that some treatment means are not equal. D) None of the above. - Precise Answer ✔✔C When would we use the 'Comparing pairs of Means' method? - Precise Answer ✔✔After we reject the null hypothesis of equal means pairwise comparison - Precise Answer ✔✔we estimate the difference in the means (for example, a pair: meani and meanj) as a difference between their corresponding means Using the Tukey method to find the confidence interval of the means, what does having a '0' in the CI mean? - Precise Answer ✔✔For the confidence intervals that include zero, it's plausible that the difference between means is zero. In the pairwise comparison, if the confidence interval only contains positive values, then we conclude... - Precise Answer ✔✔that the difference in means in statistically positive The 3 assumptions of ANOVA with respect to the error term: - Precise Answer ✔✔Constant variance, independence, normality How will we diagnose the assumptions for ANOVA? - Precise Answer ✔✔we are going to diagnose the assumptions on the residuals because under the error terms we do not know the means
If our ANOVA model does not have an intercept, then how many dummy variables? - Precise Answer ✔✔all k dummy variables T/F: For assessing the normality assumption of the ANOVA model, we can only use the quantile-quantile normal plot of the residuals. - Precise Answer ✔✔F T/F: The constant variance assumption is diagnosed using the histogram. - Precise Answer ✔✔F T/F: The estimator σ ^ 2 is a random variable. - Precise Answer ✔✔T T/F: The regression coefficients are used to measure the linear dependence between two variables. - Precise Answer ✔✔F T/F: The mean sum of square errors in ANOVA measures variability within groups. - Precise Answer ✔✔T T/F: β ^ 1 is an unbiased estimator for β0. - Precise Answer ✔✔F T/F: Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables. - Precise Answer ✔✔T T/F: In simple linear regression models, we lose three degrees of freedom because of the estimation of the three model parameters β 0 , β 1 , σ 2. - Precise Answer ✔✔F T/F: The assumptions to diagnose with a linear regression model are independence, linearity, constant variance, and normality. - Precise Answer ✔✔T T/F: The sampling distribution for the variance estimator in ANOVA is χ 2 (chi-square) regardless of the assumptions of the data. - Precise Answer ✔✔F T/F: If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable. - Precise Answer ✔✔T
T/F: A negative value of β 1 is consistent with an inverse relationship between x and y. - Precise Answer ✔✔T T/F: If one confidence interval in the pairwise comparison does not include zero, we conclude that the two means are plausibly equal. - Precise Answer ✔✔F T/F: The mean sum of square errors in ANOVA measures variability between groups. - Precise Answer ✔✔F T/F: The linear regression model with a qualitative predicting variable with k levels/classes will have k + 1 parameters to estimate - Precise Answer ✔✔T T/F: We assess the assumption of constant-variance by plotting the response variable against fitted values. - Precise Answer ✔✔F T/F: The number of degrees of freedom of the χ 2 (chi-square) distribution for the variance estimator is N − k + 1 where k is the number of samples. - Precise Answer ✔✔F T/F: Only the log-transformation of the response variable can be used when the normality assumption does not hold. - Precise Answer ✔✔F T/F: The prediction interval will never be smaller than the confidence interval for data points with identical predictor values. - Precise Answer ✔✔T T/F: If one confidence interval in the pairwise comparison includes only positive values, we conclude that the difference in means is statistically significantly positive. - Precise Answer ✔✔T What are the assumptions for multiple linear regression? - Precise Answer ✔✔Linearity/Mean zero assumption, Constant Variance, Independence and Normality (for statistical inference) what are the model parameters to be estimated in MLR? - Precise Answer ✔✔B0 (intercept), B1-Bp, and sigma squared
Explanatory factors - Precise Answer ✔✔to explain variability in the response variable; they may be included in the model even if other "similar" variables are in the model Predictive factors - Precise Answer ✔✔to best predict variability in the response regardless of their explanatory power The objective of multiple linear regression is: A) To predict future new responses B) To model the association of explanatory variables to a response variable accounting for controlling factors. C) To test hypotheses using statistical inference on the model. D) All of the above. - Precise Answer ✔✔D
D) Causality is the same as association in interpreting the relationship between the response and predicting variables. - Precise Answer ✔✔A Which one correctly characterizes the sampling distribution of the estimated variance? A) The estimated variance of the error term has a chi-squared distribution regardless of the distribution assumption of the error terms. B) The number of degrees of freedom for the chi-squared distribution of the estimated variance is n - p - 1 for a model without an intercept. C) The sampling distribution of the mean squared error is different of that of the estimated variance. D) None of the above. - Precise Answer ✔✔D What is B^ in MLR? - Precise Answer ✔✔a linear combination of Y's and is normally distributed. σ^2 hat distribution is? - Precise Answer ✔✔chi-square, n-p-1 DF What is the sampling distribution for individual β hat? - Precise Answer ✔✔t-distribution with n-p-1 DF To what distribution can we derive the confidence interval from? - Precise Answer ✔✔t-distribution What does it mean if 0 is included in the CI? - Precise Answer ✔✔we conclude that Bj is NOT statistically significant What does it mean if 0 is NOT included in the CI? - Precise Answer ✔✔we conclude that Bj IS statistically significant What is the null and alternative hypothesis for MLR? - Precise Answer ✔✔H0: the coefficients are not HA: the coefficients (at least 1) are not equal to 0 If the t-value is large... - Precise Answer ✔✔we reject the null hypothesis and conclude that the coefficient is statistically significant.
The sampling distribution of the estimated regression coefficients is: A) Centered at the true regression parameters. B) The t-distribution assuming that the variance of the error term is unknown and replaced by its estimate. C) Dependent on the design matrix. D) All of the above. - Precise Answer ✔✔D The estimators for the regression coefficients are: A) Biased but with small variance B) Unbiased under normality assumptions but biased otherwise C) Biased regardless of the distribution of the data. D) Unbiased regardless of the distribution of the data. - Precise Answer ✔✔D We can test for a subset of regression coefficients: A) Using the F-statistic test of the overall regression. B) Only if we are interested in whether additional explanatory variables should be considered in addition to the controlling variables. C) To evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant. D) None of the above. - Precise Answer ✔✔D The expectation of the mean response is: - Precise Answer ✔✔UNBIASED If we replace the unknown variance with its estimator, sigma^2=MSE, for PREDICTION, the sampling distribution becomes... - Precise Answer ✔✔t distribution with n-p-1 DF Is the predicted regression line is the same as the estimated regression line at x*? How does it affect confidence intervals? - Precise Answer ✔✔Yes, but the prediction confidence interval is wider than the estimation confidence interval because of the higher variability in the prediction.
The estimated versus predicted regression line for a given x*: A) Have the same variance B) Have the same expectation C) Have the same variance and expectation D) None of the above - Precise Answer ✔✔B Which one is correct? A) The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly. B) The prediction intervals are centered at the predicted value. C) The sampling distribution of the prediction of a new response is a t-distribution. D) All of the above. - Precise Answer ✔✔D T/F: In a multiple linear regression model with 6 predicting variables but without intercept, there are 7 parameters to estimate. - Precise Answer ✔✔True T/F: The only objective of multiple linear regression is prediction. - Precise Answer ✔✔False T/F: We can make causal inference in observational studies. - Precise Answer ✔✔False T/F: In order to make statistical inference on the regression coefficients, we need to estimate the variance of the error terms. - Precise Answer ✔✔True T/F: We cannot estimate a multiple linear regression model if the predicting variables are linearly dependent. - Precise Answer ✔✔True T/F: The estimated regression coefficients are unbiased estimators. - Precise Answer ✔✔True
T/F: The error term variance estimator has a (chi-squared) distribution with degrees of freedom for a multiple regression model with 10 predictors. - Precise Answer ✔✔True T/F: The sampling distribution for estimating confidence intervals for the regression coefficients is a normal distribution. - Precise Answer ✔✔False T/F: The estimated variance of the error terms is the sum of squared residuals divided by the sample size minus the number of predictors minus one. - Precise Answer ✔✔True T/F: Conducting t-tests on each β parameter in a multiple regression model is the best way for testing the overall significance of the model. - Precise Answer ✔✔False T/F: In the case of a multiple linear regression model containing 6 quantitative predicting variables and an intercept, the number of parameters to estimate is 7. - Precise Answer ✔✔False T/F: The regression coefficient corresponding to one predictor in multiple linear regression is interpreted in terms of the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed. - Precise Answer ✔✔true T/F: The proportion of variability in the response variable that is explained by the predicting variables is called correlation. - Precise Answer ✔✔false T/F: Predicting values of the response variable for values of the predictors that are within the data range is known as extrapolation. - Precise Answer ✔✔False T/F: In multiple linear regression we study the relationship between a single response variable and several predicting quantitative and/or qualitative variables. - Precise Answer ✔✔True T/F: The sampling distribution used for estimating confidence intervals for the regression coefficients is the normal distribution. - Precise Answer ✔✔False
T/F: A partial F-test can be used to test whether a subset of regression coefficients are all equal to zero. - Precise Answer ✔✔true T/F: Prediction is the only objective of multiple linear regression. - Precise Answer ✔✔false T/F: The equation to find the estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing that by n - p , where n is the sample size and p is the number of predictors. - Precise Answer ✔✔False T/F: For a given predicting variable, the estimated coefficient of regression associated with it will likely be different in a model with other predicting variables or in the model with only the predicting variable alone. - Precise Answer ✔✔True T/F: Observational studies allow us to make causal inference. - Precise Answer ✔✔False T/F: In the case of multiple linear regression, controlling variables are used to control for sample bias. - Precise Answer ✔✔true T/F: In the case of a multiple regression model with 10 predictors, the error term variance estimator follows a χ 2 (chi-squared) distribution with n - 10 degrees of freedom. - Precise Answer ✔✔False T/F: The estimated coefficients obtained by using the method of least squares are unbiased estimators of the true coefficients. - Precise Answer ✔✔true T/F: Before making statistical inference on regression coefficients, estimation of the variance of the error terms is necessary. - Precise Answer ✔✔true T/F: Given a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model. - Precise Answer ✔✔False T/F: It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables. - Precise Answer ✔✔False