Statistical Modeling and Experimental Design: Regression Techniques and Applications, Assignments of Mathematical Modeling and Simulation

A series of homework questions related to statistical modeling, specifically focusing on regression techniques. It covers stepwise regression, lasso, and elastic net methods, emphasizing the importance of data scaling and cross-validation for model evaluation. The solutions provided include methodologies, results, and analyses, offering insights into model selection and optimization. Additionally, it explores experimental design principles and their application in real-world scenarios, such as semiconductor process optimization and market research for housing features. The document also discusses various statistical distributions and their relevance in different contexts, providing examples for binomial, geometric, poisson, exponential, and weibull distributions. It serves as a practical guide for understanding and applying statistical concepts in data analysis and experimental design.

Typology: Assignments

2024/2025

Uploaded on 07/24/2025

bindu-manusani-1
bindu-manusani-1 🇺🇸

5 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
WEEK 5 HOMEWORK
Question 11.1
Using the crime data set uscrime.txt from Questions 8.2, 9.1, and 10.1, build a regression model
using:
1. Stepwise regression
2. Lasso
3. Elastic net
For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficients will be on
different scales and the constraint won’t have the desired effect.
For Parts 2 and 3, use the glmnet function in R.
Notes on R:
For the elastic net model, what we called λ in the videos, glmnet calls “alpha”; you can get a
range of results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other
values of alpha in between].
In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors x
need to be in R’s matrix format, rather than data frame format. You can convert a data frame
to a matrix using as.matrix – for example, x <- as.matrix(data[,1:n-1])
Rather than specifying a value of T, glmnet returns models for a variety of values of T.
Solution
Methodolgy
In this analysis, the entire dataset is used for training without a separate test set. Cross-validation is
employed to compute the R² value for the models, as well as to determine the optimal alpha parameter
for Elastic Net regression.
For Elastic Net, alpha values ranging from 0.1 to 0.9 are evaluated, and the best model is selected based
on the highest R² value obtained from cross-validation using c.lm and cv.glmnet. The R² value is derived
from the cross-validated mean squared error (MSE) provided by the models.
For stepwise regression, the lm() function is initially applied, followed by the step() function with
direction = "both" to perform bidirectional stepwise selection (allowing both forward and backward
variable elimination). This approach helps identify the most significant predictors for the final model.
Result and analysis
Stepwise Regression:
pf3
pf4
pf5

Partial preview of the text

Download Statistical Modeling and Experimental Design: Regression Techniques and Applications and more Assignments Mathematical Modeling and Simulation in PDF only on Docsity!

WEEK 5 HOMEWORK

Question 11. Using the crime data set uscrime.txt from Questions 8.2, 9.1, and 10.1, build a regression model using:

  1. Stepwise regression
  2. Lasso
  3. Elastic net For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficients will be on different scales and the constraint won’t have the desired effect. For Parts 2 and 3, use the glmnet function in R. Notes on R:  For the elastic net model, what we called λ in the videos, glmnet calls “alpha”; you can get a range of results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other values of alpha in between].  In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors x need to be in R’s matrix format, rather than data frame format. You can convert a data frame to a matrix using as.matrix – for example, x <- as.matrix(data[,1:n-1])  Rather than specifying a value of T, glmnet returns models for a variety of values of T. Solution Methodolgy In this analysis, the entire dataset is used for training without a separate test set. Cross-validation is employed to compute the R² value for the models, as well as to determine the optimal alpha parameter for Elastic Net regression. For Elastic Net, alpha values ranging from 0.1 to 0.9 are evaluated, and the best model is selected based on the highest R² value obtained from cross-validation using c.lm and cv.glmnet. The R² value is derived from the cross-validated mean squared error (MSE) provided by the models. For stepwise regression, the lm() function is initially applied, followed by the step() function with direction = "both" to perform bidirectional stepwise selection (allowing both forward and backward variable elimination). This approach helps identify the most significant predictors for the final model. Result and analysis Stepwise Regression:

The stepwise regression model summary shows a reduction from 15 predictors down to 8, achieving an adjusted R² of 0.7444. Some features with marginally high p-values could be further eliminated if needed, depending on model simplicity and performance requirements. Stepwise Regression Model: -6426.1 + 93.32M + 180.12Ed + 102.65Pol + 22.34M.F - 6086.63U1 + 187.35U2 + 61.33Ineq – 3796.03Prob Lasso and Elastic Net For elastic net, alpha = 0.8 is selected to be the model in view of the highest R2 value. Ridge Elastic Net Lasso Alpha 0 0. 8 1. Lambda from model

R^2 0.5245 0.5656 0.

To determine the value of 10 different yes/no features to the market value of a house (large yard, solar roof, etc.), a real estate agent plans to survey 50 potential buyers, showing a fictitious house with different combinations of features. To reduce the survey size, the agent wants to show just 16 fictitious houses. Use R’s FrF2 function (in the FrF2 package) to find a fractional factorial design for this experiment: what set of features should each of the 16 fictitious houses have? Note: the output of FrF2 is “1” (include) or “-1” (don’t include) for each feature. Solution: Methodology: The function FrF2(nruns = 16, nfactors = 10, factor.names = features) is used to generate a fractional factorial design. Here, nruns = 16 specifies that 16 experimental combinations (or fictitious houses) will be generated, while nfactors = 10 indicates there are 10 binary(yes/no) features being considered. These features are labeled as F1 through F10 using features <- c("F1", "F2", ..., "F10"). The set.seed(1) command is included to ensure the results are reproducible. Finally, a summary column is added to the output to list which features are included (i.e., coded as 1) for each combination. Result and analysis: A full factorial design with 10 binary (yes/no) features at 2 levels each would require 2¹⁰ = 1024 combinations, meaning every possible combination of the 10 features would need to be tested. However, conducting such a large number of surveys would be impractical and costly. By using the FrF function, the number of combinations can be significantly reduced to just 16, while still capturing the main effects of all 10 features. The features included in each combination is shown as above (column: Included_Features). Question 13. For each of the following distributions, give an example of data that you would expect to follow this distribution (besides the examples already discussed in class). a. Binomial b. Geometric c. Poisson d. Exponential

e. Weibull Solution: Distribution Definition Example Binomial Binomial distribution models the number of successes or failures in a fixed number of independent trials. A quality engineer inspects 100 randomly selected wafers each day to count how many are defective or fall outside control limits Geometric Geometric distribution models the number of trials until the first failure. After monthly maintenance, a particle monitoring system records the number of tests until the first contamination event occurs. Poisson Poisson distribution models the count of events happening in a fixed time interval, assuming the events are rare and independent. A semiconductor plant records the number of power outages or trips that occur each year. Exponential Exponential distribution models time between events (such as failures), especially when failures occur randomly and independently. The time between random failures of a machine’s hardware components is monitored. Weibull Weibull distribution models lifetimes of components, especially when the failure rate is not constant (e.g., wear-out over time). The lifespan of a semiconductor tool transfer fork used in high-temperature processes is tracked until it fails due to wear or heat damage. Reference: https://www.rdocumentation.org/packages/FrF2/versions/2.3-4/topics/FrF https://www.rdocumentation.org/packages/DAAG/versions/0.98/topics/cv.lm https://www.statology.org/lasso-regression-in-r/ https://www.geeksforgeeks.org/r-machine-learning/stepwise-regression-in-r/