Regression Analysis: Regularized Regression and Case Studies, Exercises of Statistics

Churn Rate Regression Analysis

Typology: Exercises

2022/2023

Uploaded on 12/11/2023

karan-nahar
karan-nahar 🇺🇸

4 documents

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Regression Analysis
Module 5: Other Regression Models and Case Studies
Outline of Lessons
1. Regularized Regression
2. Case Study: ER Volume
3. Case Study: Customer Churn
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Partial preview of the text

Download Regression Analysis: Regularized Regression and Case Studies and more Exercises Statistics in PDF only on Docsity!

Regression Analysis

Module 5: Other Regression Models and Case Studies

Outline of Lessons

  1. Regularized Regression
  2. Case Study: ER Volume
  3. Case Study: Customer Churn

Nicoleta Serban

Gamze Tokol-Goldsman

Dave Goldsman

Regression Analysis

Module 5: Other Regression Models and

Case Studies

H. Milton Stewart School of Industrial and Systems Engineering

5.1 Regularized Regression

Bias-Variance Tradeoff

Biased Regression: Penalties

Not all biased models are better.

We need a way to find “good” biased models!

  • Penalize large values of b s jointly
    • Should lead to “multivariate” shrinkage of the vector b
  • Goal is really to penalize “complex” models
    • Heuristically, “large” is interpreted as “complex model”
      • If truth really is complex, this may not work!
        • It will then be hard to build a good model anyways

Regularized Regression (cont’d)

  • The penalized sum of squared errors:
  • We consider three choices for the penalty:

0

penalty

0

𝑗

≠ 0 } ⇒ Minimizing Q means searching through all submodels

1

penalty (LASSO Regression)

1

= σ 𝑗= 1

𝑝

𝑗

| ⇒ Minimizing Q forces many 𝛽 𝑗

s to be zeros

2

penalty (Ridge Regression)

2

= σ 𝑗= 1

𝑝

𝑗

2

⇒ Minimizing Q accounts for multicollinearity

1

𝑝

𝑖= 1

𝑛

𝑖

0

1

𝑖

𝑝

𝑖𝑝

2

  • λ𝑃𝑒𝑛𝑎𝑙𝑡𝑦(𝛽 1

𝑝

Comparing Penalties

0

penalty

  • Provides best model given a selection criterion
  • Requires fitting all submodels

1

penalty

  • Measures sparsity

2

penalty

  • Easy to implement
  • Does not do variable selection
  • Example: Consider vectors 𝒖 = 1 , 0 , ⋯ , 0 and 𝒗 = (

1

𝑝

1

𝑝

), both of length 𝑝.

Vector 𝒖 is sparse, because it contains mostly zeros.

  • Using the 𝑳 1

norm, we have | 𝒖 | 1

= σ 𝑖= 1

𝑝

𝑖

| = 1 and | 𝒗 | 1

= σ 𝑖= 1

𝑝

𝑖

  • Using the 𝑳 2

norm, we have | 𝒖 | 2

= σ 𝑖= 1

𝑝

𝑖

2

= 1 and | 𝒗 | 2

= σ 𝑖= 1

𝑝

𝑖

2

= 1.

  • The 𝑳 1

penalty rewards the sparsity of 𝒖; the 𝑳 2

penalty makes no distinction.

Ridge Regression

  • Minimizes SSE plus the penalty the penalty term
  • Provides closed-form estimate of regression coefficients (

෡ 𝜷)

෡ 𝜷 = (𝑿

𝑇 𝑿 + λ𝐈)

− 1 𝑿

𝑇 𝒀

𝐈 is the identity matrix

  • λ = 0 gives least squares estimate (low bias, high variance)
  • λ → ∞ gives

෡ 𝜷 → 0 (high bias, low variance)

  • Commonly used under multicollinearity
  • Not used for model selection
    • Shrinks but does not “force” any

መ 𝛽 𝑗

to equal 0

𝑆𝑆𝐸 λ

(𝜷) = ෍

𝑖= 1

𝑛

𝑦 𝑖

− (𝛽 0

  • 𝛽 1

𝑥 𝑖

  • ⋯ + 𝛽 𝑝

𝑥 𝑖𝑝

)

2

  • λ ෍

𝑗= 1

𝑝

𝛽 𝑗

2

LASSO Regression

  • L east A bsolute S hrinkage and S election O perator
  • Normal Linear Regression minimizes
  • Generalized Linear Model minimizes

( b ) is the log-likelihood function

  • Estimated regression coefficients
    • Must use numerical algorithms
    • No closed-form expression
  • Used for model selection
    • Does “force” some

መ 𝛽 𝑗

to equal 0

𝑆𝑆𝐸 λ

(𝜷) = ෍

𝑖= 1

𝑛

𝑦 𝑖

− (𝛽 0

  • 𝛽 1

𝑥 𝑖

  • ⋯ + 𝛽 𝑝

𝑥 𝑖𝑝

)

2

  • λ ෍

𝑗= 1

𝑝

|𝛽 𝑗

|

𝑆𝑆𝐸 λ

(𝜷) = −ℓ(𝛽 0

, ⋯ , 𝛽 𝑝

) + λ ෍

𝑗= 1

𝑝

|𝛽 𝑗

|

Choosing l: Cross-Validation

  • Split the data 𝑥 11

1𝑝

1

𝑛 , ⋯ ,

𝑛𝑝

𝑛

into two sets.

  • Training set
    • Use to fit the penalized model
      • Given l, estimate

መ 𝛽 0

,

መ 𝛽 1

, ⋯ ,

መ 𝛽 𝑝

  • Testing/Validation set
    • Use to evaluate performance of model obtained with training set
      • Estimate mean squared error (MSE) for normal regression
      • Estimate classification error rate for logistic regression
      • Estimate sum of squared deviances for Poisson regression
      • Generally, estimate a scoring rule depending on the regression model

The process can be repeated for multiple ls.

Cross Validation: How to Split Data?

K-fold cross-validation (KCV)

  • Divide data into K chunks of approximately equal size
  • For a range of λ penalty values, e.g., λ 1

, ⋯ , λ B

, and for k = 1 to K

  • The training set consists of data without the k-th fold of data, and the testing set consists of the

k-th fold

  • Given λ, fit a model on the training data and predict responses
  • Given λ, compute mean squared error or classification error rate for the k-th fold testing data
  • Given λ, after K folds have been processed, compute overall error (e.g., MSE or classification

error) for that λ for all folds

➔ Select λ penalty providing minimum overall error

LASSO: Limitations

  • LASSO selects only up to n variables
    • n is the number of observations
    • If the number of potential predictors is greater than the number of

observations, LASSO will select at most n of them

  • Since, normally, n > p, not a significant limitation
  • If there are high correlations among predictors
  • LASSO is dominated by ridge regression
  • If there is a group of variables with high correlation
  • LASSO tends to select only one variable from the group
  • LASSO doesn't care which one

Elastic Net

  • Elastic Net minimizes

1

penalty generates a sparse model

2

penalty

  • Removes the limitation on the number of selected variables
  • Encourages group effect
  • Stabilizes the 𝑳 1

regularization path

𝑖= 1

𝑛

𝑖

0

1

𝑖

𝑝

𝑖𝑝

2

  • λ 1

𝑗= 1

𝑝

𝑗

| + λ 2

𝑗= 1

𝑝

𝑗

2

Reference: Hui Zou and Trevor Hastie. "Regularization and variable selection via the elastic net."

Journal of the Royal Statistical Society: Series B 67.2 (2005): 301-320.

library(MASS)

## Scale the predicting variables and the response variable

ltakers = log(takers)

predictors = cbind(ltakers, rank, income, years, public, expend)

predictors = scale(predictors)

sat.scaled = scale(sat)

## Apply ridge regression for a range of penalty constants

lambda = seq(0, 10, by=0.25)

out = lm.ridge(sat.scaled~predictors, lambda=lambda)

round(out$GCV, 5)

which(out$GCV == min(out$GCV))

10

round(out$coef[,10], 4)

predictorsltakers predictorsrank predictorsincome predictorsyears predictorspublic

  • 0.4771 0.4195 0.0223 0.1796 - 0.

predictorsexpend

Ridge Regression

The ridge regression outputs estimates for

each lambda in the considered range (not

shown)

The lambda is selected to minimize the

(generalized) CV score

plot(lambda, out$coef[1,], type = "l", col=1, lwd=3,

xlab = "Lambda", ylab = "Coefficients",

main = "Plot of Regression Coefficients vs. Lambda

Penalty Ridge Regression",

ylim = c(min(out$coef), max(out$coef)))

for(i in 2:6)

points(lambda, out$coef[i,], type = "l", col=i, lwd=3)

abline(h = 0, lty = 2, lwd = 3)

abline(v = 2.25, lty = 2, lwd=3)

Ridge Regression