In this lab, we introduce penalized method for variable selection for linear regression. Note that these methods are not limited to linear regression, we will see them again in logistic regression.

Least Absolute Shrinkage and Selection Operator (LASSO)

LASSO is probably one of the greatest revolution in statistics. It solves an important statistical question – dimension reduction. It facilitates estimation and variable selection simultaneously. You may find more details from the author’s webpage http://statweb.stanford.edu/~tibs/lasso.html. Also, it is worth reading the original paper.

As a LASSO regression, we are minimizing following objective function

\[\frac{1}{n}\sum_{i=1}^{n}(Y_i-\boldsymbol{\beta}^TX_i)^2+\lambda\sum_{j=1}^{p}|\beta_j|,\] where \(\lambda\) is the shrinkage tuning parameter that controls the model sparsity.

Data preparation

  • Creating design matrix \(\mathbf{X}\): Most packages for penalized methods do not work with data frame. That is, we need to define matrix \(\mathbf{X}\) and vector \(Y\) for response. For categorical variables, we will need to manually take care of the dummy conversion. If we suspect there are higher order terms or interactions, we have to manually add these variables in \(\mathbf{X}\).

  • Standardization of \(\mathbf{X}\): Different from classical OLS, the penalized methods require standardization for covariates \(\mathbf{X}\).

We use the Boston housing data for demonstration.

library(MASS)
data(Boston)

First, get the original design matrix \(X\) with dummy conversion and standardization.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Y <- Boston$medv
X <- dplyr::select(Boston, -medv)%>%
  mutate(rad=as.factor(rad), across(where(is.numeric), scale))
Xmat <- model.matrix(~., X)
Xmat <- Xmat[,-1] #We do not need the first column (intercept).

Second, add higher order terms and interaction terms if we need those.

Xmat.df <- mutate(data.frame(Xmat), lstat2=lstat^2, rm2=rm^2, chas_crim=chas*crim)  
# note that mutate has to work with a data frame, so we have a middle step here
Xmat <- as.matrix(Xmat.df)

Now the data is ready. Next we fit the Lasso regression.

Fitting Lasso Model

install.packages("glmnet")
library(glmnet)

A very detailed help document. glmnet is one of the most popular R packages.

?glmnet
lasso.fit<- glmnet(x=Xmat, y=Y)
plot(lasso.fit, xvar = "lambda")

By default, the glmnet() function estimates about 100 models with 100 different values of \(\lambda\).

lasso.fit$lambda
##  [1] 6.777653645 6.175545575 5.626927126 5.127046429 4.671573755 4.256564018
##  [7] 3.878422602 3.533874228 3.219934581 2.933884467 2.673246257 2.435762428
## [13] 2.219376007 2.022212759 1.842564951 1.678876559 1.529729793 1.393832814
## [19] 1.270008549 1.157184489 1.054383409 0.960714893 0.875367629 0.797602382
## [25] 0.726745585 0.662183510 0.603356952 0.549756383 0.500917541 0.456417407
## [31] 0.415870543 0.378925750 0.345263031 0.314590815 0.286643434 0.261178821
## [37] 0.237976415 0.216835246 0.197572200 0.180020430 0.164027912 0.149456124
## [43] 0.136178854 0.124081100 0.113058077 0.103014309 0.093862802 0.085524289
## [49] 0.077926547 0.071003767 0.064695988 0.058948575 0.053711746 0.048940143
## [55] 0.044592435 0.040630966 0.037021423 0.033732542 0.030735836 0.028005349
## [61] 0.025517431 0.023250533 0.021185019 0.019303001 0.017588175 0.016025690
## [67] 0.014602012 0.013304810 0.012122847 0.011045887 0.010064601 0.009170489
## [73] 0.008355808 0.007613501 0.006937139 0.006320862 0.005759334 0.005247691
## [79] 0.004781501 0.004356725 0.003969686

We can get the estimated coefficients with different tuning parameter.

coef_lamb1 <- coef(lasso.fit, s=0.1)
coef_lamb2 <- coef(lasso.fit, s=0.5)
coef_lamb3 <- coef(lasso.fit, s=1)
cbind(coef_lamb1, coef_lamb2, coef_lamb3)
## 24 x 3 sparse Matrix of class "dgCMatrix"
##                      s1            s1           s1
## (Intercept) 20.53783284 21.1568696414 21.593266282
## crim        -0.21939355 -0.3695231751 -0.143985473
## zn           0.18400422  .             .          
## indus        .           .             .          
## chas         1.13351691  0.3827143794  .          
## nox         -1.49320306 -0.0001531678  .          
## rm           2.27674197  2.3961734746  2.231833453
## age          .           .             .          
## dis         -1.84578607  .             .          
## rad2        -0.02284459  .             .          
## rad3         1.76459253  0.6688851661  .          
## rad4         .           .             .          
## rad5         0.07095061  .             .          
## rad6        -0.34816998  .             .          
## rad7         0.59758640  .             .          
## rad8         0.02608297  .             .          
## rad24        0.50111952  .             .          
## tax         -0.02842682 -0.0032983556 -0.004857795
## ptratio     -1.47644962 -1.1550696248 -1.026681198
## black        0.50430519  0.3752692834  0.107460935
## lstat       -4.73483557 -4.1488009182 -3.864477511
## lstat2       0.81215419  0.2547205888  .          
## rm2          1.04768594  1.0939005367  0.941400518
## chas_crim    2.56183971  0.3630559021  .

What could be some potential issues here?

Choose the optimal tuning parameter

It is a natural question that which \(\lambda\) should I use. Recall cross-validation, glmnet package provides cross-validation in order to select the optimal tuning parameter.

cv.lasso<- cv.glmnet(x=Xmat, y=Y)
plot(cv.lasso)

The left bar is the \(lambda\) that gives smallest cross-validation error (mean squared error). You can extract that lambda and corresponding betahat.

cv.lasso$lambda.min
## [1] 0.008355808
coef.lambmin <- coef(lasso.fit, s=cv.lasso$lambda.min)

The right bar is the largest \(lambda\) such that its cross-validation error (mean squared error) is within 1 standard error of the smallest cv error. You can extract that lambda by using

cv.lasso$lambda.1se
## [1] 0.345263
coef.lamb1se <- coef(lasso.fit, s=cv.lasso$lambda.1se)

Obtaining model assessment measures

Now we can choose the two optimal model suggested by cross-validation. For each model, we can compute commonly used measures for a linear model.

MSE

# first get predicted Y 
# note that we use the same predict() function, but the input is slightly different from the one we used for lm.
pred.lambmin <- as.vector(predict(lasso.fit, newx = Xmat, s=cv.lasso$lambda.min))
# then figure out degrees of freedom
df.lambmin <- nrow(Xmat)-sum(coef.lambmin!=0)
# then compute mse
rss.lambmin <- sum((pred.lambmin-Y)^2)
mse.lambmin <- rss.lambmin/df.lambmin

Adjusted R^2

n <- nrow(X)
tss <- var(Y)*(n-1)
adjR2 <- 1-(n-1)/df.lambmin*rss.lambmin/tss

Cross validation error

ind.lambmin <- cv.lasso$index[1]
cv.lambmin <- cv.lasso$cvm[ind.lambmin]

Exercise

  1. Obtain the MSE, adjR^2 and cv error for the other optimal model (lambda.1se).
  2. Randomly split the data to training (80%) and testing (20%) and conduct the training-testing procedure. Obtain MSPE and oos-\(R^2\).

Back to top


High-dimensional regression (Gene expression data example)

Ordinary least square method fails when the dimensionality is greater than sample size due to unidentifiability. However, with constraints, such as LASSO, there are feasible solutions for the optimization problem.

genedata<- read.csv("https://www.dropbox.com/s/idok8ibbcw5h5ql/gene_exp.csv?dl=1")
dim(genedata)
## [1]  60 501

Ordinary Least Square method does not work.

lm.fit.high<- lm(Y~., data = genedata)
lm.fit.high
lasso.fit.high<- glmnet(x= as.matrix(genedata[,-1]), y=genedata[,1])
cv.lasso.fit.high<- cv.glmnet(x= as.matrix(genedata[,-1]), y=genedata[,1])
plot(cv.lasso.fit.high)

# obtain coefficient estimates
coef.high<- as.vector(coef(lasso.fit.high, s=cv.lasso.fit.high$lambda.1se))
coef.high[which(coef.high!=0)]
##  [1] -3.693333e-01  4.529087e-02  1.141203e-05 -9.399617e-02 -1.089989e-01
##  [6]  1.431998e-01 -9.154978e-03 -6.068309e-02 -3.527416e-02  4.588481e-02
## [11]  2.097038e-02 -4.339215e-02 -9.996550e-02 -7.367877e-02 -4.963812e-02
## [16] -3.069497e-02  6.459509e-02 -2.648972e-02 -1.359200e-02 -2.924191e-02
## [21] -6.638786e-02  3.945393e-04  4.015780e-02  1.547374e-02 -6.058680e-02
## [26] -8.623803e-04 -2.237364e-02 -3.267342e-02 -3.568744e-02  1.746737e-02
## [31] -4.436771e-02  3.614462e-03  7.637187e-02  1.731131e-02  3.213579e-02
## [36] -1.929614e-02 -2.431746e-04 -3.984807e-02  7.626571e-02  8.309433e-03
colnames(genedata[,-1])[which(coef.high!=0)]
##  [1] "X1451557_at"   "X1418706_at"   "X1416125_at"   "X1418123_at"  
##  [5] "X1417709_at"   "X1423523_at"   "X1416424_at"   "X1418616_at"  
##  [9] "X1416374_at"   "X1451488_at"   "X1415991_a_at" "X1421092_at"  
## [13] "X1438654_x_at" "X1455106_a_at" "X1453281_at"   "X1424683_at"  
## [17] "X1419213_at"   "X1423887_a_at" "X1419449_a_at" "X1449323_a_at"
## [21] "X1451160_s_at" "X1449936_at"   "X1424352_at"   "X1436243_at"  
## [25] "X1420966_at"   "X1429947_a_at" "X1430127_a_at" "X1419510_at"  
## [29] "X1416528_at"   "X1449248_at"   "X1422917_at"   "X1449112_at"  
## [33] "X1449565_at"   "X1448185_at"   "X1449009_at"   "X1435981_at"  
## [37] "X1416656_at"   "X1422821_s_at" "X1454706_at"   "X1421205_at"

Back to top


Ridge regression

Ridge regression has the \(L_2\) penalty. Its objective function has the following form \[\frac{1}{n}\sum_{i=1}^{n}(Y_i-\boldsymbol{\beta}^TX_i)^2+\lambda\sum_{j=1}^{p}\beta_j^2.\] We know that ridge regression shrinks coefficients toward to zero but will not shrink them to exact zeros. So ridge regression is not for variable selection. The benefits of ridge regression is that it can reduce the multicollinearity, and the estimated model has smaller variance by sacrificing the unbiasness.

In the package glmnet, alpha=1 means \(L_1\) penalty, which is LASSO, and alpha=0 means \(L_2\) penalty, which is ridge regression.

ridge.fit<- glmnet(x=Xmat, y=Y, family = "gaussian", alpha = 0)
plot(ridge.fit, xvar = "lambda")

cv.ridge<- cv.glmnet(x=Xmat, y=Y, family = "gaussian", alpha = 0, nfolds = 10)
plot(cv.ridge)

Similarly, we can get predictions given a specific lambda.

pred.ridge.min<- predict(ridge.fit, newx = Xmat, s=cv.ridge$lambda.min)
rss.min <- sum((Y-pred.ridge.min)^2)
pred.ridge.1se<- predict(ridge.fit, newx = Xmat, s=cv.ridge$lambda.1se)
rss.1se <- sum((Y-pred.ridge.1se)^2)

Back to top


Bias-variance tradeoff – a Monte Carlo simulation

I leave this as an exercise for advanced learning.

First generate a set of testing sample

n=50
p=45
sigma= 3
set.seed(2020)
beta=c(runif(30, -2,2), rep(0, p-30))
X.test<- matrix(rnorm(n*p), nrow = n)
Y.test<- X.test%*%beta+rnorm(n, 0, sigma)
EY.test<- X.test%*%beta

Second, conduct Monte Carlo simulation for ridge regression estimation and prediction

m=500
lambda= exp(seq(-4,6,length.out = 100))
pred.ridge= matrix(0, length(lambda)*n, m)
for(k in 1:m){
  
}

Now calculate MSE, bias, and variance at aggregated level

## calculate RSS and aggregate MSE (two level aggregation)
raw.RSS= (pred.ridge-rep(Y.test, length(lambda)))^2

## Calculate Bias^2 and aggregate

## calculate variance and aggregate

Finally, generate plot