In this lab, we introduce penalized method for variable selection for linear regression. Note that these methods are not limited to linear regression, we will see them again in logistic regression.
LASSO is probably one of the greatest revolution in statistics. It solves an important statistical question – dimension reduction. It facilitates estimation and variable selection simultaneously. You may find more details from the author’s webpage http://statweb.stanford.edu/~tibs/lasso.html. Also, it is worth reading the original paper.
As a LASSO regression, we are minimizing following objective function
\[\frac{1}{n}\sum_{i=1}^{n}(Y_i-\boldsymbol{\beta}^TX_i)^2+\lambda\sum_{j=1}^{p}|\beta_j|,\] where \(\lambda\) is the shrinkage tuning parameter that controls the model sparsity.
Creating design matrix \(\mathbf{X}\): Most packages for penalized methods do not work with data frame. That is, we need to define matrix \(\mathbf{X}\) and vector \(Y\) for response. For categorical variables, we will need to manually take care of the dummy conversion. If we suspect there are higher order terms or interactions, we have to manually add these variables in \(\mathbf{X}\).
Standardization of \(\mathbf{X}\): Different from classical OLS, the penalized methods require standardization for covariates \(\mathbf{X}\).
We use the Boston housing data for demonstration.
library(MASS)
data(Boston)
First, get the original design matrix \(X\) with dummy conversion and standardization.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Y <- Boston$medv
X <- dplyr::select(Boston, -medv)%>%
mutate(rad=as.factor(rad), across(where(is.numeric), scale))
Xmat <- model.matrix(~., X)
Xmat <- Xmat[,-1] #We do not need the first column (intercept).
Second, add higher order terms and interaction terms if we need those.
Xmat.df <- mutate(data.frame(Xmat), lstat2=lstat^2, rm2=rm^2, chas_crim=chas*crim)
# note that mutate has to work with a data frame, so we have a middle step here
Xmat <- as.matrix(Xmat.df)
Now the data is ready. Next we fit the Lasso regression.
install.packages("glmnet")
library(glmnet)
A very detailed help document. glmnet
is one of the most
popular R packages.
?glmnet
lasso.fit<- glmnet(x=Xmat, y=Y)
plot(lasso.fit, xvar = "lambda")
By default, the glmnet()
function estimates about 100
models with 100 different values of \(\lambda\).
lasso.fit$lambda
## [1] 6.777653645 6.175545575 5.626927126 5.127046429 4.671573755 4.256564018
## [7] 3.878422602 3.533874228 3.219934581 2.933884467 2.673246257 2.435762428
## [13] 2.219376007 2.022212759 1.842564951 1.678876559 1.529729793 1.393832814
## [19] 1.270008549 1.157184489 1.054383409 0.960714893 0.875367629 0.797602382
## [25] 0.726745585 0.662183510 0.603356952 0.549756383 0.500917541 0.456417407
## [31] 0.415870543 0.378925750 0.345263031 0.314590815 0.286643434 0.261178821
## [37] 0.237976415 0.216835246 0.197572200 0.180020430 0.164027912 0.149456124
## [43] 0.136178854 0.124081100 0.113058077 0.103014309 0.093862802 0.085524289
## [49] 0.077926547 0.071003767 0.064695988 0.058948575 0.053711746 0.048940143
## [55] 0.044592435 0.040630966 0.037021423 0.033732542 0.030735836 0.028005349
## [61] 0.025517431 0.023250533 0.021185019 0.019303001 0.017588175 0.016025690
## [67] 0.014602012 0.013304810 0.012122847 0.011045887 0.010064601 0.009170489
## [73] 0.008355808 0.007613501 0.006937139 0.006320862 0.005759334 0.005247691
## [79] 0.004781501 0.004356725 0.003969686
We can get the estimated coefficients with different tuning parameter.
coef_lamb1 <- coef(lasso.fit, s=0.1)
coef_lamb2 <- coef(lasso.fit, s=0.5)
coef_lamb3 <- coef(lasso.fit, s=1)
cbind(coef_lamb1, coef_lamb2, coef_lamb3)
## 24 x 3 sparse Matrix of class "dgCMatrix"
## s1 s1 s1
## (Intercept) 20.53783284 21.1568696414 21.593266282
## crim -0.21939355 -0.3695231751 -0.143985473
## zn 0.18400422 . .
## indus . . .
## chas 1.13351691 0.3827143794 .
## nox -1.49320306 -0.0001531678 .
## rm 2.27674197 2.3961734746 2.231833453
## age . . .
## dis -1.84578607 . .
## rad2 -0.02284459 . .
## rad3 1.76459253 0.6688851661 .
## rad4 . . .
## rad5 0.07095061 . .
## rad6 -0.34816998 . .
## rad7 0.59758640 . .
## rad8 0.02608297 . .
## rad24 0.50111952 . .
## tax -0.02842682 -0.0032983556 -0.004857795
## ptratio -1.47644962 -1.1550696248 -1.026681198
## black 0.50430519 0.3752692834 0.107460935
## lstat -4.73483557 -4.1488009182 -3.864477511
## lstat2 0.81215419 0.2547205888 .
## rm2 1.04768594 1.0939005367 0.941400518
## chas_crim 2.56183971 0.3630559021 .
What could be some potential issues here?
It is a natural question that which \(\lambda\) should I use. Recall
cross-validation, glmnet
package provides cross-validation
in order to select the optimal tuning parameter.
cv.lasso<- cv.glmnet(x=Xmat, y=Y)
plot(cv.lasso)
The left bar is the \(lambda\) that gives smallest cross-validation error (mean squared error). You can extract that lambda and corresponding betahat.
cv.lasso$lambda.min
## [1] 0.008355808
coef.lambmin <- coef(lasso.fit, s=cv.lasso$lambda.min)
The right bar is the largest \(lambda\) such that its cross-validation error (mean squared error) is within 1 standard error of the smallest cv error. You can extract that lambda by using
cv.lasso$lambda.1se
## [1] 0.345263
coef.lamb1se <- coef(lasso.fit, s=cv.lasso$lambda.1se)
Now we can choose the two optimal model suggested by cross-validation. For each model, we can compute commonly used measures for a linear model.
# first get predicted Y
# note that we use the same predict() function, but the input is slightly different from the one we used for lm.
pred.lambmin <- as.vector(predict(lasso.fit, newx = Xmat, s=cv.lasso$lambda.min))
# then figure out degrees of freedom
df.lambmin <- nrow(Xmat)-sum(coef.lambmin!=0)
# then compute mse
rss.lambmin <- sum((pred.lambmin-Y)^2)
mse.lambmin <- rss.lambmin/df.lambmin
n <- nrow(X)
tss <- var(Y)*(n-1)
adjR2 <- 1-(n-1)/df.lambmin*rss.lambmin/tss
ind.lambmin <- cv.lasso$index[1]
cv.lambmin <- cv.lasso$cvm[ind.lambmin]
glmnet()
for elastic net by
setting alpha
to any number between 0 and 1.glmnet()
for adaptive-lasso.
However, you need two steps here. First, get estimates from ridge
regression; second, use it as the weights for lasso. Specifically, you
need to set penalty.factor = log(n)/n*abs(best_ridge_coef)
,
where best_ridge_coef
is the estimates from ridge
regression.ncvreg
, which performs several
non-concave penalized regression. Here is a useful tutorial.Group-lasso is a type of estimator that is used when the predictors have group structure. It is commonly used when there are categorical variable which are converted into dummy variable. In this case, if the categorical variable is useful, then all the corresponding dummy variables should be selected (nonzero betahat), otherwise, all those dummy variables should be deselected (zero betahat). As we see earlier, classical lasso does not have this feature. I leave this to the following exercise.
gglasso
, which performs group
lasso.Ordinary least square method fails when the dimensionality is greater than sample size due to unidentifiability. However, with constraints, such as LASSO, there are feasible solutions for the optimization problem.
genedata<- read.csv("https://www.dropbox.com/s/idok8ibbcw5h5ql/gene_exp.csv?dl=1")
dim(genedata)
## [1] 60 501
Ordinary Least Square method does not work.
lm.fit.high<- lm(Y~., data = genedata)
lm.fit.high
lasso.fit.high<- glmnet(x= as.matrix(genedata[,-1]), y=genedata[,1])
cv.lasso.fit.high<- cv.glmnet(x= as.matrix(genedata[,-1]), y=genedata[,1])
plot(cv.lasso.fit.high)
# obtain coefficient estimates
coef.high<- as.vector(coef(lasso.fit.high, s=cv.lasso.fit.high$lambda.1se))
coef.high[which(coef.high!=0)]
## [1] -3.693333e-01 4.529087e-02 1.141203e-05 -9.399617e-02 -1.089989e-01
## [6] 1.431998e-01 -9.154978e-03 -6.068309e-02 -3.527416e-02 4.588481e-02
## [11] 2.097038e-02 -4.339215e-02 -9.996550e-02 -7.367877e-02 -4.963812e-02
## [16] -3.069497e-02 6.459509e-02 -2.648972e-02 -1.359200e-02 -2.924191e-02
## [21] -6.638786e-02 3.945393e-04 4.015780e-02 1.547374e-02 -6.058680e-02
## [26] -8.623803e-04 -2.237364e-02 -3.267342e-02 -3.568744e-02 1.746737e-02
## [31] -4.436771e-02 3.614462e-03 7.637187e-02 1.731131e-02 3.213579e-02
## [36] -1.929614e-02 -2.431746e-04 -3.984807e-02 7.626571e-02 8.309433e-03
colnames(genedata[,-1])[which(coef.high!=0)]
## [1] "X1451557_at" "X1418706_at" "X1416125_at" "X1418123_at"
## [5] "X1417709_at" "X1423523_at" "X1416424_at" "X1418616_at"
## [9] "X1416374_at" "X1451488_at" "X1415991_a_at" "X1421092_at"
## [13] "X1438654_x_at" "X1455106_a_at" "X1453281_at" "X1424683_at"
## [17] "X1419213_at" "X1423887_a_at" "X1419449_a_at" "X1449323_a_at"
## [21] "X1451160_s_at" "X1449936_at" "X1424352_at" "X1436243_at"
## [25] "X1420966_at" "X1429947_a_at" "X1430127_a_at" "X1419510_at"
## [29] "X1416528_at" "X1449248_at" "X1422917_at" "X1449112_at"
## [33] "X1449565_at" "X1448185_at" "X1449009_at" "X1435981_at"
## [37] "X1416656_at" "X1422821_s_at" "X1454706_at" "X1421205_at"
Ridge regression has the \(L_2\) penalty. Its objective function has the following form \[\frac{1}{n}\sum_{i=1}^{n}(Y_i-\boldsymbol{\beta}^TX_i)^2+\lambda\sum_{j=1}^{p}\beta_j^2.\] We know that ridge regression shrinks coefficients toward to zero but will not shrink them to exact zeros. So ridge regression is not for variable selection. The benefits of ridge regression is that it can reduce the multicollinearity, and the estimated model has smaller variance by sacrificing the unbiasness.
In the package glmnet
, alpha=1
means \(L_1\) penalty, which is LASSO, and
alpha=0
means \(L_2\)
penalty, which is ridge regression.
ridge.fit<- glmnet(x=Xmat, y=Y, family = "gaussian", alpha = 0)
plot(ridge.fit, xvar = "lambda")
cv.ridge<- cv.glmnet(x=Xmat, y=Y, family = "gaussian", alpha = 0, nfolds = 10)
plot(cv.ridge)
Similarly, we can get predictions given a specific lambda.
pred.ridge.min<- predict(ridge.fit, newx = Xmat, s=cv.ridge$lambda.min)
rss.min <- sum((Y-pred.ridge.min)^2)
pred.ridge.1se<- predict(ridge.fit, newx = Xmat, s=cv.ridge$lambda.1se)
rss.1se <- sum((Y-pred.ridge.1se)^2)
I leave this as an exercise for advanced learning.
n=50
p=45
sigma= 3
set.seed(2020)
beta=c(runif(30, -2,2), rep(0, p-30))
X.test<- matrix(rnorm(n*p), nrow = n)
Y.test<- X.test%*%beta+rnorm(n, 0, sigma)
EY.test<- X.test%*%beta
m=500
lambda= exp(seq(-4,6,length.out = 100))
pred.ridge= matrix(0, length(lambda)*n, m)
for(k in 1:m){
}
## calculate RSS and aggregate MSE (two level aggregation)
raw.RSS= (pred.ridge-rep(Y.test, length(lambda)))^2
## Calculate Bias^2 and aggregate
## calculate variance and aggregate