In this lab, we introduce different techniques of variable selection for linear regression. Note that these methods are not limited to linear regression, we will see them again in the following lectures and labs.

Get the data ready

library(MASS)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]

Classical Methods

Recall the different models we obtained from lab 3.

model0<- lm(medv~lstat, data = Boston.train)
model1<- lm(medv~., data=Boston.train)
model2<- lm(medv~. -indus -age, data=Boston.train)

Compare Model Fit (AIC and BIC)

AIC(model0); BIC(model0)
AIC(model1); BIC(model1)
AIC(model2); BIC(model2)

Exercise: Compare MSE, \(R^2\), and MSPE of these three models.

Forward/Backward/Stepwise Regression Using AIC

To perform the Forward/Backward/Stepwise Regression in R, we need to define the starting points:

nullmodel<- lm(medv~1, data=Boston.train)
fullmodel<- lm(medv~., data=Boston.train)

nullmodel is the model with no varaible in it, while fullmodel is the model with every variable in it.

Backward Elimination

model.step.b<- step(fullmodel,direction='backward')
## Start:  AIC=1452.6
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - age      1      0.09 10418 1450.6
## - indus    1      9.82 10428 1451.0
## <none>                 10418 1452.6
## - black    1    130.23 10548 1456.2
## - crim     1    187.67 10606 1458.7
## - chas     1    195.25 10613 1459.0
## - tax      1    239.10 10657 1460.9
## - zn       1    240.33 10658 1461.0
## - rad      1    447.86 10866 1469.8
## - nox      1    472.75 10891 1470.8
## - ptratio  1   1053.88 11472 1494.5
## - dis      1   1139.27 11557 1497.8
## - rm       1   1651.97 12070 1517.6
## - lstat    1   2459.05 12877 1547.0
## 
## Step:  AIC=1450.61
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## - indus    1      9.82 10428 1449.0
## <none>                 10418 1450.6
## - black    1    131.37 10549 1454.3
## - crim     1    187.63 10606 1456.7
## - chas     1    196.36 10614 1457.1
## - tax      1    239.17 10657 1458.9
## - zn       1    244.40 10662 1459.2
## - rad      1    449.31 10867 1467.8
## - nox      1    504.05 10922 1470.1
## - ptratio  1   1057.22 11475 1492.6
## - dis      1   1247.51 11666 1500.1
## - rm       1   1741.58 12160 1518.9
## - lstat    1   2771.21 13189 1555.9
## 
## Step:  AIC=1449.03
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 10428 1449.0
## - black    1    128.39 10556 1452.6
## - crim     1    191.66 10620 1455.3
## - chas     1    207.63 10636 1456.0
## - zn       1    237.08 10665 1457.3
## - tax      1    245.71 10674 1457.6
## - rad      1    449.67 10878 1466.2
## - nox      1    505.30 10933 1468.6
## - ptratio  1   1050.25 11478 1490.7
## - dis      1   1373.07 11801 1503.3
## - rm       1   1731.77 12160 1516.9
## - lstat    1   2766.80 13195 1554.1

The default selection criterion is AIC. This is specified by k=2 (default). We can change the selection criterion to be BIC by specifying k=log(n), where n is the sample size (recall the definition of BIC).

Another option trace=1 (default) means to print the message of each step. Changing it to 0 will not print anything.

The above object can be treated as the same as the lm/glm output, so you can use functions such as summary() and predict().

Forward Selection (Output Omitted)

model.step.f<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')

Stepwise Selection (Output Omitted)

model.step.s<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both')

Forward and stepwise are a little different from backward. As you can see, we need to specify scope=list(lower=nullmodel, upper=fullmodel), because both methods start with null model, and it requires an upper bound of the model.

Exercise

  1. Comparing in-sample and out-of-sample performance between these reduced models.
  2. Conduct 10-fold cross validation on the full sample and compare the CV scores of the different models.

Note: For pros and cons of variable/model selection using the common fit statistics: (adjusted) \(R^2\), MSE, AIC, BIC, etc. refer to Ch9 in “Applied Linear Regression Models” by Kutner et al.

Best Subset Regression

This approach is not as popular as the step methods. But you are encouraged to go through this section.

The ‘leaps’ package provides procedures for best subset regression.

install.packages('leaps')
library(leaps)

Which subset of variables should you include in order to minimize BIC?

#regsubsets only takes data frame as input
model.subset<- regsubsets(medv~.,data=Boston.train, nbest=2, nvmax = 13)
sum.subset<- summary(model.subset)
sum.subset
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston.train, nbest = 2, 
##     nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 2 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 1  ( 2 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   " "  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 2  ( 2 )  " "  " " " "   " "  " " " " " " " " " " " " "*"     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 3  ( 2 )  " "  " " " "   "*"  " " "*" " " " " " " " " " "     " "   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 4  ( 2 )  " "  " " " "   "*"  " " "*" " " " " " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 2 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     "*"   "*"  
## 6  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 2 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 7  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 7  ( 2 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 8  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 8  ( 2 )  "*"  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 9  ( 1 )  "*"  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 9  ( 2 )  "*"  " " " "   "*"  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 10  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 10  ( 2 ) " "  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 2 ) "*"  "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 12  ( 1 ) "*"  "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 2 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
plot(model.subset, scale="bic")

Each row represents a model. Black indicates that a variable is included in the model, while white indicates that it is not. The argument scale = “Cp”, “adjr2”, “r2” or “bic”.

We can get exactly the model we want.

# extract certain criteria for all evaluated models
sum.subset$bic
##  [1] -349.7094 -286.1470 -444.3650 -402.3445 -486.2507 -455.0146 -498.3736
##  [8] -492.7276 -518.0025 -503.7115 -523.5791 -518.5852 -524.2030 -523.2882
## [15] -524.7728 -523.3203 -525.6123 -524.7662 -528.9998 -526.2810 -528.4476
## [22] -523.1749 -522.7560 -522.3315 -516.6397
# which one is the best?
which.min(sum.subset$bic)
## [1] 19
# get the model (selected variables)
coef(model.subset, id=which.min(sum.subset$bic))
## (Intercept)        crim          zn        chas         nox          rm 
##  41.9654047  -0.1214356   0.0460730   2.7495145 -18.1879404   3.6257594 
##         dis         rad         tax     ptratio       lstat 
##  -1.5408278   0.2933702  -0.0123629  -0.9241449  -0.5694629

** Exercise: ** Get the best model in terms of adjusted R square.

Back to top