In this lab, we introduce different techniques of variable selection for linear regression. Note that these methods are not limited to linear regression, we will see them again in the following lectures and labs.
library(MASS)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]
Recall the different models we obtained from lab 3.
model0<- lm(medv~lstat, data = Boston.train)
model1<- lm(medv~., data=Boston.train)
model2<- lm(medv~. -indus -age, data=Boston.train)
AIC(model0); BIC(model0)
AIC(model1); BIC(model1)
AIC(model2); BIC(model2)
Exercise: Compare MSE, \(R^2\), and MSPE of these three models.
To perform the Forward/Backward/Stepwise Regression in R, we need to define the starting points:
nullmodel<- lm(medv~1, data=Boston.train)
fullmodel<- lm(medv~., data=Boston.train)
nullmodel is the model with no varaible in it, while fullmodel is the model with every variable in it.
model.step.b<- step(fullmodel,direction='backward')
## Start: AIC=1452.6
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 0.09 10418 1450.6
## - indus 1 9.82 10428 1451.0
## <none> 10418 1452.6
## - black 1 130.23 10548 1456.2
## - crim 1 187.67 10606 1458.7
## - chas 1 195.25 10613 1459.0
## - tax 1 239.10 10657 1460.9
## - zn 1 240.33 10658 1461.0
## - rad 1 447.86 10866 1469.8
## - nox 1 472.75 10891 1470.8
## - ptratio 1 1053.88 11472 1494.5
## - dis 1 1139.27 11557 1497.8
## - rm 1 1651.97 12070 1517.6
## - lstat 1 2459.05 12877 1547.0
##
## Step: AIC=1450.61
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 9.82 10428 1449.0
## <none> 10418 1450.6
## - black 1 131.37 10549 1454.3
## - crim 1 187.63 10606 1456.7
## - chas 1 196.36 10614 1457.1
## - tax 1 239.17 10657 1458.9
## - zn 1 244.40 10662 1459.2
## - rad 1 449.31 10867 1467.8
## - nox 1 504.05 10922 1470.1
## - ptratio 1 1057.22 11475 1492.6
## - dis 1 1247.51 11666 1500.1
## - rm 1 1741.58 12160 1518.9
## - lstat 1 2771.21 13189 1555.9
##
## Step: AIC=1449.03
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 10428 1449.0
## - black 1 128.39 10556 1452.6
## - crim 1 191.66 10620 1455.3
## - chas 1 207.63 10636 1456.0
## - zn 1 237.08 10665 1457.3
## - tax 1 245.71 10674 1457.6
## - rad 1 449.67 10878 1466.2
## - nox 1 505.30 10933 1468.6
## - ptratio 1 1050.25 11478 1490.7
## - dis 1 1373.07 11801 1503.3
## - rm 1 1731.77 12160 1516.9
## - lstat 1 2766.80 13195 1554.1
The default selection criterion is AIC. This is specified by k=2
(default). We can change the selection criterion to be BIC by specifying k=log(n)
, where n
is the sample size (recall the definition of BIC).
Another option trace=1
(default) means to print the message of each step. Changing it to 0 will not print anything.
The above object can be treated as the same as the lm/glm output, so you can use functions such as summary()
and predict()
.
model.step.f<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')
model.step.s<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both')
Forward and stepwise are a little different from backward. As you can see, we need to specify scope=list(lower=nullmodel, upper=fullmodel)
, because both methods start with null model, and it requires an upper bound of the model.
Note: For pros and cons of variable/model selection using the common fit statistics: (adjusted) \(R^2\), MSE, AIC, BIC, etc. refer to Ch9 in “Applied Linear Regression Models” by Kutner et al.
This approach is not as popular as the step methods. But you are encouraged to go through this section.
The ‘leaps’ package provides procedures for best subset regression.
install.packages('leaps')
library(leaps)
Which subset of variables should you include in order to minimize BIC?
#regsubsets only takes data frame as input
model.subset<- regsubsets(medv~.,data=Boston.train, nbest=2, nvmax = 13)
sum.subset<- summary(model.subset)
sum.subset
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston.train, nbest = 2,
## nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 2 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 1 ( 2 ) " " " " " " " " " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 2 ( 2 ) " " " " " " " " " " " " " " " " " " " " "*" " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 3 ( 2 ) " " " " " " "*" " " "*" " " " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 4 ( 2 ) " " " " " " "*" " " "*" " " " " " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 5 ( 2 ) " " " " " " " " " " "*" " " "*" " " " " "*" "*" "*"
## 6 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 2 ) " " " " " " " " "*" "*" " " "*" " " " " "*" "*" "*"
## 7 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 7 ( 2 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 8 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 8 ( 2 ) "*" "*" " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 9 ( 1 ) "*" "*" " " " " "*" "*" " " "*" "*" "*" "*" " " "*"
## 9 ( 2 ) "*" " " " " "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 10 ( 2 ) " " "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 2 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 2 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
plot(model.subset, scale="bic")
Each row represents a model. Black indicates that a variable is included in the model, while white indicates that it is not. The argument scale = “Cp”, “adjr2”, “r2” or “bic”.
We can get exactly the model we want.
# extract certain criteria for all evaluated models
sum.subset$bic
## [1] -349.7094 -286.1470 -444.3650 -402.3445 -486.2507 -455.0146 -498.3736
## [8] -492.7276 -518.0025 -503.7115 -523.5791 -518.5852 -524.2030 -523.2882
## [15] -524.7728 -523.3203 -525.6123 -524.7662 -528.9998 -526.2810 -528.4476
## [22] -523.1749 -522.7560 -522.3315 -516.6397
# which one is the best?
which.min(sum.subset$bic)
## [1] 19
# get the model (selected variables)
coef(model.subset, id=which.min(sum.subset$bic))
## (Intercept) crim zn chas nox rm
## 41.9654047 -0.1214356 0.0460730 2.7495145 -18.1879404 3.6257594
## dis rad tax ptratio lstat
## -1.5408278 0.2933702 -0.0123629 -0.9241449 -0.5694629
** Exercise: ** Get the best model in terms of adjusted R square.