In this lab, we will cover bagging, random forest, gradient boosting and extreme boosting for regression problems. We use the same Boston Housing data. In the next lab, we discuss classification problem with these state-of-the art machine linear algorithms.

# load Boston data
library(MASS)
library(tidyverse)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.70)
boston.train <- Boston[index,]
boston.test <- Boston[-index,]

Bagging

Bagging stands for Bootstrap and Aggregating. It employs the idea of bootstrap but the purpose is not to study bias and standard errors of estimates. Instead, the goal of Bagging is to improve prediction accuracy. It fits a tree for each bootsrap sample, and then aggregate the predicted values from all these different trees. For more details, you may look at Wikepedia, or you can find the original paper Leo Breiman (1996).

An available R package, ipred, provides functions to perform Bagging. You need to install this package if you didn’t do it before.

library(ipred)

Fit tree with bagging on Boston training data, and calculate MSE on testing sample.

boston.bag<- bagging(medv~., data = boston.train, nbagg=100)
boston.bag
## 
## Bagging regression trees with 100 bootstrap replications 
## 
## Call: bagging.data.frame(formula = medv ~ ., data = boston.train, nbagg = 100)

Prediction on testing sample.

boston.bag.pred<- predict(boston.bag, newdata = boston.test)
mean((boston.test$medv-boston.bag.pred)^2)
## [1] 11.07093

Comparing with a single tree.

library(rpart)
boston.tree<- rpart(medv~., data = boston.train)
boston.tree.pred<- predict(boston.tree, newdata = boston.test)
mean((boston.test$medv-boston.tree.pred)^2)
## [1] 12.99985

How many trees are good?

ntree<- c(seq(10, 200, 10))
MSE.test<- rep(0, length(ntree))
for(i in 1:length(ntree)){
  boston.bag1<- bagging(medv~., data = boston.train, nbagg=ntree[i])
  boston.bag.pred1<- predict(boston.bag1, newdata = boston.test)
  MSE.test[i]<- mean((boston.test$medv-boston.bag.pred1)^2)
}
plot(ntree, MSE.test, type = 'l', col=2, lwd=2)

By fitting the Bagging multiple times and predicting the testing sample, we can draw the following boxplot to show the variance of the prediction error at different number of trees.

ntree<- c(1, 3, 5, seq(10, 200, 10))
MSE.test<- matrix(0, length(ntree), 50)
for(k in 1:50){
  for(i in 1:length(ntree)){
    boston.bag1<- bagging(medv~., data = boston.train, nbagg=ntree[i])
    boston.bag.pred1<- predict(boston.bag1, newdata = boston.test)
    MSE.test[i,k]<- mean((boston.test$medv-boston.bag.pred1)^2)
  }
}
boxplot(t(MSE.test), names=ntree, xlab="Number of Tree", ylab="Test MSE")
lines(apply(MSE.test, 1, mean), col="red", lty=2, lwd=2)

Out-of-bag (OOB) prediction

The out-of-bag prediction is similar to LOOCV. We use full sample. In every bootstrap, the unused sample serves as testing sample, and testing error is calculated. In the end, OOB error, root mean squared error by default, is obtained

boston.bag.oob<- bagging(medv~., data = boston.train, coob=T, nbagg=100)
boston.bag.oob
## 
## Bagging regression trees with 100 bootstrap replications 
## 
## Call: bagging.data.frame(formula = medv ~ ., data = boston.train, coob = T, 
##     nbagg = 100)
## 
## Out-of-bag estimate of root mean squared error:  4.2785

Random Forests

Random forest is an extension of Bagging, but it makes significant improvement in terms of prediction. The idea of random forests is to randomly select \(m\) out of \(p\) predictors as candidate variables for each split in each tree. Commonly, \(m=\sqrt{p}\). The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees. You may refer Wikipedia and the tutorial on the author’s website.

We start with Boston Housing data.

library(randomForest)
boston.rf<- randomForest(medv~., data = boston.train, importance=TRUE)
boston.rf
## 
## Call:
##  randomForest(formula = medv ~ ., data = boston.train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.61353
##                     % Var explained: 86.35

By default, \(m=p/3\) for regression tree, and \(m=\sqrt{p}\) for classification problem. You can change it by specifying mtry=. You can also specify number of trees by ntree=. The default is 500. The argument importance=TRUE allows us to see the variable imporatance.

boston.rf$importance
##            %IncMSE IncNodePurity
## crim     9.8312596     2057.9620
## zn       0.3770528      213.2592
## indus    6.2854103     1532.6133
## chas     0.7969057      199.9381
## nox     10.5453903     2047.2360
## rm      33.4866583     8416.2336
## age      3.7684947      816.7203
## dis      7.9094067     2047.6588
## rad      1.5776123      287.3580
## tax      4.3494128     1031.8469
## ptratio  6.6302754     1721.8515
## black    1.6629025      647.2576
## lstat   62.5959692     8459.6503

The MSR is MSE of out-of-bag prediction (recall the OOB in bagging). The fitted randomForest actually saves all OOB errors for each ntree value from 1 to 500. We can make a plot to see how the OOB error changes with different ntree.

plot(boston.rf$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")

Prediction on the testing sample.

boston.rf.pred<- predict(boston.rf, boston.test)
mean((boston.test$medv-boston.rf.pred)^2)
## [1] 8.110514

Try different \(m\)

As we mentioned before, the number of candidate predictors in each split is \(m\approx \sqrt{p}=\sqrt{13}\approx 4\). We can also specify \(m\) with argument mtry. Now let’s see how the OOB error and testing error changes with mtry.

oob.err<- rep(0, 13)
test.err<- rep(0, 13)
for(i in 1:13){
  fit<- randomForest(medv~., data = boston.train, mtry=i)
  oob.err[i]<- fit$mse[500]
  test.err[i]<- mean((boston.test$medv-predict(fit, boston.test))^2)
  cat(i, " ")
}
## 1  2  3  4  5  6  7  8  9  10  11  12  13
matplot(cbind(test.err, oob.err), pch=15, col = c("red", "blue"), type = "b", ylab = "MSE", xlab = "mtry")
legend("topright", legend = c("test Error", "OOB Error"), pch = 15, col = c("red", "blue"))

Exercise: Create a plot displaying the test error across ntree=1, …, 500, and mtry= 1, …, 13. (You can draw 13 lines in different color representing each \(m\)).

Boosting

Boosting builds a number of small trees, and each time, the response is the residual from last tree. It is a sequential procedure. We use gbm package to build boosted trees.

library(gbm)
?gbm
boston.boost<- gbm(medv~., data = boston.train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)

##             var    rel.inf
## lstat     lstat 37.9054831
## rm           rm 27.9768272
## dis         dis 10.5565415
## crim       crim  5.6162115
## age         age  4.1245061
## nox         nox  3.8883540
## black     black  3.0829563
## tax         tax  2.1290656
## ptratio ptratio  2.0356151
## indus     indus  1.0604646
## chas       chas  0.7387621
## rad         rad  0.7341870
## zn           zn  0.1510259

Note that we need to specify distribution = "gaussian" if we are working on regression tree. The default is Bernoulli distribution for binary classification problem. n.trees is the number of small trees we fit. We need to choose this parameter carefully because it may results in overfitting if the number is too large. shrinkage is another tuning parameter that controls how much contribution each tree makes. interaction.depth is how many splits of each tree we want. All those tuning parameters can be chosen from cross-validation. The idea is that we don’t want overfitting.

The fitted boosted tree also gives the relation between response and each predictor.

par(mfrow=c(1,2))
plot(boston.boost, i="lstat")

plot(boston.boost, i="rm")

Prediction on testing sample.

boston.boost.pred.test<- predict(boston.boost, boston.test, n.trees = 10000)
mean((boston.test$medv-boston.boost.pred.test)^2)
## [1] 8.546093

We can investigate how the testing error changes with different number of trees.

par(mfrow=c(1,1))
ntree<- seq(100, 10000, 100)
predmat<- predict(boston.boost, newdata = boston.test, n.trees = ntree)
err<- apply((predmat-boston.test$medv)^2, 2, mean)
plot(ntree, err, type = 'l', col=2, lwd=2, xlab = "n.trees", ylab = "Test MSE")
abline(h=min(test.err), lty=2)

The horizontal line is the best prediction error from random forests we obtained earlier.

XGboost

XGboost stands for extreme gradient boosting. It is a much more efficient and faster (more than 10X faster) algorithm to implement the traditional gradient boosting machine.

library(xgboost)
fit.xgboost.reg<- xgboost(data = as.matrix(boston.train[,-14]), label = boston.train[,14], max.depth = 4, eta = 0.1, 
                          nthread = 1, nrounds = 100, objective = "reg:squarederror", verbose = 0)

Cross-validation

How many trees are good? What value of learning rate is good? We can use cross-validation to determine these tuning parameters.

xgboost.cv<- xgb.cv(data = as.matrix(boston.train[,-14]), label = boston.train[,14], max.depth = 4, eta = 0.1, 
                    nfold=10, nthread = 4, nrounds = 200, objective = "reg:squarederror", verbose = 0)
plot(xgboost.cv$evaluation_log$iter, xgboost.cv$evaluation_log$test_rmse_mean, type='l')

#print(xgboost.cv)

The performance stays the same after 100 iterations. We choose nrounds=100 and use xgboost() to fit the model. You may also grid search learning rate eta using the similar way.

Prediction

pred.xgboost<- predict(fit.xgboost.reg, newdata = as.matrix(boston.test[,-14]))
mean((boston.test$medv-pred.xgboost)^2)
## [1] 8.880745