In this lab, we will cover bagging, random forest, gradient boosting and extreme boosting for regression problems. We use the same Boston Housing data. In the next lab, we discuss classification problem with these state-of-the art machine linear algorithms.
# load Boston data
library(MASS)
library(tidyverse)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.70)
boston.train <- Boston[index,]
boston.test <- Boston[-index,]
Bagging stands for Bootstrap and Aggregating. It employs the idea of bootstrap but the purpose is not to study bias and standard errors of estimates. Instead, the goal of Bagging is to improve prediction accuracy. It fits a tree for each bootsrap sample, and then aggregate the predicted values from all these different trees. For more details, you may look at Wikepedia, or you can find the original paper Leo Breiman (1996).
An available R package, ipred
, provides functions to perform Bagging. You need to install this package if you didn’t do it before.
library(ipred)
Fit tree with bagging on Boston training data, and calculate MSE on testing sample.
boston.bag<- bagging(medv~., data = boston.train, nbagg=100)
boston.bag
##
## Bagging regression trees with 100 bootstrap replications
##
## Call: bagging.data.frame(formula = medv ~ ., data = boston.train, nbagg = 100)
Prediction on testing sample.
boston.bag.pred<- predict(boston.bag, newdata = boston.test)
mean((boston.test$medv-boston.bag.pred)^2)
## [1] 11.07093
Comparing with a single tree.
library(rpart)
boston.tree<- rpart(medv~., data = boston.train)
boston.tree.pred<- predict(boston.tree, newdata = boston.test)
mean((boston.test$medv-boston.tree.pred)^2)
## [1] 12.99985
How many trees are good?
ntree<- c(seq(10, 200, 10))
MSE.test<- rep(0, length(ntree))
for(i in 1:length(ntree)){
boston.bag1<- bagging(medv~., data = boston.train, nbagg=ntree[i])
boston.bag.pred1<- predict(boston.bag1, newdata = boston.test)
MSE.test[i]<- mean((boston.test$medv-boston.bag.pred1)^2)
}
plot(ntree, MSE.test, type = 'l', col=2, lwd=2)
By fitting the Bagging multiple times and predicting the testing sample, we can draw the following boxplot to show the variance of the prediction error at different number of trees.
ntree<- c(1, 3, 5, seq(10, 200, 10))
MSE.test<- matrix(0, length(ntree), 50)
for(k in 1:50){
for(i in 1:length(ntree)){
boston.bag1<- bagging(medv~., data = boston.train, nbagg=ntree[i])
boston.bag.pred1<- predict(boston.bag1, newdata = boston.test)
MSE.test[i,k]<- mean((boston.test$medv-boston.bag.pred1)^2)
}
}
boxplot(t(MSE.test), names=ntree, xlab="Number of Tree", ylab="Test MSE")
lines(apply(MSE.test, 1, mean), col="red", lty=2, lwd=2)
The out-of-bag prediction is similar to LOOCV. We use full sample. In every bootstrap, the unused sample serves as testing sample, and testing error is calculated. In the end, OOB error, root mean squared error by default, is obtained
boston.bag.oob<- bagging(medv~., data = boston.train, coob=T, nbagg=100)
boston.bag.oob
##
## Bagging regression trees with 100 bootstrap replications
##
## Call: bagging.data.frame(formula = medv ~ ., data = boston.train, coob = T,
## nbagg = 100)
##
## Out-of-bag estimate of root mean squared error: 4.2785
Random forest is an extension of Bagging, but it makes significant improvement in terms of prediction. The idea of random forests is to randomly select \(m\) out of \(p\) predictors as candidate variables for each split in each tree. Commonly, \(m=\sqrt{p}\). The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees. You may refer Wikipedia and the tutorial on the author’s website.
We start with Boston Housing data.
library(randomForest)
boston.rf<- randomForest(medv~., data = boston.train, importance=TRUE)
boston.rf
##
## Call:
## randomForest(formula = medv ~ ., data = boston.train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.61353
## % Var explained: 86.35
By default, \(m=p/3\) for regression tree, and \(m=\sqrt{p}\) for classification problem. You can change it by specifying mtry=
. You can also specify number of trees by ntree=
. The default is 500. The argument importance=TRUE
allows us to see the variable imporatance.
boston.rf$importance
## %IncMSE IncNodePurity
## crim 9.8312596 2057.9620
## zn 0.3770528 213.2592
## indus 6.2854103 1532.6133
## chas 0.7969057 199.9381
## nox 10.5453903 2047.2360
## rm 33.4866583 8416.2336
## age 3.7684947 816.7203
## dis 7.9094067 2047.6588
## rad 1.5776123 287.3580
## tax 4.3494128 1031.8469
## ptratio 6.6302754 1721.8515
## black 1.6629025 647.2576
## lstat 62.5959692 8459.6503
The MSR is MSE of out-of-bag prediction (recall the OOB in bagging). The fitted randomForest actually saves all OOB errors for each ntree
value from 1 to 500. We can make a plot to see how the OOB error changes with different ntree
.
plot(boston.rf$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")
Prediction on the testing sample.
boston.rf.pred<- predict(boston.rf, boston.test)
mean((boston.test$medv-boston.rf.pred)^2)
## [1] 8.110514
As we mentioned before, the number of candidate predictors in each split is \(m\approx \sqrt{p}=\sqrt{13}\approx 4\). We can also specify \(m\) with argument mtry
. Now let’s see how the OOB error and testing error changes with mtry
.
oob.err<- rep(0, 13)
test.err<- rep(0, 13)
for(i in 1:13){
fit<- randomForest(medv~., data = boston.train, mtry=i)
oob.err[i]<- fit$mse[500]
test.err[i]<- mean((boston.test$medv-predict(fit, boston.test))^2)
cat(i, " ")
}
## 1 2 3 4 5 6 7 8 9 10 11 12 13
matplot(cbind(test.err, oob.err), pch=15, col = c("red", "blue"), type = "b", ylab = "MSE", xlab = "mtry")
legend("topright", legend = c("test Error", "OOB Error"), pch = 15, col = c("red", "blue"))
ntree=
1, …, 500, and mtry=
1, …, 13. (You can draw 13 lines in different color representing each \(m\)).Boosting builds a number of small trees, and each time, the response is the residual from last tree. It is a sequential procedure. We use gbm
package to build boosted trees.
library(gbm)
?gbm
boston.boost<- gbm(medv~., data = boston.train, distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)
## var rel.inf
## lstat lstat 37.9054831
## rm rm 27.9768272
## dis dis 10.5565415
## crim crim 5.6162115
## age age 4.1245061
## nox nox 3.8883540
## black black 3.0829563
## tax tax 2.1290656
## ptratio ptratio 2.0356151
## indus indus 1.0604646
## chas chas 0.7387621
## rad rad 0.7341870
## zn zn 0.1510259
Note that we need to specify distribution = "gaussian"
if we are working on regression tree. The default is Bernoulli distribution for binary classification problem. n.trees
is the number of small trees we fit. We need to choose this parameter carefully because it may results in overfitting if the number is too large. shrinkage
is another tuning parameter that controls how much contribution each tree makes. interaction.depth
is how many splits of each tree we want. All those tuning parameters can be chosen from cross-validation. The idea is that we don’t want overfitting.
The fitted boosted tree also gives the relation between response and each predictor.
par(mfrow=c(1,2))
plot(boston.boost, i="lstat")
plot(boston.boost, i="rm")
Prediction on testing sample.
boston.boost.pred.test<- predict(boston.boost, boston.test, n.trees = 10000)
mean((boston.test$medv-boston.boost.pred.test)^2)
## [1] 8.546093
We can investigate how the testing error changes with different number of trees.
par(mfrow=c(1,1))
ntree<- seq(100, 10000, 100)
predmat<- predict(boston.boost, newdata = boston.test, n.trees = ntree)
err<- apply((predmat-boston.test$medv)^2, 2, mean)
plot(ntree, err, type = 'l', col=2, lwd=2, xlab = "n.trees", ylab = "Test MSE")
abline(h=min(test.err), lty=2)
The horizontal line is the best prediction error from random forests we obtained earlier.
XGboost stands for extreme gradient boosting. It is a much more efficient and faster (more than 10X faster) algorithm to implement the traditional gradient boosting machine.
library(xgboost)
fit.xgboost.reg<- xgboost(data = as.matrix(boston.train[,-14]), label = boston.train[,14], max.depth = 4, eta = 0.1,
nthread = 1, nrounds = 100, objective = "reg:squarederror", verbose = 0)
How many trees are good? What value of learning rate is good? We can use cross-validation to determine these tuning parameters.
xgboost.cv<- xgb.cv(data = as.matrix(boston.train[,-14]), label = boston.train[,14], max.depth = 4, eta = 0.1,
nfold=10, nthread = 4, nrounds = 200, objective = "reg:squarederror", verbose = 0)
plot(xgboost.cv$evaluation_log$iter, xgboost.cv$evaluation_log$test_rmse_mean, type='l')
#print(xgboost.cv)
The performance stays the same after 100 iterations. We choose nrounds=100
and use xgboost()
to fit the model. You may also grid search learning rate eta
using the similar way.
pred.xgboost<- predict(fit.xgboost.reg, newdata = as.matrix(boston.test[,-14]))
mean((boston.test$medv-pred.xgboost)^2)
## [1] 8.880745