In this lab we will go through the model building, validation, and
interpretation of tree models. The focus will be on rpart
package. Recall that when the response variable \(Y\) is continuous, we fit regression tree;
when the reponse variable \(Y\) is
categorical, we fit classification tree. We build tree models for our
familiar datasets, Boston Housing data and Credit Card Default data, for
regression and classification tree respectively.
Load the data, and randomly split to training and testing sample.
library(tidyverse)
library(MASS)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.90)
boston.train <- Boston[index,]
boston.test <- Boston[-index,]
We will use the ‘rpart’ library for model building and ‘rpart.plot’ for plotting.
install.packages('rpart')
install.packages('rpart.plot')
library(rpart)
library(rpart.plot)
The simple form of the rpart function is similar to lm and glm. It takes a formula argument in which you specify the response and predictor variables, and the dataset.
boston.rpart <- rpart(formula = medv ~ ., data = boston.train)
boston.rpart
## n= 455
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 455 36914.84000 22.422640
## 2) lstat>=9.95 258 6232.28600 17.190310
## 4) lstat>=16.085 128 2319.63700 14.063280
## 8) nox>=0.603 85 1087.11000 12.381180
## 16) crim>=10.4524 33 209.86240 9.615152 *
## 17) crim< 10.4524 52 464.54060 14.136540 *
## 9) nox< 0.603 43 516.60420 17.388370 *
## 5) lstat< 16.085 130 1428.65700 20.269230 *
## 3) lstat< 9.95 197 14368.77000 29.275130
## 6) rm< 7.141 152 5325.59900 26.052630
## 12) dis>=2.0224 143 2702.59800 25.290910
## 24) rm< 6.543 80 694.70190 22.818750 *
## 25) rm>=6.543 63 898.11270 28.430160 *
## 13) dis< 2.0224 9 1221.70200 38.155560 *
## 7) rm>=7.141 45 2133.10800 40.160000
## 14) rm< 7.437 19 68.10421 34.436840 *
## 15) rm>=7.437 26 987.88350 44.342310 *
prp(boston.rpart, digits = 4, extra = 1)
The first output (boston.rpart) is the tree structure,
which can be visualized as the tree plot. In the tree structure output,
you can see what is going on behind each split. For example, the
root-node says 455 observations, RSS=39064.41 (this is TSS because no
split has been made), and predicted value is 22.49 (this is the average
medv of the train sample because again no split has been made). Then the
first split generate two nodes (2 and 3), and for each node, you can see
the value of n, RSS, and yhat. This information can also be obtained
from boston.rpart$frame.
## Warning: package 'knitr' was built under R version 4.5.2
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.98 | 0 | 21.89 | 0 | 0.62 | 5.76 | 98.4 | 2.35 | 4 | 437 | 21.2 | 262.76 | 17.31 |
We know that a regression tree optimizes RSS, just like a linear
regression. However, in rpart you cannot directly pull the
final RSS or MSE or sigma (or even the fitted value) as in
lm(). But we can obtain fitted value using
predict() and then manually compute RSS or MSE by
definition.
## obtain fitted value
fitted <- predict(boston.rpart)
# please complete the following lines as a quick exercise.
RSS <-
MSE <- ## note that MSE is simply the average of squared residual. We do not use the df of n-p-1
The prediction for regression tree is also similar to
lm() and glm() models.
boston.test.pred.tree = predict(boston.rpart, newdata=boston.test)
MSPE.tree<-
Calculate the mean squared prediction error (MSPE) for linear regression model using all variables. Then compare the results. What is your conclusion? Further, try to compare the regression tree with the best linear regression model using some variable selection procedures.
boston.lm<-
pred.lm<-
MSPE.lm<-
In rpart(), the cp(complexity parameter) argument is one of the parameters that are used to control the compexity of the tree. The help document for rpart tells you “Any split that does not decrease the overall lack of fit by a factor of cp is not attempted”. For a regression tree, the overall Rsquare must increase by cp at each step. Basically, the smaller the cp value, the larger (complex) tree rpart will attempt to fit. The default value for cp is 0.01.
The idea of pruning a tree is that we start with a large tree (small cp value) that may overfit the data, and then cut it down by choosing an appropriate cp value (via cross validation).
boston.largetree <- rpart(formula = medv ~ ., data = boston.train, cp = 0.001)
prp(boston.largetree)
The plotcp() function gives the relationship between 10-fold cross-validation error and size of tree.
plotcp(boston.largetree)
From left to right, cp decreases, which means the tree is growing. You can observe from the above graph that the cross-validation error (x-val) does not always go down when the tree becomes more complex. The analogy is when you add more variables in a regression model, its ability to predict future observations not necessarily increases. A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line. In the Boston housing example, you may conclude that having a tree model with more than 10 splits is not helptul.
To look at the error vs size of tree more carefully, you can look at the following table:
cptable <- printcp(boston.largetree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = boston.train, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] black crim dis indus lstat nox ptratio rad rm
## [10] tax
##
## Root node error: 36915/455 = 81.132
##
## n= 455
##
## CP nsplit rel error xerror xstd
## 1 0.4419302 0 1.00000 1.00499 0.086707
## 2 0.1871893 1 0.55807 0.60615 0.054567
## 3 0.0672898 2 0.37088 0.43606 0.046555
## 4 0.0379603 3 0.30359 0.37646 0.043525
## 5 0.0300633 4 0.26563 0.38794 0.044756
## 6 0.0291785 5 0.23557 0.36276 0.040834
## 7 0.0193939 6 0.20639 0.32230 0.038289
## 8 0.0111800 7 0.18699 0.30102 0.038532
## 9 0.0078412 8 0.17581 0.28046 0.039221
## 10 0.0070571 9 0.16797 0.27921 0.039392
## 11 0.0068462 10 0.16092 0.28034 0.039492
## 12 0.0052269 11 0.15407 0.26897 0.036615
## 13 0.0050412 12 0.14884 0.26432 0.035393
## 14 0.0040287 13 0.14380 0.25672 0.035338
## 15 0.0036138 14 0.13977 0.25200 0.035261
## 16 0.0028639 15 0.13616 0.25313 0.035239
## 17 0.0021987 16 0.13330 0.24677 0.034806
## 18 0.0021533 17 0.13110 0.24939 0.034807
## 19 0.0018212 18 0.12894 0.24977 0.034805
## 20 0.0017329 19 0.12712 0.25055 0.034796
## 21 0.0016753 20 0.12539 0.24717 0.032903
## 22 0.0016528 21 0.12371 0.24763 0.032908
## 23 0.0013995 22 0.12206 0.24839 0.032922
## 24 0.0013935 23 0.12066 0.25029 0.033050
## 25 0.0013211 25 0.11788 0.25016 0.033061
## 26 0.0010855 26 0.11655 0.24983 0.033035
## 27 0.0010000 28 0.11438 0.24891 0.032917
Root node error is the error when you do not do anything too smart in prediction, in regression case, it is the mean squared error(MSE) if you use the average of medv as the prediction. That is,
sum((boston.train$medv - mean(boston.train$medv))^2)/nrow(boston.train)
## [1] 81.13151
The first 2 columns CP and nsplit tells you how large the tree is. The 3rd column is a relative error, which is MSE/root node error. Therefore, rel.error \(\times\) root node error is the MSE (training error). For example, the last row “(rel error)*(root node error)“, which is the same as the in-sample MSE if you calculate using predict:
mean((predict(boston.largetree) - boston.train$medv)^2)
## [1] 9.280078
xerror is the cross-validation (default is 10-fold) error. You can see that the rel error (training error) is always decreasing as model gets complex, while the cross-validation error (measure of performance on future observations) is not. That is why we prune the tree to avoid overfitting.
To prune the tree, we need to choose the appropriate cp value from
the cp table. Then we can use the function prune() with the
chosen cp value to reduce the tree.
## cp that corresponds to minimum CV error
cp.min <- cptable[which.min(cptable[,4]),1]
tree.prune.min<- prune(boston.largetree, cp = cp.min)
## getting the 1se cp is a little complicated in coding (you can eye examine it instead of coding)
ind <- which.min(abs(cptable[,4]-sum(cptable[which.min(cptable[,4]),c(4,5)])))
cp.1se <- cptable[ind,1]
tree.prune.1se<- prune(boston.largetree, cp = cp.1se)
Load the data, rename response variable (because it is too long), convert categorical variable to factor, and randomly split to training and testing sample.
credit.data <- read_csv("https://www.dropbox.com/s/tnoo06n8m842uit/credit_card_default.csv?dl=1")
# rename
credit.data<- rename(credit.data, default=`default payment next month`)
# convert categorical data to factor
credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)
# random splitting
set.seed(123)
index <- sample(nrow(credit.data),nrow(credit.data)*0.80)
credit.train = credit.data[index,]
credit.test = credit.data[-index,]
You need to tell R you want a classification tree. We have to specify
method="class", since the default is to fit regression
tree.
credit.rpart0 <- rpart(formula = default ~ ., data = credit.train, method = "class")
prp(credit.rpart0, extra = 1)
We see a tree with only 1 split by default cp. This tree minimizes the symmetric cost, which is equivalent to misclassification rate.
Below is the tree structure output. Can you draw the confusion matrix based on this output?
Obtain the confusion matrix using predictions (e.g.,
predict(credit.rpart0, type="class")).
Please interpret this confusion matrix. Does it look good? Why?
credit.rpart0
## n= 24000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 24000 5346 0 (0.7772500 0.2227500)
## 2) PAY_0< 1.5 21488 3593 0 (0.8327904 0.1672096) *
## 3) PAY_0>=1.5 2512 759 1 (0.3021497 0.6978503) *
Recall the example in logistic regression. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).
Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.
credit.rpart1 <- rpart(formula = default ~ .,
data = credit.train,
method = "class",
parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
prp(credit.rpart1, extra = 1)
parms argument is a list, among which the the loss
matrix can be specified using loss. The diagonal elements
are 0, and off-diagonal elements tells you the loss(cost) of classifying
something wrong. For binary classification, the numbers in c() specify
the cost in this sequence: c(0, False Negative, False Positive, 0). If
you have symmetric cost, you can ignore the parms argument.For each of above classification tree, obtain the FPR and FNR for the testing sample.
Compute the cost (on testing sample) for each model, respectively.
Compare these tree models with logistic regression in terms of MR, FPR, and FNR. (Use the same cost ratio to select optimal cutoff.)
In predict() for the tree model, try
type="prob". What can you say about these predicted
probabilities?
To get ROC curve, we get the predicted probability of Y being 1 from the fitted tree.
credit.test.prob.rpart<- predict(credit.rpart1, credit.test, type="prob")
credit.test.prob.rpart has 2 columns, the first one is prob(Y) = 0 and the second prob(Y) = 1. We need the second column.
library(ROCR)
pred = prediction(credit.test.prob.rpart[,2], credit.test$default)
perf = performance(pred, "tpr", "fpr")
plot(perf)
slot(performance(pred, "auc"), "y.values")[[1]]
## [1] 0.7341126
CAUTION!!! When the cost is symmetric (default option), the pruning process is the same as a regression tree. However, when you specify an asymmetric cost, the cp table and plot appear to be misleading. Below I am demonstrating this problem using our example.
First, we fit a big tree with asymmetric cost of FN:FP=5:1 and cp=0.001, and then plot the cp plot.
Cmat <- matrix(c(0,5,1,0), nrow = 2)
credit.rpart.bigtree <- rpart(formula = default ~ .,
data = credit.train,
method = "class",
parms = list(loss=Cmat),
cp=0.001)
plotcp(credit.rpart.bigtree)
This cp plot suggests that the optimal tree is the tree with one split, and anything beyond is significantly worse.
Now let’s compute the cost on a hold-out sample (test) based on the largest tree (without pruning and supposed to be worse) and the optimal tree (prune to one-split and supposed to be better). We would expect that cost1 is bigger than cost2.
## prediction and cost of the large tree
pred1<- predict(credit.rpart.bigtree, newdata = credit.test, type = "class")
cost1 <- sum(pred1==1 & credit.test$default==0)*Cmat[1,2]+sum(pred1==0 & credit.test$default==1)*Cmat[2,1]
## prediction and cost of the one-split tree
credit.rpart.prune <- prune(credit.rpart.bigtree, cp=0.1)
pred2<- predict(credit.rpart.prune, newdata = credit.test, type = "class")
cost2 <- sum(pred2==1 & credit.test$default==0)*Cmat[1,2]+sum(pred2==0 & credit.test$default==1)*Cmat[2,1]
## compare the two costs (We would expect that cost1 > cost2)
cost1 > cost2
## [1] FALSE
This means the first tree outperforms the second tree, which contradicts to the suggestion from the cp plot. You can verify this “issue” using by using a different cost matrix (you can flip the ratio) or using a different dataset.
Then how should we prune a classification tree with asymmetric cost?
The easiest way is to write a loop and search across a sequence of cp
values with respect to the weighted cost on a hold-out sample. More
formally, you can use caret for cross-validation or
manually implement a cross-validation with the weighted cost being the
evaluation score.
digit<- data.matrix(read_csv("https://www.dropbox.com/s/ulujvi2a4ykfzju/train.csv?dl=1"))
dim(digit)
## [1] 42000 785
## visualize the data
plotTrain <- function(data, index){
op <- par(no.readonly=TRUE)
x <- ceiling(sqrt(length(index)))
par(mfrow=c(x, x), mar=c(.1, .1, .1, .1))
for (i in index){ #reverse and transpose each matrix to rotate images
m <- matrix(data[i,-1], nrow=28, byrow=TRUE)
m <- apply(m, 2, rev)
image(t(m), col=grey.colors(255), axes=FALSE)
text(0.05, 0.2, col="white", cex=1.2, data[i, 1])
}
par(op) #reset the original graphics parameters
}
plotTrain(data=digit, index=1:100)
index<- sample(1:nrow(digit), 0.6*nrow(digit))
train<- digit[index,]
test<- digit[-index,]
Currently, each cell uses 0-255 to represent the grey color scale. We recale it to 0-1.
## standardize X
train.x <- train[,-1] #remove 'label' column
test.x<- test[,-1]
train.y <- train[,1] #label column
test.y<- test[,1]
train.x <- train.x/255
test.x <- test.x/255
Here we use classification tree to train a classifier, and then compare with the multinomial logit model (in last lab).
Due to the large size, we only use first 3000 observations as training sample.
fit.tree <- rpart(y ~., method = "class", data = data.frame(y=train.y[1:3000], x=train.x[1:3000,]), cp=0.00001)
plotcp(fit.tree)
printcp(fit.tree)
##
## Classification tree:
## rpart(formula = y ~ ., data = data.frame(y = train.y[1:3000],
## x = train.x[1:3000, ]), method = "class", cp = 1e-05)
##
## Variables actually used in tree construction:
## [1] x.pixel103 x.pixel120 x.pixel131 x.pixel150 x.pixel152 x.pixel155
## [7] x.pixel156 x.pixel157 x.pixel177 x.pixel181 x.pixel183 x.pixel184
## [13] x.pixel188 x.pixel189 x.pixel210 x.pixel216 x.pixel238 x.pixel239
## [19] x.pixel244 x.pixel267 x.pixel269 x.pixel270 x.pixel294 x.pixel296
## [25] x.pixel299 x.pixel300 x.pixel317 x.pixel321 x.pixel323 x.pixel326
## [31] x.pixel345 x.pixel346 x.pixel348 x.pixel349 x.pixel350 x.pixel353
## [37] x.pixel355 x.pixel356 x.pixel372 x.pixel375 x.pixel376 x.pixel378
## [43] x.pixel379 x.pixel400 x.pixel401 x.pixel405 x.pixel407 x.pixel408
## [49] x.pixel409 x.pixel429 x.pixel432 x.pixel433 x.pixel455 x.pixel458
## [55] x.pixel463 x.pixel484 x.pixel485 x.pixel486 x.pixel488 x.pixel489
## [61] x.pixel491 x.pixel493 x.pixel496 x.pixel513 x.pixel514 x.pixel515
## [67] x.pixel516 x.pixel518 x.pixel519 x.pixel522 x.pixel538 x.pixel542
## [73] x.pixel547 x.pixel551 x.pixel553 x.pixel568 x.pixel569 x.pixel595
## [79] x.pixel597 x.pixel598 x.pixel601 x.pixel636 x.pixel657 x.pixel658
## [85] x.pixel660 x.pixel70 x.pixel713 x.pixel96
##
## Root node error: 2682/3000 = 0.894
##
## n= 3000
##
## CP nsplit rel error xerror xstd
## 1 0.09545116 0 1.00000 1.01081 0.0060255
## 2 0.09134974 1 0.90455 0.91462 0.0078854
## 3 0.06413125 2 0.81320 0.78896 0.0093103
## 4 0.04399702 5 0.62081 0.63833 0.0101086
## 5 0.03914989 7 0.53281 0.58688 0.0101987
## 6 0.03318419 8 0.49366 0.53990 0.0102049
## 7 0.01752424 9 0.46048 0.52312 0.0101897
## 8 0.01603281 10 0.44295 0.47949 0.0101066
## 9 0.01528710 11 0.42692 0.46532 0.0100660
## 10 0.01267711 12 0.41163 0.45638 0.0100367
## 11 0.01230425 13 0.39896 0.44855 0.0100089
## 12 0.01043997 15 0.37435 0.44482 0.0099949
## 13 0.00745712 16 0.36391 0.43251 0.0099453
## 14 0.00708427 17 0.35645 0.41573 0.0098690
## 15 0.00633855 18 0.34937 0.40604 0.0098203
## 16 0.00596570 19 0.34303 0.39448 0.0097577
## 17 0.00410142 20 0.33706 0.37248 0.0096247
## 18 0.00391499 25 0.31581 0.34862 0.0094590
## 19 0.00372856 29 0.30015 0.34526 0.0094339
## 20 0.00335570 31 0.29269 0.33371 0.0093437
## 21 0.00316928 32 0.28934 0.32252 0.0092510
## 22 0.00298285 34 0.28300 0.31991 0.0092286
## 23 0.00260999 38 0.27107 0.31581 0.0091927
## 24 0.00242356 46 0.25019 0.31245 0.0091628
## 25 0.00223714 50 0.24049 0.30425 0.0090877
## 26 0.00205071 59 0.22036 0.30425 0.0090877
## 27 0.00186428 61 0.21626 0.30350 0.0090807
## 28 0.00167785 69 0.20134 0.30089 0.0090560
## 29 0.00149142 71 0.19799 0.29679 0.0090166
## 30 0.00111857 73 0.19500 0.29679 0.0090166
## 31 0.00074571 81 0.18606 0.29418 0.0089911
## 32 0.00037286 92 0.17785 0.29567 0.0090057
## 33 0.00012429 97 0.17599 0.30052 0.0090525
## 34 0.00001000 100 0.17562 0.30052 0.0090525
# make prediction
pred.y.tree<- predict(fit.tree, data.frame(y=test.y, x=test.x), type = "class")
# accuracy rate
mean(test.y==pred.y.tree)
## [1] 0.7406548
We got pretty good accuracy. As we learn more advanced ML algorithm, you will see that the accuracy rate could hit to 99%.
plotResults <- function(testdata, index, preds){
op <- par(no.readonly=TRUE)
x <- ceiling(sqrt(length(index)))
par(mfrow=c(x,x), mar=c(.1,.1,.1,.1))
for (i in index){
m <- matrix(testdata[i,], nrow=28, byrow=TRUE)
m <- apply(m, 2, rev)
image(t(m), col=grey.colors(255), axes=FALSE)
text(0.05,0.1,col="green", cex=1.2, preds[i])
}
par(op)
}
Here are the first 100 images in the test set and their predicted values:
plotResults(testdata=test.x, index=1:100, preds=pred.y.tree)
We see it did a very good job.
Use rpart() to fit regression and classification
tree.
Know how to interpret a tree.
Use predict() for prediction, and how to assess the
performance.
Know how to use Cp plot/table to prune a large tree.