In this lab we will go through the model building, validation, and interpretation of tree models. The focus will be on rpart package. Recall that when the response variable \(Y\) is continuous, we fit regression tree; when the reponse variable \(Y\) is categorical, we fit classification tree. We build tree models for our familiar datasets, Boston Housing data and Credit Card Default data, for regression and classification tree respectively.

Regression Tree (Boston Housing Data)

Preparation

Load the data, and randomly split to training and testing sample.

library(tidyverse)
library(MASS)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.90)
boston.train <- Boston[index,]
boston.test <- Boston[-index,]

We will use the ‘rpart’ library for model building and ‘rpart.plot’ for plotting.

install.packages('rpart')
install.packages('rpart.plot') 
library(rpart)
library(rpart.plot)

Fitting regression tree

The simple form of the rpart function is similar to lm and glm. It takes a formula argument in which you specify the response and predictor variables, and the dataset.

boston.rpart <- rpart(formula = medv ~ ., data = boston.train)

Printing and ploting the tree

boston.rpart
## n= 455 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 455 36914.84000 22.422640  
##    2) lstat>=9.95 258  6232.28600 17.190310  
##      4) lstat>=16.085 128  2319.63700 14.063280  
##        8) nox>=0.603 85  1087.11000 12.381180  
##         16) crim>=10.4524 33   209.86240  9.615152 *
##         17) crim< 10.4524 52   464.54060 14.136540 *
##        9) nox< 0.603 43   516.60420 17.388370 *
##      5) lstat< 16.085 130  1428.65700 20.269230 *
##    3) lstat< 9.95 197 14368.77000 29.275130  
##      6) rm< 7.141 152  5325.59900 26.052630  
##       12) dis>=2.0224 143  2702.59800 25.290910  
##         24) rm< 6.543 80   694.70190 22.818750 *
##         25) rm>=6.543 63   898.11270 28.430160 *
##       13) dis< 2.0224 9  1221.70200 38.155560 *
##      7) rm>=7.141 45  2133.10800 40.160000  
##       14) rm< 7.437 19    68.10421 34.436840 *
##       15) rm>=7.437 26   987.88350 44.342310 *
prp(boston.rpart, digits = 4, extra = 1)

The first output (boston.rpart) is the tree structure, which can be visualized as the tree plot. In the tree structure output, you can see what is going on behind each split. For example, the root-node says 455 observations, RSS=39064.41 (this is TSS because no split has been made), and predicted value is 22.49 (this is the average medv of the train sample because again no split has been made). Then the first split generate two nodes (2 and 3), and for each node, you can see the value of n, RSS, and yhat. This information can also be obtained from boston.rpart$frame.

Exercise: What is the predicted median housing price (in thousand) given following information:

## Warning: package 'knitr' was built under R version 4.5.2
crim zn indus chas nox rm age dis rad tax ptratio black lstat
0.98 0 21.89 0 0.62 5.76 98.4 2.35 4 437 21.2 262.76 17.31

We know that a regression tree optimizes RSS, just like a linear regression. However, in rpart you cannot directly pull the final RSS or MSE or sigma (or even the fitted value) as in lm(). But we can obtain fitted value using predict() and then manually compute RSS or MSE by definition.

## obtain fitted value
fitted <- predict(boston.rpart)
# please complete the following lines as a quick exercise.
RSS <-      
MSE <- ## note that MSE is simply the average of squared residual. We do not use the df of n-p-1

Prediction using regression tree

The prediction for regression tree is also similar to lm() and glm() models.

boston.test.pred.tree = predict(boston.rpart, newdata=boston.test)

Exercise: Calculate the mean squared prediction error (MSPE) for this tree model

MSPE.tree<- 

Comparing the performance of regression tree with linear regression model in terms of prediction error (Exercise)

Calculate the mean squared prediction error (MSPE) for linear regression model using all variables. Then compare the results. What is your conclusion? Further, try to compare the regression tree with the best linear regression model using some variable selection procedures.

boston.lm<- 
pred.lm<- 
MSPE.lm<-

Pruning (fine tune the tree)

In rpart(), the cp(complexity parameter) argument is one of the parameters that are used to control the compexity of the tree. The help document for rpart tells you “Any split that does not decrease the overall lack of fit by a factor of cp is not attempted”. For a regression tree, the overall Rsquare must increase by cp at each step. Basically, the smaller the cp value, the larger (complex) tree rpart will attempt to fit. The default value for cp is 0.01.

The idea of pruning a tree is that we start with a large tree (small cp value) that may overfit the data, and then cut it down by choosing an appropriate cp value (via cross validation).

boston.largetree <- rpart(formula = medv ~ ., data = boston.train, cp = 0.001)
prp(boston.largetree)

The plotcp() function gives the relationship between 10-fold cross-validation error and size of tree.

plotcp(boston.largetree)

From left to right, cp decreases, which means the tree is growing. You can observe from the above graph that the cross-validation error (x-val) does not always go down when the tree becomes more complex. The analogy is when you add more variables in a regression model, its ability to predict future observations not necessarily increases. A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line. In the Boston housing example, you may conclude that having a tree model with more than 10 splits is not helptul.

To look at the error vs size of tree more carefully, you can look at the following table:

cptable <- printcp(boston.largetree)
## 
## Regression tree:
## rpart(formula = medv ~ ., data = boston.train, cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] black   crim    dis     indus   lstat   nox     ptratio rad     rm     
## [10] tax    
## 
## Root node error: 36915/455 = 81.132
## 
## n= 455 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4419302      0   1.00000 1.00499 0.086707
## 2  0.1871893      1   0.55807 0.60615 0.054567
## 3  0.0672898      2   0.37088 0.43606 0.046555
## 4  0.0379603      3   0.30359 0.37646 0.043525
## 5  0.0300633      4   0.26563 0.38794 0.044756
## 6  0.0291785      5   0.23557 0.36276 0.040834
## 7  0.0193939      6   0.20639 0.32230 0.038289
## 8  0.0111800      7   0.18699 0.30102 0.038532
## 9  0.0078412      8   0.17581 0.28046 0.039221
## 10 0.0070571      9   0.16797 0.27921 0.039392
## 11 0.0068462     10   0.16092 0.28034 0.039492
## 12 0.0052269     11   0.15407 0.26897 0.036615
## 13 0.0050412     12   0.14884 0.26432 0.035393
## 14 0.0040287     13   0.14380 0.25672 0.035338
## 15 0.0036138     14   0.13977 0.25200 0.035261
## 16 0.0028639     15   0.13616 0.25313 0.035239
## 17 0.0021987     16   0.13330 0.24677 0.034806
## 18 0.0021533     17   0.13110 0.24939 0.034807
## 19 0.0018212     18   0.12894 0.24977 0.034805
## 20 0.0017329     19   0.12712 0.25055 0.034796
## 21 0.0016753     20   0.12539 0.24717 0.032903
## 22 0.0016528     21   0.12371 0.24763 0.032908
## 23 0.0013995     22   0.12206 0.24839 0.032922
## 24 0.0013935     23   0.12066 0.25029 0.033050
## 25 0.0013211     25   0.11788 0.25016 0.033061
## 26 0.0010855     26   0.11655 0.24983 0.033035
## 27 0.0010000     28   0.11438 0.24891 0.032917

Root node error is the error when you do not do anything too smart in prediction, in regression case, it is the mean squared error(MSE) if you use the average of medv as the prediction. That is,

sum((boston.train$medv - mean(boston.train$medv))^2)/nrow(boston.train)
## [1] 81.13151

The first 2 columns CP and nsplit tells you how large the tree is. The 3rd column is a relative error, which is MSE/root node error. Therefore, rel.error \(\times\) root node error is the MSE (training error). For example, the last row “(rel error)*(root node error)“, which is the same as the in-sample MSE if you calculate using predict:

mean((predict(boston.largetree) - boston.train$medv)^2)
## [1] 9.280078

xerror is the cross-validation (default is 10-fold) error. You can see that the rel error (training error) is always decreasing as model gets complex, while the cross-validation error (measure of performance on future observations) is not. That is why we prune the tree to avoid overfitting.

To prune the tree, we need to choose the appropriate cp value from the cp table. Then we can use the function prune() with the chosen cp value to reduce the tree.

## cp that corresponds to minimum CV error
cp.min <- cptable[which.min(cptable[,4]),1]
tree.prune.min<- prune(boston.largetree, cp = cp.min)
## getting the 1se cp is a little complicated in coding (you can eye examine it instead of coding)
ind <- which.min(abs(cptable[,4]-sum(cptable[which.min(cptable[,4]),c(4,5)])))
cp.1se <- cptable[ind,1]
tree.prune.1se<- prune(boston.largetree, cp = cp.1se)

Exercise: For different models, make prediction on testing sample and compute MSPE.


Back to top


Classification Tree (Credit Card Default data)

Preparation

Load the data, rename response variable (because it is too long), convert categorical variable to factor, and randomly split to training and testing sample.

credit.data <- read_csv("https://www.dropbox.com/s/tnoo06n8m842uit/credit_card_default.csv?dl=1")
# rename
credit.data<- rename(credit.data, default=`default payment next month`)

# convert categorical data to factor
credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)

# random splitting
set.seed(123)
index <- sample(nrow(credit.data),nrow(credit.data)*0.80)
credit.train = credit.data[index,]
credit.test = credit.data[-index,]

Fitting classification tree

You need to tell R you want a classification tree. We have to specify method="class", since the default is to fit regression tree.

credit.rpart0 <- rpart(formula = default ~ ., data = credit.train, method = "class")
prp(credit.rpart0, extra = 1)

We see a tree with only 1 split by default cp. This tree minimizes the symmetric cost, which is equivalent to misclassification rate.

Exercise:

  1. Below is the tree structure output. Can you draw the confusion matrix based on this output?

  2. Obtain the confusion matrix using predictions (e.g., predict(credit.rpart0, type="class")).

  3. Please interpret this confusion matrix. Does it look good? Why?

credit.rpart0
## n= 24000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 24000 5346 0 (0.7772500 0.2227500)  
##   2) PAY_0< 1.5 21488 3593 0 (0.8327904 0.1672096) *
##   3) PAY_0>=1.5 2512  759 1 (0.3021497 0.6978503) *

Classification tree with asymmetric cost

Recall the example in logistic regression. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).

Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.

credit.rpart1 <- rpart(formula = default ~ ., 
                      data = credit.train, 
                      method = "class", 
                      parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
prp(credit.rpart1, extra = 1)

  • The parms argument is a list, among which the the loss matrix can be specified using loss. The diagonal elements are 0, and off-diagonal elements tells you the loss(cost) of classifying something wrong. For binary classification, the numbers in c() specify the cost in this sequence: c(0, False Negative, False Positive, 0). If you have symmetric cost, you can ignore the parms argument.

Exercise:

  1. For each of above classification tree, obtain the FPR and FNR for the testing sample.

  2. Compute the cost (on testing sample) for each model, respectively.

  3. Compare these tree models with logistic regression in terms of MR, FPR, and FNR. (Use the same cost ratio to select optimal cutoff.)

  4. In predict() for the tree model, try type="prob". What can you say about these predicted probabilities?

ROC Curve for Classification Tree

To get ROC curve, we get the predicted probability of Y being 1 from the fitted tree.

credit.test.prob.rpart<- predict(credit.rpart1, credit.test, type="prob")

credit.test.prob.rpart has 2 columns, the first one is prob(Y) = 0 and the second prob(Y) = 1. We need the second column.

library(ROCR)

pred = prediction(credit.test.prob.rpart[,2], credit.test$default) 
perf = performance(pred, "tpr", "fpr")
plot(perf)

slot(performance(pred, "auc"), "y.values")[[1]]
## [1] 0.7341126

Exercise: Compare tree model with logistic regression in terms of AUC.

Prune a classification tree

CAUTION!!! When the cost is symmetric (default option), the pruning process is the same as a regression tree. However, when you specify an asymmetric cost, the cp table and plot appear to be misleading. Below I am demonstrating this problem using our example.

First, we fit a big tree with asymmetric cost of FN:FP=5:1 and cp=0.001, and then plot the cp plot.

Cmat <- matrix(c(0,5,1,0), nrow = 2)
credit.rpart.bigtree <- rpart(formula = default ~ ., 
                              data = credit.train, 
                              method = "class", 
                              parms = list(loss=Cmat), 
                              cp=0.001)
plotcp(credit.rpart.bigtree)

This cp plot suggests that the optimal tree is the tree with one split, and anything beyond is significantly worse.

Now let’s compute the cost on a hold-out sample (test) based on the largest tree (without pruning and supposed to be worse) and the optimal tree (prune to one-split and supposed to be better). We would expect that cost1 is bigger than cost2.

## prediction and cost of the large tree
pred1<- predict(credit.rpart.bigtree, newdata = credit.test, type = "class")
cost1 <- sum(pred1==1 & credit.test$default==0)*Cmat[1,2]+sum(pred1==0 & credit.test$default==1)*Cmat[2,1]
## prediction and cost of the one-split tree
credit.rpart.prune <- prune(credit.rpart.bigtree, cp=0.1)
pred2<- predict(credit.rpart.prune, newdata = credit.test, type = "class")
cost2 <- sum(pred2==1 & credit.test$default==0)*Cmat[1,2]+sum(pred2==0 & credit.test$default==1)*Cmat[2,1]
## compare the two costs (We would expect that cost1 > cost2)
cost1 > cost2
## [1] FALSE

This means the first tree outperforms the second tree, which contradicts to the suggestion from the cp plot. You can verify this “issue” using by using a different cost matrix (you can flip the ratio) or using a different dataset.

Then how should we prune a classification tree with asymmetric cost? The easiest way is to write a loop and search across a sequence of cp values with respect to the weighted cost on a hold-out sample. More formally, you can use caret for cross-validation or manually implement a cross-validation with the weighted cost being the evaluation score.

Back to top


CART for MNIST data

Preparations

digit<- data.matrix(read_csv("https://www.dropbox.com/s/ulujvi2a4ykfzju/train.csv?dl=1"))
dim(digit)
## [1] 42000   785

Visualizing data

## visualize the data
plotTrain <- function(data, index){
  op <- par(no.readonly=TRUE)
  x <- ceiling(sqrt(length(index)))
  par(mfrow=c(x, x), mar=c(.1, .1, .1, .1))
  
  for (i in index){ #reverse and transpose each matrix to rotate images
    m <- matrix(data[i,-1], nrow=28, byrow=TRUE)
    m <- apply(m, 2, rev)
    image(t(m), col=grey.colors(255), axes=FALSE)
    text(0.05, 0.2, col="white", cex=1.2, data[i, 1])
  }
  par(op) #reset the original graphics parameters
}

plotTrain(data=digit, index=1:100)

Split data to training and testing

index<- sample(1:nrow(digit), 0.6*nrow(digit))
train<- digit[index,]
test<- digit[-index,]

Standardize

Currently, each cell uses 0-255 to represent the grey color scale. We recale it to 0-1.

## standardize X
train.x <- train[,-1] #remove 'label' column
test.x<- test[,-1]
train.y <- train[,1] #label column
test.y<- test[,1]
train.x <- train.x/255
test.x <- test.x/255

Classification tree

Here we use classification tree to train a classifier, and then compare with the multinomial logit model (in last lab).

Due to the large size, we only use first 3000 observations as training sample.

fit.tree <- rpart(y ~., method = "class", data = data.frame(y=train.y[1:3000], x=train.x[1:3000,]), cp=0.00001)
plotcp(fit.tree)

printcp(fit.tree)
## 
## Classification tree:
## rpart(formula = y ~ ., data = data.frame(y = train.y[1:3000], 
##     x = train.x[1:3000, ]), method = "class", cp = 1e-05)
## 
## Variables actually used in tree construction:
##  [1] x.pixel103 x.pixel120 x.pixel131 x.pixel150 x.pixel152 x.pixel155
##  [7] x.pixel156 x.pixel157 x.pixel177 x.pixel181 x.pixel183 x.pixel184
## [13] x.pixel188 x.pixel189 x.pixel210 x.pixel216 x.pixel238 x.pixel239
## [19] x.pixel244 x.pixel267 x.pixel269 x.pixel270 x.pixel294 x.pixel296
## [25] x.pixel299 x.pixel300 x.pixel317 x.pixel321 x.pixel323 x.pixel326
## [31] x.pixel345 x.pixel346 x.pixel348 x.pixel349 x.pixel350 x.pixel353
## [37] x.pixel355 x.pixel356 x.pixel372 x.pixel375 x.pixel376 x.pixel378
## [43] x.pixel379 x.pixel400 x.pixel401 x.pixel405 x.pixel407 x.pixel408
## [49] x.pixel409 x.pixel429 x.pixel432 x.pixel433 x.pixel455 x.pixel458
## [55] x.pixel463 x.pixel484 x.pixel485 x.pixel486 x.pixel488 x.pixel489
## [61] x.pixel491 x.pixel493 x.pixel496 x.pixel513 x.pixel514 x.pixel515
## [67] x.pixel516 x.pixel518 x.pixel519 x.pixel522 x.pixel538 x.pixel542
## [73] x.pixel547 x.pixel551 x.pixel553 x.pixel568 x.pixel569 x.pixel595
## [79] x.pixel597 x.pixel598 x.pixel601 x.pixel636 x.pixel657 x.pixel658
## [85] x.pixel660 x.pixel70  x.pixel713 x.pixel96 
## 
## Root node error: 2682/3000 = 0.894
## 
## n= 3000 
## 
##            CP nsplit rel error  xerror      xstd
## 1  0.09545116      0   1.00000 1.01081 0.0060255
## 2  0.09134974      1   0.90455 0.91462 0.0078854
## 3  0.06413125      2   0.81320 0.78896 0.0093103
## 4  0.04399702      5   0.62081 0.63833 0.0101086
## 5  0.03914989      7   0.53281 0.58688 0.0101987
## 6  0.03318419      8   0.49366 0.53990 0.0102049
## 7  0.01752424      9   0.46048 0.52312 0.0101897
## 8  0.01603281     10   0.44295 0.47949 0.0101066
## 9  0.01528710     11   0.42692 0.46532 0.0100660
## 10 0.01267711     12   0.41163 0.45638 0.0100367
## 11 0.01230425     13   0.39896 0.44855 0.0100089
## 12 0.01043997     15   0.37435 0.44482 0.0099949
## 13 0.00745712     16   0.36391 0.43251 0.0099453
## 14 0.00708427     17   0.35645 0.41573 0.0098690
## 15 0.00633855     18   0.34937 0.40604 0.0098203
## 16 0.00596570     19   0.34303 0.39448 0.0097577
## 17 0.00410142     20   0.33706 0.37248 0.0096247
## 18 0.00391499     25   0.31581 0.34862 0.0094590
## 19 0.00372856     29   0.30015 0.34526 0.0094339
## 20 0.00335570     31   0.29269 0.33371 0.0093437
## 21 0.00316928     32   0.28934 0.32252 0.0092510
## 22 0.00298285     34   0.28300 0.31991 0.0092286
## 23 0.00260999     38   0.27107 0.31581 0.0091927
## 24 0.00242356     46   0.25019 0.31245 0.0091628
## 25 0.00223714     50   0.24049 0.30425 0.0090877
## 26 0.00205071     59   0.22036 0.30425 0.0090877
## 27 0.00186428     61   0.21626 0.30350 0.0090807
## 28 0.00167785     69   0.20134 0.30089 0.0090560
## 29 0.00149142     71   0.19799 0.29679 0.0090166
## 30 0.00111857     73   0.19500 0.29679 0.0090166
## 31 0.00074571     81   0.18606 0.29418 0.0089911
## 32 0.00037286     92   0.17785 0.29567 0.0090057
## 33 0.00012429     97   0.17599 0.30052 0.0090525
## 34 0.00001000    100   0.17562 0.30052 0.0090525
# make prediction
pred.y.tree<- predict(fit.tree, data.frame(y=test.y, x=test.x), type = "class")
# accuracy rate
mean(test.y==pred.y.tree)
## [1] 0.7406548

We got pretty good accuracy. As we learn more advanced ML algorithm, you will see that the accuracy rate could hit to 99%.

Visualize the results

plotResults <- function(testdata, index, preds){
  op <- par(no.readonly=TRUE)
  x <- ceiling(sqrt(length(index)))
  par(mfrow=c(x,x), mar=c(.1,.1,.1,.1))
  
  for (i in index){
    m <- matrix(testdata[i,], nrow=28, byrow=TRUE)
    m <- apply(m, 2, rev)
    image(t(m), col=grey.colors(255), axes=FALSE)
    text(0.05,0.1,col="green", cex=1.2, preds[i])
  }
  par(op)
}

Here are the first 100 images in the test set and their predicted values:

plotResults(testdata=test.x, index=1:100, preds=pred.y.tree)

We see it did a very good job.

Things to remember

  • Use rpart() to fit regression and classification tree.

  • Know how to interpret a tree.

  • Use predict() for prediction, and how to assess the performance.

  • Know how to use Cp plot/table to prune a large tree.