We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Think about what kind of factors could affect people to fail to pay their credit balance.

Data Preparation and EDA.

library(tidyverse)
credit.data <- read.csv("https://www.dropbox.com/s/tnoo06n8m842uit/credit_card_default.csv?dl=1")
dim(credit.data)
## [1] 30000    24

Look at what information do we have.

colnames(credit.data)
##  [1] "LIMIT_BAL"                  "SEX"                       
##  [3] "EDUCATION"                  "MARRIAGE"                  
##  [5] "AGE"                        "PAY_0"                     
##  [7] "PAY_2"                      "PAY_3"                     
##  [9] "PAY_4"                      "PAY_5"                     
## [11] "PAY_6"                      "BILL_AMT1"                 
## [13] "BILL_AMT2"                  "BILL_AMT3"                 
## [15] "BILL_AMT4"                  "BILL_AMT5"                 
## [17] "BILL_AMT6"                  "PAY_AMT1"                  
## [19] "PAY_AMT2"                   "PAY_AMT3"                  
## [21] "PAY_AMT4"                   "PAY_AMT5"                  
## [23] "PAY_AMT6"                   "default.payment.next.month"

Let’s look at how many people were actually default in this sample.

mean(credit.data$default.payment.next.month)
## [1] 0.2212

The name of response variable is too long! I want to make it shorter by renaming. Recall the rename() function.

credit.data<- rename(credit.data, default=default.payment.next.month)

How about the variable type and summary statistics?

str(credit.data)    # structure - see variable type
summary(credit.data) # summary statistics

We see all variables are int, but we know that EDUCATION, MARRIAGE are categorical, we convert them to factor.

credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)

Two-way contingency table and Chi-square test

Two-way contingency table is a very useful tool for exploring the relationship between categorical variables. It is essentially the simplest pivot-table (see example below). Often time, after you create a two-way contingency table, Chi-square test is used to test if X affect Y. The null hypothesis is: X and Y are independent (e.g., MARRIAGE has nothing to do with likelihood of default).

Here is a very good tutorial for Chi-square test https://www.youtube.com/watch?v=WXPBoFDqNVk.

table.edu<- table(credit.data$EDUCATION, credit.data$default)
chisq.test(table.edu)
## 
##  Pearson's Chi-squared test
## 
## data:  table.edu
## X-squared = 160.41, df = 3, p-value < 2.2e-16

What we saw from above test result is that p-value < 0.05. What is your conclusion?

We omit other EDA, but you shouldn’t whenever you are doing data analysis.


Fitting a Logistic Regression

For this example, our goal is to predict credit card default, which is a binary outcome variable.

As a practice of training-testing procedure, we randomly split the data to training (80%) and testing (20%) samples

index <- sample(nrow(credit.data),nrow(credit.data)*0.80)
credit.train = credit.data[index,]
credit.test = credit.data[-index,]

Train a logistic regression model with all X variables

credit.glm1<- glm(default~LIMIT_BAL+SEX+EDUCATION+MARRIAGE+PAY_0+BILL_AMT1+PAY_AMT1, family=binomial, data=credit.train)
summary(credit.glm1)
## 
## Call:
## glm(formula = default ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE + 
##     PAY_0 + BILL_AMT1 + PAY_AMT1, family = binomial, data = credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9938  -0.7020  -0.5447  -0.3071   3.6484  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.508e-01  7.244e-02  -8.984  < 2e-16 ***
## LIMIT_BAL   -1.566e-06  1.646e-07  -9.513  < 2e-16 ***
## SEX         -1.194e-01  3.375e-02  -3.539 0.000402 ***
## EDUCATION2  -6.584e-02  3.943e-02  -1.670 0.094963 .  
## EDUCATION3  -6.018e-02  5.166e-02  -1.165 0.244011    
## EDUCATION4  -1.180e+00  2.058e-01  -5.733 9.89e-09 ***
## MARRIAGE2   -2.490e-01  3.454e-02  -7.211 5.56e-13 ***
## MARRIAGE3   -3.862e-01  1.524e-01  -2.534 0.011282 *  
## PAY_0        6.849e-01  1.660e-02  41.250  < 2e-16 ***
## BILL_AMT1   -1.153e-06  2.780e-07  -4.148 3.36e-05 ***
## PAY_AMT1    -1.382e-05  2.303e-06  -6.002 1.94e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25402  on 23999  degrees of freedom
## Residual deviance: 22542  on 23989  degrees of freedom
## AIC: 22564
## 
## Number of Fisher Scoring iterations: 5

You may have seen glm() before. In this lab, this is the main function used to build logistic regression model because it is a member of generalized linear model. In glm(), the only thing new is family. It specifies the distribution of your response variable. You may also specify the link function after the name of distribution, for example, family=binomial(logit) (default link is logit). You may also use glm() to build many other generalized linear models…

Get some criteria of model fitting

You can simply extract some criteria of the model fitting, for example, Residual deviance (equivalent to SSE in linear regression model), AIC and BIC. Unlike linear regression models, there is no \(R^2\) in logistic regression.

credit.glm1$deviance
## [1] 22542.49
AIC(credit.glm1)
## [1] 22564.49
BIC(credit.glm1)
## [1] 22653.44

Prediction

Similar to linear regression, we use predict() function for prediction.

In-sample prediction (less important)

pred.glm1.train<- predict(credit.glm1, type="response")

(IMPORTANT!!!) You need to specify type="response" in order to get probability outcome, which is what we want. Otherwise, what it produces is the linear predictor term \(\beta_0+\beta_1X_1+\beta_2X_2+\dotso\). Recall the lecture, how is this linear predictor related to probability?

ROC Curve

install.packages('ROCR')
library(ROCR)
pred <- prediction(pred.glm1.train, credit.train$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7173588

Be careful that the function prediction() is different from predict(). It is in Package ROCR, and is particularly used for preparing for ROC curve. Recall out lecture, this function basically calculates many confusion matrices with different cut-off probability. Therefore, it requires two vectors as inputs – predicted probability and observed response (0/1). The next line, performance() calculates TPR and FPR based all confusion matrices you get from previous step. Then you can simply draw the ROC curve, which is a curve of FPR vs. TPR. The last line is to get AUC (area under the curve). I would recommend you to stick these four lines of code together, and use it to get ROC curve and AUC. If you don’t want to draw the ROC curve (because it takes time), just comment out plot line.

Precision-recall Curve and F1-score

Precision is defined as TP/(TP+FP), which measures how many of the predicted positive cases were actually positive. High precision means that when the model predicts a positive outcome, it is usually correct. This is particularly valuable in scenarios where false positives are costly or undesirable (e.g., in spam detection, where a high precision ensures fewer non-spam emails are incorrectly flagged as spam). Recall is the same as TPR, which is defined as TP/(TP+FN), measuring how good the model is at capturing all actual positives. These two measures often have a tradeoff. To balance such a tradeoff, the harmoic mean is used, which is called \(F1\)-score, defined as \(2\frac{Precision*Recall}{Precision+Recall}\).

Precision-recall curves are particularly useful for evaluating models on imbalanced data because they focus on the model’s performance with respect to the positive class (often the minority class in imbalanced datasets) without being influenced by the high number of true negatives (majority class), which can otherwise mask poor performance.

RP.perf <- performance(pred, "prec", "rec")
plot(RP.perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "aucpr"), "y.values"))
## [1] 0.4948925

Out-of-sample prediction (more important)

pred.glm1.test<- predict(credit.glm1, newdata = credit.test, type="response")

For out-of-sample prediction, you have to specify newdata="testing sample name".

ROC Curve

pred <- prediction(pred.glm1.test, credit.test$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7210586

Exercise: draw the precision-recall curve and find its AUC.


Back to top






Binary Classification

As we talked in the lecture, people may be more interested in the classification results. But we have to define a cut-off probability first.

These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and chossing a small cut-off probability will result in many cases being predicted as 1.

table((pred.glm1.train > 0.9)*1)
## 
##     0     1 
## 23959    41
table((pred.glm1.train > 0.5)*1)
## 
##     0     1 
## 22195  1805
table((pred.glm1.train > 0.2)*1)
## 
##     0     1 
## 12514 11486
table((pred.glm1.train > 0.0001)*1)
## 
##     0     1 
##     1 23999

Therefore, determine the optimal cut-off probability is crucial. The simplest way to determine the cut-off is to use the proportion of “1” in the original data. We will intriduce a more appropriate way to determine the optimal p-cut.

Naive Choice of Cut-off probability

The simplest way is to choose the event proportion in training sample. This is roughly reasonable because the sample proportion is an estimate of mean probability of \(Y=1\).

pcut1<- mean(credit.train$default)

Based on this cut-off probability, we can obtain the binary prediction (predicted classification) and the confusion matrix

# get binary prediction
class.glm1.train<- (pred.glm1.train>pcut1)*1
# get confusion matrix
table(credit.train$default, class.glm1.train, dnn = c("True", "Predicted"))
##     Predicted
## True     0     1
##    0 12972  5704
##    1  1886  3438

In table() function, two vectors must be both binary in order to get confusion matrix (it is essentially a pivot table or contingency table), dnn is to specify the row and column name of this 2*2 table. The first input vector is TRUE, so the first name should be TRUE accordingly.

Then it is easy to get different types of classification error rate, i.e., false positive rate (FPR), false negative rate (FNR), and overall misclassification rate (MR). Commonly, you can use overall MR as the cost (a criterion) to evaluate the model prediction.

# (equal-weighted) misclassification rate
MR<- mean(credit.train$default!=class.glm1.train)
# False positive rate
FPR<- sum(credit.train$default==0 & class.glm1.train==1)/sum(credit.train$default==0)
# False negative rate (exercise)
FNR<- 

Determine Optimal cut-off Probability using Grid Search Method

Recall the lecture, different p-cut results in different confusion matrix, hence different MR (or cost). You need to search all possible p-cut to find the one that provides minimum cost. The first step is to define a symmetric/asymmetric cost function (misclassification rate), as a function of cut-off. Think about what else is needed to calculate MR? The answer is observed \(Y\) and predicted probability.

Asymmetric cost

Searching optimal cutoff is especially useful for asymmetric cost. In many cases, such as our example for default prediction, the costs for the two types of misclassifition error are different. The total misclassification rate (MR) may not be useful. For example, among 1000 observations, 950 are nondefault and 50 are default cases, if we simply predict everyone to be nondefault, then MR=0.05, which seems to be very low. We know such a prediction won’t be acceptable.

In general, we can define the cost in economic values. For example, 1 case of FP cost $1 and 1 case of FN cost $5 (these are typically determined by the domain experts). Then we write:

# first, define weight, e.g., $1 vs. $10.
w.FP <- 1
w.FN <- 10
# then make binary prediction with a specific value of pcut
pcut <- 0.5
pred.p <- pred.glm1.train
pred.class <- (pred.p>pcut)*1
# finally calculate the cost
obs <- credit.train$default
cost <- sum((obs==0 & pred.class==1)*w.FP+(obs==1 & pred.class==0)*w.FN)
cost
## [1] 40932

Now we can write a loop to search for the optimal pcut among a series of pcut values.

exercise: complete the following code and obtain the plot shown below, and find the optimal value of pcut.

w.FP <- 1
w.FN <- 5
p.seq <- seq(0.01, 1, 0.01) 
cost <- rep(0, length(p.seq))
for(i in 1:length(p.seq)){
  
}

Exercise:

  • Change the weights to different values, and see how your optimal cut-off changes.

  • obtain confusion matrix and calculate the (asymmetric) cost based on the optimal cut-off.

  • Find optimal cut-off probability using symmetric cost.

  • Calculate MR and cost, what do you find?

  • Write the cost as a function of obs, pred.p, w.FP, w.FN and pcut.

  • Use F1 score to determine the optimal cut-off.

  • (Optional) Try to use cross-validation to determine the optimal cut-off. You may need to write your own function.

Out-of-sample Classification

Everything you have done about classification so far is for training sample. Now let’s get to testing sample. Keep in mind the principle, testing sample is only used for evaluating your model’s prediction accuracy! NO NEED TO CHOOSE CUT-OFF PROBABILITY in this stage.

Exercise:

  • Calculate MR, FPR, FNR based on the optimal cut-off you get from training sample with weights (5:1)
  • Calculate asymetric cost based on the same optimal cut-off.
  • Compute F1 score based on the same optimal cut-off.
  • Calculate above statistics based on the cut-off you get from training sample with symmetric weights (1:1)
  • Calculate above statistics based on the cut-off you get from training sample with F1-score

Cross validation

In classification problems, the commonly used score in cross-validation is MR (or accuracy) and AUC. We can conduct CV with these two matrics using caret package.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# define training control parameters
# by default, the CV score is accuracy rate 
fit.control <- trainControl(method = "cv", number = 5)
cv.credit.model1 <- train(
  form = as.formula(credit.glm1$call),  
  data = credit.data,
  trControl = fit.control,
  method = "glm",
  family = "binomial"
)
cv.credit.model1
## Generalized Linear Model 
## 
## 30000 samples
##     7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 24000, 24000, 24000, 24000, 24000 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3826615  0.1512638  0.2969105
# define training control parameters 
# "summaryFunction = twoClassSummary, classProbs = TRUE" specifies AUC
fit.control <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary, classProbs = TRUE)
# note that if AUC is specified, you have to make response variable to be categorical, as.factor() won't work here
credit.data1 <- mutate(credit.data, default=recode(default, `1`="Yes", `0`="No"))
cv.credit.model1 <- train(
  form = as.formula(credit.glm1$call), 
  data = credit.data1,
  trControl = fit.control,
  method = "glm",
  family = "binomial"
)
cv.credit.model1
## Generalized Linear Model 
## 
## 30000 samples
##     7 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 24000, 24000, 24000, 24001, 23999 
## Resampling results:
## 
##   ROC        Sens       Spec    
##   0.7172831  0.9716229  0.246685

Back to top






Variable Selection

Variable Selection with Stepwise Approach

We can use the same procedures of variable selection, i.e. forward, backward, and stepwise, for linear regression models. caution: this will take a long time since the sample size is not small.

credit.glm.full <- glm(default~. , family=binomial, data=credit.train)
credit.glm.back <- step(credit.glm.full, trace = 0) # backward selection (if you don't specify anything)
summary(credit.glm.back)
credit.glm.back$deviance
AIC(credit.glm.back)
BIC(credit.glm.back)

You can try model selection with BIC (usually results in a simpler model than AIC criterion)

credit.glm.back.BIC <- step(credit.glm.full, k=log(nrow(credit.train)), trace = 0) 
summary(credit.glm.back.BIC)
credit.glm.back.BIC$deviance
AIC(credit.glm.back.BIC)
BIC(credit.glm.back.BIC)

Exercise: Try forward and stepwise selection procedures to see if they deliver the same best model.

Exercise:

  • Get ROC curve and AUC for both training and testing sample

  • Using training sample, find the optimal cut-off by grid search method with asymmetric cost (weights ratio = 5:1)

  • Calculate MR, FPR, FNR, the asymmetric cost for both taining and testing sample.


Variable selection with LASSO

Data preparation

Be careful that LASSO does require x to be numeric matrix. Therefore, we need to manually convert categorical variable (“SEX”, “EDUCATION” and “MARRIAGE”) to dummy variable. For simplicity, only if you have evidence that the categorical variable has monotonic relationship to response can you directly convert it to numeric by using as.numeric(). For example, the probability of default increases/decreases as EDUCATION level goes from 1 to 4. This can be seen from the two-way contingency table by calculating the default proportion at each education level.

Here I will show how to convert categorical variable to dummy variables.

yind <- which(names(credit.data)=="default")
credit.train.Y <- credit.train[,yind]
credit.train.X <- scale(model.matrix(~ ., data = credit.train[,-yind])[,-1])
credit.test.Y <- credit.test[,yind]
credit.test.X <- scale(model.matrix(~ ., data = credit.test[,-yind])[,-1])

The function model.matrix() can automatically convert categorical variable to dummy. It also creates a column of 1, which we don’t need at this time. That column of 1 is used for estimating intercept if you write algorithm by yourself, but most available functions automatically creates that column during estimation.

Model fitting

cv.glmnet estimates model and performs cross-validation at the same time.

library(glmnet)
credit.lasso<- glmnet(x=credit.train.X, y=credit.train.Y, family = "binomial")
credit.lasso.cv<- cv.glmnet(x=credit.train.X, y=credit.train.Y, family = "binomial", type.measure = "auc")
plot(credit.lasso.cv)

For logistc regression, we can specify type.measure="auc" so that the CV error will be AUC. Another popular option is type.measure="class" so that the CV error will be misclassification error (not good for inbalanced data or asymmetric cost).

Get the coefficient with optimal \(\lambda\) (lambda.min and lambda.1se)

coef(credit.lasso, s=c(credit.lasso.cv$lambda.min, credit.lasso.cv$lambda.1se))
## 27 x 2 sparse Matrix of class "dgCMatrix"
##                       s1           s2
## (Intercept) -1.462126889 -1.425508076
## LIMIT_BAL   -0.105648158 -0.096677754
## SEX         -0.044339597 -0.019741875
## EDUCATION2  -0.041546180  .          
## EDUCATION3  -0.036869739  .          
## EDUCATION4  -0.129821061 -0.077314569
## MARRIAGE2   -0.096294728 -0.064981662
## MARRIAGE3   -0.041552251 -0.011191000
## AGE          0.057061577  0.035190248
## PAY_0        0.628110567  0.620099151
## PAY_2        0.097442957  0.092359128
## PAY_3        0.104751607  0.106503547
## PAY_4        0.014144712  0.007348165
## PAY_5        0.058255618  0.051698767
## PAY_6        0.003234858  .          
## BILL_AMT1   -0.261930157 -0.112797265
## BILL_AMT2    .            .          
## BILL_AMT3    0.136624465  .          
## BILL_AMT4    .            .          
## BILL_AMT5    .            .          
## BILL_AMT6    0.026070431  .          
## PAY_AMT1    -0.195469573 -0.121775517
## PAY_AMT2    -0.266363174 -0.102725673
## PAY_AMT3    -0.039543261 -0.013214880
## PAY_AMT4    -0.039731926 -0.015607068
## PAY_AMT5    -0.053879417 -0.024416708
## PAY_AMT6    -0.007275098  .

Prediction

# in-sample prediction
pred.lasso.train<- predict(credit.lasso, newx=credit.train.X, s=credit.lasso.cv$lambda.1se, type = "response")
# out-of-sample prediction
pred.lasso.test<- predict(credit.lasso, newx=credit.test.X, s=credit.lasso.cv$lambda.1se, type = "response")

Exercise:

  • Get ROC curve and AUC for both training and testing sample

  • Using training sample, find the optimal cut-off by grid search method with asymmetric cost (weights ratio = 5:1)

  • Calculate MR, FPR, FNR, the asymmetric cost for both taining and testing sample.


Back to top






Multinomial logistic regression for hand-written digit recognition

This is a very famous data set – MNIST. It is also an ongoing Kaggle competetion. Part of the code here was taken from here.

Preparations

Load data

digit<- data.matrix(read_csv("https://www.dropbox.com/s/ulujvi2a4ykfzju/train.csv?dl=1"))
dim(digit)
## [1] 42000   785
colnames(digit)[1:10]
##  [1] "label"  "pixel0" "pixel1" "pixel2" "pixel3" "pixel4" "pixel5" "pixel6"
##  [9] "pixel7" "pixel8"
barplot(table(digit[,1]), col=rainbow(10, 0.5), main="Freq. of Digits")

Visualizing data

## visualize the data
plotTrain <- function(data, index){
  op <- par(no.readonly=TRUE)
  x <- ceiling(sqrt(length(index)))
  par(mfrow=c(x, x), mar=c(.1, .1, .1, .1))
  
  for (i in index){ #reverse and transpose each matrix to rotate images
    m <- matrix(data[i,-1], nrow=28, byrow=TRUE)
    m <- apply(m, 2, rev)
    image(t(m), col=grey.colors(255), axes=FALSE)
    text(0.05, 0.2, col="white", cex=1.2, data[i, 1])
  }
  par(op) #reset the original graphics parameters
}

plotTrain(data=digit, index=1:100)

Split data to training and testing

index<- sample(1:nrow(digit), 0.6*nrow(digit))
train<- digit[index,]
test<- digit[-index,]

Standardize

Currently, each cell uses 0-255 to represent the grey color scale. We recale it to 0-1.

## standardize X
train.x <- train[,-1] #remove 'label' column
test.x<- test[,-1]
train.y <- train[,1] #label column
test.y<- test[,1]
train.x <- train.x/255
test.x <- test.x/255

multinomial logit

We use package nnet (stands for neural network) for multinomial logit model. There are many other packages such as mlogit, but this one is relatively easy to use.

Due to the large size, we only use first 3000 observations as training sample.

library(nnet)
fit.mnl <- multinom(y ~., family = "multinomial", data = data.frame(y=train.y[1:3000], x=train.x[1:3000,]), MaxNWts =10000, maxit=50)
## # weights:  7860 (7065 variable)
## initial  value 6907.755279 
## iter  10 value 1413.616169
## iter  20 value 778.643783
## iter  30 value 397.902593
## iter  40 value 251.362238
## iter  50 value 96.218333
## final  value 96.218333 
## stopped after 50 iterations
# make prediction
pred.y.mnl<- predict(fit.mnl, data.frame(y=test.y, x=test.x), type = "class")
# accuracy rate
mean(test.y==pred.y.mnl)
## [1] 0.825119

We got pretty good accuracy. As we learn more advanced ML algorithm, you will see that the accuracy rate could hit to 99%.

Visualize the results

plotResults <- function(testdata, index, preds){
  op <- par(no.readonly=TRUE)
  x <- ceiling(sqrt(length(index)))
  par(mfrow=c(x,x), mar=c(.1,.1,.1,.1))
  
  for (i in index){
    m <- matrix(testdata[i,], nrow=28, byrow=TRUE)
    m <- apply(m, 2, rev)
    image(t(m), col=grey.colors(255), axes=FALSE)
    text(0.05,0.1,col="red", cex=1.2, preds[i])
  }
  par(op)
}

Here are the first 100 images in the test set and their predicted values:

plotResults(testdata=test.x, index=1:100, preds=pred.y.mnl)

We see it did a very good job.


Back to top






Summary

Things to remember

  • Know how to use glm() to build logistic regression;

  • Know how to get ROC and AUC based on predicted probability;

  • Know how to find optimal cut-off;

  • Know how to do binary classification, and calculation of MR, FPR, FNR, and cost;


Back to top