We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Think about what kind of factors could affect people to fail to pay their credit balance.
library(tidyverse)
credit.data <- read.csv("https://www.dropbox.com/s/tnoo06n8m842uit/credit_card_default.csv?dl=1")
dim(credit.data)
## [1] 30000 24
Look at what information do we have.
colnames(credit.data)
## [1] "LIMIT_BAL" "SEX"
## [3] "EDUCATION" "MARRIAGE"
## [5] "AGE" "PAY_0"
## [7] "PAY_2" "PAY_3"
## [9] "PAY_4" "PAY_5"
## [11] "PAY_6" "BILL_AMT1"
## [13] "BILL_AMT2" "BILL_AMT3"
## [15] "BILL_AMT4" "BILL_AMT5"
## [17] "BILL_AMT6" "PAY_AMT1"
## [19] "PAY_AMT2" "PAY_AMT3"
## [21] "PAY_AMT4" "PAY_AMT5"
## [23] "PAY_AMT6" "default.payment.next.month"
Let’s look at how many people were actually default in this sample.
mean(credit.data$default.payment.next.month)
## [1] 0.2212
The name of response variable is too long! I want to make it shorter
by renaming. Recall the rename()
function.
credit.data<- rename(credit.data, default=default.payment.next.month)
How about the variable type and summary statistics?
str(credit.data) # structure - see variable type
summary(credit.data) # summary statistics
We see all variables are int, but we know that EDUCATION, MARRIAGE are categorical, we convert them to factor.
credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)
Two-way contingency table is a very useful tool for exploring the relationship between categorical variables. It is essentially the simplest pivot-table (see example below). Often time, after you create a two-way contingency table, Chi-square test is used to test if X affect Y. The null hypothesis is: X and Y are independent (e.g., MARRIAGE has nothing to do with likelihood of default).
Here is a very good tutorial for Chi-square test https://www.youtube.com/watch?v=WXPBoFDqNVk.
table.edu<- table(credit.data$EDUCATION, credit.data$default)
chisq.test(table.edu)
##
## Pearson's Chi-squared test
##
## data: table.edu
## X-squared = 160.41, df = 3, p-value < 2.2e-16
What we saw from above test result is that p-value < 0.05. What is your conclusion?
We omit other EDA, but you shouldn’t whenever you are doing data analysis.
For this example, our goal is to predict credit card default, which is a binary outcome variable.
As a practice of training-testing procedure, we randomly split the data to training (80%) and testing (20%) samples
index <- sample(nrow(credit.data),nrow(credit.data)*0.80)
credit.train = credit.data[index,]
credit.test = credit.data[-index,]
credit.glm1<- glm(default~LIMIT_BAL+SEX+EDUCATION+MARRIAGE+PAY_0+BILL_AMT1+PAY_AMT1, family=binomial, data=credit.train)
summary(credit.glm1)
##
## Call:
## glm(formula = default ~ LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
## PAY_0 + BILL_AMT1 + PAY_AMT1, family = binomial, data = credit.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9938 -0.7020 -0.5447 -0.3071 3.6484
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.508e-01 7.244e-02 -8.984 < 2e-16 ***
## LIMIT_BAL -1.566e-06 1.646e-07 -9.513 < 2e-16 ***
## SEX -1.194e-01 3.375e-02 -3.539 0.000402 ***
## EDUCATION2 -6.584e-02 3.943e-02 -1.670 0.094963 .
## EDUCATION3 -6.018e-02 5.166e-02 -1.165 0.244011
## EDUCATION4 -1.180e+00 2.058e-01 -5.733 9.89e-09 ***
## MARRIAGE2 -2.490e-01 3.454e-02 -7.211 5.56e-13 ***
## MARRIAGE3 -3.862e-01 1.524e-01 -2.534 0.011282 *
## PAY_0 6.849e-01 1.660e-02 41.250 < 2e-16 ***
## BILL_AMT1 -1.153e-06 2.780e-07 -4.148 3.36e-05 ***
## PAY_AMT1 -1.382e-05 2.303e-06 -6.002 1.94e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25402 on 23999 degrees of freedom
## Residual deviance: 22542 on 23989 degrees of freedom
## AIC: 22564
##
## Number of Fisher Scoring iterations: 5
You may have seen glm()
before. In this lab, this is the
main function used to build logistic regression model because it is a
member of generalized linear model. In glm()
, the only
thing new is family
. It specifies the distribution of your
response variable. You may also specify the link function after the name
of distribution, for example, family=binomial(logit)
(default link is logit). You may also use glm()
to build
many other generalized linear models…
You can simply extract some criteria of the model fitting, for example, Residual deviance (equivalent to SSE in linear regression model), AIC and BIC. Unlike linear regression models, there is no \(R^2\) in logistic regression.
credit.glm1$deviance
## [1] 22542.49
AIC(credit.glm1)
## [1] 22564.49
BIC(credit.glm1)
## [1] 22653.44
Similar to linear regression, we use predict()
function
for prediction.
pred.glm1.train<- predict(credit.glm1, type="response")
(IMPORTANT!!!) You need to specify
type="response"
in order to get probability outcome, which
is what we want. Otherwise, what it produces is the linear predictor
term \(\beta_0+\beta_1X_1+\beta_2X_2+\dotso\).
Recall the lecture, how is this linear predictor related to
probability?
install.packages('ROCR')
library(ROCR)
pred <- prediction(pred.glm1.train, credit.train$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7173588
Be careful that the function prediction()
is different
from predict()
. It is in Package ROCR
, and is
particularly used for preparing for ROC curve. Recall out lecture, this
function basically calculates many confusion matrices with different
cut-off probability. Therefore, it requires two vectors as inputs –
predicted probability and observed response
(0/1). The next line, performance()
calculates TPR
and FPR based all confusion matrices you get from previous step. Then
you can simply draw the ROC curve, which is a curve of FPR vs. TPR. The
last line is to get AUC (area under the curve). I would recommend you to
stick these four lines of code together, and use it to get ROC curve and
AUC. If you don’t want to draw the ROC curve (because it takes time),
just comment out plot line.
Precision is defined as TP/(TP+FP), which measures how many of the predicted positive cases were actually positive. High precision means that when the model predicts a positive outcome, it is usually correct. This is particularly valuable in scenarios where false positives are costly or undesirable (e.g., in spam detection, where a high precision ensures fewer non-spam emails are incorrectly flagged as spam). Recall is the same as TPR, which is defined as TP/(TP+FN), measuring how good the model is at capturing all actual positives. These two measures often have a tradeoff. To balance such a tradeoff, the harmoic mean is used, which is called \(F1\)-score, defined as \(2\frac{Precision*Recall}{Precision+Recall}\).
Precision-recall curves are particularly useful for evaluating models on imbalanced data because they focus on the model’s performance with respect to the positive class (often the minority class in imbalanced datasets) without being influenced by the high number of true negatives (majority class), which can otherwise mask poor performance.
RP.perf <- performance(pred, "prec", "rec")
plot(RP.perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "aucpr"), "y.values"))
## [1] 0.4948925
pred.glm1.test<- predict(credit.glm1, newdata = credit.test, type="response")
For out-of-sample prediction, you have to specify
newdata="testing sample name"
.
pred <- prediction(pred.glm1.test, credit.test$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7210586
Exercise: draw the precision-recall curve and find its AUC.
As we talked in the lecture, people may be more interested in the classification results. But we have to define a cut-off probability first.
These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and chossing a small cut-off probability will result in many cases being predicted as 1.
table((pred.glm1.train > 0.9)*1)
##
## 0 1
## 23959 41
table((pred.glm1.train > 0.5)*1)
##
## 0 1
## 22195 1805
table((pred.glm1.train > 0.2)*1)
##
## 0 1
## 12514 11486
table((pred.glm1.train > 0.0001)*1)
##
## 0 1
## 1 23999
Therefore, determine the optimal cut-off probability is crucial. The simplest way to determine the cut-off is to use the proportion of “1” in the original data. We will intriduce a more appropriate way to determine the optimal p-cut.
The simplest way is to choose the event proportion in training sample. This is roughly reasonable because the sample proportion is an estimate of mean probability of \(Y=1\).
pcut1<- mean(credit.train$default)
Based on this cut-off probability, we can obtain the binary prediction (predicted classification) and the confusion matrix
# get binary prediction
class.glm1.train<- (pred.glm1.train>pcut1)*1
# get confusion matrix
table(credit.train$default, class.glm1.train, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 12972 5704
## 1 1886 3438
In table()
function, two vectors must be both binary in
order to get confusion matrix (it is essentially a pivot table or
contingency table), dnn
is to specify the row and column
name of this 2*2 table. The first input vector is TRUE, so the first
name should be TRUE accordingly.
Then it is easy to get different types of classification error rate, i.e., false positive rate (FPR), false negative rate (FNR), and overall misclassification rate (MR). Commonly, you can use overall MR as the cost (a criterion) to evaluate the model prediction.
# (equal-weighted) misclassification rate
MR<- mean(credit.train$default!=class.glm1.train)
# False positive rate
FPR<- sum(credit.train$default==0 & class.glm1.train==1)/sum(credit.train$default==0)
# False negative rate (exercise)
FNR<-
Recall the lecture, different p-cut results in different confusion matrix, hence different MR (or cost). You need to search all possible p-cut to find the one that provides minimum cost. The first step is to define a symmetric/asymmetric cost function (misclassification rate), as a function of cut-off. Think about what else is needed to calculate MR? The answer is observed \(Y\) and predicted probability.
Searching optimal cutoff is especially useful for asymmetric cost. In many cases, such as our example for default prediction, the costs for the two types of misclassifition error are different. The total misclassification rate (MR) may not be useful. For example, among 1000 observations, 950 are nondefault and 50 are default cases, if we simply predict everyone to be nondefault, then MR=0.05, which seems to be very low. We know such a prediction won’t be acceptable.
In general, we can define the cost in economic values. For example, 1 case of FP cost $1 and 1 case of FN cost $5 (these are typically determined by the domain experts). Then we write:
# first, define weight, e.g., $1 vs. $10.
w.FP <- 1
w.FN <- 10
# then make binary prediction with a specific value of pcut
pcut <- 0.5
pred.p <- pred.glm1.train
pred.class <- (pred.p>pcut)*1
# finally calculate the cost
obs <- credit.train$default
cost <- sum((obs==0 & pred.class==1)*w.FP+(obs==1 & pred.class==0)*w.FN)
cost
## [1] 40932
Now we can write a loop to search for the optimal pcut among a series of pcut values.
exercise: complete the following code and obtain the plot shown below, and find the optimal value of pcut.
w.FP <- 1
w.FN <- 5
p.seq <- seq(0.01, 1, 0.01)
cost <- rep(0, length(p.seq))
for(i in 1:length(p.seq)){
}
Exercise:
Change the weights to different values, and see how your optimal cut-off changes.
obtain confusion matrix and calculate the (asymmetric) cost based on the optimal cut-off.
Find optimal cut-off probability using symmetric cost.
Calculate MR and cost, what do you find?
Write the cost as a function of obs
,
pred.p
, w.FP
, w.FN
and
pcut
.
Use F1 score to determine the optimal cut-off.
(Optional) Try to use cross-validation to determine the optimal cut-off. You may need to write your own function.
Everything you have done about classification so far is for training sample. Now let’s get to testing sample. Keep in mind the principle, testing sample is only used for evaluating your model’s prediction accuracy! NO NEED TO CHOOSE CUT-OFF PROBABILITY in this stage.
Exercise:
In classification problems, the commonly used score in
cross-validation is MR (or accuracy) and AUC. We can conduct CV with
these two matrics using caret
package.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# define training control parameters
# by default, the CV score is accuracy rate
fit.control <- trainControl(method = "cv", number = 5)
cv.credit.model1 <- train(
form = as.formula(credit.glm1$call),
data = credit.data,
trControl = fit.control,
method = "glm",
family = "binomial"
)
cv.credit.model1
## Generalized Linear Model
##
## 30000 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 24000, 24000, 24000, 24000, 24000
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3826615 0.1512638 0.2969105
# define training control parameters
# "summaryFunction = twoClassSummary, classProbs = TRUE" specifies AUC
fit.control <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary, classProbs = TRUE)
# note that if AUC is specified, you have to make response variable to be categorical, as.factor() won't work here
credit.data1 <- mutate(credit.data, default=recode(default, `1`="Yes", `0`="No"))
cv.credit.model1 <- train(
form = as.formula(credit.glm1$call),
data = credit.data1,
trControl = fit.control,
method = "glm",
family = "binomial"
)
cv.credit.model1
## Generalized Linear Model
##
## 30000 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 24000, 24000, 24000, 24001, 23999
## Resampling results:
##
## ROC Sens Spec
## 0.7172831 0.9716229 0.246685
We can use the same procedures of variable selection, i.e. forward, backward, and stepwise, for linear regression models. caution: this will take a long time since the sample size is not small.
credit.glm.full <- glm(default~. , family=binomial, data=credit.train)
credit.glm.back <- step(credit.glm.full, trace = 0) # backward selection (if you don't specify anything)
summary(credit.glm.back)
credit.glm.back$deviance
AIC(credit.glm.back)
BIC(credit.glm.back)
You can try model selection with BIC (usually results in a simpler model than AIC criterion)
credit.glm.back.BIC <- step(credit.glm.full, k=log(nrow(credit.train)), trace = 0)
summary(credit.glm.back.BIC)
credit.glm.back.BIC$deviance
AIC(credit.glm.back.BIC)
BIC(credit.glm.back.BIC)
Exercise: Try forward and stepwise selection procedures to see if they deliver the same best model.
Get ROC curve and AUC for both training and testing sample
Using training sample, find the optimal cut-off by grid search method with asymmetric cost (weights ratio = 5:1)
Calculate MR, FPR, FNR, the asymmetric cost for both taining and testing sample.
Be careful that LASSO does require x to be numeric
matrix. Therefore, we need to manually convert categorical variable
(“SEX”, “EDUCATION” and “MARRIAGE”) to dummy variable. For simplicity,
only if you have evidence that the categorical variable has monotonic
relationship to response can you directly convert it to numeric by using
as.numeric()
. For example, the probability of default
increases/decreases as EDUCATION level goes from 1 to 4. This can be
seen from the two-way contingency table by calculating the default
proportion at each education level.
Here I will show how to convert categorical variable to dummy variables.
yind <- which(names(credit.data)=="default")
credit.train.Y <- credit.train[,yind]
credit.train.X <- scale(model.matrix(~ ., data = credit.train[,-yind])[,-1])
credit.test.Y <- credit.test[,yind]
credit.test.X <- scale(model.matrix(~ ., data = credit.test[,-yind])[,-1])
The function model.matrix()
can automatically convert
categorical variable to dummy. It also creates a column of 1, which we
don’t need at this time. That column of 1 is used for estimating
intercept if you write algorithm by yourself, but most available
functions automatically creates that column during estimation.
cv.glmnet
estimates model and performs cross-validation
at the same time.
library(glmnet)
credit.lasso<- glmnet(x=credit.train.X, y=credit.train.Y, family = "binomial")
credit.lasso.cv<- cv.glmnet(x=credit.train.X, y=credit.train.Y, family = "binomial", type.measure = "auc")
plot(credit.lasso.cv)
For logistc regression, we can specify
type.measure="auc"
so that the CV error will be AUC.
Another popular option is type.measure="class"
so that the
CV error will be misclassification error (not good for inbalanced data
or asymmetric cost).
Get the coefficient with optimal \(\lambda\) (lambda.min and lambda.1se)
coef(credit.lasso, s=c(credit.lasso.cv$lambda.min, credit.lasso.cv$lambda.1se))
## 27 x 2 sparse Matrix of class "dgCMatrix"
## s1 s2
## (Intercept) -1.462126889 -1.425508076
## LIMIT_BAL -0.105648158 -0.096677754
## SEX -0.044339597 -0.019741875
## EDUCATION2 -0.041546180 .
## EDUCATION3 -0.036869739 .
## EDUCATION4 -0.129821061 -0.077314569
## MARRIAGE2 -0.096294728 -0.064981662
## MARRIAGE3 -0.041552251 -0.011191000
## AGE 0.057061577 0.035190248
## PAY_0 0.628110567 0.620099151
## PAY_2 0.097442957 0.092359128
## PAY_3 0.104751607 0.106503547
## PAY_4 0.014144712 0.007348165
## PAY_5 0.058255618 0.051698767
## PAY_6 0.003234858 .
## BILL_AMT1 -0.261930157 -0.112797265
## BILL_AMT2 . .
## BILL_AMT3 0.136624465 .
## BILL_AMT4 . .
## BILL_AMT5 . .
## BILL_AMT6 0.026070431 .
## PAY_AMT1 -0.195469573 -0.121775517
## PAY_AMT2 -0.266363174 -0.102725673
## PAY_AMT3 -0.039543261 -0.013214880
## PAY_AMT4 -0.039731926 -0.015607068
## PAY_AMT5 -0.053879417 -0.024416708
## PAY_AMT6 -0.007275098 .
# in-sample prediction
pred.lasso.train<- predict(credit.lasso, newx=credit.train.X, s=credit.lasso.cv$lambda.1se, type = "response")
# out-of-sample prediction
pred.lasso.test<- predict(credit.lasso, newx=credit.test.X, s=credit.lasso.cv$lambda.1se, type = "response")
Get ROC curve and AUC for both training and testing sample
Using training sample, find the optimal cut-off by grid search method with asymmetric cost (weights ratio = 5:1)
Calculate MR, FPR, FNR, the asymmetric cost for both taining and testing sample.
This is a very famous data set – MNIST. It is also an ongoing Kaggle competetion. Part of the code here was taken from here.
digit<- data.matrix(read_csv("https://www.dropbox.com/s/ulujvi2a4ykfzju/train.csv?dl=1"))
dim(digit)
## [1] 42000 785
colnames(digit)[1:10]
## [1] "label" "pixel0" "pixel1" "pixel2" "pixel3" "pixel4" "pixel5" "pixel6"
## [9] "pixel7" "pixel8"
barplot(table(digit[,1]), col=rainbow(10, 0.5), main="Freq. of Digits")
## visualize the data
plotTrain <- function(data, index){
op <- par(no.readonly=TRUE)
x <- ceiling(sqrt(length(index)))
par(mfrow=c(x, x), mar=c(.1, .1, .1, .1))
for (i in index){ #reverse and transpose each matrix to rotate images
m <- matrix(data[i,-1], nrow=28, byrow=TRUE)
m <- apply(m, 2, rev)
image(t(m), col=grey.colors(255), axes=FALSE)
text(0.05, 0.2, col="white", cex=1.2, data[i, 1])
}
par(op) #reset the original graphics parameters
}
plotTrain(data=digit, index=1:100)
index<- sample(1:nrow(digit), 0.6*nrow(digit))
train<- digit[index,]
test<- digit[-index,]
Currently, each cell uses 0-255 to represent the grey color scale. We recale it to 0-1.
## standardize X
train.x <- train[,-1] #remove 'label' column
test.x<- test[,-1]
train.y <- train[,1] #label column
test.y<- test[,1]
train.x <- train.x/255
test.x <- test.x/255
We use package nnet
(stands for neural network) for
multinomial logit model. There are many other packages such as
mlogit
, but this one is relatively easy to use.
Due to the large size, we only use first 3000 observations as training sample.
library(nnet)
fit.mnl <- multinom(y ~., family = "multinomial", data = data.frame(y=train.y[1:3000], x=train.x[1:3000,]), MaxNWts =10000, maxit=50)
## # weights: 7860 (7065 variable)
## initial value 6907.755279
## iter 10 value 1413.616169
## iter 20 value 778.643783
## iter 30 value 397.902593
## iter 40 value 251.362238
## iter 50 value 96.218333
## final value 96.218333
## stopped after 50 iterations
# make prediction
pred.y.mnl<- predict(fit.mnl, data.frame(y=test.y, x=test.x), type = "class")
# accuracy rate
mean(test.y==pred.y.mnl)
## [1] 0.825119
We got pretty good accuracy. As we learn more advanced ML algorithm, you will see that the accuracy rate could hit to 99%.
plotResults <- function(testdata, index, preds){
op <- par(no.readonly=TRUE)
x <- ceiling(sqrt(length(index)))
par(mfrow=c(x,x), mar=c(.1,.1,.1,.1))
for (i in index){
m <- matrix(testdata[i,], nrow=28, byrow=TRUE)
m <- apply(m, 2, rev)
image(t(m), col=grey.colors(255), axes=FALSE)
text(0.05,0.1,col="red", cex=1.2, preds[i])
}
par(op)
}
Here are the first 100 images in the test set and their predicted values:
plotResults(testdata=test.x, index=1:100, preds=pred.y.mnl)
We see it did a very good job.
Know how to use glm() to build logistic regression;
Know how to get ROC and AUC based on predicted probability;
Know how to find optimal cut-off;
Know how to do binary classification, and calculation of MR, FPR, FNR, and cost;