We use Boston Housing Data as an illustrative example in this lab. We learn basic linear regression and analysis with R.
Boston housing data is a built-in dataset in MASS
package, so you do not need to download externally. Package
MASS
comes with R when you installed R, so no need to use
install.packages(MASS)
to download and install, but you do
need to load this package.
library(MASS)
data(Boston) #this data is in MASS package
You can find details of the dataset from help document.
?Boston
We can obtain the help document for the dataset because it is built-in, but most time your dataset is external.
We have introduced many EDA techniques in previous labs. Here we only conduct a few very basic and high-level EDA, but in general, a much more in-depth EDA is needed. Below is a quick overview and summary of the data.
dim(Boston)
## [1] 506 14
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We can also draw a pairwise scatter plot.
pairs(Boston)
We should be careful when we draw the pairwise scatter plot. First, scatter plot makes sense for numeric variables, so make sure the input data frame contains only numeric variables. Second, we should not have too many variables. The one we show here is already a little overwhelming.
We first fit a simple linear regression using, say “rm”, as the explanatory variable \(X\).
# draw a scatter plot
plot(medv~rm, data=Boston)
# fit the simple linear regression
model0<- lm(medv~rm, data = Boston)
model0
##
## Call:
## lm(formula = medv ~ rm, data = Boston)
##
## Coefficients:
## (Intercept) rm
## -34.671 9.102
abline(model0, col="red", lty=2, lwd=2)
To get more detailed outputs of the fitted model, we use function
summary()
.
sum.model0<- summary(model0)
sum.model0
##
## Call:
## lm(formula = medv ~ rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.346 -2.547 0.090 2.986 39.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.671 2.650 -13.08 <2e-16 ***
## rm 9.102 0.419 21.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.616 on 504 degrees of freedom
## Multiple R-squared: 0.4835, Adjusted R-squared: 0.4825
## F-statistic: 471.8 on 1 and 504 DF, p-value: < 2.2e-16
Note that both of the objects “model0” and “sum.model0” are lists. Therefore, we can extract any components on that list.
## pull coefficients
model0$coefficients
## (Intercept) rm
## -34.670621 9.102109
## pull coefficients with hypothesis testing info
sum.model0$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.670621 2.6498030 -13.08423 6.950229e-34
## rm 9.102109 0.4190266 21.72203 2.487229e-74
## pull sigma (think about what is it?)
sum.model0$sigma
## [1] 6.61616
## pull adjusted R-square
sum.model0$adj.r.squared
## [1] 0.4825007
## pull fitted values
head(model0$fitted.values)
## 1 2 3 4 5 6
## 25.17575 23.77402 30.72803 29.02594 30.38215 23.85594
## pull residuals
head(model0$residuals)
## 1 2 3 4 5 6
## -1.175746 -2.174021 3.971968 4.374062 5.817848 4.844060
You can get some of these results using alternative ways:
coef(model0)
coef(sum.model0)
sigma(model0)
head(fitted(model0))
head(residuals(model0))
You may get the confidence interval for the estimated coefficients. Recall the formula to calculate the confidence interval: \(\hat{\beta}_j \pm t_{\alpha/2, n-p-1}\times SE(\hat{\beta}_j)\).
confint(model0)
## 2.5 % 97.5 %
## (Intercept) -39.876641 -29.464601
## rm 8.278855 9.925363
sigma
? Can you obtain sigma
based
on $residual
?Let’s consider a model with 3 \(X\) variables
model1 <- lm(medv~crim+rm+lstat, data = Boston)
sum1 <- summary(model1)
sum1$adj.r.squared
## [1] 0.6437356
sum1$sigma^2
## [1] 30.13524
AIC(model1)
## [1] 3165.232
BIC(model1)
## [1] 3186.364
Very clear, the adjusted R square, MSE, AIC and BIC of
model1
are much better than those of
model0
.
Given the estimated coefficients, we know how to compute the
predicted response \(\hat{y}\) for any
given input vector \(\mathbf{x}\). That
is \(\hat{y}=\mathbf{\hat{\beta}}^\top
\mathbf{x}\). In R, we can also use predict()
function.
predY <- predict(model1, newdata=data.frame(crim=3, rm=4, lstat=6.4))
predY
## 1
## 14.29444
Here, we need to make sure that newdata
is a data frame
that contains the same variable as those used in your training data. If
newdata is not specified, predict()
returns the fitted
value.
PredY_insample <- predict(model1)
head(PredY_insample)
## 1 2 3 4 5 6
## 28.85772 25.64564 32.58746 32.24192 31.63289 27.96579
The following code provides prediction interval.
predY_interval <- predict(model1, newdata=data.frame(crim=3, rm=4, lstat=6.4), interval = "prediction")
predY_interval
## fit lwr upr
## 1 14.29444 3.241033 25.34784
predict()
without specifying newdata.The aim of model diagnostics is to check if the model assumptions are valid. Recall that residual plot is primary tool for model diagnostics.
plot(model1$fitted.values, model1$residuals)
What can you see from this plot?
More plots for model diagnostics.
#par(mar=c(4,4.5,2,1), oma=c(2,1,1,2))
par(mfrow=c(2,2)) # to specify the figure layout
plot(model1)
Some notes about the residual-leverage plot:
Residual vs. fitted and normal Q-Q plot are quite standard to check assumptions of normality, constant variance and whether higher order terms need to be included. The last figure, residual vs leverage plot is mainly used to identify influential data points and potential outliers. Leverage is defined as the diagonal elements \(h_i\) of hat matrix \(H=X(X^T X)^{-1}X^T\), which only depends on \(X\), hence larger value is an indication of extreme values in \(X\).
Cook’s distance is defined as \(D_i=\frac{r_i^2 h_i}{p(1-h_i)}\), so it is a function of both residual and leverage. Therefore it is a contour as shown in the figure. A Cook’s Distance is often considered large if \(D_i>4/n\), and those data points can be considered as influential point. In the last figure, however, we see there are two threshold values of Cook’s distance, 0.5 and 1, which are much larger than \(4/n\). Therefore, any points fall outside the dashed line (Cook’s distance = 0.5 (or 1)) is highly influential and very likely to be outliers.
In R basic package stat
, there is a function to
calculate Cook’s distance:
CookD <- cooks.distance(model1)
which(CookD>4/nrow(Boston)) # threshold = 4/n, general rule for influential point
## 9 49 142 148 149 162 163 164 167 187 196 204 205 215 226 229 234 258 262 263
## 9 49 142 148 149 162 163 164 167 187 196 204 205 215 226 229 234 258 262 263
## 268 281 283 284 365 366 368 369 370 371 372 373 374 375 376 381 385 387 407 413
## 268 281 283 284 365 366 368 369 370 371 372 373 374 375 376 381 385 387 407 413
## 414 415 417 420 428
## 414 415 417 420 428
which(CookD>0.5) # threshold = 0.5 (default value in the plot)
## named integer(0)
Here is a great tutorial for linear model diagnostic with R: Link.
The following model includes all \(x\) varables in the model
model2<- lm(medv~., data=Boston)
# summary
summary(model2)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Are there any insignificant variables?
model3<- lm(medv~. -indus -age, data=Boston)
summary(model3)
##
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
Note: when you use .
to indicate all
variables, you have to make sure all variables in the data are relevant
and ready to be plug in. This can be tricky, and error often are
reported, when it comes to categorical variables. Therefore, before
fitting model, EDA and data cleaning are extremely important.
We sample 90% of the original data and use it as the training set. The remaining 10% is used as test set. The regression model will be built on the training set and future performance of your model will be evaluated with the test set.
index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]
We always start by training a model using the training sample.
model1 <- lm(medv~crim+rm+lstat, data = Boston.train)
sum.model1 <- summary(model1)
We can obtain two prediction errors: MSE and MSPE. Please make sure you know what they are.
## MSE
mse.model1 <- sum.model1$sigma^2
## MSPE
pred.model1<- predict(model1, newdata = Boston.test)
mspe.model1 <- mean((Boston.test$medv-pred.model1)^2)
## MAPE
mape.model1 <- mean(abs(Boston.test$medv-pred.model1))
model2 <-
pred.model2 <-
mspe.model2 <-
mape.model2 <-
Rsq.oos.model2 <-
model3 <-
pred.model3 <-
mspe.model3 <-
mape.model3 <-
Rsq.oos.model3 <-
Instead of fitting a model on a pre-specified 90% training sample and evaluate the MSPE on the hold-out 10% testing sample, it is more reliable to use cross-validation for out-of-sample performance evaluation. For k-fold cross-validation, the dataset is divided into k parts (equal sample size). Each part serves as the testing sample in and the rest (k-1 together) serves as training sample. This training/testing procedure is iteratively performed k times. The CV score is usually the average of the metric of out-of-sample performance across k iterations.
Note that we often use the entire dataset for cross validation. Think about why?
Instead of manually implement a cross-validation procedure, we can
use caret
package, which provides built-in cross-validation
capabilities. The caret
package (short for Classification
And REgression Training) contains functions to streamline the model
training process for complex regression and classification problems. For
more details, please see https://cran.r-project.org/web/packages/caret/vignettes/caret.html.
library(caret)
cv_model1 <- train(
form = medv ~ rm,
data = Boston,
method = "lm",
trControl = trainControl(method = "cv", number = 10)
)
cv_model1
## Linear Regression
##
## 506 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 455, 455, 455, 455, 457, 455, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.60593 0.490575 4.477051
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Conduct 10 fold cross-validation for the other three models we have explored.
Now we can generate a very user-friendly summary table.
allCVresults<- resamples(list(model1 = cv_model1,
model2 = cv_model2,
model3 = cv_model3))
summary(allCVresults)
##
## Call:
## summary.resamples(object = allCVresults)
##
## Models: model1, model2, model3
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## model1 3.437536 4.487691 4.597107 4.477051 4.708215 4.950190 0
## model2 3.502776 3.695231 3.875887 3.944833 4.044314 4.640150 0
## model3 2.785270 3.016912 3.463437 3.355030 3.633388 3.917857 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## model1 4.941122 6.348525 6.736368 6.605930 7.253039 7.646193 0
## model2 4.583821 5.009231 5.171352 5.488724 5.259959 7.974033 0
## model3 3.710898 4.039980 4.829046 4.777103 5.369552 5.981773 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## model1 0.2426731 0.3177243 0.4626704 0.4905750 0.6666814 0.7809753 0
## model2 0.3982181 0.6320153 0.6671200 0.6632233 0.7676352 0.8214734 0
## model3 0.5820728 0.6958518 0.7652234 0.7356172 0.7828319 0.7961814 0
We can also visualize the comparison.
bwplot(allCVresults, metric = "RMSE", main="Model Comparison")
dotplot(allCVresults, main="Model Comparison")
Bootstrap is one of the most useful inventions in statistics. It is based on random sampling with replacement. We can use this technique to construct confidence intervals empirically.
library(boot)
coef.lm<- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(fit$coef)
}
# bootstrapping with 1000 replications
results <- boot(data=Boston, statistic=coef.lm, R=1000, formula=medv~.)
# view results
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston, statistic = coef.lm, R = 1000, formula = medv ~
## .)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 3.645949e+01 -2.652835e-01 8.239937033
## t2* -1.080114e-01 6.312763e-03 0.034684479
## t3* 4.642046e-02 -4.442950e-04 0.013758976
## t4* 2.055863e-02 -2.801766e-04 0.050071441
## t5* 2.686734e+00 6.434535e-02 1.342600465
## t6* -1.776661e+01 1.668126e-01 3.825150283
## t7* 3.809865e+00 1.864467e-02 0.875424390
## t8* 6.922246e-04 -1.127862e-04 0.016616599
## t9* -1.475567e+00 3.497049e-03 0.216324862
## t10* 3.060495e-01 -2.202251e-03 0.065209932
## t11* -1.233459e-02 -5.530330e-05 0.002798938
## t12* -9.527472e-01 3.558182e-03 0.117745316
## t13* 9.311683e-03 8.906344e-05 0.002729802
## t14* -5.247584e-01 -3.185466e-03 0.103084826
plot(results, index = 2)
lm()
function for linear regression model fittingpredict()
function for prediction and then assess
prediction errortrain()
function in caret
package for
cross-validationboot
function for bootstrapping