We use Boston Housing Data as an illustrative example in this lab. We learn basic linear regression and analysis with R.

Load Data

Boston housing data is a built-in dataset in MASS package, so you do not need to download externally. Package MASS comes with R when you installed R, so no need to use install.packages(MASS) to download and install, but you do need to load this package.

library(MASS)
data(Boston) #this data is in MASS package

You can find details of the dataset from help document.

?Boston

We can obtain the help document for the dataset because it is built-in, but most time your dataset is external.

EDA

We have introduced many EDA techniques in previous labs. Here we only conduct a few very basic and high-level EDA, but in general, a much more in-depth EDA is needed. Below is a quick overview and summary of the data.

dim(Boston) 
## [1] 506  14
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can also draw a pairwise scatter plot.

pairs(Boston)

We should be careful when we draw the pairwise scatter plot. First, scatter plot makes sense for numeric variables, so make sure the input data frame contains only numeric variables. Second, we should not have too many variables. The one we show here is already a little overwhelming.

Exercise

  • In the pairwise scatter plot, which variables should be excluded due to nonnumeric meaning?
  • Use one of those approaches we learned before to construct a nice summary statistics table for all variables (because they are all numeric).

Back to top


Model Fitting

Simple linear regression

We first fit a simple linear regression using, say “rm”, as the explanatory variable \(X\).

# draw a scatter plot
plot(medv~rm, data=Boston)
# fit the simple linear regression
model0<- lm(medv~rm, data = Boston)
model0
## 
## Call:
## lm(formula = medv ~ rm, data = Boston)
## 
## Coefficients:
## (Intercept)           rm  
##     -34.671        9.102
abline(model0, col="red", lty=2, lwd=2)

To get more detailed outputs of the fitted model, we use function summary().

sum.model0<- summary(model0)
sum.model0
## 
## Call:
## lm(formula = medv ~ rm, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.346  -2.547   0.090   2.986  39.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -34.671      2.650  -13.08   <2e-16 ***
## rm             9.102      0.419   21.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.616 on 504 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4825 
## F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16

Model output

Note that both of the objects “model0” and “sum.model0” are lists. Therefore, we can extract any components on that list.

## pull coefficients
model0$coefficients
## (Intercept)          rm 
##  -34.670621    9.102109
## pull coefficients with hypothesis testing info
sum.model0$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -34.670621  2.6498030 -13.08423 6.950229e-34
## rm            9.102109  0.4190266  21.72203 2.487229e-74
## pull sigma (think about what is it?)
sum.model0$sigma
## [1] 6.61616
## pull adjusted R-square
sum.model0$adj.r.squared
## [1] 0.4825007
## pull fitted values
head(model0$fitted.values)
##        1        2        3        4        5        6 
## 25.17575 23.77402 30.72803 29.02594 30.38215 23.85594
## pull residuals
head(model0$residuals)
##         1         2         3         4         5         6 
## -1.175746 -2.174021  3.971968  4.374062  5.817848  4.844060

You can get some of these results using alternative ways:

coef(model0)
coef(sum.model0)
sigma(model0)
head(fitted(model0))
head(residuals(model0))

You may get the confidence interval for the estimated coefficients. Recall the formula to calculate the confidence interval: \(\hat{\beta}_j \pm t_{\alpha/2, n-p-1}\times SE(\hat{\beta}_j)\).

confint(model0)
##                  2.5 %     97.5 %
## (Intercept) -39.876641 -29.464601
## rm            8.278855   9.925363

Exercise:

  1. Interpret the estimated coefficient.
  2. What is sigma? Can you obtain sigma based on $residual?
  3. Compute MSE.
  4. Verify that residual = y - yhat.
  5. How does residual distributed?

Multiple linear regression

Let’s consider a model with 3 \(X\) variables

model1 <- lm(medv~crim+rm+lstat, data = Boston)
sum1 <- summary(model1)
sum1$adj.r.squared
## [1] 0.6437356
sum1$sigma^2
## [1] 30.13524
AIC(model1)
## [1] 3165.232
BIC(model1)
## [1] 3186.364

Very clear, the adjusted R square, MSE, AIC and BIC of model1 are much better than those of model0.

Prediction

Given the estimated coefficients, we know how to compute the predicted response \(\hat{y}\) for any given input vector \(\mathbf{x}\). That is \(\hat{y}=\mathbf{\hat{\beta}}^\top \mathbf{x}\). In R, we can also use predict() function.

predY <- predict(model1, newdata=data.frame(crim=3, rm=4, lstat=6.4))
predY
##        1 
## 14.29444

Here, we need to make sure that newdata is a data frame that contains the same variable as those used in your training data. If newdata is not specified, predict() returns the fitted value.

PredY_insample <- predict(model1)
head(PredY_insample)
##        1        2        3        4        5        6 
## 28.85772 25.64564 32.58746 32.24192 31.63289 27.96579

The following code provides prediction interval.

predY_interval <- predict(model1, newdata=data.frame(crim=3, rm=4, lstat=6.4), interval = "prediction")
predY_interval
##        fit      lwr      upr
## 1 14.29444 3.241033 25.34784

Exercise:

  1. Verify that the fitted value extracted from the model is the same as using predict() without specifying newdata.
  2. Draw a scatter plot of medv vs. lstat. Then draw the fitted line on top of it.
  3. Draw the prediction interval on top of the figure from above question.

Model diagnostics

The aim of model diagnostics is to check if the model assumptions are valid. Recall that residual plot is primary tool for model diagnostics.

plot(model1$fitted.values, model1$residuals)

What can you see from this plot?

More plots for model diagnostics.

#par(mar=c(4,4.5,2,1), oma=c(2,1,1,2))
par(mfrow=c(2,2)) # to specify the figure layout
plot(model1)

Exercise:

  1. The first figure, residual vs. fitted plot, shows evidence of model inadequacy. What is the problem?
  2. Can you address it, and build new model?
  3. For the updated model, conduct model diagnostic again, what do you see?
  4. If you identify any issues, address it and update the model again.

Some notes about the residual-leverage plot:

Residual vs. fitted and normal Q-Q plot are quite standard to check assumptions of normality, constant variance and whether higher order terms need to be included. The last figure, residual vs leverage plot is mainly used to identify influential data points and potential outliers. Leverage is defined as the diagonal elements \(h_i\) of hat matrix \(H=X(X^T X)^{-1}X^T\), which only depends on \(X\), hence larger value is an indication of extreme values in \(X\).

Cook’s distance is defined as \(D_i=\frac{r_i^2 h_i}{p(1-h_i)}\), so it is a function of both residual and leverage. Therefore it is a contour as shown in the figure. A Cook’s Distance is often considered large if \(D_i>4/n\), and those data points can be considered as influential point. In the last figure, however, we see there are two threshold values of Cook’s distance, 0.5 and 1, which are much larger than \(4/n\). Therefore, any points fall outside the dashed line (Cook’s distance = 0.5 (or 1)) is highly influential and very likely to be outliers.

In R basic package stat, there is a function to calculate Cook’s distance:

CookD <- cooks.distance(model1)
which(CookD>4/nrow(Boston))  # threshold = 4/n, general rule for influential point
##   9  49 142 148 149 162 163 164 167 187 196 204 205 215 226 229 234 258 262 263 
##   9  49 142 148 149 162 163 164 167 187 196 204 205 215 226 229 234 258 262 263 
## 268 281 283 284 365 366 368 369 370 371 372 373 374 375 376 381 385 387 407 413 
## 268 281 283 284 365 366 368 369 370 371 372 373 374 375 376 381 385 387 407 413 
## 414 415 417 420 428 
## 414 415 417 420 428
which(CookD>0.5)  # threshold = 0.5 (default value in the plot)
## named integer(0)

Here is a great tutorial for linear model diagnostic with R: Link.

A model with all X variables

The following model includes all \(x\) varables in the model

model2<- lm(medv~., data=Boston)
# summary
summary(model2)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Are there any insignificant variables?

model3<- lm(medv~. -indus -age, data=Boston)
summary(model3)
## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Note: when you use . to indicate all variables, you have to make sure all variables in the data are relevant and ready to be plug in. This can be tricky, and error often are reported, when it comes to categorical variables. Therefore, before fitting model, EDA and data cleaning are extremely important.

Back to top


Model assessment – prediction error

Training vs. testing procedure

Splitting data to training and testing samples

We sample 90% of the original data and use it as the training set. The remaining 10% is used as test set. The regression model will be built on the training set and future performance of your model will be evaluated with the test set.

index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]

Model training

We always start by training a model using the training sample.

model1 <- lm(medv~crim+rm+lstat, data = Boston.train)
sum.model1 <- summary(model1)

Prediction error

We can obtain two prediction errors: MSE and MSPE. Please make sure you know what they are.

## MSE
mse.model1 <- sum.model1$sigma^2
## MSPE
pred.model1<- predict(model1, newdata = Boston.test)
mspe.model1 <- mean((Boston.test$medv-pred.model1)^2)
## MAPE
mape.model1 <- mean(abs(Boston.test$medv-pred.model1))

Exercise:

  1. Get the predicted value of another two models we explored, and calculate mean squared prediction error (MSPE) and mean absolute prediction error (MAPE).
  2. Recall the definition of \(R^2\). Can you obtain it for the testing sample?
model2 <- 
pred.model2 <- 
mspe.model2 <-
mape.model2 <-
Rsq.oos.model2 <- 

model3 <- 
pred.model3 <-
mspe.model3 <-
mape.model3 <-
Rsq.oos.model3 <- 
  1. Conduct the same assessment for an updated model (model4) based on your model diagnostics for model3.
  2. Which model performs the best among all models evaluated?

Cross-validation

Instead of fitting a model on a pre-specified 90% training sample and evaluate the MSPE on the hold-out 10% testing sample, it is more reliable to use cross-validation for out-of-sample performance evaluation. For k-fold cross-validation, the dataset is divided into k parts (equal sample size). Each part serves as the testing sample in and the rest (k-1 together) serves as training sample. This training/testing procedure is iteratively performed k times. The CV score is usually the average of the metric of out-of-sample performance across k iterations.

Note that we often use the entire dataset for cross validation. Think about why?

Instead of manually implement a cross-validation procedure, we can use caret package, which provides built-in cross-validation capabilities. The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems. For more details, please see https://cran.r-project.org/web/packages/caret/vignettes/caret.html.

library(caret)
cv_model1 <- train(
  form = medv ~ rm, 
  data = Boston, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_model1
## Linear Regression 
## 
## 506 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 455, 455, 455, 455, 457, 455, ... 
## Resampling results:
## 
##   RMSE     Rsquared  MAE     
##   6.60593  0.490575  4.477051
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Exercise

Conduct 10 fold cross-validation for the other three models we have explored.

Now we can generate a very user-friendly summary table.

allCVresults<- resamples(list(model1 = cv_model1, 
                              model2 = cv_model2,
                              model3 = cv_model3))
summary(allCVresults)
## 
## Call:
## summary.resamples(object = allCVresults)
## 
## Models: model1, model2, model3 
## Number of resamples: 10 
## 
## MAE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## model1 3.437536 4.487691 4.597107 4.477051 4.708215 4.950190    0
## model2 3.502776 3.695231 3.875887 3.944833 4.044314 4.640150    0
## model3 2.785270 3.016912 3.463437 3.355030 3.633388 3.917857    0
## 
## RMSE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## model1 4.941122 6.348525 6.736368 6.605930 7.253039 7.646193    0
## model2 4.583821 5.009231 5.171352 5.488724 5.259959 7.974033    0
## model3 3.710898 4.039980 4.829046 4.777103 5.369552 5.981773    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## model1 0.2426731 0.3177243 0.4626704 0.4905750 0.6666814 0.7809753    0
## model2 0.3982181 0.6320153 0.6671200 0.6632233 0.7676352 0.8214734    0
## model3 0.5820728 0.6958518 0.7652234 0.7356172 0.7828319 0.7961814    0

We can also visualize the comparison.

bwplot(allCVresults, metric = "RMSE", main="Model Comparison")

dotplot(allCVresults, main="Model Comparison")

Back to top


Bootstrap

Bootstrap is one of the most useful inventions in statistics. It is based on random sampling with replacement. We can use this technique to construct confidence intervals empirically.

library(boot)
coef.lm<- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(fit$coef)
} 
# bootstrapping with 1000 replications 
results <- boot(data=Boston, statistic=coef.lm, R=1000, formula=medv~.)

# view results
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = coef.lm, R = 1000, formula = medv ~ 
##     .)
## 
## 
## Bootstrap Statistics :
##           original        bias    std. error
## t1*   3.645949e+01 -2.652835e-01 8.239937033
## t2*  -1.080114e-01  6.312763e-03 0.034684479
## t3*   4.642046e-02 -4.442950e-04 0.013758976
## t4*   2.055863e-02 -2.801766e-04 0.050071441
## t5*   2.686734e+00  6.434535e-02 1.342600465
## t6*  -1.776661e+01  1.668126e-01 3.825150283
## t7*   3.809865e+00  1.864467e-02 0.875424390
## t8*   6.922246e-04 -1.127862e-04 0.016616599
## t9*  -1.475567e+00  3.497049e-03 0.216324862
## t10*  3.060495e-01 -2.202251e-03 0.065209932
## t11* -1.233459e-02 -5.530330e-05 0.002798938
## t12* -9.527472e-01  3.558182e-03 0.117745316
## t13*  9.311683e-03  8.906344e-05 0.002729802
## t14* -5.247584e-01 -3.185466e-03 0.103084826
plot(results, index = 2)

Back to top


Summary – Things to remember

  • lm() function for linear regression model fitting
  • How to get different metrics about model fitting
  • Model diagnostics and possible ways to address issues
  • Random sampling (training vs. testing)
  • predict() function for prediction and then assess prediction error
  • train() function in caret package for cross-validation
  • boot function for bootstrapping