We use Boston Housing Data as an illustrative example in this lab. We learn basic linear regression and analysis with R.
Boston housing data is a built-in dataset in MASS
package, so you do not need to download externally. Package
MASS comes with R when you installed R, so no need to use
install.packages(MASS) to download and install, but you do
need to load this package.
library(MASS)
data(Boston) #this data is in MASS packageYou can find details of the dataset from help document.
?BostonWe can obtain the help document for the dataset because it is built-in, but most time your dataset is external.
We have introduced many EDA techniques in previous labs. Here we only conduct a few very basic and high-level EDA, but in general, a much more in-depth EDA is needed. Below is a quick overview and summary of the data.
dim(Boston) ## [1] 506  14names(Boston)##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"str(Boston)## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...summary(Boston)##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00We can also draw a pairwise scatter plot.
pairs(Boston)We should be careful when we draw the pairwise scatter plot. First, scatter plot makes sense for numeric variables, so make sure the input data frame contains only numeric variables. Second, we should not have too many variables. The one we show here is already a little overwhelming.
We first fit a simple linear regression using, say “rm”, as the explanatory variable \(X\).
# draw a scatter plot
plot(medv~rm, data=Boston)
# fit the simple linear regression
model0<- lm(medv~rm, data = Boston)
model0## 
## Call:
## lm(formula = medv ~ rm, data = Boston)
## 
## Coefficients:
## (Intercept)           rm  
##     -34.671        9.102abline(model0, col="red", lty=2, lwd=2)To get more detailed outputs of the fitted model, we use function
summary().
sum.model0<- summary(model0)
sum.model0## 
## Call:
## lm(formula = medv ~ rm, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.346  -2.547   0.090   2.986  39.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -34.671      2.650  -13.08   <2e-16 ***
## rm             9.102      0.419   21.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.616 on 504 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4825 
## F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16Note that both of the objects “model0” and “sum.model0” are lists. Therefore, we can extract any components on that list.
## pull coefficients
model0$coefficients## (Intercept)          rm 
##  -34.670621    9.102109## pull coefficients with hypothesis testing info
sum.model0$coefficients##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -34.670621  2.6498030 -13.08423 6.950229e-34
## rm            9.102109  0.4190266  21.72203 2.487229e-74## pull sigma (think about what is it?)
sum.model0$sigma## [1] 6.61616## pull adjusted R-square
sum.model0$adj.r.squared## [1] 0.4825007## pull fitted values
head(model0$fitted.values)##        1        2        3        4        5        6 
## 25.17575 23.77402 30.72803 29.02594 30.38215 23.85594## pull residuals
head(model0$residuals)##         1         2         3         4         5         6 
## -1.175746 -2.174021  3.971968  4.374062  5.817848  4.844060You can get some of these results using alternative ways:
coef(model0)
coef(sum.model0)
sigma(model0)
head(fitted(model0))
head(residuals(model0))You may get the confidence interval for the estimated coefficients. Recall the formula to calculate the confidence interval: \(\hat{\beta}_j \pm t_{\alpha/2, n-p-1}\times SE(\hat{\beta}_j)\).
confint(model0)##                  2.5 %     97.5 %
## (Intercept) -39.876641 -29.464601
## rm            8.278855   9.925363sigma? Can you obtain sigma based
on $residual?Let’s consider a model with 3 \(X\) variables
model1 <- lm(medv~crim+rm+lstat, data = Boston)
sum1 <- summary(model1)
sum1$adj.r.squared## [1] 0.6437356sum1$sigma^2## [1] 30.13524AIC(model1)## [1] 3165.232BIC(model1)## [1] 3186.364Very clear, the adjusted R square, MSE, AIC and BIC of
model1 are much better than those of
model0.
Given the estimated coefficients, we know how to compute the
predicted response \(\hat{y}\) for any
given input vector \(\mathbf{x}\). That
is \(\hat{y}=\mathbf{\hat{\beta}}^\top
\mathbf{x}\). In R, we can also use predict()
function.
predY <- predict(model1, newdata=data.frame(crim=3, rm=4, lstat=6.4))
predY##        1 
## 14.29444Here, we need to make sure that newdata is a data frame
that contains the same variable as those used in your training data. If
newdata is not specified, predict() returns the fitted
value.
PredY_insample <- predict(model1)
head(PredY_insample)##        1        2        3        4        5        6 
## 28.85772 25.64564 32.58746 32.24192 31.63289 27.96579The following code provides prediction interval.
predY_interval <- predict(model1, newdata=data.frame(crim=3, rm=4, lstat=6.4), interval = "prediction")
predY_interval##        fit      lwr      upr
## 1 14.29444 3.241033 25.34784predict() without specifying newdata.The aim of model diagnostics is to check if the model assumptions are valid. Recall that residual plot is primary tool for model diagnostics.
plot(model1$fitted.values, model1$residuals)What can you see from this plot?
More plots for model diagnostics.
#par(mar=c(4,4.5,2,1), oma=c(2,1,1,2))
par(mfrow=c(2,2)) # to specify the figure layout
plot(model1)Some notes about the residual-leverage plot:
Residual vs. fitted and normal Q-Q plot are quite standard to check assumptions of normality, constant variance and whether higher order terms need to be included. The last figure, residual vs leverage plot is mainly used to identify influential data points and potential outliers. Leverage is defined as the diagonal elements \(h_i\) of hat matrix \(H=X(X^T X)^{-1}X^T\), which only depends on \(X\), hence larger value is an indication of extreme values in \(X\).
Cook’s distance is defined as \(D_i=\frac{r_i^2 h_i}{p(1-h_i)}\), so it is a function of both residual and leverage. Therefore it is a contour as shown in the figure. A Cook’s Distance is often considered large if \(D_i>4/n\), and those data points can be considered as influential point. In the last figure, however, we see there are two threshold values of Cook’s distance, 0.5 and 1, which are much larger than \(4/n\). Therefore, any points fall outside the dashed line (Cook’s distance = 0.5 (or 1)) is highly influential and very likely to be outliers.
In R basic package stat, there is a function to
calculate Cook’s distance:
CookD <- cooks.distance(model1)
which(CookD>4/nrow(Boston))  # threshold = 4/n, general rule for influential point##   9  49 142 148 149 162 163 164 167 187 196 204 205 215 226 229 234 258 262 263 
##   9  49 142 148 149 162 163 164 167 187 196 204 205 215 226 229 234 258 262 263 
## 268 281 283 284 365 366 368 369 370 371 372 373 374 375 376 381 385 387 407 413 
## 268 281 283 284 365 366 368 369 370 371 372 373 374 375 376 381 385 387 407 413 
## 414 415 417 420 428 
## 414 415 417 420 428which(CookD>0.5)  # threshold = 0.5 (default value in the plot)## named integer(0)Here is a great tutorial for linear model diagnostic with R: Link.
The following model includes all \(x\) varables in the model
model2<- lm(medv~., data=Boston)
# summary
summary(model2)## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16Are there any insignificant variables?
model3<- lm(medv~. -indus -age, data=Boston)
summary(model3)## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16Note: when you use . to indicate all
variables, you have to make sure all variables in the data are relevant
and ready to be plug in. This can be tricky, and error often are
reported, when it comes to categorical variables. Therefore, before
fitting model, EDA and data cleaning are extremely important.
We sample 90% of the original data and use it as the training set. The remaining 10% is used as test set. The regression model will be built on the training set and future performance of your model will be evaluated with the test set.
index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]We always start by training a model using the training sample.
model1 <- lm(medv~crim+rm+lstat, data = Boston.train)
sum.model1 <- summary(model1)We can obtain two prediction errors: MSE and MSPE. Please make sure you know what they are.
## MSE
mse.model1 <- sum.model1$sigma^2
## MSPE
pred.model1<- predict(model1, newdata = Boston.test)
mspe.model1 <- mean((Boston.test$medv-pred.model1)^2)
## MAPE
mape.model1 <- mean(abs(Boston.test$medv-pred.model1))model2 <- 
pred.model2 <- 
mspe.model2 <-
mape.model2 <-
Rsq.oos.model2 <- 
model3 <- 
pred.model3 <-
mspe.model3 <-
mape.model3 <-
Rsq.oos.model3 <- Instead of fitting a model on a pre-specified 90% training sample and evaluate the MSPE on the hold-out 10% testing sample, it is more reliable to use cross-validation for out-of-sample performance evaluation. For k-fold cross-validation, the dataset is divided into k parts (equal sample size). Each part serves as the testing sample in and the rest (k-1 together) serves as training sample. This training/testing procedure is iteratively performed k times. The CV score is usually the average of the metric of out-of-sample performance across k iterations.
Note that we often use the entire dataset for cross validation. Think about why?
Instead of manually implement a cross-validation procedure, we can
use caret package, which provides built-in cross-validation
capabilities. The caret package (short for Classification
And REgression Training) contains functions to streamline the model
training process for complex regression and classification problems. For
more details, please see https://cran.r-project.org/web/packages/caret/vignettes/caret.html.
library(caret)cv_model1 <- train(
  form = medv ~ rm, 
  data = Boston, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_model1## Linear Regression 
## 
## 506 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 455, 455, 455, 455, 457, 455, ... 
## Resampling results:
## 
##   RMSE     Rsquared  MAE     
##   6.60593  0.490575  4.477051
## 
## Tuning parameter 'intercept' was held constant at a value of TRUEConduct 10 fold cross-validation for the other three models we have explored.
Now we can generate a very user-friendly summary table.
allCVresults<- resamples(list(model1 = cv_model1, 
                              model2 = cv_model2,
                              model3 = cv_model3))
summary(allCVresults)## 
## Call:
## summary.resamples(object = allCVresults)
## 
## Models: model1, model2, model3 
## Number of resamples: 10 
## 
## MAE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## model1 3.437536 4.487691 4.597107 4.477051 4.708215 4.950190    0
## model2 3.502776 3.695231 3.875887 3.944833 4.044314 4.640150    0
## model3 2.785270 3.016912 3.463437 3.355030 3.633388 3.917857    0
## 
## RMSE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## model1 4.941122 6.348525 6.736368 6.605930 7.253039 7.646193    0
## model2 4.583821 5.009231 5.171352 5.488724 5.259959 7.974033    0
## model3 3.710898 4.039980 4.829046 4.777103 5.369552 5.981773    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## model1 0.2426731 0.3177243 0.4626704 0.4905750 0.6666814 0.7809753    0
## model2 0.3982181 0.6320153 0.6671200 0.6632233 0.7676352 0.8214734    0
## model3 0.5820728 0.6958518 0.7652234 0.7356172 0.7828319 0.7961814    0We can also visualize the comparison.
bwplot(allCVresults, metric = "RMSE", main="Model Comparison")dotplot(allCVresults, main="Model Comparison")Bootstrap is one of the most useful inventions in statistics. It is based on random sampling with replacement. We can use this technique to construct confidence intervals empirically.
library(boot)
coef.lm<- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(fit$coef)
} 
# bootstrapping with 1000 replications 
results <- boot(data=Boston, statistic=coef.lm, R=1000, formula=medv~.)
# view results
results## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston, statistic = coef.lm, R = 1000, formula = medv ~ 
##     .)
## 
## 
## Bootstrap Statistics :
##           original        bias    std. error
## t1*   3.645949e+01 -2.652835e-01 8.239937033
## t2*  -1.080114e-01  6.312763e-03 0.034684479
## t3*   4.642046e-02 -4.442950e-04 0.013758976
## t4*   2.055863e-02 -2.801766e-04 0.050071441
## t5*   2.686734e+00  6.434535e-02 1.342600465
## t6*  -1.776661e+01  1.668126e-01 3.825150283
## t7*   3.809865e+00  1.864467e-02 0.875424390
## t8*   6.922246e-04 -1.127862e-04 0.016616599
## t9*  -1.475567e+00  3.497049e-03 0.216324862
## t10*  3.060495e-01 -2.202251e-03 0.065209932
## t11* -1.233459e-02 -5.530330e-05 0.002798938
## t12* -9.527472e-01  3.558182e-03 0.117745316
## t13*  9.311683e-03  8.906344e-05 0.002729802
## t14* -5.247584e-01 -3.185466e-03 0.103084826plot(results, index = 2)lm() function for linear regression model fittingpredict() function for prediction and then assess
prediction errortrain() function in caret package for
cross-validationboot function for bootstrapping