Dataset: mtcars

We will use an old data set from 1974 on gasoline consumption for various cars which is part of the datasets package in R.

head(mtcars)

                   mpg cyl disp  hp drat   wt qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1

As you can see there are multiple variables. If you have the datasets package you may look at the help file for the data by:

?datasets::mtcars

Ex-1: Model Fitting

Fit a multiple linear regression model with mpg (Miles/Gallon) as response variable and wt (Weight) and cyl (Number of cylinders) as predictors. Call the model object “cars1”.

cars1 <- lm(mpg ~ cyl + wt, data = mtcars)

Check the model assumptions by residual analysis.

par(mfrow = c(1,2)) #This creates a layout for plots, one row and two columns
plot(cars1, which = c(1,2)) #Only the two first residual plots

Give a summary of the results and compute the ANOVA-table.

summary(cars1)


Call:
lm(formula = mpg ~ cyl + wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.289 -1.551 -0.468  1.574  6.100 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   39.686      1.715   23.14  < 2e-16 ***
cyl           -1.508      0.415   -3.64  0.00106 ** 
wt            -3.191      0.757   -4.22  0.00022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 2.57 on 29 degrees of freedom
Multiple R-squared: 0.83,
Adjusted R-squared: 0.819 
F-statistic: 70.9 on 2 and 29 DF,  p-value: 0.00000000000681

anova(cars1)

Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value          Pr(>F)    
cyl        1    818     818   124.0 0.0000000000054 ***
wt         1    117     117    17.8         0.00022 ***
Residuals 29    191       7                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Give a short report on the results.

Both cyl and wt appears to be highly significant predictors for mpg. The estimated effects are negative implying that the mileage decreases as both weight and cylinder numbers increase, which is a reasonable result. The \(R^2\) is 0.83, hence, about 83% of the variability in mileage is explained by the linear relationship with cyl and wt. The residual plot of fitted values versus residuals gives an indication of a non-linear relationship, which may be a result of non-linear dependencies or missing explanatory variable(s). The normal probability plot is more or less OK.

Ex-2: Indicator variable

The am variable is an indicator variable for transmission system of the cars, 0=automatic, 1=manual. Run the following model in R:

cars2 <- lm(mpg ~ cyl + wt*am, data = mtcars)

Write up the assumed model which has been run here. Also write up the estimated models for automatic and manual transmission, respectively.

The model is:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2\cdot x_3 + \epsilon\]

where \(y\) = mpg, \(x_1\) = cyl, \(x_2\) = wt, \(x_3\) = am and \(\epsilon \sim N(0, \sigma^2)\).

The fitted model from R is:

summary(cars2)


Call:
lm(formula = mpg ~ cyl + wt * am, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.462 -1.491 -0.788  1.396  5.350 

Coefficients:
            Estimate Std. Error t value        Pr(>|t|)    
(Intercept)   34.283      2.796   12.26 0.0000000000015 ***
cyl           -1.181      0.380   -3.11          0.0044 ** 
wt            -2.369      0.824   -2.87          0.0078 ** 
am            11.939      3.845    3.10          0.0044 ** 
wt:am         -4.197      1.312   -3.20          0.0035 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 2.26 on 27 degrees of freedom
Multiple R-squared: 0.877,
Adjusted R-squared: 0.859 
F-statistic: 48.1 on 4 and 27 DF,  p-value: 0.00000000000664

For automatic transmission (am=\(x_3\)=0) we have the estimated model:

\[\hat{y} = 34.28 - 1.18x_1 - 2.37x_2\]

For manual transmission (am=\(x_3\)=1) we have \[\hat{y} = 34.28 - 1.18x_1 - 2.37x_2 + 11.94\cdot 1 - 4.20x_2\cdot 1\] \[= 46.22 - 1.18x_1 - 6.57x_2\]

We observe that the negative effect of weight on mileage is larger for manual transmission than for automatic.

Ex-3: Comparing models - Partial F-test

From the p-values we observe that transmission gives a significant addition to the intercept and to the effect of weight, respectively. These p-values correspond to testing each effect GIVEN that all other variables are included in the model. Sometimes we would rather like to test several effects jointly. For instance, should we add both transmission (am) AND the interaction betwen transmission and weight (wt:am) to the model? This is a joint test of the significance of transmission in the model. To accomplish this we may compare the fits of cars2 (a full model) with cars1 (a reduced model) since the difference between these models are exactly the transmission effects. This is called a partial F-test (Fisher test) were we test whether the SSE has decreased significantly as we go from the reduced model to the full model. The partial F-test may be run in RStudio by:

anova(cars1, cars2)

Analysis of Variance Table

Model 1: mpg ~ cyl + wt
Model 2: mpg ~ cyl + wt * am
  Res.Df RSS Df Sum of Sq    F Pr(>F)  
1     29 191                           
2     27 138  2      52.7 5.13  0.013 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we see that the test statistic is an F-statistic. Is there a significant effect of transmission do you think?

A lengthy answer:

We are here really testing the hypotheses:

\[H_0: \beta_3 = \beta_4 = 0\] versus the alternative that at least one of them is different from zero.

We reject the null-hypothesis at test-level \(\alpha\) if

\[F = \frac{(SSE_\text{red.mod}-SSE_\text{full.mod})/r}{MSE_\text{full.mod}}\] is larger than \(F_{\alpha, r, n-p}\), where \(r\) is the difference in the degrees of freedom for SSE for the two models (here \(r=2\)), and \(n-p\) are the degrees of freedom for SSE of the full model (here \(n-p=27\)).

From the output we have (note RSS = SSE): \[F = \frac{(191.17 - 138.51)/2}{138.51/27} = 5.13\]

We reject at level \(\alpha=0.05\) if this observed F is larger than \(F_{0.05, 2, 27}\). At this point we could look this up in a Fisher-table, or alternatively compute this quantile of the Fisher distribution by:

qf(0.05, 2, 27, lower.tail = FALSE)

[1] 3.35

See ?FDist for help-file for the Fisher distribution.

We reject the null-hypothesis.

Alternatively we reject since the p-value from the output is smaller than 0.05.

Perform a residual analysis of the cars2 model.

par(mfrow = c(1,2)) 
plot(cars2, which = c(1,2))

The linearity has improved, but maybe there is an increasing variance with increasing fitted value (estimated mileage). The normality looks good.

Ex-4: Influential measurements

Use the influence.measures() function to compute the Cook’s distances and the leverage (hat) values for all observations accoring to the cars2 model. Are there any influential observations according to these measures?

summary(influence.measures(cars2))

Potentially influential observations of
     lm(formula = mpg ~ cyl + wt * am, data = mtcars) :

                    dfb.1_ dfb.cyl dfb.wt dfb.am dfb.wt:m dffit cov.r  
Lincoln Continental  0.38   0.16   -0.52  -0.28   0.23    -0.59  1.57_*
Fiat 128             0.10  -0.31    0.17   0.25  -0.15     0.91  0.36_*
Ford Pantera L       0.01  -0.02    0.01   0.01  -0.02    -0.04  1.61_*
Maserati Bora       -0.03   0.09   -0.05  -0.35   0.49     0.73  1.64_*
                    cook.d hat  
Lincoln Continental  0.07   0.33
Fiat 128             0.13   0.10
Ford Pantera L       0.00   0.25
Maserati Bora        0.11   0.38

Four observations are flagged by R, but none according to Cook’s distance or leverage (hat).

Ex-5: Model selection

Fit a third multiple linear regression model with mpg (Miles/Gallon) as response variable and cyl, disp, hp, drat, wt and qsec as predictor variables. Call the model object “cars3”. Report a summary of the analysis.

cars3 <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
summary(cars3)


Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.968 -1.580 -0.435  1.166  5.527 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  26.3074    14.6299    1.80   0.0842 . 
cyl          -0.8186     0.8116   -1.01   0.3228   
disp          0.0132     0.0120    1.10   0.2831   
hp           -0.0179     0.0155   -1.16   0.2585   
drat          1.3204     1.4795    0.89   0.3806   
wt           -4.1908     1.2579   -3.33   0.0027 **
qsec          0.4015     0.5166    0.78   0.4444   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 2.56 on 25 degrees of freedom
Multiple R-squared: 0.855,
Adjusted R-squared: 0.82 
F-statistic: 24.5 on 6 and 25 DF,  p-value: 0.00000000245

Apparently only wt is significant, but having many variables in a model may lead to inflated Std. Errors of the estimates due to correlation between predictors, and problems finding truly significant variables.

We would like to check various sub-models of this model by combining different variables. Install and load the ‘mixlm’ package. The package contains a function called best.subsets() which can help us find a good model.

Best Subset Output:

best.subsets(cars3)

         cyl disp hp drat wt qsec RSS    R2 R2adj    Cp
1  ( 1 )                   *      278 0.753 0.745 14.56
1  ( 2 )   *                      308 0.726 0.717 19.15
1  ( 3 )        *                 317 0.718 0.709 20.50
1  ( 4 )           *              448 0.602 0.589 40.46
1  ( 5 )                *         604 0.464 0.446 64.30
2  ( 1 )   *               *      191 0.830 0.819  3.24
2  ( 2 )           *       *      195 0.827 0.815  3.83
2  ( 3 )                   *    * 195 0.826 0.814  3.89
2  ( 4 )        *          *      247 0.781 0.766 11.72
2  ( 5 )                *  *      269 0.761 0.744 15.17
3  ( 1 )   *       *       *      177 0.843 0.826  3.01
3  ( 2 )   *               *    * 181 0.840 0.822  3.62
3  ( 3 )                *  *    * 184 0.837 0.820  4.07
3  ( 4 )           *    *  *      184 0.837 0.819  4.09
3  ( 5 )           *       *    * 186 0.835 0.817  4.45
4  ( 1 )   *    *  *       *      170 0.849 0.826  4.07
4  ( 2 )           *    *  *    * 174 0.845 0.822  4.63
4  ( 3 )   *       *    *  *      174 0.845 0.822  4.67
4  ( 4 )   *       *       *    * 175 0.844 0.821  4.80
4  ( 5 )   *    *          *    * 176 0.844 0.821  4.86
5  ( 1 )   *    *  *    *  *      167 0.851 0.823  5.60
5  ( 2 )   *    *  *       *    * 169 0.850 0.821  5.80
5  ( 3 )        *  *    *  *    * 170 0.849 0.820  6.02
5  ( 4 )   *       *    *  *    * 171 0.848 0.819  6.20
5  ( 5 )   *    *       *  *    * 172 0.847 0.818  6.34
6  ( 1 )   *    *  *    *  *    * 163 0.855 0.820  7.00

The function reports by default the 5 best models for each model size (number of predictors). The model size is given in the first column. Column two is the rank within model size, then comes a column for each variable with a star indicating that a given variable is part of the model. Finally comes the residual sum of squares (RSS or SSE), \(R^2\), \(R^2\)-adjusted and finally a diagnostic called Mallow’s Cp.

Which sub-model would you say is the best fitting model according to the \(R^2\)-adjusted?

A couple of models are quite similar, but the largest \(R^2\)-adjusted is obtained with a model with predictors cyl, hp and wt. This is also a quite simple model with few predictors. We should always strive for simple models and choose the simpler model in cases where the fit appears to be more or less equal for several models.

Ex-6: Model validation

On canvas you find a file called “CV.R” containing two functions CV() and Kfold(). Download this file and open it in RStudio and press the “source” button up to the right in the script window. This will run the file and create these functions. You can also scource the file as,

source('_functions/CV.R')

A fitted model should ideally be validated on a test set of new and un-touched data. We could predict the new samples using our best choice model and evaluate the prediction performance. If the model predicts well, we probably have a good model!

If we don’t have a test set, we may perform Cross-validation. The most common version is the Leave-One-Out Cross-validation where we successively remove one observation from the data and fit the model to the remaining observations. The fitted model is used to predict the left out observation. After fitting a model, the left out observation is put back, and another is left out. In ttoal we then fit \(n\) models, and perform \(n\) predictions.

Use the CV() function to perform a Leave-One-Out CV using the cars2 model fit by:

res <- CV(cars2)

print(res)

$pred
 [1] 22.0 20.1 26.6 19.4 16.4 19.1 16.6 21.3 21.9 19.0 19.1 15.1 15.9 15.9
[15] 13.1 12.8 11.1 26.5 31.0 28.7 24.5 16.6 16.9 15.9 15.4 29.0 27.6 32.0
[29] 16.0 21.1 12.3 23.7

$msep
[1] 6.09

$r.squared.pred
[1] 0.828

The CV() returns a list with three elements, the predictions, the Mean Square Error of Prediction and and \(R^2\)-predicted.

Make a plot of the observed mpg versus the cross-validation predictions.

plot(mtcars$mpg, res$pred, xlab = "Observed", ylab = "Predicted")

Is the best model from the previous exercise, identified by the best.subsets(), a better model in terms of prediction error (MSEP)? The MSEP is defined by:

\[\text{MSEP} = \frac{1}{n}\sum_{i=1}^{n}\left(y_i - \hat{y}_{(i)}\right)^2\]

where \(\hat{y}_{(i)}\) is the prediction of \(y_i\) using a model where observation \(i\) was left out from the model estimation. A small value implies better prediction.

cars4 <- lm(mpg ~ cyl + hp + wt, data = mtcars)
CV(cars4)

$pred
 [1] 22.97 22.08 26.26 20.91 16.98 20.41 15.65 23.65 23.38 20.03 20.10
[12] 14.97 16.05 16.07 11.02 10.09  8.74 26.32 28.71 27.35 25.83 17.69
[23] 18.09 14.79 15.57 27.70 26.62 27.72 16.58 21.28 12.89 24.61

$msep
[1] 7.19

$r.squared.pred
[1] 0.797

No, this model does not predict better than cars2.

Leave-One-out CV is known to have large uncertainty, and using a K-fold CV is an alternative. Then the data are divided into K subsets (folds) of approximately equal sizes, and a leave-one-fold-out CV is performed instead. A K=10 is often recommended. Here, since n=32 a K=8 is better, since this gives subsets of equal sizes. Create K=8 random folds by

myfolds <- Kfold(n = 32, K = 8, random = TRUE)

print(myfolds)

[[1]]
[1] 11 22 14 10

[[2]]
[1]  8  6  4 21

[[3]]
[1] 17 25  9 16

[[4]]
[1] 29 26  5  3

[[5]]
[1]  1 24 19 30

[[6]]
[1] 32 15 28 31

[[7]]
[1]  2 27 20 23

[[8]]
[1] 13 18  7 12

Since the folds are sampled randomly, we get r different folds each time we run Kfold().

As you see, myfolds is a list of 8 random subsets of observation numbers. Re-run the validation of cars2 by

CV(cars2, folds = myfolds)

$pred
 [1] 22.3 20.1 27.0 19.5 16.2 18.8 16.6 21.7 21.9 19.2 19.2 15.3 16.1 16.1
[15] 13.2 10.8 11.0 26.5 31.2 28.8 23.5 16.8 16.8 16.0 15.3 29.4 26.9 32.4
[29] 16.0 21.3 12.3 23.2

$msep
[1] 5.83

$r.squared.pred
[1] 0.837

Note that the result now will vary if you repeat the creation of random K-folds for the cross-validation. Try to make a new myfolds and run the CV again.