Barley Data

In barley.rdata are results from an experiment where the response is yield of barley pr 1000 square meter, and the factors sorts of barley, soil types, types of fertilizers. In addition was the experiment done in two different geographical areas (sites).

  • Assume a two factor model including the main effects of sort and soil and their interaction. State the model and explain all parameters under a sum-to-zero parametrization.

The model can be written as:

\[ y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk}\]

where, \(\epsilon \sim N(0, \sigma^2)\)

The parametrization restrictions for “sum-to-zero”" are:

\[ \begin{aligned} \sum_i{\alpha_i} = 0, & \sum_j{\beta_j} = 0, & \sum_{i}{(\alpha\beta)_{ij}} = 0 \text{ and } \sum_{j}{(\alpha\beta)_{ij}} = 0 \end{aligned} \]

  • Fit the model in R and estimate all parameters
options(contrasts = c("contr.sum", "contr.poly"))
mod1 <- lm(Yield ~ sort * soil, data = Barley)
summod1 <- summary(mod1)
summod1$coef
                Estimate Std. Error t value Pr(>|t|)
(Intercept)        380.9       10.1   37.55 1.70e-25
sort(1)            -23.6       10.1   -2.33 2.73e-02
soil(1)            -43.1       10.1   -4.25 2.14e-04
sort(1):soil(1)     13.8       10.1    1.36 1.84e-01
summod1$sigma ^ 2
[1] 3293

From the model summaries we find the estimated parameters to be:

  1. The estimated overall mean yield: \(\hat{\mu} = 380.94\)
  2. The estimated effects of sort 1 and 2: \(\hat{\alpha}_1=\) -23.625, \(\hat{\alpha}_2 = -\hat{\alpha}_1 =\) 23.625
  3. The estimated effects of soil 1 and 2: \(\hat{\beta}_1=\) -43.125, \(\hat{\beta}_2 = -\hat{\beta}_1 =\) 43.125
  4. The estimated interaction effects of sort and soil: \(\hat{\alpha\beta}_{11}=\) 13.812, \(\hat{\alpha\beta}_{12}=-\hat{\alpha\beta}_{11}\) -13.812, \(\hat{\alpha\beta}_{21}=-\hat{\alpha\beta}_{11}\) -13.812, \(\hat{\alpha\beta}_{22}=-\hat{\alpha\beta}_{21}=-\hat{\alpha\beta}_{12}=\) 13.812
  5. The estimated within sort:soil levels unexplained variability: \(\hat{\sigma}^2=MSE=\) 3293.21
  • Make an interaction plot and try to conclude about the presence of interaction from the plot. How would you explain interaction effect in this example for a person with experience in agriculture, but minimal statistical experience?
plot(allEffects(mod1,confidence.level = .95))

From the plot, it seems that a farmer can expect an increase in yield when using sort 2 instead of sort 1 for both types of soil, and there may be a higher gain of sort 2 over sort 1 for soil-type 2 than for soil type 1. If this is so, there is a so-called interaction effect of sort and soil on yield.
  • Perform a hypothesis test for the interaction effect. Conclusion?
Anova(mod1, type = "III")
Anova Table (Type III tests)

Response: Yield
             Sum Sq Df F value  Pr(>F)    
(Intercept) 4643628  1 1410.06 < 2e-16 ***
sort          17861  1    5.42 0.02731 *  
soil          59513  1   18.07 0.00021 ***
sort:soil      6105  1    1.85 0.18419    
Residuals     92210 28                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction between soil and sort is insignificant (high p-value). This result indicates that the slight non-parallel tendency of the lines in the interaction plot may be due to random errors. More data would be needed to verify any interaction effect.

  • Fit a four factor model with sort, soil, fert and site as factors and with all possible 2-factor, 3-factor and 4-factor interactions. Use Anova() from the car-package with argument type="III" to produce an ANOVA table. Type III means that all factors are tested as if they were the last factor to be added to the model. Are there any higher order significant effects? Compare the R-squared and the adjusted R-squared values. What can you conclude from these?
mod2 <- lm(Yield ~ sort * soil * fert * site, data = Barley)
Anova(mod2, type = "III")
Anova Table (Type III tests)

Response: Yield
                     Sum Sq Df F value    Pr(>F)    
(Intercept)         4643628  1 4503.19   < 2e-16 ***
sort                  17861  1   17.32   0.00073 ***
soil                  59512  1   57.71 0.0000011 ***
fert                   2813  1    2.73   0.11812    
site                   1458  1    1.41   0.25176    
sort:soil              6105  1    5.92   0.02707 *  
sort:fert              2080  1    2.02   0.17472    
soil:fert             33930  1   32.90 0.0000306 ***
sort:site             23220  1   22.52   0.00022 ***
soil:site               595  1    0.58   0.45849    
fert:site               105  1    0.10   0.75364    
sort:soil:fert         2245  1    2.18   0.15953    
sort:soil:site         1405  1    1.36   0.26029    
sort:fert:site         7442  1    7.22   0.01622 *  
soil:fert:site           12  1    0.01   0.91370    
sort:soil:fert:site     406  1    0.39   0.53914    
Residuals             16499 16                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2)$r.squared
[1] 0.906
summary(mod2)$adj.r.squared
[1] 0.818

There is only one interaction effect significant at 5% levels, namely the sort:fert:site interaction. However, the least squares estimator is prone to inflated variance estimates and difficulties in finding significance when the number of variables k rises compared to the number of observations N and if variables are highy correlated. To many variables to estimate from a limited number of observations leaves few degrees of freedom to SSE, and tests with low power. Lack of significance may be an over-fitting problem. By removing some of the least significant variables, the problem is reduced, and a reduced model with significant effects may be identified.

The large difference between \(R^2\) and \(R_{adj}^2\) is also an indication of an over-fitted model.

  • Fit also a reduced model without site, but all other effects up to 3rd order interactions. Perform a partial F-test with the anova() function to test whether site (and all its interactions with the others) should be excluded from the model (See also Exercise set 3 and Ex-3).
mod3 <- lm(Yield ~ sort * soil * fert, data = Barley)
anova(mod2, mod3)
Analysis of Variance Table

Model 1: Yield ~ sort * soil * fert * site
Model 2: Yield ~ sort * soil * fert
  Res.Df   RSS Df Sum of Sq   F Pr(>F)   
1     16 16499                           
2     24 51142 -8    -34644 4.2 0.0071 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova result for partial F-test shows that site is needed to a certain degree in the model. The low p-value reject the hypothesis that there is no effect of site on Yield.

  • In the mixlm-package there are convenient functions for performing automatic model selection by either backward() elimination of factors from a "full-model, by forward() addition of factors from a minimal model with only intercept, or a combined stepwise() function which combines both forward and backward addition/elimination.

For the backward function the least significant factor is removed from the model in each step if the p-value is larger than testlevel alpha. If all factors are significant at any step, the procedure stops. The elimination obeys the so-called marginality principle (hierarchy of factors) which states that any lower order effect or interaction should not be removed from the model if it is part of a higher order significant interaction. This is a good principle for practical data analysis.

Try to run the backward() function on the model object you created above with all four factors and interactions. Use alpha=0.05 as test level. Use the Step-information from the output to explain which factors being excluded at each step. What is the final reduced model?

mod4 <- backward(mod2, alpha = 0.05)
Backward elimination, alpha-to-remove: 0.05

Full model: Yield ~ sort * soil * fert * site
<environment: 0x7f938ede4740>

                    Step   RSS AIC R2pred   Cp F value Pr(>F)
sort:soil:fert:site    1 16905 231  0.659 14.4    0.39   0.54
soil:fert:site         2 16918 229  0.696 12.4    0.01   0.91
sort:soil:site         3 18322 229  0.704 11.8    1.49   0.24
soil:site              4 18917 228  0.724 10.3    0.62   0.44
sort:soil:fert         5 21162 230  0.720 10.5    2.37   0.14

In the first step, the 4th order interaction is removed. In 2nd, 3rd and 4th step some 3rd order and a 2nd order interactions are removed. After removing interaction between sort,soil and site in the 5th step, the resulting model is,

Yield ~ sort + soil + fert + site + sort:soil + sort:fert + soil:fert + 
    sort:site + fert:site + sort:fert:site
  • Fit the reduced model and perform a model check using residual analysis.

The backward function has fitted the reduced model (her named mod4):

mod4 <- lm(Yield ~ sort + soil + fert + site + sort:soil + 
     sort:fert + soil:fert + sort:site + fert:site + sort:fert:site, 
     data = Barley)

par(mfrow = c(2,2))
plot(mod4)

The residual plots give no evidence of any problems with the model assumptions.

  • Make an interaction effects plot using the following code:
eff <- allEffects(mod4, confidence.level = .95)
plot(eff[3])
eff <- allEffects(mod4, confidence.level = .95)
plot(eff[3])

Try to explain the plot. Why do you think the interaction between sort, fertilizer and site was significant?

The plot shows that, when changing sort from 1 to 2, a farmer can expect different change in average Yield for site:1 and site:2 and the difference is not same when using fertilizer:1 and fertilizer:2.