Chlorine levels in cities

Independent measurements of chlorine (ppm parts per million) were taken from 3 large cities:

city
   City1 City2 City3
1   0.46  1.21  0.44
2   0.92 10.86  0.71
3   0.86  2.09  2.87
4   0.33  6.76  0.81
5   0.53  3.20  7.34
6   5.13  2.02  2.21
7   1.51  2.16  9.00
8   1.00  0.44 17.20
9   0.70  7.80  2.76
10  1.89  1.20  1.07
11  0.68  1.39  2.07
12  0.71  2.42  1.70
13  0.41  1.62  7.89
14  0.78 10.78  0.79
15  0.11  3.25  0.24
16  2.70  0.41  4.44
17  0.41  0.52 15.40
18  0.75  3.71  4.21
19  0.81  5.20  8.63
20  0.35 13.89  3.90
  • Load the city.rdata which is available on Canvas.

Before analyzing the data it is best to “stack” the data into two columns, a response column y and a city factor column city. By using the stack() function in R you can restructure the city data by

citydata <- stack(city)
  • Use the colnames() function to rename the variables as y and city.
colnames(citydata) <- c("y", "city")
  • Assume an ANOVA-model with chlorine as response and city as a factor. What assumptions do you make?

The model assumptions are:

\[y_{ij} = \mu_i + \epsilon_{ij}\]

where the error terms are assumed independent and identically distributed \(\epsilon_{ij} \sim N(0, \sigma^2)\). Further \(\mu_i\) is the expected chlorine level in city \(i\).

  • Make a box-plot of the data. Describe the variabilities between cities and within cities.
plot(y ~ city, data = citydata)

City 1 seems to have both the lowest average chlorine levels and the smallest variability in the measurements. Cities 2 and 3 are quite similar with both higher means and variability.

  • Fit the model and perform a residual analysis. Comment on the model fit.
options(contrasts=c("contr.treatment", "contr.poly"))
citymod1 <- lm(y ~ city, data = citydata)
par(mfrow = c(1, 2))
plot(citymod1, which = c(1, 2))

The residual analysis reveals two problems: 1) Non-constant variability, which was also observed in the boxplot, and 2) A skeewed and non-normal distribution for the residuals with a heavy tail to the right.

  • Make a log-transformation of \(y\) to create the variable logy, and add the new variable to your data set by:
citydata$logy <- log(citydata$y)
  • Fit a new model with logy as response and check the model fit again.
citymod2 <- lm(logy~city, data = citydata)
par(mfrow = c(1,2))
plot(citymod2, which = c(1,2))

Both problems from the previous model appear to be corrected. We continue with this model.

  • Estimate all model parameters from the latter model. Explain what the parameters measure given the parameterization of your choice (See lecture notes for parameterization).

The ANOVA analysis using contr.treatment parametrization:

summary(citymod2)

Call:
lm(formula = logy ~ city, data = citydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3987 -0.5954 -0.0175  0.7122  1.9326 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.297      0.231   -1.29  0.20266    
city(City2)    1.226      0.326    3.76  0.00041 ***
city(City3)    1.269      0.326    3.89  0.00027 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 1.03 on 57 degrees of freedom
Multiple R-squared: 0.255,
Adjusted R-squared: 0.229 
F-statistic: 9.75 on 2 and 57 DF,  p-value: 0.000228 
Anova(citymod2)
Anova Table (Type II tests)

Response: logy
          Sum Sq Df F value  Pr(>F)    
city        20.8  2    9.75 0.00023 ***
Residuals   60.7 57                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefs <- citymod2$coefficients

We observe that \(R^2 = 0.25\) indicating that there is still a lot of unexplained variability in the response. We get the following estimates for the city means of log(chlorine)-levels using this parametrization (city 1 is reference city):

\(\mu_1 = \mu =\) -0.3

\(\mu_2 = \mu + \alpha_2 =\) 0.93

\(\mu_3 = \mu + \alpha_3 =\) 0.97

The estimate of the error variance is \(\hat{\sigma}^2 = MSE= SSE/(N-3) = 60.724/57 = 1.065\).

As we see from above, the “intercept” \(\mu\) is the expected log-chlorine level for city 1. The \(\alpha_2\) is the difference between city 1 and city 2, and \(\alpha_3\) is the difference between city 1 and city 3. The \(\sigma^2\) is the variance for log-chlorine levels measured in the same city.

  • State the hypotheses for testing whether the expected chlorine levels differ between the cities, choose a test level \(\alpha\) and perform the test. What is the conclusion?

The hypotheses are:

\[H_0: \mu_1 = \mu_2 = \mu_3\] versus \[H_1:\mu_i \neq\mu_{i'}\] for at least two different cities \(i\) and \(i'\).

Note:This is by the chosen parametrization equivalent to testing

\[H_0: \alpha_2 = \alpha_3 = 0\] versus \[H_1:\alpha_i\neq0\] for at least one \(i \in \{2,3\}\).

For test-level \(\alpha=0.05\) we reject the null-hypothesis if \(F>F_{\alpha, 2, 57}\) or if the p-value is smaller than 0.05. Here we observe a very small p-value and reject the null-hypothesis. We claim that the expected log-chlorine levels differ between at least two of the cities, and from the observed means we know that cities 1 and 3 are significantly different since they have the largest observed difference in means. There may also be a significant difference between cities 1 and 2, and 2 and 3, but this should be checked by a pair-wise contrast.