Chlorine levels in cities

Independent measurements of chlorine (ppm parts per million) were taken from 3 large cities:

city

   City1 City2 City3
1   0.46  1.21  0.44
2   0.92 10.86  0.71
3   0.86  2.09  2.87
4   0.33  6.76  0.81
5   0.53  3.20  7.34
6   5.13  2.02  2.21
7   1.51  2.16  9.00
8   1.00  0.44 17.20
9   0.70  7.80  2.76
10  1.89  1.20  1.07
11  0.68  1.39  2.07
12  0.71  2.42  1.70
13  0.41  1.62  7.89
14  0.78 10.78  0.79
15  0.11  3.25  0.24
16  2.70  0.41  4.44
17  0.41  0.52 15.40
18  0.75  3.71  4.21
19  0.81  5.20  8.63
20  0.35 13.89  3.90

Load the city.rdata which is available on Canvas.

Before analyzing the data it is best to “stack” the data into two columns, a response column y and a city factor column city. By using the stack() function in R you can restructure the city data by

citydata <- stack(city)

Use the colnames() function to rename the variables as y and city.

colnames(citydata) <- c("y", "city")

Assume an ANOVA-model with chlorine as response and city as a factor. What assumptions do you make?

The model assumptions are:

\[y_{ij} = \mu_i + \epsilon_{ij}\]

where the error terms are assumed independent and identically distributed \(\epsilon_{ij} \sim N(0, \sigma^2)\). Further \(\mu_i\) is the expected chlorine level in city \(i\).

Make a box-plot of the data. Describe the variabilities between cities and within cities.

plot(y ~ city, data = citydata)

City 1 seems to have both the lowest average chlorine levels and the smallest variability in the measurements. Cities 2 and 3 are quite similar with both higher means and variability.

Fit the model and perform a residual analysis. Comment on the model fit.

options(contrasts=c("contr.treatment", "contr.poly"))
citymod1 <- lm(y ~ city, data = citydata)
par(mfrow = c(1, 2))
plot(citymod1, which = c(1, 2))

The residual analysis reveals two problems: 1) Non-constant variability, which was also observed in the boxplot, and 2) A skeewed and non-normal distribution for the residuals with a heavy tail to the right.

Make a log-transformation of \(y\) to create the variable logy, and add the new variable to your data set by:

citydata$logy <- log(citydata$y)

Fit a new model with logy as response and check the model fit again.

citymod2 <- lm(logy~city, data = citydata)
par(mfrow = c(1,2))
plot(citymod2, which = c(1,2))

Both problems from the previous model appear to be corrected. We continue with this model.

Estimate all model parameters from the latter model. Explain what the parameters measure given the parameterization of your choice (See lecture notes for parameterization).

The ANOVA analysis using contr.treatment parametrization:

summary(citymod2)


Call:
lm(formula = logy ~ city, data = citydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3987 -0.5954 -0.0175  0.7122  1.9326 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.297      0.231   -1.29  0.20266    
city(City2)    1.226      0.326    3.76  0.00041 ***
city(City3)    1.269      0.326    3.89  0.00027 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 1.03 on 57 degrees of freedom
Multiple R-squared: 0.255,
Adjusted R-squared: 0.229 
F-statistic: 9.75 on 2 and 57 DF,  p-value: 0.000228

Anova(citymod2)

Anova Table (Type II tests)

Response: logy
          Sum Sq Df F value  Pr(>F)    
city        20.8  2    9.75 0.00023 ***
Residuals   60.7 57                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coefs <- citymod2$coefficients

We observe that \(R^2 = 0.25\) indicating that there is still a lot of unexplained variability in the response. We get the following estimates for the city means of log(chlorine)-levels using this parametrization (city 1 is reference city):

\(\mu_1 = \mu =\) -0.3

\(\mu_2 = \mu + \alpha_2 =\) 0.93

\(\mu_3 = \mu + \alpha_3 =\) 0.97

The estimate of the error variance is \(\hat{\sigma}^2 = MSE= SSE/(N-3) = 60.724/57 = 1.065\).

As we see from above, the “intercept” \(\mu\) is the expected log-chlorine level for city 1. The \(\alpha_2\) is the difference between city 1 and city 2, and \(\alpha_3\) is the difference between city 1 and city 3. The \(\sigma^2\) is the variance for log-chlorine levels measured in the same city.

State the hypotheses for testing whether the expected chlorine levels differ between the cities, choose a test level \(\alpha\) and perform the test. What is the conclusion?

The hypotheses are:

\[H_0: \mu_1 = \mu_2 = \mu_3\] versus \[H_1:\mu_i \neq\mu_{i'}\] for at least two different cities \(i\) and \(i'\).

Note:This is by the chosen parametrization equivalent to testing

\[H_0: \alpha_2 = \alpha_3 = 0\] versus \[H_1:\alpha_i\neq0\] for at least one \(i \in \{2,3\}\).

For test-level \(\alpha=0.05\) we reject the null-hypothesis if \(F>F_{\alpha, 2, 57}\) or if the p-value is smaller than 0.05. Here we observe a very small p-value and reject the null-hypothesis. We claim that the expected log-chlorine levels differ between at least two of the cities, and from the observed means we know that cities 1 and 3 are significantly different since they have the largest observed difference in means. There may also be a significant difference between cities 1 and 2, and 2 and 3, but this should be checked by a pair-wise contrast.