Prediction of cow milk percentage

In the data file “maldidata.rdata” you find four objects:

  • Y : The percentage of cow-milk for 4 replicates of 45 different milk mixtures
  • X : Mass Spectrometry data (MALDI-TOF) for the milk samples. The 6179 variables is a quantification of molecule “size” and “charge” in a sample. For simplicity we may say that the size of molecules increases from variable 1 to variable 6179. The measurements are then amounts of molecules of different sizes. The method is used to separate proteins, peptides and other ionizable compunds.
  • Ytest : Cow-milk percentages for 45 extra test samples
  • Xtest : The MALDI-TOF values for the test samples.

You may plot the spectra in X one-by-one as,:

plot(X[, 1], type = "l")

We can also plot multiple (all) spectra together, takes more time.

matplot(X[, 1:5], type = "l", lty = 1, col = "black")

The peaks show molecule sizes that are abundant in the sample.

  • First run a PCA on X. How many components are needed to explain 80% and 90% of the variability in X?
pr <- prcomp(X)
summary(pr)$importance[, 1:14]
                          PC1    PC2    PC3    PC4    PC5    PC6     PC7
Standard deviation     12.502 4.4833 3.0218 2.5373 2.0112 1.7430 1.30637
Proportion of Variance  0.648 0.0833 0.0379 0.0267 0.0168 0.0126 0.00708
Cumulative Proportion   0.648 0.7313 0.7691 0.7958 0.8126 0.8252 0.83228
                          PC8     PC9    PC10    PC11    PC12    PC13
Standard deviation     1.0875 1.05133 0.96666 0.91363 0.84363 0.83145
Proportion of Variance 0.0049 0.00458 0.00387 0.00346 0.00295 0.00287
Cumulative Proportion  0.8372 0.84176 0.84564 0.84910 0.85205 0.85491
                          PC14
Standard deviation     0.81638
Proportion of Variance 0.00276
Cumulative Proportion  0.85768

We need 5 components to explain 80% and 36 to explain 90% (check yourself). Hopefully we do not need to use all X-information to predict Y.

  • Compute the correlations between \(Y\) and the principal components. How can you use this get an idea of how many components PLS-regression will require to make a good model for cow-milk prediction?
cor(Y, pr$x)
       PC1    PC2    PC3    PC4    PC5     PC6   PC7    PC8      PC9
[1,] 0.528 -0.754 0.0338 -0.225 0.0744 -0.0503 0.015 -0.166 -0.00757
        PC10   PC11    PC12   PC13   PC14    PC15    PC16   PC17    PC18
[1,] -0.0754 -0.137 -0.0432 0.0183 0.0167 -0.0155 -0.0263 0.0222 -0.0187
         PC19    PC20    PC21    PC22    PC23    PC24    PC25    PC26
[1,] -0.00495 -0.0413 0.00649 -0.0187 -0.0442 -0.0069 -0.0411 0.00412
        PC27  PC28    PC29   PC30     PC31   PC32    PC33   PC34     PC35
[1,] -0.0123 0.017 -0.0147 0.0387 0.000618 0.0185 -0.0213 0.0478 -0.00964
        PC36     PC37  PC38   PC39     PC40    PC41   PC42   PC43    PC44
[1,] -0.0106 -0.00503 0.021 -0.011 -0.00874 -0.0191 -0.014 0.0082 -0.0124
       PC45    PC46   PC47    PC48   PC49     PC50     PC51    PC52
[1,] 0.0176 0.00597 0.0141 -0.0193 0.0217 -0.00107 -0.00454 0.00781
        PC53    PC54   PC55     PC56    PC57    PC58     PC59      PC60
[1,] 0.00674 -0.0041 0.0027 -0.00404 -0.0269 0.00521 -0.00471 -0.000586
       PC61    PC62   PC63    PC64     PC65    PC66   PC67    PC68
[1,] 0.0124 -0.0142 0.0125 -0.0181 -0.00766 0.00276 0.0146 -0.0221
         PC69     PC70     PC71     PC72   PC73    PC74    PC75   PC76
[1,] -0.00245 -0.00385 -0.00334 -0.00403 -0.038 -0.0394 0.00702 0.0103
       PC77    PC78    PC79     PC80    PC81     PC82   PC83    PC84
[1,] -0.017 -0.0104 -0.0116 -0.00895 0.00664 -0.00716 0.0086 -0.0205
        PC85  PC86   PC87    PC88   PC89    PC90     PC91    PC92     PC93
[1,] -0.0217 0.018 0.0284 0.00673 0.0166 0.00445 -0.00696 -0.0109 -0.00113
       PC94   PC95    PC96     PC97   PC98     PC99    PC100    PC101
[1,] 0.0244 0.0304 -0.0186 -0.00388 0.0214 -0.00707 -0.00238 0.000283
      PC102    PC103   PC104   PC105  PC106  PC107   PC108   PC109  PC110
[1,] 0.0251 -0.00757 -0.0175 -0.0259 0.0244 -0.026 0.00272 -0.0157 0.0142
       PC111  PC112   PC113  PC114   PC115   PC116 PC117    PC118   PC119
[1,] -0.0115 -0.009 -0.0115 0.0299 0.00924 0.00246 -0.01 -0.00907 -0.0242
       PC120   PC121  PC122   PC123   PC124   PC125   PC126   PC127  PC128
[1,] 0.00516 -0.0066 0.0147 0.00467 -0.0254 -0.0224 -0.0124 0.00816 0.0101
       PC129   PC130    PC131   PC132    PC133   PC134    PC135  PC136
[1,] -0.0142 -0.0119 -0.00018 -0.0108 -0.00172 0.00275 -0.00635 0.0103
       PC137   PC138   PC139    PC140    PC141  PC142    PC143   PC144
[1,] -0.0087 0.00552 0.00742 -0.00805 -0.00327 0.0159 -0.00109 0.00291
        PC145    PC146    PC147    PC148   PC149    PC150  PC151   PC152
[1,] -0.00378 -0.00148 -0.00736 -0.00656 0.00354 -0.00579 0.0034 0.00252
      PC153    PC154   PC155  PC156  PC157    PC158    PC159   PC160
[1,] 0.0015 -0.00918 -0.0111 0.0232 0.0108 -0.00734 -0.00148 0.00503
        PC161    PC162    PC163   PC164    PC165   PC166   PC167   PC168
[1,] -0.00242 -0.00583 -0.00701 0.00558 -0.00344 0.00436 0.00368 0.00529
      PC169   PC170   PC171    PC172    PC173  PC174   PC175     PC176
[1,] -0.015 -0.0092 -0.0134 -0.00933 -0.00255 0.0113 0.00024 -0.000418
       PC177   PC178  PC179  PC180
[1,] -0.0057 -0.0103 0.0218 0.0288

Components 1, 2, 4 and perhaps 8 are moderately to highly correlated with Y. Supposedly 3 to 4 components will be necessary for PLSR.

  • Fit a PLS-regression model using Y as response and X as predictor (You may simply write Y ~ X as your model call in plsr. Also use ncomp=10 as extra argument to only fit 1 to 10 components). Use the scoreplot() function to make a scoreplot. Check the help-file for this function to see how you can chose the component numbers to plot, and how you can put labels to your observations. Plot component 1 against 2 and put observation numbers 1:180 as labels. If the noise level of the measurements is low, the replicates should group in clusters of size four (obs 1,2,3 and 4 are from the same mixture, and so on). Do you see any tendency to this?
plsmod <- plsr(Y ~ X, ncomp = 10)
scoreplot(plsmod, comps = c(1,2), labels = 'names', pch = 0.7)

We observe that there are clusters of observations and that there is a tendency of successive numbers to be close in the scoreplot. Cluster sizes are difficult to determine.

  • Perform hierarchical clustering of the samples using the 3 first PLS-component scores as input variables. Try both “complete” and “average” agglomeration method and make a dendrogram. Are the replicate clusters more apparent in the dendrogram than in the scoreplot? Is there any samples that are very different from all others?
clust1 <- hclust(dist(plsmod$scores[,1:3]), method = "complete")
plot(clust1, cex = 0.5)

From the dendrogram we observe several clusters of size four, and some of three with successive observation numbers. This implies that the replicates are more similar to each other than samples from different milk samples. Sample 159 seems to be an outlier, very different from all others.

  • Use K-means clustering with K=45 on the three first PLS-component scores. How do the replicates cluster?
clust2 <- kmeans(plsmod$scores[,1:3], centers = 45)
clust2$cluster
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
 39  39  39  39  13  13  13  44  20  20  20  20  31  31  31  31  38  15 
 19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
 38  15  13   1   1  40  25  25  14  17   2   2   2   2  22  32  22  32 
 37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
 25  17  25  25  44  23   9  34   3   3   3   3  11  11  19  23  17  24 
 55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
 24  24  30  30  14  14  40  23  40  39   9   9   9   9   1  30   1  13 
 73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
 12  45  45  45  44  44  33  22   5  13  12  35  34  34  34  23  14  17 
 91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 
 17  17   6   6  28  28  42  42  42  42  22  32  43  43   7   7   7   7 
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
 38  38  38  38   4   4   4   4  37  37  37  37  41   8   8  41  29  29 
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 
 29  26  25  16  16  16  25  41  25  25  35   5   5  35  42  42  42  42 
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
 21  21  21  21  36  36  36  36  18  18  18  18  33   6  27  28   8   8 
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
  8  30  14  41  41  16  12  45  45  35  10  30  10  10  15  15   3  15 

Many replicates fall into the same clusters, but some clusters observations from more than one “true” cluster. E.g. cluster number 43 contains observations (97, 98, 99, 100) and (141, 142, 143, 144), which is a mixture of four true replica clusters. We conclude that the replicates are somewhat similar, but there is some noise in the MALDI-TOF data which makes similar milk mixtures hard to distinguish.

  • We will use cross-validation to estimate the expected prediction error of PLSR and to choose the appropriate number of components. Instead of using Leave-One-Out Cross-validation we will exclude all four replicates in a “Leave-Four-Out” type of Cross-validation. Why is this smart?

Since the replicates are from the same mixture, they are not independent. If we use Leave-One-Out CV the three replicates still contained in the training set will make the model “too good” to predict the given mixture, and we will under-estimate the prediction error of new samples.

  • Refit the plsmodel from exercise c. but add the arguments validation="CV", segments=45 and segment.type="consecutive" in the plsr-model call. This sets up the “Leave-Four-Out” CV to be performed.
plsmod <- plsr(Y ~ X, ncomp = 10, validation = "CV", 
               segments = 45, segment.type = "consecutive")
  • The sum of squared prediction errors (PRESS) for different number of components is given in the validation$PRESS element of the fitted pls-model. The null-model PRESS (prediction using the mean-response) is given in the validation$PRESS0 element. You find the MSEP-values (Mean Squared Error of Prediction) by dividing the PRESS by the number of observations (N=180). Make a prediction error plot of MSEP with 0 to 10 components. How many components do you think gives satisfying prediction of cow-milk content in new mixtures? Remember that simple models are often better than complex.
MSEP <- c(plsmod$validation$PRESS0, plsmod$validation$PRESS)/180
plot(0:10, MSEP, type = "b", col = 2, lwd = 2)

The prediction error is heavily reduced as we introduce components 1 and 2, but there is a small gain by adding components 3, 4 and 5. Simple models are usually more robust, so I would not go any further than 5 components here.

  • Predict the cow-milk content of the test samples using Xtest as newdata in the predict-function and use the number of components you found as best from the previous exercise. Save the predictions into an object called pred. (See ?predict.mvr for the help-file to predict for pls-objects.). The predict-function returns an array of dimension [45,1,1] of predictions. You can extract all predictions by pred[,1,1].
pred <- predict(plsmod, newdata = Xtest, ncomp = 5)

Copy the folowing code into R and execute:

MSEPfunc <- function(y, yhat){
    mean((y - yhat) ^ 2)
}
  • Use MSEPfunc() to compute the MSEP-value for the test-predictions using Ytest and the predicted values as inputs. Did the cross-validation under-estimate the prediction error?
MSEPfunc(Ytest, pred[,1,1])
[1] 0.00692

The MSEP-value is slightly larger than the value found for five components using Cross-validation, but the order of magnitude is the same.

  • EXTRA for those interested. Redo the exercises g. and h. with Leave-One-Out CV. (See also lecture notes Lesson 7.) Compare the MSEP-values you find with the previous CV-routine.
plsmod2 <- plsr(Y ~ X, ncomp = 10, validation = "LOO")
MSEP2 <- c(plsmod2$validation$PRESS0, plsmod2$validation$PRESS)/180
plot(0:10, MSEP, type = "b", col = 2, lwd = 2)
points(0:10, MSEP2, type = "b", col = 4, lwd = 2)

We see that LOO-CV underestimates the prediction error, as commented in exercise f.