Iris Dataset

    1. Consider the famous iris data set iris.train.rdata as introduced in lesson 8. Reproduce the pairs plot for the four sepal and petal variables as given in lesson 8 slide 4. Which variable appears to be discriminating the species best? And which is worst?

The two classes are best separated from the point of view of Petal.Length. The Sepal.Width looks like the worst discriminator.

    1. Explain the difference between “discrimination” and “classification”.

Discrimination corresponds to the model fitting process in statistical inference. We seek good variables for discriminating classes. Classification corresponds to the actual prediction of new samples, that is, to allocate new samples to classes using the presumably best “classifier”-model estimated from the training data.

    1. Explain what is meant by the assumption “We assume apriori that versicolor and virginica are equally likely”.

This means that both species are equally probable if you sample a random plant from the population.

    1. Fit an LDA model to the iris data using Sepal.Length as the predictor. Assume equal prior probabilities for both species. Use the print()-function on you fitted model. What are the sample means of each species for this predictor variable?
mod1.lda <- lda(Species ~ Sepal.Length, data = iris.train, prior = c(0.5, 0.5))
mod1.lda
Call:
lda(Species ~ Sepal.Length, data = iris.train, prior = c(0.5, 
    0.5))

Prior probabilities of groups:
versicolor  virginica 
       0.5        0.5 

Group means:
           Sepal.Length
versicolor         5.89
virginica          6.59

Coefficients of linear discriminants:
              LD1
Sepal.Length 1.89

The means are reported in the print output as Group means.

    1. Source in the CV.class.R file (open the fil in the script window and press the “Source” button to the upper right). Look at the CV.class.examples.R-file for reference. Perform a Leave-One-Out Cross-Validation of the model you fitted in the previous exercise. Report the confusion matrix, the accuracy and the cross-validated error rate.
cvres1 <- CV.class(mod1.lda, data = iris.train)
            True
Predicted    versicolor virginica
  versicolor         32        11
  virginica           8        29
  Total              40        40
  Correct            32        29

Proportions correct
versicolor  virginica 
     0.800      0.725 

N correct/N total = 61/80 = 0.762

The confusion matrix is given in the first part of the output. 32 out of 40 versicolor are correctly classified, whereas 29 out of 40 virginica are correct. In total 61 out of 80 are correctly classified giving an accuracy of 0.7625 as reported. The APER is 1-accuracy, hence APER=1-0.7625 = 0.2375. We typically seek classifiers that minimize the classification error rate.

    1. Use the scheme from exercises d. and e. to identify a good classifier for iris species. You may use either lda or qda and you may use one or several predictors. Report the cross-validated error for you “best choice”.

Through some trial and error, and by looking at the pairs-plot from exercise a, my personal choice is the following model:

mymod <- qda(Species ~ Petal.Length + Petal.Width, data = iris.train, prior = c(0.5, 0.5))
cvres2 <- CV.class(mymod, data = iris.train)
            True
Predicted    versicolor virginica
  versicolor         38         2
  virginica           2        38
  Total              40        40
  Correct            38        38

Proportions correct
versicolor  virginica 
      0.95       0.95 

N correct/N total = 76/80 = 0.95

This has an accuracy of 0.95 and, hence, an error of 0.05.

    1. What is the model assumption difference between an LDA and a QDA model?

In LDA we assume equal variance structure for all classes, whereas in QDA we assume different variance structures for all classes.

    1. Use the model of your choice to predict the samples in iris.test.rdata. Use the confusion()-function in the mixlm-library to evaluate the performance of your classifier.
pred <- predict(mymod, newdata = iris.test)
confusion(iris.test$Species, pred$class)
            True
Predicted    versicolor virginica
  versicolor         10         0
  virginica           0        10
  Total              10        10
  Correct            10        10

Proportions correct
versicolor  virginica 
         1          1 

N correct/N total = 20/20 = 1

The model I found in f. gave a perfect classification, accuracy=1.0 and error=0.0.