###################################################################################################### # # # A simple example of Principal Component Analysis on the build in iris dataset. # # # # Iris is a dataset that contains observations of three iris varieties # # (Iris setosa, Iris virginica, Iris Versicolor ) recording five features # # for each observaion: Sepal length, Sepal width, Pedal lentgh, Pedal width and # # the species it belongs to (Setosa, Virginica, Versicolor ) # # # # For a picture of the Iris plant and what is recorded in the dataset see: # # http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/iris_petal_sepal.png # # # # The Iris dataset contains a total of 150 observations, 50 from each variety. # # # # The aim of the code is to perform Principal Component Analysis on the Iris dataset. # # Please note that this may not make sense for only 5 variables as PCA is mostly used in # # situations where the number of variables must be reduces, if these are large (dimensionality). # # # ###################################################################################################### # Let's take a quick look at the Iris dataset, by displaying the first 6 observations in the dataset. # Note: you can peek at more data by suplying a second argument like this: head(iris, 10) which # means show the first 10 observations of the iris dataaset. head(iris) # By peeking at the dataset we see that it has 5 variables, of which the first 4 looks to be numerical values # and the last one (Species) nominal i.e. not numerical. # Let's take a summary of the iris data. This will show some fisrt desctiptive statistics. summary(iris) # Since the variable 'Species' has nominal values, we can't use them for PCA. PCA works only # on numerical values. Hence, get rid of the column 'Species' and keep just the other ones. # This can be done using the following command: iris[, 1:4] => This means from the iris dataset # return all rows, but only columns 1 until 4 (i.e. exluding column 5 which is the 'Species' variable). # We store this data in a new variable data pcaData. # NOTE: We could also do the following: pcaData <- iris[, -5] which means all rows and all columns except # column 5. pcaData<-iris[,1:4] # Let's take a look to see if everything is ok head(pcaData, 12) # Yep, looks ok. Now, we need to make sure that there are no # missing values (that are displayed as NA). # PCA does not work if there are missing values in any column/variable. # Let's check this. # Does variable Sepal.Length have any (at least 1) missing (na) value? if (any(is.na(pcaData[,"Sepal.Length"]))) { sprintf("Sepal.Length has NA values") } else sprintf("Sepal.Length is ok! ") # Does variable Sepal.Width have any (at least 1) missing (na) value? if (any(is.na(pcaData[,"Sepal.Width"]))) { sprintf("Sepal.Width has NA values") } else sprintf("Sepal.Width is ok! ") # Does variable Petal.Length have any (at least 1) missing (na) value? if (any(is.na(pcaData[,"Petal.Length"]))){ sprintf("Petal.Length has NA values") } else sprintf("Petal.Length is ok! ") # Does variable Petal.Width have any (at least 1) missing (na) value? if (any(is.na(pcaData[,"Petal.Width"]))){ sprintf("Petal.Width has NA values") } else sprintf("Petal.Width is ok! ") # # OK, our data (pcaData) is ok. Now ready to execute PCA # # R's princomp() function is one way of executing PCA (there are also other functions available e.g. prcomp() ). # We use princomp() because it can be configured to use the Covariance matrix to calculate Eigenvalues/Eigenvectors # Please note that princomp() supports also performing PCA by using a Correlation matrix and SVD. This all depends # on the parameters that you will provide. # # First parameter is our data (pcaData). Since we'll use the Covariance matrix and we signify this by setting # the parameter cor equal to FALSE. Set to TRUE, princomp() will calculate the Covariance matrix. However you can # provide yourself the Covariance matrix by setting the covmat parameter. # If you set parameter cor to TRUE, the Correlation matrix will be used instead. # Parameter score=TRUE tells princomp to transform each original observation to the new system defined by the principal # components. Remember that principal components define a new coordinate system and the original data MUST be mapped # properly onto this new coordinate system. principalComponents<-princomp(pcaData, cor=FALSE, score=TRUE) # Ok, done. Now the result of the princomp() function is a new object -that we store in a new variable # called principalComponents- that has inside all necessary information. This object has attributes that # you can access. # But first, let's see what attributes it has. attributes(principalComponents) #You can see attributes such sdev, scores etc. You can display their values. # Here, we display the calculated scores principalComponents$scores # Or, you can display a summary, which gives you a better overview. summary(principalComponents) # You may also plot the principal components to see the variances of each principal component # in decreasing order plot(principalComponents) # Or you can plot the original data on a coordinate system defined by the two biggest principal # components (note the bi- in biplot). Note that eigenvectors are the vertical and horizontal axes. # The red vectors you see are pointing in the direction of the variables, as projected # into the 2-d plane of the biplot. biplot(principalComponents)