######################################################################################################
#                                                                                                    #  
# A simple example of Principal Component Analysis on the build in iris dataset.                     #
#                                                                                                    #
# Iris is a dataset that contains observations of three iris varieties                               #
# (Iris setosa, Iris virginica, Iris Versicolor ) recording five features                            #
# for each observaion: Sepal length, Sepal width, Pedal lentgh, Pedal width and                      #
# the species it belongs to (Setosa, Virginica, Versicolor )                                         #
#                                                                                                    #
#  For a picture of the Iris plant and what is recorded in the dataset see:                          #
#  http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/iris_petal_sepal.png        #
#                                                                                                    #
# The Iris dataset contains a total of 150 observations, 50 from each variety.                       #
#                                                                                                    #
# The aim of the code is to perform Principal Component Analysis on the Iris dataset.                #  
# Please note that this may not make sense for only 5 variables as PCA is mostly used in             #
# situations where the number of variables must be reduces, if these are large (dimensionality).     #
#                                                                                                    #
######################################################################################################

# Let's take a quick look at the Iris dataset, by displaying the first 6 observations in the dataset.
# Note: you can peek at more data by suplying a second argument like this: head(iris, 10) which
# means show the first 10 observations of the iris dataaset.
head(iris)

# By peeking at the dataset we see that it has 5 variables, of which the first 4 looks to be numerical values
# and the last one (Species) nominal i.e. not numerical.

# Let's take a summary of the iris data. This will show some fisrt desctiptive statistics.
summary(iris)


# Since the variable 'Species' has nominal values, we can't use them for PCA. PCA works only
# on numerical values. Hence, get rid of the  column 'Species' and keep just the other ones.
# This can be done using the following command: iris[, 1:4] => This means from the iris dataset
# return  all rows, but only columns 1 until 4 (i.e. exluding column 5 which is the 'Species' variable).
# We store this data in a new variable data pcaData.
# NOTE: We could also do the following: pcaData <- iris[, -5] which means all rows and all columns except 
# column 5.

pcaData<-iris[,1:4] 

# Let's take a look to see if everything is ok
head(pcaData, 12)


# Yep, looks ok. Now, we need to make sure that there are no
# missing values (that are displayed as NA). 
# PCA does not work if there are missing values in any column/variable.
# Let's check this.

# Does variable Sepal.Length have any (at least 1) missing (na) value? 
if (any(is.na(pcaData[,"Sepal.Length"]))) {
  sprintf("Sepal.Length has NA values")
} else
  sprintf("Sepal.Length is ok! ")


# Does variable Sepal.Width have any (at least 1) missing (na) value? 
if (any(is.na(pcaData[,"Sepal.Width"]))) {
  sprintf("Sepal.Width has NA values")
} else
   sprintf("Sepal.Width is ok! ")


# Does variable Petal.Length have any (at least 1) missing (na) value? 
if (any(is.na(pcaData[,"Petal.Length"]))){ 
  sprintf("Petal.Length has NA values")
} else
   sprintf("Petal.Length is ok! ")


# Does variable Petal.Width have any (at least 1) missing (na) value? 
if (any(is.na(pcaData[,"Petal.Width"]))){
  sprintf("Petal.Width has NA values")
} else
   sprintf("Petal.Width is ok! ")


#
# OK, our data (pcaData) is ok. Now ready to execute PCA
#


# R's princomp() function is one way of executing PCA (there are also other functions available e.g. prcomp() ).
# We use princomp() because it can be configured to use the Covariance matrix to calculate Eigenvalues/Eigenvectors
# Please note that princomp() supports also performing PCA by using a Correlation matrix and SVD. This all depends
# on the parameters that you will provide.
# 
# First parameter is our data (pcaData). Since we'll use the Covariance matrix and we signify this by setting 
# the parameter cor equal to FALSE. Set to TRUE, princomp() will calculate the Covariance matrix. However you can 
# provide yourself the Covariance matrix by setting the covmat parameter. 
# If you set parameter cor to TRUE, the Correlation matrix will be used instead.
# Parameter score=TRUE tells princomp to transform each original observation to the new system defined by the principal
# components. Remember that principal components define a new coordinate system and the original data MUST be mapped
# properly onto this new coordinate system.
principalComponents<-princomp(pcaData, cor=FALSE, score=TRUE)

# Ok, done. Now the result of the princomp() function is a new object -that we store in a new variable
# called principalComponents- that has inside all necessary information. This object has attributes that
# you can access. 
# But first, let's see what attributes it has.
attributes(principalComponents)

#You can see attributes such sdev, scores etc. You can display their values.
# Here, we display the calculated scores 
principalComponents$scores

# Or, you can display a summary, which gives you a better overview.
summary(principalComponents)

# You may also plot the principal components to see the variances of each principal component 
# in decreasing order
plot(principalComponents)

# Or you can plot the original data on a coordinate system defined by the two biggest principal 
# components (note the bi- in biplot). Note that eigenvectors are the vertical and horizontal axes. 
# The red vectors you see are pointing in the direction of the variables, as projected
# into the 2-d plane of the biplot.
biplot(principalComponents)