1. Introduction
Wheat is a grass widely
cultivated for its seed, a cereal grain which is a worldwide staple food. Wheat
is grown on more land area than any other food crop (220.4 million hectares,
2014). World trade in wheat is greater than for all other crops combined. In
2016, world production of wheat was 749 million tonnes, making it the second
most-produced cereal after maize.
Since 1960, world production of
wheat and other grain crops has tripled and is expected to grow further through
the middle of the 21st century. Global demand for wheat is increasing due to
the unique viscoelastic and adhesive properties of gluten proteins, which
facilitate the production of processed foods, whose consumption is increasing
as a result of the worldwide industrialization process and the westernization
of the diet.
This study describes two
unsupervised learning approaches which dimension reduction and clustering with
a real-life dataset. Dimension reduction is used to obtain fast and efficient
results on clustering analysis. In this project, principle component analysis
(PCA) and K-means algorithm are used for dimension reduction and clustering
analysis, respectively.
In this article, I will
demonstrate an unsupervised learning analysis using Wheat Seed dataset from UCI
Machine Learning. The analysis includes clustering using K-means algorithm and
dimensionality reduction using principal component analysis (PCA).
Through this analysis, I want
to evaluate the possibility of performing clustering to produce a new label for
the dataset and the possibility of dimensionality reduction using PCA.
Additionally, I will also analyze the pattern of the data to obtain insights by
combining PCA and clustering.
2. Data Description and Preparation
The datasets variables are
specific features of wheat seed for three wheat types which are Kama, Rosa and
Canadian. Each type of wheat seeds in shown in seedtype variable for the same
name respectively in the dataset, respectively. Moreover, 70 observation exist
for every type of wheat seed.
Each variable is represented as
follows:
•
Seedtype: Represents
Kama, Rosa and Canadian types as 1,2 and 3.
•
Area(A): Area of
each observation.
•
Perimeter(P):
Perimeter value of each wheat seed.
•
Compactness:
Campactness value of each wheat seed and it is calculated as C = 4 * pi * A /
P^2
•
Lengthofkernel:
Length of kernel.
•
Widthofkernel: Width
of kernel.
•
Asymmetrycoefficient:
Asymmetry coefficient.
•
Lengthofkernelgroove:
length of kernel groove.
Data Set Characteristics: Multivariate
The dataset is imported in R as
the seeds. The dataset was imported through the function showed below:
seeds <- read.csv(file.choose())
head(seeds)
## A P
C LK WK A_Coef
LKG target
## 1 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 0
## 2 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 0
## 3 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 0
## 4 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 0
## 5 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 0
## 6 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 0
In the data, seven geometric parameters of wheat
kernels were measured:
•
A: area
•
P: perimeter
•
C: compactness (),
•
LK: length of kernel
•
WK: width of kernel
•
A_Coef: asymmetry
coefficient
•
LKG: length of kernel
groove
•
target: kernel type
Let us rename the columns to understand them clearly
names(seeds)[names(seeds) == "A"] <- "area"
names(seeds)[names(seeds) == "P"] <- "perimeter"
names(seeds)[names(seeds) == "C"] <- "compactness"
names(seeds)[names(seeds) == "LK"] <- "lengthofkernel"
names(seeds)[names(seeds) == "WK"] <- "widthofkernel"
names(seeds)[names(seeds) == "A_Coef"] <- "asymmetrycoefficient"
names(seeds)[names(seeds) == "LKG"] <- "lengthofkernelgroove"
2.1.
Data Cleaning
2.1.1.
Missing Values
We are checking if there are any missing values in any
of the variables.
colSums(is.na(seeds))
##
area perimeter compactness
##
0 0 0
##
lengthofkernel
widthofkernel asymmetrycoefficient
##
0 0 0
## lengthofkernelgroove target
##
0 0
print(paste(sum(complete.cases(seeds)),"Complete cases!"))
## [1] "210 Complete cases!"
any(is.null(seeds))
## [1] FALSE
We see that there are no
missing values in any of the variables and there are 210 complete cases. Also,
there are no invalid or spurious data in the dataset.
2.1.2.
Duplicate Records
We are checking if there are
duplicate records in the dataset which is redundant and hence needs to be
removed.
dim(unique(seeds))[1]
## [1] 210
There is no duplicate records in the dataset.
2.1.3.
Dimensionality Reduction
At first, the relationship
of between each variable are analysed. According to the figure below, most of
variables have a linear relation and positive correlation between each other.
Especially between the area and perimeter, there is almost perfect linear
relationship.
Here we will explore the data
distribution of each numeric variable using density plot and the correlation
between each variable using scatterplot which were provided within ggpairs
function from GGally package.
ggpairs(seeds[,c(1:7)], showStrips = F) +
theme(axis.text = element_text(colour = "black", size = 11),
strip.background = element_rect(fill = "#d63d2d"),
strip.text = element_text(colour = "white", size = 12,
face = "bold"))
It can be seen that there is a
strong correlation between some variables from the data, including Area,
Perimeter, Length, and Width. This result indicates that this dataset has
multicollinearity and might not be suitable for various classification
algorithms (which have non-multicollinearity as their assumption).
By having a distinct separation
of values between each type of Kernel, this data might be suitable for
clustering using K-means. Although the data already contains labels of the type
of Kernels, clustering will still be used to prove whether the geometrical
properties of Kernels can be used to produce clusters that resemble Kernel’s
type. If not, can we produce new clusters based on those geometrical
properties? Additionally, visualization and profiling of the clustering result
will be performed to gain valuable insights.
Principal Component Analysis can be performed for this data
to produce non-multicollinearity data, while also reducing the dimension of the
data and retaining as much as information possible. The result of this analysis
can be utilized further for classification purpose with lower computation.
2.1.4.
Correlation Between
Variables
corrplot(cor(seeds[,1:7]), type = "lower", method = "number")
•
Area and Perimeter are
highly positive correlated.
•
Area and Width are highly
positive correlated as Length and Perimeter.
•
Length and Area, Width and
Perimeter, Length and Length_groove are all highly positively correlated.
3. Principle Component Analysis (PCA)
In this step, it is proceed the PCA on the dataset with
scaling. After the scaling is used, the variables are become more comparable.
Through PCA, I can retain some informative principal
components (high in cumulative variance) from Kernels dataset to perform
dimensionality reduction. By doing this, I can reduce the dimension of the
dataset while also retaining as much information as possible.
In this study, I want to retain at least 90% of the
information from our data. From the PCA summary (seed_pca$eig), I picked
PC1-PC3 from a total of 7 PC. By doing this, I was able to reduce ~57% of
dimension from my original data while retaining 98.7% of the information from
the data.
We can extract the values of PC1-PC3 from all of the
observations and put it into a new data frame. This data frame can later be
analyzed using supervised learning classification technique or other purposes.
mypr <- prcomp(seeds[,1:7], center = T, scale = T)
summary(mypr)
## Importance of components:
## PC1 PC2
PC3 PC4 PC5
PC6 PC7
## Standard deviation 2.2430 1.0943 0.82341 0.26147 0.13680
0.07302 0.02850
## Proportion of Variance 0.7187 0.1711 0.09686
0.00977 0.00267 0.00076 0.00012
## Cumulative Proportion 0.7187 0.8898 0.98668 0.99645 0.99912 0.99988
1.00000
As it can be seen in the
elbow method, the first three components are explained as 98.7% of variance
whereas 85% of variance is enough for PCA analysis.
fviz_eig(mypr, addlabels = TRUE, ylim = c(0, 75))
Also, we can see the principle component
analysis results of variables
var <- get_pca_var(mypr)
var
## Principal Component Analysis Results for variables
##
===================================================
## Name Description
## 1 "$coord" "Coordinates for the
variables"
## 2 "$cor" "Correlations between variables and
dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the
variables"
Cos2 (square cosine) is used to see variables on factor
map
head(var$cos2, 4)
##
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## area
0.9939475 0.0008450341 0.0004537914 0.002563424 0.0007819319
## perimeter 0.9810106 0.0084506414 0.0024277403
0.005967849 0.0005683714
## compactness
0.3860875 0.3353216539 0.2688363174 0.007572511 0.0020708334
## lengthofkernel 0.9026272 0.0508079572 0.0304376025
0.004743335 0.0109831433
## Dim.6 Dim.7
## area
0.0009696239 4.386450e-04
## perimeter
0.0012093247 3.655035e-04
## compactness
0.0001069541 4.276371e-06
## lengthofkernel 0.0003990721 1.739726e-06
fviz_pca_biplot(mypr,
# Fill individuals by groups
geom.ind = as.factor("point"),
addEllipses = TRUE, label = "var",
pointshape = 21,
pointsize = 2.5,
fill.ind = as.factor(seeds$target),
col.ind = as.factor("black"),
# Color variable by groups
col.var = factor(c("area", "perimeter", "compactness", "lengthofkernel", "widthofkernel", "asymmetrycoefficient", "asymmetrycoefficient")), legend.title = list(fill = "Type of seed", color = "Clusters"),
repel = TRUE #
Avoid label overplotting
)+ggpubr::fill_palette("jco")+ # Indiviual
fill color
ggpubr::color_palette("npg") #
Variable colors
Biplot is created to display
the result of PCA. Moreover, Each variable that went into the PCA has an
associated arrow. The arrows for each variable point in the direction of
increasing values of that variable. Also, seed types PCA values can be observed
in the clusters.
4. K-means Clustering Model
library(cluster)
# scaling data
seed_z <- scale(seeds[,-8])
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of
Clusters",
ylab="Within groups sum of
squares")
wss
}
wssplot(seed_z)
From the plots, we can see
that 3 is the optimum number of K. After k=3, increasing the number of K does
not result in a considerable decrease of the total within sum of squares
(strong internal cohesion) nor a considerable increase of between sum of square
and between/total sum of squares ratio (maximum external separation).
# k-means clustering
set.seed(50)
seed_k <- kmeans(seed_z, 3)
# result analysis
seed_k
## K-means clustering with 3 clusters of sizes 67,
72, 71
##
## Cluster means:
##
area perimeter compactness
lengthofkernel widthofkernel
## 1
1.2536860 1.2589580 0.5591283 1.2349319 1.162075101
## 2 -1.0277967 -1.0042491 -0.9626050
-0.8955451 -1.082995635
## 3 -0.1407831 -0.1696372 0.4485346
-0.2571999 0.001643014
##
asymmetrycoefficient lengthofkernelgroove
## 1
-0.04511157 1.2892273
## 2
0.69314821 -0.6233191
## 3
-0.66034079 -0.5844965
##
## Within cluster sum of squares by cluster:
## [1] 139.5542 144.5954 144.4586
## (between_SS
/ total_SS = 70.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From the clustering result, it
is very interesting that the size of each cluster is not proportional to one
another (cluster 1: 67; cluster 2: 72; cluster 3: 71), revealing a slightly
different result than the true observations from each Kernel type (70 each).
This indicates that there might
be Kernels with similar geometrical properties which originate from different
type/species. In the following, I will try to compare the clustering result
(clustering vector) with the actual label to see how many observations fall
into a different class.
seed_k$cluster
## [1] 3 3 3
3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 1 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 3 3 3 3 2 1 1 1 1
## [75] 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1
1 3 1 3 3 1 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 2 2 2 2 2 2
2 2
7+5+4 # total of miss-class
## [1] 16
(16/210)*100 # percentage of miss-class from the data
## [1] 7.619048
From the calculation, 16 observations (7.6% of
observations) fall into the different class than it should be. This indicates
that the geometrical properties of Kernels alone are not sufficient enough to
obtain a clustering that resembles Kernels types. Additional properties such as
genetic and metabolites properties of each Kernel might be needed to overcome
this problem.
# additional information
seed_k$tot.withinss
# total within sum of squares
## [1] 428.6082
seed_k$totss # total sum of squares
## [1] 1463
seed_k$iter # number of iteration needed to obtain optimum clustering
## [1] 3
Nevertheless, new cluster
can be made using this dataset and these new clusters also have different
characteristics owned by each cluster. Visualization and profiling of cluster
results can give us additional information about each clusters which can be
useful for us from a business perspective.
To visualize the result of K-means
clustering we can use various functions from factoextra package or by combining
it with PCA. This time will use factoextra package (I will combine the result
with PCA in a later section).
seeds$cluster
<- as.factor(seed_k$cluster)
# clustering visualization
fviz_cluster(object = seed_k,
data = seed_z)
# cluster profiling
seeds %>%
group_by(cluster) %>%
summarise_all(.funs = "mean") %>%
select(-target)
## # A tibble: 3 x 8
##
cluster area perimeter
compactness lengthofkernel widthofkernel
##
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1
18.5 16.2 0.884 6.18 3.70
## 2 2
11.9 13.2 0.848 5.23 2.85
## 3 3
14.4 14.3 0.882 5.51 3.26
## # ... with 2 more variables: asymmetrycoefficient
<dbl>,
## #
lengthofkernelgroove <dbl>
From the profiling result,
we can derive insights:
•
Cluster 1: consist mostly of kernels with the highest value on
their geometrical attributes.
•
Cluster 2: consist mostly of kernels with the lowest value on their
geometrical attributes but highest in their asymetry coefficient.
•
Cluster 3: consist mostly of kernels with the middle value on their
geometrical attributes.
These characteristics from
each cluster can also be visualized by combining clustering and PCA in the
later section
4.1.
Combining Clustering and
PCA
From the previous section,
we have discussed that PCA can be combined with clustering to obtain better visualization
of our clustering result, or simply to understand the pattern in our dataset.
This can be done by using a biplot, a common plot in PCA to visualize high
dimensional data using PC1 and PC2 as the axes.
We can use plot.PCA to
visualize a PCA object with added arguments for customization.
seed_pca <- PCA(seeds[,1:7], graph = F)
# analysis of clustering result
par(mfcol=c(1,2)) # graphical parameter to arrange plots
plot(x =seed_pca , choix
=
"ind")
plot.PCA(x = seed_pca, choix
=
"ind", label = "quali")
The plots above are examples of
individual factor map of a biplot. The points in the plot resemble observations
and colored by their Type (original Kernel type) and Cluster (Kernel by
clustering result). Dim1 and Dim2 are PC1 and PC2 respectively, with their own
share (percentage) of information from the total information of the dataset.
From the biplot, we can clearly
see in the Colored by Type plot, some observations from different clusters were
located really close with one another and an overlapping view of clusters can
be seen. Meanwhile, in the Colored by Cluster plot, we can see that the
clusters separate nicely without overlapping view of clusters.
This visualization supports the
assumption made during clustering result analysis, which was, “..there might be
Kernels with similar geometrical properties which originate from different
type/species. This indicates that the geometrical properties of Kernels alone
are not sufficient enough to obtain a clustering that resembles Kernels types.”
After this, I will focus on the
interpretation of biplots which observations were colored based on clusters
that we have made before.
Some insights from the plots are:
The dataset can be
clustered according to their geometrical properties.
·
There are no considerable
outliers in data.
·
Variables that were highly
contributed to PC1 are Groove.length, Length, Perimeter, Area, Width.
·
Variables that were highly
contributed to PC2 are Asymmetry.coef and Compactness.
·
Variables Asymmetry
coefficient possibly has a negative correlation with Compactness.
·
Variables Asymmetry
coefficient and Compactness possibly have a low correlation with Grove.length
and other geometrical variables which contribute to PC1.
·
The three clusters were
very different from one another in terms of their score in variables that were
highly contributed to PC1. Cluster with the highest score in those variables is
cluster 1, followed by cluster 3, then cluster 2.
Additionally, what
differentiates cluster 3 with cluster 1 & 2 is that its Asymmetry.coef
value which was lower than the other 2 clusters. Notice that most of the
observations in cluster 3 positioned below the 0 coordinates of PC2, whereas
Asymmetry.coef was highly contributed to PC2.
# analysis of biplot
par(mfcol=c(1,2)) # graphical parameter to arrange plots
plot.PCA(x = seed_pca, choix
=
"ind", label = "quali", title = "Colored by
Cluster")
#legend(x = "topright",
levels(seeds$cluster), pch=19, col=1:4)
plot.PCA(x = seed_pca, choix
=
"var", title = "Variable
Factor Map")
comp <- data.frame(mypr$x[,1:3])
scatterplot3d(comp[,1:3], pch=16, color = seeds$cluster,
grid = TRUE, box = FALSE)
5. Conclusion
1.
K- means clustering can be
done using this dataset, although, the clusters did not resemble Kernels types.
Geometrical properties of Kernels alone are not sufficient enough to obtain a
clustering that resembles Kernels types. Additional properties such as genetic
and metabolites properties of each Kernel might be needed to obtain such
clustering.
2.
Dimensionality reduction
can be performed using this dataset. To perform dimensionality reduction, we
can pick PCs from a total of 7 PC according to the total information we want to
retain. In this article, I used 3 PCs to reduce ~57% of dimension from my original
data while retaining 98.7% of the information from the data.
As a conclusion, principle
component analysis (PCA) and k-means clustering algorithm are analysed in this
project. The analysis is started with examining the relation between each
variable. Then, principle component analysis is initiated. The percentage of
explained variance for each dimension is determined. After that, the
correlation plot is created to determine contributed variables for each
dimension. PCA score and loading results are represented with biplot. First
three PCA variables are selected and determined the optimal cluster number for
k-means clustering. k-means algorithm is initiated for first three PCA variable
and examined on combined plot.
6. Future Work
The improved data set
obtained from unsupervised learning (eg.PCA) can be utilized further for
supervised learning (classification) or for better data visualization (high
dimensional data) with various insights.
7. References
1.
The dataset is imported by using machine
learning repository - https://archive.ics.uci.edu/ml/datasets/seeds
No comments:
Post a Comment