Cluster Analysis and Unsupervised Machine Learning in R

by August 4, 2016

 Cluster analysis is one of the most used techniques to segment data in a multivariate analysis. It is an example of unsupervised machine learning and has widespread application in business analytics. Cluster analysis is a method of grouping a set of objects similar to each other. Precisely, it tries to identify homogeneous groups of cases such as observations, participants, and respondents. In this post, I will take you through the two most important clustering techniques using R. These are:

Hierarchical Clustering: The method identifies a cluster within a cluster. It groups data over a variety of scales by creating a cluster tree or dendrogram. For instance, wine can be subcategorized as Fortified Wine, Sparkling Wine or Still Wine depending on their similar composition. Hierarchical clustering is further categorized as Agglomerative clustering and Divisive clustering, based on bottom-up or top-down approach.

Partitional Clustering: This method constructs a partition of n objects into a set of K clusters. The most popular partitional clustering is K-means. In this, each cluster is associated with a centroid while each point is assigned to the cluster with the closest centroid. The clustering requires specifying the number of clusters to be extracted in advance.

The difference between the two clustering methods is that the K-means clustering handles larger datasets compared to hierarchical clustering. So, let’s go ahead and use both of them one by one. For cluster analysis, I will use “iris” dataset available in the list of R Datasets Package. There are also other datasets available in the package. But this one is a famous dataset used in many statistical classification techniques in machine learning. It consists of the measurements of 50 flowers based on three species in centimeters. These three species are setosa, versicolor, and virginica.

library(datasets)
head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Now let’s apply K-means to the data by using the code below. We will use 30 random numbers by setting a seed at 30. Also, we will set the number of clusters to 3 since there are only three species of flowers in the data.

> set.seed(30)
> iris_K <- kmeans(iris[, -5], 3, nstart = 30) # It removes species column.
> iris_K

K-means clustering with 3 clusters of sizes 50, 38, 62
Cluster means:

Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871

Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3

Within cluster sum of squares by cluster:
[1] 15.15100 23.87947 39.82097
(between_SS / total_SS = 88.4 %)

Available components:
[1] “cluster” “centers” “totss” “withinss” “tot.withinss” “betweenss” “size” “iter” “ifault”

The algorithm has grouped the data into three clusters with 30 different random assignments. Let’s now compare this with the species column.

> table(iris$Species, iris_K$cluster)
1 2 3
setosa 50 0 0
versicolor 0 2 48
virginica 0 36 14

Now, by plotting the clusters with their centers we would be able to distinguish them. Let’s do that by the following command.

> library(cluster)
> clusplot(iris, iris_K$cluster, color=TRUE, shade=TRUE, lines=0)

 

The above plot provides a clear understating of the three clusters in different colors.

Let’s go further to see how hierarchical clustering performs with the same data set.

> iris_h <- hclust(dist(iris[, -5]))
> iris_h
Call:
hclust(d = dist(iris[, -5]))

Cluster method : complete
Distance  : euclidean
Number of objects: 150
We will now plot the clusters to create a dendogram.
> library(sparcl)
> y = cutree(iris_h, 3)

> ColorDendrogram(iris_h , y = y, labels = names(y), main = “Iris”, branchlength = 80)

We will now plot the three clusters against species to know if the algorithm has done the job correctly.

> table(y, iris$Species)

y   setosa versicolor virginica
1     50          0         0
2      0         23        49
3      0         27         1

From the table, we can see that the algorithm has clustered the setosa correctly but failed to cluster the other two species. Let us now use average linkage method to perform clustering again.

> newcluster<-hclust(dist(iris[,-5]), method = ‘average’)
> cut=cutree(newcluster, 3)
> table(cut, iris$Species)

cut setosa versicolor virginica
1     50          0         0
2      0         50        14
3      0          0        36

Evidently, this is the most accurate way in which the algorithm has clustered the flowers. It has clustered the two species, setosa and virginica, except for versicolor which has included two flowers in a different cluster. We will now plot it to have a clear visual understanding of the same.

From the two clustering methods, we have got a fair idea about how the data is classified into a number of different groups consisting of similar objects. We deal with cluster analysis in almost every aspects of our daily life. We make friends on the basis of similar feelings and emotions, and a group of these friends form a cluster. In supermarkets, all the similar food items are placed near each other, forming a cluster. There are infinite ways in which cluster analysis plays an important role in our life. In business, cluster analysis is used majorly in market segmentation, which is one of the most fundamental strategic marketing concepts. In a nutshell, cluster analysis helps in reducing the complexity of data which leads to information, to knowledge, and to wisdom.