K-Means Clustering

Concept¶

K means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (low intra-cluster variation), whereas objects from different clusters are as dissimilar as possible (i.e., high inter-cluster variation).
In k-means clustering, each cluster is represented by its centre (i.e. centroid).

Steps¶

In K-means clustering, the observations in the sample are assigned to one of the clusters by using the following steps:
1. Choose K observations from the data that are likely to be in different clusters
2. The K observations chosen in step 1 are the centroids of those clusters
3. For remaining observations, find the cluster closest to the centroid. Add the new observation (say observation j) to the cluster with the closest centroid
4. Adjust the centroid after adding a new observation to the cluster. The closest centroid is chosen based on an appropriate distance measure
5. Repeat step 3 and 4 till all observations are assigned to a cluster

The centroids keep moving when new observations are added.

Data¶

The data used in this blog is taken from kaggle. It is a customer segmentation problem to define market strategy. The sample Dataset summarizes the usage behaviour of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioural variables. Visit this link to know more about the data. Sample data is given below.

Credit card dataset
CUST_ID	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
C16568	3948.7769	1.000000	0.00	0.00	0.00	5291.215	0.000000	0.000000	0.000000	0.500000	17	0	4000	4576.5676	1819.1397	0.083333	12
C13598	2104.9617	0.727273	4115.95	230.50	3885.45	9358.071	0.833333	0.166667	0.666667	0.916667	20	59	8000	15172.5853	778.2392	0.000000	12
C13190	2089.4517	1.000000	1256.49	879.61	376.88	1381.853	0.666667	0.583333	0.500000	0.166667	2	32	2500	637.0447	1072.3948	0.090909	12
C16080	912.9127	0.857143	0.00	0.00	0.00	1563.921	0.000000	0.000000	0.000000	0.285714	9	0	1500	212.0849	331.4871	0.000000	7
C11056	380.3848	0.545455	1188.00	1188.00	0.00	0.000	0.083333	0.083333	0.000000	0.000000	0	2	9000	2461.3203	158.3398	0.000000	12

Optimal number of clusters¶

Identifying the optimal number of clusters is the first step in K-Means clustering. This can be done using many ways, some of which are:
1. Business decision
2. Calinski and Harabasz index
3. Silhouette statistic
4. Elbow curve

Calinski and Harabasz index¶

Calinhara index is an F-statistic kind of index which compares the between and within cluster variances. It is given by:
$$ CH(k) = \frac{B(k)/k-1}{W(k)/(n-k)} $$ Where k is number of clusters and B(k) and W(k) are between and within cluster variances respectively.

From the above plot, the ideal k value to be selected is 2.

Silhouette statistic¶

Silhouette statistic is the ratio of average distance within the cluster with average distance outside the cluster. If a(i) is the average distance between an observation i and other points in the cluster to which observation i belongs and b(i) is the minimum average distance between observation i and observations in other clusters. Then the Silhouette statistic is defined by

$$ S(i) = (\frac{b(i)-a(i)}{Max[a(i), b(i)]})$$

The ideal k using silhouette 2 to 6 as there is not much reduction in silhouette metric.

Elbow curve¶

Elbow curve is based on within-cluster variation. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. (see blog on hierarchical clustering for more)

fviz_nbclust(cluster_data, kmeans, method = "wss")

From the above three metrics, the optimum value of k is 3.

Analysing the cluster¶

K-means can be computed in R with the kmeans function. We will group the data into three clusters (centres = 3). The kmeans function also has a nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.

final_cluster <- kmeans(cluster_data, centers = 3, nstart = 25)

The final clusters can be visualized below. In the below figure, all the columns are reduced to two columns for visualization using PCA.

In each column, the variation between the three clusters can be analysed using the following plots.

From the comparison plot, we can observe the properties of the three clusters. They are:

Cluster 1¶

They are customers who purchase lesser amounts using the card but have reasonably high balance in their account.

Cluster 2¶

They are customers who have a low balance in their account and also purchase less using the account.

Cluster 3¶

They are customers who purchase in higher value using their card.

References¶

Business Analytics: The Science of Data-Driven Decision Making Available
UC Business Analytics R Programming Guide - University of Cincinnati - Online
Alboukadel Kassambara - factoextra package - git