Clustering is an unsupervised algorithm that is used for determining the intrinsic groups present in unlabelled data. For instance, a B2C business might be interested in finding segments in its customer base. Clustering is hence used commonly for different use-cases like customer segmentation, market segmentation, pattern recognition, search result clustering etc. Some standard clustering techniques are K-means, DBSCAN, Hierarchical clustering amongst other methods.
Clusters created using techniques like Kmeans are often not easy to decipher because it is difficult to determine why a particular row of data is classified in a particular bucket. Knowing these boundary requirements for migrating from one cluster to another is an insight that businesses can use to move data items (such as customers) from one cluster to one more profitable cluster. Is usually useful for decision-makers. For example, if a business has an inactive customer segment and another fairly active customer segment, information of the boundary conditions of some variables that can enable the movement of a customer from an inactive to an active segment would be highly insightful.
Figure 1. Partitioning a Dataset into K clusters using K-means and decision trees (IMM & ExKMC) – Source
The following steps describe how the algorithm works on a high level:
Figure 2. A red point moves to the cluster of blue points by the split. This, in IMM clustering, is known as a mistake as the split led to separation of the red point from its original cluster – Source
Under the hood, the following steps happen while building the decision tree:
Given reference centres μ1,…,μk and a threshold tree T that defines the clustering (C1 ,…..,Ck), the surrogate cost is :
Splitting of a node happens with the combination of a feature and threshold that leads to the most improvement to the surrogate cost.
The Python packages for IMM & ExKMC
algorithms are available publicly and can be installed using the following line
of code –
pip install ExKMC
We use the Iris dataset here to analyze how the two algorithms perform in terms of the ‘mistakes’ they produce with respect to the reference K-means clustering.
1) We start with importing the libraries that we need :
from ExKMC.Tree import Tree from IPython.display import Image
2) We then read the dataset:
df1 = pd.read_csv(r"/content/drive/MyDrive/iris.csv")
3) The tree we build in the next step requires a kmeans model, so we perform the default kmeans clustering.
X = df1.drop('variety',axis=1) for cols in X.columns: X[cols] = X[cols].astype(float) k1=X[cols].mean() k2 = np.std(X[cols]) X[cols] = (X[cols] - k1)/k2 k=3 kmeans = KMeans(k,random_state=43) kmeans.fit(X) p = kmeans.predict(X) class_names = np.array(['Setosa', 'Versicolor', 'Virginica'])
4) As per the IMM algorithm, we build the decision tree (that needs the number of clusters that we used in K-means (k) and the K-means model) :
tree = Tree(k=k) tree.fit(X, kmeans)
5) Next, we plot the tree that we just built using the code below:
tree.plot(filename="test", feature_names=X.columns) Image(filename='test.gv.png')
Output:
Since the X was scaled before carrying out the clustering, we see some of the thresholds to be negatives in the decision tree.
6) Until now, we have used the IMM algorithm – which is why we got the decision tree with exactly K leaves. In order to have a decision tree with K’ > K leaves, we will use the ExKMC algorithm. To use this, we can pass another parameter called ‘max_leaves’ to our Tree object. This parameter would take the number of leaves (or K’) that we would like to have in our decision tree.
tree = Tree(k=k, max_leaves=6) tree.fit(X, kmeans) tree.plot(filename="test", feature_names=X.columns) Image(filename='test.gv.png')
Output:
As we can see in the above Figure, there are 6 leaves as we passed to the tree object.
The first output was from the IMM algorithm with a number of leaves equal to the number of clusters (3). The rules from the created decision tree bring in the explainability ascpect. Due to the inability of the decision tree with 3 leaves to form the same clusters as the K-means clustering, we see some mistakes at the bottom 2 nodes. As mentioned before, mistakes here refer to the points isolated from their reference cluster centre (from K-means) at each node. There, at the left-most bottom leaf, 8 samples belong to some other cluster.
The second output was the ExKMC output which produced a decision tree with 6 leaves. Here, owing to the higher depth of the tree with respect to the decision tree produced by IMM algorithm, we see fewer mistakes at the 6 leaves. As we see in the Table 1 below, there are some splits in the Tree that are redundant for any explanations – e.g. the node at the right bottom with the decision rule – Sepal Length >=1.280.
Table 1. Decision Rules For Explaining Each Cluster (From Output of ExKMC algorithm)
Decision Rule |
Cluster |
Number of Samples (In Leaf) |
Mistakes |
Petal Length>1.056 & Sepal Length <=0.553 & Sepal Width <=-0.132 |
0 |
52 |
2 |
Petal Length<=1.056 |
1 |
50 |
0 |
Petal Length>1.056 & Sepal Length > 0.553 & petal Width > 0.133 |
2 |
40 |
1 |
Standard Clustering techniques are difficult to interpret because they cannot pinpoint the reasons behind formation of the clusters. Knowing the rules or boundary conditions that push a data point to a certain cluster can be very insightful for decision-makers. Algorithms that can bring in explainability to clustering are therefore sought after across the industry.
Connect with me on LinkedIn: Nibedita Dutta
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.