Source: View original notebook on GitHub
Category: Machine Learning / Learn ML

Algorithm

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set.

The algorithm then iterates between two steps:

Data assigment step:

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on

$$\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2$$

where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be Si.

Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.

$$ci=\frac{1}{|S_i|}\sum{x_i \in S_i x_i}$$

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).

This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.

# loading dataset

from sklearn.datasets import make_blobs

X,Y = make_blobs(n_samples=500,centers=5,n_features=2,random_state=101)

X.shape

Output:

(500, 2)

Y.shape

Output:

(500,)

import matplotlib.pyplot as plt

plt.scatter(X[:,0],X[:,1], c = Y)

Output

Output:

&lt;matplotlib.collections.PathCollection at 0x37e130&gt;

# applying K-means to it
from sklearn.cluster import KMeans

obj = KMeans(n_clusters=5, max_iter=300, random_state=101)

obj.fit(X)

Output:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=101, tol=0.0001, verbose=0)

centers = obj.cluster_centers_
centers

Output:

array([[ 4.6497515 , -6.25312567],
       [ 3.89604178,  6.53341535],
       [-9.44302162, -6.5596307 ],
       [-3.81222587,  7.75606791],
       [ 0.25812161,  1.56886054]])

plt.scatter(X[:,0],X[:,1],c=Y)
for i in centers :
    plt.scatter(i[0],i[1],c='k',marker='*')

Output