Source: View original notebook on GitHub
Category: Machine Learning / Learn ML
Algorithm
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set.
The algorithm then iterates between two steps:
- Data assigment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on
$$\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2$$
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be Si.
- Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.
$$ci=\frac{1}{|S_i|}\sum{x_i \in S_i x_i}$$
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.
# loading dataset
from sklearn.datasets import make_blobs
X,Y = make_blobs(n_samples=100,centers=5,n_features=2,random_state=101)
X.shape
Output:
(100, 2)
Y.shape
Output:
(100,)
import numpy as np
import matplotlib.pyplot as plt
def distance(v1,v2):
return np.sqrt(np.sum((v1-v2)**2))
# step 1 implementation -> assigning each point to its nearest cluster, and returning a dictionary with points and their mean
def points_filteration(X, mu):
mu = np.array(mu)
k = mu.shape[0]
# each cluster will have two things -> cluster point, cluster mean
clusters = {}
for i in range(k) :
clusters[i] = {
'points' : [],
'mean': None
}
for x in X :
dist = []
for i in range(k):
dist.append(distance(mu[i], x))
dist = np.array(dist)
clusters[np.argmin(dist)]['points'].append(x)
for i in range(k) :
if clusters[i]['points'] != []:
clusters[i]['mean'] = np.mean(clusters[i]['points'],axis=0)
# plots only for 2 featured data
if X.shape[1] == 2 :
plt.scatter(X[:,0],X[:,1])
for i in range(k) :
mean = clusters[i]['mean']
plt.scatter(mean[0],mean[1],c='k',marker='*',s=100)
plt.show()
return clusters
# step 2 implementation
def Kmeans(X, k = 2,max_iter=5):
# finding the max feature values possible in data so to randomly initilize the centers b/w the data range
data_range = max(np.max(X) , np.abs(np.min(X)))
# Initializing k-random means
mu = []
np.random.seed(0) # settting seed so to get randomness of same type everytime
for evry_cluster in range(k):
mean = data_range*(2*np.random.random(size =(X.shape[1]))-1)
mu.append(mean)
# updating means till convergence
itr = 0
while itr < max_iter :
# finding points closer to each cluster and plotting data with means
clusters = points_filteration(X, mu)
# updating mu array for next time
for i in range(k):
mu[i] = clusters[i]['mean']
itr += 1
return mu
k = 5
final_mu = Kmeans(X, k,max_iter=3) #after three iteration points move to its correct position
final_mu
Output:
[array([3.62186823, 6.59111541]),
array([0.23713963, 1.85541645]),
array([-9.56458048, -6.09665251]),
array([-4.05050671, 7.90104677]),
array([ 4.58002297, -6.28061932])]
# plotting final data
plt.scatter(X[:,0], X[:,1],c=Y)
for i in range(k):
plt.scatter(final_mu[i][0],final_mu[i][1], c = 'k',marker='*',s=100)
