Clustering keywords with K-means in Python

A quick methodology for regrouping similar keywords into clusters

SEOs are often in need of quickly categorizing keywords for different purposes.
Using Machine Learning there are several methods to do this, one of which is to regroup keywords that are similar because they contain the same terms.
In this post I’ll give you a quick example of this method, using K-means clustering.

K-means clustering

K-means convergence

K-means is a popular and basic clustering algorithms, using unsupervised learning.
Its principle is quite simple:

  1. randomly set k cluster centroids
  2. assign each item to the nearest centroid
  3. recalculate centroids position to the “mean” of each cluster

Iterating steps 2 and 3, the algorithm converges when the position of the centroids no longer change.
See the animation on the right for a more visual example.

Clustering keywords with K-means in Python

You’ll find the complete documented code in this Google Collab notebook. Most of this could have been done in fewer steps but the aim is to help people understand the main steps of the process.
Bear in mind that this code is not optimized ;)

Using K-means clustering is relativly fast, even on huge lists of items. However, I find it has a few flaws.

First of all, about K-means itself:

  • You need to indicate the number of clusters you want, which might not be optimum. There are several methods to calculate the best k number, like the elbow method, or you might prefer to limit the number of clusters in some contexts.
  • Centroids are initialized randomly. This means they won’t always converge to the same position, thus the clusters won’t always be the same even when using exactly the same data.

Regarding keywords clustering, another problem is the size of the vocabulary: with larger lists of keyowrds, it leads to larger vocabulary length, thus more complex vectors and longer computation time.
To reduce the size of the vocabulary and the size of the vectors, several options come to mind:

  • Use only the most frequent terms,
  • Lemmatize and/or stem terms to regroup inflected forms together.

You’ll find some quick examples of such techniques in the notebook.

Finally the method described in the notebook will group keywords that look alike, but not necessarily keywords that describe similar intents.
For example, it won’t tell you that a car is the same thing as an automobile.
Other clustering methods can help solve this issue, but that will be another story ;)

Let's work together !

Contact me !