Let’s now explore the task of clustering. Contrary to classification or regression, clustering is an unsupervised learning task; there are no labels involved here. In its typical form, the goal of clustering is to separate a set of examples into groups called clusters. Clustering has many applications, such as segmenting customers (to design better products, ads, etc.), aggregating news, identifying communities on social media, or even defining gene families from DNA sequences. Also, clustering can be used on just about any dataset in order to explore and obtain insights about it. Interestingly, clustering played (and still plays) a fundamental role for us humans: we clustered our world into concepts and use them to reason and communicate.
Let’s start with a simple example from the Fisher’s Irises dataset, which is a record of 150 specimens of irises and some of their characteristics:
For this example, we only keep the petal length and sepal width. Let’s extract their values and visualize the resulting data:
We can visually distinguish two clusters. Let’s see if we can find these clusters automatically:
The automatic function FindClusters identified two clusters, one with 49 examples and one with 101 examples. Let’s visualize them:
This is more or less what we would expect, except for the bottom-left example of cluster 2 () that is quite close to cluster 1 (). While this goes against our intuition, this clustering makes more sense if we realize that the automatic function has no idea what the features are. From its perspective, they could have different units or be different things entirely, so by default, the function standardizes the data, which means it independently transforms each feature to have a zero mean and a standard deviation of 1. In the standardized space, the clustering we found makes more sense.
As we can see, the bottom-left example is actually quite close to cluster 2. As it happens, the “correct” clustering is to group this bottom-left example with cluster 1 instead of cluster 2 because it belongs to the same species as the irises in cluster 1, so the standardization is harmful here. Often, obtaining good clustering requires a bit more work than obtaining a good classifier, for which everything can be fairly automated.
One way to guide the clustering is to specify the number of clusters, the clustering method used, and, of course, the preprocessing. Here is an example where we use the k-means method to identify three clusters while we deactivate the automatic standardization:
Interestingly, this clustering is quite effective at placing the irises back with their correct group. Here are the irises labeled by their actual species for comparison:
To measure the agreement between our clustering and the actual classes (species), let’s count how many irises of each species are present in each cluster found:
From this, we can compute the purity measure, which is a sort of accuracy measuring the proportion of examples that belong to the main class of their cluster:
In this case, the purity is about 91%, which is pretty good. Purity is an example of an external measure because it uses our knowledge of the correct clustering (the separation by species). If we do not know the correct clustering, we can use an internal measure. An example of such a measure is the within-cluster variance, which is the mean squared Euclidean distance from the center of the cluster:
This measure tells us how close the points are to their center, so the smaller, the better. This value can be compared to the overall variance of the data, which we would obtain if there was only one cluster:
This within-cluster variance is the cost that the k-means method is optimizing, so we should not be able to improve its value here. This does not mean that our clustering is optimal though (and we know it is not thanks to our external measure). There exists other internal measures, called clustering criterion functions, which would give a better clustering if optimized. The choice of criterion function, which is also related to the choice of the method, is something that depends on the application though. This is quite unlike supervised learning, where the choice of objective does not matter too much.
Let’s now try to cluster images and, in particular, images of people. This could be used by a home robot to understand who lives in the house or, more simply, to organize a photo album.
Let’s consider the following faces belonging to three different people:
We can try to cluster these images automatically:
The results are not really what we are looking for. It is not clear why such clustering is found, maybe something to do with the background or the clothes, but clearly, this clustering is not capturing the identity of the people. We could add more faces to see if it helps, but there is a more fundamental problem here: the computer does not know what we want. Indeed, we might want to separate pictures depending on where they were taken (e.g. outside vs. inside), to separate males from females, to cluster according to people’s ages, and so on. This is a fundamental issue of clustering and, in a way, of unsupervised learning in general: the goal is often not well defined.
One way to tell the computer what we want is to extract features that we know are relevant for our task. In this case let’s use a model that has been trained to recognize some famous people:
We can use the probabilities obtained from this model to create a feature vector (with a bit of smoothing):
Now to obtain better results, let’s define a pseudo-distance that works well on probability distributions, such as a symmetrized version of the Kullback–Leibler divergence:
A way to check if this has a chance of working is to visualize the distance matrix of the data:
We can see clear groups already. Faces belonging to the same person are close to each other. Let’s now use these features and this distance function to cluster the faces again:
This time the clustering is good; it is exactly what we were looking for. This shows the importance of choosing appropriate features (and distance functions) for clustering, even more so than for supervised tasks.
We now would like to create a news aggregator. The goal is to group news articles that are about the same topic. This can be used to make news browsing more convenient or as a first step to creating an automatic press review.
Let’s simulate news articles using Wikipedia. To make things simpler, let’s imagine that there are only three news stories today: one about a tennis match, one about SpaceX, and one about a pandemic. We can load Wikipedia pages that correspond to these topics:
Let’s now separate these texts into groups of 10 sentences (we also want keep track of the origin of each article to test our clustering):
We now have 148 “news” articles of 10 sentences each. Here is one of them:
Let’s see if we can automatically cluster these articles:
The clustering function automatically found 42 clusters, which is too many for our application. Let’s specify that a maximum of three topics should be found (here topic labels are returned instead of articles, but these labels are not used to help the clustering):
Again, the results are not good to say the least. Unfortunately this is not uncommon for clustering, which is harder to automatize than supervised learning tasks. We could try to use a specific feature extractor as we did for images (in that case, using the probabilities of the "FacebookTopic" built-in classifier would probably help!), but to make the clustering a bit more robust, let’s just try some other classic preprocessing. The t-SNE dimensionality reduction method (see Chapter 7, Dimensionality Reduction) happens to work well in this case. Let’s reduce the dimension of the articles to numeric vectors of size 2:
Since text is not a numeric vector, there is also an internal preprocessing that converts the text into a numeric vector before the dimensionality reduction. Here it is: a tf–idf transformation (see Chapter 9, Data Preprocessing). Let’s visualize the embeddings of these articles while showing their topics of origin (note that the dimensionality reduction did not have access to these labels):
We can clearly see clusters. SpaceX articles are well grouped, and the same goes for pandemic articles. Only tennis articles are a bit more dispersed. Let’s cluster this data and visualize the resulting clusters in this reduced space:
Exactly three clusters are found, which is good, and the clustering more or less corresponds to the ones we wanted. SpaceX and pandemic articles are well grouped while tennis articles leak a little bit into the pandemic cluster:
This is confirmed by visualizing the word clouds of each cluster, in which we can see that topics are fairly well separated:
We thus found a pretty good way to cluster these articles. The next steps would be to try this procedure on other topics and probably play with the preprocessing, the method used, and so on in order to obtain something more robust. Once the clustering procedure is set up, one last task could be to automatically name these clusters, which could be done, for example, by taking the largest word in each word cloud (which corresponds to the most common word besides stop words, such as “the,” “a,” etc.).
As a final example, let’s cluster some DNA sequences. Instead of performing a regular clustering though, we are going to create a hierarchical clustering, which means finding a hierarchy of clusters from the individual examples to the entire dataset.
Let’s start by loading some DNA sequences using some sequences of the coronavirus SARS-CoV-2:
This data contains tens of thousands of DNA sequences, including metadata such as collection date and place. Here is the sequence, date, and location of the first decoded sequence:
Each DNA sequence is a string of about 30000 characters (“A,” “C,” “G,” or “T”), which represent nucleobases. Some sequences are abnormally short though, and we will ignore them:
To simplify the problem, let’s only focus on sequences that have been obtained in these countries:
Furthermore, let’s pick only the first five sequences from each country. Here are the resulting 75 sequences:
We can extract the corresponding DNA sequences and their countries:
Let’s look at the first 20 nucleobases of some sequences:
We can already see differences between them. A way to quantify these differences is to compute an edit distance. For example, our first and last sequences have an edit distance of 470, which is not so much given their length:
We could use such a distance to determine a clustering; however, it would be very slow given the lengths of the strings. Instead, we can use an alignment-free method, which consists of transforming each sequence into its Frequency Chaos Game Representation image:
Here are the first and last images along with a visualization of their differences:
This feature extraction happens to work pretty well for DNA sequences, and it is much faster than methods requiring the finding of alignment between sequences.
Let’s now create a clustering tree from these images (note that the labels are only here for visualization purposes):
Each node of this tree corresponds to a different cluster, going from the individual examples to the entire dataset. We can see that sequences belonging to the same countries are fairly well grouped, which makes sense. An alternative way to visualize a hierarchical clustering is to use a dendrogram, which is particularly appreciated because the position of the nodes also reflects the similarity between clusters. Here is a dendrogram using the first 20 sequences:
This type of clustering is usually made from the bottom up: clusters are iteratively merged according to some criterion. Such hierarchical clusterings are mostly useful for data visualization in order to understand the data, get insights, and make discoveries.
|■||Clustering is the task of grouping data examples together.|
|■||Clusters can be disjoint or organized into a hierarchy.|
|■||Clustering can be used for specific applications or just to explore datasets.|
|■||Clustering tools cannot be as automated as supervised learning tools because clustering goals vary depending on the application.|
|■||We often need to play with the choice of features, distance function, criterion, method, and number of clusters in order to obtain an appropriate clustering.|
|clustering||separating a set of examples into groups called clusters|
|external measure||measure of a clustering quality that compares the clustering to a known target clustering|
|measure of a clustering quality that only uses the given clustering|
|purity measure||external clustering measure, computes the proportion of examples that belong to the correct cluster|
|within-cluster variance||internal clustering measure, computes the mean squared Euclidean distance from the center of the clusters|
|k-means||classic clustering method that minimizes the within-cluster variance for a fixed number (k) of clusters|
|hierarchical clustering||clustering for which clusters are organized into a hierarchy going from the individual examples to the entire dataset, forming a tree structure|
|dendrogram||method to visualize a hierarchical clustering|
|standardization||preprocessing that sets the mean of each variable to 0 and variance to 1|