Clustering algorithms which are popular unsupervised tools of machine learning are not just a fantasy of crazy scientists who are looking for the ways of making computers as intelligent as we are. These types of algorithms are quite popular in the modern life since they are opening unique possibilities for people turning out to be extremely useful in various spheres of life. Just look at these popular usages of clustering algorithms and you will understand how widespread they already are.
What is clustering?
First, let’s look at the meaning of clustering itself. Clustering is a whole category of unsupervised algorithms. Unsupervised means unlabelled in machine learning which means the data used for computer training is not labelled with the information about the solution of a particular problem. Instead of showing the computer predicting values and the target values corresponding to them, we are just using a dataset with information and ask our algorithm to look for patterns in it. Based on these patterns which might be not obvious for humans, the computer comes up with one’s own ideas for grouping the data entries which is called clustering. As a result, new data will be clustered based on these patterns.
Although it might seem to be somewhat similar to the classification methods of supervised learning, it is not the same. Classification is using labelled data which will show you exactly what are the categories for particular data entries while the computer will be using this patterns for the future prediction of categories for new data. Once again, in the case of clustering, the computer has to come up with the division into groups and assigning data entries to these groups on its own. The only thing which is done by humans in particular algorithms is choosing the number of clusters itself, however, it does not go as a rule.
The most popular types of clustering algorithms
So far, the most popular types of clustering algorithms are K-means clustering algorithms, hierarchical clustering as well DBSCAN algorithms. Each of them has its own advantages and disadvantages making these algorithms more useful for particular purposes.
In the K-means algorithm, you will have to come up with the number of clusters you would like to get. The algorithm will divide all the data in the k number of clusters and calculate the centres of these clusters. Then, it will rearrange the clusters based on the distances between the data points and centres of the clusters reassigning the points as close to the centres of the clusters as possible.
Hierarchical clustering can be done in two directions. Depending on the dataset and its usage, the algorithm can either move up from the bottom linking two most closest data points to each other with each iteration creating different number of clusters each time. With each iteration, the new clusters will be linked to each other. The whole process will take place until there is one single cluster left or until there is a predefined number of clusters. It is also possible to move from the top to bottom taking the entire dataset as one single cluster and then dividing it into clusters with each iteration.
Finally, there is also a DBSCAN algorithm which a bit similar to the K-means algorithms, yet, it will give you an opportunity to get clusters of irregular shapes rather than spherical ones. You will be able to get even a cluster within a cluster.
Where are these algorithms used?
There is multiple usages for clustering algorithms. For example, you yourself might be using an email address registered on the service which has adopted a clustering algorithm for filtering spam which can be done with 97% accuracy. K-means algorithm is used for this purpose in the first place which can check the different parts of a message and group the information into the clusters for comparing new data entries.
Another popular usage of such algorithms is also identification of fake news. As you can imagine, this can be quite an important task which is not thus simple to solve, however, applying machine learning techniques is making it possible. Clustering algorithms have proved to be useful for recognising the patterns in the news which are quite characteristic for fake articles. As you can imagine, people frequently fail to do it on their own.
Clustering mechanisms are particularly good for identifying criminal activity in different spheres of life. A DBSCAN algorithm which is also known as a density-based algorithm is great for this usage since it can spot outliers. The data points which do not fit any of the clusters well is usually an outlier which in the reality can be a signal of abnormal activity.
Other popular usages for clustering algorithms are classification of network traffic, marketing and document analysis.