Comparison of Clustering Algorithms

There are a lot of clustering algorithms. Nevertheless, there are questions: how to decide which algorithm should be used, how fast are they and what kind of data they can process properly. In this technical note I would like to compare several classical and bio-inspired clustering algorithms (that are based on oscillatory networks).

For comparison clustering algorithms python implementation of them is used (library pyclustering – branch 0.6.dev0). The library provides opportunity to use CCORE library (C/C++ part of the library), but currently amount of python implemented clustering algorithms is greater than amount of C/C++ implemented algorithms. Nevertheless, functionality and cluster analysis abilities can be compared using only python implementation.

Following algorithms are considered in this technical note: Agglomerative, BIRCH, CLARANS, CURE, DBSCAN, HSyncNet, K-Means, K-Medians, K-Medoids (PAM), OPTICS, ROCK, SyncNet, SYNC-SOM, X-Means. Almost of them are classical algorithms, except bio-inspired class: HSyncNet, SyncNet, SYNC-SOM. Let’s divide them into following categories:

  • Hierarchical algorithms: Agglomerative, BIRCH, CURE, HSyncNet.
  • Density based algorithms: DBSCAN, OPTICS, ROCK, SyncNet, SYNC-SOM.
  • Partitioning based algorithms: K-Means, K-Medians, K-Medoids (PAM), X-Means.

Almost all these algorithms require input data in RAM except BIRCH and SYNC-SOM, and thus there is limitation to use them for processing large databases. BIRCH encodes input data space (reduces it) by CF-Tree and SYNC-SOM encodes input data by SOM layer (self-organized feature map). FCPS data set collection for clustering testing is used. This data collection consists of 2D and 3D data sets with various complexity. Examples of data and proper clustering results (DBSCAN with correct parameters) are presented on figure 1.

Figure 1. Clustering results of FCPS data by DBSCAN algorithm.

Another one example of clustering where some clusters are allocated incorrectly (K-Medians) presented on figure 2.

Figure 2. Clustering results of FCPS data by K-Mediands algorithm.

Time processing and correctness (proper allocation is marked by green color, improper allocation – red color) of clusters allocation for FCPS are presented in table 1.

Table 1. Time processing and correctness of clustering algorithms.

Some words about bio-inspired algorithms: SyncNet, HSyncNet, SYNC-SOM are able to allocate clusters but allocation depends on output of these oscillatory networks. Approach is pretty simple: each oscillator of a first layer (SyncNet and HSyncNet have only one layer) corresponds to only one object from input data space (SYNC-SOM’s neurons of the first layer compete with each other to capture objects). After simulation (processing input data space) oscillatory network has one or several ensembles of synchronous oscillators where each ensembles corresponds to only one cluster. But ensembles might have the same phases because considered algorithms have got de-synchronization mechanism – ensembles just do not interact with each other. It means that sometimes clustering result might be incorrect and simulation should be repeated. This problem can be resolved by using high level of local synchronization but it leads to increasing of simulation time.