As a developer of pyclustering library that is stored in GitHub repository I have a desire to keep “up-to-date” API documentation for users. Source code of the mentioned library is self-documented and doxygen-style is used to build documentation. But the question where it can be published?
There is popular service “Read the Docs” that using webhook to build and publish documentation when code is pushed to repository. But, unfortunately this service does not support doxygen (at least now – 10 Jul. 2017) and it does not suite to me. Of course there are some mediators which can be used to publish documentation to “Read the Docs”, but for me it is much preferable to avoid usage of third-parties.
After some I have found another similar service – “Code Docs” that is not so popular as well as “Read the Docs”, but it totally covers my needs and support doxygen-style.
It is enough to create file .codedocs in root folder of your repository to use this service where path to doxygen settings is placed:
DOXYFILE = docs/doxygen_conf_pyclustering
That is all! Here you can find example of built and published documentation of pyclustering library.
This technical note is a continuation of the previous ‘Sync: Oscillatory based Clustering Algorithm‘. Hierarchical Sync algorithm ‘hSync’ for cluster analysis will be considered as a one of the representatives of algorithms based on oscillatory networks that use Kuramoto model for synchronization.
As it was mentioned in the previous note, sometimes output dynamic of the Sync algorithm can be analysed to extract required amount of clusters. It can be useful when connectivity radius is greater than real because in this case global synchronization will be observed in the oscillatory network and it means that only one cluster is encoded. But keyword here is ‘sometimes’, it does not mean that we can set bigger radius for every data to reach global synchronization and then analyse it, so the question is why? Output dynamic of Sync algorithm can be analysed only when oscillatory network consists of local highly coupled sub-networks that have a few links to each other as it presented on figure 1.
Global synchronization will be reached for the oscillatory network with such structure (because there no any sub-networks that are not connected to each other) and it will be possible to analyse output dynamic due to small number of connections between these sub-networks – figure 2. At the beggining synchronization will be reached in sub-networks (two groups of oscillators that are marked as , ) and after time these two ensembles will be synchronized with each other.
But in other cases it is hard or even impossible to extract required amount of clusters. An idea to process network dynamic has been successfully implemented in hSync algorithm.
How does hSync algorithm work? hSync algorithm uses one mandatory parameter – amount of clusters that should be allocated instead of connectivity radius as in Sync. At the beginning the smallest connectivity radius is chosen, for example, average Euclidean distance to connect 3-nearest neighbors, or initial radius can be specified as an optional parameter by user. At the second step classic Sync algorithm is run to allocate clusters and if amount of allocated clusters is greater than required then radius is increased and Sync algorithm is rerun again. The last step is repeated until required amount of clusters is obtained. hSync algorithm can be described by following pseudo-code:
Authors of the algorithm do not specify what is rule should be used for radius increasing and this parameter can be also specified by user as an optional parameter of the algorithm in case pyclustering library (module hsyncnet in pyclustering.cluster). Let’s process the simplest data-set ‘SampleSimple01’ to demonstrate how does it work:
# Read dataset that consists 10 objects.
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE1);
# Two clusters should be extracted.
amount_clusters = 2;
# Create hSync clustering algorithm instance.
network = hsyncnet(sample, amount_clusters, ccore = True);
# Start clustering process.
analyser = network.process(collect_dynamic = True);
# Display output dynamic of the network.
# Display allocated clusters.
clusters = analyser.allocate_clusters();
At the beginning 10 oscillators has own phases that are initialized randomly. Each oscillator corresponds to only one object in the data ‘SampleSimple1’. After the first step of when Sync processing is over (t = 1), 7 groups (ensembles) are formed. Connectivity radius is increased because 7 ensembles encodes 7 clusters and it is greater than required 2 clusters, so simulation with new radius is continued using Sync. At the end of the second step (t = 2) we have 4 groups and it is still greater, and we repeat the last procedure again: increase radius and continue simulation. Finally (t = 3), 2 ensembles are formed that encode 2 clusters and it means that clustering process is over.
Other hSync clustering examples can be found in pyclustering library:
1. If real number of clusters is known then it is much easy to extract them using hSync than try to find out proper connectivity radius for Sync algorithm.
2. hSync can be used to build some kind of dendrogram that is used to illustrate the arrangements of clusters that is similar to dendrogram that is produced by classical hierarchical algorithm.
3. Clustering abilities (quality) are the same as for Sync, DBSCAN, OPTICS.
1. Time processing is the real problem for this algorithm. It is used Sync on each iteration that have complexity in worse case and amount of iterations depends on initial radius and increasing rule. Moreover on each iteration connectivity radius should be calculated and using k-nearest methods it takes , i.e. even in sunny case one iteration cost is . So, in case of lack of any information about connectivity radius time processing is dramatically increased.
2. Results are not stable as well as for Sync algorithm, because oscillator ensembles may occupy the same phase coordinates because there is no any desynchronization mechanism between ensembles in Sync and hSync.
Kuramoto model is able to ensure global synchronization in oscillatory networks. Global synchronization is a state of oscillatory network where each oscillator has the same phase coordinate and this state is not interested from practical point of view, because nothing is encoded by oscillators (see previous note ‘Oscillatory networks based on Kuramoto model #1 – Introduction‘). Thus, Kuramoto model should be modified to ensure partial synchronization when there are two or more groups of oscillators where each oscillator within group has the same phase coordinate. In other words each group encodes objects that are similar to each other that and such set of objects is considered as a cluster.
There are different clustering algorithms that are based on modified Kuramoto model: Sync, hSync, Sync-SOM and so on. Let’s start from the first one and consider Sync algorithm in this note.
Sync clustering algorithm is based on oscillatory network that uses modified Kuramoto model. Each oscillator corresponds to only one object from input data and thus amount of oscillators is equal to amount of objects. Two oscillators are connected if Euclidean distance between two corresponding objects is less than threshold value. Output dynamic of each oscillator is described by following differential equation:
This formula is simpler than original Kuramoto model: intrinsic frequency is omitted because it affects phase coordinate as a permanent bias. is a amount of neighbors of oscillator i. And sum is for neighbors of oscillator i as well. K is a constant value that denotes coupling strength between oscillators. Phase coordinate is main variable of each oscillator.
When should simulation process be stopped? Or in other words, when can we obtain results of cluster analysis? Estimation of local synchronization is used for that purpose, when it close to 1.0 then clustering is close to the end and output dynamic of the oscillatory network can be used for extracting clusters. The estimation is defined by following formula:
Ability to estimate end of the synchronization process is distinctive feature of oscillatory networks based on Kuramoto model. Such estimation helps to simulate network only required number of steps until desire state of system is not reached.
Implementation of the Sync algorithm can be found in pyclustering library. Using pyclustering let’s perform cluster analysis of sample LSUN:
from pyclustering.cluster.syncnet import syncnet, syncnet_visualizer;
from pyclustering.samples.definitions import FCPS_SAMPLES;
from pyclustering.utils import draw_clusters;
# Read sample 'Lsun' from file.
sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN);
# Create oscillatory network based algorithm Sync, using connectivity
# radius 0.45 and ccore calculatation to make process faster.
network = syncnet(sample, 0.45, ccore = True);
# Start simulation process until rc = 0.998 (default value, can be changed).
analyser = network.process();
# Show output dynamic of the network.
# Allocate clusters and display them
clusters = analyser.allocate_clusters();
Here, connectivity radius is 0.45 – as it has been mentioned this parameter is similar to connectivity radius in DSCAN and OPTICS. Two oscillators are connected if Euclidean distance between their objects is less than 0.45. Estimation of local synchronization is considered as a stop condition for simulation – time point when clustering process is finished. In our example local synchronization is 0.998 by default.
Let’s consider output dynamic of the oscillatory network – figure 1.
Oscillators that have the same phase coordinate encode cluster and therefore it is easy to extract clusters. Using pyclustering library and method draw_clusters() from package pyclustering.utils:
Points that have the same color belong to the one cluster and as it can be observed on figure 2 – three clusters are allocated by the Sync algorithm. Class cluster_visualizer or function draw_clusters can visualize only 1-, 2-, 3-dimensional data, but anyway clusters can be printed to file or to console, clustering result is represented by list of clusters, where each cluster is represented by list of object indexes from input data.
Animation of clustering process of sample ‘Hepta’ by Sync algorithm is shown in following movie:
1. Clustering results of Sync are similar to DBSCAN algorithm when amount of neighbors is equal to 1. For example, this algorithm is able to process data with elongated clusters with homogenous density.
2. Output dynamic can be analysed to extract true number of clusters in some cases (this idea has been extended in hSync algorithm that will be considered in the next note ‘hSync: Oscillatory based Clustering Algorithm‘) when connectivity radius is slightly bigger. Illustration of the cluster allocation by output dynamic analysis is shown on figure 3.
3. Differential equation that used to calculate phase coordinate of each oscillator can be simplified and as a result increase performance (pyclustering uses simplified equation by default):
1. Groups of oscillators do not communicate to each other and they may occasionally occupy the same phase coordinate and in this case it will not be possible to extract true number of clusters and use output dynamic. In this case there is nothing to do, simulation should be repeated again.
2. Complexity tends to in case of elongated clusters such as LSUN. Execution time of processing of each sample from FCPS collection is much slower in compare to traditional algorithms (DBSCAN, OPTICS).
3. It is hard to find correct connectivity radius and for some samples it is even impossible.
There is well known issue related to inability to sniff loopback interface on Windows OS family using Wireshark. This technical note helps resolve the issue.
What: how to collect dump on loopback interface on Windows OS?
Solution: the key to the problem is to use RawCap – free command line network sniffer that allows to collect dump on loopback interface. Collected dump is stored to *.pcap file that can be opened by Wireshark for further analysis.
What is an oscillatory network? Oscillatory network is a special type of neural networks that is represented by dynamic system that consists of oscillatory elements. The general distinction from classical neural networks is dynamic output like in real neurons.
There is an assumption that synchronization processes are used to implement cognitive functions, for example, vision, motion or memory. Oscillatory networks use this assumption to encode patterns and features. How the networks does encodes features in synchronization? The answer is simple: oscillators that have the same phase coordinate forms special groups – synchronous ensembles. An oscillatory network might have more than one synchronous ensemble and it means that the network encodes more than one feature where each ensemble of synchronous oscillators encodes only one feature. For example, we want to extract colors from an image – color image segmentation; in this case, each synchronous ensemble of oscillators encodes only one color. Let us consider another example – data clustering, in this case synchronous ensemble encodes only one cluster of data.
How does synchronization look like in the network? Suppose we have network with four oscillators and the first, the second encodes a feature A and the third, and the fourth encodes a feature B. Thus there are two synchronous ensembles [0, 1] and [2, 3] – figure 1.
Spike neuron output is used for simplicity in the figure 1. Here neuron #1 and neuron #2 have the same phase coordinate and at the same time, their phases are different from phases of neuron #3 and #4 that from another synchronous ensemble. Output of neuron depends on type of transfer function, for example, it might have wave-form or even chaotic-form. Let us return to our example, suppose we have image 2×2 where two pixels have blue color and other two has green color. Each ensemble encodes only one feature as we mentioned it before, so for such image neuron #1 and #2 correspond to pixel #1 and #2 (blue color) and neuron #3 and #4 correspond to pixel #3 and #4 (green color).
Now we understood how oscillatory network encodes features, but there is next question: “How to synchronize oscillators?”. We need to synchronize oscillators that should correspond to the same feature. The easiest way to do it is to use Kuramoto model. Original Kuramoto model ensures global synchronization in a full-interconnected network that uses phase oscillator:
Here θi – phase that is considered as a current state of oscillator. ωi – oscillator intrinsic frequency that can be considered as a oscillator bias. Kij – is a connection between i and j oscillators. Oscillatory network based on Kuramoto model is able to ensure global synchronization when all oscillators have the same phase. Let us simulate the oscillatory network using 100 oscillators using pyclustering library 0.6.0:
from pyclustering.nnet import conn_type;
from pyclustering.nnet.sync import sync_network, sync_visualizer;
# amout of oscillators in the network
num_osc = 100;
# multiplier for coupling strength between oscillators
k = 1;
# type of network structure - all-to-all (full-interconnected)
conn = conn_type.ALL_TO_ALL;
# permission to use core of the library to reach maximum performance
ccore_flag = True;
# create the oscillatory network
network = sync_network(num_osc, k, type_conn = conn, ccore = ccore_flag);
# simulate it until global synchronization and collect out
sync_output_dynamic = network.simulate_dynamic(collect_dynamic = True);
# show output dynamic of oscillators in the network
# show animation of synchronization process in polar coordinates
The model of oscillatory network is simple and it can be easily implemented. Moreover, simple Euler method can be used for solving the differential equation for each oscillator, no need to implement classical Runge-Kutta fourth order method. Nevertheless, it is enough for review to use pyclustering library, therefore the library is used in this technical note.
The presented code simulate oscillatory network where global synchronization is observed at the end of simulation. It means that oscillators have the phase coordinate. Amount of simulation steps that is required to reach global synchronization is different for various network structures as you can see it in figure 2.
As you can see from example above synchronization is reached in all networks. Pay attention to forms of output dynamic for grid and bidirectional list structures. I have prepared several movies devoted to the global synchronization in Kuramoto model.
Kuramoto model is interesting system and many researches devoted to it. Using pyclustering library you can perform your own investigation and study the model. And I hope this article encourage you to check what this model can do, try to play with coupling strength, added new components of the differential equation such as noise or try to substitute sinus function to another.
Now we have next question that should be clarified to understand how to apply it for practical problem, for example, cluster analysis, graph coloring, image segmentation and so on. We have global synchronization, but how to get local synchronization when oscillatory network has more than one synchronous ensemble? This issue will be considered in the next technical note – Sync: Oscillatory based Clustering Algorithm.
The library (pyclustering) was born in 2014 when I decided to implement oscillatory network based on Kuramoto model to study it in my PhD. I have tried to adapt it for solvingpractical problems and to compare obtained models with already existedalgorithms and methods. Due to personal circumstances I work in one big corporation as a R&D engineer in absolutely different area from my PhD, thus I spend my time on it in the evenings, at the weekends and during vacations. And as a result pyclustering library has been implemented after two years and it consists of bio-inspired and classical data mining algorithms (cluster analysis, image segmentation, graph coloring algorithms and models of oscillatory networks).
Beginning my PhD research I planned to create several oscillatory networks based on various modified Kuramoto models which have the most interesting properties. Implemented oscillatory networks based on Kuramoto model are able to solve clustering (modules syncnet, syncsom, hsyncnet in pyclustering.cluster), image segmentation (module syncsegm in pyclustering.nnet), graph coloring (module syncgcolor in pyclustering.gcolor) and even pattern recognition (module syncpr in pyclustering.nnet) problems. But a little bit later I decided to implement several other oscillatory networks using corresponding articles to compare them, for example, LEGION – local excitatory global inhibitory oscillatory network (module legion in pyclustering.nnet) that has beencreated for image segmentation problem (object extraction, an example of simulation where three objects are allocated is presented in figure 1, each ensemble of synchronous oscillators encodes only one object), PCNN – pulse-coupled neural network (module pcnn in pyclustering.nnet) that is also implemented for image segmentation (edges extraction, an example of segmentation is presented in figure 2), HHN – Hodgkin Huxley neural network (module hnn in pyclustering.nnet) for image segmentation (color extraction), SOM – self organized feature map (module som in pyclustering.nnet, an example of visualization of dataspace by distance and density matrices is presented in figure 3) for encoding data space, Hysteresis oscillatory network (module hysteresis in pyclustering.nnet) which modification is used for graph coloring (module hysteresis in pyclustering.gcolor) and so on. Despite specified purposes ofthe mentioned networks they are presented by abstract models which ensure abilities to simulate them and to study their properties without references to the purposes. I have studied their abilities and have tried to find out answer to the question: “Are they really applicable for practical problems such as cluster analysis, image segmentation, pattern recognition, graph coloring?”. I`m not going to present results of the study in this note, but examples and demos “how do they work” and “what they can do” can be found for each oscillatory network and bio-inspired algorithm in example modules (for example, pyclustering.nnet.examples.* – where general features of each implemented oscillatory or neural network are demonstrated). The library provides tools for visualizing and analyzing results of network simulation.
After some time I`ve had an idea not only to create models of oscillatory networks and bio-inspired algorithms but also to compare bio-inspired and classical (traditional) algorithms because almost each article about bio-inspired method or algorithm or approach says that presented solution is the best and in vast majority cases shows only advantages, but almost always keeps silence about disadvantages. And sometimes such articles consists of unfair comparison (from my point of view) with classical algorithms. Thus more attention was paid to implementaion of cluster analysis algorithms (vast majority of them are well-known): agglomerative, BIRCH, CLARANS, CURE, DBSCAN, heirarchical Sync (bio-inspired algorithm), K-Means, K-Medians, K-Medoids, OPTICS, ROCK, SyncNet (bio-inspired), Sync-SOM (bio-inspired), X-Means. Each algorithm has examples and demos in the library (the rule of the library – each algorithms is followed by a set of examples). If you need to compare mentioned algorithm by yourself so your are welcome to use the library for that and make your own conclusions. An example of visualization of clustering result is presented in figure 4.
Initially the library had been implemented using only python code, but modeling of oscillatory networks requires a lot of computational resources and it is really slow to simulate the networks on python, therefore core of the library has been written in C/C++ language and it is used as a shared library CCORE. CCORE is an independent part of pyclustering where the same algorithms and models are implemented. CCORE doesn`t use python library (Python.h) to communicate with python part, it uses C special interface that is used by python via ctypes. It was done like this to save ability to use CCORE library in other projects and applications where python is not required.
The pyclustering library is an open source project; people report about bugs, propose new features, contribute and help to develop the library. The library consists of following general modules:
pyclustering.cluster – algorithms for cluster analysis.
pyclustering.nnet – models of oscillatory and neural networks.
pyclustering.gcolor – algorithms for graph coloring.
pyclustering.tsp – algorithms for travelling salesman problem.
pyclustering.container – special data structures that are used by other modules.
The library provides simple and intuitive API. There is the listing for simulation of Hodgkin-Huxley oscillatory network (where oscillator is Hodgkin-Huxley neuron model) using python code:
from pyclustering.utils import draw_dynamics;
from pyclustering.nnet.hhn import hhn_network;
# six oscillator in the network
# and external stimulus for them
net = hhn_network(6, [11, 11, 11, 25, 25, 25]);
# simulation of the network during
# 100 time units (consider it like time in
# some units). And this 100 time unites
# should be simulated by 750 steps (consider
# it like iterations).
(t, dyn) = net.simulate(750, 100);
# visualization of the output dynamic
draw_dynamics(t, dyn, x_title = "t", y_title = "V");
Thus if you need to study or to use or just to see oscillatory networks, clustering algorithms and otherdata mining algorithms then pyclustering library or some part of it might be found as a useful instrument for that. I hope you will find it useful.
There are a lot of clustering algorithms. Nevertheless, there are questions: how to decide which algorithm should be used, how fast are they and what kind of data they can process properly. In this technical note I would like to compare several classical and bio-inspired clustering algorithms (that are based on oscillatory networks).
For comparison clustering algorithms python implementation of them is used (library pyclustering – branch 0.6.dev0). The library provides opportunity to use CCORE library (C/C++ part of the library), but currently amount of python implemented clustering algorithms is greater than amount of C/C++ implemented algorithms. Nevertheless, functionality and cluster analysis abilities can be compared using only python implementation.
Following algorithms are considered in this technical note: Agglomerative, BIRCH, CLARANS, CURE, DBSCAN, HSyncNet, K-Means, K-Medians, K-Medoids (PAM), OPTICS, ROCK, SyncNet, SYNC-SOM, X-Means. Almost of them are classical algorithms, except bio-inspired class: HSyncNet, SyncNet, SYNC-SOM. Let’s divide them into following categories:
Density based algorithms: DBSCAN, OPTICS, ROCK, SyncNet, SYNC-SOM.
Partitioning based algorithms: K-Means, K-Medians, K-Medoids (PAM), X-Means.
Almost all these algorithms require input data in RAM except BIRCH and SYNC-SOM, and thus there is limitation to use them for processing large databases. BIRCH encodes input data space (reduces it) by CF-Tree and SYNC-SOM encodes input data by SOM layer (self-organized feature map). FCPS data set collection for clustering testing is used. This data collection consists of 2D and 3D data sets with various complexity. Examples of data and proper clustering results (DBSCAN with correct parameters) are presented on figure 1.
Another one example of clustering where some clusters are allocated incorrectly (K-Medians) presented on figure 2.
Time processing and correctness (proper allocation is marked by green color, improper allocation – red color) of clusters allocation for FCPS are presented in table 1.
Some words about bio-inspired algorithms: SyncNet, HSyncNet, SYNC-SOM are able to allocate clusters but allocation depends on output of these oscillatory networks. Approach is pretty simple: each oscillator of a first layer (SyncNet and HSyncNet have only one layer) corresponds to only one object from input data space (SYNC-SOM’s neurons of the first layer compete with each other to capture objects). After simulation (processing input data space) oscillatory network has one or several ensembles of synchronous oscillators where each ensembles corresponds to only one cluster. But ensembles might have the same phases because considered algorithms have got de-synchronization mechanism – ensembles just do not interact with each other. It means that sometimes clustering result might be incorrect and simulation should be repeated. This problem can be resolved by using high level of local synchronization but it leads to increasing of simulation time.