Agglomerative Clustering
Application Description
This benchmark is an implementation of a well-known data-mining algorithm called Agglomerative Clustering [1]. The input to this algorithm is (1) a data-set consisting of points in an n-dimensional space and (2) a measure of the similarity between items in the data set. We refer to this measure of similarity as a distance metric; the more dissimilar items are, the farther apart they are according to the distance metric. The output of the algorithm is a binary tree (called a dendrogram) representing a hierarchical, pair-wise clustering of the items in the data set. The parallelism in this algorithm arises from the multiple pairs of points that can be clustered independently.
Figure 1(a) shows a data-set containing points in the plane, with a distance metric corresponding to Euclidean distance. The clustering for the data-set is in Figure 1(b), and the resulting dendrogram is in Figure 1(c).
[1] P.-N. Tan, M. Steinbach, and V. Kumar, editors. "Introduction to Data Mining." Pearson Addison Wesley, 2005.
[2] B. Walter, K. Bala, M. Kulkarni, and K. Pingali. "Fast Agglomerative Clustering for Rendering." IEEE Symposium on Interactive Ray Tracing, 2008.
Figure 1: Example of Agglomerative Clustering.
Algorithm
We implement a cautious variant of Agglomerative Clustering algorithm first described by Walter et al. [2], which is based on the following observation: if at any time two points agree that they are one another's nearest neighbor, they will be clustered together in the final dendrogram (subject to certain conditions on the distance metric). Pseudocode for this algorithm is provided in Figure 2.
Initially, all data points are placed onto a worklist (line 1). We then build a kd-tree, an acceleration structure that allows the quick determination of a point's nearest neighbor (line 2). The algorithm proceeds as follows. For each point p in the worklist, find its nearest neighbor q (line 5) and determine if q's nearest neighbor is p (lines 6-8). If so, cluster p and q and insert a new point into the new nodes list representing the new clusters. Otherwise, place p back onto the new nodes list to be processed in the next iteration of the outer loop. The new clusters are then copied back to the worklist after rebuilding a new Kd-Tree. The algorithm terminates when there is only one point left in the worklist.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | Workset ws = new Set(points); KDTree kdtree = new KDTree(points); while (true) { foreach (Element p in ws) { if (p.hasCluster()) continue; Point q = kdtree.findNearest(p); if (q == null) break; // stop if p is last element Point r = kdtree.findNearest(q); if (p == r) { // create new cluster e that contains a and b Element e = cluster(p, q); newWork.add(e); } else { // can't cluster yet, try again later newWork.add(p); // add back to worklist } } if (newWork.size() == 1) // we have a single cluster break; ws.addAll(newWork); //add new nodes to worklist kdtree.clear(); kdtree.addAll(newWork); newWork.clear(); } |
Figure 2: Pseudocode for Agglomerative Clustering
Data Structures
There are two key data structures used in Agglomerative Clustering:
- Unordered Set
- The points left to be clustered can be processed in any order, so the worklist holding those points is represented as an unordered set.
- KD-Tree
- The kd-tree represents a hierarchical decomposition of the point space in a manner similar to an octree. It is built prior to performing clustering as a means of accelerating nearest-neighbor queries. Rather than requiring O(n) time to find a point's nearest neighbor, the kd-tree allows the search to occur in O(log n) time. It is kept up-to-date throughout execution by removing clustered points and adding newly created clusters.
Parallelism
The active nodes in agglomerative clustering are the points to be clustered; they can be processed in any order. Intuitively, two points can be processed simultaneously if their clustering decisions are independent. This means that the following conditions must hold: (i) the formed clusters must involve different points (otherwise, both iterations would attempt to remove the same point from the kd-tree) and (ii) the newly added points must not interfere with the nearest neighbor computations performed by other iterations.
Performance
Figure 4 and 5 show the exeution time (miliseconds) and scalability of the algorithm relative to its performance with one thread using the C++ version of Galois. The input is a set of 10000.
Figure 3: Execution time of Agglomerative Clustering.
Figure 4: Scalability of Agglomerative Clustering.
Machine Description
Performance numbers are collected on a 4 package, 10 cores per package, Intel Xeon E7-4860 machine at 2.27GHz. Thread numbers above 40 use SMT. The machine has 128GB of ram with 20GB dedicated to 2M hugepages used by the Galois allocator. The operating system is Ubuntu Linux 10.04 LTS (Linux 2.6.32). All runs of the Galois benchmarks used gcc/g++ 4.7 to compile with a patch to the C++ runtime to improve the scalability of exception handling. All other implementations were compiled with Intel ICC 12.1 with thread parallelization provided by Intel Cilk.