Maximum Cardinality Bipartite Matching

Application Description

Given a graph G, a matching M is a subset of edges where no two edges share an endpoint. The cardinality |M| of M is the number of edges in M . Given an unweighted bipartite graph G = (V, E) V = (A, B), the maximum cardinality bipartite matching problem is to find a matching with maximum cardinality.

Algorithm

Algorithms to solve this problem try to extend existing matchings by finding augmenting paths. When there are no more augmenting paths, the matching is a maximum matching.

The Ford-Fulkerson algorithm [1] visits each node in A trying to find an augmenting path at each. It has complexity O(|V||E|). The Alt-Blum-Mehlhorn-Paul (ABMP) algorithm [1] tries to find the shortest augmenting path first. After all the augmenting paths of less than a particular length are found or the number of unmatched nodes is reduced to a particular size, the algorithm applies the Ford-Fulkerson algorithm to complete the matching. The ABMP algorithm has complexity O(|E|sqrt(|V|)).

This benchmark implements the Alt-Blum-Mehlhorn-Paul algorithm.

[1] K. Mehlhorn and S. Naeher. LEDA: A Platform for Combinatorial and Geometric Computing. Cambridge University Press, 1999.

Data Structures

There are two key data structures used in this algorithm:

Priority Scheduler
The ABMP algorithm uses priority on tasks to find the shortest augmenting paths first.
Directed Graph
The algorithm successively reads and updates a graph.

Parallelism

The parallelism in this algorithm is very data dependent.

Performance

Figure 3 gives the performance of this algorithm. One thing to note about the the ABMP algorithm is that its performance is sensitive to the scheduling of iterations and different schedulers result in different amounts of work, leading to the possibility of super-linear speedup.

Figure 1: Performance results on a random graph. The input is a bipartite graph G = (A, B, E) where |A| = |B| = 10^6 , and there are 10^8 edges. Nodes in each partition are divided evenly into 10^4 groups. Each node in A has degree d = |E|/|A| and the edges out of a node in group i of A go to random nodes in groups i + 1 and i − 1 of B.

Machine Description

Performance numbers are collected on a 4 package, 10 cores per package, Intel Xeon E7-4860 machine at 2.27GHz. Thread numbers above 40 use SMT. The machine has 128GB of ram with 20GB dedicated to 2M hugepages used by the Galois allocator. The operating system is Ubuntu Linux 10.04 LTS (Linux 2.6.32). All runs of the Galois benchmarks used gcc/g++ 4.7 to compile with a patch to the C++ runtime to improve the scalability of exception handling. All other implementations were compiled with Intel ICC 12.1 with thread parallelization provided by Intel Cilk.