# Maximum Cardinality Bipartite Matching

## Application Description

Given a graph *G*, a matching *M* is a subset of edges where no
two
edges share an endpoint. The cardinality |*M*| of *M* is the
number of
edges in *M* . Given an unweighted bipartite graph *G = (V, E) V =
(A, B)*, the maximum cardinality bipartite matching problem is to find a
matching with maximum cardinality.

## Algorithm

Algorithms to solve this problem try to extend existing matchings by finding augmenting paths. When there are no more augmenting paths, the matching is a maximum matching.

The Ford-Fulkerson algorithm [1] visits each node in *A* trying
to find an augmenting path at each. It has complexity *O(|V||E|)*.
The Alt-Blum-Mehlhorn-Paul (ABMP) algorithm [1] tries to find
the shortest augmenting path first. After all the augmenting paths of
less than a particular length are found or the number of unmatched
nodes is reduced to a particular size, the algorithm applies the
Ford-Fulkerson algorithm to complete the matching. The ABMP
algorithm has complexity *O(|E|sqrt(|V|))*.

This benchmark implements the Alt-Blum-Mehlhorn-Paul algorithm.

[1] K. Mehlhorn and S. Naeher. LEDA: A Platform for Combinatorial and Geometric Computing. Cambridge University Press, 1999.

## Data Structures

There are two key data structures used in this algorithm:

- Priority Scheduler
- The ABMP algorithm uses priority on tasks to find the shortest augmenting paths first.
- Directed Graph
- The algorithm successively reads and updates a graph.

## Parallelism

The parallelism in this algorithm is very data dependent.

## Performance

Figure 3 gives the performance of this algorithm. One thing to note about the the ABMP algorithm is that its performance is sensitive to the scheduling of iterations and different schedulers result in different amounts of work, leading to the possibility of super-linear speedup.

Figure 1: Performance results on a random graph. The input is a bipartite graph G = (A, B, E) where |A| = |B| = 10^6 , and there are 10^8 edges. Nodes in each partition are divided evenly into 10^4 groups. Each node in A has degree d = |E|/|A| and the edges out of a node in group i of A go to random nodes in groups i + 1 and i − 1 of B.

## Machine Description

Performance numbers are collected on a 4 package, 10 cores per package, Intel Xeon E7-4860 machine at 2.27GHz. Thread numbers above 40 use SMT. The machine has 128GB of ram with 20GB dedicated to 2M hugepages used by the Galois allocator. The operating system is Ubuntu Linux 10.04 LTS (Linux 2.6.32). All runs of the Galois benchmarks used gcc/g++ 4.7 to compile with a patch to the C++ runtime to improve the scalability of exception handling. All other implementations were compiled with Intel ICC 12.1 with thread parallelization provided by Intel Cilk.