Breadth-First Search

Problem Description

This benchmark computes the level of each node from a source node in an unweighted graph.


The computation is implemented in a topology-driven manner; all the graph edges are tried for relaxation in every iteration and another iteration is executed if at least one edge was relaxed in the previous iteration. This continues until a fixed-point.
  for all nodes src {
    lsrc = level[src];
    for each edge <src, dst> {
      ldst = level[dst];
      if ldst > lsrc + 1 {
         level[dst] = lsrc + 1;
         anotheriteration = true;
The body of the outer for loop is implemented as the kernel. The kernel uses global barriers to avoid repeated invocations from the host CPU. The inner for loop is distributed according to strategy described in Merrill et al. (PPoPP 2010)


The benchmark is written in CUDA.
Compile as: make bfs-wlc
Execute as: bfs-wlc <graphfile>
e.g., bfs-wlc

GraphTime (ms)

The above performance is observed on a 1.45 GHz Quadro 6000 with 6 GB of main memory and 448 cores distributed over 14 SMs. It has 64 kB of fast memory per SM that is split between the L1 data cache and the shared memory.