Barnes-Hut N-body Simulation

Problem Description

This benchmark simulates the gravitational forces acting on a galactic cluster using the Barnes-Hut n-body algorithm [1]. The positions and velocities of the n galaxies are initialized according to the empirical Plummer model. The program calculates the motion of each galaxy through space for a number of time steps. The data parallelism in this algorithm arises primarily from the independent force calculations.

[1] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 324(4):446-449, December 1986.


Barnes-Hut's force-calculation algorithm employs a hierarchical data structure, called an octree, to approximately compute the force (e.g., gravitational, electric, or magnetic force) that the n bodies in the system induce upon each other. With n bodies, O(n2) interactions need to be considered, i.e., the precise calculation is quadratic in the number of bodies. The Barnes-Hut algorithm hierarchically partitions the volume around the n bodies into successively smaller cells. Each cell forms an internal node of the octree and summarizes information about the bodies it contains, in particular their combined mass and center of gravity. The leaves of the octree are the individual bodies. This hierarchy reduces the time to calculate the force on the n bodies to O(n log n) because, for cells that are sufficiently far away, it suffices to perform only one force calculation with the cell instead of performing one calculation with each body inside the cell. For example, consider the two-dimensional hierarchical subdivision of space in Figure 1. Sometime during the force calculation of the galaxy that is the source of the arrows, the algorithm will check whether the red cell's center of gravity (red circle) is far enough away. Because it is not, the code proceeds to look at all subcells and performs the force calculations marked by the green arrows. When the algorithm considers the orange cell, however, it will find that this cell is sufficiently far away and will therefore only perform one force calculation (orange arrow) and will not look deeper into this cell (blue arrows).

Set bodies = /* read input */; 
for (int step = 0; step < maxTimestep; step++) { 
  Octree octree = new Octree(); 
  foreach (Body b: bodies) 
  foreach (Body b : bodies) 
  foreach (Body b : bodies) 
The body of the outer for loop is implemented as a sequence of kernels.


The computation implements several optimizations.
  • Warp-based execution: The force-computation kernel expands octree-prefixes to make sure all warp-threads execute the same instruction.
  • Sorting: The code sorts bodies based on their distance to improve execution.
  • Wait-free pre-pass: The consumer threads, instead of waiting for all the producers to produce the leaf nodes, start processing the children as and when they are ready.
  • Combined synchronization: The code exploits the synchronization hierarchy to execute a __syncthread before __threadfence.


The benchmark is written in CUDA.
Compile as: make or nvcc -O3 -arch=sm_20 -o bh
Execute as: bh <bodies> <timesteps> <deviceid>
e.g., bh 50000 2 0

Number of bodiesTime stepsTime (ms)

The above performance is observed on a 1.45 GHz Quadro 6000 with 6 GB of main memory and 448 cores distributed over 14 SMs. It has 64 kB of fast memory per SM that is split between the L1 data cache and the shared memory.