# Barnes-Hut

## Application Description

This benchmark simulates the gravitational forces acting
on a galactic cluster using the Barnes-Hut *n*-body
algorithm [1]. The positions and velocities of
the *n* galaxies are initialized according to the
empirical Plummer model. The program calculates the motion
of each galaxy through space for a number of time
steps. The data parallelism in this algorithm arises
primarily from the independent force calculations.

[1] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 324(4):446-449, December 1986.

## Algorithm

Barnes-Hut's force-calculation algorithm employs a
hierarchical data structure, called an octree, to
approximately compute the force (e.g., gravitational,
electric, or magnetic force) that the *n* bodies in
the system induce upon each other. With *n* bodies,
O(*n*^{2}) interactions need to be
considered, i.e., the precise calculation is quadratic in
the number of bodies. The Barnes-Hut algorithm
hierarchically partitions the volume around the *n*
bodies into successively smaller cells. Each cell forms
an internal node of the octree and summarizes information
about the bodies it contains, in particular their combined
mass and center of gravity. The leaves of the octree are
the individual bodies. This hierarchy reduces the time to
calculate the force on the *n* bodies to O(*n*
log *n*) because, for cells that are sufficiently far
away, it suffices to perform only one force calculation
with the cell instead of performing one calculation with
each body inside the cell. For example, consider the
two-dimensional hierarchical subdivision of space in
Figure 1. Sometime during the force calculation of the
galaxy that is the source of the arrows, the algorithm
will check whether the red cell's center of gravity (red
circle) is far enough away. Because it is not, the code
proceeds to look at all subcells and performs the force
calculations marked by the green arrows. When the
algorithm considers the orange cell, however, it will find
that this cell is sufficiently far away and will therefore
only perform one force calculation (orange arrow) and
will *not* look deeper into this cell (blue arrows).

Figure 1: Hierarchical force calculation example.

In more detail, the algorithm proceeds as follows (pseudocode is provided in Figure 2). First, the set of bodies is initialized with the starting location and velocity of each body (line 1). Then the code iterates over the time steps (line 2). In each iteration, a new octree (i.e., spatial hierarchy) is generated by inserting all bodies (lines 3-5). Then the cumulative mass and center of mass of each cell is recursively computed (line 6). Next, the force acting upon each body is computed (lines 7-8) by traversing the octree. The traversal along any path is terminated as soon as a leaf node (i.e., a body) or an internal node (i.e., a cell) that is far enough away is encountered. Finally, each body's position and velocity are updated based on the computed force (lines 9-10).

1 2 3 4 5 6 7 8 9 10 11 | Set bodies = /* read input */; for (int step = 0; step < maxTimestep; step++) { Octree octree = new Octree(); foreach (Body b : bodies) octree.Insert(b); octree.SummarizeSubtrees(); foreach (Body b : bodies) b.ComputeForce(octree); foreach (Body b : bodies) b.Advance(); } |

Figure 2: Pseudocode for Barnes-Hut

## Data Structures

There are two key data structures used in Barnes-Hut:

- Unordered set
- The bodies are stored in an unordered set as they can be processed in any order. Once initialized, the set does not change. Only the bodies' attributes are modified in each time step.
- Octree
- The spatial hierarchy is represented by an octree (the 3-dimensional equivalent of a binary tree), where each node has up to eight children. The leaves of the octree correspond to individual bodies whereas the internal nodes represent the cells.

## Parallelism

While many phases of Barnes-Hut can be parallelized (including building the octree and calculating the summary information), we focus on parallelizing the force-computation step, which consumes the vast majority of the runtime. The active nodes in this step are the bodies. Calculating the force on each body requires reading some portion of the octree, so the accessed nodes form the body's neighborhood. Because these accesses are read-only, the bodies can be processed in any order and in parallel, provided that a body does not update its position or velocity until all force computations are complete. For data locality reasons, it may be beneficial to process spatially close bodies together.

Figure 3 shows the available parallelism in the Barnes-Hut algorithm for an input of 10,000 points generated with the Plummer model and one time step.

Figure 3: Available parallelism in Barnes-Hut algorithm.

## Performance

Figure 3: Performance results on a 50,000 points generated with the Plummer model for one timestep.

## Machine Description

Performance numbers are collected on a 4 package, 10 cores per package, Intel Xeon E7-4860 machine at 2.27GHz. Thread numbers above 40 use SMT. The machine has 128GB of ram with 20GB dedicated to 2M hugepages used by the Galois allocator. The operating system is Ubuntu Linux 10.04 LTS (Linux 2.6.32). All runs of the Galois benchmarks used gcc/g++ 4.7 to compile with a patch to the C++ runtime to improve the scalability of exception handling. All other implementations were compiled with Intel ICC 12.1 with thread parallelization provided by Intel Cilk.