Delaunay Mesh Refinement

Problem Description

This benchmark is an implementation of the algorithm described by Kulkarni et al. [1]. The application produces a guaranteed quality Delaunay mesh, which is a Delaunay triangulation with the additional constraint that no angle in the mesh be less than 30 degrees. The benchmark takes as input an unrefined Delaunay triangulation and produces a new mesh that satisfies the quality constraint. The data parallelism in this algorithm arises from the badly shaped triangles that must be "fixed."

[1] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew. Optimistic Parallelism Requires Abstractions. In Proceedings of the ACM Conference on Programming Languages Design and Implementation (PLDI), 211 - 222, June 2007.

Algorithm

A 2D Delaunay mesh is a triangulation of a set of points with the following property: the circumcircle of any triangle in the mesh must contain no other point from the mesh. A refined Delaunay mesh is a Delaunay mesh with the additional constraint that no triangle have an angle less than 30 degrees. The algorithm takes an input Delaunay mesh, which may contain triangles that do not meet the quality constraints called bad triangles, and produces a refined mesh by iteratively re-triangulating the affected portions of the mesh.


Figure: Before refinement and after refinement

The algorithm is implemented in a topology-driven manner. In each step, the refinement procedure (i) checks if the triangle assigned to it is bad, (ii) collects the affected triangles in the neighborhood of that bad triangle called the cavity, and (iii) re-triangulates the cavity. If this re-triangulation creates new badly-shaped triangles in the cavity, they are processed as well until all bad triangles have been eliminated from the mesh. The order in which the bad triangles are processed is irrelevant, all orders lead to a valid refined mesh.

  for all mesh triangles t {
    if t is bad {
      cavity = findcavity(t);
      newcavity = retriangulate(cavity);
      updatemesh(cavity, newcavity);
    }
  }
The body of the for loop is implemented as the kernel. The kernel is repeatedly called from the host until there are no more bad triangles in the mesh.

Optimizations

The computation implements several optimizations.
  • Race and resolve: Instead of relying on expensive locking, the code implements a barrier-based exclusive ownership detection to avoid conflicts.
  • Triangles are distributed to threads to maximize chances of non-overlapping cavities.

Performance

The benchmark is written in CUDA.
Compile as: make
Execute as: dmr <meshfile>
e.g., dmr r1M
Test inputs (files with extensions .ele, .node, .poly) can be downloaded from this url.

MeshTime (ms)
250k.2 (250000)619.05
r1M (1 million)1651.23
r5M (5 million)8172.59

The above performance is observed on a 1.45 GHz Quadro 6000 with 6 GB of main memory and 448 cores distributed over 14 SMs. It has 64 kB of fast memory per SM that is split between the L1 data cache and the shared memory.