Asynchronous Variational Integrators

Application Description

AVI is an algorithm for performing Elastodynamic simulation with minimal number of updates. In Elastodynamic simulation, we simulate changes in shape of an elastic material under the effect of a force. Typical strategies discretize the body being simulated into a Finite Element Mesh and perform simulation updates after time intervals to compute the position and velocities. A naive approach to discretize time is to pick a time interval that is safe for all regions of the mesh and perform updates on the whole mesh in regular steps. This safe time interval is determined by the fastest changing regions of the mesh. However, we can observe that the slow changing regions of the mesh can afford update intervals larger than this safe interval. Thus using this naive approach leads to many extra updates. AVI uses this intuition and assigns a time-step to each element, which is the smallest safe interval at which the element should be updated. In order to ensure that time increases monotonically over the whole mesh, AVI assigns a time stamp to each element and processes the elements in time stamp order. The time stamp marks when the element should be updated and is incremented by the time step of element on each simulation update. AVI thus avoids performing any extra update simulation updates and also guarantees stability of the computation.

[1] Lew A., Marsden J., Ortiz M. & West M., "Asynchronous variational integrators", Archive for Rational Mechanics & Analysis, 167 (2), 85-146, 2003.

Sequential Algorithm

The sequential algorithm for AVI performs simulation updates in the increasing order of time stamps. Initially, the time stamp of each element is initialized to its time step and the elements are put in a priority queue, which orders the elements by time stamp. The algorithm iterates over the priority queue in a loop. In each iteration, the element with smallest time stamp is removed from the priority queue. The simulation updates are performed upon the element and its time stamp is incremented by its time step. The element is added back to the priority queue if its time stamp hasn't exceeded the ending time of the simulation. Figure 1 below gives the pseudocode for this sequential algorithm.

1 2 3 4 5 6 7 8 9 PriorityQueue pq = /* all mesh elements ordered by time stamp */ while (!pq.isEmpty()) { AVIelement elem = pq.removeMin(); elem.simulate(); elem.timeStamp = elem.timeStamp + elem.timeStep; if (elem.timeStamp < simulationEndTime) { pq.add(elem); } }

Figure 1: Pseudocode for ordered sequential algorithm for AVI.

Unordered Algorithm

During a simulation udpate to a mesh element, we update values of some target functions per each vertex of the mesh-element. Thus each element shares the data values on its vertices with nearby elements that share a vertex with this element. Let us define a graph called Element Adjacency Graph (EAG), where we have a node for each element and an edge between two nodes if elements corresponding to these nodes share a vertex in the mesh. A mesh element depends, for its update, upon values from neighboring elements in EAG if they bear a time stamp smaller than this element. Kale and Lew [2] showed that picking the element with smallest time stamp from a priority queue is too restrictive and it is sufficient to ensure that the mesh element being updated has the minimum time stamp among neighboring elements i.e. it does not depend on any updates from its neighbors for values on its mesh vertices. Let us call a mesh element "active" if it has smallest time stamp among its neighbors in the EAG. The unordered algorithm shown in Figure 2 maintains an unordered workset of nodes corresponding to active mesh elements (line 2). The algorithm iterates over this unordered workset, picking a node in each step. The element corresponding to the active node is updated in the same way as the sequential ordered algorithm (lines 4-6). Then we iterate over the neighbors of the active node in order to determine if any of the neighbor has become active or if the node being processed is still active i.e. has minimum time stamp among its neighbors (lines 7-15). The resulting active nodes are added back to the workset if their time stamps have not execeeded the ending time of the simulation. The algorithm terminates when the workset becomes empty.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Graph eag = /* Element Adjacency Graph */; Workset ws = /* initial active mesh elements */; foreach (Node n: ws) { AVIelement elem = eag.getData(n); elem.simulate(); elem.timeStamp = elem.timeStamp + elem.timeStep; for (Node m: eag.neighbors(n)) { AVIelement melem = eag.getData(m); if (melem.isActive() && melem.timeStamp < simEndTime ) { /* has min time stamp among its neighbors */ ws.add(m); } } if (elem.isActive() && elem.timeStamp < simEndTime) { /* still has the min time stamp among neighbors */ ws.add(n); } }

Figure 2: Pseudocode for Unordered AVI algorithm.

[2] Kale K. & Lew A., "Parallel Asynchronous Variational Integrators", International Journal for Numerical Methods in Engineering, 70, 2007, 291-321. [3] J.-C. Huang, X. Jiao, R. M. Fujimoto, and H. Zha. "DAG-guided parallel asynchronous variational integrators with super-elements." In Proc. Summer Comp. Sim. Conf., San Diego, CA, USA, July 2007.

Data Structures

There are two key data structures used in unordered algorithm for Asynchronous Variational Integrators

Graph
The Element Adjacency Graph, which has a node for each mesh element. There is an edge between nodes if elements corresponding to them share a vertex in mesh.
Unordered Set
The unordered workset used to hold "active" nodes corresponding to mesh elements that have the smallest time stamp among their neighbors in EAG.

Available Parallelism

Figure 3 shows the parallelism profile for unordered AVI algorithm generated using ParaMeter. Input for this parallelism profile is a 1m x 1m Neo-Hookean plate with 3424 elements. The corresponding EAG has 3424 nodes and 19921 edges. The number of active nodes at any point during the execution depends upon the structure of the graph and time steps of the elements. We observe that the number of active nodes that can be processed in parallel at each step does not vary a lot as neighbors of active nodes become active. The available parallelism goes down only at the end of the simulation.

Figure 3: Available parallelism in unordered AVI algorithm.

Important Optimizations

In each iteration of the simulation, certain functions use temporary vectors. These temporary vectors allocate storage on the heap, and are destroyed at the end of the function. These frequent calls to memory allocation library routines proves to be a major bottleneck in the scalability of the implementation. Therefore, we store these vectors in per thread memory, so that they can be reused safely every iteration, thus avoiding the need for calling memory management routines. We used work chunking to reduce the serialization overhead of accessing the worklist. The worklist used here is a chunked FIFO.

Performance

Figure 4: Performance results. The input is a 10m x 10m Neo-Hookean plate with 166,000 elements and 84,000 nodes whose EAG has 166,000 nodes and 1 million edges. The horizontal line marks the time taken by serial ordered AVI algorithm for the same input.

Acknowledgements

A major portion of our code for this application comes from MPI/C++ implementation of AVI, shared generously by Prof. Lew and his group at Stanford. We are thankful for this collaboration and help with questions related to the code itself.

Machine Description

Performance numbers are collected on a 4 package, 10 cores per package, Intel Xeon E7-4860 machine at 2.27GHz. Thread numbers above 40 use SMT. The machine has 128GB of ram with 20GB dedicated to 2M hugepages used by the Galois allocator. The operating system is Ubuntu Linux 10.04 LTS (Linux 2.6.32). All runs of the Galois benchmarks used gcc/g++ 4.7 to compile with a patch to the C++ runtime to improve the scalability of exception handling. All other implementations were compiled with Intel ICC 12.1 with thread parallelization provided by Intel Cilk.