Tuning Galois Programs
This section provides some Galois-specific tuning hints that we have found useful to improve the performance of our codes.
Graph class selection
Galois programs have to use one of the provided concurrent graph classes for all data that is shared between parallel activities. Several graph implementations are available that are optimized for different uses. The three currently supported implementations are listed in Table 1.
Table 1: Supported graph classes
Each of these three graph classes contains a builder to create graphs of this type that are tailored to specific needs. For example, the indexed tree is optimized for tree-based algorithms that do not need edge data and allows the indexing of specific children, such as the first child of a node in a search tree. The user can tailor the graph by specifying the maximum number of children per node (e.g., two in case of a binary tree). To maximize the performance and minimize the memory consumption, this number should always be chosen as small as possible.
The morph graph is the most general graph. It can store any type of object in each node and edge. However, the builder allows the user to specify what type of data, if any, the edges can hold. The supported types are
void (no edge data). Using objects results in the slowest execution and should therefore be avoided if one of the other types can be used instead. The user has to specify whether the edges are directed or undirected, which is application dependent. The user can choose between an implementation that internally uses a
vector or a
hashset. The vector-based implementation is preferable for sparse graphs (where all nodes have few neighbors) whereas the hashset-based implementation results in better performance for dense graphs (where at least some nodes have many neighbors at some time during the computation).
The local computation graph does not permit operations that modify the graph in any way, such as adding or removing edges or nodes. Hence, it can only be used for computations that merely modify the values contained in the nodes or edges. Whereas using a local computation graph makes local computation operations much faster, a local computation graph has to first be generated by copying the structure and the values from a morph graph, which takes time. Thus, using a local computation graph is only worthwhile if more than O(E+V) operations will be performed on it. If the edges of the graph have one data field of type
double, the user can request an optimized local computation graph from the graph builder that is tailored towards these edge data types. Using the most restrictive graph version typically results in the best performance and the smallest memory footprint.
The parallel foreach loops in Galois take an initial worklist as a parameter. Because any class that implements the
List interface can be passed, it is often unnecessary to first populate an explicit worklist. Instead, the original object should be passed directly, or a List-based view can be passed such as
The internal worklist is not directly accessible. For example, it cannot be queried to determine whether a specific item is in the list or not. Hence, we sometimes found it useful to add a flag to our worklist objects that is programmatically updated to indicate whether the object is on the worklist or not.
Table 2: Selection of supported schedules
The Galois system currently supports many different schedules for determining the order in which the worklist elements are handed out to the worker threads. Depending on the application, selecting a good schedule can have a substantial impact on performance. Table 2 provides some examples. For the full list of supported schedules, see the
galois.runtime.wl package in the API section.
The FIFO schedule selects elements in First-In First-Out order whereas LIFO uses Last-In First-Out order. Chunked FIFO and LIFO are similar but assign an entire chunk (range of adjacent elements) from the worklist to the worker threads. The bounded versions limit the maximum number of elements the worklist can hold but are a little faster. RandomOrder uses a random number generator to select each element, which can be slow. ChunkedRandomOrder is faster because it randomly assigns chunks to threads but does not randomize the elements within a chunk. Note that it is possible to build a hierarchy of schedules. For example, first specifying ChunkedFIFO and then LIFO means that chunks of elements are assigned to worker threads in FIFO order, but each thread will see the elements from its chunk in LIFO order. More detail can be found in the Foreach Usage section. If the schedule does not matter, ChunkedFIFO should be used. If it is unclear which schedule to use for a given application, we recommend experimenting with the different schedules.
Method invocation flags
Most methods in the Galois graph classes allow an optional flag to be passed as the final parameter. This flag can be used to turn off conflict detection and/or saving of undo information to make the runtime system faster. However, these flags may only be used under certain conditions where their usage does not interfere with program correctness. See the
galois.objects.MethodFlag class in the API section.
The implementation of an operator can sometimes be changed so that it still computes the same result but exposes more opportunity to exploit these flags. One very important code transformation in this regard is making operators cautious, possibly using the one-shot optimization [PPoPP'10]. If an operator is guaranteed to first execute read-only methods that touch all nodes of the graph that later method invocations in the same operator will touch, the operator is said to be cautious. As long as all the nodes have been touched and locked before the first update in a cautious operator, any possible conflict will already have been raised or the operator is guaranteed to terminate without a conflict. Therefore, it is not necessary to check for conflicts or to save undo information for any method call after and including the first method that performs a graph update in a cautious operator. The flags can be used to communicate this information to the system, thus making the program execution much faster.
If the operator cannot directly be made cautious, the one-shot optimization may be used to make it so if the nodes an operator will touch are statically known. For example, if the operator will only touch the direct neighbors of the active node (or a subset thereof), one can insert a method call at the beginning of the operator that reads the direct neighbors (and discards the result). As a side effect, this invocation will lock all the neighbors, making the operator cautious and therefore allowing the programmer to invoke all further method calls without conflict detection or saving undo information.
If none of these optimizations apply, it may still be possible to optimize some method calls. For example, when building a tree top down, the tree is modified at the leafs and the rest of the tree is only read. Hence, it is not necessary (and bad for performance) to lock the entire path from the root to the leaf where a new node needs to be inserted. Instead, the leaf should first be found in read-only mode using the flag that disables conflict detection. Then the leaf should be read again, but this time with conflict detection enabled to ensure that it is not currently in use by another activity. Finally, the new node can be inserted without saving undo information since no other activity can insert a node at the same place in the tree. See the binary search tree code in the Getting Started section for an example.
Finally, some foreach loops only read the shared data. For such reader operators, conflict detection and saving undo information can always be turned off.
General performance improvements
Almost all general (i.e., non-Galois specific) code optimizations will also help make Galois code run faster and should therefore be applied. One important example is to reuse objects as much as possible to avoid calls to "new". To get the most from parallelization, as much work as possible should be performed inside the parallel loops. Because there is a fixed overhead associated with each loop iteration, it is important to perform enough work per iteration so that the overhead is small in comparison. To access the neighbors of a node, a map method needs to be called, which takes a
Lambda object as a parameter, i.e., a function to be applied to each neighbor (see the Code Example section for an example). Since creating
Lambda objects is expensive, they should be reused as much as possible for best performance. Finally, Galois application writers should strive to make their computations as local as possible to minimize conflicts between concurrent activities and to maximize cache performance.