Discrete Event Simulation

Application Description

The Discrete Event Simulation (DES) algorithm is commonly used to simulate systems of interacting stations, e.g., circuit simulation, units in a battle field, assembly line etc. The interacting objects in the system are modeled as stations, which store some state. During the course of simulation, the stations interact with each other via messages or events, and cause changesin their own state or the state of other stations. The changes are modeled as events being sent from a station to other stations. An event causes the target station to potentially update its state and generate new events destined for other stations in the system dependent upon this change. The system is represented as a graph where nodes are the stations and edges represent the communication links between the stations. Every event has a time stamp associated with it. The system evolves by processing the events in their time stamp order. The data parallelism in this algorithm arises from multiple events executing on different stations in the system. This benchmark uses Discrete Event Simulation to model logic circuit simulation. Here the processing stations are logic gates, which are interconnected with each other through wires (edges in the graph). The events in the system represent a change of value on the input or output of a gate. The simulation starts with some predefined initial events for the inputs of the circuit. Processing these initial events causes additional events to be generated in the system. When the input of a gate changes, it updates its output and generates new events to update the inputs of the gates in its fanout and so on. The simulation finishes when there are no unprocessed events.

Ordered Algorithm

The basic condition for correctness of the simulation is that events be processed in time stamp order. Therefore, the commonly used sequential algorithm orders the events in a priority queue. The algorithm iterates over this priority queue, removing the event with least time stamp during each iteration, and processing it. Processing of an event may generate new events which are added back to the priority queue for processing. The algorithm terminates when the priority queue becomes empty. Figure 1 below shows the pseudocode for this algorithm.

1 2 3 4 5 6 7 8 9 10 11 Graph circuit = /* create graph from netlist */ Set events = /* initial events */ PriorityQueue pq = new PriorityQueue(); pq.addAll(events); while (!pq.isEmpty()) { Event e = pq.removeMin(); Station s = e.targetObj(); List newEvents = s.simulate(e); pq.addAll(newEvents); }

Figure 1: Pseudocode for Discrete Event Simulation: Sequential Ordered Algorithm.

Unordered Algorithm

Chandy and Misra [1] proposed an unordered algorithm for Discrete Event Simulation. Their algorithm is based on an insight from data-flow architectures that a station must adhere to certain firing rules. For example, a station can fire, i.e., process events, only after it has received events on all of its inputs, at which point it can determine a set called Ready-Events. Let T(i) be the timestamp of the latest event received on input i of a station. Then the set Ready-Events is { Event e | timestamp(e) <= min(T(i) forall i) }. Here min(T(i)) is the minimum of the time stamps of the latest event received on all inputs i. The key observation here is that the events received on an input always have monotonically increasing timestamps. A station can therefore safely process all events in the Ready-Events because it will never receive an event with a timestamp lower than min(T(i)) on any of its inputs. A station with a non-empty Ready-Events set is called an active station.

Even though Chandy-Misra's algorithm is (globally) unordered, each station processes its Ready-Events in timestamp order. Moreover, if a station sends a message with timestamp T to one of its neighbors, it must send a NULL message with timestamp T to the remaining neighbors that have not received a message with timestamp T to avoid deadlock. The NULL messages do not correspond to any change in state but merely serve to inform other stations of how far a station has advanced its simulation time. When a station processes an incoming NULL message, it sends out corresponding NULL messages to all of its neighbors. Figure 3 shows the pseudocode for the unordered DES algorithm. The graph nodes corresponding to active stations are stored in an unordered workset. In each iteration, the call to simulate() computes and processes the Ready-Events (line 6). The outgoing neighbors that have become active are added to the workset(lines 7-10). The algorithm terminates when there are no active stations left.

1 2 3 4 5 6 7 8 9 10 11 12 13 Graph g = /* read in graph */; Workset ws = new Workset(); ws.addAll(g.nodes()); foreach (Node n : ws) { Station s = n.getStation(); s.simulate(); // process ready events for (Node m: g.getNeighbors(n)) { if (m.isActive()) ws.add(m); } if (n.isActive()) ws.add(n); }

Figure 2: Pseudocode for unordered algorithm for Discrete Event Simulation.

[1] K. Mani Chandy and Jayadev Misra, Distributed Simulation: A case study in design and verification of distributed programs, IEEE Trans. Software Eng. 1979

Data Structures

There are two key data structures used in unordered Discrete Event Simulation:

Unordered Set
The unordered workset used to hold the stations with nonempty Ready-Events set.
Graph
The network of gates is represented as a directed graph. The logic gates are the nodes in the graph and the interconnecting wires are the edges in the graph.

Available Parallelism

The active elements in the unordered DES algorithm are stations that have non-empty Ready-Events set. An iteration processing events on a station needs to acquire ownership of the station itself and its outgoing neighbors, to which it can send events. Therefore the neighborhood comprises a station and its outgoing neighbors.

Figure 4 shows the available parallelism of the unordered DES algorithm for an input circuit of a 6x6 tree multiplier. Initially the input ports are the active stations because the inputs store the initial events of the simulation. Thus, initially there is low available parallelism. The circuit has large fanouts in the middle, therefore the number of active stations (thus the parallelism) increases rapidly. The available parallelism decreases towards the end because there are few outputs and due to the fact that there are no more active stations left.

Figure 3: Available parallelism in unordered Discrete Event Simulation.

Important Optimizations

One Shot: Usually, parallel activities need to save undo operations for the changes made to graph and other shared data, so that if an activity aborts, the shared data can be restored to its original state. Saving and performing these undo operations incurs an overhead. Our implementation of Chandy-Misra's algorithm employs One-Shot optimization, where at the beginning of an iteration, the executing thread acquires ownership (abstract locks) on the node being processed and all of its out-going neighbors. A thread that fails to acquire all the locks aborts itself and does not have to perform any undo operations. While on the other hand, a thread that succeeds in acquiring all the locks does not need save any undo operations because it is guaranteed to complete the iteration.

No Duplicates on Worklist: Galois worklists, currently, do not protect against addition of duplicate items i.e. do not support set semantics. Putting duplicates on the workset in unordered DES does not affect correctness, but, the duplicates generate unnecessary iterations , and, increase chance of conflict among threads. Two threads processing the same node will conflict with each other when acquire abstract locks on the node. To ensure uniqueness of items on the worklist, we keep a list of boolean flags for each node, which indicate whether the node is on the worklist. When adding a node to the worklist, the flag corresponding to a node is set to True if it was previously False. The flag reset to False when the node is removed from the worklist. This list of flags provides a cheap way of implementing set semantics.

Events processed per iteration: The size of Ready-Events set can grow very big for some nodes in the graph for certain inputs. This size also depends on scheduling policy used for active nodes. A naive policy would be to process just one event from Ready-Events set whenever the node becomes active. However, overheads such as accessing the work-set dominate in this policy. On the other extreme, we can choose to process all of the nodes in the Ready-Events set. This can lead to load imbalance because the time per iteration becomes very big. Therefore we define, as a parameter, an upper bound on the number of events to be processed from Ready-Events set in each iteration. This parameter is also used to control the size of Ready-Events set on the outgoing neighbors of a station. If the size were not controlled, the scalability of the implementation becomes limited by serialization due to memory allocations in the parallel section. Using this parameter, we can reserve enough memory for Ready-Events set at initialization time to avoid most of memory allocations in the parallel section.

Performance

We measured the performance of our Galois C++ based implementation of unordered DES. We used a 64 bit Kogge-Stone tree adder circuit as our input. The corresponding graph contains 1306 nodes and 2289 edges. The simulation performs 89 million events. Figure 3 shows the execution time of this experiment for different number of threads. at higher number of threads is limited due to lack of enough parallelism. Threads have to wait more often to remove items from the worklist, which becomes a bottleneck. Nevertheless, our implementation scales to 15 on this large input.

Figure 3: Performance results.

Machine Description

Performance numbers are collected on a 4 package, 10 cores per package, Intel Xeon E7-4860 machine at 2.27GHz. Thread numbers above 40 use SMT. The machine has 128GB of ram with 20GB dedicated to 2M hugepages used by the Galois allocator. The operating system is Ubuntu Linux 10.04 LTS (Linux 2.6.32). All runs of the Galois benchmarks used gcc/g++ 4.7 to compile with a patch to the C++ runtime to improve the scalability of exception handling. All other implementations were compiled with Intel ICC 12.1 with thread parallelization provided by Intel Cilk.