Fault Tolerance
New trends in supercomputing hardware are moving us towards platforms such as large arrays of commodity parts and Grid Computing that are larger and less reliable than their predecessors. While some such machines have a mean time to failure as short as a day, many scientific computing applications run for days, months and even as long as a year. Our work on Fault Tolerance is motivated by the supercomputing community's need for a software solution that can efficiently deal with hardware failures without placing an undue burden on the programmer. More specifically, we aim to provide automated rollback recovery for programs written in C that use an implementation of MPI for communication.
Our target audience is supercomputing and this means programs running on a large number of processors, using a large amount of state and performing large amounts of communication. As such it quickly becomes clear that:
- If we use checkpointing then we cannot save the entire process state as a system-level checkpointer would. With large programs running on many processors, the entire system state would be in the hundreds of Gigabytes or even Terabytes. Instead, we need to perform application-level checkpointing, which involves modifying the program itself to save its own state in a way that only saves the data necessary for recovery. (ex: for ab initio protein folding one only needs to save the positions and velocities of the particles) Since our goal is to be as automatic as possible, this means that we must use a compiler tool to automatically modify the program so that it saves its own state and optimizes the amount of state it saves.
- We cannot use message logging, since any such approach requires us to record all of our messages. We don't want to record these messages on disk because of the overhead and we cannot record them in RAM because we'd fill it up within minutes (as shown by our early tests with sample scientific programs). As such, message logging doesn't seem to be an option for implementing rollback recovery for scientific applications.
With this in mind, our work has proceeded as follows:
-
Single-Processor Checkpointing
- Our first goal has been to develop a compiler tool that can take an arbitrary C program and transform it into a version of itself that can save its own state. This tool has already been created by Dan Marques on top of Kamen Yotov's Bernouli compiler as a source-to-source preprocessor. The programmer or some automated tool simply places potential checkpoint locations into the code (spots that presumably have a minimal amount of live state) and the tool transforms the program into a version that saves its state at those locations.
-
Building off of this basis, we are now looking at techniques to optimize the
amount of state saved while checkpointing. Because we already have a compiler
tool, we have direct access to the program's source code and have full freedom
to develop various compile-time analysis techniques. Some examples of what were
looking at are:
- intelligent memory allocation for incremental checkpointing (allocating data that is modified together on the same memory page)
- live/dead data analysis (so that we can avoid checkpointing dead data)
- state saving vs. recomputation analysis (can we save time by recomputing some data instead of checkpointing it?)
-
Distributed Checkpoint Coordination
- On the distributed front, we have designed a protocol that can deal with the special needs of application-level checkpointing. Specifically, the fact that application-level checkpoints can only be taken at specific points in the program (rather than just anywhere as with system-level checkpointing) means that their recovery lines do not form consistent cuts (in the Chandy-Lamport sense). As a result, our protocol keeps around a log of messages and non-deterministic events (similar to that used in message logging but much shorter) that allows us to perform recovery.
- Another key feature of our approach is that it can handle all of the complexities of the MPI standard. In addition to supporting the basic MPI message-passing constructs such as MPI_Send() and MPI_Recv() it can be extended to deal with all of MPI, including non-blocking sends and receives, collective communication, communicators and derived datatypes. Furthermore, because we are dealing with MPI at the application level, we need no knowledge of the actual implementation of the MPI standard. As such, our solution is portable across MPI implementations and can take advantage of the efficiencies of each platform's local implementation of MPI.
- This protocol is currently being implemented by Greg Bronevetsky. While it already supports the major parts of MPI (such as most communication constructs) and has been shown to have low overhead in our tests, the implementation is still in progress and won't be out in stable form for another few months.
Publications:
- Experimental Evaluation of Application-Level Checkpointing for OpenMP Programs 20th ACM International Conference on Supercomputing (ICS), 2006
- Mobile MPI Programs in Computational Grids ACM Symposium on Principles and Practices of Parallel Programming (PPoPP), 2006
- Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs Supercomputing, 11/06/2004
- Checkpointing Shared Memory Programs at the Application-level European Workshop on OpenMP, 10/20/2004
- Application-level Checkpointing for Shared Memory Programs Conference on Application Support for Programming Languages and Operating Systems, 10/10/2004
- Collective Operations in an Application-level Fault Tolerant MPI System International Conference on Supercomputing, 06/23/2003
- Automated Application-level Checkpointing of MPI Programs Principles and Practice of Parallel Programming, 06/11/2003





