Memory Hierarchy Optimizations
Programs that access large data structures often perform poorly on high-performance computers because the processor spends all its time waiting for memory access requests to be completed. Many of these programs benefit from being restructured to improve locality of reference. It is tedious to do this restructuring by hand, so it is desirable to have restructuring compiler technology that can accomplish this task.
Our group is a world-leader in this area.
Our early work focused on restructuring perfectly-nested loops, which are loop nests in which all assignment statements occur in the innermost loop of the loop nest. BLAS kernels such as matrix-vector product and matrix-matrix product are examples of such loops. Wei Li showed how non-singular matrices can be used to synthesize sequences of transformations for enhancing locality and parallelism in perfectly-nested loops.
Although perfectly-nested loops have received a lot of attention from the compiler community, matrix-multiplication is about the only important kernel that is perfectly-nested! Most codes such as matrix factorization codes and relaxation codes contain imperfectly-nested loops in which assignment statements are contained in some but not all of the loops of the loop nest. To restructure imperfectly-nested loop nests, we have developed two approaches.
The first approach is data-shackling - roughly speaking, the compiler detemines an order in which data will be brought into the cache from main memory at run-time, and then restructures code so that all statements that touch a given piece of data are scheduled close together in the program. Data-shackling is not restricted to dense arrays - for example, Sally McKee et al recently showed that data-shackling can be used to enhance locality in programs with pointer-based data structures.
The second approach embeds execution instances of statements in an imperfectly-nested loop into a large Cartesian space - intuitively, this space models a perfectly-nested loop which can then be transformed appropriately and then tiled. The code generation phase produces imperfectly-nested loops.
Neither data-shackling nor the Cartesian space approach subsume each other in practice. The Cartesian space approach is difficult to apply to semi-regular codes like LU with pivoting and irregular codes like pointer-based codes for which data-shackling does well. For other codes such as relaxation codes, data-shackling is difficult to apply because it is difficult to synthesize the optimal order of data traversal, whereas the Cartesian space approach works well as Nawaaz Ahmed showed in his thesis. The state of the art of locality enhancement for imperfectly-nested loops is less than satisfactory.
An intriguing feature of divide-and-conquer algorithms is that they should run well on machines with multiple levels of memory hierarchy without the need for blocking since each successive divide step generates subproblems that touch smaller amounts of data. We have shown how divide-and-conquer style codes can be generated from iterative versions of the same algorithm.
Here are some companies that use our technology.
- Hewlett-Packard uses the technology in LAMBDA in its entire compiler product line.
- Intel has recently licensed the LAMBDA toolkit for use in Merced compilers.
- Silicon Graphics incorporated data-shackling into their MIPSPro compiler.
Publications:
-
Perfectly Nested Loop Transformations: Linear Transformations
- The LAMBDA Loop Transformation Toolkit (User's Reference Manual) Technical Report TR94-1431, Cornell Computer Science Department, 06/01/1994
- Access Normalization: Loop Restructuring for NUMA Compilers ACM Transactions On Computer Systems, 11/01/1993
- Compiling for NUMA Parallel Machines Cornell Computer Science Department PhD Thesis, 08/01/1993
- Access Normalization: Loop Restructuring for NUMA Compilers Architectural Support for Programming Languages and Operating Systems, 10/12/1992
-
Imperfectly Nested Loop Transformations: Data Shackling
- Data-Centric Transformations for Locality Enhancement International Journal of Parallel Programming, 10/01/2001
- An experimental evaluation of tiling and shackling for memory hierarchy management International Conference on Supercomputing, 06/20/1999
- Data-centric Compilation Cornell Computer Science Department PhD Thesis, 08/01/1998
- Data-centric Multi-level Blocking Programming Language Design and Implementation, 06/15/1997
-
Imperfectly Nested Loop Transformations: Catesian-space Approach
- Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests International Journal of Parallel Programming 29(5), 10/01/2001
- Tiling Imperfectly-Nested Loop Nests Supercomputing, SC, 11/04/2000
- Locality Enhancement for Imperfectly-nested Loop Nests Cornell Computer Science Department PhD Thesis, 08/01/2000
- Synthesizing Transformations for Locality Enhancement of Imperfectly-nested Loop Nests International Conference on Supercomputing (ICS 2000), 05/11/2000
- Transformations for Imperfectly Nested Loops Supercomputing, SC, 11/01/1996
-
Generating Block-Recursive Code From Iterative Code
- Automatic Generation of Block-Recursive Codes Euro-Par, 08/29/2000





