Memory Hierarchy Optimizations

Programs that access large data structures often perform poorly on high-performance computers because the processor spends all its time waiting for memory access requests to be completed. Many of these programs benefit from being restructured to improve locality of reference. It is tedious to do this restructuring by hand, so it is desirable to have restructuring compiler technology that can accomplish this task.

Our group is a world-leader in this area.

Our early work focused on restructuring perfectly-nested loops, which are loop nests in which all assignment statements occur in the innermost loop of the loop nest. BLAS kernels such as matrix-vector product and matrix-matrix product are examples of such loops. Wei Li showed how non-singular matrices can be used to synthesize sequences of transformations for enhancing locality and parallelism in perfectly-nested loops.

Although perfectly-nested loops have received a lot of attention from the compiler community, matrix-multiplication is about the only important kernel that is perfectly-nested! Most codes such as matrix factorization codes and relaxation codes contain imperfectly-nested loops in which assignment statements are contained in some but not all of the loops of the loop nest. To restructure imperfectly-nested loop nests, we have developed two approaches.

The first approach is data-shackling - roughly speaking, the compiler detemines an order in which data will be brought into the cache from main memory at run-time, and then restructures code so that all statements that touch a given piece of data are scheduled close together in the program. Data-shackling is not restricted to dense arrays - for example, Sally McKee et al recently showed that data-shackling can be used to enhance locality in programs with pointer-based data structures.

The second approach embeds execution instances of statements in an imperfectly-nested loop into a large Cartesian space - intuitively, this space models a perfectly-nested loop which can then be transformed appropriately and then tiled. The code generation phase produces imperfectly-nested loops.

Neither data-shackling nor the Cartesian space approach subsume each other in practice. The Cartesian space approach is difficult to apply to semi-regular codes like LU with pivoting and irregular codes like pointer-based codes for which data-shackling does well. For other codes such as relaxation codes, data-shackling is difficult to apply because it is difficult to synthesize the optimal order of data traversal, whereas the Cartesian space approach works well as Nawaaz Ahmed showed in his thesis. The state of the art of locality enhancement for imperfectly-nested loops is less than satisfactory.

An intriguing feature of divide-and-conquer algorithms is that they should run well on machines with multiple levels of memory hierarchy without the need for blocking since each successive divide step generates subproblems that touch smaller amounts of data. We have shown how divide-and-conquer style codes can be generated from iterative versions of the same algorithm.

Here are some companies that use our technology.

Publications: