Multigrid methods combine a number of standard sparse matrix techniques. Usual implementations separate the individual components (e.g., an iterative methods, residual computation, and interpolation between grids) into nicely structured routines. However, many computers today employ quite sophisticated and potentially large caches whose correct use are instrumental in gaining much of the peak performance of the processors. This is true independent of how many processors are used in a computation.
We investigate when it makes sense to combine several of the multigrid components into one, using block oriented algorithms. We determine how large (or small) the blocks must be in order for the data in the block to just fit into the processor's primary cache. By re-using the data in cache several times, a potential savings in run time can be predicted. This is analyzed for a set of examples. It is surprising to see how large a subdomain can fit into a relatively small, well designed cache (e.g., 256Kb). As caches continue to increase in size, the ideas here will extend quite nicely to three dimensional problems. In particular, machines with 1 - 4Mb caches are under design which will make three dimensional cache based multigrid a reality.
While most of the savings in time are with respect to the approximate solver, using a multiplicative or additive domain decomposition method saves even more time and allows us to use theoretical convergence rates from that field for our problems.
An automatic, low overhead software tool for determining a good cache based ordering for the multigrid components for general problems will also be discussed.