Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the need for data locality due to poor main memory latency and limited bandwidth is entirely neglected by many developers designing numerical software. Only when most of the data to be accessed during the computation are found in the system cache (or in one of the caches if the machine architecture comprises a cache hierarchy) fast program execution can be expected. Otherwise, i.e. in case of a significant rate of cache misses, the processor must stay idle until the necessary operands are fetched from main memory, whose cycle time is in general extremely large compared to the time needed to execute a floating point instruction. In this paper, we extend techniques developed to improve the cache performance of two-dimensional multigrid algorithms for the three-dimensional case. We will introduce sophisticated blocking techniques and program restructurations that are significantly more involved than in the two-dimensional case. Numerical experiments are presented showing the efficiency of our cache-aware methods.