for Parallel CFD Enhancements
on SGI ccNUMA and Cluster Architectures
Kremenetsky, PhD, Principal Scientist, CFD Applications
Posey, HPC Applications Market Development
|For presentation at 10th Copper Mountain Conference on Multigrid
APRIL 1-6, 2001
|The maturity of Computational
Fluid Dynamics (CFD) methods and the increasing computational power of
contemporary computers has enabled industry to incorporate CFD technology
in several stages of design processes. As the application of the CFD technology
grows from component level analysis to system level, the complexity and
the size of models increase continuously. Successful simulation requires
synergy between CAD, grid generation and solvers.
The requirement for shorter design cycles has put severe limitations on the turnaround time of the numerical simulations. The time required for (1) mesh generation for computational domains of complex geometry and (2) obtaining numerical solutions for flows with complex physics has traditionally been the pacing item for CFD applications. Unstructured grid generation techniques and parallel algorithms have been instrumental in making such calculations affordable. Availability of these algorithms in commercial packages has grown in the last few years and parallel performance has become a very important factor in the selection of such methods for production work.
Although extensive research has been devoted in determining the optimum parallel paradigm, in practice the best parallel performance can be obtained only when algorithm and paradigms take into consideration the architectural design of the target computer system they are intended for. This paper addresses the issues related to efficient performance of the commercial CFD software FLUENT on a cache coherent Non Uniform Memory (ccNUMA) Architecture. Also presented are results from implementation of FLUENT on cluster systems of workstation for both the Linux and SGI IRIX operating systems. Issues related to performance of the message passing system and memory-processor affinity are investigated for efficient scalability of FLUENT when applied to a variety of industrial problems.
|In the recent
years Computer Aided Engineering (CAE) has contributed significantly to
changes in the product development process. The major factors behind increased
use of CAE were the requirements for continuos improvement in product performance
and quality, reductions in number and cost of prototypes, and faster time
Due to these requirements the conventional method of product design and development has been replaced by a modern approach that relies more on CAE tools and techniques. A discipline of CAE, Computational Fluid Dynamics (CFD) has recently been accepted as one such CAE tool and is used extensively to guide and compliment experimental methods used for product design and verification.
Beginning in the mid-90's, the introduction of unstructured CFD grid technology, accurate and robust numerical solutions and the availability of powerful parallel computers have acted as catalysts in the rapid acceptance of a CFD-assisted design approach. The availability of commercial as well as in-house codes that use parallel processing has considerably increased in recent years leading to larger models and reduced solution times. All parallel implementations on existing system architectures are based on two alternatives (1) fine-grain and (2) corse-grain parallelism.
The first parallel paradigm is often termed compiler parallelism. It exploits loop level parallelism, implemented by automatic-parallel compilers for shared memory architectures. This technique is popular for its ease-of-use and incremental approach for existing source code. The coarse-grain parallel method can be further subdivided into (2a) shared memory parallelism, (2b) library based distributed memory parallelism such as high performance FORTRAN (HPF), and (2c) parallelism based on explicit message passing with systems such as MPI.
Distributed memory coarse-grain parallelizm (or 2c) has increasingly become a the preferred method because of its ability to accomodate both shared and distributed memory computer architectures. It is also considered to provide better scalability, although recent studies indicate that selction of the solver algorithm and not the programming model is responsible for scalability impediments usually associated with the shared memory parallel model .
This plethora of parallel programming paradigms and parallel computer architectures necessitates careful porting and performance tuning of various applications to ensure they take a full advantage of computational system resources. The objectives of the present study are the examination of performance issues associated with the implementation of commercial CFD softare FLUENT on ccNUMA and cluster system architectures.
An investigation of performance bottlenecks was conducted for FLUENT with industrial-sized cases. This paper describes the parallel implementation of the FLUENT solver, the system environment for the investigations, and presents results for each. Also, new directions in parallel performance enhancement, such as hybrid parallel programming paradigms and new class of communication primitives are discussed.
|2. FLUENT CFD Software|
|The commercial CFD software FLUENT
is a fully-unstructured finite-volume CFD solver for complex flows ranging
from incompressible (subsonic) to mildly compressible (transonic) to highly
compressible (supersonic and hypersonic) flows. The wealth of physical
models in FLUENT allows you to accurately predict laminar, transitional
and turbulent flows, various modes of heat transfer, chemical reaction,
multiphase flows and other complex phenomena.
The cell-based discretization approach used in FLUENT is capable of handling arbitrary convex polyhedra elements. For solution strategy, FLUENT allows a choice of two numerical methods, either segregated or coupled. With either method FLUENT solves the governing integral equations for conservation of mass, momentum, energy and other scalars such as turbulence ad chemical species.
The FLUENT solution consists of a control-volume based technique that includes (1) division of the domain into discrete control volumes using a computational grid, (2) integration of governing equations on the individual control volumes to construct algebraic equations for the discrete dependent variables such as velocities, pressure, temperature, and conserved scalars, and (3) linearization of discretized equations for solution of the resulting linear system to yield updated values of dependent variables.
Both segregated and coupled numerical methods employ a similar discretization process (finite-volume) but their approach to linearization and solution of the discretized equations is different. A point implicit (Gauss-Seidel) linear equation solver is used in conjunction with an Algebraic Multigrid (AMG) scheme to solve the resultant linear system for the dependent variables in each cell.
The AMG scheme is most often used but FLUENT also contains a full-approximation storage (FAS) multi-grid scheme also. AMG is an essential component of both the segregated and coupled implicit solvers, while FAS is an important but optional component of the coupled explicit solver . Parallelism is implemented through a coarse-grain, domain decomposition technique with use of the message passing interface (MPI) system.
|3. Computer System Architectures|
Provided are descriptions of the system architectures used for investigation of the FLUENT parallel studies. These include two proprietary SGI systems based on the IRIX operating system and and Intel-IA32 architecture based on the Linux operating system.
3.1 SGI ccNUMA Architecture
The majority of FLUENT computations presented in this paper were performed on an SGI Origin2000 server. The Origin2000 is a cache-coherent non-uniform access multiprocessor (ccNUMA) architecture . The SGI ccNUMA memory is physically distributed amongst the nodes but it is globally addressable to all processors through the interconnection network.
The distribution of memory among processors ensures that memory latency is reduced. Still, the globally addressable memory model is retained but memory access times are no longer uniform. The ccNUMA design incorporates hardware and software features that minimize latency differences between remote and local memory. Page migration hardware moves data closer into memory closer to a processor that frequently uses it, meaning that most memory references are local.
Cache coherence is maintained via a directory based protocol and caches are used to reduce memory latency as well. While data only exists in either local or remote memory, copy of the data can exist in various processor caches. Keeping these copies consistent is the responsibility of the logic (cache-coherent protocol) of the various hubs. The directory-based cache coherence protocol is preferable to snooping since it reduces the amount of coherence traffic; cache-line invalidations are broadcasted only to those CPUs actually using the cache line instead to all CPUs in the system.
The building block of the system is the node, which contains two processors, up to 4 GB of main memory and its corresponding directory memory, and a connection to a portion of IO subsystem. The hub chip is the distributed memory controller and is responsible for providing transparent access to all of the distributed memory in a cache-coherent manner to all of the processors and I/O devices. The nodes can be connected together via any choice of scalable interconnection network.
3.2 Linux and IRIX Cluster Systems
Cluster computing is based on a simple approach of connecting several compute servers to utilize their collective resources for solving a single or multiple problems quickly. The rapidly decreasing price-performance of hardware and system software, emerging high-speed networks, and the availability of mature commercial CFD application software allows cluster computing to become an attractive option for industrial CAE users.
There are two major classes of a cluster configurations (1) capacity cluster and (2) capability cluster. A capacity cluster is targeted for solutions of multiple problems, each running on a dedicated single CPU with minimum communication between individual servers. A capability cluster is used as a collective computational power of several computational nodes for solution of a single problem as rapidly as possible.
The capability approach requires a well developed and fast interconnect between clustered nodes. The studies presented in this paper concentrate our attention only on capability clusters. Additionally we will subdivide the capability cluster configuration into two distinctive subclasses, (1) High End Cluster consisting of few powerful multiprocessor nodes connected within a simple but powerful and mostly proprietary network topology, and (2) Low End Clusters employing substantial numbers of mostly single or low CPU count nodes connected in general via inexpensive commercially available networks.
|4. Parallel Performance Issues|
|The fundamental issues behind parallel
algorithm design are well understood and described in various research
publications. For grid-oriented problems such as the numerical solution
of partial differential equations, four major define four major sources
of parallel performance declining:
Still with such a sophisticated approach, parallel performance can exhibit unsatisfactory results owing to the lack of special mapping to the specific architecture of a particular computer system. Parallel performance of FLUENT, as originally ported on the SGI ccNUMA architecture did not meet initial expectations, with most models scaling only up to 4 processors. Examination of the parallel performance for a number of cases identified bottlenecks with (a) MPI latency and (b) non-enforcement of processor-memory affinity (data placement) as the key reasons for limited scalability. The data placement concern was related to a feature of the ccNUMA architecture and was addressed through implementation of the IRIX dplace set of tools.
The latency bottlenecks associated with MPI required more attention. FLUENT parallelization is based on an explicit message passing paradigm which utilizes MPI to exchange boundary information between partitions. For AMG schemes, MPI is used for information exchange of both fine-grid as well as with coarse-grid levels. The size of MPI messages decreases for coarse-grids making the message initiation (latency) cost more important than the message transmission cost (bandwidth). Thus, unlike classic one-level algorithms where the bandwidth of the MPI implementation is critical to scalability, FLUENT scalability depends primarily on latency of the MPI implementation.
Total MPI latency is determined by both the specifics of a system architecture and the implementation of MPI for that system. Since system architecture latency is determined by design of a particular interconnect, overall latency improvements can only be made to the MPI implementation. Modifications to the MPI software to ensure "awareness" of a specific architecture is the only way to reduce the total latency and subsequently the communication overhead.
Table1 shows scalability data for one of the FLUENT/UNS test problems using both MPICH(public domain MPI) and the SGI implementation of MPI coupled with the use of placement tools. Clearly the later provides higher scalability when compared to the public domain of MPI. "Ping-Pong" like test verified that MPICH latency is almost three times higher than SGI-MPI specific implementation. The results presented in the later sections will demonstrate the influence of MPI latency on the Fluent scalabilty for various architectures of SGI systems. scalability
The current release of FLUENT for SGI systems is instrumented with data placement tools for use with MPICH and exhibits improved scalability over this experiment that was conducted during the beginning of our investigations. Still the lower latency of SGI-MPI over MPICH will produce maximum efficiency in parallel performance.
|5. FLUENT Parallel Performance|
|The overall objective of these studies
is to demonstrate the performance of Fluent CFD software on the SGI ccNUMA
architecture as a single system image (SSI) configuration, as well as with
various cluster configurations based on SGI systems. Performance is examined
on a moderate SSI and clusters with several models less than 1M cells,
then on large SSI and clusters for models much greater than 1M cells.
5.1 Linux and IRIX Cluster Performance
The use of a cluster of systems as a CFD computational resource is increasingly appealing for CAE professionals. Flexibility, local control, relatively low cost combined with suitable performance makes clusters a popular choice, especially for small-sized engineering companies and departments. The major components of a cluster for CFD software are the set of computational nodes (single or multiprocessors), a network interconnect of hardware and software, and an operating system software and administration tools. Specific configurations used for the moderate SSI and cluster experiments are summarized in Table 2.
Cluster studies were conducted on FLUENT performance as a function of architecture type, processor type and choice of interconnect. Performance depends on a number of factors including latency and bandwidth, of each cluster arrangement. A simple "ping-pong" test allows one to measure the bandwidth and latency for each systems in this study.
From the ping-pong test it is observed that the Single System Image architecture (SGI 2400) with its hypercube topology, fast NUMALink communication hardware and SGI-MPI communication software exhibit the lowest latency and highest bandwidth. Next in line is the SGI 1400 Linux cluster (SGI 1400/Myr) with Myrinet interconnect and MPICH port to GM Myrinet communication protocol. The SGI 2100 cluster with HIPPI interconnect shows quite respectable levels of bandwidth but latency is almost an order higher than SSI latency. And finally, the SGI 1400 Linux cluster with 100BT demonstrates the least favorable communication parameters.
It is well known that the parallel performance of numerical applications is influenced by the size of the models for the benchmarking case, as well as by the number of CPU used for a particular execution. In order to investigate the behavior of these test systems in the broad spectrum of problem sizes, three becnhmark tests are chosen for the study:
|Table 4. FLUENT Performance for SMALL Model|
degree elbow duct, 78,887 tetrahedral cells, k-e turbulence, segregated
Metric: Number of jobs completed in 24 hours with parallel speed-up
|CPUs||SGI 2400||SGI 2100||SGI 1400||SGI 1400/Myr|
|1||606 1.0||606 1.0||432 1.0||427 1.0|
|2||1234 2.0||1234 2.0||786 1.8||786 1.8|
|4||2304 3.8||2304 3.8||987 2.3||987 2.3|
|8||3456 5.7||3142 5.2||735 1.7||1819 4.3|
|16||3142 5.2||2659 4.4||364 0.8||3142 7.4|
|Table 5. FLUENT Performance for MEDIUM Model|
valveport, 242,782 hybrid cells, k-e turbulence, segregated implicit solver
Metric: Number of jobs completed in 24 hours with parallel speed-up
|CPUs||SGI 2400||SGI 2100||SGI 1400||SGI 1400/Myr|
|1||116 1.0||115 1.0||96 1.0||98 1.0|
|2||213 1.9||212 1.8||166 1.7||175 1.8|
|4||421 3.6||416 3.6||243 2.5||252 2.6|
|8||823 7.1||804 7.0||432 4.5||508 5.2|
|16||1234 10.7||1115 9.7||576 6.0||1017 10.3|
|32||2033 17.6||1382 12.0||n/a n/a||1571 16.0|
|Table 6. FLUENT Performance for LARGE Model|
aircraft, 847,764 hexahedral cells, RNG k-e turbulence, coupled explicit
Metric: Number of jobs completed in 24 hours with parallel speed-up
|CPUs||SGI 2400||SGI 2100||SGI 1400||SGI 1400/Myr|
|1||12.5 1.0||12.5 1.0||10.2 1.0||10.6 1.0|
|2||24.1 1.9||23.3 1.9||15.9 1.6||16.3 1.6|
|4||45.7 3.8||47.2 3.8||26.9 2.7||28.6 2.7|
|8||85.1 6.8||80.4 6.4||43.6 4.3||52.6 5.0|
|16||154.3 12.3||110.1 8.8||39.6 3.9||80.0 7.6|
|32||259.8 20.6||91.9 7.3||n/a n/a||118.0 11.2|
|64||n/a n/a||n/a n/a||n/a n/a||159.3 15.0|
Despite the fact that we intentionally choose systems with CPU clock that provides approximately the same hardware peak performance at the level approximately 0.5 GFlops. We can see that absolute performance especially for a low number of processors varies quite substantially. It can be explained by a strong dependency of Fluent single CPU performance from the memory subsystem parameters particularly memory bandwidth and secondary cache size. But we will concentrate on a parallel performance metrics which mostly defined by interconnect characteristics than a CPU performance. First of all the quoted above results confirm the well known fact the problem with larger size scale better. The small problem "modine" begins to level a parallel scaling after 8 CPUs which le the large problem (fl5l1) continue to scale even after 32 processors. But the most important observation is that level of parallel scalablity correlates quite clear with a latency characteristics on an interconnect in use. Both SSI and Linux Cluster with Myrinet interface have smallest latency and at the same time highest parallel scaling. Systems with inferior interconnect (mice and especially 100BT) scales much worst. Another interesting fact is that as the number of processor increase the ratio of computation work to communication overhead is diminishing and speed of network practically defines the absolute performance of application. It can be seen especially clear on the example of smallest benchmark. If for the single CPU SSI system is faster than sc1 cluster by the factor 1.4 for the 16 CPUs this factor goes down to 1.0. These results shows that low cost cluster with a fast network (low latency communication protocols) can be a viable solution for low- and mid-range CFD servers.
5.2 SGI ccNUMA Performance
Another important class of computers that provide resources capable to solve Grand Challenge class problems is often refers as supercomputers. Until recently only vector computers ( e.g. Cray C90 or T90) were identified as supercomputers. the appearance of Massively Parallel Computers (MPP) in the last decade of 20th century change this situation. Massively parallel computers are called upon to solve by simulation a set of Grand Challenge application problems in various areas of science and engineering including Computational Fluif Dynamics. These applications drive issues in computer sciences and numerical mathematics, as well as pacing needs in supercomputing, high speed communications, visualization and databases. Attacking Grand Challenge problems involves coordinated research in mathematical models of physical behavior, numerical algorithms, parallel implementations and computer science methodology. Until last several years only large research facilities like National Laboratories and Supercomputer Centers were involved in the solution of extremely large problems and correspondingly they were major customers for supercomputer vendors. In the latest years the high computational of aerospace and automotive industries changed significantly "supercomputer battlefield". As a result of a significant improvement in processing speed and storage capacity of new generation of parallel computers and improvements in computational algorithms computer industry was able to satisfy , to a specific degree of course, the growing demand for supercomputer resources. Practically all large aerospace and automotive company poses now supercomputers class systems which are highly used in their everyday engineering particle. majority of those systems belongs now to MPP class. Massively Parallel Systems can generally come in 2 flavors: - large Single System Image systems which allows to use both shared and distributed memory parallel paradigms across the whole system (e.g.512 CPU O2800 at NASA Aimes Research Center) - cluster of smaller multiprocessors systems connected via high performance network (e.g. 64 x 128 CPU O2000 ASCI BLUE cluster at Los Alamos national Lab). The same spectrum (in some way on a smaller scale) we can see among industrial supercomputer users. One of major automotove companies use 128 CPU O2000 in order to simulate various flow phenomena for car aerodynamics optimization. At the same time one of US aerospace companies uses a cluster of 4 x 64 O2000 in order to study flow around rotating blades, etc. The interesting thing is that both company use the same commercial CF package, namely Fluent. The important question arise - which configuration provides a better performance for the particular parallel code and what are the major reasons for those performance variations. In order to investigate this problem we configure three following large systems:
|Table 2. System Environments for FLUENT Performance Investigations|
|SGI 2800||MIPS R12000/300Mhz||1 x 256||IRIX 6.5||N/A (SSI)|
|SGI 2800||MIPS R12000/300Mhz||4 x 64||IRIX 6.5||HIPPI|
In order to avoid any possible constraints of parallel performance due to the size of a problem we built a very large test case that consist of ~30 mln cells. This case a derivative of one of Fluent Standard Benchmarks (fl5l3) with the grid refinement by 8X from the original problem. This case simulates the turbulent flow in a rectangular duck and has following characteristics:
|Table 3. Communication Rates From MPI Point-to-Point Test|
|System||SGI 2800/SSI||SGI 2800/Cluster|
Now let's compare the performance of both systems using the same metrics wall time/iteration as before. In order to save time we ran the test from 10 CPUs to 240 CPUs on SSI configuration and from 30 to 240 CPUs on a cluster. For cluster configuration we tried uniformly distribute the load between cluster hosts.
|Table 5. FLUENT Performance for EXTRA LARGE Model|
aerodynamics, 28,944,640 tetrahedral cells, k-e turbulence, segregated
Metric: Number of jobs completed in 24 hours with parallel speed-up
|CPUs||SGI 2800/SSI||SGI 2800/Cluster|
As we can see the performance of the cluster is always inferior to SSI system. This difference increases with a growth on CPU numbers at use. There actually 2 reasons for this phenomena. First and major is is a higher inter host latency which we already discussed in previous chapters. Another one is a limited pipeline capability of an inter host communications. Due to the very rich topology of SSI (hyper cube) as usual there no or very few outstanding messages negotiating the pass from source to destination. It is a different story for interhost communication topology where we can use only very few communication routes between hostages. HIPPI pipeline capability doesn't resolve this bottleneck completely. This problem is not so evident for a low range clusters where number domains per host is low but it is getting increasingly influential for a large multiprocessors hosts that contain a high number of domain in need of inter host communication. Our conclusion is quite evident and states that for a solution of large problems on a high number of processors the Single System Image architecture provides much better performance than a cluster approach.
|6. Considerations For Further Performance Enhancements|
|Hardware computer vendors keep telling
their customers that is getting easier to program for new parallel systems.
There is a lot of truth in this statement we believe that it is slightly
oversold. The reality is that with the development of shared memory paradigm
and an improvement of a compiler technology it got more accessible to "jump
on a wagon" and get something running and even show some respectable performance.
But if a programmer/user really wants to get a significant portion of hardware
performance peak he/she should pay more careful attention to a microprocessor
and system architecture details. Here is an example. One the major changes
from v4.2 to v5.0 was moving away from the link - list based data structure
to array based data structure. Such change allows a compiler to much better
scheduling job and as a result much better cache contention. The code which
utilize better a hierarchical memory structure not only improve a performance
on the existing hardware but makes code more responsive to an introduction
of faster microprocessors. The following table demonstrates the singular
CPU performance improvement due to the introduction of new data structure
and a faster microprocessor
The future single processor performance enhancement will come from new generation of MIPS processors ( R12000/400 Mhz and R14000/500 Mhz) and from IA64 Intel new processor family. We just completed a first phase of FLUENT porting on the first processor in this family( 800 Mhz, 3.2 Gflops peak performance). The preliminary results were presented on Intel Developer Forum'2000 in Palm Springs and generated a substantial interest. The new progress in parallel performance can be influenced by both software and hardware development. SGI developers implemented point - to- point communication primitives subset of MPI - 2 standard. This so-called one - sided MPI primitives use the memory structure of ccNUMA system in much fuller extent than two - sided function from MPI-1. Our measurement showed that MPI_get and MPI-Put calls cut the latency almost 4X in comparison with a traditional MPI implementation. It is an experimental development and communication call interfaces are not easy to use yet but we were able to bundle them in communication layer of Fluent. Our first results shows a significant improvement of parallel scalability. Another quite powerful resource of parallel performance boost is the possibility to use various parallel paradigms couple in the same code. Such approach is very natural for ccNUMA architecture which allows to use both shared and distributed memory paradigms. Fluent developers skillfully used this opportunity for multiphase flow simulation. The basis for such approach can be presented as following. It is weel know that the static frame of reference (Euler coordinate system) is most convenient for description of continuum phase motion. The static nature of this frame of reference allows to create static partitioning which than can be effectively utilized in distributed memory parallel model. At the same time the discrete phase as usual described and calculated in dynamic (moving) frame of reference , so -called Lagrangian system of coordinates. To impose a static portioning scheme on a such topology is very hard and not effective. It is turned out that shared memory parallel paradigm allows to resolve this conflict very elegantly . Both system work in a segregated manner and exchange data between two phases after completion distributed or shared step on each iteration. The following table demonstrates how such hybrid improved a parallel performance. We will use two version of Fluent: - v5.0 - Parallel continuum computations, sequential discrete phase computations - v5.1 - both phases are computed in parallel utilizing hybrid parallel model Again we will consider one of Fluent Standard benchmarks - FL5M1 (boilr) which simulates two phase flow in an industrial boiler with a particle tracking
|Table 10. FLUENT Performance with Hybrid Parallel (parallel speed-up)|
|Model: Coal combustion in a boiler, 155,188 tetrahedral cells, 6 species with reaction, dispersed phase, P1 radiation, k-e turbulence, segregated implicit solver|
|CPUs||FLUENT 5.0||FLUENT 5.1|
As we can see hybrid parallel model provides much better scalability for this complex problem. One more point to mention: if hybrid models is used in a cluster environment the number of shared memory processes will be restricted by the number of CPUs on a master computational node. On contrary in SSI environment shared memory phase of algorithm can use the same number of processors as a distributed memory part of the code which will lead to a better parallel performance. And finally SGI is coming with a new architecture based on ccNUMA model. This new system named SN1 brings twice more bandwidth to a local memory and cut a remote reference latency by at least 2x. Fluent shows the improvement on this system over O2000 with the same microprocessors in the range of 1.3X - 2.0X.
Presented were parallel performance results of the commercial CFD software FLUENT for a variety of applications and system configurations. Results of these experiments lead to the following observations:
 FLUENT 5 User's Guide, 1998, Fluent Incorporated, Lebanon, NH.
 Dagum, L., McDonald, J., and Menon, R., "The NAS LU Benchmark as a Scalable Shared-Memory Program", Silicon Graphics Internal Report, 1997.