| PERFORMANCE TESTING ON THE BEOWULF:
Dr. Richard S. Miller
INTRODUCTION: This page contains results from performance testing on the Department of Mechanical Engineering's Beowulf parallel computing cluster. Comparisons are made using both the fast ethernet network (100 mbps) and the `gigabit' myrinet network, as well as with two other parallel computers; a SUN HPC 6000 (helios) and a Silicon Graphics Origin 2000 supercomputer located at NCSA. All testing was performed using the original 16 processor cluster during November 2000. Two codes are used for the testing. The first is a finite difference based single-phase flow Navier Stokes solver run on a two-dimensional domain using either 256 x 256 or 2048 x 16 numerical grid points (same total memory and number of operations). The pertinent aspects of the code for considering the performance results are that an explicit finite difference scheme is utilized for derivative approximations in the x direction, whereas an implicit compact scheme is utilized for all derivatives in the y direction. The explicit (x) scheme is ideally parallelizable; although it is based on an eight point finite difference stencil and therefore requires a relatively large four columns of data transfer during communication. In contrast, the compact scheme calculates derivatives by solving a tridiagonal matrix which is inherently non-parallelizable in the y direction (this is true when the standard Thomas algorithm is used to solve the matrix equations; however, there are semi-parallelizable, though less direct, methods for tridiagonal matrices). The second code is Monte Carlo modeled solution for grain growth in solids (such as steel) at high temperatures. This particular problem is extremely communication intensive. The particular parameters used in the run presented below involves 256 x 256 x 256 computational `cells' defined by a particular grain orientation. Approximately 1.8 billion random Monte Carlo events are performed, with communication between processors occuring on the order of every 1000 events. A single case is described below which was conducted using 8 CPUs in a 2 x 2 x 2 domain decomposition.
Finite difference code performance on parallel processing computers is related to the reltive number of communications required relative to the number of `interior' operations not requiring communications. Figure 1 illustrates this point with a simple computational mesh with 32 x 16 grid points, divided among two processors denoted `0' and `1'. The boxed regions show the four column wide sets of data which need to be communicated during each cycle of the code. This is due to the fact that the boundary conditions are periodic and the computational stencil is four points wide on either side of the grid point being computed. All interior points do not require communication to calculate the terms involved in the Navier Stokes solver. Therefore, in order to minimize the relative extent of communications during some problem, it is required that the relative area of each CPUs section of the domain be large relative to the circumferential area involved with communication. In what follows, performance `speed up' results (defined as code time on 1 CPU divided by code time on N CPUs) are presented for domain decompositions in either the x direction (ideal), the y direction (serial computation), and also for 2D domain composition using processors in both directions. Results are provided for both the fast ethernet and myrinet networks. The computational code is parallelized using the Message Passing Interface (MPI) routines.
FAST ETHERNET RESULTS:
Performance results for the 2D Navier-Stokes code running on the fast ethernet (100 mbps) network for all inter-processor communications are presented in Fig. 2. Three sets of data are provided in the figure: In the first case, the 2D domain is decomposed only in the x direction along which the `ideally' parallelizable explicit finite differencing is used (denoted NPX). Super-linear performance due to increased available L2 cache memory is observed for small numbers of processors. Two primary factors influence the perfromance degradation for greater than eight processors. First, MPI spawns new processes in a `round robin' fashion jumping from machine to machine (not in consecutive CPUs within a dual processor machine). Therefore, for greater than eight CPUs both processors in at least one machine are being used simultaneously. Since both CPUs share the same memory the effective transfer rate between the CPUs and memory is halved when compared to a single CPU accessing the memory. Second, communications costs over the relatively slow ethernet eventually degrade the performace as more processors are added to the problem. Communication traffic over ethernet cables routing any dual CPU machine using both CPUs is also doubled due to having a single ethernet card/cable for each machine. In the second case, the the domain is decomposed in the y direction (data denoted by NPY) which cannot be parallelized due to the nature of the tridiagonal matrix solver. Since the derivative calculations represent the bulk of the Navier Stokes solver, very little of the overall operations can be parallelized through this decomposition. The net effect is that the code shows no improvement using two CPUs, and the performace is approximately halved using four CPUs due to communications costs. Results are also shown for cases using 2D domain decomposition for which the number of processors used in both the x and y directions are equal (NPX=NPY). In this case, performance gains are observed when four CPUs are employed; however, use of 16 CPUs results in excessive communications costs and the code runs more slowly than with 4 CPUs. MYRINET RESULTS: Figure 3 presents speed up results using the finite difference code with x domain decomposition using a grid resolution of 2048 x 16 grid points. The first domain discussed above had a resolution of 256 x 256 grid points, which when decomposed with 16 CPUs each processors has sub-domains consisting of 16 x 256 grid points. Eight columns of data are involved in communication within each sub-domain; therefore 50% of all data points must be communicated over the network. In contrast, the 2048 x 16 grid point domain has the exact same number of grid points (and therefore also of operations). However, in this case 16 CPUs results in sub-domains with 128 x 16 grid points, and only 6.25% of the domain is involved in communication. The single CPU performance (on ethernet) is 34% faster for the 2048 x 16 node case due to both lesser relative communication and more efficient memory usage with the longer inner `i' looping.
Despite the lesser communication, the myrinet is observed to markedly increase code performance. This occurs even when using 1 CPU since the MPI communcation routines are still called and the single processor communicates with itself. The single CPU run times are 122% faster using the myrinet than using the ethernet network. For both networks superlinear scaling is observed up to 8 CPUs; however, a performance reduction is observed when going to 9 CPUs since one of the machines is now running both processors which must share a common memory and bus. Adding more processors alleviates this problem until the full 16 CPUs at which point the primary CPU running system applications is also utilized. EFFECT OF COMPILER OPTIMIZATIONS:
Fortran compilers allow for several levels of compiler optimizations which are specified during compilation:
mpif77 code.f No optimization
mpif77 -O2 code.f Minimal optimization
mpif77 -O3 code.f Strong optimization
mpif77 -O4 code.f Strongest optimization
These optimization flags alert the comiler to attempt to varying degrees to optimize the resulting assembly language executable file. Care should be taken by the user to ensure that the compiler interprets the code in the correct manner (ie. the results do not change), and added optimization requires longer compilation time. Note also that the relative performance gains incurred by compiler optimizations are stongly code dependent; users should perform testing with their own codes. Results for the present finite difference code are presented in Fig. 4 which shows nearly a doubling in code performance using anything by the default optimization. All results presented above were compiled using -O3 optimization. COMPARISON WITH SUN HPC6000 AND SGI ORIGIN2000:
In this section the performance of the finite difference code with 1D domain decomposition in the x direction is compared with the University's 16 processor SUN HPC 6000 shared memory parallel computer (`Helios'), and with the Silicon Graphics Origin2000 parallel supercomputer at the National Computational Science Alliance (NCSA) at the University of Illinois at Urbana-Champaign (1520 CPUs). Both the SUN and the SGI machines are shared memory computers, and both cost substantially more than a beowulf cluster. For example, the beowulf cost approximately $26,000 during the summer of 2000; the SUN, which also has 16 CPUs, would cost `in the ballpark' of $500,000. Both the SUN and SGI machines use chips with significantly higher floating point operation performance than a typical PC chip. Memory bus speeds and bandwidth are also much greater, and the L2 cache memory size is much larger compared with the 256KB L2 cache on the beowulf. For example, the SGI Origin2000 at NCSA uses both 195MHz and 250MHz R10000 processors with a 4MB L2 cache memory. In contrast, the beowulf has a much faster (700 MHz) clock speed. Figure 5 presents performance results for the 2048 x 16 grid point finite difference code using the three computers and a range of numbers of processors. Considering the hardware specs, the myrinet equiped beowulf does a remarkable job of `keeping' up with the shared memory machines. The SUN HPC is actually approximately linear at 16 CPUs relative to its own single processor performance (the curvature is due to super linear performance for less CPUs). Performance on the SGI machine scales super-linearly throughout the range of processors tested; this same code has shown near linear scaling up to 220 CPUs (the maximum tested) on a comparable supercomputer. Single CPU relative performance of the code on the three machines is presented in Fig. 6. As expected, the floating point optimized chips on the SGI and SUN machines calculate the sinlge CPU problem faster than the PC chip based beowulf when `generic' fast ethernet is used. However, when the myrinet network is used for communication the faster clock speed now makes up for the slower bus speed on the PC chips, and the beowulf actually surpasses the single CPU performance of the more expensive computers.
Figure 7 shows the total simulation time for a 256 x 256 x 256 run using the Monte Carlo grain growth code. This particular code is extremely communication intensive, and the myrinet network shows a 46% performance gain in comparison to the fast ethernet network. Results are also shown for the SUN HPC6000 which communicates rapidly through its shared memory.
|
Home
Publications
Teaching
Research
Team
Tutorial
Beowulf
Quotes