| I: PARALLEL COMPUTING BASICS: Dr. Richard S. Miller
I(A): Introduction The philosophy of parallel programming is to take advantage of more than a single computer processor that will work simultaneously to perform a single task. This can be accomplished in any of a number of ways, and on a multitude of choices of computer platforms. Consider an overly simplified example in which we wish to use a computer to calculate the average of a large dataset of numbers which are stored in a file. Of course, a single `serial' processor computer could accomplish this task; however, we may wish to speed up the turn around time (ie. the wall clock time) required to complete the task. A naive approach would be to edit the file and split it into N equal size files each containing a portion of the original dataset. Then we could load each of these files onto N computers, each of which calculates the sum of their portion of the original file; after which the results from each computer could be added and divided by the total number of data points, either by hand or on another computer, to yield the desired result. Of course, this is a very naive approach and would probably be much slower than having a single computer work on the problem; unless the number of data points were tremendous.
![]()
On the other hand, consider a computer system containing multiple processors which are able to `communicate' with one another in some direct, but as yet unspecified, manner. On this computer we could then simply have one processor, usually referred to as the `master' processor, read in the original dataset and then send portions of it to N other processors, denoted `slaves'. These processors can then calculate the sums of their respective portions of the problem, and send the resulting single number sum back to the master. The master then finishes the problem by adding the N numbers is has received and dividing the result by the total number of data points, and outputting the result to the user. This is the essence of parallel programming; to take advantage of multiple processors in order to decrease the time required to solve some task. In theory, if the time required for communication between processors is negligible, the wall clock time can be reduced to 1/N times the single processor performance. This limiting behavior is referred to as `linear scaling'. It is denoted `linear' (despite the 1/N) because performance is usually measured as the calculation time on a single processor divided by the actual time on N processors; which in theory can be made to scale linearly with N for `optimal problems' solved with an `optimal parallelization procedure'. The concepts of `optimal problems' and `optimal parallelization procedures' refer to the fact that not all problems are equally amenable to parallelization, and not all choices of procedures for parallelizing a particular problem are equally efficient. In fact, most `real' problems are not amenable to `perfect' parallelization; ie. linear scaling is not possible. Even for such perfect problems, linear scaling is not possible for all problem sizes and number of processors employed. A perfect problem will in general be one in which completely uniform `load balancing' is achieved such that all processors have an equal and identical fraction of the computational calculations to perform. In this manner, no processors will ever be sitting idle waiting for other processors to `catch up' (since before proceeding some information is generally required from the other processors). Note that this perfect load balancing is at odds with the `master-slave' parallelization philosophy alluded to above, since the master processor did not partake in the actual calculation. In what follows, three examples will be posed which will illustrate the concepts described above. The first is as near to an ideal problem as is generally possible; solution of the steady heat conduction equation on a rectangular domain by a finite difference technique. The second is the solution of a tridiagonal matrix using the Thomas algorithm which represents a very poor parallelization problem. The third is one of the author's current research projects which illustrates some concepts of parallel programming found in `real' computational fluid dynamics problems. I(B): Heat Conduction Example A typical and elementary computational heat transfer problem is the numerical solution of the two dimensional steady heat conduction equation governed by the Laplacian of the temperature equal to zero subject to appropriate boundary conditions. Consider the computational domain illustrated in Fig. 1. The rectangular domain in x-y space has been `discretized' into a discrete set of `grid points', located at the intersection of the horizontal and vertical lines, on which an approximate form of the heat conduction equation is to be solved. In this particular case we have chosen MX=128 and MY=64 equally spaced grid points in the x and y directions, respectively. One typical solution procedure which could be applied on a serial computer is the Gauss-Seidel iteration technique. In this case a single processor performs the entire solution by looping through all grid points updating the temperature as:
do 10 j=1,MY
do 10 i=1,MX
T(i,j)=a*T(i+1,j)+b*T(i-1,j)+c*T(i,j-1)+d*T(i,j+1)
10 continue
continuously until some convergence criteria is achieved (a, b, c, and d are appropriately defined coefficients). Note that the value of T(i,j) at each iteration only depends on `local' information; ie. the four adjacent grid points. This is a typical characteristic of problems which are readily amenable to parallelization.
Consider next parallelization of the same heat conduction problem using two processors. The simplest way to accomplish this task is to assign each processor, denoted processor `0' and processor `1', to solve the problem on half of the domain; Fig 2. In this case two `threads' are initiated (`thread' is the notation for a process spawned off onto a unique processor) each of which is initialized on a domain having NX =MX/2=64, and NY=MY=64 grid points. Notice that there are now actually two programs which have been started on two unique processors. The only distinction between the two thread programs is that each now has an initial condition corresponding to its particular half of the true full domain problem. Also, each thread program only requires one half of the memory required by the single processor program described above. Now each thread program begins its iteration loop:
do 10 j=1,NY
do 10 i=1,NX
T(i,j)=a*T(i+1,j)+b*T(i-1,j)+c*T(i,j-1)+d*T(i,j+1)
10 continue
Neither processor's thread program requires any knowledge of the other processors existence except along the column i=NX for processor 0 and along column i=1 for processor 1 (due to the T(i+1,j) and T(i-1,j), respectively). Because of these points, some mechanism is needed for communicating between these two processors. This is the crux of parallel programming, and many options are available. Presently, the most prolific is the Message Passing Interface (MPI) set of communication subroutines; however, this will be discussed below, and the particular communication software chosen is not relevant to the present discussion. Whichever communication procedure is chosen, the user must insert the proper commands into the code between each iteration of the above loop to `send' the required boundary information from each processor to the other; the boundary information can then be used within each loop as the `ghost cell' values. With this accomplished the solution proceeds exactly as on the serial machine. Note that it is only during this communication that the two thread programs are ever aware of the others existence. Other than this, the two threads behave as two independent programs being run on two independent computers. Note also that at the end of each loop over the domain that if either processor finishes first it must `wait' for the other to also finish since it will need new ghost cell information before it can begin the next loop. This is in part why uniform load balancing is so essential to achieving linear performance scaling.
The parallelization described above is an example of `one-dimensional domain decomposition' since the domain is divided among processors only in the x coordinate direction. Figure 3 illustrates the use of four processors, and Fig. 4 shows a case of two-dimensional domain decomposition using sixteen processors with the domain now divided in both directions. More processors can be added as needed. In all of these cases smaller sub-programs are spawned to the required processors which compute their fractions of the entire problem using identical loops. Each of these threads treats a domain having MX/NPX times MY/NPY grid points, where NPX and NPY are the number of processors used to decompose the x and y coordinate directions, respectively. Three-dimensional domain decomposition extrapolates to 3D problems in an analogous manner.
As many processors as are wanted (and available) may be added to the particular task at hand; however, communication is not `free' and can add substantial time to a problem. The specific cost added by a single communication is dependent on a number of parameters, including the specific computer platform being used, whether communication is through memory (as in shared memory platforms) or over some `backbone' such as ethernet or myrinet, number of other users on the system, etc. (hardware is the subject of section II). Nevertheless, regardless of the actual cost per communication, specific observations about minimizing communications can be made. Consider Fig. 5 in which the same heat conduction problem is decomposed using two processors. In this case processor 0 treats all odd numbered grid points in the x direction, and processor 1 handles the even numbered points. Notice that in this case communication must be performed at every individual grid point due to the appearance of the T(i-1,j) and T(i+1,j) in the Gauss-Seidel procedure. This solution paradigm would be substantially more costly than the two processor paradigm in Fig. 2 due to the added communication. This illustrates that it is the ratio of the number of grid points requiring communication relative to the number of interior points not involved with communication which generally should be minimized in order to optimize a parallel solution algorithm. This optimal procedure may be found in either 1D, 2D, or 3D domain decomposition, depending on the number of grid points used, the number of processors employed, and the geometry of the domain.
I(C): Tridiagonal Matrix Example The next example is meant to illustrate some of the difficulties which may be encountered in attempting to parallelize particular numerical solutions. Consider what occurs when a finite difference approximation to some function F(x) is obtained by a tridiagonal `compact' scheme. In this case the derivative to the function, denoted F'(x) is approximated on the discrete domain with MX grid points as:
a*F'(i-1)+F'(i)+a*F'(i+1)=b*[F(i+1) - F(i-1)]
with appropriate coefficients a and b. The left hand side functions are all unknowns so the above represents an implicit set of equations for the unknown derivatives at each of the MX grid points. The set of equations can be expressed as:
where C is a known vector and D is a tridiagonal matrix of known coefficients (ie. all elements of the matrix are equal to zero except for the primary diagonal and the two adjacent diagonals; i-1, i, and i+1). The Thomas algorithm is an efficient solution procedure for a tridiagonal matrix such as this in which two `sweeps' are made, first `up' and then `down' the the MX grid points:
do 10 i=MX-1,1,-1
R(i)=f1(F(i+1))
10 continue
do 20 i=2,MX
F'(i)=f2(R(i-1))
20 continue
The important concept to grasp from the above is that in both sweeps the value computed always depends on the result from the previous calculation. Now consider decomposing the domain using two processors treating the 1:MX/2 and MX/2+1:MX grid points, respectively. If processor 1 has the latter half of the domain then it proceeds first with the upward sweep. However, processor 0 cannot in this case begin its calculations until processor 1 has finished and communicated its last result. This is because all of the calculations to be performed by processor 0 depend upon the cumulative results of the entire proceeding sweep. In this case parallelization not only does not speed up the calculation, it actually slows it down somewhat due to the addition of a communication. The key point here is that the solution dependency is `non-local' as opposed to the heat conduction problem which was entirely local. This does not mean necessarily that non-local problems cannot be parallelized; clever techniques can often be found to alleviate the inherent difficulties. However, in general non-local problems such as the tridiagonal matrix solver, N-body problems, fast Fourier transforms, etc. will rarely, if ever, exhibit linear scaling performance in parallel applications. I(D): Droplet Laden Mixing Layer Example One of the author's current research projects is the numerical simulation of a two-dimensional two-phase chemically reacting mixing layer. The project and some sample results are provided on a seperate page: http://www.ces.clemson.edu/~rm/project1.html The solution procedure chosen is eighth order accurate explicit central finite differences for all x derivatives, and fourth order compact tridiagonal finite differencing for y derivatives. The explicit differences are parallelized in the same manner as described above for the heat conduction problem; however, four rows of ghost cells require communication due to the wider computational stencil which spans F(i-4,j) to F(i+4,j). The compact scheme is chosen due to accuracy and stability constraints, but is not easily amenable to parallelization. To overcome this deficiency typically at most four processors are used to decompose the y direction, while as many as sixteen are used for the x direction (64 processors). The two-phase mixing layer represents a hybrid local/non-local communication problem in that the gas phase variables are solved on an Eulerian mesh (local with finite differences), while the particles or droplets are tracked in the Lagrangian reference frame and can `wander' throughout the domain. Consider a case which is initialized with only the lower half of the domain (the fuel stream) laden with particles. If the task of solving these particle equations, which require local information concerning the gas phase flow state, are divided evenly among all processors, then the particles will not always be treated by the same processor treating the local gas phase equations. This leads to tremendous communications costs. In practice, the most efficient solution to this problem is to always have the particle equations solved by the processor that is solving the local gas phase equations. In the case of one laden stream this does not yield good load balancing since half of the processors (those solving the upper stream) do not have any particles; however, communication is minimized. Note also in this problem that every particle must be checked at every time step to determine if it has left the physical domain of its processor, in which case it, and all of its storage arrays, must be communicated to the new processor. Nevertheless, despite these inconviences, this is the optimal solution procedure for this particular problem. Perhaps somewhat surprisingly, the author has observed very near to linear performance scaling up to 220 processors during a previous study in which the code was run in full three-dimensional mode:
REFERENCES: R.S. Miller and J. Bellan, `` Direct Numerical Simulation and Subgrid Analysis of a Transitional Droplet Laden Mixing Layer ,'' Physics of Fluids, 12(3), 650-671, 2000. |
Home
Publications
Teaching
Research
Team
Tutorial
Beowulf
Quotes