| II: PARALLEL COMPUTING HARDWARE: Dr. Richard S. Miller I(A): Introduction The subject of this section is computer hardware on parallel processing computers. In contrast to serial computer programming in which relatively little knowledge of hardware is necessary for efficient code writing, in parallel programming the particular hardware being used can be quite important to both the choice of parallelization methodology and in understanding performace constraints. Modern communication paradigms, such as the Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) have their primary strength in that they are easily portable to most if not all hardware platforms. There are also many proprietary and platform specific parallelization paradigms in use today primarily on shared memory parallel computers. Understanding the basic hardware architecuture of parallel computers is very usefull in understanding the constraints and limiting factors involved with parallel programming. To this end, the following discussion is divided into four primary subsections addressing basic processor/memory architecture, shared, distributed and hybrid memory platforms, and beowulf clusters. A fifth section describing networking options on beowulf clusters is also provided at the end. I(B): Processor Basics The discussion begins with a very basic introduction to computer processor design as it relates to memory, since this is the aspect of the processor hardware of most importance to efficient parallel programming. Figure 1 schematically illustrates a typical `PC' microprocessor. The main memory typically resides off of the actual chip (exceptions exist even for PCs, such as the new Intel Xenon chip). This is where the primary bulk of an application resides and PCs generally are sold with 64 to 256 MB of primary memory these days. Two additional, and less well understood, forms of memory exist on the chip. The `L2 cache' holds the subset of the application immediately awaiting operation by the processor. It is connected to the primary memory by the `bus'. The bus is where the main bottleneck to computer performace is found. Today, the fastest bus speeds found on PC computers are 133MHz. On the other hand, the actual processor clock speeds are now over 1GHz. However, if the processor can only get the data to work on at 133MHz, then gigahertz ratings become rather deceptive. Finally, the `L1 cache' memory is where the processor actually performs all operations. It is quite small on PCs (~640 bytes) and will not be important to the discusssions which follow for this reason. Consider a performace test case in which some application is run at progressivley larger memory sizes and the true number of operations per second is measured. What occurs is that three very distinct levels of performace are observed. The fastest occurs for very small applications in which the entire program fits on the L1 cache. The second somewhat slower performance is found when some of the problem must be stored on the L2 cache. Finally, for code larger than the sum of the L1 and L2 caches the performace drops of substantially due to bottelnecking across the bus in both directions as data is exchanged with primary memory. In practice, no realistic applications fit entirely on the L1 cache due to its small size; hence its irrelevance to the discussion.
Understanding the L2 cache memory is key to understanding many aspects of computer program performance, PCs vs. workstations, as well as `super-linear' parallel programming performance. Consider for example the difference between so called sub $1000 PCs and `normal' PCs which are often sold with comparable chip speed ratings but at higher prices. The primary, and often the only, difference is that the inexpensive PCs contain processors with much smaller L2 cache sizes; typically 256KB. On the other hand, the normal PC chips contain 512KB L2 caches, and some are going to 1MB cache sizes. This increases the manufacturing price considerably; however, the performance gains can be quite substantial due to the larger percentage of an application which can reside on the L2 cache, thereby avoiding the relatively slow journey over the bus. The `workstation' class of computers are often sold at substantially increased costs but with typically much smaller chip speed ratings. Again, one of the primary differences is the L2 cache size which ranges from 1-4MB on most workstations. The memory requirement for a particluar array with MTOTAL elements is calculated as: MEM = MTOTAL(words)*8(bytes/word) for double precision (8 byte word) or as MEM = MTOTAL(words)*4(bytes/word) for single precision. Therefore, the heat conduction problem described in the previous section requires only 66KB and can reside entirely within cache memory. However, larger more realistic problems and applications will generally exceed the cache size of typical PCs. For example, a small 3D heat conduction problem with only one array, T(128,64,64) requires 4.2MB of memory in double precision. Another important difference between PCs, workstations, and expensive `supercomputers' is related to the bandwidth across the bus. PC computers tend to have a 32 bit bus, while workstations and supercomputers can have 64 bit, 128 bit or larger buses. If the bus speed is the analogy to the speed limit on a highway, then the bandwidth is the analogy to the number of lanes of traffic which may move at the speed; thereby increasing the total flow rate of information between the primary and L2 cache. As alluded to above, knowledge of the way a computer handles memory can be very beneficial to optimizing computer code. For example, Fortran stores multi-dimensional arrays actually as a single dimensional array in memory. Consider 2D array: A(4,2). This is stored in memory as a single column of numbers: A(1,1), A(2,1), A(3,1), A(4,1), A(1,2), A(2,2), A(3,2), A(4,2). Notice that the first index is cycled first. This is why an efficient Fortran code will always have the inner loop over the first array index:
do 10 k=1,MZ
do 10 j=1,MY
do 10 i=1,MX
A(i,j,k)=...
10 continue
for some 3D array. Notice that in this manner the memory buffer is accessed sequentially. On the other hand if the loop structure is reversed:
do 10 i=1,MX
do 10 j=1,MY
do 10 k=1,MZ
A(i,j,k)=...
10 continue
then the memory buffer has to skip in increments of MX*MY places to access each successive operation within the loop. This can substantially slow down a calculation; particularly if the MX*MY information does not reside entirely within the L2 cache memory. Under this scenario the operation can only proceed at the speed of the memory bus. As a final observation here, note that in a parallel processing program as more processors are added to the problem a larger percentage of the problem can now reside within L2 cache since more L2 cache is available. In some cases the performace increase due to this effect can supercede the performace degradation associated with communciation costs; thus leading to so called super-linear performance (ie. better than the theoretical linear scaling). I(C): Shared Memory Platforms Parallel computers come in a wide variety of forms and cost ranges which can be classified as having either `shared' memory, `distributed' memory, or `hybrid' memory architectures. Shared memory systems have several advantages in user friendliness over distributed memory systems; however, their cost is generally much larger. A schematic of a shared memory architecture is illustrated in Fig. 1. The fundamental concept is that only one primary memory exists which all processors share and access directly through their respective memory buses. All communication between processors is through the primary memory, and all processors have direct access to the entire program stored in the memory. This complete access by all processors allows for a wide variety of parallel programming paradigms to be adopted on shared memory systems. Shared memory computer hardware is not mass market when large numbers of processors are required, and is considerably more costly to produce and/or purchase than other options described below. However, at the lower end dual processor PCs and workstations are shared memory and relatively inexpensive.
The simplest parallelization available on shared memory platforms involves simply `asking' the compiler to attempt to parallelize an otherwise serial code. In Fortran this can usually be accomplished using:
f77 -Oparallel serial.f
setenv NCPUS 4
a.out
where serial.f is the Fortran source code, the environment variable tells the computer to try to run the executable using four processors, and a.out is the source code. Whether or not any performance gains are actually observed depends on the nature of the code as well as on the compiler's sophistication. In the author's experience many codes will experience no performance enhancement at all, whereas even in the best scenario performance gains are found only up to a few processors. Adding more than the optimal number of processors will show either no additional gains, or even degradation in the performace. The next level of sophistication in parallelization paradigms on shared memory computers involves mostly proprietary command structures which are usually entirely platform specific; code written with these paradigms is not portable to other computers. The command structure at this level of parallelization is relatively simple. The compiler often handles much of the actual parallelization; however, commands are usually provided which the user manually inserts within the programming source code. These commands are relatively simple and typically in effect ask the compiler to `please try to parallelize this loop'. The author's experience with the Hewlett-Packard/Convex Exemplar (a hybrid computer, but shared memory within groups of sixteen processors) is that near linear performance scaling is achievable up to approximately eight to ten processors, after which the performamce degrades. These paradigms are not appropriate for so called massively parallel applications. However, they are relatively easy to use and require little deviation from traditional serial programming knowledge. Several commercial engineering codes are on the market which can take advantage of shared memory parallel computers for this reason (eg. Fluent, a commercial computational fluid dynamics code). The final, and most sophisticated, level of parallelization paradigms are the `pure' parallel structures, including the Message Passing Interface (MPI) communication subroutines. MPI consists of a series of subroutines called by the user within a code. The MPI libraries must be linked during compilation. Two common ways to do so with Unix based Fortran77 compilers are: (1) f77 code.f -lmpi (2) mpif77 code.f MPI actually treat the primary shared memory computer as though it were a truly distributed memory system, and do not take direct advantage of the shared memory. However, these paradigms allow for `true' parallelization and total control by the user. One of their primary strengths is that they are completely portable to almost any parallel computer architecture; including shared memory. On the other hand, these require a much higher level of programming sophistication on the part of the user, and no commercial engineering code is yet available to the author's knowledge. I(D): Distributed Memory Platforms In some sense, the `purest' form of parallel computing architecture is the distributed memory computer illustrated schematically in Fig. 3. In contrast to shared memory computers, every individual processor has its own unique primary memory. The only means of communicating between processors is over an external network. This network can take many forms including highly sophisticated proprietary hardware on supercomputers, local netoworks using ethernet, or non-local computers communicating over the internet. Aside from network communication, which is strictly controlled by the user, no processor has any `knowledge' whatsoever concerning either the existence of the other processors or the contents of their memory. For this reason, the relatively simple and user friendly options for parallelization available on shared memory systems are not available on distributed memory platforms. Parallelization must be completely programmed within the source code by the user using a paradigm such as MPI. This makes distributed memory platforms considerably less accessible than shared memory; however, two major advantages are gained. The first is performance. Whereas user friendly shared memory parallelization may scale linearly to of order ten or more processors, well written MPI code run on distributed (or shared) memory platforms can yield near linear performance enhancement into thousands of processors on supercomputers. The second advantage is in cost. Distributed memory computers using hundreds or even thousands of processors can be built using `off the shelf' processors and memory. A computer built this way will in general cost an order of magnitude less than a comparable shared memory computer.
I(E): Hybrid Memory Platforms The final parallel computing architecture category is the `hybrid' architecure which is a cross between shared and distributed computers. Figure 4 shows a schematic of an arbitrary hybrid system. Local `clusters' of processors share a primary memory cache; however, these clusters must communicate with other clusters over some network in the same manner as the distributed memory system. The fastest computers in existence today are the ASCI Red and ASCI Blue machines located at Sandia National Laboratory; the largest has over nine thousand processors. Both of these machines are based on off the shelf Pentium processors built in groups of two sharing a local memory. In this sense both machines are hybrid parallel computers. Another example is the Hewlett-Packard/Convex parallel supercomputer which is built on `hyper-nodes' containing sixteen shared memory processors, connected with other hyper-nodes over a fast network.
I(F): Beowulf Systems Beowulf is the generic term for a cluster of PCs or workstations connected via a local network to operate as a parallel computer. Because beowulf systems are constructed using off the shelf hardware, they represent the most affordable means of purchasing a parallel computer. In addition, beowulfs generally utilize open source (ie. free) software including the Linux operating system, making them even more affordable. In their simplest form, a beowulf may be comprised of two or more inexpensive PCs connected with standard 10 megabit per second (mbps) ethernet (there are 8 bits per byte). At the other extreme, clusters of several hundreds of PCs or workstations have been constructed using high speed gigabit per second (gbps) ethernet or myrinet networks. By their nature, beowulfs are distributed memory platforms having all of the advantages and disadvantages described above (however, a cluster of dual processor PCs, such as the beowulf operated by Clemson's Department of Mechanical Engineering, is strictly speaking a hybrid memory system). This particular system is comprised of sixteen 700 MHz Pentium chips and cost less than an order of magnitude less than the cost of a comparable SUN shared memory computer. More specific information regarding the Department's beowulf is available at: http://www.ces.clemson.edu/~rm/beowulf.html I(G): Beowulf Networks The two primary networks available for beowulf systems are ethernet and myrinet. Ethernet comes in standard 10mbps, `fast ethernet' operating at 100mbps, and `gigabit ethernet' operating at approximately 1gbps. These performace ratings represent the potential performance under ideal conditions, but are rarely attainable when operating a beowulf cluster of PCs; particularly for gigabit ethernet. The second option is the myrinet network sold by Myricom Inc. which operates under a similar principle; however, myrinet was developed and optimized for beowulf applications and will generally outperform comparable ethernet. Myrinet one-way 64 bit PCI slot data rates are reported by the manufacturer to sustain approximately 1.8gbps; testing on 32 bit PCI slot Pentium based beowulfs has been reported to the author at over 700mbps. However, unless tremendous amounts of data are being communicated during the running of a parallel code, the primary bottleneck to performance will almost always be the bus speed between the processor's L2 cache and primary memories. Myrient has several additional advantages over standard ethernet. It has both TCP/IP and proprietary transfer protocols either of which can be specified by the user during code compiling. Myrinet cards also have an on-board processor operating at two times the bus data rate which reduces short message latency. Parallel programming strategies can be affected by the choice of network on a beowulf cluster. For example, if a relatively slow 10mbps ethernet backbone is employed, then the primary bottleneck to code performce may very well be communication costs. The user may therefore wish to tailor the code to minimize communications across the network; perhaps at the expense of added processors oeprations. On the other hand, a beowulf cluster built with 100 MHz memory bus based Pentium type processors with a high speed gigabit backbone may very well exhibit bottlenecking between the processors and their individual primary memories, rather than over the network. In this case, minimizing network communications becomes much less important to code optimization. Of course, many variables are involved in code optimization. The information provided on this page is therefore only meant to provide a starting point for the novice parallel programmer to begin the writing and optimization of whatever specific task is at hand.
|
Home
Publications
Teaching
Research
Team
Tutorial
Beowulf
Quotes