







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of interprocessor communication algorithms and topologies in massive parallel computers, focusing on the intel touchstone delta system. Various communication algorithms for one-to-all, all-to-one, and all-to-all operations, their performance, and scalability issues. The authors also introduce the concept of actual and effective message sizes in the context of k-level algorithms.
Typology: Papers
1 / 13
This page cannot be seen from the preview
Don't miss anything!








Chi Zhang 658310279
This survey introduces the interconnection topology for massive processors, communication algorithm and experimental results on a mesh connected computer, at last an architecture independent model in introduced briefly. Focuses are on the communication algorithm’s performance and scalability. To make it more understandable, I always add my explanation and figures. In this survey, “the authors” means the authors of a reference paper, while the author of this survey is simple referred as “I”.
Coarse-grained machines have emerged as major architectures in massively processor computer. Compared with Fine-grained machine, coarse-grained machines typically fall into the shared memory MIMD category, where processors operate asynchronously on the large, shared memory. There are several popular interconnection architectures for Coarse-grained machines, such as fully connected, hypercube, mesh, fat-tree, and ring. Ring and fat-tree are lost cost and lost performance solution. Communication on ring and fat tree isn’t predictable because of congestion. On the other hand, fully connected net and hypercube enjoy high performance, but suffers from the extreme high cost and lack of extensibility. Number of connections Diameter Degree complete p*(p-1)/2 1 p- hypercube p *log 2 p /2 log 2 p log 2 p mesh 2(p-p1/2) 2(p1/2-1) 4 fat-tree linear 2log((p+1)/2) 3
ring p └ p/2 ┘ 2 Table 1 comparison of topologies
The mesh has some potential advantages over a hypercube topology. First, mesh is more extensible. Hypercubes must be expanded in powers of two, which is often prohibitively expensive. So the commercial hypercubes often have a fixed maximum dimension. And
meshes can be expanded at linear costs by adding an additional row or column. Secondly, mesh can save a lot of connections, mesh only need 2(p-p1/2) connections compared with hypercube’s need for p *log 2 p /2 and complete graph’s need for square p. Third, as a tradeoff, mesh provides reasonable diameter.
The Intel Touchstone DELTA system is a mesh_connected parallel processor,consisting of 528 i860 compute nodes,32 80386 I/O nodes ,two 80386 network interface nodes, six services nodes, and two tape nodes. Each compute node has 16 million bytes of memory and is connected to a Mesh Routing Chip (MRC) through a Mesh Interface Module (MIM). Each MRC channel is 8-bits wide and has a bandwidth of 65 million bytes/second (MB/s), but the FIFO’s on the MIM have only a 26.7MB/s data rate The largest mesh available to an application is 16*32. [1]
Figure 1 DELTA mesh and nodes [1]
The communication network of Delta Touchstone uses wormhole routing. Packet size is 512 bytes, with 482 bytes reserved for data and 30 bytes for the message header. The operating system supports both blocking and non-blocking communication primitives.[3] Wormhole routing is a special case of cut-through switching. Instead of storing a packet completely in a node and then transmitting it to the next node, wormhole routing operates by advancing the head of a packet directly from incoming to outgoing channels of the routing chip. A packet is divided into a number of flits (flow control digits) for transmission. The size of a flit depends on system parameters, in particular, the channel width. The header flit(or flits) govern the route. As soon as a node examines the header flit(s) of a message, it selects the next channel on the route and begins forwarding flits down that channel. As the header advances along the specified route, the remaining flits follow in a pipeline fashion.[2]
each processor in its column.
The two 1-lev algorithms are direct send, and broadcasting. The 2-lev algorithms concern different partitioning methods. I’ll give two figures to explain these methods.
Figure 2 2 supersteps of 2-lev-rec algorithms
As shown in
Figure 2, each row is a submachine, and the leftmost node of each submachine (in red) serves as the leader. In the first superstep, source node P0 makes p-1 sends directly to each leaders (shown in red line with arrows), and in the second superstep, each leader makes a 1-lev-dir(p-1) in its row.
Figure 3 3 supersteps of 3-lev-sq algorithms
Table 2 Performance for one-to-all communication on a 256-processor (times are in msec). Observations are
Algorithm 1-lev-dir is a reasonable choice only for large message sizes. Algorithms 1-lev-sys-br and 1-lev-our-br give the worst performance. 2-lev-rec, 3-lev-sq, and logp-lev-rec(0.75) perform best.
1-lev-dir minimizes the effective message size, but experiences a total of p - 1 message set-up costs. The poor performance of broadcast algorithms is partly due to the large effective message size, so it takes much longer time for each node to send and receive effective message. So the broadcast algorithms provide only acceptable results for small size messages. 3-lev-sq algorithm balances effective message size, total number of messages, and hence provides one of the best performances. The reason why logp-lev-rec(0.75) has even better performance than 3-lev-sq is that in Delta Touchstone, the sending and receiving capability are not the equal, and the partition factor 0.75 is obtained by experiment.
The author also gives a discussion on the scalability.
Figure 5 scalability, sending a total of 64 Kbytes and 256 Kbytes, varying machine size.
Figure 6 Scalability, sending an actual message size of 256 bytes and 4 Kbytes, varying machine size. Figure 5 shows the scalability when given constant total message size (constant problem size), we can see, when machine size increases, the best algorithm can provide an almost constant performance. Figure 6 shows the scalability when given constant actual message size (problem size is increasing with the machine size), we can see, the performance gap narrows for messages of size 4 Kbyte. This suggests, the multi level algorithm only bring much benefit, when actual message size is small.
Conceptually, all-to-one is the inverse of one-to-all. Except the algorithms based on broadcasting, all one-to-all algorithms, have corresponding all-to-one algorithms. However practically, the best one-to-all algorithms do not necessarily correspond to the best all-to-one algorithms.
results indicate that for logp-lev-rec(γ) algorithm, γ= 0.65 performs better for small actual message (L<2048), γ= 0.60 performs better for large actual message (L>2048).
Comparasion of all-to-one and one-to-all Obersvations: For messages of length< 512 bytes, the all-to-one algorithms are slightly faster than their one-to-all counterparts, For messages of length>512 bytes, the all-to-one algorithms are significantly slower. There are several reasons. First, in the algorithms for one-to-all, when a processor issues multiple sends, higher priority is given to the destinations further away. For all-to-one, when a processor issues multiple receives, the processor is unable to employ such a rule. Secondly, the set-up time of receiving a message is less than the set-up time of sending a message. For small message sizes, the set-up time experienced when sending messages constitutes a bigger fraction of the overall time.
All-to-all communication is the most difficult, because it arise congestion, which is difficult to analyze. The most straightforward 1-level approach for all-to-all routing is to have each processor send its p-1 messages, which is referred as 1-level-dir. Several interesting methods are proposed to partition the problem into permutations. [4][5] provides the experimental results of exclusive-or permutations, and [6] proposes balanced permutation. The authors[3] refer to these algorithms as Algorithm 1-lev- xor, Algorithm 1-lev-balance. Algorithm 2-lev-sq is independent of the underlying architecture. It consists of 3 steps. In each step of the algorithm every processor sends out a total of pL bytes; the first and the last step send out pL bytes in the form of p1/ messages and the second step sends them out as one single message. Algorithm 2-lev-r,c consists of only 2 steps, with each step sending out a total of pL bytes in the form of p1/2^ messages.
Figure 10 Performance reults for all-to-all communication Algorithm 1-lev-xor gives the best performance for larger message sizes; i.e. L > 256. For
small message sizes (i.e. L < 256), Algorithm 2-lev-c,r achieved the best performance. This conclusion holds not only for a 256-processors machine, but for all machine sizes.
An architecture independent mode could bridge the gap between software and hardware. It should accurately reflect the constraints of a parallel machine, and allow accurate prediction of the performance of an algorithm. Some papers propose this kind of model [7][8]. [7] proposes a C^3 _model, which models all the computation units, sending and receiving protocol, and achieve simple formulation to evaluate congestion, total time. It provides a method to predict and evaluate arbitrary communication pattern.
Figure 11 Predicted performance (in units) of the One-to-All Algorithms on a 256-Processor Intel Delta
Figure 12 Experimental results of the One-to-All Algorithms on a 256-Processor Intel Delta
[7] Susanne E. Hambrusch, Ashfaq A. Khokhar, “An architecture-independent model for coarse-grained parallel machines”,
[8] T. Heywood and S. Ranka, “A Practical Hierarchical Model of Parallel Computation: I. The model”, JPDC , Vol.16, pp.212-232,