Interprocessor Communication Algorithms on Parallel Computers: Performance and Scalability, Papers of Electrical and Electronics Engineering

An overview of interprocessor communication algorithms and topologies in massive parallel computers, focusing on the intel touchstone delta system. Various communication algorithms for one-to-all, all-to-one, and all-to-all operations, their performance, and scalability issues. The authors also introduce the concept of actual and effective message sizes in the context of k-level algorithms.

Typology: Papers

Pre 2010

Uploaded on 09/17/2009

koofers-user-20n
koofers-user-20n 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Interprocessor Communication Algorithms On
Massive Professor Computer
Chi Zhang 658310279
Abstract
This survey introduces the interconnection topology for massive processors,
communication algorithm and experimental results on a mesh connected computer, at last an
architecture independent model in introduced briefly. Focuses are on the communication
algorithm’s performance and scalability. To make it more understandable, I always add my
explanation and figures. In this survey, “the authors” means the authors of a reference paper,
while the author of this survey is simple referred as “I”.
Interprocessor topology
Coarse-grained machines have emerged as major architectures in massively processor
computer. Compared with Fine-grained machine, coarse-grained machines typically fall into the
shared memory MIMD category, where processors operate asynchronously on the large, shared
memory.
There are several popular interconnection architectures for Coarse-grained machines, such as
fully connected, hypercube, mesh, fat-tree, and ring. Ring and fat-tree are lost cost and lost
performance solution. Communication on ring and fat tree isn’t predictable because of
congestion. On the other hand, fully connected net and hypercube enjoy high performance, but
suffers from the extreme high cost and lack of extensibility.
Number of connections Diameter Degree
complete p*(p-1)/2 1 p-1
hypercube p *log2 p /2 log2 p log2 p
mesh 2(p-p1/2) 2(p1/2-1) 4
fat-tree linear 2log((p+1)/2) 3
ring p p/2 2
Table 1 comparison of topologies
The mesh has some potential advantages over a hypercube topology. First, mesh is more
extensible. Hypercubes must be expanded in powers of two, which is often prohibitively
expensive. So the commercial hypercubes often have a fixed maximum dimension. And
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Interprocessor Communication Algorithms on Parallel Computers: Performance and Scalability and more Papers Electrical and Electronics Engineering in PDF only on Docsity!

Interprocessor Communication Algorithms On

Massive Professor Computer

Chi Zhang 658310279

Abstract

This survey introduces the interconnection topology for massive processors, communication algorithm and experimental results on a mesh connected computer, at last an architecture independent model in introduced briefly. Focuses are on the communication algorithm’s performance and scalability. To make it more understandable, I always add my explanation and figures. In this survey, “the authors” means the authors of a reference paper, while the author of this survey is simple referred as “I”.

Interprocessor topology

Coarse-grained machines have emerged as major architectures in massively processor computer. Compared with Fine-grained machine, coarse-grained machines typically fall into the shared memory MIMD category, where processors operate asynchronously on the large, shared memory. There are several popular interconnection architectures for Coarse-grained machines, such as fully connected, hypercube, mesh, fat-tree, and ring. Ring and fat-tree are lost cost and lost performance solution. Communication on ring and fat tree isn’t predictable because of congestion. On the other hand, fully connected net and hypercube enjoy high performance, but suffers from the extreme high cost and lack of extensibility. Number of connections Diameter Degree complete p*(p-1)/2 1 p- hypercube p *log 2 p /2 log 2 p log 2 p mesh 2(p-p1/2) 2(p1/2-1) 4 fat-tree linear 2log((p+1)/2) 3

ring p └ p/2 ┘ 2 Table 1 comparison of topologies

The mesh has some potential advantages over a hypercube topology. First, mesh is more extensible. Hypercubes must be expanded in powers of two, which is often prohibitively expensive. So the commercial hypercubes often have a fixed maximum dimension. And

meshes can be expanded at linear costs by adding an additional row or column. Secondly, mesh can save a lot of connections, mesh only need 2(p-p1/2) connections compared with hypercube’s need for p *log 2 p /2 and complete graph’s need for square p. Third, as a tradeoff, mesh provides reasonable diameter.

Platform: Intel Delta Touchstone architecture and routing

The Intel Touchstone DELTA system is a mesh_connected parallel processor,consisting of 528 i860 compute nodes,32 80386 I/O nodes ,two 80386 network interface nodes, six services nodes, and two tape nodes. Each compute node has 16 million bytes of memory and is connected to a Mesh Routing Chip (MRC) through a Mesh Interface Module (MIM). Each MRC channel is 8-bits wide and has a bandwidth of 65 million bytes/second (MB/s), but the FIFO’s on the MIM have only a 26.7MB/s data rate The largest mesh available to an application is 16*32. [1]

Figure 1 DELTA mesh and nodes [1]

The communication network of Delta Touchstone uses wormhole routing. Packet size is 512 bytes, with 482 bytes reserved for data and 30 bytes for the message header. The operating system supports both blocking and non-blocking communication primitives.[3] Wormhole routing is a special case of cut-through switching. Instead of storing a packet completely in a node and then transmitting it to the next node, wormhole routing operates by advancing the head of a packet directly from incoming to outgoing channels of the routing chip. A packet is divided into a number of flits (flow control digits) for transmission. The size of a flit depends on system parameters, in particular, the channel width. The header flit(or flits) govern the route. As soon as a node examines the header flit(s) of a message, it selects the next channel on the route and begins forwarding flits down that channel. As the header advances along the specified route, the remaining flits follow in a pipeline fashion.[2]

each processor in its column.

  1. A processor that received a long message, applies Algorithm 1-lev-dir(p1/2^ ) within its row. **Algorithm 3-lev-sq(p)
  2. The machine is** partitioned into p1/2^ square submachines.
  3. The source processor prepares p1/2-1 long messages, each containing p1/2^ messages and sends one long message to each leader processor in the submachine.
  4. Each submachine applies Algorithm 2-lev-rec(p1/2) Algorithm logp-lev-sq(p)
  5. The machine is partitioned into 2 submachines, alternating partitions along the columns and rows.
  6. The source processor concatenates p/2 messages into one long message and sends the long message to the leader processor in the other submachine.
  7. Each submachine applies Algorithm logp-lev-sq (p/2). Algorithm logp-lev-rec(p, γγγγ ) 1. The machine is partitioned into 2 submachines, one containingγγγγp processors including the source processor, and the other containing (1-γγγγ)p processors.
  8. The source processor concatenates (1-γγγγ)p messages into one long message and sends it to the leader processor in the other submachine.
  9. The submachine withγγγγp processor applies Algorithm logp-lev-rec(γγγγp,γγγγ), and the submachine with (1-γγγγ)p processors applies Algorithm logp-lev-rec((l-γγγγ)p,γγγγ).

The two 1-lev algorithms are direct send, and broadcasting. The 2-lev algorithms concern different partitioning methods. I’ll give two figures to explain these methods.

Figure 2 2 supersteps of 2-lev-rec algorithms

As shown in

Figure 2, each row is a submachine, and the leftmost node of each submachine (in red) serves as the leader. In the first superstep, source node P0 makes p-1 sends directly to each leaders (shown in red line with arrows), and in the second superstep, each leader makes a 1-lev-dir(p-1) in its row.

Figure 3 3 supersteps of 3-lev-sq algorithms

Table 2 Performance for one-to-all communication on a 256-processor (times are in msec). Observations are

 Algorithm 1-lev-dir is a reasonable choice only for large message sizes.  Algorithms 1-lev-sys-br and 1-lev-our-br give the worst performance.  2-lev-rec, 3-lev-sq, and logp-lev-rec(0.75) perform best.

1-lev-dir minimizes the effective message size, but experiences a total of p - 1 message set-up costs. The poor performance of broadcast algorithms is partly due to the large effective message size, so it takes much longer time for each node to send and receive effective message. So the broadcast algorithms provide only acceptable results for small size messages. 3-lev-sq algorithm balances effective message size, total number of messages, and hence provides one of the best performances. The reason why logp-lev-rec(0.75) has even better performance than 3-lev-sq is that in Delta Touchstone, the sending and receiving capability are not the equal, and the partition factor 0.75 is obtained by experiment.

The author also gives a discussion on the scalability.

Figure 5 scalability, sending a total of 64 Kbytes and 256 Kbytes, varying machine size.

Figure 6 Scalability, sending an actual message size of 256 bytes and 4 Kbytes, varying machine size. Figure 5 shows the scalability when given constant total message size (constant problem size), we can see, when machine size increases, the best algorithm can provide an almost constant performance. Figure 6 shows the scalability when given constant actual message size (problem size is increasing with the machine size), we can see, the performance gap narrows for messages of size 4 Kbyte. This suggests, the multi level algorithm only bring much benefit, when actual message size is small.

All-to-one communication

Conceptually, all-to-one is the inverse of one-to-all. Except the algorithms based on broadcasting, all one-to-all algorithms, have corresponding all-to-one algorithms. However practically, the best one-to-all algorithms do not necessarily correspond to the best all-to-one algorithms.

results indicate that for logp-lev-rec(γ) algorithm, γ= 0.65 performs better for small actual message (L<2048), γ= 0.60 performs better for large actual message (L>2048).

Comparasion of all-to-one and one-to-all Obersvations:  For messages of length< 512 bytes, the all-to-one algorithms are slightly faster than their one-to-all counterparts,  For messages of length>512 bytes, the all-to-one algorithms are significantly slower. There are several reasons. First, in the algorithms for one-to-all, when a processor issues multiple sends, higher priority is given to the destinations further away. For all-to-one, when a processor issues multiple receives, the processor is unable to employ such a rule. Secondly, the set-up time of receiving a message is less than the set-up time of sending a message. For small message sizes, the set-up time experienced when sending messages constitutes a bigger fraction of the overall time.

All-to-all communication

All-to-all communication is the most difficult, because it arise congestion, which is difficult to analyze. The most straightforward 1-level approach for all-to-all routing is to have each processor send its p-1 messages, which is referred as 1-level-dir. Several interesting methods are proposed to partition the problem into permutations. [4][5] provides the experimental results of exclusive-or permutations, and [6] proposes balanced permutation. The authors[3] refer to these algorithms as Algorithm 1-lev- xor, Algorithm 1-lev-balance. Algorithm 2-lev-sq is independent of the underlying architecture. It consists of 3 steps. In each step of the algorithm every processor sends out a total of pL bytes; the first and the last step send out pL bytes in the form of p1/ messages and the second step sends them out as one single message. Algorithm 2-lev-r,c consists of only 2 steps, with each step sending out a total of pL bytes in the form of p1/2^ messages.

Figure 10 Performance reults for all-to-all communication Algorithm 1-lev-xor gives the best performance for larger message sizes; i.e. L > 256. For

small message sizes (i.e. L < 256), Algorithm 2-lev-c,r achieved the best performance. This conclusion holds not only for a 256-processors machine, but for all machine sizes.

An Architecture Independent Communication Model

An architecture independent mode could bridge the gap between software and hardware. It should accurately reflect the constraints of a parallel machine, and allow accurate prediction of the performance of an algorithm. Some papers propose this kind of model [7][8]. [7] proposes a C^3 _model, which models all the computation units, sending and receiving protocol, and achieve simple formulation to evaluate congestion, total time. It provides a method to predict and evaluate arbitrary communication pattern.

Figure 11 Predicted performance (in units) of the One-to-All Algorithms on a 256-Processor Intel Delta

Figure 12 Experimental results of the One-to-All Algorithms on a 256-Processor Intel Delta

[7] Susanne E. Hambrusch, Ashfaq A. Khokhar, “An architecture-independent model for coarse-grained parallel machines”,

[8] T. Heywood and S. Ranka, “A Practical Hierarchical Model of Parallel Computation: I. The model”, JPDC , Vol.16, pp.212-232,