Efficient Solutions for Massively Parallel Machines: Exploring Parallel Algorithms, Study notes of Computer Science

An introduction to the design of parallel algorithms, focusing on strategies for exploiting parallelism and architectural constraints. It covers simd machines, coarse-grain and fine-grain computers, crw prams, and the development of parallel algorithms on pram and interconnection models. Examples of parallel search algorithms are presented for both pram and interconnection network models.

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-tig
koofers-user-tig 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Design of Parallel Algorithms
A major challenge facing computer scientists today, given the existence of
massively parallel machines, is to design algorithms which exploit this
parallelism. There are three main approaches to the design of parallel
algorithms.
1. Modify existing sequential algorithms exploiting those parts of the
algorithm that are naturally parallelizable. To some extent, this is what we
did in the last set of notes with the algorithm to find the largest key from a
set of keys. The tournament method (also called the binary fan-in
technique) is not unique to parallel algorithms, indeed the same technique
can be applied sequentially, however, that part of the algorithm is
inherently parallel.
2. Design an entirely new parallel algorithm that may have no natural
sequential analog.
3. For some problems, such as finding roots, the same sequential algorithm
is run on many different processors concurrently with different seed values
until one of the processors reports “success”. That is, all the processors
start running a sequential algorithm with different initial conditions, and the
first processor to achieve the desired result “wins the race.”
All three of these strategies are viable is certain situations and we will see
examples of each as we explore parallel algorithms further.
Architectural Constraints When Designing a Parallel Algorithm
A number of constraints arise when designing parallel algorithms that do not
occur when designing sequential algorithms. These constraints are imposed by
the architecture of the particular parallel machine on which the algorithm is
intended to be executed. We eluded to some of these constraints in the last
section of notes and now we will expand this discussion somewhat. There are
five basic constraints that we will examine in some detail, these are:
1. Single instruction versus multiple instruction architecture. Do all the
processors execute the same instruction or different instructions
concurrently?
2. The number and type of processors that are available.
Parallel Algorithms II - 1
Parallel Algorithms II (16)
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Efficient Solutions for Massively Parallel Machines: Exploring Parallel Algorithms and more Study notes Computer Science in PDF only on Docsity!

The Design of Parallel Algorithms A major challenge facing computer scientists today, given the existence of massively parallel machines, is to design algorithms which exploit this parallelism. There are three main approaches to the design of parallel algorithms.

  1. Modify existing sequential algorithms exploiting those parts of the algorithm that are naturally parallelizable. To some extent, this is what we did in the last set of notes with the algorithm to find the largest key from a set of keys. The tournament method (also called the binary fan-in technique) is not unique to parallel algorithms, indeed the same technique can be applied sequentially, however, that part of the algorithm is inherently parallel.
  2. Design an entirely new parallel algorithm that may have no natural sequential analog.
  3. For some problems, such as finding roots, the same sequential algorithm is run on many different processors concurrently with different seed values until one of the processors reports “success”. That is, all the processors start running a sequential algorithm with different initial conditions, and the first processor to achieve the desired result “wins the race.” All three of these strategies are viable is certain situations and we will see examples of each as we explore parallel algorithms further. Architectural Constraints When Designing a Parallel Algorithm A number of constraints arise when designing parallel algorithms that do not occur when designing sequential algorithms. These constraints are imposed by the architecture of the particular parallel machine on which the algorithm is intended to be executed. We eluded to some of these constraints in the last section of notes and now we will expand this discussion somewhat. There are five basic constraints that we will examine in some detail, these are:
  4. Single instruction versus multiple instruction architecture. Do all the processors execute the same instruction or different instructions concurrently?
  5. The number and type of processors that are available.

Parallel Algorithms II (16)

  1. Does the architecture support the PRAM model of shared memory or does it support a distributed memory through an interconnection network?
  2. Communication constraints. PRAM models have read/write restrictions while interconnection networks must specify (through a graph) the direct connections that exist between processors.
  3. I/O constraints. How is the connection to the “outside world” handled. Single Instruction vs. Multiple Instruction A single-processor computer can only execute one instruction at a time. A parallel computer with p processors can execute p instructions concurrently. Each processor may operate on possibly different data. In the PRAM model, each processor executes the same instruction on possibly different data concurrently in a synchronized manner. This common instruction also contains information that can instruct a given processor to remain idle ( masked out ) during a given step. The operations of each processor must be controlled by a front-end processor ( central control) and a global clock. During each time interval of the global clock, all the processors concurrently perform the same operation ( input or output data, perform computations on data, read from local memories, communicate between processors, and so forth). Since the operations pulse through the system in regular clock intervals, this model is often referred to as systolic computing. Parallel computers that follow this model are called SIMD ( S ingle I nstruction M ultiple D ata) machines. Parallel machines that allow different instructions to be performed at the same time on possibly different data are called MIMD ( M ultiple I nstruction M ultiple D ata) machines. Since SIMD machines are conceptually simpler and easier to implement, we’ll focus on SIMD machines. Number and Type of Processor In practice, computer manufacturers must decide whether to build a coarse- grain computer, one which has tens or hundreds of powerful processors, or a fine-grain computer, one which has thousands and thousands of relatively simple processors. For our modeling purposes we will consider that the processors available are powerful enough to execute all the normal instructions of a serial computer. There are two approaches to designing parallel algorithms with respect to the number of processors available. The first approach is to design the algorithm where the number of processors used by the algorithm is an input parameter. In this approach, the number of processors p does not depend on the input size n. The second approach is to allow the number of processors to grow with the

handled in various fashions. A commonly used technique only allows concurrent writes when all the processors are attempting to write the same value. Another method, which can be applied to numeric data, is to write the sum of all these values. Still other methods involve allowing a randomly chosen processor among the contending processors to write its value, or establishing a total ordering of the processors and allow the processor with the smallest value (typically pid based) to write first, and so on. The EREW model is the most realistic of the PRAM models to build in practice. Further, any algorithm designed for an EREW PRAM will run without alteration on the other PRAM models. Unless otherwise noted, all PRAM algorithms assume the EREW model. With the current state of technology, EREW PRAMs are difficult to build (although there are efforts currently underway to do so). Nevertheless, the PRAM model is still a very good model for theoretical results and the initial design of a parallel algorithm without the burden of processor communication details getting in the way of the design. Mesh Model Communication Constraints Most parallel computers built today more closely follow the guidelines of the hypercube and degree-bounded network models (also called mesh models). This means that there is an interconnection network which links the processors together. In these models, each processor has its own RAM with no common shared memory accessible to each processor. Since we are focusing on SIMD machines, variables in processor memories each have instantiations in every processor, and are thus called parallel or distributed variables. In other words, if x is a distributed variable, then each processor in the network has a memory location reserved for its own version of x. In the interconnection network models, the assumption is that each processor has sufficient memory to handle the various tasks to which it will be applied. Nevertheless, parallel algorithms are usually written so that they require only a constant (independent of input size) number of distributed variables. Once again, statements which involve distributed variables might only be executed in a subset of the available processors. Certain processors can be masked out at certain steps. Information is communicated between processors using messages sent along the network. Messages pass along routes in the network where each link in a route is between directly connected (adjacent) processors. To avoid routing conflicts most parallel algorithms will assume communication occurs in each step between adjacent processors only. While this is not necessarily true, in

general for parallel machines, it again, makes the algorithms somewhat easier to develop and analyze if we can remove such detail from the algorithm. To describe how communication takes place between adjacent processors PX and PY, suppose the central control instructs PY to assign to the variable y in its local memory the value of the variable x in the local memory of PX. This is accomplished as follows: PX reads the value of local variable x and sends this value along a link in the network to PY. Upon receiving this message, PY writes this value into its copy of y. [In parallel pseudo-code: P Y:y  P X:x]. For the time being, we’ll focus on the mesh model for developing parallel algorithms using the interconnection network model. For our example, we’ll use the two-dimensional mesh shown in Figure 2. Figure 2 – Two dimensional mesh of degree 4 (M4,4). The two dimensional mesh Mq,q with p = q^2 processors Pi,j , i, j {1, …, q } has Pi,j directly connected with Pr,s iff, i = r and  j-s =1 or  i-r = 1 (see Figure 2 above). I/O Constraints As with any computer, a parallel machine must have some mechanism to read “outside world” data from external input devices into the processor’s local memories, as well as to write data from these memories to external output devices. Most parallel algorithm development takes a very high-level approach to this type of constraint, leaving the exact nature of the I/O mechanism

P

1,

P

1, P 2,

P

2,

P

1,

P

1, P 2,

P

2, P 3,

P

3, P 4,

P

4,

P

3,

P

3, P 4,

P

4,

Now let’s assume the more realistic EREW PRAM model and see the different assumptions that must be in place for this algorithm to be successful. For each processor to have access to the value of x simultaneously in the EREW PRAM, we need to allocate an auxiliary array temp [1:n] and assign the value of x to each array element temp [i], 1  i  n. Assigning the value of x to each entry in temp [1:n] can be achieved by assigning x to temp [1] and then broadcasting x to the other positions in the array as follows: (For simplicity assume that n = 2k^ for some nonnegative integer k. ) In the first step, Pi reads x and writes it to temp [1]. In the second step, P 1 reads temp [1] and writes it to temp [2]. In the third step, processors P 1 and P 2 read temp [1] and temp [2], respectively, and write x to temp [3] and temp [4], respectively. This broadcasting process is illustrated in Figure 3 when x = 5 and n = 16. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Figure 3 – Broadcasting the value of x into the array in log 2 n steps. As Figure 3 illustrates, the broadcasting of x is complete after log 2 n steps. Figure 3 also illustrates that not all processors must be active at every step in SIMD processing. 5 P 1 5 5 P 1 5 5 5 5 P 1 P 2 5 5 5 5 5 5 5 5 P 1 P 2 P 3 P 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8

After x has been broadcast and temp has been filled, then, in parallel, processor Pi compares L [i] to temp [i] = x and writes i in temp [i] if L [i] = x ; otherwise it writes the value  (maxint) in temp [i]. The array temp [1:n] now contains the results of the search. The parallel search step is illustrated in Figure 4. Temp[1:n] after parallel step Figure 4 – Single parallel comparison step between search elements and list elements. However, we are still left with the problem of signaling a successful search. Note that the value that we wish to return is nothing more than the minimum value in the temp array. Using the binary fan-in technique (see previous day’s notes), we can obtain a straightforward parallel algorithm for determining the minimum of a set of n numbers on an EREW PRAM processor with n/ processors. Using the binary fan-in technique we can reduce the 15 sequential steps required by a sequential processor to only four parallel steps, thereby achieving a speed-up of 15/4 over the sequential algorithm. The basic operation for a sequential search algorithm was the comparison of a list element to the search element. In the parallel algorithm all such comparisons are made in a single step. Therefore, we must choose another basic operation to make a meaningful statement about the complexity (the number of parallel basic operations) of our parallel search algorithm. We could use either the number of parallel assignment statements performed when broadcasting the search element or the number of parallel comparison steps in computing the index of the minimum value in temp [1:n] in the final phase. Either choice yields a complexity of log 2 n. Example: Searching in the Two-dimensional Mesh Model 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 2 -1 9 -4 2 5 -2 0 5 1 5 -5 8 5 3 - P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 P 16      6   9  11   14  

Figure 5 – Initial states of the distributed variables L, x, and index in the mesh M4,4. After the search element has been broadcast to all n processors (so that each P i,j: x contains the value of the search element, in a single parallel comparison step each processor Pi,j compares P i,j: x to its list element P i,j: L and writes the value  to P i,j: index if the search element is not equal to the list element. After the single parallel comparison step, the distributed variable index contains the results of the search. Figure 6 illustrates the configuration of the mesh after the parallel comparison step has completed. Parallel Algorithms II - 10 P1,1 L x index 2 5  P1,2 L x index

5  P1,3 L x index 9 5  P1,4 L x index

5  P2,1 L x index 2 5  P2,2 L x index 5 5 6 P2,3 L x index

5  P1,4 L x index 0 5  P3,1 L x index 5 5 9 P3,2 L x index 1 5  P3,3 L x index 5 5 11 P3,4 L x index

5  P4,1 L x index 8 5  P4,2 L x index 5 5 14 P4,3 L x index 3 5  P3,4 L x index

5 

Figure 6 – State of the distributed variables in the mesh after value 5 has been broadcast throughout x and the single parallel comparison step has been performed. As before, the mesh now contains the results of the search, but how do we return the result? Typically, whenever a scalar-valued function defined on an interconnection terminates, the value to be returned by the function resides in a particular processor’s local instantiation of a suitable distributed variable. Let’s assume that processor P1,1 is our designated processor to return the value of the search in its instantiation of the distributed variable index. In other words, at the termination of the search, we want P 1,1: index to hold the smallest index i such that L [i] = x , or  if no such index exists. To compute this minimum value, we need to perform what amounts to a reverse broadcast procedure. In phase 1, column minimums are computed as shown in Figure 7.

Figure 8 – Final state of the distributed variables in the 2-d mesh upon completion of the searching algorithm.