







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to the design of parallel algorithms, focusing on strategies for exploiting parallelism and architectural constraints. It covers simd machines, coarse-grain and fine-grain computers, crw prams, and the development of parallel algorithms on pram and interconnection models. Examples of parallel search algorithms are presented for both pram and interconnection network models.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








The Design of Parallel Algorithms A major challenge facing computer scientists today, given the existence of massively parallel machines, is to design algorithms which exploit this parallelism. There are three main approaches to the design of parallel algorithms.
handled in various fashions. A commonly used technique only allows concurrent writes when all the processors are attempting to write the same value. Another method, which can be applied to numeric data, is to write the sum of all these values. Still other methods involve allowing a randomly chosen processor among the contending processors to write its value, or establishing a total ordering of the processors and allow the processor with the smallest value (typically pid based) to write first, and so on. The EREW model is the most realistic of the PRAM models to build in practice. Further, any algorithm designed for an EREW PRAM will run without alteration on the other PRAM models. Unless otherwise noted, all PRAM algorithms assume the EREW model. With the current state of technology, EREW PRAMs are difficult to build (although there are efforts currently underway to do so). Nevertheless, the PRAM model is still a very good model for theoretical results and the initial design of a parallel algorithm without the burden of processor communication details getting in the way of the design. Mesh Model Communication Constraints Most parallel computers built today more closely follow the guidelines of the hypercube and degree-bounded network models (also called mesh models). This means that there is an interconnection network which links the processors together. In these models, each processor has its own RAM with no common shared memory accessible to each processor. Since we are focusing on SIMD machines, variables in processor memories each have instantiations in every processor, and are thus called parallel or distributed variables. In other words, if x is a distributed variable, then each processor in the network has a memory location reserved for its own version of x. In the interconnection network models, the assumption is that each processor has sufficient memory to handle the various tasks to which it will be applied. Nevertheless, parallel algorithms are usually written so that they require only a constant (independent of input size) number of distributed variables. Once again, statements which involve distributed variables might only be executed in a subset of the available processors. Certain processors can be masked out at certain steps. Information is communicated between processors using messages sent along the network. Messages pass along routes in the network where each link in a route is between directly connected (adjacent) processors. To avoid routing conflicts most parallel algorithms will assume communication occurs in each step between adjacent processors only. While this is not necessarily true, in
general for parallel machines, it again, makes the algorithms somewhat easier to develop and analyze if we can remove such detail from the algorithm. To describe how communication takes place between adjacent processors PX and PY, suppose the central control instructs PY to assign to the variable y in its local memory the value of the variable x in the local memory of PX. This is accomplished as follows: PX reads the value of local variable x and sends this value along a link in the network to PY. Upon receiving this message, PY writes this value into its copy of y. [In parallel pseudo-code: P Y:y P X:x]. For the time being, we’ll focus on the mesh model for developing parallel algorithms using the interconnection network model. For our example, we’ll use the two-dimensional mesh shown in Figure 2. Figure 2 – Two dimensional mesh of degree 4 (M4,4). The two dimensional mesh Mq,q with p = q^2 processors Pi,j , i, j {1, …, q } has Pi,j directly connected with Pr,s iff, i = r and j-s =1 or i-r = 1 (see Figure 2 above). I/O Constraints As with any computer, a parallel machine must have some mechanism to read “outside world” data from external input devices into the processor’s local memories, as well as to write data from these memories to external output devices. Most parallel algorithm development takes a very high-level approach to this type of constraint, leaving the exact nature of the I/O mechanism
1,
1, P 2,
2,
1,
1, P 2,
2, P 3,
3, P 4,
4,
3,
3, P 4,
4,
Now let’s assume the more realistic EREW PRAM model and see the different assumptions that must be in place for this algorithm to be successful. For each processor to have access to the value of x simultaneously in the EREW PRAM, we need to allocate an auxiliary array temp [1:n] and assign the value of x to each array element temp [i], 1 i n. Assigning the value of x to each entry in temp [1:n] can be achieved by assigning x to temp [1] and then broadcasting x to the other positions in the array as follows: (For simplicity assume that n = 2k^ for some nonnegative integer k. ) In the first step, Pi reads x and writes it to temp [1]. In the second step, P 1 reads temp [1] and writes it to temp [2]. In the third step, processors P 1 and P 2 read temp [1] and temp [2], respectively, and write x to temp [3] and temp [4], respectively. This broadcasting process is illustrated in Figure 3 when x = 5 and n = 16. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Figure 3 – Broadcasting the value of x into the array in log 2 n steps. As Figure 3 illustrates, the broadcasting of x is complete after log 2 n steps. Figure 3 also illustrates that not all processors must be active at every step in SIMD processing. 5 P 1 5 5 P 1 5 5 5 5 P 1 P 2 5 5 5 5 5 5 5 5 P 1 P 2 P 3 P 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8
After x has been broadcast and temp has been filled, then, in parallel, processor Pi compares L [i] to temp [i] = x and writes i in temp [i] if L [i] = x ; otherwise it writes the value (maxint) in temp [i]. The array temp [1:n] now contains the results of the search. The parallel search step is illustrated in Figure 4. Temp[1:n] after parallel step Figure 4 – Single parallel comparison step between search elements and list elements. However, we are still left with the problem of signaling a successful search. Note that the value that we wish to return is nothing more than the minimum value in the temp array. Using the binary fan-in technique (see previous day’s notes), we can obtain a straightforward parallel algorithm for determining the minimum of a set of n numbers on an EREW PRAM processor with n/ processors. Using the binary fan-in technique we can reduce the 15 sequential steps required by a sequential processor to only four parallel steps, thereby achieving a speed-up of 15/4 over the sequential algorithm. The basic operation for a sequential search algorithm was the comparison of a list element to the search element. In the parallel algorithm all such comparisons are made in a single step. Therefore, we must choose another basic operation to make a meaningful statement about the complexity (the number of parallel basic operations) of our parallel search algorithm. We could use either the number of parallel assignment statements performed when broadcasting the search element or the number of parallel comparison steps in computing the index of the minimum value in temp [1:n] in the final phase. Either choice yields a complexity of log 2 n. Example: Searching in the Two-dimensional Mesh Model 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 2 -1 9 -4 2 5 -2 0 5 1 5 -5 8 5 3 - P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 P 16 6 9 11 14
Figure 5 – Initial states of the distributed variables L, x, and index in the mesh M4,4. After the search element has been broadcast to all n processors (so that each P i,j: x contains the value of the search element, in a single parallel comparison step each processor Pi,j compares P i,j: x to its list element P i,j: L and writes the value to P i,j: index if the search element is not equal to the list element. After the single parallel comparison step, the distributed variable index contains the results of the search. Figure 6 illustrates the configuration of the mesh after the parallel comparison step has completed. Parallel Algorithms II - 10 P1,1 L x index 2 5 P1,2 L x index
5 P1,3 L x index 9 5 P1,4 L x index
5 P2,1 L x index 2 5 P2,2 L x index 5 5 6 P2,3 L x index
5 P1,4 L x index 0 5 P3,1 L x index 5 5 9 P3,2 L x index 1 5 P3,3 L x index 5 5 11 P3,4 L x index
5 P4,1 L x index 8 5 P4,2 L x index 5 5 14 P4,3 L x index 3 5 P3,4 L x index
5
Figure 6 – State of the distributed variables in the mesh after value 5 has been broadcast throughout x and the single parallel comparison step has been performed. As before, the mesh now contains the results of the search, but how do we return the result? Typically, whenever a scalar-valued function defined on an interconnection terminates, the value to be returned by the function resides in a particular processor’s local instantiation of a suitable distributed variable. Let’s assume that processor P1,1 is our designated processor to return the value of the search in its instantiation of the distributed variable index. In other words, at the termination of the search, we want P 1,1: index to hold the smallest index i such that L [i] = x , or if no such index exists. To compute this minimum value, we need to perform what amounts to a reverse broadcast procedure. In phase 1, column minimums are computed as shown in Figure 7.
Figure 8 – Final state of the distributed variables in the 2-d mesh upon completion of the searching algorithm.