






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The implementation of parallel matrix transpose algorithms on distributed memory concurrent processors using non-blocking communication and block cyclic data distribution. It includes figures and pseudocode to illustrate the concepts.
Typology: Study Guides, Projects, Research
1 / 11
This page cannot be seen from the preview
Don't miss anything!







This report describes parallel matrix transpose algorithms on distributed memory con- current processors. We assume that the matrix is distributed over a P X Q processor template with a block scattered data distribution. We have implemented the communication scheme that involeves complete exchange communication for the P,Q processor template that are relatively prime. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Details of the parallel implementation of the algorithm is given, and results are presented for runs on the Beowulf High Performance computer.
Matrix transposition is a fundamental matrix operation of linear algebra and arises in many scientific and engineering applications. On a uniprocessor, an algorithm involving a transposed matrix may not actually require the matrix data to be transposed in physical memory. Instead, it may be accessed simply by exchanging the row and column indices. However, in a distributed-memory multiprocessor environment, we cannot simply interchange the global row and column indices. Instead, the data must be physically moved from one processor to another. In this paper, the parallel matrix transpose algorithms are presented based on the block scattered decomposition. The matrix transpose algorithm involves complete exchange communication. This is called all-to-all personalized communication, in which each of Np=P X Q processors is required to send distinct subblocks to each of the remaining Np - 1 processors, and receive distinct subblocks from each of them..
In a block cyclic data distribution, the p x q processor template is used to map the processor rank with the position in the template. This is shown in the table below. The processor template P and Q are used. The rank of the processor is obtained as per the mapping between the values in the processor template P and Q. The corresponding element of the matrix present with respect to the relative position in the processor template is moved to the processor whose rank is obtained by the above block cyclic grid. This P by Q template repeats itself until all the elements of the initial matrix are mapped to a particular processor. This is known as block cyclic data distribution. The block scattered decomposition provides a simple, yet general-purpose way of distributing a block-partitioned matrix on distributed memory concurrent computers. 12/2/2020 CSE 5240
If P and Q are relatively prime, the matrix transpose algorithm involves a two- dimensional complete exchange communication, where each of Np processors is required to send distinct subblocks to each of the remaining Np - 1 processors, and receive distinct subblocks from each of them. We implemented the complicated two-dimensional complete exchange algorithm by generalizing the one-dimensional complete exchange algorithm. In the direct point-to-point communication scheme depicted in figure 2 the number of steps is Np-1, but the amount of data transmitted in each step is only one subblock.
Pseudocode:- DO J = 0;Q- DO I = 0; P - 1 [ Copy all blocks of A required by P(p+ I, q ÿJ) to T (in condensed and transposed form) ] [ Send T1 to P(p+ I,q ÿ J) ] [ Receive T2 from P(p ÿ I, q + J) ] [ Copy T2 to AT] END DO END DO Figure 3 In this case blocks in P0 are scattered to all processors after being locally transposed as shown in Figure 4 (b). This case involves the two-dimensional complete exchange communication. That is, every processor needs to communicates with every other processor. The complete exchange problem is implemented by direct communication between sender and receiver. Figure 3 shows the pseudocode from the processor point-of-view, where P(p, q) represents PMOD(p,P );MOD(q,Q) in the processor template. Processor P(p, q) (0 <= p < P and 0 <= q < Q) starts to transpose blocks whose transposed blocks belong to itself. Then it deals with blocks whose transposition are in processors in the same column of the template (P(p-i q) 0 <= i < P).The processor sends blocks to its top neighbor, P(p-1, q), and receives blocks from its bottom neighbor, P(p+ 1, q). Before sending the blocks, it is necessary to copy the blocks to be sent into a contiguous message buffer. Next it sends blocks to the next top processor, P(p- 2, q) and receives blocks from the next bottom processor, P(p + 2, q). After it completes its operations with the processors in the same column, it sends blocks to the processors to the left in the template (P(p- I, q - 1) 0<= i < P), and receives blocks from the processors to the right (P(p+I, q + 1)). All operations are completed in P * Q = LCM steps. 12/2/2020 CSE 5240
Transpose 0 1 2 3 4 5 0 0 1 2 0 1 2 1 3 4 5 3 4 5 2 0 1 2 0 1 2 3 3 4 5 3 4 5 4 0 1 2 0 1 2 5 3 4 5 3 4 5 Fig 4a Matrix transpose from matrix point-of-view 0 3 1 4 2 5 0 2 P0 P1 P 4 1 3 P3 P4 P 5 Transpose 0 3 1 4 2 5 0 2 P0 P1 P 4 1 3 P3 P4 P 5 Fig 4b Matrix transpose from processor point-of-view Figure 4 An example of matrix transpose for a block scattered decomposition, when P = 2, Q= 3, and Mb = Nb = 6. 12/2/2020 CSE 5240
After block cyclic distribution of 12x12 matrix in the 12 processors:- 0 4 8 1 5 9 2 6 10 3 7 11 0 A00 A04 A08 A01 A05 A09 A02 A06 A010 A03 A07 A 3 A30 A34 A38 A31 A35 A39 A32 A36 A310 A33 A37 A 6 A60 A64 A68 A61 A65 A69 A62 A66 A610 A63 A67 A 9 A90 A94 A98 A91 A95 A99 A92 A96 A910 A93 A97 A 1 A10 A14 A18 A11 A15 A19 A12 A16 A110 A13 A17 A 4 A40 A44 A48 A41 A45 A49 A42 A46 A410 A43 A47 A 7 A70 A74 A78 A71 A75 A79 A72 A76 A710 A73 A77 A 10 A100 A104 A108 A101 A105 A109 A102 A106 A1010 A103 A107 A 2 A20 A24 A28 A21 A25 A29 A22 A26 A210 A23 A27 A 5 A50 A54 A58 A51 A55 A59 A52 A56 A510 A53 A57 A 8 A80 A84 A88 A81 A85 A89 A82 A86 A810 A83 A87 A 11 A110 A114 A118 A111 A115 A119 A112 A116 A1110 A113 A1117 A The above figure shows the block cyclic distributed matrix in each of the 12 processors. The top left shaded portion represents the elements in processor 0, the next 12 elements in the right represents the elements in processor 1 and so on in the first row. The leftmost set of elements in the next row belong to processor 4 and the next processor 5 etc. Thereby the elements of the original matrix are distributed into the 12 processors in a block cyclic fashion. Now we will see how the elements are scattered after transpose operation has been performed on all the processors. 0 4 8 1 5 9 2 6 10 3 7 11 0 A00 A40 A80 A10 A50 A90 A20 A60 A100 A30 A70 A 3 A03 A43 A83 A13 A53 A93 A23 A63 A103 A33 A73 A 6 A06 A46 A86 A16 A56 A96 A26 A66 A106 A36 A76 A 9 A09 A49 A89 A19 A59 A99 A29 A69 A109 A39 A79 A 1 A01 A41 A81 A11 A51 A91 A21 A61 A101 A31 A71 A 4 A04 A44 A84 A14 A54 A94 A24 A64 A104 A34 A74 A 7 A07 A47 A87 A17 A57 A97 A27 A67 A107 A37 A77 A 10 A010 A410 A810 A110 A510 A910 A210 A610 A1010 A310 A710 A 2 A02 A42 A82 A12 A52 A92 A22 A62 A102 A32 A72 A 5 A05 A45 A85 A15 A55 A95 A25 A65 A105 A35 A75 A 8 A08 A48 A88 A18 A58 A98 A28 A68 A108 A38 A78 A 11 A011 A411 A811 A111 A511 A911 A211 A611 A1011 A311 A711 A 12/2/2020 CSE 5240
The above figure shows the transposed matrix elements among the 12 processors. The elements are as we observe nothing but the block cyclic distributed elements whose indices are reversed. For example A54 becomes A45 and so on. This operation is carried out by sending the Aij element to the position where Aji element is positioned and receiving from Aji position the element which was previously placed there. This operation is carried out in the pseudocode of the matrix transpose.
The matrix transpose was implemented in C in the 48-node Beowulf cluster and the results are given. We have given the time taken for the whole algorithm as well as the time taken for only the transpose operation to take place for each case. Different values of M and N were used for the transpose ranging from 360 upto 1800 for the number of processors 6 to 36. The timings were measured using the MPI_Wtime() function call. The timings given here are the average time considered for 6,12,..36 processors in each case. We shall see the results for a square matrix through the following table 360 720 1080 1440 1800 6 68.11 294.74 692.76 2364.65 3590. 12 43.01 183.23 377.93 674.2 926. 24 11.17 91.89 206.98 393.43 594. 30 9.07 91.51 173.24 309.8 492. 36 6.97 41.32 140.49 268.8 407. Table 1a. Time for Total Execution 360 720 1080 1440 1800 12 1.58 1.61 1.83 3.51 3. 24 6.09 3.2 3.35 6.01 6. 30 7.51 3.22 3.99 7.63 7. 36 9.77 7.13 4.93 8.79 8. 12/2/2020 CSE 5240