Implications of the CRCW Hierarchy of Submodels and Parallel Algorithms, Slides of Parallel Computing and Programming

The crcw hierarchy of submodels and its implications for efficient parallel algorithms. It covers various pram computations, including matrix multiplication, and provides examples of algorithms for sorting and selection. The document also touches upon topics like numa, uma, and coma.

Typology: Slides

2012/2013

Uploaded on 04/30/2013

devank
devank 🇮🇳

4.3

(12)

152 documents

1 / 115

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Part II
Extreme Models
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Implications of the CRCW Hierarchy of Submodels and Parallel Algorithms and more Slides Parallel Computing and Programming in PDF only on Docsity!

Part II

Extreme Models

II Extreme Models

Study the two extremes of parallel computation models:

  • Abstract SM (PRAM); ignores implementation issues
  • Concrete circuit model; incorporates hardware details
  • Everything else falls between these two extremes

Topics in This Part

Chapter 5 PRAM and Basic Algorithms

Chapter 6 More Shared-Memory Algorithms

Chapter 7 Sorting and Selection Networks

Chapter 8 Other Circuit-Level Examples

5.1 PRAM Submodels and Assumptions

Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM).

Processors

Shared Memory

p–

m–

Processor i can do the following in three phases of one cycle:

  1. Fetch a value from address s (^) i in shared memory
  2. Perform computations on data held in local registers
  3. Store a value into address d (^) i in shared memory

Types of PRAM

Fig. 5.1 Submodels of the PRAM model.

EREW Least “powerful”, most “realistic”

CREW Default

ERCW Not useful

CRCW Most “powerful”, further subdivided

Reads from same location

Writes to same location

Exclusive

Concurrent

Concurrent

Exclusive

Power of CRCW PRAM Submodels

Theorem 5.1: A p -processor CRCW-P (priority) PRAM can be simulated (emulated) by a p -processor EREW PRAM with slowdown factor Θ(log p ).

Intuitive justification for concurrent read emulation (write is similar): Write the p memory addresses in a list Sort the list of addresses in ascending order Remove all duplicate addresses Access data at desired addresses Replicate data via parallel prefix computation Each of these steps requires constant or O(log p ) time

Model U is more powerful than model V if T U( n ) = o( T V( n )) for some problem

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

1 6 5 2 3 6 1 1 2 1 1 1 2 2 3 5 6 6 1

2

3 5 6

Implications of the CRCW Hierarchy of Submodels

Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown

Efficient parallel algorithms have polylogarithmic running times

Running time still polylogarithmic after slowdown due to emulation

A p -processor CRCW-P (priority) PRAM can be simulated (emulated) by a p -processor EREW PRAM with slowdown factor Θ(log p ).

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

We need not be too concerned with the CRCW submodel used

Simply use whichever submodel is most natural or convenient

5.2 Data

Broadcasting

Fig. 5.2 Data broadcasting in EREW PRAM via recursive doubling.

Making p copies of B [0]

by recursive doubling

for k = 0 to log 2 p  – 1

Proc j , 0 ≤ j < p , do

Copy B [ j ] into B [ j + 2 k ]

endfor

0 1 2 3 4 5 6 7 8 9

10 11

B

Fig. 5.3 EREW PRAM data broadcasting without redundant copying.

0 1 2 3 4 5 6 7 8 9

10 11

B

Can modify the algorithm so that redundant copying does not occur and array bound is not exceeded

All-to-All Broadcasting on EREW PRAM

EREW PRAM algorithm for all-to-all broadcasting Processor j , 0 ≤ j < p , write own data value into B [ j ] for k = 1 to p – 1 Processor j , 0 ≤ j < p , do Read the data value in B [( j + k ) mod p ] endfor

This O( p )-step algorithm is time-optimal

Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j , 0 ≤ j < p , write 0 into R [ j ] for k = 1 to p – 1 Processor j , 0 ≤ j < p , do l := ( j + k ) mod p if S [ l ] < S [ j ] or S [ l ] = S [ j ] and l < j then R [ j ] := R [ j ] + 1 endif endfor Processor j , 0 ≤ j < p , write S [ j ] into S [ R [ j ]]

j

This O( p )-step sorting algorithm is far from optimal; sorting is possible in O(log p ) time

p – 1

0

5.3 Semigroup or Fan-in Computation

EREW PRAM semigroup computation algorithm Proc j , 0 ≤ j < p , copy X [ j ] into S [ j ] s := 1 while s < p Proc j , 0 ≤ j < ps , do S [ j + s ] := S [ j ] ⊗ S [ j + s ] s := 2 s endwhile Broadcast S [ p – 1] to all processors

If we use p processors on a list of size n = O( p log p ), then optimal speedup can be achieved

This algorithm is optimal for PRAM, but its speedup of O( p / log p ) is not

Fig. 5.4 Semigroup computation in EREW PRAM.

0 1 2 3 4 5 6 7 8 9

S 0: 1: 2: 3: 4: 5: 6: 7: 8: 9:

0: 0: 1: 2: 3: 4: 5: 6: 7: 8:

0: 0: 0: 0: 1: 2: 3: 4: 5: 6:

0: 0: 0: 0: 0: 0: 0: 0: 1: 2:

0: 0: 0: 0: 0: 0: 0: 0: 0: 0:

Fig. 5.5 Intuitive justification of why parallel slack helps improve the efficiency.

Higher degree of parallelism near the leaves

Lower degree of parallelism near the root

5.4 Parallel Prefix Computation

Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling. 0 1 2 3 4 5 6 7 8 9

  • 0: S
  • 1:
  • 2:
  • 3:
  • 4:
  • 5:
  • 6:
  • 7:
  • 8:
  • 9:
    • 0:
    • 0:
    • 1:
    • 2:
    • 3:
    • 4:
    • 5:
    • 6:
    • 7:
    • 8:
      • 0:
      • 0:
      • 0:
      • 0:
      • 1:
      • 2:
      • 3:
      • 4:
      • 5:
      • 6:
        • 0:
        • 0:
        • 0:
        • 0:
        • 0:
        • 0:
        • 0:
        • 0:
        • 1:
        • 2:
          • 0:
          • 0:
          • 0:
          • 0:
          • 0:
          • 0:
          • 0:
          • 0:
          • 0:
          • 0:

Another Divide-and-Conquer Algorithm

Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.

Strictly optimal algorithm, but requires commutativity

x 0 x 1 x 2 x 3 x (^) n -2 x (^) n -

0: n - 0: n -

Parallel prefix comput ation on n / odd-indexed inputs

Parallel prefix comput ation on n / even-index ed inputs

⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗

T ( p ) = T ( p /2) + 1

T ( p ) = log 2 p

Each vertical line represents a location in shared memory

5.5 Ranking the Elements of a Linked List

C F A E B D Rank: 5 4 3 2 1 0

info next head

Terminal element

(or dis tance from terminal) Dis tance from head: 1 2 3 4 5 6 Fig. 5.9 Example linked list and the ranks of its elements.

Fig. 5.10 PRAM data structures representing a linked list and the ranking results.

A B C D E F 4 3 5 3 1 0

info next^ rank 0 1 2 3 4 5

head

List ranking appears to be hopelessly sequential; one cannot get to a list element except through its predecessor!

PRAM List Ranking Algorithm

Question: Which PRAM submodel is implicit in this algorithm?

If we do not want to modify the original list, we simply make a copy of it first, in constant time

A B C D E F 4 3 5 3 1 0

info next^ rank 0 1 2 3 4 5

head

PRAM list ranking algorithm (via pointer jumping) Processor j , 0 ≤ j < p , do {initialize the partial ranks} if next [ j ] = j then rank [ j ] := 0 else rank [ j ] := 1 endif while rank [ next [ head ]] ≠ 0 Processor j , 0 ≤ j < p , do rank [ j ] := rank [ j ] + rank [ next [ j ]] next [ j ] := next [ next [ j ]] endwhile

Answer: CREW

5.6 Matrix Multiplication

Sequential matrix multiplication

for i = 0 to m – 1 do

for j = 0 to m – 1 do

t := 0

for k = 0 to m – 1 do

t := t + aik bkj

endfor

cij := t

endfor

endfor

× =

i

j

ij

A B C

c ij := Σ k =0 to m –1 a ik b kj

PRAM solution with m^3 processors: each processor does one multiplication (not very efficient)

m × m matrices