PRAM and Other Models: Understanding Parallel Computation - Prof. Jingke Li, Study notes of Computer Science

An overview of pram (parallel random access memory) models, their submodels, and algorithms. It covers the ram model, pram model, exclusive read and concurrent write submodels, and specific pram algorithms like global sum and prefix sums. The document also explains brent’s theorem and its application to these algorithms.

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-2s0
koofers-user-2s0 🇺🇸

9 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
PRAM and Other Models
Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 1 / 27
The RAM Model
(for Sequential Computation)
RAM = Random Access Machine
A processor operating under the control of a sequential program.
A memory with Mcells (Mcan be unbounded).
Basic Operations:
READ the processor reads a datum from an arbitrary location in
memory into one of its internal registers.
COMPUTE the processor performs an (arithmetic) operation o n
data in register(s).
WRITE the processor writes the content of one register into an
arbitrary memory cell.
Uniform Cost Criterion:
Each basic operation takes one time unit to execute.
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 2 / 27
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download PRAM and Other Models: Understanding Parallel Computation - Prof. Jingke Li and more Study notes Computer Science in PDF only on Docsity!

PRAM and Other Models

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 1 / 27

The RAM Model

(for Sequential Computation)

RAM = Random Access Machine

  • (^) A processor operating under the control of a sequential program.
  • (^) A memory with M cells (M can be unbounded).
  • (^) Basic Operations:
    • (^) READ — the processor reads a datum from an arbitrary location in memory into one of its internal registers.
    • (^) COMPUTE — the processor performs an (arithmetic) operation on data in register(s).
    • (^) WRITE — the processor writes the content of one register into an arbitrary memory cell.
  • (^) Uniform Cost Criterion: Each basic operation takes one time unit to execute.

RAM Algorithms

Presented in pseudo code. Complexity is measured by number of time units needed.

Algorithm Prefix Sums (RAM) // Compute the prefix sums of n numbers. s 0 ← x 0 for i = 1 to n − 1 do si ← si − 1 + xi endfor

Complexity Analysis:

  • n (implicit) READs to read x 0 , x 1 ,... , xn− 1 from memory;
  • n COMPUTEs to obtain s 0 , s 1 ,... , sn− 1 ;
  • n (implicit) WRITEs to write s 0 , s 1 ,... , sn− 1 into memory.
  • Total: 3n time units.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 3 / 27

The PRAM Model

PRAM = Parallel RAM

  • A number of identical processors, P 1 , P 2 ,... , PN.
  • A common memory with M cells, shared by the N processors.
  • The processors work in a synchronous fashion.
  • Basic Operations:
    • READ — (up to N) processors read simultaneously from (up to N) memory cells. Each processor reads from at most one memory cell and stores the value obtained in a local register.
    • COMPUTE — (up to N) processors perform an (arithmetic) operation on their local data in register(s).
    • WRITE — (up to N) processors write simultaneously into register into an arbitrary memory cell. Inter-processor communication is simplistically modeled by READ/WRITE operations.

Three Specific PRAM Models

• EREW PRAM —

The weakest among the three models. It takes O(log n) time units to spread a value from one processor to n other processors.

  • CREW PRAM — A processor can spread a value to n other processors in O(1) time units.
  • CRCW PRAM — The strongest model. But CW needs special handling for resolving memory write conficits.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 7 / 27

Concurrent Write Sub Models

  • Priority CW — The processors are assigned certain priorities; the processor with the highest priority wins.
  • Common CW — The processors are allowed to write to the same memory cell only if they are attempting to write the same value; otherwise a special flag will be raised.
  • Arbitrary CW — Any of the attempting processors can succeed; the selection is according to an algorithm.
  • Random CW — A processor is chosen by a random process to succeed.
  • Combining CW — All the values from the attempting processors are combined into a single value, which is then stored in the memory cell. - sum, product, and, or, xor, max, min, etc.

PRAM Algorithms

Complexity Measurement:

  • Time Complexity — number of time units used
  • Space Complexity — number of processors involved
  • Lower Bounds — the least time/space needed for a problem

Cost-Optimal PRAM Algorithms:

  • Cost — The cost of a PRAM algorithm is the product of its time complexity and space complexity.
  • Cost-Optimal — A cost-optimal PRAM algorithm is one in which the cost is in the same complexity class as the optimal sequential algorithm.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 9 / 27

Algorithm Global Sum (EREW PRAM)

// Given n numbers A[0], A[1],... , A[n − 1], compute // the sum

∑n− 1 i =0 A[i]^ and store it in^ A[0]. spawn(P 0 , P 1 ,... , P⌊n/ 2 ⌋− 1 ) for all Pi where 0 ≤ i ≤ ⌊n/ 2 ⌋ − 1 do for j ← 0 to ⌈log n⌉ − 1 do if i mod 2j^ = 0 and 2i + 2j^ < n then A[2i] ← A[2i] + A[2i + 2j^ ] endif endfor endfor

Complexity Analysis: Time: O(log n), Space: n

Optimality Analysis: ⌈log n⌉ × ⌊ n 2 ⌋ = O(n log n) , while sequential algorithm is Θ(n). — Not optimal!

Reason: Some processors have idle steps — the total number of operations performed is n − 1. How to improve?

Apply Brent’s Theorem to Algorithm Global Sum

  • Total number of operations performed: n − 1
  • Using p = ⌊n/ log n⌋ processors, the time is

⌈log n⌉ +

n − 1 − ⌈log n⌉ ⌊n/ log n⌋

= Θ(2 log n −

log n n

log^2 n n

) = Θ(log n)

This is cost-optimal!

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 13 / 27

Apply Brent’s Theorem to Algorithm Prefix Sums

  • Total number of operations performed: n⌈log n⌉
  • Using p = ⌊n/ log n⌋ processors, the time is

⌈log n⌉ +

n⌈log n⌉ − ⌈log n⌉ ⌊n/ log n⌋

= Θ(log n + log^2 n −

log^2 n n

) = Θ(log^2 n)

  • Cost = ⌊n/ log n⌋ × Θ(log^2 n) = Θ(n log n) — Still not optimal!

Reason: The total number of PRAM operations is higher than that of the optimal sequential algorithm.

“Coarse-Grain” Algorithms

Another approach for improving performance.

Algorithm Prefix Sums (“Coarse-Grain”)

Given: n values, p processors (p < n)

  • Divide the n values into p sets, each containing ≤ ⌈n/p⌉ values.
  • The first p − 1 processors each uses the optimal sequential algorithm to do a local prefix computation. (⌈n/p⌉ − 1 steps)
  • Then the processors run the PRAM parallel prefix algorithm on the subtotals. (log(p − 1) steps)
  • Each processor then goes back and updates the local prefix values with the results from the global comp. (⌈n/p⌉ steps)

Total Cost: (2⌈n/p⌉ + log(p − 1))p = Θ(n + p log p)

For small p, this algorithm is cost-optimal!

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 15 / 27

Concurrent-Write Algorithms

  • (^) Algorithm Global Sum (CRCW(combine) PRAM): spawn(P 0 , P 1 ,... , Pn− 1 ) for all Pi where 0 ≤ i ≤ n − 1 do A[0] ← A[i] endfor Time: O(1)
  • Algorithm Prefix Sum (CRCW(combine) PRAM): spawn(P 0 , 0 , P 0 , 1 ,... , P 0 ,n− 1 , P 1 , 0 , P 1 , 1 ,... , P 1 ,n− 2 , ... , Pn− 2 , 0 , Pn− 2 , 1 , Pn− 1 , 0 ) for all Pi ,j where 0 ≤ i ≤ n − 1 , 0 ≤ j ≤ n − i − 1 do xi ,j = A[j]; A[i] ← xi ,j endfor Time: O(1); Space: (n − 1)/ 2

Problems of the PRAM Model

  • (^) It assumes the processors operate in a fully-synchronous mode.
  • (^) It assumes a single shared memory in which each processor can access any cell in unit time.
  • (^) It neglects the issue of contention caused by concurrent access to different cells within the same memory module.
  • It assumes the interprocessor communication has infinite bandwidth, zero latency, and zero overhead.
  • It assumes the number of processors can increase with the problem size.

In conclusion, the PRAM model is good for a gross classification of algorithms, but not very useful for describing realistic algorithms or predicting the performance of algorithms.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 19 / 27

Extensions of the PRAM Model

  • Phase PRAM — Introduces asynchrony.
  • Module Parallel Computer — Divides memory into modules.
  • Local-Memory PRAM — Divides memory into local and global.
  • Memory Hierarchy Model — Views the memory as a hierarchy.
  • Delay Model — Introduces communication delay.

Valiant’s BSP Model

BSP = Bulk Synchronous Parallel

The BSP model is developed to model distributed-memory multiprocessors. It consists of three components:

  • A group of p processors each with local memory.
  • An interconnection network for point-to-point communication between the processors.
  • A mechanism for synchronizing all the processors at defined intervals.

Properties:

  • It’s simple to use. — BSP programs look much the same as sequential programs.
  • It’s high-level, architecture-independent.
  • The performance of a BSP program on a given architecture is predictable. — A small set of parameters are used to describe the target architecture.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 21 / 27

Supersteps

In BSP, a computation consists of a sequence of supersteps. Each superstep is further subdivided into three ordered phases:

  • (^) Local Computation — each processor performs computation using only data stored in the local memory.
  • Communication — processors sends/receives messages to each other.

Local Computation

Communication Barrier Synchronization

  • Barrier Synchronization — this global synchronization waits for all of the communication actions to complete.

A BSP Program

The following function calculates the partial sums of p integers stored on p processors. (The code is written in C with Oxford BSP library.)

int bsp_allsums(int x) { int i, left, right; int mypid = bsp_pid(); int p = bsp_nprocs(); bsp_pushregister(&left, sizeof(int)); bsp_sync(); right = x; for (i=1; i<p; i*=2) { if (mypid+i < p) bsp_put(mypid+i, &right, &left, 0, sizeof(int)); bsp_sync(); if (mypid>=i) right = left + right; } bsp_popregister(&left); return right; }

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 25 / 27

Berkeley’s LogP Model

L = latency (of message transmissions)

o = overhead (of communication incurred on a processor)

g = gap (between consecutive message transmissions) P = processors (the number of processors)

This model is targeted for accurately modeling performances of algorithms for distributed-memory multiprocessors. It is not as useful for developing parallel algorithms.

LogP Example

Given L = 6, o = 2, g = 4, P = 8, predict the time for a broadcast.

Time Actions 0 P 0 starts to send a message to P 1 2 the message leaves P 0 4 P 0 starts to send a second message to P 2 6 the second message leaves P 0 8 P 0 starts to send a third message to P 3 ... 8 the first message arrives at P 1 10 P 1 receives the message, and starts to forward it to P 4 ... 14 P 1 starts to send a second message to P 5 ...

Total time: 24