Download PRAM and Other Models: Understanding Parallel Computation - Prof. Jingke Li and more Study notes Computer Science in PDF only on Docsity!
PRAM and Other Models
Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 1 / 27
The RAM Model
(for Sequential Computation)
RAM = Random Access Machine
- (^) A processor operating under the control of a sequential program.
- (^) A memory with M cells (M can be unbounded).
- (^) Basic Operations:
- (^) READ — the processor reads a datum from an arbitrary location in memory into one of its internal registers.
- (^) COMPUTE — the processor performs an (arithmetic) operation on data in register(s).
- (^) WRITE — the processor writes the content of one register into an arbitrary memory cell.
- (^) Uniform Cost Criterion: Each basic operation takes one time unit to execute.
RAM Algorithms
Presented in pseudo code. Complexity is measured by number of time units needed.
Algorithm Prefix Sums (RAM) // Compute the prefix sums of n numbers. s 0 ← x 0 for i = 1 to n − 1 do si ← si − 1 + xi endfor
Complexity Analysis:
- n (implicit) READs to read x 0 , x 1 ,... , xn− 1 from memory;
- n COMPUTEs to obtain s 0 , s 1 ,... , sn− 1 ;
- n (implicit) WRITEs to write s 0 , s 1 ,... , sn− 1 into memory.
- Total: 3n time units.
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 3 / 27
The PRAM Model
PRAM = Parallel RAM
- A number of identical processors, P 1 , P 2 ,... , PN.
- A common memory with M cells, shared by the N processors.
- The processors work in a synchronous fashion.
- Basic Operations:
- READ — (up to N) processors read simultaneously from (up to N) memory cells. Each processor reads from at most one memory cell and stores the value obtained in a local register.
- COMPUTE — (up to N) processors perform an (arithmetic) operation on their local data in register(s).
- WRITE — (up to N) processors write simultaneously into register into an arbitrary memory cell. Inter-processor communication is simplistically modeled by READ/WRITE operations.
Three Specific PRAM Models
• EREW PRAM —
The weakest among the three models. It takes O(log n) time units to spread a value from one processor to n other processors.
- CREW PRAM — A processor can spread a value to n other processors in O(1) time units.
- CRCW PRAM — The strongest model. But CW needs special handling for resolving memory write conficits.
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 7 / 27
Concurrent Write Sub Models
- Priority CW — The processors are assigned certain priorities; the processor with the highest priority wins.
- Common CW — The processors are allowed to write to the same memory cell only if they are attempting to write the same value; otherwise a special flag will be raised.
- Arbitrary CW — Any of the attempting processors can succeed; the selection is according to an algorithm.
- Random CW — A processor is chosen by a random process to succeed.
- Combining CW — All the values from the attempting processors are combined into a single value, which is then stored in the memory cell. - sum, product, and, or, xor, max, min, etc.
PRAM Algorithms
Complexity Measurement:
- Time Complexity — number of time units used
- Space Complexity — number of processors involved
- Lower Bounds — the least time/space needed for a problem
Cost-Optimal PRAM Algorithms:
- Cost — The cost of a PRAM algorithm is the product of its time complexity and space complexity.
- Cost-Optimal — A cost-optimal PRAM algorithm is one in which the cost is in the same complexity class as the optimal sequential algorithm.
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 9 / 27
Algorithm Global Sum (EREW PRAM)
// Given n numbers A[0], A[1],... , A[n − 1], compute // the sum
∑n− 1 i =0 A[i]^ and store it in^ A[0]. spawn(P 0 , P 1 ,... , P⌊n/ 2 ⌋− 1 ) for all Pi where 0 ≤ i ≤ ⌊n/ 2 ⌋ − 1 do for j ← 0 to ⌈log n⌉ − 1 do if i mod 2j^ = 0 and 2i + 2j^ < n then A[2i] ← A[2i] + A[2i + 2j^ ] endif endfor endfor
Complexity Analysis: Time: O(log n), Space: n
Optimality Analysis: ⌈log n⌉ × ⌊ n 2 ⌋ = O(n log n) , while sequential algorithm is Θ(n). — Not optimal!
Reason: Some processors have idle steps — the total number of operations performed is n − 1. How to improve?
Apply Brent’s Theorem to Algorithm Global Sum
- Total number of operations performed: n − 1
- Using p = ⌊n/ log n⌋ processors, the time is
⌈log n⌉ +
n − 1 − ⌈log n⌉ ⌊n/ log n⌋
= Θ(2 log n −
log n n
log^2 n n
) = Θ(log n)
This is cost-optimal!
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 13 / 27
Apply Brent’s Theorem to Algorithm Prefix Sums
- Total number of operations performed: n⌈log n⌉
- Using p = ⌊n/ log n⌋ processors, the time is
⌈log n⌉ +
n⌈log n⌉ − ⌈log n⌉ ⌊n/ log n⌋
= Θ(log n + log^2 n −
log^2 n n
) = Θ(log^2 n)
- Cost = ⌊n/ log n⌋ × Θ(log^2 n) = Θ(n log n) — Still not optimal!
Reason: The total number of PRAM operations is higher than that of the optimal sequential algorithm.
“Coarse-Grain” Algorithms
Another approach for improving performance.
Algorithm Prefix Sums (“Coarse-Grain”)
Given: n values, p processors (p < n)
- Divide the n values into p sets, each containing ≤ ⌈n/p⌉ values.
- The first p − 1 processors each uses the optimal sequential algorithm to do a local prefix computation. (⌈n/p⌉ − 1 steps)
- Then the processors run the PRAM parallel prefix algorithm on the subtotals. (log(p − 1) steps)
- Each processor then goes back and updates the local prefix values with the results from the global comp. (⌈n/p⌉ steps)
Total Cost: (2⌈n/p⌉ + log(p − 1))p = Θ(n + p log p)
For small p, this algorithm is cost-optimal!
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 15 / 27
Concurrent-Write Algorithms
- (^) Algorithm Global Sum (CRCW(combine) PRAM): spawn(P 0 , P 1 ,... , Pn− 1 ) for all Pi where 0 ≤ i ≤ n − 1 do A[0] ← A[i] endfor Time: O(1)
- Algorithm Prefix Sum (CRCW(combine) PRAM): spawn(P 0 , 0 , P 0 , 1 ,... , P 0 ,n− 1 , P 1 , 0 , P 1 , 1 ,... , P 1 ,n− 2 , ... , Pn− 2 , 0 , Pn− 2 , 1 , Pn− 1 , 0 ) for all Pi ,j where 0 ≤ i ≤ n − 1 , 0 ≤ j ≤ n − i − 1 do xi ,j = A[j]; A[i] ← xi ,j endfor Time: O(1); Space: (n − 1)/ 2
Problems of the PRAM Model
- (^) It assumes the processors operate in a fully-synchronous mode.
- (^) It assumes a single shared memory in which each processor can access any cell in unit time.
- (^) It neglects the issue of contention caused by concurrent access to different cells within the same memory module.
- It assumes the interprocessor communication has infinite bandwidth, zero latency, and zero overhead.
- It assumes the number of processors can increase with the problem size.
In conclusion, the PRAM model is good for a gross classification of algorithms, but not very useful for describing realistic algorithms or predicting the performance of algorithms.
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 19 / 27
Extensions of the PRAM Model
- Phase PRAM — Introduces asynchrony.
- Module Parallel Computer — Divides memory into modules.
- Local-Memory PRAM — Divides memory into local and global.
- Memory Hierarchy Model — Views the memory as a hierarchy.
- Delay Model — Introduces communication delay.
Valiant’s BSP Model
BSP = Bulk Synchronous Parallel
The BSP model is developed to model distributed-memory multiprocessors. It consists of three components:
- A group of p processors each with local memory.
- An interconnection network for point-to-point communication between the processors.
- A mechanism for synchronizing all the processors at defined intervals.
Properties:
- It’s simple to use. — BSP programs look much the same as sequential programs.
- It’s high-level, architecture-independent.
- The performance of a BSP program on a given architecture is predictable. — A small set of parameters are used to describe the target architecture.
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 21 / 27
Supersteps
In BSP, a computation consists of a sequence of supersteps. Each superstep is further subdivided into three ordered phases:
- (^) Local Computation — each processor performs computation using only data stored in the local memory.
- Communication — processors sends/receives messages to each other.
Local Computation
Communication Barrier Synchronization
- Barrier Synchronization — this global synchronization waits for all of the communication actions to complete.
A BSP Program
The following function calculates the partial sums of p integers stored on p processors. (The code is written in C with Oxford BSP library.)
int bsp_allsums(int x) { int i, left, right; int mypid = bsp_pid(); int p = bsp_nprocs(); bsp_pushregister(&left, sizeof(int)); bsp_sync(); right = x; for (i=1; i<p; i*=2) { if (mypid+i < p) bsp_put(mypid+i, &right, &left, 0, sizeof(int)); bsp_sync(); if (mypid>=i) right = left + right; } bsp_popregister(&left); return right; }
Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 25 / 27
Berkeley’s LogP Model
L = latency (of message transmissions)
o = overhead (of communication incurred on a processor)
g = gap (between consecutive message transmissions) P = processors (the number of processors)
This model is targeted for accurately modeling performances of algorithms for distributed-memory multiprocessors. It is not as useful for developing parallel algorithms.
LogP Example
Given L = 6, o = 2, g = 4, P = 8, predict the time for a broadcast.
Time Actions 0 P 0 starts to send a message to P 1 2 the message leaves P 0 4 P 0 starts to send a second message to P 2 6 the second message leaves P 0 8 P 0 starts to send a third message to P 3 ... 8 the first message arrives at P 1 10 P 1 receives the message, and starts to forward it to P 4 ... 14 P 1 starts to send a second message to P 5 ...
Total time: 24