Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

PRAM and Other Models: Understanding Parallel Computation - Prof. Jingke Li, Study notes of Computer Science

Portland State University (PSU)Computer Science

Prof. Jingke Li

An overview of pram (parallel random access memory) models, their submodels, and algorithms. It covers the ram model, pram model, exclusive read and concurrent write submodels, and specific pram algorithms like global sum and prefix sums. The document also explains brent’s theorem and its application to these algorithms.

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-2s0 🇺🇸

9 documents

1 / 14

This page cannot be seen from the preview

Don't miss anything!

PRAM and Other Models

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 1 / 27

The RAM Model

(for Sequential Computation)

RAM = Random Access Machine

•A processor operating under the control of a sequential program.

•A memory with Mcells (Mcan be unbounded).

•Basic Operations:

•READ — the processor reads a datum from an arbitrary location in

memory into one of its internal registers.

•COMPUTE — the processor performs an (arithmetic) operation o n

data in register(s).

•WRITE — the processor writes the content of one register into an

arbitrary memory cell.

•Uniform Cost Criterion:

Each basic operation takes one time unit to execute.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 2 / 27

Discover Study notes of Computer Science Portland State University (PSU)

Partial preview of the text

Download PRAM and Other Models: Understanding Parallel Computation - Prof. Jingke Li and more Study notes Computer Science in PDF only on Docsity!

PRAM and Other Models

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 1 / 27

The RAM Model

(for Sequential Computation)

RAM = Random Access Machine

(^) A processor operating under the control of a sequential program.
(^) A memory with M cells (M can be unbounded).
(^) Basic Operations:
- (^) READ — the processor reads a datum from an arbitrary location in memory into one of its internal registers.
- (^) COMPUTE — the processor performs an (arithmetic) operation on data in register(s).
- (^) WRITE — the processor writes the content of one register into an arbitrary memory cell.
(^) Uniform Cost Criterion: Each basic operation takes one time unit to execute.

RAM Algorithms

Presented in pseudo code. Complexity is measured by number of time units needed.

Algorithm Prefix Sums (RAM) // Compute the prefix sums of n numbers. s 0 ← x 0 for i = 1 to n − 1 do si ← si − 1 + xi endfor

Complexity Analysis:

n (implicit) READs to read x 0 , x 1 ,... , xn− 1 from memory;
n COMPUTEs to obtain s 0 , s 1 ,... , sn− 1 ;
n (implicit) WRITEs to write s 0 , s 1 ,... , sn− 1 into memory.
Total: 3n time units.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 3 / 27

The PRAM Model

PRAM = Parallel RAM

A number of identical processors, P 1 , P 2 ,... , PN.
A common memory with M cells, shared by the N processors.
The processors work in a synchronous fashion.
Basic Operations:
- READ — (up to N) processors read simultaneously from (up to N) memory cells. Each processor reads from at most one memory cell and stores the value obtained in a local register.
- COMPUTE — (up to N) processors perform an (arithmetic) operation on their local data in register(s).
- WRITE — (up to N) processors write simultaneously into register into an arbitrary memory cell. Inter-processor communication is simplistically modeled by READ/WRITE operations.

Three Specific PRAM Models

• EREW PRAM —

The weakest among the three models. It takes O(log n) time units to spread a value from one processor to n other processors.

CREW PRAM — A processor can spread a value to n other processors in O(1) time units.
CRCW PRAM — The strongest model. But CW needs special handling for resolving memory write conficits.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 7 / 27

Concurrent Write Sub Models

Priority CW — The processors are assigned certain priorities; the processor with the highest priority wins.
Common CW — The processors are allowed to write to the same memory cell only if they are attempting to write the same value; otherwise a special flag will be raised.
Arbitrary CW — Any of the attempting processors can succeed; the selection is according to an algorithm.
Random CW — A processor is chosen by a random process to succeed.
Combining CW — All the values from the attempting processors are combined into a single value, which is then stored in the memory cell. - sum, product, and, or, xor, max, min, etc.

PRAM Algorithms

Complexity Measurement:

Time Complexity — number of time units used
Space Complexity — number of processors involved
Lower Bounds — the least time/space needed for a problem

Cost-Optimal PRAM Algorithms:

Cost — The cost of a PRAM algorithm is the product of its time complexity and space complexity.
Cost-Optimal — A cost-optimal PRAM algorithm is one in which the cost is in the same complexity class as the optimal sequential algorithm.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 9 / 27

Algorithm Global Sum (EREW PRAM)

// Given n numbers A[0], A[1],... , A[n − 1], compute // the sum

∑n− 1 i =0 A[i]^ and store it in^ A[0]. spawn(P 0 , P 1 ,... , P⌊n/ 2 ⌋− 1 ) for all Pi where 0 ≤ i ≤ ⌊n/ 2 ⌋ − 1 do for j ← 0 to ⌈log n⌉ − 1 do if i mod 2j^ = 0 and 2i + 2j^ < n then A[2i] ← A[2i] + A[2i + 2j^ ] endif endfor endfor

Complexity Analysis: Time: O(log n), Space: n

Optimality Analysis: ⌈log n⌉ × ⌊ n 2 ⌋ = O(n log n) , while sequential algorithm is Θ(n). — Not optimal!

Reason: Some processors have idle steps — the total number of operations performed is n − 1. How to improve?

Apply Brent’s Theorem to Algorithm Global Sum

Total number of operations performed: n − 1
Using p = ⌊n/ log n⌋ processors, the time is

⌈log n⌉ +

n − 1 − ⌈log n⌉ ⌊n/ log n⌋

= Θ(2 log n −

log n n

log^2 n n

) = Θ(log n)

This is cost-optimal!

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 13 / 27

Apply Brent’s Theorem to Algorithm Prefix Sums

Total number of operations performed: n⌈log n⌉
Using p = ⌊n/ log n⌋ processors, the time is

⌈log n⌉ +

n⌈log n⌉ − ⌈log n⌉ ⌊n/ log n⌋

= Θ(log n + log^2 n −

log^2 n n

) = Θ(log^2 n)

Cost = ⌊n/ log n⌋ × Θ(log^2 n) = Θ(n log n) — Still not optimal!

Reason: The total number of PRAM operations is higher than that of the optimal sequential algorithm.

“Coarse-Grain” Algorithms

Another approach for improving performance.

Algorithm Prefix Sums (“Coarse-Grain”)

Given: n values, p processors (p < n)

Divide the n values into p sets, each containing ≤ ⌈n/p⌉ values.
The first p − 1 processors each uses the optimal sequential algorithm to do a local prefix computation. (⌈n/p⌉ − 1 steps)
Then the processors run the PRAM parallel prefix algorithm on the subtotals. (log(p − 1) steps)
Each processor then goes back and updates the local prefix values with the results from the global comp. (⌈n/p⌉ steps)

Total Cost: (2⌈n/p⌉ + log(p − 1))p = Θ(n + p log p)

For small p, this algorithm is cost-optimal!

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 15 / 27

Concurrent-Write Algorithms

(^) Algorithm Global Sum (CRCW(combine) PRAM): spawn(P 0 , P 1 ,... , Pn− 1 ) for all Pi where 0 ≤ i ≤ n − 1 do A[0] ← A[i] endfor Time: O(1)
Algorithm Prefix Sum (CRCW(combine) PRAM): spawn(P 0 , 0 , P 0 , 1 ,... , P 0 ,n− 1 , P 1 , 0 , P 1 , 1 ,... , P 1 ,n− 2 , ... , Pn− 2 , 0 , Pn− 2 , 1 , Pn− 1 , 0 ) for all Pi ,j where 0 ≤ i ≤ n − 1 , 0 ≤ j ≤ n − i − 1 do xi ,j = A[j]; A[i] ← xi ,j endfor Time: O(1); Space: (n − 1)/ 2

Problems of the PRAM Model

(^) It assumes the processors operate in a fully-synchronous mode.
(^) It assumes a single shared memory in which each processor can access any cell in unit time.
(^) It neglects the issue of contention caused by concurrent access to different cells within the same memory module.
It assumes the interprocessor communication has infinite bandwidth, zero latency, and zero overhead.
It assumes the number of processors can increase with the problem size.

In conclusion, the PRAM model is good for a gross classification of algorithms, but not very useful for describing realistic algorithms or predicting the performance of algorithms.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 19 / 27

Extensions of the PRAM Model

Phase PRAM — Introduces asynchrony.
Module Parallel Computer — Divides memory into modules.
Local-Memory PRAM — Divides memory into local and global.
Memory Hierarchy Model — Views the memory as a hierarchy.
Delay Model — Introduces communication delay.

Valiant’s BSP Model

BSP = Bulk Synchronous Parallel

The BSP model is developed to model distributed-memory multiprocessors. It consists of three components:

A group of p processors each with local memory.
An interconnection network for point-to-point communication between the processors.
A mechanism for synchronizing all the processors at defined intervals.

Properties:

It’s simple to use. — BSP programs look much the same as sequential programs.
It’s high-level, architecture-independent.
The performance of a BSP program on a given architecture is predictable. — A small set of parameters are used to describe the target architecture.

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 21 / 27

Supersteps

In BSP, a computation consists of a sequence of supersteps. Each superstep is further subdivided into three ordered phases:

(^) Local Computation — each processor performs computation using only data stored in the local memory.
Communication — processors sends/receives messages to each other.

Local Computation

Communication Barrier Synchronization

Barrier Synchronization — this global synchronization waits for all of the communication actions to complete.

A BSP Program

The following function calculates the partial sums of p integers stored on p processors. (The code is written in C with Oxford BSP library.)

int bsp_allsums(int x) { int i, left, right; int mypid = bsp_pid(); int p = bsp_nprocs(); bsp_pushregister(&left, sizeof(int)); bsp_sync(); right = x; for (i=1; i<p; i*=2) { if (mypid+i < p) bsp_put(mypid+i, &right, &left, 0, sizeof(int)); bsp_sync(); if (mypid>=i) right = left + right; } bsp_popregister(&left); return right; }

Jingke Li (Portland State University) CS 415/515 PRAM and Other Models 25 / 27

Berkeley’s LogP Model

L = latency (of message transmissions)

o = overhead (of communication incurred on a processor)

g = gap (between consecutive message transmissions) P = processors (the number of processors)

This model is targeted for accurately modeling performances of algorithms for distributed-memory multiprocessors. It is not as useful for developing parallel algorithms.

LogP Example

Given L = 6, o = 2, g = 4, P = 8, predict the time for a broadcast.

Time Actions 0 P 0 starts to send a message to P 1 2 the message leaves P 0 4 P 0 starts to send a second message to P 2 6 the second message leaves P 0 8 P 0 starts to send a third message to P 3 ... 8 the first message arrives at P 1 10 P 1 receives the message, and starts to forward it to P 4 ... 14 P 1 starts to send a second message to P 5 ...

Total time: 24

PRAM and Other Models: Understanding Parallel Computation - Prof. Jingke Li, Study notes of Computer Science

Related documents

Partial preview of the text

Download PRAM and Other Models: Understanding Parallel Computation - Prof. Jingke Li and more Study notes Computer Science in PDF only on Docsity!

PRAM and Other Models

The RAM Model

(for Sequential Computation)

RAM Algorithms

The PRAM Model

Three Specific PRAM Models

• EREW PRAM —

Concurrent Write Sub Models

PRAM Algorithms

Algorithm Global Sum (EREW PRAM)

Apply Brent’s Theorem to Algorithm Global Sum

Apply Brent’s Theorem to Algorithm Prefix Sums

“Coarse-Grain” Algorithms

Concurrent-Write Algorithms

Problems of the PRAM Model

Extensions of the PRAM Model

Valiant’s BSP Model

Supersteps

A BSP Program

Berkeley’s LogP Model

LogP Example