Parallelism & Communication in Multiprocessor Apps: Jacobi Relaxation Algorithm Study, Assignments of Computer Science

This document delves into the challenges of exploiting parallelism in multiprocessor applications, focusing on the tension between parallelism and communication. Using the jacobi relaxation algorithm as an example, it discusses the importance of partitioning the graph of an application into pieces for a processor machine, and the impact of communication overhead on performance.

Typology: Assignments

Pre 2010

Uploaded on 08/05/2009

koofers-user-typ
koofers-user-typ 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
GEORGIA INSTITUTE OF TECHNOLOGY
College of Computing
CS6290/CS4290 | High-Performance Computer Architecture
Fall 1999
CS6290/CS4290 Issued: November 12, 1999
Homework 4 Due: November 22, 1998
Purpose:
This homework explores network performance in the context of an
abstract parallel machine. It starts byintroducing a simple perfor-
mance model for parallel execution using a particular application
as an example and then uses the model to estimate the eect of
the network on performance. The goal is to develop intuition and
a back-of-the-envelope strategy for evaluating the suitabilityofan
architecture to an application.
Reading:
H&P Chapter 7, particularly Sections 7.1-7.3
H&P Sections 8.1 and 8.2 for parallel applications
Problems:
1. Partioning.
2. Communication Overhead.
CS6290/CS4290 Fall 1999 {1{ November 12, 1999
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Parallelism & Communication in Multiprocessor Apps: Jacobi Relaxation Algorithm Study and more Assignments Computer Science in PDF only on Docsity!

GEORGIA INSTITUTE OF TECHNOLOGY

College of Computing

CS6290/CS4290 | High-Performance Computer Architecture Fall 1999

CS6290/CS4290 Issued: Novemb er 12, 1999 Homework 4 Due: Novemb er 22, 1998

Purp ose: This homework explores network p erformance in the context of an abstract parallel machine. It starts by intro ducing a simple p erfor- mance mo del for parallel execution using a particular application as an example and then uses the mo del to estimate the e ect of the network on p erformance. The goal is to develop intuition and a back-of-the-envelop e strategy for evaluating the suitability of an architecture to an application.

Reading: H&P Chapter 7, particularly Sections 7.1-7. H&P Sections 8.1 and 8.2 for parallel applications

Problems: 1. Partioning.

  1. Communication Overhead.

Intro duction

When developing multipro cessor applications, we attempt to exploit parallelism to achieve increased p erformance. With increased parallelism, however, comes increased interpro ces- sor communication. Since real multipro cessors can provide only a nite amount of network bandwidth, this tension b etween parallelism and communication has signi cant rami ca- tions which we need to keep in mind when designing and programming multipro cessors. In this exercise, you will examine several of these issues.

Multipro cessor programs also require the intro duction of synchronization to co ordinate multiple pro cesses. While di erent languages o er varied forms of synchronization, in this homework you will explore pro ducer-consumer style synchronization, implemented using barrier() constructs for shared-memory and via messages in message-passing.

The goal of these exercises is to familiarize you with the problems of partitioning parallel programs and data structures. In addition, you will develop a simple mo del of parallel pro- gram p erformance that gives insights into the demands of applications and the capabilities of machines.

Example Application: Jacobi Relaxation

Jacobi relaxation is an iterative algorithm which, given a set of b oundary conditions, nds (discretized) solutions to di erential equations of the form r^2 A + B = 0. As we've seen in lecture, we b egin by cho osing the grid which will form the basis of our discretization: B O U N D A R Y C O N D I T I O N S B O U N D A R Y C O N D I T I O N S (i,j)

To nd a solution on a grid, we rep eatedly apply the following iterative step until we converge on a solution.

Aki;^ +1 j =

Ak i+1; j + Ak i 1 ; j + Ak i; j+1 + Ak i; j 1 4

  • bi; j

Jacobi di ers from most other iterative relaxation algorithms in that the up date of each p oint (at iteration step k + 1) requires the previous values of the neighb oring p oints (from iteration step k ).

We use a simple graphical representation to capture those features we are interested in and abstract away excess detail. In this representation, graph no des represent xed amounts of

5 0

5 0

0

5

6 4

4

8 5 8 3

8 5 8 3

8 5 8 3 1 8 7 8

1 8 7 8

1 8 7 8

4 3 4 2

4 3 4 2

6 4 8 5

8 5

4 5

5 8 3

8 3

8

7 8 4 2

7 8

7 1 8 4 4 3

1 8

8 3

6

Figure 1: Partitioning by rows and by square tiles.

on the arrows b etween the no des represent the data that must b e communicated b etween pro cessors.

Machine Mo del

The machine mo del which we'll b e using in this homework is relatively simple { a distributed- memory multipro cessor in which some numb er of pro cessor/memory no des communicate via an interconnection network. Each pro cessor contains some amount of memory in which it stores the data partition for which it is resp onsible.

INTERCONNECTION NETWORK

P/M P/M P/M P/M P/M P/M

In this machine mo del, we assume the interconnection network provides uniform access { all interpro cessor communication is equally exp ensive in terms of network resources consumed and communication latency!

Note that we don't make any assumptions ab out what programming mo del (e.g. shared memory, message passing, etc.) this machine provides to the end user (yet).

Given an application's graphical representation (such as that describ ed ab ove for Jacobi relaxation), we \program" a P -pro cessor machine by deciding which pro cessor should p erform the computation represented by each of the no des in the graph. Since we could

equivalently view this pro cess as one of dividing the graph into (at most) P pieces, we usually refer to this as the partitioning problem.

A: (Warmup { nothing to turn in) For each of the two partitions in Figure 1 how should the B matrix b e distributed to the pro cessors? How should the b oundary values b e distributed?

B: (Warmup { nothing to turn in) Using total amount of com- municated data as a metric, which of the two partitions in Figure 1 is b etter?

Problem 1: Partitioning

The total running time of the program is the ideal metric of go o dness { one partition of a program graph is b etter than another if it results in a shorter running time. But how do we determine running time for a particular partition running on a P -pro cessor machine? If we assume that there is no overlap b etween computation and communication, we can estimate the running time as the sum of the computation time and the communication time.

T = (time to compute) + (time to communicate)

We start by determining the following information from the program graph and partition:

wi { the total amount of computation for pro cessor i (in abstract \computation units") ci { the total amount of communication invoked by pro cessor i which cannot b e resolved on that pro cessor (in abstract \communication units")

For simplicity, assume that we always partition things such that all pro cessors get the same amount of work, w , and invoke the same amount of external communication c. (w = w 1 =    = wi and c = c 1 =    = ci )

Given w and c, we initially compute the running time T by summing the computation and communication times required by one pro cessor. Since all the pro cessors are running in parallel and doing the same amount of work, the running time of a single pro cessor should b e the same as the running time of the entire application.

Thus, T = s  w + l  c

where

s is a measure of pro cessor sp eed { a pro cessor requires s time units to complete one unit of computation

What are the appropriate values of w and c? Using these values, what values do you get for T? How much of a sp eedup is this over the sequential running time?

C: Assuming an n  n Jacobi grid, derive an expression for the amount of communication for one partition, when the asp ect ratio of each partition is given as a : b. The asp ect ratio is sp eci ed as a : b, where a is the size of the x dimension of the partition, and b is the size of the y dimension of the partition. Furthermore, assume there are P pro cessors and that each pro cessor gets an equal amount of work.

D: Prove that the volume of communication p er partition is min- imized when a = b. Assume as b efore that there are P pro cessors, and that each pro cessor must get an equal amount of work.

E: Optimal Rectangular Partitioning. The classic Jacobi algorithm has a square \kernel", but other computations on ele- ments of an array may have arbitrary access patterns. Consider the following computation p erformed for each (i,j) on an n  n grid.

Aki;^ +1 j =

Ak i+x; j + Ak ix; j + Ak i; j+y + Ak i; jy 4

Assume n >> P , n >> x and n >> y. Derive the asp ect ratio that minimizes communication for the communication pattern inherent in the computation shown ab ove.

F: (Optional) Do es your answer to the previous question change if the some additional terms are included in the iteration step as shown b elow,

Aki;^ +1 j =

Ak i+x; j + Ak i+x (^0) ; j + Ak ix; j + Ak ix (^0) ; j + Ak i; j+y + Ak i; jy 6

where x > x^0? Discuss brie y.

Problem 2: Communication Overhead

In this problem, we'll lo ok at the costs of communication and computation more closely and (somewhat) more realistically.

As discussed in class, the p erformance of an interconnect can b e summarized in three parameters: the time-of- ight latency across the wire, the maximum bandwidth of the wire and the interface overhead (hardware and software) at the endp oints. Here are parameters that resemble some real networks: Wire Bandwidth Overhead Network Latency (S) (bytes/S) (S)

T3E: 0 : 1  10 ^6 150  106 0 : 5  10 ^6

Myrinet: 1  10 ^6 150  106 2  10 ^6 ATM (uNet): 22  10 ^6 80  106 5  10 ^6 ATM (TCP): 22  10 ^6 80  106 50  10 ^6

A: Endp oint overhead limits the e ective bandwidth of an inter- connect for \short" messages. How long (in bytes) must a message b e to achieve half of the maximum bandwidth (the \3dB p oint") in these networks? The overhead is for one endp oint (assume send- ing and receiving costs the same). Give two answers for each net- work: rst, the bandwidth assuming two pro cessors communicate completely synchronously (no messages may overlap) and, second, the bandwidth assuming messages may b e overlapp ed/pip elined.

Simple mo del of applications: compute time and communication volume. You can give these parameters as a function of the numb er of pro cessors and the size of the data set to get a gross feel for how an application will p erform on an architecture.

At this p oint, we can nail down some constants, include overhead and get a much b etter estimate of p erformance:

 Use the Myrinet parameters ab ove to answer the rest of the questions in this problem. Note that the Myrinet network is full-duplex: you can simultaneously send and receive data at 150  106 bytes/S.  Assume the pro cessor is capable of 1 oating-p oint op eration p er cycle at a clo ck rate of 300MHz. Ignore all other instructions (i.e. assume they are covered by instruction- level parallelism). Jacobi requires 3 FADDs plus one FMUL/cell, so, for instance, a 64  64 grid has a sequential execution time of

4  64  64 = 16384 cycles p er iteration

Aside: the machine parameters ab ove approximately describ e one of the CoC Intel clusters, \b eetle".