Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Parallel Programming in Computer Science: Making Use of Multiple Processing Units - Prof. , Study notes of Computer Architecture and Organization

Rensselaer Polytechnic Institute (RPI)Computer Architecture and Organization

Prof. James D. Teresco

This document from rensselaer polytechnic institute's computer science 2500 course explores parallel programming, specifically how to make use of multiple processing units in a computer system. The concept of concurrent programming, the need for communication and synchronization between processes, and the process of achieving parallelism. It also provides examples of shared memory parallelism using threads and introduces the concept of critical sections to prevent race conditions.

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-jy0 🇺🇸

4.5

(1)

9 documents

1 / 9

This page cannot be seen from the preview

Don't miss anything!

Computer Science 2500

Computer Organization

Rensselaer Polytechnic Institute

Spring 2009

Topic Notes: Parallel Programming Intro

Given an multicore/SMT or a computer with multiple processors on separate chips (a symmetric

multiprocessor (SMP)), how can we make use of the multiple processing units?

This level of parallelism is at a much higher level than the instruction-level parallelism we looked

at before. There, the compiler and/or architecture takes a single program made up of a sequential

series of instructions, and executes those instructions in parallel in a way that produces the same

result as a one-by-one sequential execution of the instructions.

For a computer with multiple processors, we need to provide multiple streams of instructions to

be executed by the processors. A single stream of instructions will only make use of one of our

processors at a time.

The easiest way to program this systems is to program them just like a regular single-processor

system, but to run multiple programs at once. Each program being run will be assigned to a CPU

by the operating system.

However, we would like to consider an approach where a single program can make use of these

multiple CPUs.

If we are going to do this, we first need to think about how we would break down the problem to

be solved into components that can be executed in parallel, then write a program to achieve it.

Consider some examples:

•Taking a census of Troy.

One person doing this would visit each house, count the people, and ask whatever questions

are supposed to be asked. This person would keep running counts. At the end, this person

has gathered everything.

If there are two people, they can work concurrently. Each visits some houses, and they need

to “report in” along the way or at the end to combine their information. But how to split up

the work?

–Each person could do what the individual was originally doing, but would check to

make sure each house along the way had not yet been counted.

–Each person could start at the city hall, get an address that has not yet been visited,

go visit it, then go back to the city hall to report the result and get another address to

visit. Someone at city hall keeps track of the cumulative totals. This is nice because

neither person will be left without work to do until the whole thing is done. This is the

master-slave method of breaking up the work.

Discover Study notes of Computer Architecture and Organization Rensselaer Polytechnic Institute (RPI)

Partial preview of the text

Download Parallel Programming in Computer Science: Making Use of Multiple Processing Units - Prof. and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Computer Science 2500

Computer Organization

Rensselaer Polytechnic Institute Spring 2009

Topic Notes: Parallel Programming Intro

Given an multicore/SMT or a computer with multiple processors on separate chips (a symmetric multiprocessor (SMP) ), how can we make use of the multiple processing units?

This level of parallelism is at a much higher level than the instruction-level parallelism we looked at before. There, the compiler and/or architecture takes a single program made up of a sequential series of instructions, and executes those instructions in parallel in a way that produces the same result as a one-by-one sequential execution of the instructions.

For a computer with multiple processors, we need to provide multiple streams of instructions to be executed by the processors. A single stream of instructions will only make use of one of our processors at a time.

The easiest way to program this systems is to program them just like a regular single-processor system, but to run multiple programs at once. Each program being run will be assigned to a CPU by the operating system.

However, we would like to consider an approach where a single program can make use of these multiple CPUs.

If we are going to do this, we first need to think about how we would break down the problem to be solved into components that can be executed in parallel, then write a program to achieve it.

Consider some examples:

Taking a census of Troy. One person doing this would visit each house, count the people, and ask whatever questions are supposed to be asked. This person would keep running counts. At the end, this person has gathered everything. If there are two people, they can work concurrently. Each visits some houses, and they need to “report in” along the way or at the end to combine their information. But how to split up the work? - Each person could do what the individual was originally doing, but would check to make sure each house along the way had not yet been counted. - Each person could start at the city hall, get an address that has not yet been visited, go visit it, then go back to the city hall to report the result and get another address to visit. Someone at city hall keeps track of the cumulative totals. This is nice because neither person will be left without work to do until the whole thing is done. This is the master-slave method of breaking up the work.

- The city could be split up beforehand. Each could get a randomly selected collection of addresses to visit. Maybe one person takes all houses with even street numbers and the other all houses with odd street numbers. Or perhaps one person would take everything north of Hoosick St. and the other everything south of Hoosick St. The choice of how to divide up the city may have a big effect on the total cost. There could be excessive travel if one person walks right past a house that has not yet been visited. Also, one person could finish completely while the other still has a lot of work to do. This is a domain decomposition approach.

Grading a stack of exams. Suppose each has several questions. Again, assume two graders to start. - Each person could take half of the stack. Simple enough. But we still have the potential of one person finishing before the other. - Each person could take a paper from the “ungraded” stack, grade it, then put it into the “graded” stack. - Perhaps it makes more sense to have each person grade half of the questions instead of half of the exams, maybe because it would be unfair to have the same question graded by different people. Here, we could use variations on the approaches above. Each takes half the stack, grades his own questions, then they swap stacks. - Or we form a pipeline , where each exam goes from one grader to the next to the finished pile. Some time is needed to start up the pipeline and drain it out, especially if we add more graders. These models could be applied to the census example, if different census takers each went to every house to ask different questions. - Suppose we also add in a “grade totaler and recorder” person. Does that make any of the approaches better or worse?
Adding two 1 , 000 , 000 × 1 , 000 , 000 matrices. - Each matrix entry in the sum can be computed independently, so we can break this up any way we like. Could use the master-slave approach, though a domain decomposition would probably make more sense. Depending on how many processes we have, we might break it down by individual entries, or maybe by rows or columns.

In each of these cases, we have taken what we might normally think of as a sequential process, and taken advantage of the availability of concurrent processing to make use of multiple workers (processing units).

Some Terminology

Sequential Program : sequence of actions that produce a result (statements + variables), called a process, task, or thread (of control). The state of the program is determined by the code, data, and a single program counter.

6 has to happen after 4 (so 4 doesn’t clobber its value) and after 5 (because it depends on its value)
7 has to happen last.

This can be formalized into a set of rules called Bernstein’s conditions to determine if a pair of tasks can be executed in parallel:

Two tasks P 1 and P 2 can execute in parallel if all three of these conditions hold:

1. I 1 ∩ O 2 = ∅

2. I 2 ∩ O 1 = ∅

3. O 1 ∩ O 2 = ∅

where Ii and Oi are the input and output sets, respectively, for task i (Bernstein, 1966). The input set is the set of variables read by a task and the output set is the set of variables modified by a task.

Back to our example, let’s see what can be done concurrently.

/* initialize matrices, just fill with junk (^) */ for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { a[i][j] = i+j; b[i][j] = i-j; } }

/* matrix-matrix multiply / for (i=0; i<SIZE; i++) { / for each row / for (j=0; j<SIZE; j++) { / for each column (^) / / initialize result to 0 */ c[i][j] = 0;

/* perform dot product / for(k=0; k<SIZE; k++) { c[i][j] = c[i][j] + a[i][k]b[k][j]; } } }

sum=0; for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { sum += c[i][j]; } }

The initialization can all be done in any order – each i and j combination is independent of each other, and the assignment of a[i][j] and b[i][j] can be done in either order.

In the actual matrix-matrix multiply, each c[i][j] must be initialized to 0 before the sum can start to be accumulated. Also, iteration k of the inner loop can only be done after row i of a and column j of b have been initialized.

Finally, the sum contribution of each c[i][j] can be added as soon as that c[i][j] has been computed, and after sum has been initialized to 0.

That granularity seems a bit cumbersome, so we might step back and just say that we can initialize a and b in any order, but that it should be completed before we start computing values in c. Then we can initialize and compute each c[i][j] in any order, but we do not start accumulating sum until c is completely computed.

But all of these dependencies in this case can be determined by a relatively straightforward com- putation. Seems like a job for a compiler! (And in this case, it can be.)

Unfortunately, not everything can be parallelized by the compiler:

If we change the initialization code to:

for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { if ((i == 0) || (j == 0)) { a[i][j] = i+j; b[i][j] = i-j; } else { a[i][j] = a[i-1][j-1] + i + j; b[i][j] = b[i-1][j-1] + i - j; } } }

it can’t be parallelized, so no matter how many processors we throw at it, we can’t speed it up.

Approaches to Parallelism

Automatic parallelism is great, when it’s possible. We got it for free (at least once we bought the compiler)! It does have limitations, though:

some potential parallelization opportunities cannot be detected automatically – can add di- rectives to help
bigger complication – this executable cannot run on distributed-memory systems

Parallel programs can be categorized by how the cooperating processes communicate with each other:

must take a single parameter of type void * and return void *. The fourth parameter is the pointer that will be passed as the argument to the thread function.

pthread exit(3THR) – This causes the calling thread to exit. This is called implicitly if the thread function called during the thread creation returns. Its argument is a return status value, which can be retrieved by pthread join().
pthread join(3THR) – This causes the calling thread to block (wait) until the thread with the identifier passed as the first argument to pthread join() has exited. The second argument is a pointer to a location where the return status passed to pthread exit() can be stored. In the pthreadhello program, we pass in NULL, and hence ignore the value.

Prototypes for pthread functions are in pthread.h and programs need to link with libp- thread.a (use -lpthread at link time). When using the Sun compiler, the -mt flag should also be specified to indicate multithreaded code.

A slightly more interesting example:

See: /cs/terescoj/shared/cs2500/examples/proctree threads

This example builds a “tree” of threads to a depth given on the command line. It includes calls to pthread self(). This function returns the thread identifier of the calling thread.

Try it out and study the code to make sure you understand how it works.

A bit of extra initialization is necessary to make sure the system will allow your threads to make use of all available processors. It may, by default, allow only one thread in your program to be executing at any given time. If your program will create up to n concurrent threads, you should make the call:

pthread_setconcurrency(n+1);

somewhere before your first thread creation. The “+1” is needed to account for the original thread plus the n you plan to create.

You may also want to specify actual attributes as the second argument to pthread create(). To do this, declare a variable for the attributes:

pthread_attr_t attr;

and initialize it with:

pthread_attr_init(&attr);

and set parameters on the attributes with calls such as:

pthread_attr_setscope(&attr, PTHREAD_SCOPE_PROCESS);

I recommend the above setting for threads in Solaris.

Then, you can pass in &attr as the second parameter to pthread create().

Any global variables in your program are accessible to all threads. Local variables are directly accessible only to the thread in which they were created, though the memory can be shared by passing a pointer as part of the last argument to pthread create().

Brief Intro to Critical Sections

As you may have been shown in other contexts, concurrent access to shared variables can be dangerous.

Consider this example:

See: /cs/terescoj/shared/cs2500/examples/pthread danger

Run it with one thread, and we get 100000. What if we run it with 2 threads? On a multiprocessor, it is going to give the wrong answer! Why?

The answer is that we have concurrent access to the shared variable counter. Suppose that two threads are each about to execute counter++, what can go wrong?

counter++ really requires three machine instructions: (i) load a register with the value of counter’s memory location, (ii) increment the register, and (iii) store the register value back in counter’s memory location. Even on a single processor, the operating system could switch the process out in the middle of this. With multiple processors, the statements really could be happening concurrently.

Consider two threads running the statements that modify counter:

Thread A Thread B A 1 R0 = counter; B 1 R1 = counter; A 2 R0 = R0 + 1; B 2 R1 = R1 + 1; A 3 counter = R0; B 3 counter = R1;

Consider one possible ordering: A 1 A 2 B 1 A 3 B 2 B 3 , where counter=17 before starting. Uh oh.

What we have here is a race condition that can lead to interference of the actions of one thread with another. We need to make sure that when one process starts modifying counter, that it finishes before the other can try to modify it. This requires synchronization of the processes.

When we run it on a single-processor system, the problem is unlikely to show itself - we almost certainly the correct sum when we run it. However, there is no guarantee that this would be the case. The operating system could switch threads in the middle of the load-increment-store, resulting in a race condition and an incorrect result.

We need to make those statements that increment counter atomic. We say that the modification of counter is a critical section.

Parallel Programming in Computer Science: Making Use of Multiple Processing Units - Prof. , Study notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Parallel Programming in Computer Science: Making Use of Multiple Processing Units - Prof. and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Computer Science 2500

Computer Organization

Topic Notes: Parallel Programming Intro

Some Terminology

1. I 1 ∩ O 2 = ∅

2. I 2 ∩ O 1 = ∅

3. O 1 ∩ O 2 = ∅

Approaches to Parallelism

Brief Intro to Critical Sections