





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document from rensselaer polytechnic institute's computer science 2500 course explores parallel programming, specifically how to make use of multiple processing units in a computer system. The concept of concurrent programming, the need for communication and synchronization between processes, and the process of achieving parallelism. It also provides examples of shared memory parallelism using threads and introduces the concept of critical sections to prevent race conditions.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






Rensselaer Polytechnic Institute Spring 2009
Given an multicore/SMT or a computer with multiple processors on separate chips (a symmetric multiprocessor (SMP) ), how can we make use of the multiple processing units?
This level of parallelism is at a much higher level than the instruction-level parallelism we looked at before. There, the compiler and/or architecture takes a single program made up of a sequential series of instructions, and executes those instructions in parallel in a way that produces the same result as a one-by-one sequential execution of the instructions.
For a computer with multiple processors, we need to provide multiple streams of instructions to be executed by the processors. A single stream of instructions will only make use of one of our processors at a time.
The easiest way to program this systems is to program them just like a regular single-processor system, but to run multiple programs at once. Each program being run will be assigned to a CPU by the operating system.
However, we would like to consider an approach where a single program can make use of these multiple CPUs.
If we are going to do this, we first need to think about how we would break down the problem to be solved into components that can be executed in parallel, then write a program to achieve it.
Consider some examples:
- The city could be split up beforehand. Each could get a randomly selected collection of addresses to visit. Maybe one person takes all houses with even street numbers and the other all houses with odd street numbers. Or perhaps one person would take everything north of Hoosick St. and the other everything south of Hoosick St. The choice of how to divide up the city may have a big effect on the total cost. There could be excessive travel if one person walks right past a house that has not yet been visited. Also, one person could finish completely while the other still has a lot of work to do. This is a domain decomposition approach.
In each of these cases, we have taken what we might normally think of as a sequential process, and taken advantage of the availability of concurrent processing to make use of multiple workers (processing units).
Sequential Program : sequence of actions that produce a result (statements + variables), called a process, task, or thread (of control). The state of the program is determined by the code, data, and a single program counter.
This can be formalized into a set of rules called Bernstein’s conditions to determine if a pair of tasks can be executed in parallel:
Two tasks P 1 and P 2 can execute in parallel if all three of these conditions hold:
where Ii and Oi are the input and output sets, respectively, for task i (Bernstein, 1966). The input set is the set of variables read by a task and the output set is the set of variables modified by a task.
Back to our example, let’s see what can be done concurrently.
/* initialize matrices, just fill with junk (^) */ for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { a[i][j] = i+j; b[i][j] = i-j; } }
/* matrix-matrix multiply / for (i=0; i<SIZE; i++) { / for each row / for (j=0; j<SIZE; j++) { / for each column (^) / / initialize result to 0 */ c[i][j] = 0;
/* perform dot product / for(k=0; k<SIZE; k++) { c[i][j] = c[i][j] + a[i][k]b[k][j]; } } }
sum=0; for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { sum += c[i][j]; } }
The initialization can all be done in any order – each i and j combination is independent of each other, and the assignment of a[i][j] and b[i][j] can be done in either order.
In the actual matrix-matrix multiply, each c[i][j] must be initialized to 0 before the sum can start to be accumulated. Also, iteration k of the inner loop can only be done after row i of a and column j of b have been initialized.
Finally, the sum contribution of each c[i][j] can be added as soon as that c[i][j] has been computed, and after sum has been initialized to 0.
That granularity seems a bit cumbersome, so we might step back and just say that we can initialize a and b in any order, but that it should be completed before we start computing values in c. Then we can initialize and compute each c[i][j] in any order, but we do not start accumulating sum until c is completely computed.
But all of these dependencies in this case can be determined by a relatively straightforward com- putation. Seems like a job for a compiler! (And in this case, it can be.)
Unfortunately, not everything can be parallelized by the compiler:
If we change the initialization code to:
for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { if ((i == 0) || (j == 0)) { a[i][j] = i+j; b[i][j] = i-j; } else { a[i][j] = a[i-1][j-1] + i + j; b[i][j] = b[i-1][j-1] + i - j; } } }
it can’t be parallelized, so no matter how many processors we throw at it, we can’t speed it up.
Automatic parallelism is great, when it’s possible. We got it for free (at least once we bought the compiler)! It does have limitations, though:
Parallel programs can be categorized by how the cooperating processes communicate with each other:
must take a single parameter of type void * and return void *. The fourth parameter is the pointer that will be passed as the argument to the thread function.
Prototypes for pthread functions are in pthread.h and programs need to link with libp- thread.a (use -lpthread at link time). When using the Sun compiler, the -mt flag should also be specified to indicate multithreaded code.
A slightly more interesting example:
See: /cs/terescoj/shared/cs2500/examples/proctree threads
This example builds a “tree” of threads to a depth given on the command line. It includes calls to pthread self(). This function returns the thread identifier of the calling thread.
Try it out and study the code to make sure you understand how it works.
A bit of extra initialization is necessary to make sure the system will allow your threads to make use of all available processors. It may, by default, allow only one thread in your program to be executing at any given time. If your program will create up to n concurrent threads, you should make the call:
pthread_setconcurrency(n+1);
somewhere before your first thread creation. The “+1” is needed to account for the original thread plus the n you plan to create.
You may also want to specify actual attributes as the second argument to pthread create(). To do this, declare a variable for the attributes:
pthread_attr_t attr;
and initialize it with:
pthread_attr_init(&attr);
and set parameters on the attributes with calls such as:
pthread_attr_setscope(&attr, PTHREAD_SCOPE_PROCESS);
I recommend the above setting for threads in Solaris.
Then, you can pass in &attr as the second parameter to pthread create().
Any global variables in your program are accessible to all threads. Local variables are directly accessible only to the thread in which they were created, though the memory can be shared by passing a pointer as part of the last argument to pthread create().
As you may have been shown in other contexts, concurrent access to shared variables can be dangerous.
Consider this example:
See: /cs/terescoj/shared/cs2500/examples/pthread danger
Run it with one thread, and we get 100000. What if we run it with 2 threads? On a multiprocessor, it is going to give the wrong answer! Why?
The answer is that we have concurrent access to the shared variable counter. Suppose that two threads are each about to execute counter++, what can go wrong?
counter++ really requires three machine instructions: (i) load a register with the value of counter’s memory location, (ii) increment the register, and (iii) store the register value back in counter’s memory location. Even on a single processor, the operating system could switch the process out in the middle of this. With multiple processors, the statements really could be happening concurrently.
Consider two threads running the statements that modify counter:
Thread A Thread B A 1 R0 = counter; B 1 R1 = counter; A 2 R0 = R0 + 1; B 2 R1 = R1 + 1; A 3 counter = R0; B 3 counter = R1;
Consider one possible ordering: A 1 A 2 B 1 A 3 B 2 B 3 , where counter=17 before starting. Uh oh.
What we have here is a race condition that can lead to interference of the actions of one thread with another. We need to make sure that when one process starts modifying counter, that it finishes before the other can try to modify it. This requires synchronization of the processes.
When we run it on a single-processor system, the problem is unlikely to show itself - we almost certainly the correct sum when we run it. However, there is no guarantee that this would be the case. The operating system could switch threads in the middle of the load-increment-store, resulting in a race condition and an incorrect result.
We need to make those statements that increment counter atomic. We say that the modification of counter is a critical section.