Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Introduction to Shared-Memory Parallelism and Concurrency, Exams of Computer Architecture and Organization

University of California - Berkeley Computer Architecture and Organization

These notes were originally written for CSE332 at the University of Washington. (http://www.cs.washington.edu/education/courses/cse332).

Typology: Exams

2022/2023

Uploaded on 05/11/2023

anamika 🇺🇸

4.7

(16)

254 documents

1 / 66

This page cannot be seen from the preview

Don't miss anything!

A Sophomoric∗Introduction to

Shared-Memory Parallelism and Concurrency

Dan Grossman

Version of December 8, 2012

Contents

1 Meta-Introduction: An Instructor’s View of These Notes 2

1.1 Where This Material Fits in a Changing Curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Six Theses On A Successful Approach to this Material . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 How to Use These Notes — And Improve Them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Acknowledgments ............................................ 4

2 Introduction 4

2.1 MoreThanOneThingAtOnce...................................... 4

2.2 Parallelismvs.Concurrency ....................................... 5

2.3 BasicThreadsandSharedMemory.................................... 7

2.4 OtherModels ............................................... 9

3 Basic Fork-Join Parallelism 10

3.1 A Simple Example: Okay Idea, Inferior Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Why Not To Use One Thread Per Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Divide-And-Conquer Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 TheJavaForkJoinFramework ...................................... 19

3.5 ReductionsandMaps........................................... 22

3.6 DataStructuresBesidesArrays...................................... 24

4 Analyzing Fork-Join Algorithms 24

4.1 WorkandSpan .............................................. 25

4.2 Amdahl’sLaw............................................... 27

4.3 Comparing Amdahl’s Law and Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Fancier Fork-Join Algorithms: Prefix, Pack, Sort 29

5.1 Parallel-PrefixSum............................................ 29

5.2 Pack.................................................... 32

5.3 ParallelQuicksort............................................. 33

5.4 ParallelMergesort............................................. 34

∗As in intended for second-year students, not as in immaturely pretentious

Discover Exams of Computer Architecture and Organization University of California - Berkeley

Partial preview of the text

Download Introduction to Shared-Memory Parallelism and Concurrency and more Exams Computer Architecture and Organization in PDF only on Docsity!

A Sophomoric

∗

Introduction to

Shared-Memory Parallelism and Concurrency

 - Version of December 8, Dan Grossman

1 Meta-Introduction: An Instructor’s View of These Notes Contents
- 1.1 Where This Material Fits in a Changing Curriculum
- 1.2 Six Theses On A Successful Approach to this Material
- 1.3 How to Use These Notes — And Improve Them
- 1.4 Acknowledgments
2 Introduction
- 2.1 More Than One Thing At Once
- 2.2 Parallelism vs. Concurrency
- 2.3 Basic Threads and Shared Memory
- 2.4 Other Models
3 Basic Fork-Join Parallelism
- 3.1 A Simple Example: Okay Idea, Inferior Style
- 3.2 Why Not To Use One Thread Per Processor
- 3.3 Divide-And-Conquer Parallelism
- 3.4 The Java ForkJoin Framework
- 3.5 Reductions and Maps
- 3.6 Data Structures Besides Arrays
4 Analyzing Fork-Join Algorithms
- 4.1 Work and Span
- 4.2 Amdahl’s Law
- 4.3 Comparing Amdahl’s Law and Moore’s Law
5 Fancier Fork-Join Algorithms: Prefix, Pack, Sort
- 5.1 Parallel-Prefix Sum
- 5.2 Pack
- 5.3 Parallel Quicksort
- 5.4 Parallel Mergesort

6 Basic Shared-Memory Concurrency 35 6.1 The Programming Model......................................... 35 6.2 The Need for Synchronization...................................... 36 6.3 Locks................................................... 39 6.4 Locks in Java............................................... 40

7 Race Conditions: Bad Interleavings and Data Races 43 7.1 Bad Interleavings: An Example with Stacks............................... 44 7.2 Data Races: Wrong Even When They Look Right............................ 47

8 Concurrency Programming Guidelines 50 8.1 Conceptually Splitting Memory in Three Parts.............................. 51 8.2 Approaches to Synchronization...................................... 53

9 Deadlock 56

10 Additional Synchronization Primitives 59 10.1 Reader/Writer Locks........................................... 59 10.2 Condition Variables............................................ 60 10.3 Other................................................... 65

1 Meta-Introduction: An Instructor’s View of These Notes

1.1 Where This Material Fits in a Changing Curriculum

These notes teach parallelism and concurrency as part of an advanced sophomore-level data-structures course

the course that covers asymptotic complexity, balanced trees, hash tables, graph algorithms, sorting, etc.

Why parallelism and concurrency should be taught early:

Parallelism and concurrency are increasingly important topics in computer science and engineering. Traditionally, most undergraduates learned rather little about these topics and did so rather late in the curriculum: Senior-level operating-systems courses cover threads, scheduling, and synchronization. Early hardware courses have circuits and functional units with parallel parts. Advanced architecture courses discuss cache coherence. Electives might cover parallel algorithms or use distributed computing to solve embarrassingly parallel tasks. Little of this scattered material emphasizes the essential concepts of parallelism and concurrency — and certainly not in a central place such that subsequent courses can rely on it. These days, most desktop and laptop computers have multiple cores. Modern high-level languages have threads built into them and standard libraries use threads (e.g., Java’s Swing library for GUIs). It no longer seems reasonable to bring up threads “as needed” or delay until an operating-systems course that should be focusing on operating systems. There is no reason to introduce threads first in C or assembly when all the usual conveniences of high-level languages for introducing core concepts apply.

Why parallelism and concurrency should not be taught too early:

Conversely, it is tempting to introduce threads “from day one” in introductory programming courses before stu- dents learn “sequential habits.” I suspect this approach is infeasible in most curricula. For example, it may make little sense in programs where most students in introductory courses do not end up majoring in computer science. “Messing with intro” is a high-risk endeavor, and introductory courses are already packed with essential concepts like variables, functions, arrays, linked structures, etc. There is probably no room.

So put it in the data-structures course:

There may be multiple natural places to introduce parallelism and concurrency, but I claim “sophomore-level” data structures (after CS2 and discrete math, but before “senior-level” algorithms) works very well. Here are some reasons:

1.3 How to Use These Notes — And Improve Them

These notes were originally written for CSE332 at the University of Washington (http://www.cs.washington.edu/education/courses/cse332). They account for 3 weeks of a required 10-week course (the University uses a quarter system). Alongside these notes are PowerPoint slides, homework assignments, and a programming project. In fact, these notes were the last aspect to be written — the first edition of the course went great without them and students reported parallelism to be their favorite aspect of the course. Surely these notes have errors and explanations could be improved. Please let me know of any problems you find. I am a perfectionist: if you find a typo I would like to know. Ideally you would first check that the most recent version of the notes does not already have the problem fixed. As for permission, do with these notes whatever you like. Seriously: “steal these notes.” The LATEX sources are available so you can modify them however you like. I would like to know if you are using these notes and how. It motivates me to improve them and, frankly, it’s not bad for my ego. Constructive criticism is also welcome. That said, I don’t expect any thoughtful instructor to agree completely with me on what to cover and how to cover it. Contact me at the email djg and then the at-sign and then cs.washington.edu. The current home for these notes and related materials is http://www.cs.washington.edu/homes/djg/teachingMaterials. Students are more than welcome to contact me: who better to let me know where these notes could be improved.

1.4 Acknowledgments

I deserve no credit for the material in these notes. If anything, my role was simply to distill decades of wisdom from others down to three weeks of core concepts and integrate the result into a data-structures course. When in doubt, I stuck with the basic and simplest topics and examples. I was particularly influenced by Guy Blelloch and Charles Leisersen in terms of teaching parallelism before con- currency and emphasizing divide-and-conquer algorithms that do not consider the number of processors. Doug Lea and other developers of Java’s ForkJoin framework provided a wonderful library that, with some hand-holding, is usable by sophomores. Larry Snyder was also an excellent resource for parallel algorithms. The treatment of shared-memory synchronization is heavily influenced by decades of operating-systems courses, but with the distinction of ignoring all issues of scheduling and synchronization implementation. Moreover, the em- phasis on the need to avoid data races in high-level languages is frustratingly under-appreciated despite the noble work of memory-model experts such as Sarita Adve, Hans Boehm, and Bill Pugh. Feedback from Ruth Anderson, Kim Bruce, Kristian Lieberg, Tyler Robison, Cody Schroeder, and Martin Tompa helped improve explanations and remove typos. Tyler and Martin deserve particular mention for using these notes when they were very new. James Fogarty made many useful improvements to the presentation slides that accompany these reading notes. Steve Wolfman created the C++ version of these notes. I have had enlightening and enjoyable discussions on “how to teach this stuff” with too many researchers and educators over the last few years to list them all, but I am grateful. This work was funded in part via grants from the National Science Foundation and generous support, financial and otherwise, from Intel.

2 Introduction

2.1 More Than One Thing At Once

In sequential programming, one thing happens at a time. Sequential programming is what most people learn first and how most programs are written. Probably every program you have written in Java is sequential: Execution starts at the beginning of main and proceeds one assignment / call / return / arithmetic operation at a time. Removing the one-thing-at-a-time assumption complicates writing software. The multiple threads of execution (things performing computations) will somehow need to coordinate so that they can work together to complete a task — or at least not get in each other’s way while they are doing separate things. These notes cover basic concepts related to multithreaded programming, i.e., programs where there are multiple threads of execution. We will cover:

How to create multiple threads
How to write and analyze divide-and-conquer algorithms that use threads to produce results more quickly
How to coordinate access to shared objects so that multiple threads using the same data do not produce the wrong answer

A useful analogy is with cooking. A sequential program is like having one cook who does each step of a recipe in order, finishing one step before starting the next. Often there are multiple steps that could be done at the same time — if you had more cooks. But having more cooks requires extra coordination. One cook may have to wait for another cook to finish something. And there are limited resources: If you have only one oven, two cooks won’t be able to bake casseroles at different temperatures at the same time. In short, multiple cooks present efficiency opportunities, but also significantly complicate the process of producing a meal. Because multithreaded programming is so much more difficult, it is best to avoid it if you can. For most of computing’s history, most programmers wrote only sequential programs. Notable exceptions were:

Programmers writing programs to solve such computationally large problems that it would take years or cen- turies for one computer to finish. So they would use multiple computers together.
Programmers writing systems like an operating system where a key point of the system is to handle multiple things happening at once. For example, you can have more than one program running at a time. If you have only one processor, only one program can actually run at a time, but the operating system still uses threads to keep track of all the running programs and let them take turns. If the taking turns happens fast enough (e.g., 10 milliseconds), humans fall for the illusion of simultaneous execution. This is called time-slicing.

Sequential programmers were lucky: since every 2 years or so computers got roughly twice as fast, most programs would get exponentially faster over time without any extra effort. Around 2005, computers stopped getting twice as fast every 2 years. To understand why requires a course in computer architecture. In brief, increasing the clock rate (very roughly and technically inaccurately speaking, how quickly instructions execute) became infeasible without generating too much heat. Also, the relative cost of memory accesses can become too high for faster processors to help. Nonetheless, chip manufacturers still plan to make exponentially more powerful chips. Instead of one processor running faster, they will have more processors. The next computer you buy will likely have 4 processors (also called cores) on the same chip and the number of available cores will likely double every few years. What would 256 cores be good for? Well, you can run multiple programs at once — for real, not just with time- slicing. But for an individual program to run any faster than with one core, it will need to do more than one thing at once. This is the reason that multithreaded programming is becoming more important. To be clear, multithreaded programming is not new. It has existed for decades and all the key concepts these notes cover are just as old. Before there were multiple cores on one chip, you could use multiple chips and/or use time-slicing on one chip — and both remain important techniques today. The move to multiple cores on one chip is “just” having the effect of making multithreading something that more and more software wants to do.

2.2 Parallelism vs. Concurrency

These notes are organized around a fundamental distinction between parallelism and concurrency. Unfortunately, the way we define these terms is not entirely standard, so you should not assume that everyone uses these terms as we will. Nonetheless, most computer scientists agree that this distinction is important.

Parallel programming is about using additional computational resources to produce an answer faster.

As a canonical example, consider the trivial problem of summing up all the numbers in an array. We know no sequential algorithm can do better than Θ(n) time. Suppose instead we had 4 processors. Then hopefully we could produce the result roughly 4 times faster by having each processor add 1/4 of the elements and then we could just add these 4 partial results together with 3 more additions. Θ(n/4) is still Θ(n), but constant factors can matter. Moreover, when designing and analyzing a parallel algorithm, we should leave the number of processors as a variable, call it P.

2.3 Basic Threads and Shared Memory

Before writing any parallel or concurrent programs, we need some way to make multiple things happen at once and some way for those different things to communicate. Put another way, your computer may have multiple cores, but all the Java constructs you know are for sequential programs, which do only one thing at once. Before showing any Java specifics, we need to explain the programming model. The model we will assume is explicit threads with shared memory. A thread is itself like a running sequential program, but one thread can create other threads that are part of the same program and those threads can create more threads, etc. Two or more threads can communicate by writing and reading fields of the same objects. In other words, they share memory. This is only one model of parallel/concurrent programming, but it is the only one we will use. The next section briefly mentions other models that a full course on parallel/concurrent programming would likely cover. Conceptually, all the threads that have been started but not yet terminated are “running at once” in a program. In practice, they may not all be running at any particular moment:

There may be more threads than processors. It is up to the Java implementation, with help from the underlying operating system, to find a way to let the threads “take turns” using the available processors. This is called scheduling and is a major topic in operating systems. All we need to know is that it is not under the Java programmer’s control: you create the threads and the system schedules them.
A thread may be waiting for something to happen before it continues. For example, the next section discusses the join primitive where one thread does not continue until another thread has terminated.

Let’s be more concrete about what a thread is and how threads communicate. It’s helpful to start by enumerating the key pieces that a sequential program has while it is running:

One call stack, where each stack frame holds the local variables for a method call that has started but not yet finished. Calling a method pushes a new frame and returning from a method pops a frame. Call stacks are why recursion is not “magic.”
One program counter. This is just a low-level name for keeping track of what statement is currently executing. In a sequential program, there is exactly one such statement.
Static fields of classes.
Objects. An object is created by calling new, which returns a reference to the new object. We call the memory that holds all the objects the heap. This use of the word “heap” has nothing to do with heap data structure used to implement priority queues. It is separate memory from the memory used for the call stack and static fields.

With this overview of the sequential program state, it is much easier to understand threads:

Each thread has its own call stack and program counter, but all the threads share one collection of static fields and objects.

When a new thread starts running, it will have its own new call stack. It will have one frame on it, which is like that thread’s main, but it won’t actually be main.
When a thread returns from its first method, it terminates.
Each thread has its own program counter and local variables, so there is no “interference” from other threads for these things. The way loops, calls, assignments to variables, exceptions, etc. work for each thread is just like you learned in sequential programming and is separate for each thread.
What is different is how static fields and objects work. In sequential programming we know x.f=42; y = x.f; always assigns 42 to the variable y. But now the object that x refers to might also have its f field written to by other threads, so we cannot be so sure.

In practice, even though all objects could be shared among threads, most are not. In fact, just as having static fields is often poor style, having lots of objects shared among threads is often poor style. But we need some shared objects because that is how threads communicate. If we are going to create parallel algorithms where helper threads run in parallel to compute partial answers, they need some way to communicate those partial answers back to the “main” thread. The way we will do it is to have the helper threads write to some object fields that the main thread later reads. We finish this section with some Java specifics for exactly how to create a new thread in Java. The details vary in different languages and in fact the parallelism portion of these notes mostly uses a different Java library with slightly different specifics. In addition to creating threads, we will need other language constructs for coordinating them. For example, for one thread to read the result another thread wrote as its answer, the reader often needs to know the writer is done. We will present such primitives as we need them. To create a new thread in Java requires that you define a new class (step 1) and then perform two actions at run-time (steps 2–3):

Define a subclass of java.lang.Thread and override the public method run, which takes no arguments and has return type void. The run method will act like “main” for threads created using this class. It must take no arguments, but the example below shows how to work around this inconvenience.
Create an instance of the class you defined in step 1. That is, if you defined class C, then use new to create a C object. Note this does not yet create a running thread. It just creates an object of class C, which is a subclass of Thread.
Call the start method of the object you created in step 2. This step does the “magic” creation of a new thread. That new thread will execute the run method of the object. Notice that you do not call run; that would just be an ordinary method call. You call start, which makes a new thread that runs run. The call to start “returns immediately” so the caller continues on, in parallel with the newly-created thread running run. The new thread terminates when its run method completes.

Here is a complete example of a useless Java program that starts with one thread and then creates 20 more threads:

class C extends java.lang.Thread { int i; C(int i) { this.i = i; } public void run() { System.out.println("Thread " + i + " says hi"); System.out.println("Thread " + i + " says bye"); } } class M { public static void main(String[] args) { for(int i=1; i <= 20; ++i) { C c = new C(i); c.start(); } } }

When this program runs, it will print 40 lines of output, one of which is:

Thread 13 says hi

Interestingly, we cannot predict the order for these 40 lines of output. In fact, if you run the program multiple times, you will probably see the output appear in different orders on different runs. After all, each of the 21 separate threads running “at the same time” (conceptually, since your machine may not have 21 processors available for the program) can run in an unpredictable order. The main thread is the first thread and then it creates 20 others. The main thread always creates the other threads in the same order, but it is up to the Java implementation to let all the threads “take

Data parallelism does not have explicit threads or nodes running different parts of the program at different times. Instead, it has primitives for parallelism that involve applying the same operation to different pieces of data at the same time. For example, you would have a primitive for applying some function to every element of an array. The implementation of this primitive would use parallelism rather than a sequential for-loop. Hence all the parallelism is done for you provided you can express your program using the available primitives. Examples include vector instructions on some processors and map-reduce style distributed systems.

3 Basic Fork-Join Parallelism

This section shows how to use threads and shared memory to implement simple parallel algorithms. The only syn- chronization primitive we will need is join, which causes one thread to wait until another thread has terminated. We begin with simple pseudocode and then show how using threads in Java to achieve the same idea requires a bit more work (Section 3.1). We then argue that it is best for parallel code to not be written in terms of the number of processors available (Section 3.2) and show how to use recursive divide-and-conquer instead (Section 3.3). Because Java’s threads are not engineered for this style of programming, we switch to the Java ForkJoin framework which is designed for our needs (Section 3.4). With all of this discussion in terms of the single problem of summing an array of integers, we then turn to other similar problems, introducing the terminology of maps and reduces (Section 3.5) as well as data structures other than arrays (Section 3.6).

3.1 A Simple Example: Okay Idea, Inferior Style

Most of this section will consider the problem of computing the sum of an array of integers. An O(n) sequential solution to this problem is trivial:

int sum(int[] arr) { int ans = 0; for(int i=0; i < arr.length; i++) ans += arr[i]; return ans; }

If the array is large and we have extra processors available, we can get a more efficient parallel algorithm. Suppose we have 4 processors. Then we could do the following:

Use the first processor to sum the first 1/4 of the array and store the result somewhere.
Use the second processor to sum the second 1/4 of the array and store the result somewhere.
Use the third processor to sum the third 1/4 of the array and store the result somewhere.
Use the fourth processor to sum the fourth 1/4 of the array and store the result somewhere.
Add the 4 stored results and return that as the answer.

This algorithm is clearly correct provided that the last step is started only after the previous four steps have completed. The first four steps can occur in parallel. More generally, if we have P processors, we can divide the array into P equal segments and have an algorithm that runs in time O(n/P + P ) where n/P is for the parallel part and P is for combining the stored results. Later we will see we can do better if P is very large, though that may be less of a practical concern. In pseudocode, a convenient way to write this kind of algorithm is with a FORALL loop. A FORALL loop is like a for loop except it does all the iterations in parallel. Like a regular for loop, the code after a FORALL loop does not execute until the loop (i.e., all its iterations) are done. Unlike the for loop, the programmer is “promising” that all the iterations can be done at the same time without them interfering with each other. Therefore, if one loop iteration

writes to a location, then another iteration must not read or write to that location. However, it is fine for two iterations to read the same location: that does not cause any interference. Here, then, is a pseudocode solution to using 4 processors to sum an array. Note it is essential that we store the 4 partial results in separate locations to avoid any interference between loop iterations.^1

int sum(int[] arr) { results = new int[4]; len = arr.length; FORALL(i=0; i < 4; ++i) { results[i] = sumRange(arr,(ilen)/4,((i+1)len)/4); } return results[0] + results[1] + results[2] + results[3]; } int sumRange(int[] arr, int lo, int hi) { result = 0; for(j=lo; j < hi; ++j) result += arr[j]; return result; }

Unfortunately, Java and most other general-purpose languages do not have a FORALL loop. We can encode this programming pattern explicitly using threads as follows:

In a regular for loop, create one thread to do each iteration of our FORALL loop, passing the data needed in the constructor. Have the threads store their answers in fields of themselves.
Wait for all the threads created in step 1 to terminate.
Combine the results by reading the answers out of the fields of the threads created in step 1.

To understand this pattern, we will first show a wrong version to get the idea. That is a common technique in these notes — learning from wrong versions is extremely useful — but wrong versions are always clearly indicated. Here is our WRONG attempt:

class SumThread extends java.lang.Thread { int lo; // fields for communicating inputs int hi; int[] arr; int ans = 0; // for communicating result SumThread(int[] a, int l, int h) { lo=l; hi=h; arr=a; } public void run() { // overriding, must have this type for(int i=lo; i<hi; i++) ans += arr[i]; } } class C { static int sum(int[] arr) { (^1) We must take care to avoid bugs due to integer-division truncation with the arguments to sumRange. We need to process each array element exactly once even if len is not divisible by 4. This code is correct; notice in particular that ((i+1)len)/4 will always be len when i== because 4len is divisible by 4. Moreover, we could write (i+1)len/4 since * and / have the same precedence and associate left-to-right. But (i+1)(len/4) would not be correct. For the same reason, defining a variable int rangeSize = len/4 and using (i+1)*rangeSize would not be correct.

for(int i=0; i < 4; i++) ts[i].join(); for(int i=0; i < 4; i++) ans += ts[i].ans; return ans;

There is nothing wrong with the code above, but the following is also correct:

for(int i=0; i < 4; i++) { ts[i].join(); ans += ts[i].ans; } return ans;

Here we do not wait for all the helper threads to finish before we start producing the final answer. But we still ensure that the main thread does not access a helper thread’s ans field until at least that helper thread has terminated. There is one last Java-specific detail we need when using the join method defined in java.lang.Thread. It turns out this method can throw a java.lang.InterruptedException so a method calling join will not compile unless it catches this exception or declares that it might be thrown. For concurrent programming, it may be bad style to ignore this exception, but for basic parallel programming like we are doing, this exception is a nuisance and will not occur. So we will say no more about it. Also the ForkJoin framework we will use starting in Section 3.4 has a different join method that does not throw exceptions. Here, then, is a complete and correct program. There is no change to the SumThread class. This example shows many of the key concepts of fork-join parallelism, but Section 3.2 will explain why it is poor style and can lead to suboptimal performance. Sections 3.3 and 3.4 will then present a similar but better approach.

3.2 Why Not To Use One Thread Per Processor

Having now presented a basic parallel algorithm, we will argue that the approach the algorithm takes is poor style and likely to lead to unnecessary inefficiency. Do not despair: the concepts we have learned like creating threads and using join will remain useful — and it was best to explain them using a too-simple approach. Moreover, many parallel programs are written in pretty much exactly this style, often because libraries like those in Section 3.4 are unavailable. Fortunately, such libraries are now available on many platforms. The problem with the previous approach was dividing the work into exactly 4 pieces. This approach assumes there are 4 processors available to do the work (no other code needs them) and that each processor is given approximately the same amount of work. Sometimes these assumptions may hold, but it would be better to use algorithms that do not rely on such brittle assumptions. The rest of this section explains in more detail why these assumptions are unlikely to hold and some partial solutions. Section 3.3 then describes the better solution that we advocate.

Different computers have different numbers of processors

We want parallel programs that effectively use the processors available to them. Using exactly 4 threads is a horrible approach. If 8 processors are available, half of them will sit idle and our program will be no faster than with 4 processors. If 3 processors are available, our 4-thread program will take approximately twice as long as with 4 processors. If 3 processors are available and we rewrite our program to use 3 threads, then we will use resources effectively and the result will only be about 33% slower than when we had 4 processors and 4 threads. (We will take 1/3 as much time as the sequential version compared to 1/4 as much time. And 1/3 is 33% slower than 1/4.) But we do not want to have to edit our code every time we run it on a computer with a different number of processors. A natural solution is a core software-engineering principle you should already know: Do not use constants where a variable is appropriate. Our sum method can take as a parameter the number of threads to use, leaving it to some other part of the program to decide the number. (There are Java library methods to ask for the number of processors on the computer, for example, but we argue next that using that number is often unwise.) It would look like this:

static int sum(int[] arr, int numThreads) throws java.lang.InterruptedException { int len = arr.length; int ans = 0; SumThread[] ts = new SumThread[numThreads]; for(int i=0; i < numThreads; i++) { ts[i] = new SumThread(arr,(ilen)/numThreads,((i+1)len)/numThreads); ts[i].start(); } for(int i=0; i < numThreads; i++) { ts[i].join(); ans += ts[i].ans; } return ans; }

Note that you need to be careful with integer division not to introduce rounding errors when dividing the work.

The processors available to part of the code can change

The second dubious assumption made so far is that every processor is available to the code we are writing. But some processors may be needed by other programs or even other parts of the same program. We have parallelism after all — maybe the caller to sum is already part of some outer parallel algorithm. The operating system can reassign processors at any time, even when we are in the middle of summing array elements. It is fine to assume that the underlying Java implementation will try to use the available processors effectively, but we should not assume 4 or even numThreads processors will be available from the beginning to the end of running our parallel algorithm.

We cannot always predictably divide the work into approximately equal pieces

In our sum example, it is quite likely that the threads processing equal-size chunks of the array take approximately the same amount of time. They may not, due to memory-hierarchy issues or other architectural effects, however.

(b) Recursively sum the elements from the middle of the range to hi.

Add the two results from the previous step.

The essence of the recursion is that steps 1a and 1b will themselves use parallelism to divide the work of their halves in half again. It is the same divide-and-conquer recursive idea as you have seen in algorithms like mergesort. For se- quential algorithms for simple problems like summing an array, such fanciness is overkill. But for parallel algorithms, it is ideal. As a small example (too small to actually want to use parallelism), consider summing an array with 10 elements. The algorithm produces the following tree of recursion, where the range [i,j) includes i and excludes j:

Thread: sum range [0,10) Thread: sum range [0,5) Thread: sum range [0,2) Thread: sum range [0,1) (return arr[0]) Thread: sum range [1,2) (return arr[1]) add results from two helper threads Thread: sum range [2,5) Thread: sum range [2,3) (return arr[2]) Thread: sum range [3,5) Thread: sum range [3,4) (return arr[3]) Thread: sum range [4,5) (return arr[4]) add results from two helper threads add results from two helper threads add results from two helper threads Thread: sum range [5,10) Thread: sum range [5,7) Thread: sum range [5,6) (return arr[5]) Thread: sum range [6,7) (return arr[6]) add results from two helper threads Thread: sum range [7,10) Thread: sum range [7,8) (return arr[7]) Thread: sum range [8,10) Thread: sum range [8,9) (return arr[8]) Thread: sum range [9,10) (return arr[9]) add results from two helper threads add results from two helper threads add results from two helper threads add results from two helper threads

The total amount of work done by this algorithm is O(n) because we create approximately 2 n threads and each thread either returns an array element or adds together results from two helper threads it created. Much more interestingly, if we have O(n) processors, then this algorithm can run in O(log n) time, which is exponentially faster than the sequential algorithm. The key reason for the improvement is that the algorithm is combining results in parallel. The recursion forms a binary tree for summing subranges and the height of this tree is log n for a range of size n. With enough processors, the total time corresponds to the tree height, not the tree size: this is the fundamental running-time benefit of parallelism. Later sections will discuss why the problem of summing an array has such an efficient parallel algorithm; not every problem enjoys exponential improvement from parallelism. Having described the algorithm in English, seen an example, and informally analyzed its running time, let us now consider an actual implementation with Java threads and then modify it with two important improvements that affect only constant factors, but the constant factors are large. Then the next section will show the “final” version where we use the improvements and use a different library for the threads.

To start, here is the algorithm directly translated into Java, omitting some boilerplate like putting the main sum method in a class and handling java.lang.InterruptedException.^2

class SumThread extends java.lang.Thread { int lo; // fields for communicating inputs int hi; int[] arr; int ans = 0; // for communicating result SumThread(int[] a, int l, int h) { arr=a; lo=l; hi=h; } public void run() { if(hi - lo == 1) { ans = arr[lo]; } else { SumThread left = new SumThread(arr,lo,(hi+lo)/2); SumThread right = new SumThread(arr,(hi+lo)/2,hi); left.start(); right.start(); left.join(); right.join(); ans = left.ans + right.ans; } } } int sum(int[] arr) { SumThread t = new SumThread(arr,0,arr.length); t.run(); return t.ans; }

Notice how each thread creates two helper threads left and right and then waits for them to finish. Crucially, the calls to left.start and right.start precede the calls to left.join and right.join. If for example, left.join() came before right.start(), then the algorithm would have no effective parallelism whatsoever. It would still produce the correct answer, but so would the original much simpler sequential program. As a minor but important coding point, notice that the “main” sum method calls the run method directly. As such, this is an ordinary method call like you have used since you started programming; the caller and callee are part of the same thread. The fact that the object is a subclass of java.lang.Thread is only relevant if you call the “magic” start method, which calls run in a new thread. In practice, code like this produces far too many threads to be efficient. To add up four numbers, does it really make sense to create six new threads? Therefore, implementations of fork/join algorithms invariably use a cutoff below which they switch over to a sequential algorithm. Because this cutoff is a constant, it has no effect on the asymptotic behavior of the algorithm. What it does is eliminate the vast majority of the threads created, while still preserving enough parallelism to balance the load among the processors. Here is code using a cutoff of 1000. As you can see, using a cutoff does not really complicate the code.

class SumThread extends java.lang.Thread { static int SEQUENTIAL_CUTOFF = 1000; int lo; // fields for communicating inputs int hi; int[] arr; int ans = 0; // for communicating result (^2) For the exception, you cannot declare that run throws this exception because it overrides a method in java.lang.Thread that does not have this declaration. Since this exception is not going to be raised, it is reasonable to insert a catch statement and ignore this exception. The Java ForkJoin Framework introduced in Section 3.4 does not have this problem; its join method does not throw checked exceptions.

SumThread right= new SumThread(arr,(hi+lo)/2,hi); left.start(); right.run(); left.join(); ans = left.ans + right.ans; } } } int sum(int[] arr) { SumThread t = new SumThread(arr,0,arr.length); t.run(); return t.ans; }

Notice how the code above creates two SumThread objects, but only creates one helper thread with left.start(). It then does the right half of the work itself by calling right.run(). There is only one call to join because only one helper thread was created. The order here is still essential so that the two halves of the work are done in parallel. Creating a SumThread object for the right half and then calling run rather than creating a thread may seem odd, but it keeps the code from getting more complicated and still conveys the idea of dividing the work into two similar parts that are done in parallel. Unfortunately, even with these optimizations, the code above will run poorly in practice, especially if given a large array. The implementation of Java’s threads is not engineered for threads that do such a small amount of work as adding 1000 numbers: it takes much longer just to create, start running, and dispose of a thread. The space overhead may also be prohibitive. In particular, it is not uncommon for a Java implementation to pre-allocate the maximum amount of space it allows for the call-stack, which might be 2MB or more. So creating thousands of threads could use gigabytes of space. Hence we will switch to the library described in the next section for parallel programming. We will return to Java’s threads when we learn concurrency because the synchronization operations we will use work with Java’s threads.

3.4 The Java ForkJoin Framework

Java 7 includes classes in the java.util.concurrent package designed exactly for the kind of fine-grained fork-join parallel computing these notes use. In addition to supporting lightweight threads (which the library calls ForkJoin- Tasks) that are small enough that even a million of them should not overwhelm the system, the implementation includes a scheduler and run-time system with provably optimal expected-time guarantees, as described in Section 4. Similar libraries for other languages include Intel’s Thread Building Blocks, Microsoft’s Task Parallel Library for C#, and others. The core ideas and implementation techniques go back much further to the Cilk language, an extension of C developed since 1994. This section describes just a few practical details and library specifics. Compared to Java threads the core ideas are all the same, but some of the method names and interfaces are different — in places more complicated and in others simpler. Naturally, we give a full example (actually two) for summing an array of numbers. The actual library contains many other useful features and classes, but we will use only the primitives related to forking and joining, implementing anything else we need ourselves. For introductory notes on installing and using the library and avoiding some difficult-to-diagnose pitfalls, see http://www.cs.washington.edu/homes/djg/teachingMaterials/grossmanSPAC forkJoinFramework.html. The main web site for the library is http://gee.cs.oswego.edu/dl/concurrency-interest/index.html and javadoc documentation is at http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166ydocs/. We first show a full program (minus a main method) that is as much as possible like the version we wrote using Java threads. We show a version using a sequential cut-off and only one helper thread at each recursive subdivision though removing these important improvements would be easy. After discussing this version, we show a second version that uses Java’s generic types and a different library class. This second version is better style, but easier to understand after the first version.

FIRST VERSION (INFERIOR STYLE):

import java.util.concurrent.ForkJoinPool; import java.util.concurrent.RecursiveAction;

class SumArray extends RecursiveAction { static int SEQUENTIAL_THRESHOLD = 1000;

int lo; int hi; int[] arr; int ans = 0; SumArray(int[] a, int l, int h) { lo=l; hi=h; arr=a; }

protected void compute() { if(hi - lo <= SEQUENTIAL_THRESHOLD) { for(int i=lo; i < hi; ++i) ans += arr[i]; } else { SumArray left = new SumArray(arr,lo,(hi+lo)/2); SumArray right = new SumArray(arr,(hi+lo)/2,hi); left.fork(); right.compute(); left.join(); ans = left.ans + right.ans; } } } class Main { static final ForkJoinPool fjPool = new ForkJoinPool(); static int sumArray(int[] array) { SumArray t = new SumArray(array,0,array.length); fjPool.invoke(t); return t.ans; } }

While there are many differences compared to using Java’s threads, the overall structure of the algorithm should look similar. Furthermore, most of the changes are just different names for classes and methods:

Subclass java.util.concurrent.RecursiveAction instead of java.lang.Thread.
The method that “magically” creates parallelism is called fork instead of start.
The method that starts executing when a new thread begins is called compute instead of run. Recall these methods can also be called normally.
(The method join is still called join.)

The small additions involve creating a ForkJoinPool and using the invoke method on it. These are just some details because the library is not built into the Java language, so we have to do a little extra to initialize the library and start using it. Here is all you really need to know:

The entire program should have exactly one ForkJoinPool, so it makes sense to store it in a static field and use it for all the parallel algorithms in your program.