


























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
These notes were originally written for CSE332 at the University of Washington. (http://www.cs.washington.edu/education/courses/cse332).
Typology: Exams
1 / 66
This page cannot be seen from the preview
Don't miss anything!



























































∗
Shared-Memory Parallelism and Concurrency
- Version of December 8, Dan Grossman 6 Basic Shared-Memory Concurrency 35 6.1 The Programming Model......................................... 35 6.2 The Need for Synchronization...................................... 36 6.3 Locks................................................... 39 6.4 Locks in Java............................................... 40
7 Race Conditions: Bad Interleavings and Data Races 43 7.1 Bad Interleavings: An Example with Stacks............................... 44 7.2 Data Races: Wrong Even When They Look Right............................ 47
8 Concurrency Programming Guidelines 50 8.1 Conceptually Splitting Memory in Three Parts.............................. 51 8.2 Approaches to Synchronization...................................... 53
9 Deadlock 56
10 Additional Synchronization Primitives 59 10.1 Reader/Writer Locks........................................... 59 10.2 Condition Variables............................................ 60 10.3 Other................................................... 65
These notes teach parallelism and concurrency as part of an advanced sophomore-level data-structures course
Why parallelism and concurrency should be taught early:
Parallelism and concurrency are increasingly important topics in computer science and engineering. Traditionally, most undergraduates learned rather little about these topics and did so rather late in the curriculum: Senior-level operating-systems courses cover threads, scheduling, and synchronization. Early hardware courses have circuits and functional units with parallel parts. Advanced architecture courses discuss cache coherence. Electives might cover parallel algorithms or use distributed computing to solve embarrassingly parallel tasks. Little of this scattered material emphasizes the essential concepts of parallelism and concurrency — and certainly not in a central place such that subsequent courses can rely on it. These days, most desktop and laptop computers have multiple cores. Modern high-level languages have threads built into them and standard libraries use threads (e.g., Java’s Swing library for GUIs). It no longer seems reasonable to bring up threads “as needed” or delay until an operating-systems course that should be focusing on operating systems. There is no reason to introduce threads first in C or assembly when all the usual conveniences of high-level languages for introducing core concepts apply.
Why parallelism and concurrency should not be taught too early:
Conversely, it is tempting to introduce threads “from day one” in introductory programming courses before stu- dents learn “sequential habits.” I suspect this approach is infeasible in most curricula. For example, it may make little sense in programs where most students in introductory courses do not end up majoring in computer science. “Messing with intro” is a high-risk endeavor, and introductory courses are already packed with essential concepts like variables, functions, arrays, linked structures, etc. There is probably no room.
So put it in the data-structures course:
There may be multiple natural places to introduce parallelism and concurrency, but I claim “sophomore-level” data structures (after CS2 and discrete math, but before “senior-level” algorithms) works very well. Here are some reasons:
These notes were originally written for CSE332 at the University of Washington (http://www.cs.washington.edu/education/courses/cse332). They account for 3 weeks of a required 10-week course (the University uses a quarter system). Alongside these notes are PowerPoint slides, homework assignments, and a programming project. In fact, these notes were the last aspect to be written — the first edition of the course went great without them and students reported parallelism to be their favorite aspect of the course. Surely these notes have errors and explanations could be improved. Please let me know of any problems you find. I am a perfectionist: if you find a typo I would like to know. Ideally you would first check that the most recent version of the notes does not already have the problem fixed. As for permission, do with these notes whatever you like. Seriously: “steal these notes.” The LATEX sources are available so you can modify them however you like. I would like to know if you are using these notes and how. It motivates me to improve them and, frankly, it’s not bad for my ego. Constructive criticism is also welcome. That said, I don’t expect any thoughtful instructor to agree completely with me on what to cover and how to cover it. Contact me at the email djg and then the at-sign and then cs.washington.edu. The current home for these notes and related materials is http://www.cs.washington.edu/homes/djg/teachingMaterials. Students are more than welcome to contact me: who better to let me know where these notes could be improved.
I deserve no credit for the material in these notes. If anything, my role was simply to distill decades of wisdom from others down to three weeks of core concepts and integrate the result into a data-structures course. When in doubt, I stuck with the basic and simplest topics and examples. I was particularly influenced by Guy Blelloch and Charles Leisersen in terms of teaching parallelism before con- currency and emphasizing divide-and-conquer algorithms that do not consider the number of processors. Doug Lea and other developers of Java’s ForkJoin framework provided a wonderful library that, with some hand-holding, is usable by sophomores. Larry Snyder was also an excellent resource for parallel algorithms. The treatment of shared-memory synchronization is heavily influenced by decades of operating-systems courses, but with the distinction of ignoring all issues of scheduling and synchronization implementation. Moreover, the em- phasis on the need to avoid data races in high-level languages is frustratingly under-appreciated despite the noble work of memory-model experts such as Sarita Adve, Hans Boehm, and Bill Pugh. Feedback from Ruth Anderson, Kim Bruce, Kristian Lieberg, Tyler Robison, Cody Schroeder, and Martin Tompa helped improve explanations and remove typos. Tyler and Martin deserve particular mention for using these notes when they were very new. James Fogarty made many useful improvements to the presentation slides that accompany these reading notes. Steve Wolfman created the C++ version of these notes. I have had enlightening and enjoyable discussions on “how to teach this stuff” with too many researchers and educators over the last few years to list them all, but I am grateful. This work was funded in part via grants from the National Science Foundation and generous support, financial and otherwise, from Intel.
In sequential programming, one thing happens at a time. Sequential programming is what most people learn first and how most programs are written. Probably every program you have written in Java is sequential: Execution starts at the beginning of main and proceeds one assignment / call / return / arithmetic operation at a time. Removing the one-thing-at-a-time assumption complicates writing software. The multiple threads of execution (things performing computations) will somehow need to coordinate so that they can work together to complete a task — or at least not get in each other’s way while they are doing separate things. These notes cover basic concepts related to multithreaded programming, i.e., programs where there are multiple threads of execution. We will cover:
A useful analogy is with cooking. A sequential program is like having one cook who does each step of a recipe in order, finishing one step before starting the next. Often there are multiple steps that could be done at the same time — if you had more cooks. But having more cooks requires extra coordination. One cook may have to wait for another cook to finish something. And there are limited resources: If you have only one oven, two cooks won’t be able to bake casseroles at different temperatures at the same time. In short, multiple cooks present efficiency opportunities, but also significantly complicate the process of producing a meal. Because multithreaded programming is so much more difficult, it is best to avoid it if you can. For most of computing’s history, most programmers wrote only sequential programs. Notable exceptions were:
Sequential programmers were lucky: since every 2 years or so computers got roughly twice as fast, most programs would get exponentially faster over time without any extra effort. Around 2005, computers stopped getting twice as fast every 2 years. To understand why requires a course in computer architecture. In brief, increasing the clock rate (very roughly and technically inaccurately speaking, how quickly instructions execute) became infeasible without generating too much heat. Also, the relative cost of memory accesses can become too high for faster processors to help. Nonetheless, chip manufacturers still plan to make exponentially more powerful chips. Instead of one processor running faster, they will have more processors. The next computer you buy will likely have 4 processors (also called cores) on the same chip and the number of available cores will likely double every few years. What would 256 cores be good for? Well, you can run multiple programs at once — for real, not just with time- slicing. But for an individual program to run any faster than with one core, it will need to do more than one thing at once. This is the reason that multithreaded programming is becoming more important. To be clear, multithreaded programming is not new. It has existed for decades and all the key concepts these notes cover are just as old. Before there were multiple cores on one chip, you could use multiple chips and/or use time-slicing on one chip — and both remain important techniques today. The move to multiple cores on one chip is “just” having the effect of making multithreading something that more and more software wants to do.
These notes are organized around a fundamental distinction between parallelism and concurrency. Unfortunately, the way we define these terms is not entirely standard, so you should not assume that everyone uses these terms as we will. Nonetheless, most computer scientists agree that this distinction is important.
Parallel programming is about using additional computational resources to produce an answer faster.
As a canonical example, consider the trivial problem of summing up all the numbers in an array. We know no sequential algorithm can do better than Θ(n) time. Suppose instead we had 4 processors. Then hopefully we could produce the result roughly 4 times faster by having each processor add 1/4 of the elements and then we could just add these 4 partial results together with 3 more additions. Θ(n/4) is still Θ(n), but constant factors can matter. Moreover, when designing and analyzing a parallel algorithm, we should leave the number of processors as a variable, call it P.
Before writing any parallel or concurrent programs, we need some way to make multiple things happen at once and some way for those different things to communicate. Put another way, your computer may have multiple cores, but all the Java constructs you know are for sequential programs, which do only one thing at once. Before showing any Java specifics, we need to explain the programming model. The model we will assume is explicit threads with shared memory. A thread is itself like a running sequential program, but one thread can create other threads that are part of the same program and those threads can create more threads, etc. Two or more threads can communicate by writing and reading fields of the same objects. In other words, they share memory. This is only one model of parallel/concurrent programming, but it is the only one we will use. The next section briefly mentions other models that a full course on parallel/concurrent programming would likely cover. Conceptually, all the threads that have been started but not yet terminated are “running at once” in a program. In practice, they may not all be running at any particular moment:
Let’s be more concrete about what a thread is and how threads communicate. It’s helpful to start by enumerating the key pieces that a sequential program has while it is running:
With this overview of the sequential program state, it is much easier to understand threads:
Each thread has its own call stack and program counter, but all the threads share one collection of static fields and objects.
In practice, even though all objects could be shared among threads, most are not. In fact, just as having static fields is often poor style, having lots of objects shared among threads is often poor style. But we need some shared objects because that is how threads communicate. If we are going to create parallel algorithms where helper threads run in parallel to compute partial answers, they need some way to communicate those partial answers back to the “main” thread. The way we will do it is to have the helper threads write to some object fields that the main thread later reads. We finish this section with some Java specifics for exactly how to create a new thread in Java. The details vary in different languages and in fact the parallelism portion of these notes mostly uses a different Java library with slightly different specifics. In addition to creating threads, we will need other language constructs for coordinating them. For example, for one thread to read the result another thread wrote as its answer, the reader often needs to know the writer is done. We will present such primitives as we need them. To create a new thread in Java requires that you define a new class (step 1) and then perform two actions at run-time (steps 2–3):
Here is a complete example of a useless Java program that starts with one thread and then creates 20 more threads:
class C extends java.lang.Thread { int i; C(int i) { this.i = i; } public void run() { System.out.println("Thread " + i + " says hi"); System.out.println("Thread " + i + " says bye"); } } class M { public static void main(String[] args) { for(int i=1; i <= 20; ++i) { C c = new C(i); c.start(); } } }
When this program runs, it will print 40 lines of output, one of which is:
Thread 13 says hi
Interestingly, we cannot predict the order for these 40 lines of output. In fact, if you run the program multiple times, you will probably see the output appear in different orders on different runs. After all, each of the 21 separate threads running “at the same time” (conceptually, since your machine may not have 21 processors available for the program) can run in an unpredictable order. The main thread is the first thread and then it creates 20 others. The main thread always creates the other threads in the same order, but it is up to the Java implementation to let all the threads “take
Data parallelism does not have explicit threads or nodes running different parts of the program at different times. Instead, it has primitives for parallelism that involve applying the same operation to different pieces of data at the same time. For example, you would have a primitive for applying some function to every element of an array. The implementation of this primitive would use parallelism rather than a sequential for-loop. Hence all the parallelism is done for you provided you can express your program using the available primitives. Examples include vector instructions on some processors and map-reduce style distributed systems.
This section shows how to use threads and shared memory to implement simple parallel algorithms. The only syn- chronization primitive we will need is join, which causes one thread to wait until another thread has terminated. We begin with simple pseudocode and then show how using threads in Java to achieve the same idea requires a bit more work (Section 3.1). We then argue that it is best for parallel code to not be written in terms of the number of processors available (Section 3.2) and show how to use recursive divide-and-conquer instead (Section 3.3). Because Java’s threads are not engineered for this style of programming, we switch to the Java ForkJoin framework which is designed for our needs (Section 3.4). With all of this discussion in terms of the single problem of summing an array of integers, we then turn to other similar problems, introducing the terminology of maps and reduces (Section 3.5) as well as data structures other than arrays (Section 3.6).
Most of this section will consider the problem of computing the sum of an array of integers. An O(n) sequential solution to this problem is trivial:
int sum(int[] arr) { int ans = 0; for(int i=0; i < arr.length; i++) ans += arr[i]; return ans; }
If the array is large and we have extra processors available, we can get a more efficient parallel algorithm. Suppose we have 4 processors. Then we could do the following:
This algorithm is clearly correct provided that the last step is started only after the previous four steps have completed. The first four steps can occur in parallel. More generally, if we have P processors, we can divide the array into P equal segments and have an algorithm that runs in time O(n/P + P ) where n/P is for the parallel part and P is for combining the stored results. Later we will see we can do better if P is very large, though that may be less of a practical concern. In pseudocode, a convenient way to write this kind of algorithm is with a FORALL loop. A FORALL loop is like a for loop except it does all the iterations in parallel. Like a regular for loop, the code after a FORALL loop does not execute until the loop (i.e., all its iterations) are done. Unlike the for loop, the programmer is “promising” that all the iterations can be done at the same time without them interfering with each other. Therefore, if one loop iteration
writes to a location, then another iteration must not read or write to that location. However, it is fine for two iterations to read the same location: that does not cause any interference. Here, then, is a pseudocode solution to using 4 processors to sum an array. Note it is essential that we store the 4 partial results in separate locations to avoid any interference between loop iterations.^1
int sum(int[] arr) { results = new int[4]; len = arr.length; FORALL(i=0; i < 4; ++i) { results[i] = sumRange(arr,(ilen)/4,((i+1)len)/4); } return results[0] + results[1] + results[2] + results[3]; } int sumRange(int[] arr, int lo, int hi) { result = 0; for(j=lo; j < hi; ++j) result += arr[j]; return result; }
Unfortunately, Java and most other general-purpose languages do not have a FORALL loop. We can encode this programming pattern explicitly using threads as follows:
To understand this pattern, we will first show a wrong version to get the idea. That is a common technique in these notes — learning from wrong versions is extremely useful — but wrong versions are always clearly indicated. Here is our WRONG attempt:
class SumThread extends java.lang.Thread { int lo; // fields for communicating inputs int hi; int[] arr; int ans = 0; // for communicating result SumThread(int[] a, int l, int h) { lo=l; hi=h; arr=a; } public void run() { // overriding, must have this type for(int i=lo; i<hi; i++) ans += arr[i]; } } class C { static int sum(int[] arr) { (^1) We must take care to avoid bugs due to integer-division truncation with the arguments to sumRange. We need to process each array element exactly once even if len is not divisible by 4. This code is correct; notice in particular that ((i+1)len)/4 will always be len when i== because 4len is divisible by 4. Moreover, we could write (i+1)len/4 since * and / have the same precedence and associate left-to-right. But (i+1)(len/4) would not be correct. For the same reason, defining a variable int rangeSize = len/4 and using (i+1)*rangeSize would not be correct.
for(int i=0; i < 4; i++) ts[i].join(); for(int i=0; i < 4; i++) ans += ts[i].ans; return ans;
There is nothing wrong with the code above, but the following is also correct:
for(int i=0; i < 4; i++) { ts[i].join(); ans += ts[i].ans; } return ans;
Here we do not wait for all the helper threads to finish before we start producing the final answer. But we still ensure that the main thread does not access a helper thread’s ans field until at least that helper thread has terminated. There is one last Java-specific detail we need when using the join method defined in java.lang.Thread. It turns out this method can throw a java.lang.InterruptedException so a method calling join will not compile unless it catches this exception or declares that it might be thrown. For concurrent programming, it may be bad style to ignore this exception, but for basic parallel programming like we are doing, this exception is a nuisance and will not occur. So we will say no more about it. Also the ForkJoin framework we will use starting in Section 3.4 has a different join method that does not throw exceptions. Here, then, is a complete and correct program. There is no change to the SumThread class. This example shows many of the key concepts of fork-join parallelism, but Section 3.2 will explain why it is poor style and can lead to suboptimal performance. Sections 3.3 and 3.4 will then present a similar but better approach.
class SumThread extends java.lang.Thread { int lo; // fields for communicating inputs int hi; int[] arr; int ans = 0; // for communicating result SumThread(int[] a, int l, int h) { lo=l; hi=h; arr=a; } public void run() { // overriding, must have this type for(int i=lo; i<hi; i++) ans += arr[i]; } } class C { static int sum(int[] arr) throws java.lang.InterruptedException { int len = arr.length; int ans = 0; SumThread[] ts = new SumThread[4]; for(int i=0; i < 4; i++) { ts[i] = new SumThread(arr,(ilen)/4,((i+1)len)/4); ts[i].start(); } for(int i=0; i < 4; i++) { ts[i].join(); ans += ts[i].ans; } return ans; }
Having now presented a basic parallel algorithm, we will argue that the approach the algorithm takes is poor style and likely to lead to unnecessary inefficiency. Do not despair: the concepts we have learned like creating threads and using join will remain useful — and it was best to explain them using a too-simple approach. Moreover, many parallel programs are written in pretty much exactly this style, often because libraries like those in Section 3.4 are unavailable. Fortunately, such libraries are now available on many platforms. The problem with the previous approach was dividing the work into exactly 4 pieces. This approach assumes there are 4 processors available to do the work (no other code needs them) and that each processor is given approximately the same amount of work. Sometimes these assumptions may hold, but it would be better to use algorithms that do not rely on such brittle assumptions. The rest of this section explains in more detail why these assumptions are unlikely to hold and some partial solutions. Section 3.3 then describes the better solution that we advocate.
Different computers have different numbers of processors
We want parallel programs that effectively use the processors available to them. Using exactly 4 threads is a horrible approach. If 8 processors are available, half of them will sit idle and our program will be no faster than with 4 processors. If 3 processors are available, our 4-thread program will take approximately twice as long as with 4 processors. If 3 processors are available and we rewrite our program to use 3 threads, then we will use resources effectively and the result will only be about 33% slower than when we had 4 processors and 4 threads. (We will take 1/3 as much time as the sequential version compared to 1/4 as much time. And 1/3 is 33% slower than 1/4.) But we do not want to have to edit our code every time we run it on a computer with a different number of processors. A natural solution is a core software-engineering principle you should already know: Do not use constants where a variable is appropriate. Our sum method can take as a parameter the number of threads to use, leaving it to some other part of the program to decide the number. (There are Java library methods to ask for the number of processors on the computer, for example, but we argue next that using that number is often unwise.) It would look like this:
static int sum(int[] arr, int numThreads) throws java.lang.InterruptedException { int len = arr.length; int ans = 0; SumThread[] ts = new SumThread[numThreads]; for(int i=0; i < numThreads; i++) { ts[i] = new SumThread(arr,(ilen)/numThreads,((i+1)len)/numThreads); ts[i].start(); } for(int i=0; i < numThreads; i++) { ts[i].join(); ans += ts[i].ans; } return ans; }
Note that you need to be careful with integer division not to introduce rounding errors when dividing the work.
The processors available to part of the code can change
The second dubious assumption made so far is that every processor is available to the code we are writing. But some processors may be needed by other programs or even other parts of the same program. We have parallelism after all — maybe the caller to sum is already part of some outer parallel algorithm. The operating system can reassign processors at any time, even when we are in the middle of summing array elements. It is fine to assume that the underlying Java implementation will try to use the available processors effectively, but we should not assume 4 or even numThreads processors will be available from the beginning to the end of running our parallel algorithm.
We cannot always predictably divide the work into approximately equal pieces
In our sum example, it is quite likely that the threads processing equal-size chunks of the array take approximately the same amount of time. They may not, due to memory-hierarchy issues or other architectural effects, however.
(b) Recursively sum the elements from the middle of the range to hi.
The essence of the recursion is that steps 1a and 1b will themselves use parallelism to divide the work of their halves in half again. It is the same divide-and-conquer recursive idea as you have seen in algorithms like mergesort. For se- quential algorithms for simple problems like summing an array, such fanciness is overkill. But for parallel algorithms, it is ideal. As a small example (too small to actually want to use parallelism), consider summing an array with 10 elements. The algorithm produces the following tree of recursion, where the range [i,j) includes i and excludes j:
Thread: sum range [0,10) Thread: sum range [0,5) Thread: sum range [0,2) Thread: sum range [0,1) (return arr[0]) Thread: sum range [1,2) (return arr[1]) add results from two helper threads Thread: sum range [2,5) Thread: sum range [2,3) (return arr[2]) Thread: sum range [3,5) Thread: sum range [3,4) (return arr[3]) Thread: sum range [4,5) (return arr[4]) add results from two helper threads add results from two helper threads add results from two helper threads Thread: sum range [5,10) Thread: sum range [5,7) Thread: sum range [5,6) (return arr[5]) Thread: sum range [6,7) (return arr[6]) add results from two helper threads Thread: sum range [7,10) Thread: sum range [7,8) (return arr[7]) Thread: sum range [8,10) Thread: sum range [8,9) (return arr[8]) Thread: sum range [9,10) (return arr[9]) add results from two helper threads add results from two helper threads add results from two helper threads add results from two helper threads
The total amount of work done by this algorithm is O(n) because we create approximately 2 n threads and each thread either returns an array element or adds together results from two helper threads it created. Much more interestingly, if we have O(n) processors, then this algorithm can run in O(log n) time, which is exponentially faster than the sequential algorithm. The key reason for the improvement is that the algorithm is combining results in parallel. The recursion forms a binary tree for summing subranges and the height of this tree is log n for a range of size n. With enough processors, the total time corresponds to the tree height, not the tree size: this is the fundamental running-time benefit of parallelism. Later sections will discuss why the problem of summing an array has such an efficient parallel algorithm; not every problem enjoys exponential improvement from parallelism. Having described the algorithm in English, seen an example, and informally analyzed its running time, let us now consider an actual implementation with Java threads and then modify it with two important improvements that affect only constant factors, but the constant factors are large. Then the next section will show the “final” version where we use the improvements and use a different library for the threads.
To start, here is the algorithm directly translated into Java, omitting some boilerplate like putting the main sum method in a class and handling java.lang.InterruptedException.^2
class SumThread extends java.lang.Thread { int lo; // fields for communicating inputs int hi; int[] arr; int ans = 0; // for communicating result SumThread(int[] a, int l, int h) { arr=a; lo=l; hi=h; } public void run() { if(hi - lo == 1) { ans = arr[lo]; } else { SumThread left = new SumThread(arr,lo,(hi+lo)/2); SumThread right = new SumThread(arr,(hi+lo)/2,hi); left.start(); right.start(); left.join(); right.join(); ans = left.ans + right.ans; } } } int sum(int[] arr) { SumThread t = new SumThread(arr,0,arr.length); t.run(); return t.ans; }
Notice how each thread creates two helper threads left and right and then waits for them to finish. Crucially, the calls to left.start and right.start precede the calls to left.join and right.join. If for example, left.join() came before right.start(), then the algorithm would have no effective parallelism whatsoever. It would still produce the correct answer, but so would the original much simpler sequential program. As a minor but important coding point, notice that the “main” sum method calls the run method directly. As such, this is an ordinary method call like you have used since you started programming; the caller and callee are part of the same thread. The fact that the object is a subclass of java.lang.Thread is only relevant if you call the “magic” start method, which calls run in a new thread. In practice, code like this produces far too many threads to be efficient. To add up four numbers, does it really make sense to create six new threads? Therefore, implementations of fork/join algorithms invariably use a cutoff below which they switch over to a sequential algorithm. Because this cutoff is a constant, it has no effect on the asymptotic behavior of the algorithm. What it does is eliminate the vast majority of the threads created, while still preserving enough parallelism to balance the load among the processors. Here is code using a cutoff of 1000. As you can see, using a cutoff does not really complicate the code.
class SumThread extends java.lang.Thread { static int SEQUENTIAL_CUTOFF = 1000; int lo; // fields for communicating inputs int hi; int[] arr; int ans = 0; // for communicating result (^2) For the exception, you cannot declare that run throws this exception because it overrides a method in java.lang.Thread that does not have this declaration. Since this exception is not going to be raised, it is reasonable to insert a catch statement and ignore this exception. The Java ForkJoin Framework introduced in Section 3.4 does not have this problem; its join method does not throw checked exceptions.
SumThread right= new SumThread(arr,(hi+lo)/2,hi); left.start(); right.run(); left.join(); ans = left.ans + right.ans; } } } int sum(int[] arr) { SumThread t = new SumThread(arr,0,arr.length); t.run(); return t.ans; }
Notice how the code above creates two SumThread objects, but only creates one helper thread with left.start(). It then does the right half of the work itself by calling right.run(). There is only one call to join because only one helper thread was created. The order here is still essential so that the two halves of the work are done in parallel. Creating a SumThread object for the right half and then calling run rather than creating a thread may seem odd, but it keeps the code from getting more complicated and still conveys the idea of dividing the work into two similar parts that are done in parallel. Unfortunately, even with these optimizations, the code above will run poorly in practice, especially if given a large array. The implementation of Java’s threads is not engineered for threads that do such a small amount of work as adding 1000 numbers: it takes much longer just to create, start running, and dispose of a thread. The space overhead may also be prohibitive. In particular, it is not uncommon for a Java implementation to pre-allocate the maximum amount of space it allows for the call-stack, which might be 2MB or more. So creating thousands of threads could use gigabytes of space. Hence we will switch to the library described in the next section for parallel programming. We will return to Java’s threads when we learn concurrency because the synchronization operations we will use work with Java’s threads.
Java 7 includes classes in the java.util.concurrent package designed exactly for the kind of fine-grained fork-join parallel computing these notes use. In addition to supporting lightweight threads (which the library calls ForkJoin- Tasks) that are small enough that even a million of them should not overwhelm the system, the implementation includes a scheduler and run-time system with provably optimal expected-time guarantees, as described in Section 4. Similar libraries for other languages include Intel’s Thread Building Blocks, Microsoft’s Task Parallel Library for C#, and others. The core ideas and implementation techniques go back much further to the Cilk language, an extension of C developed since 1994. This section describes just a few practical details and library specifics. Compared to Java threads the core ideas are all the same, but some of the method names and interfaces are different — in places more complicated and in others simpler. Naturally, we give a full example (actually two) for summing an array of numbers. The actual library contains many other useful features and classes, but we will use only the primitives related to forking and joining, implementing anything else we need ourselves. For introductory notes on installing and using the library and avoiding some difficult-to-diagnose pitfalls, see http://www.cs.washington.edu/homes/djg/teachingMaterials/grossmanSPAC forkJoinFramework.html. The main web site for the library is http://gee.cs.oswego.edu/dl/concurrency-interest/index.html and javadoc documentation is at http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166ydocs/. We first show a full program (minus a main method) that is as much as possible like the version we wrote using Java threads. We show a version using a sequential cut-off and only one helper thread at each recursive subdivision though removing these important improvements would be easy. After discussing this version, we show a second version that uses Java’s generic types and a different library class. This second version is better style, but easier to understand after the first version.
import java.util.concurrent.ForkJoinPool; import java.util.concurrent.RecursiveAction;
class SumArray extends RecursiveAction { static int SEQUENTIAL_THRESHOLD = 1000;
int lo; int hi; int[] arr; int ans = 0; SumArray(int[] a, int l, int h) { lo=l; hi=h; arr=a; }
protected void compute() { if(hi - lo <= SEQUENTIAL_THRESHOLD) { for(int i=lo; i < hi; ++i) ans += arr[i]; } else { SumArray left = new SumArray(arr,lo,(hi+lo)/2); SumArray right = new SumArray(arr,(hi+lo)/2,hi); left.fork(); right.compute(); left.join(); ans = left.ans + right.ans; } } } class Main { static final ForkJoinPool fjPool = new ForkJoinPool(); static int sumArray(int[] array) { SumArray t = new SumArray(array,0,array.length); fjPool.invoke(t); return t.ans; } }
While there are many differences compared to using Java’s threads, the overall structure of the algorithm should look similar. Furthermore, most of the changes are just different names for classes and methods:
The small additions involve creating a ForkJoinPool and using the invoke method on it. These are just some details because the library is not built into the Java language, so we have to do a little extra to initialize the library and start using it. Here is all you really need to know: