Balanced Multithreading: Low Cost Multithreading - Lecture Notes | CSSE 490, Papers of Computer Science

Material Type: Paper; Class: Pattern Recog Using Hid Markov; Subject: Computer Sci & Software Engr; University: Rose-Hulman Institute of Technology; Term: Unknown 1989;

Typology: Papers

Pre 2010

Uploaded on 08/16/2009

koofers-user-pny
koofers-user-pny 🇺🇸

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Balanced Multithreading: Increasing Throughput via a
Low Cost Multithreading Hierarchy
Eric Tune Rakesh Kumar Dean M. Tullsen Brad Calder
Computer Science and Engineering Department
University of California at San Diego
{etune,rakumar,tullsen,calder}@cs.ucsd.edu
Abstract
A simultaneous multithreading (SMT) processor can issue
instructions from several threads every cycle, allowing it to
effectively hide various instruction latencies; this effect in-
creases with the number of simultaneous contexts supported.
However, each added context on an SMT processor incurs a
cost in complexity, which may lead to an increase in pipeline
length or a decrease in the maximum clock rate. This pa-
per presents new designs for multithreaded processors which
combine a conservative SMT implementation with a coarse-
grained multithreading capability. By presenting more virtual
contexts to the operating system and user than are supported
in the core pipeline, the new designs can take advantageof the
memory parallelism present in workloads with many threads,
while avoiding the performance penalties inherent in a many-
context SMT processor design. A design with 4 virtual con-
texts, but which is based on a 2-context SMT processor core,
gains an additional 26% throughput when 4 threads are run
together.
1 Introduction
The ratio between main memory access time and core
clock rates continues to grow. As a result, a processor
pipeline may be idle during much of a programs execution. A
multithreading processor can maintain a high throughput de-
spite a large relative memory latencies by executing instruc-
tions from several programs. Many models of multithreading
have been proposed. They can be categorized by how close
together in time instructions from different threads may be ex-
ecuted, which affects how the state for different threads must
be managed. Simultaneous Multithreading [31, 30, 12, 33]
(SMT) is the least restrictive model, in that instructions from
multiple threads can execute in the same cycle. This flexibil-
ity allows an SMT processor to hide stalls in one thread by
executing instructions from other threads. However, the flex-
ibility of SMT comes at a cost. The register file and rename
tables must be enlarged to accommodate the architectural reg-
isters of the additional threads. This in turn can increase the
clock cycle time and/or the depth of the pipeline.
Coarse-grained multithreading (CGMT) [1, 21, 26] is a
more restrictive model where the processor can only execute
instructions from one thread at a time, but where it can switch
to a new thread after a short delay. This makes CGMT suited
for hiding longer delays. Soon, general-purpose micropro-
cessors will be experiencing delays to main memory of 500
or more cycles. This means that a context switch in response
to a memory access can take tens of cycles and still provide
a considerable performance benefit. Previous CGMT designs
relied on a larger register file to allow fast context switches,
which would likely slow down current pipeline designs and
interfere with register renaming. Instead, we describe a new
implementation of CGMT which does not affect the size or
design of the register file or renaming table.
We find that CGMT alone, triggered only by main-
memory accesses, provides unimpressive increases in per-
formance because it cannot hide the effect of shorter stalls
in a single thread. However, CGMT and SMT complement
each other very well. A design which combines both types of
multithreading provides a balance between support for hiding
long and short stalls, and a balance between high throughput
and high single-thread performance. We call this combina-
tion of techniques Balanced Multithreading (BMT).
This combination of multithreading models can be com-
pared to a cache hierarchy,which results in a multithreading
hierarchy. The lowest level of multithreading (SMT) is small
(fewer contexts), fast, expensive, and closely tied to the pro-
cessor cycle time. The next level of multithreading (CGMT)
is slower, potentially larger (fewer limits to the number of
contexts that can be supported), cheaper, and has no impact
on processor cycle time or pipeline depth.
In our design, the operating system sees more virtual con-
texts than are supported in the core pipeline. These virtual
contexts are controlled by a mechanism to quickly switch be-
tween threads on long latency load misses. The method we
1
Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)
1072-4451/04 $20.00 © 2004 IEEE
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Balanced Multithreading: Low Cost Multithreading - Lecture Notes | CSSE 490 and more Papers Computer Science in PDF only on Docsity!

Balanced Multithreading: Increasing Throughput via a

Low Cost Multithreading Hierarchy

Eric Tune Rakesh Kumar Dean M. Tullsen Brad Calder

Computer Science and Engineering Department

University of California at San Diego

{etune,rakumar,tullsen,calder}@cs.ucsd.edu

Abstract

A simultaneous multithreading (SMT) processor can issue instructions from several threads every cycle, allowing it to effectively hide various instruction latencies; this effect in- creases with the number of simultaneous contexts supported. However, each added context on an SMT processor incurs a cost in complexity, which may lead to an increase in pipeline length or a decrease in the maximum clock rate. This pa- per presents new designs for multithreaded processors which combine a conservative SMT implementation with a coarse- grained multithreading capability. By presenting more virtual contexts to the operating system and user than are supported in the core pipeline, the new designs can take advantage of the memory parallelism present in workloads with many threads, while avoiding the performance penalties inherent in a many- context SMT processor design. A design with 4 virtual con- texts, but which is based on a 2-context SMT processor core, gains an additional 26% throughput when 4 threads are run together.

1 Introduction

The ratio between main memory access time and core clock rates continues to grow. As a result, a processor pipeline may be idle during much of a programs execution. A multithreading processor can maintain a high throughput de- spite a large relative memory latencies by executing instruc- tions from several programs. Many models of multithreading have been proposed. They can be categorized by how close together in time instructions from different threads may be ex- ecuted, which affects how the state for different threads must be managed. Simultaneous Multithreading [31, 30, 12, 33] (SMT) is the least restrictive model, in that instructions from multiple threads can execute in the same cycle. This flexibil- ity allows an SMT processor to hide stalls in one thread by executing instructions from other threads. However, the flex-

ibility of SMT comes at a cost. The register file and rename tables must be enlarged to accommodate the architectural reg- isters of the additional threads. This in turn can increase the clock cycle time and/or the depth of the pipeline. Coarse-grained multithreading (CGMT) [1, 21, 26] is a more restrictive model where the processor can only execute instructions from one thread at a time, but where it can switch to a new thread after a short delay. This makes CGMT suited for hiding longer delays. Soon, general-purpose micropro- cessors will be experiencing delays to main memory of 500 or more cycles. This means that a context switch in response to a memory access can take tens of cycles and still provide a considerable performance benefit. Previous CGMT designs relied on a larger register file to allow fast context switches, which would likely slow down current pipeline designs and interfere with register renaming. Instead, we describe a new implementation of CGMT which does not affect the size or design of the register file or renaming table. We find that CGMT alone, triggered only by main- memory accesses, provides unimpressive increases in per- formance because it cannot hide the effect of shorter stalls in a single thread. However, CGMT and SMT complement each other very well. A design which combines both types of multithreading provides a balance between support for hiding long and short stalls, and a balance between high throughput and high single-thread performance. We call this combina- tion of techniques Balanced Multithreading (BMT). This combination of multithreading models can be com- pared to a cache hierarchy, which results in a multithreading hierarchy. The lowest level of multithreading (SMT) is small (fewer contexts), fast, expensive, and closely tied to the pro- cessor cycle time. The next level of multithreading (CGMT) is slower, potentially larger (fewer limits to the number of contexts that can be supported), cheaper, and has no impact on processor cycle time or pipeline depth. In our design, the operating system sees more virtual con- texts than are supported in the core pipeline. These virtual contexts are controlled by a mechanism to quickly switch be- tween threads on long latency load misses. The method we

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

propose for adding more virtual contexts does not increase the size of the physical register file or of the renaming tables. In- stead, inactive contexts reside in a separate memory dedicated to that purpose, which can be simpler and far from the core as compared to a register file, and will not be timing critical. Further, those threads that are swapped out of the processor core do not need to be renamed, which avoids an increase in the size of the renaming table. This architecture can achieve the throughput near that of a many-context SMT processor, but with the pipeline and clock rate of an SMT implementa- tion that supports fewer threads. We find that we can increase the throughput of an SMT processor design by as much as 26% by applying these small changes to the processor core. This paper is organized as follows: Section 2 discusses related prior work. Section 3 presents the architecture and mechanisms for combining SMT and CGMT. Section 4 dis- cusses our evaluation methodology. Results are presented in Section 5.

2 Related Work

There has been a large body of work on the three primary multithreading execution models. Fine-grained multithread- ing architectures [24, 2, 10, 16] switch threads every proces- sor cycle. Coarse-grained multithreading [1, 21, 26, 18, 5] (CGMT) architectures switch to a different thread if the cur- rent thread has a costly stall. Simultaneous multithread- ing [31, 30, 12, 33] (SMT) architectures can issue instructions from multiple threads simultaneously.

2.1 Coarse-Grain Multithreading

The Sparcle CPU [1] in the Alewife machine implements CGMT, performing a context switch in 14 cycles (4 cycles with aggressive optimizations). The Sparcle architects dis- abled the register windows present in the Sparc processor that they reused, and used the extra registers to support a second context. The Sparcle processor was in-order, with a short pipeline and did not perform register renaming. The IBM RS64 IV processor [5] supports CGMT with 2 threads, and is in-order. The RS64 designers chose to implement only two contexts, which avoided any cycle-time penalty from the additional registers. For the processors we seek to improve, which have large instruction windows backed by additional registers, the register file access time is much more likely to be on the critical timing path. Therefore, we present a differ- ent approach to context switching. Waldspurger and Wiehl [32] avoid expanding the register file in a CGMT architecture by recompiling code so that each thread used fewer registers. Mowry and Ramkissoon [18] propose software-controlled CGMT to help tolerate the la- tency of shared data in a shared-memory multiprocessor.

They suggest compiler-based register file partitioning to re- duce context-switch overhead. Horowitz, et al. similarly sug- gest using memory references which cause cache misses to branch or trap to a user-level handler [13]. Our approach uses lightweight hardware support to make context switches faster than would be possible purely using software, and does not require recompilation.

2.2 Simultaneous Multithreading

Simultaneous multithreading can increase the utilization of the execution resources of a single processor core by ex- ecuting instructions from different threads at the same time. However, each additional simultaneous thread expands struc- tures whose speed may directly affect performance, in partic- ular the register file. To reduce the incremental cost of addi- tional threads in an SMT processor, Redstone, et al. [20] pro- pose partitioning the architectural register file. Lo et al. [17] propose software-directed register deallocation to decrease dynamic register file demand for SMT processors. Both [20] and [17] require compiler support. Multi-level register file or- ganizations reduce the average register access time [4, 8, 3]. Register file speed is a function of the number of ports, as well as the number of registers it contains. A processor with a high issue width requires a register file with many ports to avoid contention. The port requirements can be re- laxed [19, 14, 27], but that requires additional hardware to arbitrate among the ports. Tullsen and Brown [29] note that very long latency mem- ory operations can create problems for an SMT processor. They suggest that when a thread is stalled waiting for a mem- ory access, the instructions after the miss should be flushed from the pipeline, freeing critical shared execution resources. Our scheme inherently provides the same functionality. How- ever, their proposal fails to free the most critical shared re- source – thread contexts. We compare our processor designs against an SMT processor which implements their flushing mechanism. Our results show that freeing resources being held by a stalled thread is indeed very important; however, making those same resources available to a thread that would not otherwise have a chance to run is also important. Other researchers have suggested more sophisticated flushing poli- cies for SMT [6], which we do not evaluate. However, im- provements to policies which control when to flush an SMT processor can also be applied to controlling thread-swapping in a BMT processor.

3 Architecture

In this paper, we use the term context to refer to the hard- ware which gives a processor the ability to run a process with- out operating system or software intervention. We use the term thread to refer to a program assigned to a context by the

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

would need to be accessed through a multiplexer which would be controlled by a physical-to-virtual context mapping regis- ter. The special thread-switch instruction changes this register to correspond to the next thread to run. Selecting the Next Thread— The next thread to swap in is known before a thread swap occurs. We use a Least-Recently- Run policy for selecting the next thread. When an active thread is swapped out of the pipeline, the least recently run thread is swapped in. When a thread incurs a miss, but all inactive threads are also waiting for memory, we found that a good policy was to swap out the stalled thread, swap in the least recently run thread, but gate (stall) fetch for that least recently run thread until its data is returned from memory. This prevents the still- stalled thread from introducing instructions into the proces- sor that will interfere with other active threads. Eickemeyer, et al. , [9], refer to this policy as switch-when-ready in their evaluation of a CGMT-only processor. Inactive Register Buffer— Adding physical contexts to a processor increases the total number of registers in the reg- ister file, which is likely to affect the clock rate or pipeline length. The access requirements for active and inactive reg- isters are quite different. As a result of these differences, the design constraints on the IRB are considerably relaxed, com- pared to the register file. (We will use the term primary reg- ister file to emphasize that we are not referring to the IRB.) For a 4-wide processor design, the IRB has at most 4 ports (read/write), compared to 12 ports (8 read and 4 write) for the primary register file. It does not require bypassing, because the same locations are never written and then read close to- gether in time. Also, it can tolerate being placed far from the core pipeline, and thus has fewer layout constraints. In regard to the last item, we model a 10 cycle (pipelined) access time for the IRB, implying its distance from the core is similar to the L2 cache, certainly further than the L1. In addition, firmware context switching is well-suited to a processor with a unified register file for both architectural registers and for uncommitted results, as in [34, 11]. In that type of architecture, including those with separate floating- point and integer register files, an architectural register is not mapped to a fixed location in the register file, so saving or restoring it involves first consulting the renaming table. The alternative architecture, with a separate reorder buffer and commit register file, may allow for greater hardware support of context switching, but it requires a higher read bandwidth on the reorder buffer for a given level of instruction through- put, and is poorly suited to SMT. Our firmware approach to context switching does not add additional ports to the register file, since the thread switching operations use the ordinary instruction path. In summary, the inactive register buffer adds no complexity to the core of the processor.

3.2 Time Required to Swap Threads

In our simulations, with the baseline BMT configuration, a majority of firmware context switches take 60 cycles or less. However, there is considerable room for variation. This sec- tion describes the range of times required for each step of the context switch. 25 cycles to detect main memory access— If a load in- struction does not complete execution in 25 cycles, then it is considered to be a main-memory access. This includes a 3 cy- cle load instruction latency, a 14 cycle L2 latency, and several extra cycles to account for contention when accessing the L cache. This is for the baseline memory architecture. For the other memory designs investigated in Section 5.4, this thresh- old is adjusted. In principle, this time could be reduced by an early reply from the L2 tag array, or by consulting a load- hit predictor. However, as we show in Section 5.6, switching prematurely can decrease memory parallelism by missing the opportunity to issue independent load misses in parallel with the first miss encountered. The 60 cycle figure above does not include these 25 cycles. 3–30 cycles to trigger flush— There is a 3-cycle minimum delay to trigger a flush in our model. However, older uncom- mitted instructions from the same thread may further delay the flush. In our simulations, the flush occurs after 3 cycles 64% of the time, within 15 cycles 94% of the time, and very rarely after more than 30 cycles. A flush could be triggered before the canceled load becomes the oldest instruction in its thread, but we found that the cost of unnecessary flushes caused by wrong path instructions outweighed the advantage of flushing sooner. 15 cycles for microcode to reach execute— Instructions can be fetched from the microcode control store immediately after the flush has been triggered. In the pipeline we model, there are 15 stages between fetch and execute. ∼ 10 cycles to issue rsave instructions— The micropro- gram will contain 1–62 rsave instructions, depending on the number of dirty registers. There is considerable varia- tion between benchmarks. Overall, though, on 50% of thread swaps, 20 or fewer registers had been modified, and on 90% of thread swaps, 40 or fewer had been modified. The rsave instructions compete to use the integer units with instructions from other active threads, but in the best case, 40 rsaves take 10 cycles to execute, 4 at a time. ∼ 16 cycles to issue rrestore instructions— The mi- croprogram concludes with 62 rrestore instructions to re- store the registers of the new thread. These take at least 16 cycles to execute. For those 4 of the 16 benchmarks which do not use floating-point registers, there are only 31 rrestores. ≤ 10 cycles restore-use latency— After the micropro- gram is fetched, but concurrently with the execution of the rrestores, the processor fetches from the new thread. We

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

Fetch ≤ 3 instructions per thread from ≤ 2 threads each cycle Branch prediction 64Kbit 2bcGskew Deep Pipeline 22 stages, 16 cycle misp. penalty Out-of-order execution with 48/32/20 entry integer/fp/memory in- struction queues, which may issue 4 integer/mem instructions (≤ 2 mem) and 2 fp instructions each cycle Instruction Window supports 128 in-flight instructions† Memory system 32k 4-way 3 cycle L1 Instruction and Data caches (2 acc/cyc) 64 byte linesize for L1 caches 64 entry DTLB / 48 entry ITLB, fully associative 256 entry second level Data and Instruction TLBs 128 byte linesize for higher-level caches 2MB 8-way 14 cycle L2 cache (1 acc/cyc)† 500 cycle memory access time† †— Baseline parameter, different where noted.

Table 1. Simulated Processor Specifications.

model a 10 cycle latency for the rrestore instructions, and the execution of the rrestore instructions is fully pipelined. Depending on what registers are used first by the new thread, there will be a 0–10 cycle delay. This could be reduced by strategically reordering the rrestore instruc- tions to match the order of their use by the new thread, based on the instructions previously flushed. Of course, the new thread may also incur an instruction cache miss.

3.3 Common Architecture

The parameters common to all processor designs are shown in Table 1. We intend that these parameters represent a reasonable processor design one or two process generations from now, except that the cache sizes are somewhat smaller than might be projected. We chose relatively smaller cache sizes to match the memory footprint of the benchmarks we use. The baseline SMT processors we evaluate implements the flush-on-cache-miss policy from [29], which makes more room in the instruction window for instructions from non- stalled threads. Thus, the miss detection and flushing capa- bility required by BMT should not be viewed as an extra cost of our design. We model a software TLB miss handler mechanism close to that used in the Alpha architecture [7] for all processor designs. For some workloads, page-table walks due to TLB misses represent a significant fraction of all main memory accesses, and a fraction which increases as more threads are run together. Therefore, we allow thread swaps to occur on the loads in the TLB miss trap handler routine. A system with a hardware TLB handler should be able to accommodate thread-swapping as well.

Fast Forward Name Code Input Instructions (× 106 ) ammp 0 2000 art 1 -startx 110 7500 crafty 2 700 eon 3 rushmeier 100 galgel 4 5000 gap 5 185330 gcc 6 166 2100 gzip 7 graphic 39300 mcf 8 12600 mesa 9 1300 mgrid A 2100 parser B 400 perl C makerand 10000 twolf D 900 vortex E 2 6000 vpr F route 36100

Table 2. Benchmarks.

2A 01 2B 12 2C 23 2D 34 2E 45 2F 56 2G 67 2H 78 2I 89 2J 9A 2K AB 2L BC 2M CD 2N DE 2O EF 2P F

Workload ↓ Name 3A 012 3B 345 3C 678 3D 9AB 3E CDE 3F 024 3G 68A 3H CEF 3I 135 3J 79B 3K DF Bench.↑ Codes

4A 0123 4B 4567 4C 89AB 4D CDEF 4E 0246 4F 8ACE 4G 1357 4H 9BDF 6A 012345 6B 6789AB 6C ABCDEF 6D 0369EF 6E 147C (see Tbl 2)

8A 01234567 8B 89ABCDEF 8C 02468ACE 8D 13579BDF 10A 0123456789 10B 456789ABCD 10C 89ABCDEF 10D CDEF 12A 456789ABCDEF 12B 012389ABCDEF 12C 01234567CDEF 12D 0123456789AB 16A 01234567 89ABCDEF

Table 3. Workloads.

4 Methodology

We evaluate each design alternative by simulation. For each design, we simulate workloads of different sizes. For each workload size, we present the average of several dif- ferent workloads. Each of the workloads are comprised of a subset of the SPEC2000 benchmarks. We perform all simulations using a detailed, execution- driven simulator, based on SMTSIM [28]. The simulator ex- ecutes Alpha binaries which are compiled with the DEC C (-O4) or Fortran (-O5) compiler. We added a software TLB miss handler that closely models the Alpha architecture PAL- Code TLB trap handler. The speedup results we present are meant to be an estimate of the overall improvement in throughput for a system which continuously runs the 16 benchmarks shown in Table 2, as compared to a single-threaded system. We simulate a portion of each benchmark. With the assistance of SimPoint [22], we select a starting point for simulation within each benchmark. Using the multiple simulation point algorithm, we select a phase in each benchmark that represents the largest amount of execution.

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

a long-latency load [29]. We present this to emphasize the importance of having such a mechanism in any multithread- ing processor with a shared instruction window. The points labeled BMT- m / n represent BMT designs with Cphys = m and Cvirt ≥ n, running workloads of n threads. The BMT-2/4 processor gets 26% more throughput than an SMT-2 processor, while running at the same clock speed with the same pipeline depth. A 4-context SMT processor gets 17% more throughput when enhanced with BMT, assuming 8 jobs are ready to run. We model the same pipeline depth and cycle time for all SMT and BMT configurations. As additional hardware con- texts are added, keeping the pipeline depth or cycle time con- stant is unlikely, but the focus of our comparisons are be- tween SMT and BMT configurations with the same number of physical contexts. Because the speedup is not adjusted for these effects, care should be taken when comparing points with different values of Rphys. For example, while the 4- context SMT processor shows 54% higher throughput than a 2-context SMT processor when 4 threads are available, differ- ences in the pipeline and/or clock rate between those two de- signs mean that the relative throughput of the 4-context SMT processor will be lower than that number. Even ignoring complexity differences, however, the ad- ditional benefit of our approach is significant. Regardless of whether a 2, 4 or 6 context SMT processor design is the best choice for particular technology and performance goals, BMT can be added to boost throughput without affecting pipeline complexity. Additionally, these results assume all physical contexts are filled. When there are fewer threads than contexts, the advantage of the BMT designs over SMT are even greater. This figure also shows the performance of CGMT alone. It provides only marginal gains over a single-threaded proces- sor. Because of the high cost of moving state in and out of the processor core, CGMT alone is of less value. But when CGMT is added to SMT, the additional physical contexts can do useful work while a context switch is underway, hiding the cost of the switch.

5.2 Scalability of Balanced Multithreading

Adding more threads to a processor can increase perfor- mance by increasing memory parallelism. However, with too many threads, the benefits can be outweighed by the cost of contention between threads. In this section, we investigate how well different BMT designs perform, compared to SMT designs, as the virtual-to-physical context ratio, Cvirt/Cphys, increases. The firmware mechanism to swap threads in and out of the processor core has two costs. First, the time required to complete the context switch delays the start of execution of the incoming thread. Second, the firmware save/restore

2 3 4 6 8 10 12 16

1

2

3

BMT 2 firmware

BMT 2 instant

BMT 4 firmware

BMT 4 instant

BMT 6 frm.

BMT 6 ins.

SMT−

SMT−

SMT−

SMT−

SMT−

Weighted Speedup

Number of threads in workload, T

Figure 2. Speedup versus workload size.

instructions contend with other active threads for execution resources. To understand the cost of the firmware context switching mechanism, we compare the performance of the firmware mechanism with a hypothetical instant save/restore mechanism. Figure 2 shows the weighted speedup of several differ- ent SMT and BMT designs. The x-axis shows the num- ber of threads in a workload, n, which is assumed to be equal to Cvirt for this study. The y-axis shows weighted speedup of each design compared to a single-thread proces- sor. On the curve where the points are labeled SMT- n, the points represent SMT processors capable of running work- loads of n threads together. There are three sets of curves for BMT designs with 2, 4, or 6 physical contexts. Within each set, there is a curve labeled firmware , for a processor us- ing the firmware thread swapping mechanism, and a curve la- beled instant which represents a processor with an idealized, nearly instantaneous thread-swapping mechanism. The in- stant mechanism requires only 1 cycle to save and restore the architectural registers of the outgoing and incoming threads, once the miss-to-memory is detected and a thread is flushed. Figure 2 illustrates two effects. First, for each value of Cphys, there is an optimal value of Cvirt/Cphys. Second, as Cphys increases, the relative cost of the firmware thread swapping mechanism increases too. The figure shows that the gain from BMT peaks when Cvirt/Cphys = 2. When the ratio is larger than 2, the costs of running multiple threads begin to outweigh the benefits. For a BMT processor, that cost has two components: the cost of thread swapping and the cost of interference between threads. The curves labeled

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

2 3 4 6 8

1

2

3

Weighted Speedup

Number of Threads in Workload

BMT−

BMT− SMT BMT BMT w/o IRB BMT w/o DRM

Figure 3. Performance of BMT with different levels of hardware support.

instant , while being perhaps impractical, show the relative contribution of these two effects. When n is small, the cost of swapping is low. The cost of thread swapping comes from contention for instruction queue space and load/store ports from the thread-swapping instructions. Thus, at the BMT- design point, there is little reason to try to further optimize the thread swapping mechanism, but for BMT-6 , there is an incentive to improve it. For larger values of Cvirt/Cphys and larger n, the benefit from increased memory parallelism is outweighed by a loss of locality in the higher level caches. The loss of locality is caused by having many threads in the workload. The optimal Cvirt/Cphys ratios suggested by this graph are for an average over many workloads, but will vary with the particular threads running. This represents an opportunity to further improve performance by adaptively sizing the number of threads in a workload based on the behavior of the constituent threads.

5.3 Hardware Support for Thread Swapping

The previous section compared the performance of our baseline thread swapping mechanism with a hypothetical one-cycle latency thread swapping mechanism. Our base- line mechanism already includes some optimizations to re- duce swapping latency. This section evaluates two of those optimizations: the Dirty Register Mask (DRM) and the Inac- tive Register Buffer (IRB). The DRM, discussed in Section 3.1, allows the thread swap to only save registers values that have been touched. The IRB may be considered an optimization compared to a purely software thread swap, where a context’s state is stored using conventional loads and stores. Figure 3 shows the per- formance of BMT processors with 2 or 4 physical contexts, with varying levels of hardware support for thread swapping, and of SMT processors with 2–6 contexts. The two BMT features are not important for the BMT- processor, but are important for the BMT-4 processor. Of the two, the DRM is more important. The benefit of the dirty-

register mask increases as more threads are run because of the greater contention for functional units. A lesser effect may be that programs are swapped in for less time when more threads are present, and thus have time to dirty fewer registers. Without an IRB, inactive registers could be stored directly into memory (where they would typically be caught by the cache). Thus, for the no-IRB configuration, the save-restore instructions use the load/store units, which halves the rate at which they may issue. In the no-IRB configuration, if a miss occurs in the thread-swap microcode, the thread waits instead of performing a second swap. Because such misses are un- common in the BMT-2 configurations, there is little perfor- mance impact. With a larger workload size, the IRB is im- portant for good performance.

5.4 Sensitivity to Memory Hierarchy

The speedup provided by balanced multithreading is sen- sitive to three parameters of the memory hierarchy: The size of the caches, the latency to access the lowest level of cache, and the latency to main memory. Figure 4 shows the perfor- mance of SMT and BMT with different memory configura- tions. Each group of bars shows the performance of different processor designs with the same memory hierarchy. All con- figurations have the L1 caches described in Table 1, but the lower levels of the hierarchy are varied. The configurations were chosen to study the sensitivity to individual memory- system parameters. The y-axis represents weighted speedup. For each group of bars, the speedup is computed relative to a single-threaded processor with the same memory hierarchy. As a result, the speedup for a design with a larger cache hier- archy may be less than that for a design with a smaller cache. The best Cvirt/Cphys ratio for a BMT system depends on the memory system, so we show two BMT configurations next to each corresponding SMT processor design. Above each group of bars is shown the speedup of the better of the two BMT bars over the adjacent SMT bar. All three of those bars the same Cphys. For example, the first group of bars, labeled Base , represents the memory configuration used for all previous results in this paper: a 500 cycle memory latency and a 14-cycle 2MB L2 Cache. As noted in the plot, a BMT- 2/4 design gets 26% speedup over an SMT-2 processor, and a BMT-4/8 gets 16% speedup over an SMT-4 design. Running more threads at the same time has a cost and a benefit. Part of the cost is from increased contention in the caches, predictors and other structures. The benefit is an in- crease in the number of parallel memory accesses. Changes to the memory parameters shift these costs and benefits. A larger cache, as in Big $, reduces the number of op- portunities to use coarse-grained thread switching. Also, a slower cache increases the cost of misses caused by cache contention, and increases the latency before the processor can detect a main memory access. This is illustrated by the lower

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

BMT−2/3 BMT−2/4 BMT4/6 BMT4/

−0.

−0.

−0.

0

∆^

Weighted Speedup

Processor Configuration

5 15 25 35 65 95 Per−T

Figure 6. Performance of BMT processors with different delays to initiate swapping on a miss.

ing a load miss and switching sooner may improve perfor- mance, since the next thread begins executing sooner. How- ever, flushing a thread too soon can prevent the execution of a second load instruction, which would otherwise initiate a second, parallel, main-memory access. We evaluated the performance of 4 BMT designs with different values for the load-execution to miss-detection la- tency, l. Those results are shown in Figure 6. The value of l is indicated in the legend. The y-axis shows the change in weighted speedup for a given design when l is changed from its baseline value of 25. Note that detecting an L2 miss af- ter only 5 cycles would require either checking the L2 tags very quickly, or a load hit predictor. Fortunately, detecting a miss sooner actually decreases throughput. For example, if L2 misses could be detected 5 cycles after a load first exe- cuted, the weighted speedup of BMT-2/3 would drop by 0. (from 1.70, as indicated in Figure 1 or Table 4, to 1.65). In all cases, increasing l to 35 increases the throughput of BMT. However, with larger workloads, higher values of l may reduce throughput. With a larger workload, it is more likely that there is a ready-to-run thread waiting to be swapped in. The best single value of l depends on the number of virtual and physical contexts, and the particular set of benchmarks. However, an even better policy would be one which sets a different value of l for each thread. It is profitable to delay swapping out a thread if it is likely that additional main memory accesses can be initiated by waiting. As illustrated in Figure 7, benchmarks differ con- siderably in the number of main-memory accesses that may occur in parallel. There is one subgraph for each bench- mark we use. The y-axis shows the probability that no ad- ditional main-memory accesses will be initiated following a load which misses in the L2 cache. The x-axis shows time in cycles after the first miss. Note that only subsequent ac- cesses to different cache lines are counted. For perl amd ammp, when a load misses in the L2, it is highly unlikely that subsequent loads will initiate additional memory activity, so

0

1 ammp 0

1 art−

crafty eon−rushmeier

galgel gap

gcc−166 gzip−graphic

mcf mesa

mgrid parser

perlbmk−makerand twolf

0 20 40 60

0

1 vortex−

0 20 40 60

0

1 vpr−route

Figure 7. Probability, y , at time x after execut- ing a load instruction which misses in L2, that no further main-memory accesses will be initi- ated.

those threads should be swapped out as soon as possible. For gcc, even 60 cycles following a L2 miss, it is quite likely that additional misses will occur before the first miss com- pletes, so gcc should be swapped out after a longer delay. We evaluated a static, per-thread swap-delay policy. This is shown as the bar labeled Per–T in Figure 6. For this policy, all threads are swapped out on a load which takes more than 20 cycles, except art, galgel, gap, gcc, and vpr, for which l = 80. In all 4 cases, the Per–T policy performs better than any single value of l. With this policy, BMT2/4 gets an additional 3% speedup over single-thread execution. We present the Per-T policy to show that there is bene- fit from a dynamic policy which detects which threads have high memory level parallelism. To implement such a policy, l could be held in a counter which is periodically set to a high value, and which is decremented each time no concur- rent misses occur.

5.7 Quantifying the Cost of Additional Registers

The weighted speedup results presented in this paper do not reflect any cycle time or pipeline length penalties that may arise from adding physical contexts to a processor. In this section, we attempt to quantify the cost of adding additional physical contexts to an SMT processor, as opposed to adding virtual contexts. Table 4 lists the different architectures studied in previous

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

Type Cp n W SU Rren Rphys tacc nstg Uni 1 1 1.00 128 190 0.46 5 SMT 2 2 1.46 128 252 0.58 6 3 3 1.89 ” 314 0.60 7 4 4 2.26 ” 376 0.62 7 6 6 2.73 ” 500 0.65 7 8 8 3.07 ” 628 0.72 8 BMT-2 2 3 1.70 128 252 0.58 6 2 4 1.84 ” ” ” 2 6 1.76 ” ” ” BMT-4 4 6 2.57 128 376 0.62 7 4 8 2.64 ” ” ” BMT-6 6 12 2.93 128 500 0.65 7 Cp the number of physical contexts n the number of threads in workload W SU the weighted speedup Rren the number of additional registers for renaming Rphys the number of physical registers tacc the register file access time (ns) nstg the estimated number of stages at 10 Ghz

Table 4. Performance and register file speed.

sections. The speedups shown are for the base memory con- figuration (see Table 1). The last two columns show estimates of the register file access times for different architectures and an estimate of the number of clock cycles that it would re- quire if pipelined at 10 GHz. By this estimate, 3 additional pipeline stages would be needed for an 8 context SMT pro- cessor, compared to an otherwise similar 1-context proces- sor. Our access time estimates do not quantify several addi- tional costs of additional contexts. A slower register file read time can add stages between issue and execute, which com- plicates scheduling. A slower register file write time requires additional hardware to hold bypassed results longer. And a larger register file in turn increases the size of the renaming table. Also, the additional pipeline stages required to tolerate a larger register file fall in a particularly inopportune place in the pipeline. Lengthening the pipeline at this point increases load hit misspeculation penalties [4].

In previous sections, we simulate processors with a large instruction window. The instruction window requires 128 registers beyond those required to hold programmer-visible state (which is 62 per physical context). An alternate way to reduce the size of the register file is to provide fewer of these additional registers. Doing this does not negate the benefit of BMT. If reducing the size of the instruction window in- creases the performance of the processor, or makes room for additional physical contexts, then BMT can still be used. Fig- ure 8 shows that a BMT-2/4 processor configuration beats an SMT-2 processor configuration, with fewer additional regis- ters for renaming (a smaller instruction window).

64 96 128

1

Weighted Speedup

Number of Renaming Registers, Rren

ST SMT− BMT−2/

Figure 8. Speedup vs instruction window size. Weighted speedup is relative to single-thread execution with 128 renaming registers.

6 Conclusions

This paper explores the benefits of adding coarse-grained threading support to an SMT processor, creating an architec- ture we call Balanced Multithreading. SMT allows the pro- cessor to tolerate even the smallest latencies. CGMT is suffi- cient to tolerate long memory latencies. We present a form of CGMT which requires no changes to timing-critical proces- sor resources such as the register file and the renaming table. The combination of the two results in a processor that pro- vides high single thread performance via a high clock rate, shorter pipeline and high instruction-level parallelism; and high memory parallelism and thread-level parallelism when more threads are available. We evaluate the combination of CGMT and SMT, over a range of workload sizes, memory configurations, and several context-switching optimizations, including a method for re- ducing register saves. We find that in the face of long memory latencies, balanced multithreading can provide instruction throughput similar to a wide SMT processor, but without many of the hardware costs. In particular, we show that by adding support for balanced multithreading, the throughput of an SMT processor can be improved by 26%, with no significant changes to the core of the processor, the cycle time, or the pipeline.

7 Acknowledgements

We would like to thank the anonymous reviewers for their comments. Eric Tune was supported by an Intel Foundation Fellowship. Other support for this research came from NSF grants CCR-0311683 and CCR-0105743, and a grant from Intel.

References

[1] A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G. D’Souza, and M. Parkin. Sparcle: An evolutionary proces-

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)