Balanced Multithreading: Low Cost Multithreading - Lecture Notes | CSSE 490 | Papers Computer Science

Balanced Multithreading: Increasing Throughput via a

Low Cost Multithreading Hierarchy

Eric Tune Rakesh Kumar Dean M. Tullsen Brad Calder

Computer Science and Engineering Department

University of California at San Diego

{etune,rakumar,tullsen,calder}@cs.ucsd.edu

Abstract

A simultaneous multithreading (SMT) processor can issue

instructions from several threads every cycle, allowing it to

effectively hide various instruction latencies; this effect in-

creases with the number of simultaneous contexts supported.

However, each added context on an SMT processor incurs a

cost in complexity, which may lead to an increase in pipeline

length or a decrease in the maximum clock rate. This pa-

per presents new designs for multithreaded processors which

combine a conservative SMT implementation with a coarse-

grained multithreading capability. By presenting more virtual

contexts to the operating system and user than are supported

in the core pipeline, the new designs can take advantageof the

memory parallelism present in workloads with many threads,

while avoiding the performance penalties inherent in a many-

context SMT processor design. A design with 4 virtual con-

texts, but which is based on a 2-context SMT processor core,

gains an additional 26% throughput when 4 threads are run

together.

1 Introduction

The ratio between main memory access time and core

clock rates continues to grow. As a result, a processor

pipeline may be idle during much of a programs execution. A

multithreading processor can maintain a high throughput de-

spite a large relative memory latencies by executing instruc-

tions from several programs. Many models of multithreading

have been proposed. They can be categorized by how close

together in time instructions from different threads may be ex-

ecuted, which affects how the state for different threads must

be managed. Simultaneous Multithreading [31, 30, 12, 33]

(SMT) is the least restrictive model, in that instructions from

multiple threads can execute in the same cycle. This flexibil-

ity allows an SMT processor to hide stalls in one thread by

executing instructions from other threads. However, the flex-

ibility of SMT comes at a cost. The register file and rename

tables must be enlarged to accommodate the architectural reg-

isters of the additional threads. This in turn can increase the

clock cycle time and/or the depth of the pipeline.

Coarse-grained multithreading (CGMT) [1, 21, 26] is a

more restrictive model where the processor can only execute

instructions from one thread at a time, but where it can switch

to a new thread after a short delay. This makes CGMT suited

for hiding longer delays. Soon, general-purpose micropro-

cessors will be experiencing delays to main memory of 500

or more cycles. This means that a context switch in response

to a memory access can take tens of cycles and still provide

a considerable performance benefit. Previous CGMT designs

relied on a larger register file to allow fast context switches,

which would likely slow down current pipeline designs and

interfere with register renaming. Instead, we describe a new

implementation of CGMT which does not affect the size or

design of the register file or renaming table.

We find that CGMT alone, triggered only by main-

memory accesses, provides unimpressive increases in per-

formance because it cannot hide the effect of shorter stalls

in a single thread. However, CGMT and SMT complement

each other very well. A design which combines both types of

multithreading provides a balance between support for hiding

long and short stalls, and a balance between high throughput

and high single-thread performance. We call this combina-

tion of techniques Balanced Multithreading (BMT).

This combination of multithreading models can be com-

pared to a cache hierarchy,which results in a multithreading

hierarchy. The lowest level of multithreading (SMT) is small

(fewer contexts), fast, expensive, and closely tied to the pro-

cessor cycle time. The next level of multithreading (CGMT)

is slower, potentially larger (fewer limits to the number of

contexts that can be supported), cheaper, and has no impact

on processor cycle time or pipeline depth.

In our design, the operating system sees more virtual con-

texts than are supported in the core pipeline. These virtual

contexts are controlled by a mechanism to quickly switch be-

tween threads on long latency load misses. The method we

Proceedings of the 37th International Symposium on Microarchitecture (MICRO-37 2004)

Balanced Multithreading: Low Cost Multithreading - Lecture Notes | CSSE 490, Papers of Computer Science

Related documents

Partial preview of the text

Download Balanced Multithreading: Low Cost Multithreading - Lecture Notes | CSSE 490 and more Papers Computer Science in PDF only on Docsity!

Balanced Multithreading: Increasing Throughput via a

Low Cost Multithreading Hierarchy

Eric Tune Rakesh Kumar Dean M. Tullsen Brad Calder

Computer Science and Engineering Department

University of California at San Diego

{etune,rakumar,tullsen,calder}@cs.ucsd.edu

Abstract

1 Introduction

2 Related Work

2.1 Coarse-Grain Multithreading

2.2 Simultaneous Multithreading

3 Architecture

3.2 Time Required to Swap Threads

3.3 Common Architecture

4 Methodology

5.2 Scalability of Balanced Multithreading

5.3 Hardware Support for Thread Swapping

5.4 Sensitivity to Memory Hierarchy

5.7 Quantifying the Cost of Additional Registers

6 Conclusions

7 Acknowledgements

References