Multithreading: Motivation, Architectures, and Performance Implications, Slides of Computer Science

The motivation for multithreaded architectures, discussing the limitations of processors in executing code at their hardware potential. It introduces multithreaded processors as a solution to increase instruction throughput and address multiple causes of processor stalling. Two styles of traditional multithreading - coarse-grain and fine-grain - and a third style, simultaneous multithreading (smt). It also discusses the performance implications of multithreading in various workloads.

Typology: Slides

2012/2013

Uploaded on 03/22/2013

dhirendra
dhirendra 🇮🇳

4.3

(78)

268 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Spring 2007 CSE 471 - Multithreading 1
Motivation for Multithreaded Architectures
Processors not executing code at their hardware potential
late 70s: performance lost to memory latency
90s: performance not in line with the increasingly complex parallel
hardware as well
increase in instruction issue bandwidth
increase in number of functional units
out-of-order execution
techniques for decreasing/hiding branch & memory latencies
Still, processor utilization was decreasing & instruction
throughput not increasing in proportion to the issue width
Spring 2007 CSE 471 - Multithreading 2
Motivation for Multithreaded Architectures
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Multithreading: Motivation, Architectures, and Performance Implications and more Slides Computer Science in PDF only on Docsity!

Spring 2007 CSE 471 - Multithreading 1

Motivation for Multithreaded Architectures

Processors not executing code at their hardware potential

  • late 70’s: performance lost to memory latency
  • 90 ’s: performance not in line with the increasingly complex parallel hardware as well - increase in instruction issue bandwidth - increase in number of functional units - out-of-order execution - techniques for decreasing/hiding branch & memory latencies - Still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width Spring 2007 CSE 471 - Multithreading 2

Motivation for Multithreaded Architectures

Spring 2007 CSE 471 - Multithreading 3

Motivation for Multithreaded Architectures

Major cause is the lack of instruction-level parallelism in a single executing thread Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor Spring 2007 CSE 471 - Multithreading 4

Multithreaded Processors

Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling

  • holds processor state for more than one thread of execution
    • registers
    • PC
    • each thread’s state is a hardware context
  • execute the instruction stream from multiple threads without software context switching
  • utilize thread-level parallelism (TLP) to compensate for a lack in ILP

Spring 2007 CSE 471 - Multithreading 7

Comparison of Issue Capabilities

Spring 2007 CSE 471 - Multithreading 8

Simultaneous Multithreading (SMT)

Third style of multithreading, different concept

  1. simultaneous multithreading (SMT)
    • issues multiple instructions from multiple threads each cycle
    • no hardware context switching
    • same-cycle multithreading
    • huge boost in instruction throughput with less degradation to individual threads

Spring 2007 CSE 471 - Multithreading 9

Comparison of Issue Capabilities

Spring 2007 CSE 471 - Multithreading 10

Cray (Tera) MTA

Goals

  • the appearance of uniform memory access
  • lightweight synchronization
  • heterogeneous parallelism

Spring 2007 CSE 471 - Multithreading 13

Cray (Tera) MTA

Interesting features

  • Trade-off between avoiding memory bank conflicts & exploiting spatial locality for data
  • conflicts:
    • memory distributed among hardware contexts
    • memory addresses are randomized to avoid conflicts
      • want to fully utilize all memory bandwidth
  • locality:
    • run-time system can confine consecutive virtual addresses to a single (close-by) memory unit - used mainly for the stack Spring 2007 CSE 471 - Multithreading 14

Cray (Tera) MTA

Interesting features

  • tagged memory
    • indirectly set full/empty bits to prevent data races
      • prevents a consumer/producer from loading/overwriting a value before a producer/consumer has written/read it
      • example for the consumer:
        • set to empty when producer instruction starts executing
        • consumer instructions block if try to read the producer value
        • set to full when producer writes value
        • consumers can now read a valid value
    • explicitly set full/empty bits for thread synchronization
      • primarily used accessing shared data
        • lock: read memory location & set to empty
        • other readers are blocked
        • unlock: write & set to full

Spring 2007 CSE 471 - Multithreading 15

Cray (Tera) MTA

Interesting features

  • no paging
    • want pages pinned down in memory for consistent latency
    • page size is 256MB
  • forward bit
    • memory contents interpreted as a pointer & dereferenced
    • used for GC & null reference checking
  • user-mode trap handlers
    • lighter weight
    • used for fatal exceptions, overflow, normalizing floating point numbers
    • not used for protection - user might override the RT
    • designed for user-written trap handlers, but too complicated for users Spring 2007 CSE 471 - Multithreading 16

Cray (Tera) MTA

Compiler support

  • VLIW instructions
    • memory/arithmetic/branch
    • load/store architecture
    • need a good code scheduler
  • memory dependence look-ahead
    • field in a memory instruction that specifies the number of independent memory ops that follow
    • guarantees nonstalling instruction choice
    • improves memory parallelism
  • handling branches
    • special instruction to store a branch target in a register before the branch is executed
    • can start prefetching the target code

Spring 2007 CSE 471 - Multithreading 19

Performance Implications

Multiprogramming workload

  • 2.5X on SPEC95, 4X on SPEC Parallel programs
  • ~1.7X on SPLASH Commercial databases
  • 2-3X on TPC B; 1.5X on TPC D Web servers & OS
  • 4X on Apache and Digital Unix Spring 2007 CSE 471 - Multithreading 20

Does this Processor Sound Familiar?

Technology transfer =>

  • 2-context Intel Hyperthreading
  • 4-context IBM Power
  • 2-context Sun UltraSPARC on a 4-processor CMP
  • 4-context Compaq 21464
  • network processor & mobile device start-ups
  • others in the wings

Spring 2007 CSE 471 - Multithreading 21

An SMT Architecture

Three primary goals for this architecture:

  1. Achieve significant throughput gains with multiple threads
  2. Minimize the performance impact on a single thread executing alone
  3. Minimize the microarchitectural impact on a conventional out-of- order superscalar design Spring 2007 CSE 471 - Multithreading 22

Implementing SMT

Spring 2007 CSE 471 - Multithreading 25

From Superscalar to SMT

Per-thread hardware

  • small stuff
  • all part of current out-of-order processors
  • none endangers the cycle time
  • other per-thread processor state, e.g.,
    • program counters
    • return stacks
    • thread identifiers, e.g., with BTB entries, TLB entries
  • per-thread bookkeeping for, e.g.,
    • instruction queue flush
    • instruction retirement
    • trapping This is why there is only a 15% increase to Alpha 21464 chip area. Spring 2007 CSE 471 - Multithreading 26

Implementing SMT

Thread-shared hardware :

  • fetch buffers
  • branch prediction structures
  • instruction queues
  • functional units
  • active list
  • all caches & TLBs
  • store buffers & MSHRs This is why there is little single-thread performance degradation (~1.5%).

Spring 2007 CSE 471 - Multithreading 27

Architecture Research

Concept & potential of Simultaneous Multithreading Designing the microarchitecture

  • straightforward extension of out-of-order superscalars I-fetch thread chooser
  • 40% faster than round-robin The lockbox for cheap synchronization
  • orders of magnitude faster
  • can parallelize previously unparallelizable codes Spring 2007 CSE 471 - Multithreading 28

Architecture Research

Software-directed register deallocation

  • large register-file performance w. small register file Mini-threads
  • large SMT performance w. small SMTs SMT instruction speculation
  • don’t execute as far down a wrong path
  • speculative instructions don’t get as far down the pipeline
  • speculation keeps a good thread mix in the IQ
  • most important factor for performance

Spring 2007 CSE 471 - Multithreading 31

Others are Now Carrying the Ball

Fault detection & recovery Thread-level speculation Instruction & data prefetching Instruction issue hardware design Thread scheduling & thread priority Single-thread execution Profiling executing threads SMT-CMP hybrids Power considerations Spring 2007 32

SMT Collaborators