Multithreading and Out-of-Order Execution: A Deep Dive into Concurrent Processing, Slides of Computer Architecture and Organization

The concepts of multithreading and out-of-order execution, two fundamental techniques in computer architecture used to improve processor efficiency. Topics covered include thread scheduling policies, pipeline hazards, multithreading design choices, and various hardware implementations such as the cdc 6600 peripheral processors, simple multithreaded pipeline, denelcor hep, and tera mta. The document also discusses the benefits and challenges of coarse-grain multithreading and simultaneous multithreading.

Typology: Slides

2012/2013

Uploaded on 04/27/2013

jutt
jutt 🇮🇳

4.5

(154)

75 documents

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Multithreading
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Partial preview of the text

Download Multithreading and Out-of-Order Execution: A Deep Dive into Concurrent Processing and more Slides Computer Architecture and Organization in PDF only on Docsity!

Multithreading

Outline

  • Multithreading
  • Thread scheduling policies
  • Grain multithreading
  • Design choices
  • Multi threading architecture
  • Out-of-order superscalar processor
  • Simultaneous multithreading
  • From superscalar to SMT
  • SMT design issues

Multithreading

  • How can we guarantee no dependencies between instructions in a pipeline? - One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

T1: LW r1, 0(r2) T2: ADD r7, r1, r T3: XORI r5, r4, # T4: SW 0(r7), r T1: LW r5, 12(r1)

CDC 6600 Peripheral Processors (Cray,

  • First multithreaded hardware
  • 10 “virtual” I/O processors
  • fixed interleave on simple pipeline
  • pipeline has 100ns cycle time
  • each processor executes one

instruction every 1000ns

  • accumulator-based instruction set to

reduce processor state

Multithreading Costs

  • Appears to software (including OS) as multiple slower CPUs
  • Each thread requires its own user state
    • GPRs
    • PC
  • Also, needs own OS control state
    • virtual memory page table base register
    • exception handling registers
  • Other costs?

Thread Scheduling Policies

  • Fixed interleave (CDC 6600 PPUs, 1965)
    • each of N threads executes one instruction every N cycles
    • if thread not ready to go in its slot, insert pipeline bubble
  • Software-controlled interleave (TI ASC PPUs, 1971)
    • OS allocates S pipeline slots amongst N threads
    • hardware performs fixed interleave over S slots, executing whichever thread is in that slot
  • Hardware-controlled thread scheduling (HEP, 1982)
    • hardware keeps track of which threads are ready to go
    • picks next thread to execute based on hardware priority scheme

Multithreading Design Choices

  • Context switch to another thread every cycle, or on hazard or L1 miss or L2 miss or network request
  • Per-thread state and context-switch overhead
  • Interactions between threads in memory hierarchy

Denelcor HEP

(Burton Smith, 1982)

  • First commercial machine to use

hardware threading in main CPU

  • 120 threads per processor
  • 10 MHz clock rate
  • Up to 8 processors
  • precursor to Tera MTA (Multithreaded

Architecture)

MTA Instruction Format

  • Three operations packed into 64-bit instruction word (short VLIW)
  • One memory operation, one arithmetic operation, plus one arithmetic or branch operation
  • Memory operations incur ~150 cycles of latency
  • Explicit 3-bit “lookahead” field in instruction gives number of subsequent instructions (0-7) that are independent of this one - c.f. Instruction grouping in VLIW - allows fewer threads to fill machine pipeline - used for variable- sized branch delay slots
  • Thread creation and termination instructions

MTA Multithreading

  • Each processor supports 128 active hardware threads - 128 SSWs, 1024 target registers, 4096 general-purpose registers
  • Every cycle, one instruction from one active thread is launched into pipeline
  • Instruction pipeline is 21 cycles long
  • At best, a single thread can issue one instruction every 21 cycles - Clock rate is 260MHz, effective single thread issue rate is 260/21 = 12.4MHz

Coarse-Grain Multithreading

  • Tera MTA designed for supercomputing applications with large data sets and low locality - No data cache - Many parallel threads needed to hide large memory latency
  • Other applications are more cache friendly
    • Few pipeline bubbles when cache getting hits
    • Just add a few threads to hide occasional cache miss latencies
    • Swap threads on cache misses

MIT Alewife

• Modified SPARC chips

  • register windows hold different thread

contexts

• Up to four threads per node

• Thread switch on local cache miss

Speculative, Out-of-Order

Superscalar Processor

Out-Of-Order Execution

  • Why Out-Of-Order Execution?
  • In-order Processors Stalls

Pipeline may not be full because of the frequent stalls

  • Example:

Allow In the Out-Of-Order Processors

**- No dependency Move the Instruction for execution

  • Means the Instructions that are Ready**