Multithreading - Intro to Computer Architecture - Lecture Slides, Slides of Computer Architecture and Organization

During the course work of the Intro to Computer Architecture, we study the main concept regarding the:Multithreading, Pipeline Hazards, Peripheral Processors, Simple Multithreaded Pipeline, Multithreading Costs, Thread Scheduling Policies, Coarse-Grained Multithreading, Multithreading Design Choices, Instruction Format

Typology: Slides

2012/2013

Uploaded on 05/06/2013

anurati
anurati 🇮🇳

4.2

(24)

121 documents

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 162 Computer Architecture
Lecture 10: Multithreading
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download Multithreading - Intro to Computer Architecture - Lecture Slides and more Slides Computer Architecture and Organization in PDF only on Docsity!

CS 162 Computer Architecture

Lecture 10: Multithreading

Pipeline Hazards

LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, # SW 12(r1), r

  • Each instruction may depend on the next
    • Without bypassing, need interlocks LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, # SW 12(r1), r
  • Bypassing cannot completely eliminate interlocks or delay slots

CDC 6600 Peripheral Processors

(Cray, 1965)

  • First multithreaded hardware
  • 10 “virtual” I/O processors
  • fixed interleave on simple pipeline
  • pipeline has 100ns cycle time
  • each processor executes one

instruction every 1000ns

  • accumulator-based instruction set to

reduce processor state

Simple Multithreaded Pipeline

  • Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

Thread Scheduling Policies

  • Fixed interleave (CDC 6600 PPUs, 1965)
    • each of N threads executes one instruction every N cycles
    • if thread not ready to go in its slot, insert pipeline bubble
  • Software-controlled interleave (TI ASC PPUs, 1971)
    • OS allocates S pipeline slots amongst N threads
    • hardware performs fixed interleave over S slots, executing whichever thread is in that slot
  • Hardware-controlled thread scheduling (HEP, 1982)
    • hardware keeps track of which threads are ready to go
    • picks next thread to execute based on hardware priority scheme

What “Grain” Multithreading?

  • So far assumed fine-grained multithreading
    • CPU switches every cycle to a different thread
    • When does this make sense?
  • Coarse-grained multithreading
    • CPU switches every few cycles to a different thread
    • When does this make sense?

Denelcor HEP

(Burton Smith, 1982)

  • First commercial machine to use

hardware threading in main CPU

  • 120 threads per processor
  • 10 MHz clock rate
  • Up to 8 processors
  • precursor to Tera MTA (Multithreaded Architecture)

Tera MTA Overview

  • Up to 256 processors
  • Up to 128 active threads per processor
  • Processors and memory modules populate a sparse 3D torus interconnection fabric
  • Flat, shared main memory
    • No data cache
    • Sustains one main memory access per cycle per processor
  • 50W/processor @ 260MHz

MTA Multithreading

  • Each processor supports 128 active hardware threads - 128 SSWs, 1024 target registers, 4096 general- purpose registers
  • Every cycle, one instruction from one active thread is launched into pipeline
  • Instruction pipeline is 21 cycles long
  • At best, a single thread can issue one instruction every 21 cycles - Clock rate is 260MHz, effective single thread issue rate is 260/21 = 12.4MHz

MTA Pipeline

MIT Alewife

  • Modified SPARC chips
    • register windows hold different thread contexts
  • Up to four threads per node
  • Thread switch on local cache miss

IBM PowerPC RS64-III

(Pulsar)

  • Commercial coarse-grain multithreading CPU
  • Based on PowerPC with quad-issue in-order fivestage pipeline
  • Each physical CPU supports two virtual CPUs
  • On L2 cache miss, pipeline is flushed and execution switches to second thread - short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency - flush pipeline to simplify exception handling

Superscalar Machine

Efficiency

  • Why horizontal waste?
  • Why vertical waste?

Vertical Multithreading

  • Cycle-by-cycle interleaving of second thread removes vertical waste