Computer Architecture: Understanding Pipelining and Instruction Dependencies, Lecture notes of Parallel Computing and Programming

An in-depth exploration of computer architecture, focusing on pipelining and instruction dependencies. Topics include the fetch, decode, execute, and memory stages of the pipeline, as well as data and control hazards and their resolution. Students will gain a solid understanding of the concepts and terminology related to computer architecture and pipelining.

Typology: Lecture notes

2018/2019

Uploaded on 03/20/2019

aman-rao-1
aman-rao-1 🇵🇰

2 documents

1 / 43

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSE502: Computer Architecture
CSE 502:
Computer Architecture
Core Pipelining
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b

Partial preview of the text

Download Computer Architecture: Understanding Pipelining and Instruction Dependencies and more Lecture notes Parallel Computing and Programming in PDF only on Docsity!

CSE 502:

Computer Architecture

Core Pipelining

Before there was pipelining…

  • Single-cycle control: hardwired
    • Low CPI (1)
    • Long clock period (to accommodate slowest instruction)
  • Multi-cycle control: micro-programmed
    • Short clock period
    • High CPI
  • Can we have both low CPI and short clock period?

Single-cycle

Multi-cycle

insn0.(fetch,decode,exec) insn1.(fetch,decode,exec)

insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec

time

Pipeline Examples

Stage delay = 𝑛

Bandwidth = ~(

𝑛)

Stage delay =

Bandwidth = ~(

𝑛)

Stage delay =

Bandwidth = ~(

𝑛)

address (^) hit?

=

=

=

=

Increases throughput at the expense of latency

address (^) hit?

=

=

=

=

address (^) hit?

=

=

=

=

Processor Pipeline Review

I-cache

Reg

File

PC

+ 4

ALU D-cache

Fetch Decode Memory

(Write-back)

Execute

Stage 1: Fetch Diagram

Instruction

bits

IF / ID

Pipeline register

PC

Instruction

Cache

en

en

1

M

U

X

PC + 1

Decode

target

Stage 2: Decode

  • Decodes opcode bits
    • Set up Control signals for later stages
  • Read input operands from register file
    • Specified by decoded instruction bits
  • Write state to the pipeline register (ID/EX)
    • Opcode
    • Register contents
    • PC+1 (even though decode didn’t use it)
    • Control signals (from insn) for opcode and destReg

Stage 3: Execute

  • Perform ALU operations
    • Calculate result of instruction
      • Control signals select operation
      • Contents of regA used as one input
      • Either regB or constant offset (from insn) used as second input
    • Calculate PC-relative branch target
      • PC+1+(constant offset)
  • Write state to the pipeline register (EX/Mem)
    • ALU result, contents of regB, and PC+1+offset
    • Control signals (from insn) for opcode and destReg

Stage 3: Execute Diagram

ID / EX

Pipeline register

regA contents

regB contents

ALU result

EX/Mem

Pipeline register

PC + 1

Control

signals Control

signals

PC+1+offset

regB contents

A L U M U X

Decode Memory

destReg data

target

Stage 4: Memory Diagram

ALU result

Mem/WB

Pipeline register

ALUresult

EX/Mem

Pipeline register

Control

signals

PC+1+offset

regB contents

Loaded

Data Cache^ data

en R/W

in_data

in_addr

Control

signals

Execute

Write-back

destReg data

target

Stage 5: Write-back

  • Writing result to register file (if required)
    • Write Loaded data to destReg for LD
    • Write ALU result to destReg for arithmetic insn
    • Opcode bits control register write enable signal

Putting It All Together

PC Inst

Cache Register file

M U X

A

L

U

1

Data Cache

M U X

IF/ID ID/EX EX/Mem Mem/WB

M U X

op

dest

offset

valB

valA

PC+1 PC+

target

ALU result

op

dest

valB

op

dest

ALU

result

mdata

instruction^ eq?

0

R R R R

R

R

R

R

regA regB

data

dest

M U X

Pipelining Idealism

  • Uniform Sub-operations
    • Operation can partitioned into uniform-latency sub-ops
  • Repetition of Identical Operations
    • Same ops performed on many different inputs
  • Repetition of Independent Operations
    • All repetitions of op are mutually independent

The Generic Instruction Pipeline

Instruction Fetch

Instruction Decode

Operand Fetch

Instruction Execute

Write-back

IF

ID

OF

EX

WB

Balancing Pipeline Stages

TIF= 6 units

TID= 2 units

TID= 9 units

TEX= 5 units

TOS= 9 units

Without pipelining

Tcyc TIF+TID+TOF+TEX+TOS

Pipelined

Tcyc  max{TIF, TID, TOF, TEX, TOS}

Speedup= 31 / 9

IF

ID

OF

EX

WB

Can we do better?