organization-review, Study notes of Advanced Computer Architecture

Good material for Advanced Computer Architecutre

Typology: Study notes

2013/2014

Uploaded on 06/03/2014

nagesh
nagesh 🇮🇳

4.6

(14)

7 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CS 211: Computer Architecture
CS 211: Computer Architecture
Instructor: Prof. Bhagi Narahari
Dept. of Computer Science
Course URL: www.seas.gwu.edu/~narahari/cs211/
CS 211: BhagiNarahari,CS, GWU
Summary: Architecture Trends ?
Moore’s law: density doubles every 18-24
months
¾smaller processors, faster clocks
¾Price drops due to volume and dev. costs what next?
Interconnect delays could dominate over feature
delay
¾Need for simpler architectures
¾Distributed logic and control
More functionality
¾communicating processors
¾network of embedded processors
To extract max performance
¾Thumb rules: Amdahl’s law, Paralle lism, Locality
¾Software and compiler support needed!!!
CS 211: BhagiNarahari,CS, GWU
Next: Review
Computer Organization in an hour!
Overview of Computer Organization
¾Components
¾Sample processor design process
CS 211: BhagiNarahari,CS, GWU
Review: Computer Organization Basics
What are the components of a CPU
What is the microarchitecture level ?
What is an ISA - Instruction set
architecture ?
How does a sample processor design
look ?
¾A simple processor architecture
what is the basic concept of pipelining
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download organization-review and more Study notes Advanced Computer Architecture in PDF only on Docsity!

CS 211: Computer ArchitectureCS 211: Computer Architecture

Instructor: Prof. Bhagi Narahari

Dept. of Computer Science

Course URL: www.seas.gwu.edu/~narahari/cs211/

CS 211: Bhagi Narahari,CS, GWU

Summary: Architecture Trends?

• Moore’s law: density doubles every 18-

months

¾ smaller processors, faster clocks

¾ Price drops due to volume and dev. costs what next?

• Interconnect delays could dominate over feature

delay

¾ Need for simpler architectures

¾ Distributed logic and control

• More functionality

¾ communicating processors

¾ network of embedded processors

• To extract max performance

¾ Thumb rules: Amdahl’s law, Parallelism, Locality

¾ Software and compiler support needed!!!

CS 211: Bhagi Narahari,CS, GWU

Next: Review

Computer Organization in an hour!

• Overview of Computer Organization

¾ Components

¾ Sample processor design process

CS 211: Bhagi Narahari,CS, GWU

Review: Computer Organization Basics

• What are the components of a CPU

• What is the microarchitecture level?

• What is an ISA - Instruction set

architecture?

• How does a sample processor design

look?

¾ A simple processor architecture

• what is the basic concept of pipelining

CS 211: Bhagi Narahari,CS, GWU

A Computer

The computer is composed of input devices, a central processing

unit, a memory unit and output devices.

Input Device Central Processing Unit

Output Device

Input Device

Memory

Auxiliary Storage Device

CS 211: Bhagi Narahari,CS, GWU

Memory Unit

  • An ordered sequence of storage cells,

each capable of holding a piece of data.

  • Volatile Memory

¾ RAM – Random Access Memory

  • Non-volatile Memory

¾ ROM – Read Only Memory

CS 211: Bhagi Narahari,CS, GWU

Computer System

diskDisk diskDisk

Memory-I/O busMemory-I/O bus

ProcessorProcessor

CacheCache

MemoryMemory

I/O

controller

I/O

controller

I/O

controller

I/O

controller

I/O

controller

I/O

controller

DisplayDisplay NetworkNetwork

interrupts

CS 211: Bhagi Narahari,CS, GWU

Memory Hierarchy: The Tradeoff

CPU CPU

regsregs

C a c h e

MemorMemoryy (^) diskdisk

size: speed: $/Mbyte: block size:

608 B

1.4 ns

4 B

register reference

L2-cache reference

memory reference

disk memory reference

512kB -- 4MB 16.8 ns $90/MB 16 B

128 MB

112 ns $2-6/MB 4-8 KB

27GB

9 ms $0.01/MB

larger, slower, cheaper

16 B 8 B 4 KB

cache virtual memory

C a c h e

128k B 4.2 ns

4 B

L1-cache reference

(Numbers are for a 21264 at 700MHz)

CS 211: Bhagi Narahari,CS, GWU

Architecture Models: Von Neumann

architecture

  • Memory holds data, instructions.
  • Central processing unit (CPU) fetches

instructions from memory.

¾ Separate CPU and memory distinguishes programmable computer.

  • CPU registers help out: program counter

(PC), instruction register (IR), general-

purpose registers, etc.

CS 211: Bhagi Narahari,CS, GWU

CPU + memory

memory CPU

PC

address

data

200 ADD r5,r1,r3 IR

200

ADD r5,r1,r

CS 211: Bhagi Narahari,CS, GWU

Harvard architecture

CPU

PC

data memory

program memory

address

data

address

data

CS 211: Bhagi Narahari,CS, GWU

von Neumann vs. Harvard

  • Harvard can’t use self-modifying code.
  • Harvard allows two simultaneous

memory fetches.

  • Most DSPs use Harvard architecture for

streaming data:

¾ greater memory bandwidth;

¾ more predictable bandwidth.

CS 211: Bhagi Narahari,CS, GWU

Instruction Set Architecture

  • The Instruction Set Architecture (ISA)

describes a set of instructions whose

syntactic and semantic characteristics

are defined by the underlying computer architecture.

CS 211: Bhagi Narahari,CS, GWU

Programming model

  • Programming model: registers visible to

the programmer.

  • Some registers are not visible (IR).

CS 211: Bhagi Narahari,CS, GWU

Multiple implementations

  • Successful architectures have several

implementations:

¾ varying clock speeds;

¾ different bus widths;

¾ different cache sizes;

¾ etc.

CS 211: Bhagi Narahari,CS, GWU

Assembly language

  • One-to-one with instructions (more or

less).

  • Basic features:

¾ One instruction per line.

¾ Labels provide names for addresses (usually in first column).

¾ Instructions often start in later columns.

¾ Columns run to end of line.

CS 211: Bhagi Narahari,CS, GWU

Evolution of Instruction Sets

  • Major advances in computer architecture are typically associated with landmark instruction set designs

¾ Ex: Stack vs GPR (System 360)

  • Design decisions must take into account:

¾ technology

¾ machine organization

¾ programming languages

¾ compiler technology

¾ operating systems

¾ applications

  • And they in turn influence these

CS 211: Bhagi Narahari,CS, GWU

CISC vs. RISC

  • Complex instruction set computer

(CISC):

¾ many addressing modes;

¾ many operations.

  • Reduced instruction set computer (RISC):

¾ load/store;

¾ pipelined instructions.

CS 211: Bhagi Narahari,CS, GWU

CISC Processors

  • Instruction decoding is performed with

large microcode ROMs

  • Some instructions require more than a

single instruction cycle to execute

  • Many addressing modes supported
  • Register set was designed to support

specific functions

CS 211: Bhagi Narahari,CS, GWU

RISC Processors

  • Instruction decoding is performed with

static (hard-wired) logic for a much faster result

  • Instructions are designed to execute in a

single instruction cycle

  • Data processing instructions operate

only on registers. Load and store

instructions were designated to access

memory

  • Register set is large and general purpose

(in many cases)

CS 211: Bhagi Narahari,CS, GWU

IA - 32

  • 1978: The Intel 8086 is announced (16 bit architecture)
  • 1980: The 8087 floating point coprocessor is added
  • 1982: The 80286 increases address space to 24 bits, +instructions
  • 1985: The 80386 extends to 32 bits, new addressing modes
  • 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions

(mostly designed for higher performance)

  • 1997: 57 new “MMX” instructions are added, Pentium II
  • 1999: The Pentium III added another 70 instructions (SSE)
  • 2001: Another 144 instructions (SSE2)
  • 2003: AMD extends the architecture to increase address space to 64 bits,

widens all registers to 64 bits and other changes (AMD64)

  • 2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds

more media extensions

  • “This history illustrates the impact of the “golden handcuffs” of compatibility

-“adding new features as someone might add clothing to a packed bag”

-“an architecture that is difficult to explain and impossible to love”

CS 211: Bhagi Narahari,CS, GWU

IA-32 Overview

  • Complexity:

¾ Instructions from 1 to 17 bytes long

¾ one operand must act as both a source and destination

¾ one operand can come from memory

¾ complex addressing modes

e.g., “base or scaled index with 8 or 32 bit

displacement”

  • Saving grace:

¾ the most frequently used instructions are not too

difficult to build

¾ compilers avoid the portions of the architecture that

are slow

“what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective”

CS 211: Bhagi Narahari,CS, GWU

Quick look at ISA

  • Will use MIPS

¾ Simple RISC ISA

¾ Widely used

CS 211: Bhagi Narahari,CS, GWU

Instruction set characteristics

  • Fixed vs. variable length.
  • Addressing modes.
  • Number of operands.
  • Types of operands.

CS 211: Bhagi Narahari,CS, GWU

The Big Picture: The Performance Perspective

  • Performance of a machine is determined by:

¾ Instruction count

¾ Clock cycle time

¾ Clock cycles per instruction

  • Processor design (datapath and control) will determine:

¾ Clock cycle time

¾ Clock cycles per instruction

CPI

Inst. Count Cycle Time

CS 211: Bhagi Narahari,CS, GWU

Microarchitecture Design: How?

  • Any design must attempt to meet the

requirements

¾ Where do the requirements come from?

¾ Ex: need to represent numbers in binary; integers, text, floating point

  • How to proceed with design?

CS 211: Bhagi Narahari,CS, GWU

Some History…

  • The Indiana Legislature once introduced

legislation declaring that the value of π

was exactly 3.

CS 211: Bhagi Narahari,CS, GWU

How to Design a Processor: step-by-step

    1. Analyze instruction set => datapath requirements

¾ the meaning of each instruction is given by the register transfers

¾ datapath must include storage element for ISA registers

¾ possibly more

¾ datapath must support each register transfer

    1. Select set of datapath components and establish clocking methodology
    1. Assemble datapath meeting the requirements
    1. Analyze implementation of each instruction to determine setting of control points that effects the register transfer.
    1. Assemble the control logic
  • Let’s look at a single cycle ISA…

CS 211: Bhagi Narahari,CS, GWU

The MIPS Instruction Formats

  • All MIPS instructions are 32 bits long. The three instruction formats :

¾ R-type

¾ I-type

¾ J-type

  • The different fields are :

¾ op: operation of the instruction

¾ rs, rt, rd: the source and destination register specifiers

¾ shamt: shift amount

¾ funct: selects the variant of the operation in the “op” field

¾ address / immediate: address offset or immediate value

¾ target address: target address of the jump instruction

op target address

6 bits 26 bits

op rs rt rd shamt funct

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

op rs rt immediate

6 bits 5 bits 5 bits 16 bits

CS 211: Bhagi Narahari,CS, GWU

Step 1a: The MIPS-Inst Set (eg.)

  • ADD and SUB

¾ addU rd, rs, rt

¾ subU rd, rs, rt

  • OR Immediate:

¾ ori rt, rs, imm

  • LOAD and STORE Word

¾ lw rt, rs, imm

¾ sw rt, rs, imm

  • BRANCH :

¾ beq rs, rt, imm

op rs rt rd shamt funct

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

op rs rt immediate

6 bits 5 bits 5 bits 16 bits

op rs rt immediate

6 bits 5 bits 5 bits 16 bits

op rs rt immediate

6 bits 5 bits 5 bits 16 bits

  • Register rs and rt are the source registers.
  • If the instruction has three operand register, then rd is the destination register
  • If the instruction has two operand register, then rt is the destination register

CS 211: Bhagi Narahari,CS, GWU

Logical Register Transfers

  • RTL gives the meaning of the instructions
  • All start by fetching the instruction

op | rs | rt | rd | shamt | funct = MEM[ PC ] op | rs | rt | Imm16 = MEM[ PC ]

inst Register Transfers

ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 SUBU R[rd] <– R[rs] – R[rt]; PC <– PC + 4 ORi R[rt] <– R[rs] | zero_ext(Imm16); PC <– PC + 4

LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4 STORE MEM[ R[rs] + sign_ext(Imm16) ] <– R[rt]; PC <– PC + 4

BEQ if ( R[rs] == R[rt] ) then PC <– PC + 4 + sign_ext(Imm16)] || 00 else PC <– PC + 4

CS 211: Bhagi Narahari,CS, GWU

Step 2: Components of the Datapath

  • Combinational Elements
  • Storage Elements

¾ Clocking methodology

CS 211: Bhagi Narahari,CS, GWU

Clocking Methodology

  • Clocks needed in sequential logic to decide when an element

that contains state should be updated.

  • A clock is a free-running circuit with a fixed cycle time or clock

period. The clock frequency is the inverse of the cycle time.

  • The clock cycle time or clock period is divided into two

portions: when the clock is high and when the clock is low.

  • Edge-triggered clocking: all state changes occur on a clock

edge.

Clk

Don’t Care

Setup Hold Setup Hold

Clock Period

Rising Edge Falling Edge

CS 211: Bhagi Narahari,CS, GWU

Step 3: Assemble DataPath meeting our requirements

  • Register Transfer Requirements ⇒ Datapath Assembly
  • Instruction Fetch
  • Read Operands and Execute Operation

The common RTL operations for all instructions are:

(a) Fetch the instruction using the Program Counter (PC) at the beginning of an

instruction’s execution (PC -> Instruction Memory -> Instruction Word).

(b) Then at the end of the instruction’s execution, you need to update the

Program Counter (PC -> Next Address Logic -> PC).

More specifically, you need to increment the PC by 4 if you are executing sequential code.

For Branch and Jump instructions, you need to update the program counter to “something

else” other than plus 4.

The Next Address Logic block:

  • Add 4 (number of bytes in an instruction) or
  • Branch and Jump instructions

CS 211: Bhagi Narahari,CS, GWU

3a: Overview of the Instruction Fetch Unit

  • The common RTL operations

¾ Fetch the Instruction: mem[PC]

¾ Update the program counter:

¾ Sequential Code: PC <- PC + 4

¾ Branch and Jump: PC <- “something else”

Instruction Word

Address

Instruction Memory

Clk PC

Next Address Logic

CS 211: Bhagi Narahari,CS, GWU

3b: Add & Subtract

  • R[rd] <- R[rs] op R[rt] Example: addU rd, rs, rt

¾ Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields

¾ ALUctr and RegWr: control logic after decoding the

instruction

Result

ALUctr

Clk

busW

RegWr

busA

busB

Rw Ra Rb

32 32-bit Registers

Rd Rs Rt

ALU

op rs rt rd shamt funct

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

CS 211: Bhagi Narahari,CS, GWU

Putting it All Together: A Single Cycle Datapath

imm

ALUctr

Clk

busW

RegWr

busA

busB

Rw Ra Rb 32 32-bit Registers

Rs

Rt

Rt

Rd

RegDst

Extender

Mux

imm

ExtOpALUSrc

Mux

MemtoReg

Clk

Data In

(^32) WrEnAdr

Data Memory

MemWr

ALU

Equal

Instruction<31:0>

Rs Rt Rd Imm

Adder

Adder

PC

Clk

Mux

nPC_sel

PC Ext

Adr

Inst Memory

CS 211: Bhagi Narahari,CS, GWU

An Abstract View of the Critical Path

  • Register file and ideal memory:

¾ The CLK input is a factor ONLY during write operation

¾ During read operation, behave as combinational logic:

¾ Address valid => Output valid after “access time.” Critical Path (Load Operation) = PC’s Clk-to-Q + Instruction Memory’s Access Time + Register File’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for Register File Write + Clock Skew

Clk

Rw Ra Rb 32 32-bit Registers

Rd

ALU

Clk

Data In

Data Address Ideal Data Memory

Instruction

Instruction Address

Ideal Instruction Memory

Clk

PC

Rs 5

Rt 16

Imm

A

B

Next Address

CS 211: Bhagi Narahari,CS, GWU

An Abstract View of the Implementation

Data Out

Clk

Rw Ra Rb 32 32-bit Registers

Rd

ALU

Clk

Data In

Data Address Ideal Data Memory

Instruction

Instruction Address

Ideal Instruction Memory

Clk

PC

Rs 5

Rt

A

B

Next Address

Control

Datapath

Control Signals (^) Conditions

CS 211: Bhagi Narahari,CS, GWU

Step 4: Given Datapath: RTL -> Control

RegDst ExtOp ALUSrc^ ALUctrMemWr^ MemtoReg Equal

Instruction<31:0> <21:25><16:20><11:15><0:15>

Rt RsRd Imm

nPC_sel

Adr

Inst Memory

DATA PATH

Control

Op

Fun

RegWr

CS 211: Bhagi Narahari,CS, GWU

Summary

  • 5 steps to design a processor

¾ 1. Analyze instruction set => datapath requirements

¾ 2. Select set of datapath components & establish clock

methodology

¾ 3. Assemble datapath meeting the requirements

¾ 4. Analyze implementation of each instruction to determine

setting of control points that effects the register transfer.

¾ 5. Assemble the control logic

  • MIPS makes it easier

¾ Instructions same size

¾ Source registers always in same place

¾ Immediates same size, location

¾ Operations always on registers/immediates

  • Single cycle datapath => CPI=1, CCT => long CS 211: Bhagi Narahari,CS, GWU

Systematic Generation of Control

  • In a single-cycle processor, each instruction is realized by exactly one control command or “ microinstruction”

¾ in general, the controller is a finite state machine

¾ microinstruction can also control sequencing (see later)

Control Logic / Store

(PLA, ROM)

OPcode

Datapath

Instruction

Decode

Conditions

Control

Points

microinstruction

CS 211: Bhagi Narahari,CS, GWU

What’s wrong with our CPI=1 processor?

  • Long Cycle Time
  • All instructions take as much time as the slowest
  • Real memory is not as nice as our idealized memory

¾ cannot always get the job done in one (short) cycle

PC Inst Memory mux ALU Data Mem mux

PC Inst Memory Reg File mux ALU mux

PC Inst Memory mux ALU Data Mem

PC Inst Memory cmp mux

Reg File

Reg File

Reg File

Arithmetic & Logical

Load

Store

Branch

Critical Path

setup

setup

CS 211: Bhagi Narahari,CS, GWU

Partitioning the CPI=1 Datapath

  • Add registers between smallest steps

PC

Next PC Operand

Fetch

Exec

Reg.File

Mem

Access

DataMem

Instruction

Fetch

Result Store

ALUctr

RegDst ExtOpALUSrc nPC_sel MemRdMemWr RegWrMemWr

Equal

CS 211: Bhagi Narahari,CS, GWU

Example Multicycle Datapath

  • Critical Path?

PC

Next PC

Operand

Fetch

Instruction

Fetch

nPC_sel

IR

Reg

File ExtALU

Reg.File

Mem

Access

DataMem

Result Store

RegDstRegWr MemRdMemWr

S

M

MemToReg

Equal

ExtOpALUSrc^ ALUctr

A

B

E

CS 211: Bhagi Narahari,CS, GWU

Controller Design

  • The state digrams that arise define the controller for an instruction

set processor are highly structured

  • Use this structure to construct a simple “microsequencer”
  • Control reduces to programming this very simple device

⇒ microprogramming

sequencer

control

datapath control

micro-PC

sequencer

microinstruction

CS 211: Bhagi Narahari,CS, GWU

Microprogramming

  • Microprogramming is a convenient method for implementing structured control state diagrams:

¾ Random logic replaced by microPC sequencer and ROM

¾ Each line of ROM called a μinstruction:

contains sequencer control + values for control points

¾ limited state transitions:

branch to zero, next sequential,

branch to μinstruction address from displatch ROM

  • Horizontal μCode: one control bit in μInstruction for every control line in datapath
  • Vertical μCode: groups of control-lines coded together in μInstruction (e.g. possible ALU dest)
  • Control design reduces to Microprogramming

¾ Part of the design process is to develop a “language”

that describes control and is easy for humans to

understand

CS 211: Bhagi Narahari,CS, GWU

Microprogramming

  • Microprogramming is a fundamental concept

¾ implement an instruction set by building a very simple

processor and interpreting the instructions

¾ essential for very complex instructions and when few

register transfers are possible

¾ overkill when ISA matches datapath 1-

sequencer

control

datapath control

micro-PC

μ-sequencer:

fetch,dispatch,

sequential

microinstruction (μ)

Dispatch

ROM

Opcode

μ-Code ROM

DecodeDecode

To DataPath

Decoders implement our μ- code language:

For instance: rt-ALU rd-ALU mem-ALU

CS 211: Bhagi Narahari,CS, GWU

Sequential Laundry

  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

T a s k O r d e r

Time

CS 211: Bhagi Narahari,CS, GWU

Pipelined Laundry

Start work ASAP

  • Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

T a s k O r d e r

Time

30 40 40 40 40 20

CS 211: Bhagi Narahari,CS, GWU

Pipelining Lessons

  • Pipelining doesn’t help

latency of single task, it

helps throughput of

entire workload

  • Pipeline rate limited by

slowest pipeline stage

  • Multiple tasks operating

simultaneously

  • Potential speedup =

Number pipe stages

  • Unbalanced lengths of

pipe stages reduces

speedup

  • Time to “fill” pipeline

and time to “drain” it

reduces speedup

A

B

C

D

6 PM 7 8 9

T a s k O r d e r

Time

30 40 40 40 40 20

CS 211: Bhagi Narahari,CS, GWU

Instruction Pipeline

  • Instruction execution process lends itself

naturally to pipelining

¾ overlap the subtasks of instruction fetch, decode and execute

CS 211: Bhagi Narahari,CS, GWU

How to improve performance?

  • Recall performance is function of

¾ CPI: cycles per instruction

¾ Clock cycle

¾ Instruction count

  • Reducing any of the 3 factors will lead to

improved performance

CS 211: Bhagi Narahari,CS, GWU

How to improve performance?

  • First step is to apply concept of

pipelining to the instruction execution

process

¾ Overlap computations

  • What does this do?

¾ Decrease clock cycle

¾ Decrease effective CPU time compared to original clock cycle

CS 211: Bhagi Narahari,CS, GWU

Pipeline Approach to Improve System

Performance

  • Analogous to fluid flow in pipelines and

assembly line in factories

  • Divide process into “stages” and send

tasks into a pipeline

¾ Overlap computations of different tasks by operating on them concurrently in different stages

CS 211: Bhagi Narahari,CS, GWU

Instruction Level Parallel Processors

(ILP)

  • early ILP - one of two orthogonal concepts:

¾ pipelining - vertical approach

¾ multiple (non-pipelined) units - horizontal approach

  • progression to multiple pipelined units
  • instruction issue became bottleneck, led to

¾ superscalar ILP processors

¾ Very Large Instruction Word (VLIW)

  • Note: key performance metric in all ILP processor classes is IPC (instructions per cycle)

¾ this is the degree of parallelism achieved