Download organization-review and more Study notes Advanced Computer Architecture in PDF only on Docsity!
CS 211: Computer ArchitectureCS 211: Computer Architecture
Instructor: Prof. Bhagi Narahari
Dept. of Computer Science
Course URL: www.seas.gwu.edu/~narahari/cs211/
CS 211: Bhagi Narahari,CS, GWU
Summary: Architecture Trends?
• Moore’s law: density doubles every 18-
months
¾ smaller processors, faster clocks
¾ Price drops due to volume and dev. costs what next?
• Interconnect delays could dominate over feature
delay
¾ Need for simpler architectures
¾ Distributed logic and control
• More functionality
¾ communicating processors
¾ network of embedded processors
• To extract max performance
¾ Thumb rules: Amdahl’s law, Parallelism, Locality
¾ Software and compiler support needed!!!
CS 211: Bhagi Narahari,CS, GWU
Next: Review
Computer Organization in an hour!
• Overview of Computer Organization
¾ Components
¾ Sample processor design process
CS 211: Bhagi Narahari,CS, GWU
Review: Computer Organization Basics
• What are the components of a CPU
• What is the microarchitecture level?
• What is an ISA - Instruction set
architecture?
• How does a sample processor design
look?
¾ A simple processor architecture
• what is the basic concept of pipelining
CS 211: Bhagi Narahari,CS, GWU
A Computer
The computer is composed of input devices, a central processing
unit, a memory unit and output devices.
Input Device Central Processing Unit
Output Device
Input Device
Memory
Auxiliary Storage Device
CS 211: Bhagi Narahari,CS, GWU
Memory Unit
- An ordered sequence of storage cells,
each capable of holding a piece of data.
¾ RAM – Random Access Memory
¾ ROM – Read Only Memory
CS 211: Bhagi Narahari,CS, GWU
Computer System
diskDisk diskDisk
Memory-I/O busMemory-I/O bus
ProcessorProcessor
CacheCache
MemoryMemory
I/O
controller
I/O
controller
I/O
controller
I/O
controller
I/O
controller
I/O
controller
DisplayDisplay NetworkNetwork
interrupts
CS 211: Bhagi Narahari,CS, GWU
Memory Hierarchy: The Tradeoff
CPU CPU
regsregs
C a c h e
MemorMemoryy (^) diskdisk
size: speed: $/Mbyte: block size:
608 B
1.4 ns
4 B
register reference
L2-cache reference
memory reference
disk memory reference
512kB -- 4MB 16.8 ns $90/MB 16 B
128 MB
112 ns $2-6/MB 4-8 KB
27GB
9 ms $0.01/MB
larger, slower, cheaper
16 B 8 B 4 KB
cache virtual memory
C a c h e
128k B 4.2 ns
4 B
L1-cache reference
(Numbers are for a 21264 at 700MHz)
CS 211: Bhagi Narahari,CS, GWU
Architecture Models: Von Neumann
architecture
- Memory holds data, instructions.
- Central processing unit (CPU) fetches
instructions from memory.
¾ Separate CPU and memory distinguishes programmable computer.
- CPU registers help out: program counter
(PC), instruction register (IR), general-
purpose registers, etc.
CS 211: Bhagi Narahari,CS, GWU
CPU + memory
memory CPU
PC
address
data
200 ADD r5,r1,r3 IR
200
ADD r5,r1,r
CS 211: Bhagi Narahari,CS, GWU
Harvard architecture
CPU
PC
data memory
program memory
address
data
address
data
CS 211: Bhagi Narahari,CS, GWU
von Neumann vs. Harvard
- Harvard can’t use self-modifying code.
- Harvard allows two simultaneous
memory fetches.
- Most DSPs use Harvard architecture for
streaming data:
¾ greater memory bandwidth;
¾ more predictable bandwidth.
CS 211: Bhagi Narahari,CS, GWU
Instruction Set Architecture
- The Instruction Set Architecture (ISA)
describes a set of instructions whose
syntactic and semantic characteristics
are defined by the underlying computer architecture.
CS 211: Bhagi Narahari,CS, GWU
Programming model
- Programming model: registers visible to
the programmer.
- Some registers are not visible (IR).
CS 211: Bhagi Narahari,CS, GWU
Multiple implementations
- Successful architectures have several
implementations:
¾ varying clock speeds;
¾ different bus widths;
¾ different cache sizes;
¾ etc.
CS 211: Bhagi Narahari,CS, GWU
Assembly language
- One-to-one with instructions (more or
less).
¾ One instruction per line.
¾ Labels provide names for addresses (usually in first column).
¾ Instructions often start in later columns.
¾ Columns run to end of line.
CS 211: Bhagi Narahari,CS, GWU
Evolution of Instruction Sets
- Major advances in computer architecture are typically associated with landmark instruction set designs
¾ Ex: Stack vs GPR (System 360)
- Design decisions must take into account:
¾ technology
¾ machine organization
¾ programming languages
¾ compiler technology
¾ operating systems
¾ applications
- And they in turn influence these
CS 211: Bhagi Narahari,CS, GWU
CISC vs. RISC
- Complex instruction set computer
(CISC):
¾ many addressing modes;
¾ many operations.
- Reduced instruction set computer (RISC):
¾ load/store;
¾ pipelined instructions.
CS 211: Bhagi Narahari,CS, GWU
CISC Processors
- Instruction decoding is performed with
large microcode ROMs
- Some instructions require more than a
single instruction cycle to execute
- Many addressing modes supported
- Register set was designed to support
specific functions
CS 211: Bhagi Narahari,CS, GWU
RISC Processors
- Instruction decoding is performed with
static (hard-wired) logic for a much faster result
- Instructions are designed to execute in a
single instruction cycle
- Data processing instructions operate
only on registers. Load and store
instructions were designated to access
memory
- Register set is large and general purpose
(in many cases)
CS 211: Bhagi Narahari,CS, GWU
IA - 32
- 1978: The Intel 8086 is announced (16 bit architecture)
- 1980: The 8087 floating point coprocessor is added
- 1982: The 80286 increases address space to 24 bits, +instructions
- 1985: The 80386 extends to 32 bits, new addressing modes
- 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions
(mostly designed for higher performance)
- 1997: 57 new “MMX” instructions are added, Pentium II
- 1999: The Pentium III added another 70 instructions (SSE)
- 2001: Another 144 instructions (SSE2)
- 2003: AMD extends the architecture to increase address space to 64 bits,
widens all registers to 64 bits and other changes (AMD64)
- 2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds
more media extensions
- “This history illustrates the impact of the “golden handcuffs” of compatibility
-“adding new features as someone might add clothing to a packed bag”
-“an architecture that is difficult to explain and impossible to love”
CS 211: Bhagi Narahari,CS, GWU
IA-32 Overview
¾ Instructions from 1 to 17 bytes long
¾ one operand must act as both a source and destination
¾ one operand can come from memory
¾ complex addressing modes
e.g., “base or scaled index with 8 or 32 bit
displacement”
¾ the most frequently used instructions are not too
difficult to build
¾ compilers avoid the portions of the architecture that
are slow
“what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective”
CS 211: Bhagi Narahari,CS, GWU
Quick look at ISA
¾ Simple RISC ISA
¾ Widely used
CS 211: Bhagi Narahari,CS, GWU
Instruction set characteristics
- Fixed vs. variable length.
- Addressing modes.
- Number of operands.
- Types of operands.
CS 211: Bhagi Narahari,CS, GWU
The Big Picture: The Performance Perspective
- Performance of a machine is determined by:
¾ Instruction count
¾ Clock cycle time
¾ Clock cycles per instruction
- Processor design (datapath and control) will determine:
¾ Clock cycle time
¾ Clock cycles per instruction
CPI
Inst. Count Cycle Time
CS 211: Bhagi Narahari,CS, GWU
Microarchitecture Design: How?
- Any design must attempt to meet the
requirements
¾ Where do the requirements come from?
¾ Ex: need to represent numbers in binary; integers, text, floating point
- How to proceed with design?
CS 211: Bhagi Narahari,CS, GWU
Some History…
- The Indiana Legislature once introduced
legislation declaring that the value of π
was exactly 3.
CS 211: Bhagi Narahari,CS, GWU
How to Design a Processor: step-by-step
- Analyze instruction set => datapath requirements
¾ the meaning of each instruction is given by the register transfers
¾ datapath must include storage element for ISA registers
¾ possibly more
¾ datapath must support each register transfer
- Select set of datapath components and establish clocking methodology
- Assemble datapath meeting the requirements
- Analyze implementation of each instruction to determine setting of control points that effects the register transfer.
- Assemble the control logic
- Let’s look at a single cycle ISA…
CS 211: Bhagi Narahari,CS, GWU
The MIPS Instruction Formats
- All MIPS instructions are 32 bits long. The three instruction formats :
¾ R-type
¾ I-type
¾ J-type
- The different fields are :
¾ op: operation of the instruction
¾ rs, rt, rd: the source and destination register specifiers
¾ shamt: shift amount
¾ funct: selects the variant of the operation in the “op” field
¾ address / immediate: address offset or immediate value
¾ target address: target address of the jump instruction
op target address
6 bits 26 bits
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
op rs rt immediate
6 bits 5 bits 5 bits 16 bits
CS 211: Bhagi Narahari,CS, GWU
Step 1a: The MIPS-Inst Set (eg.)
¾ addU rd, rs, rt
¾ subU rd, rs, rt
¾ ori rt, rs, imm
¾ lw rt, rs, imm
¾ sw rt, rs, imm
¾ beq rs, rt, imm
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
op rs rt immediate
6 bits 5 bits 5 bits 16 bits
op rs rt immediate
6 bits 5 bits 5 bits 16 bits
op rs rt immediate
6 bits 5 bits 5 bits 16 bits
- Register rs and rt are the source registers.
- If the instruction has three operand register, then rd is the destination register
- If the instruction has two operand register, then rt is the destination register
CS 211: Bhagi Narahari,CS, GWU
Logical Register Transfers
- RTL gives the meaning of the instructions
- All start by fetching the instruction
op | rs | rt | rd | shamt | funct = MEM[ PC ] op | rs | rt | Imm16 = MEM[ PC ]
inst Register Transfers
ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 SUBU R[rd] <– R[rs] – R[rt]; PC <– PC + 4 ORi R[rt] <– R[rs] | zero_ext(Imm16); PC <– PC + 4
LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4 STORE MEM[ R[rs] + sign_ext(Imm16) ] <– R[rt]; PC <– PC + 4
BEQ if ( R[rs] == R[rt] ) then PC <– PC + 4 + sign_ext(Imm16)] || 00 else PC <– PC + 4
CS 211: Bhagi Narahari,CS, GWU
Step 2: Components of the Datapath
- Combinational Elements
- Storage Elements
¾ Clocking methodology
CS 211: Bhagi Narahari,CS, GWU
Clocking Methodology
- Clocks needed in sequential logic to decide when an element
that contains state should be updated.
- A clock is a free-running circuit with a fixed cycle time or clock
period. The clock frequency is the inverse of the cycle time.
- The clock cycle time or clock period is divided into two
portions: when the clock is high and when the clock is low.
- Edge-triggered clocking: all state changes occur on a clock
edge.
Clk
Don’t Care
Setup Hold Setup Hold
Clock Period
Rising Edge Falling Edge
CS 211: Bhagi Narahari,CS, GWU
Step 3: Assemble DataPath meeting our requirements
- Register Transfer Requirements ⇒ Datapath Assembly
- Instruction Fetch
- Read Operands and Execute Operation
The common RTL operations for all instructions are:
(a) Fetch the instruction using the Program Counter (PC) at the beginning of an
instruction’s execution (PC -> Instruction Memory -> Instruction Word).
(b) Then at the end of the instruction’s execution, you need to update the
Program Counter (PC -> Next Address Logic -> PC).
More specifically, you need to increment the PC by 4 if you are executing sequential code.
For Branch and Jump instructions, you need to update the program counter to “something
else” other than plus 4.
The Next Address Logic block:
- Add 4 (number of bytes in an instruction) or
- Branch and Jump instructions
CS 211: Bhagi Narahari,CS, GWU
3a: Overview of the Instruction Fetch Unit
- The common RTL operations
¾ Fetch the Instruction: mem[PC]
¾ Update the program counter:
¾ Sequential Code: PC <- PC + 4
¾ Branch and Jump: PC <- “something else”
Instruction Word
Address
Instruction Memory
Clk PC
Next Address Logic
CS 211: Bhagi Narahari,CS, GWU
3b: Add & Subtract
- R[rd] <- R[rs] op R[rt] Example: addU rd, rs, rt
¾ Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields
¾ ALUctr and RegWr: control logic after decoding the
instruction
Result
ALUctr
Clk
busW
RegWr
busA
busB
Rw Ra Rb
32 32-bit Registers
Rd Rs Rt
ALU
op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
CS 211: Bhagi Narahari,CS, GWU
Putting it All Together: A Single Cycle Datapath
imm
ALUctr
Clk
busW
RegWr
busA
busB
Rw Ra Rb 32 32-bit Registers
Rs
Rt
Rt
Rd
RegDst
Extender
Mux
imm
ExtOpALUSrc
Mux
MemtoReg
Clk
Data In
(^32) WrEnAdr
Data Memory
MemWr
ALU
Equal
Instruction<31:0>
Rs Rt Rd Imm
Adder
Adder
PC
Clk
Mux
nPC_sel
PC Ext
Adr
Inst Memory
CS 211: Bhagi Narahari,CS, GWU
An Abstract View of the Critical Path
- Register file and ideal memory:
¾ The CLK input is a factor ONLY during write operation
¾ During read operation, behave as combinational logic:
¾ Address valid => Output valid after “access time.” Critical Path (Load Operation) = PC’s Clk-to-Q + Instruction Memory’s Access Time + Register File’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for Register File Write + Clock Skew
Clk
Rw Ra Rb 32 32-bit Registers
Rd
ALU
Clk
Data In
Data Address Ideal Data Memory
Instruction
Instruction Address
Ideal Instruction Memory
Clk
PC
Rs 5
Rt 16
Imm
A
B
Next Address
CS 211: Bhagi Narahari,CS, GWU
An Abstract View of the Implementation
Data Out
Clk
Rw Ra Rb 32 32-bit Registers
Rd
ALU
Clk
Data In
Data Address Ideal Data Memory
Instruction
Instruction Address
Ideal Instruction Memory
Clk
PC
Rs 5
Rt
A
B
Next Address
Control
Datapath
Control Signals (^) Conditions
CS 211: Bhagi Narahari,CS, GWU
Step 4: Given Datapath: RTL -> Control
RegDst ExtOp ALUSrc^ ALUctrMemWr^ MemtoReg Equal
Instruction<31:0> <21:25><16:20><11:15><0:15>
Rt RsRd Imm
nPC_sel
Adr
Inst Memory
DATA PATH
Control
Op
Fun
RegWr
CS 211: Bhagi Narahari,CS, GWU
Summary
- 5 steps to design a processor
¾ 1. Analyze instruction set => datapath requirements
¾ 2. Select set of datapath components & establish clock
methodology
¾ 3. Assemble datapath meeting the requirements
¾ 4. Analyze implementation of each instruction to determine
setting of control points that effects the register transfer.
¾ 5. Assemble the control logic
¾ Instructions same size
¾ Source registers always in same place
¾ Immediates same size, location
¾ Operations always on registers/immediates
- Single cycle datapath => CPI=1, CCT => long CS 211: Bhagi Narahari,CS, GWU
Systematic Generation of Control
- In a single-cycle processor, each instruction is realized by exactly one control command or “ microinstruction”
¾ in general, the controller is a finite state machine
¾ microinstruction can also control sequencing (see later)
Control Logic / Store
(PLA, ROM)
OPcode
Datapath
Instruction
Decode
Conditions
Control
Points
microinstruction
CS 211: Bhagi Narahari,CS, GWU
What’s wrong with our CPI=1 processor?
- Long Cycle Time
- All instructions take as much time as the slowest
- Real memory is not as nice as our idealized memory
¾ cannot always get the job done in one (short) cycle
PC Inst Memory mux ALU Data Mem mux
PC Inst Memory Reg File mux ALU mux
PC Inst Memory mux ALU Data Mem
PC Inst Memory cmp mux
Reg File
Reg File
Reg File
Arithmetic & Logical
Load
Store
Branch
Critical Path
setup
setup
CS 211: Bhagi Narahari,CS, GWU
Partitioning the CPI=1 Datapath
- Add registers between smallest steps
PC
Next PC Operand
Fetch
Exec
Reg.File
Mem
Access
DataMem
Instruction
Fetch
Result Store
ALUctr
RegDst ExtOpALUSrc nPC_sel MemRdMemWr RegWrMemWr
Equal
CS 211: Bhagi Narahari,CS, GWU
Example Multicycle Datapath
PC
Next PC
Operand
Fetch
Instruction
Fetch
nPC_sel
IR
Reg
File ExtALU
Reg.File
Mem
Access
DataMem
Result Store
RegDstRegWr MemRdMemWr
S
M
MemToReg
Equal
ExtOpALUSrc^ ALUctr
A
B
E
CS 211: Bhagi Narahari,CS, GWU
Controller Design
- The state digrams that arise define the controller for an instruction
set processor are highly structured
- Use this structure to construct a simple “microsequencer”
- Control reduces to programming this very simple device
⇒ microprogramming
sequencer
control
datapath control
micro-PC
sequencer
microinstruction
CS 211: Bhagi Narahari,CS, GWU
Microprogramming
- Microprogramming is a convenient method for implementing structured control state diagrams:
¾ Random logic replaced by microPC sequencer and ROM
¾ Each line of ROM called a μinstruction:
contains sequencer control + values for control points
¾ limited state transitions:
branch to zero, next sequential,
branch to μinstruction address from displatch ROM
- Horizontal μCode: one control bit in μInstruction for every control line in datapath
- Vertical μCode: groups of control-lines coded together in μInstruction (e.g. possible ALU dest)
- Control design reduces to Microprogramming
¾ Part of the design process is to develop a “language”
that describes control and is easy for humans to
understand
CS 211: Bhagi Narahari,CS, GWU
Microprogramming
- Microprogramming is a fundamental concept
¾ implement an instruction set by building a very simple
processor and interpreting the instructions
¾ essential for very complex instructions and when few
register transfers are possible
¾ overkill when ISA matches datapath 1-
sequencer
control
datapath control
micro-PC
μ-sequencer:
fetch,dispatch,
sequential
microinstruction (μ)
Dispatch
ROM
Opcode
μ-Code ROM
DecodeDecode
To DataPath
Decoders implement our μ- code language:
For instance: rt-ALU rd-ALU mem-ALU
CS 211: Bhagi Narahari,CS, GWU
Sequential Laundry
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T a s k O r d e r
Time
CS 211: Bhagi Narahari,CS, GWU
Pipelined Laundry
Start work ASAP
- Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
T a s k O r d e r
Time
30 40 40 40 40 20
CS 211: Bhagi Narahari,CS, GWU
Pipelining Lessons
latency of single task, it
helps throughput of
entire workload
slowest pipeline stage
simultaneously
Number pipe stages
pipe stages reduces
speedup
and time to “drain” it
reduces speedup
A
B
C
D
6 PM 7 8 9
T a s k O r d e r
Time
30 40 40 40 40 20
CS 211: Bhagi Narahari,CS, GWU
Instruction Pipeline
- Instruction execution process lends itself
naturally to pipelining
¾ overlap the subtasks of instruction fetch, decode and execute
CS 211: Bhagi Narahari,CS, GWU
How to improve performance?
- Recall performance is function of
¾ CPI: cycles per instruction
¾ Clock cycle
¾ Instruction count
- Reducing any of the 3 factors will lead to
improved performance
CS 211: Bhagi Narahari,CS, GWU
How to improve performance?
- First step is to apply concept of
pipelining to the instruction execution
process
¾ Overlap computations
¾ Decrease clock cycle
¾ Decrease effective CPU time compared to original clock cycle
CS 211: Bhagi Narahari,CS, GWU
Pipeline Approach to Improve System
Performance
- Analogous to fluid flow in pipelines and
assembly line in factories
- Divide process into “stages” and send
tasks into a pipeline
¾ Overlap computations of different tasks by operating on them concurrently in different stages
CS 211: Bhagi Narahari,CS, GWU
Instruction Level Parallel Processors
(ILP)
- early ILP - one of two orthogonal concepts:
¾ pipelining - vertical approach
¾ multiple (non-pipelined) units - horizontal approach
- progression to multiple pipelined units
- instruction issue became bottleneck, led to
¾ superscalar ILP processors
¾ Very Large Instruction Word (VLIW)
- Note: key performance metric in all ILP processor classes is IPC (instructions per cycle)
¾ this is the degree of parallelism achieved