




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth analysis of CPU execution time, Instructions Per Second (MIPS), and Millions of Floating Point Operations Per Second (MFLOPs). It covers the CPU performance equation, instruction pipelining, and the impact of data and control hazards on pipeline performance. The document also discusses the concept of Instructions Per Cycle (IPC) and its significance in optimizing pipelined instruction execution.
Typology: Slides
1 / 101
This page cannot be seen from the preview
Don't miss anything!





























































































Lec # 1
Spring 2016 1-26-
Advanced Computer ArchitectureAdvanced Computer Architecture
Course Goal:Understanding important emerging design techniques, machine structures,technology factors, evaluation methods that will determine the form of high-performance programmable processors and computing systems in 21st Century.Important Factors: •
Driving Force: Applications with diverse and increased computational demands even inmainstream computing (multimedia etc.)
Techniques must be developed to overcome the major limitations of current computingsystems to meet such demands:
Instruction-Level Parallelism (ILP) limitations, Memory latency, IO performance.
Increased branch penalty/other stalls in deeply pipelined CPUs.
General-purpose processors as only homogeneous system computing resource.
Enables implementing more advanced architectural enhancements.
Enables chip-level Thread Level Parallelism:
Simultaneous Multithreading (SMT)/Chip Multiprocessors (CMPs, AKA multi-core processors).
Enables a high-level of chip-level system integration.
System On Chip (SOC) approach
Lec # 1
Spring 2016 1-26-
Course Topics
Overcoming inherent ILP & clock scaling limitations by exploitingThread-level Parallelism (TLP):
Support for Simultaneous Multithreading (SMT).
Alpha EV8. Intel P4 Xeon and Core i7 (aka Hyper-Threading), IBM Power5.
Chip Multiprocessors (CMPs):
The Hydra Project: An example CMP with Hardware Data/Thread Level Speculation(TLS) Support.
IBM Power4, 5, 6 ….
Instruction Fetch Bandwidth/Memory Latency Reduction:
Conventional & Block-based Trace Cache (Intel P4).
Advanced Dynamic Branch Prediction Techniques.
Towards micro heterogeneous computing systems:
Vector processing.
Vector Intelligent RAM (VIRAM).
Digital Signal Processing (DSP), Media Processors.
Graphics Processor Units (GPUs).
Re-Configurable Computing and Processors.
Virtual Memory Design/Implementation Issues.
High Performance Storage: Redundant Arrays of Disks (RAID).
Lec # 1
Spring 2016 1-26-
Computing Engine Choices
General Purpose Processors (GPPs): Intended for general purpose computing(desktops, servers, clusters..)
Application-Specific Processors (ASPs): Processors with ISAs andarchitectural features tailored towards specific application domains
E.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors,Graphics Processing Units (GPUs),
Vector Processors??? ...
Co-Processors: A hardware (hardwired) implementation of specificalgorithms with limited programming interface (augment GPPs or ASPs)
Configurable Hardware:
Field Programmable Gate Arrays (FPGAs)
Configurable array of simple processing elements
Application Specific Integrated Circuits (ASICs): A custom VLSI hardwaresolution for a specific computational task
The choice of one or more depends on a number of factors including:
(general purpose vs. Specialized)
Desired level of flexibility
Development cost
Power requirements
Real-time constrains
Lec # 1
Spring 2016 1-26-
Computing Engine Choices
General Purpose
Processors (GPPs):
Application-Specific
Processors (ASPs)
Co-Processors
Application SpecificIntegrated Circuits
(ASICs)
Configurable Hardware
E.g Digital Signal Processors (DSPs),Network Processors (NPs),Media Processors,Graphics Processing Units (GPUs)
(general purpose vs. Specialized)
Desired level of flexibility
Development cost
Power requirements
Programmability /Flexibility
Specialization , Development cost/timePerformance/Chip Area/Watt (Computational Efficiency)
Processor = Programmable computing element that runsprograms written using a pre-defined set of instructions (ISA)
ISA Requirements
→
Processor Design
Software
Hardware
Lec # 1
Spring 2016 1-26-
Recent Trends in Computer Design.
Computer Performance Measures.
Instruction Pipelining.
Dynamic Branch Prediction.
Instruction-Level Parallelism (ILP).
Loop-Level Parallelism (LLP) + Data Parallelism.
Dynamic Pipeline Scheduling.
Multiple Instruction Issue (CPI < 1):
Superscalar vs. VLIW
Dynamic Hardware-Based Speculation
Cache Design & Performance.
Basic Virtual memory Issues
Lec # 1
Spring 2016 1-26-
Trends in Computer DesignTrends in Computer Design
Integrated circuit technology:
decreasing feature size,
λ
Clock rate improves roughly proportional to improvement in
λ
Number of transistors improves proportional to
λ
2
(or faster).
Rate of clock speed improvement have decreased in recent years.
Architectural improvements in CPU design.
Simultaneous Multithreading SMT/Chip Multiprocessor (CMP)Chip-level Thread-Level Parallelism (TLP)
Lec # 1
Spring 2016 1-26-
Microprocessor PerformanceMicroprocessor Performance
1987-^1987
97
0
800 600 400 200
12001000
87
88
89
90
91
92
93
94
95
96
97
DEC Alpha 21264/
DEC Alpha 5/
DEC Alpha 5/
DEC Alpha 4/
Sun
Integer SPEC92 PerformanceInteger SPEC92 Performance^2000
> 100x performance increase in the last decade
Lec # 1
Spring 2016 1-26-
Microprocessor Transistor Count Growth Rate^ Microprocessor Transistor Count Growth Rate
Intel 4004(2300 transistors)
Still holds today
~4,000,000x transistor density increase in thelast 45 years
How to best exploit increased transistor count? •
Keep increasing cache capacity/levels?
Multiple GPP cores?
Integrate other types of computing elements?
4-Bit
Lec # 1
Spring 2016 1-26-
Transistors
1, 10, 100,
1,000, 10,000, 100,000,
1970
1975
1980
1985
1990
1995
2000
2005
Bit-level parallelism
Instruction-level
Thread-level (?)
i
i
i
i80286 i
i
R
Pentium
R
R
Parallelism in Microprocessor VLSI Generations^ Parallelism in Microprocessor VLSI Generations
SimultaneousMultithreading SMT:e.g. Intel’s Hyper-threadingChip-Multiprocessors (CMPs)e.g IBM Power 4, 5
Intel Pentium D, Core DuoAMD Athlon 64 X
Dual Core Opteron
Sun UltraSparc T1 (Niagara)
Even more importantdue to slowing clockrate increase
Multiple micro-operations
per cycle (multi-cycle non-pipelined)
Superscalar/VLIWCPI <
Single-issuePipelinedCPI =
Not PipelinedCPI >> 1
(ILP)
(TLP)
Improving microprocessor generation performance byexploiting more levels of parallelism
AKA Operation-Level Parallelism
Lec # 1
Spring 2016 1-26-
Microprocessor Architecture TrendsMicroprocessor Architecture Trends
CISC Machines
instructions take variable times to complete
RISC Machines (microcode)
simple instructions, optimized for speed
RISC Machines (pipelined)
same individual instruction latency
greater throughput through instruction "overlap"
Superscalar Processors
multiple instructions executing simultaneously
Multithreaded Processors
additional HW resources (regs, PC, SP)each context gets processor for x cycles
VLIW
"Superinstructions" grouped together
decreased HW control complexity
Single Chip Multiprocessors
duplicate entire processors
(tech soon due to Moore's Law)
SIMULTANEOUS MULTITHREADING
multiple HW contexts (regs, PC, SP)each cycle, any context may execute
CMPs
(SMT)
SMT/CMPs
e.g. IBM Power5,6,7 , Intel Pentium D, Sun Niagara - (UltraSparc T1)
Intel Nehalem (Core i7)
SingleThreaded
(e.g IBM Power 4/5,
AMD X2, X3, X4, Intel Core 2)
e.g. Intel’s HyperThreading (P4)
(Single or Multi-Threaded)
General Purpose Processor (GPP)
Chip-Level TLP
ILP
Lec # 1
Spring 2016 1-26-
Architectural ImprovementsArchitectural Improvements
Including Simultaneous Multithreading (SMT)
Lec # 1
Spring 2016 1-26-
Metrics of Computer PerformanceMetrics of Computer Performance
Compiler
Programming
Language Application Datapath
Control
Transistors Wires
Pins
ISA
Function Units
Cycles per second (clock rate).
Megabytes per second.
Execution time: Target workload,SPEC95, SPEC2000, etc.
Each metric has a purpose, and each can be misused.
(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s
Lec # 1
Spring 2016 1-26-
Factors Affecting CPU PerformanceFactors Affecting CPU Performance
CPU time
= Seconds
= Instructions x Cycles
x
Seconds
Program
Program
Instruction
Cycle
CPU time
= Seconds
= Instructions x Cycles
x
Seconds
Program
Program
Instruction
Cycle
X X
X
X
X X X
X X
VLSI
Lec # 1
Spring 2016 1-26-
Performance Enhancement Calculations:Performance Enhancement Calculations:
Amdahl's LawAmdahl's Law
Performance improvement or speedup due to enhancement E:
Execution Time without E
Performance with E
Speedup(E) =
Execution Time with E
Performance without E
Suppose that enhancement E accelerates a fraction F of theexecution time by a factor S and the remainder of the time isunaffected then:
Execution Time with E =
((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E
Speedup(E) =
((1 - F) + F/S) X Execution Time without E
F (Fraction of execution time enhanced) refersto original execution time before the enhancement is applied