Download Computer Architecture Lecture Notes: OoO Wrap-Up and Advanced Caching and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!
Computer Architecture
Lecture 11: OoO Wrap-Up and Advanced Caching
Prof. Onur Mutlu
Carnegie Mellon University
Announcements
Chuck Thacker (Microsoft Research) Seminar
RARE: Rethinking Architectural Research and Education
October 7, 4:30-5:30pm, GHC Rashid Auditorium
Ben Zorn (Microsoft Research) Seminar
Performance is Dead, Long Live Performance!
October 8, 11am-noon, GHC 6115
Guest lecture Friday
Dr. Ben Zorn, Microsoft Research
Fault Tolerant, Efficient, and Secure Runtimes
Last Time …
Full Window Stalls
Runahead Execution
Memory Level Parallelism
Memory Latency Tolerance Techniques
Caching
Prefetching
Multithreading
Out-of-order execution
Improving Runahead Execution
Efficiency
Dependent Cache Misses: Address-Value Delta Prediction
OoO/Runahead Readings
Mutlu et al., “Runahead Execution: An Alternative to Very Large
Instruction Windows for Out-of-order Processors,” HPCA 2003.
Mutlu et al., “Efficient Runahead Execution: Power-Efficient
Memory Latency Tolerance,” IEEE Micro Top Picks 2006.
Zhou, Dual-Core Execution: “Building a Highly Scalable Single-
Thread Instruction Window,” PACT 2005.
Chrysos and Emer, “Memory Dependence Prediction Using Store
Sets,” ISCA 1998.
Runahead Execution (III)
Advantages:
+ Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Uses the same thread context as main thread, no waste of context
+ Simple to implement, most of the hardware is already built in
Disadvantages/Limitations:
-- Extra executed instructions
-- Limited by branch prediction accuracy
-- Cannot prefetch dependent cache misses. Solution?
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency
Implemented in IBM POWER6, Sun “Rock”
Memory Latency Tolerance Techniques
Caching [initially by Wilkes, 1965]
Widely used, simple, effective, but inefficient, passive
Not all applications/phases exhibit temporal or spatial locality
Prefetching [initially in IBM 360/91, 1967]
Works well for regular memory access patterns
Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive
Multithreading [initially in CDC 6600, 1964]
Works well if there are multiple threads
Improving single thread performance using multithreading hardware is an
ongoing research effort
Out-of-order execution [initially by Tomasulo, 1967]
Tolerates cache misses that cannot be prefetched
Requires extensive hardware resources for tolerating long latencies
Runahead and Dual Core Execution
Load 3 Hit Load 2 Miss Stall Miss 1 Load 1 Miss Saved Cycles
DCE: front processor
Compute Load 1 Miss Runahead Load 2 Miss Load 2 Hit Miss 1 Miss 2 Compute Load 1 Hit
Runahead:
Load 3 Miss Runahead Saved Cycles Load 1 Miss
DCE: back processor
Compute Compute (^) Compute Load 2 Hit Miss 3 Miss 2 Load 3 Miss
Handling of Store-Load Dependencies
A load’s dependence status is not known until all previous store
addresses are available.
How does the OOO engine detect dependence of a load instruction on a
previous store?
Option 1: Wait until all previous stores committed (no need to
check)
Option 2: Keep a list of pending stores in a store buffer and check
whether load address matches a previous store address
How does the OOO engine treat the scheduling of a load instruction wrt
previous stores?
Option 1: Assume load independent of all previous stores
Option 2: Assume load dependent on all previous stores
Option 3: Predict the dependence of a load on an outstanding store
Store Buffer Design (II)
Why is it complex to design a store buffer?
Content associative, age-ordered, range search on an
address range
Check for overlap of [load EA, load EA + load size] and [store
EA, store EA + store size]
EA: effective address
A key limiter of instruction window scalability
Simplifying store buffer design or alternative designs an
important topic of research
Memory Disambiguation (I)
Option 1: Assume load independent of all previous stores
+ Simple and can be common case: no delay for independent loads
-- Requires recovery and re-execution of load and dependents on misprediction
Option 2: Assume load dependent on all previous stores
+ No need for recovery
-- Too conservative: delays independent loads unnecessarily
Option 3: Predict the dependence of a load on an
outstanding store
+ More accurate. Load store dependencies persist over time
-- Still requires recovery/re-execution on misprediction
Alpha 21264 : Initially assume load independent, delay loads found to be dependent Moshovos et al., “Dynamic speculation and synchronization of data dependences,” ISCA 1997. Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.
Speculative Execution and Data Coherence
Speculatively executed loads can load a stale value in a
multiprocessor system
The same address can be written by another processor before
the load is committed load and its dependents can use the
wrong value
Solutions:
1. A store from another processor invalidates a load that loaded
the same address
-- Stores of another processor check the load buffer
-- How to handle dependent instructions? They are also
invalidated.
2. All loads re-executed at the time of retirement
Open Research Issues in OOO Execution (I)
Performance with simplicity and energy-efficiency
How to build scalable and energy-efficient instruction windows
To tolerate very long memory latencies and to expose more memory
level parallelism
Problems:
How to scale or avoid scaling register files, store buffers
How to supply useful instructions into a large window in the
presence of branches
How to approximate the benefits of a large window
MLP benefits vs. ILP benefits
Can the compiler pack more misses (MLP) into a smaller window?
How to approximate the benefits of OOO with in-order +
enhancements
Open Research Issues in OOO Execution (III)
Out-of-order execution in the presence of multi-core
Powerful execution engines are needed to execute
Single-threaded applications
Serial sections of multithreaded applications (remember Amdahl’s law)
Where single thread performance matters (e.g., transactions, game logic)
Accelerate multithreaded applications (e.g., critical sections)
Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Large core
ACMP Approach
Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core
“Niagara” Approach
Large core Large core Large core Large core
“Tile-Large” Approach
Asymmetric vs. Symmetric Cores
Advantages of Asymmetric
+ Can provide better performance when thread parallelism is
limited
+ Can be more energy efficient
+ Schedule computation to the core type that can best execute it
Disadvantages
- Need to design more than one type of core. Always?
- Scheduling becomes more complicated
- What computation should be scheduled on the large core?
- Who should decide? HW vs. SW?
- Managing locality and load balancing can become difficult if
threads move between cores (transparently to software)
- Cores have different demands from shared resources