Computer Architecture Lecture Notes: OoO Wrap-Up and Advanced Caching, Lecture notes of Computer Architecture and Organization

A set of lecture notes from a Computer Architecture course at Carnegie Mellon University. The notes cover topics such as Runahead Execution, Dual-Core Execution, Store-Load Dependencies, and Memory Disambiguation. readings and seminars from Microsoft Research and RARE. The notes are detailed and technical, making them useful for university students studying Computer Architecture or related fields.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

shekara_44het
shekara_44het 🇺🇸

4.3

(7)

229 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
15-740/18-740
Computer Architecture
Lecture 11: OoO Wrap-Up and Advanced Caching
Prof. Onur Mutlu
Carnegie Mellon University
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Computer Architecture Lecture Notes: OoO Wrap-Up and Advanced Caching and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!

Computer Architecture

Lecture 11: OoO Wrap-Up and Advanced Caching

Prof. Onur Mutlu

Carnegie Mellon University

Announcements

 Chuck Thacker (Microsoft Research) Seminar

 RARE: Rethinking Architectural Research and Education

 October 7, 4:30-5:30pm, GHC Rashid Auditorium

 Ben Zorn (Microsoft Research) Seminar

 Performance is Dead, Long Live Performance!

 October 8, 11am-noon, GHC 6115

 Guest lecture Friday

 Dr. Ben Zorn, Microsoft Research

 Fault Tolerant, Efficient, and Secure Runtimes

Last Time …

 Full Window Stalls

 Runahead Execution

 Memory Level Parallelism

 Memory Latency Tolerance Techniques

 Caching

 Prefetching

 Multithreading

 Out-of-order execution

 Improving Runahead Execution

 Efficiency

 Dependent Cache Misses: Address-Value Delta Prediction

OoO/Runahead Readings

 Mutlu et al., “Runahead Execution: An Alternative to Very Large

Instruction Windows for Out-of-order Processors,” HPCA 2003.

 Mutlu et al., “Efficient Runahead Execution: Power-Efficient

Memory Latency Tolerance,” IEEE Micro Top Picks 2006.

 Zhou, Dual-Core Execution: “Building a Highly Scalable Single-

Thread Instruction Window,” PACT 2005.

 Chrysos and Emer, “Memory Dependence Prediction Using Store

Sets,” ISCA 1998.

Runahead Execution (III)

 Advantages:

+ Very accurate prefetches for data/instructions (all cache levels)

+ Follows the program path

+ Uses the same thread context as main thread, no waste of context

+ Simple to implement, most of the hardware is already built in

 Disadvantages/Limitations:

-- Extra executed instructions

-- Limited by branch prediction accuracy

-- Cannot prefetch dependent cache misses. Solution?

-- Effectiveness limited by available “memory-level parallelism” (MLP)

-- Prefetch distance limited by memory latency

 Implemented in IBM POWER6, Sun “Rock”

Memory Latency Tolerance Techniques

 Caching [initially by Wilkes, 1965]
 Widely used, simple, effective, but inefficient, passive
 Not all applications/phases exhibit temporal or spatial locality
 Prefetching [initially in IBM 360/91, 1967]
 Works well for regular memory access patterns
 Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive
 Multithreading [initially in CDC 6600, 1964]
 Works well if there are multiple threads
 Improving single thread performance using multithreading hardware is an
ongoing research effort
 Out-of-order execution [initially by Tomasulo, 1967]
 Tolerates cache misses that cannot be prefetched
 Requires extensive hardware resources for tolerating long latencies

Runahead and Dual Core Execution

Load 3 Hit Load 2 Miss Stall Miss 1 Load 1 Miss Saved Cycles

DCE: front processor

Compute Load 1 Miss Runahead Load 2 Miss Load 2 Hit Miss 1 Miss 2 Compute Load 1 Hit

Runahead:

Load 3 Miss Runahead Saved Cycles Load 1 Miss

DCE: back processor

Compute Compute (^) Compute Load 2 Hit Miss 3 Miss 2 Load 3 Miss

Handling of Store-Load Dependencies

 A load’s dependence status is not known until all previous store

addresses are available.

 How does the OOO engine detect dependence of a load instruction on a

previous store?

 Option 1: Wait until all previous stores committed (no need to

check)

 Option 2: Keep a list of pending stores in a store buffer and check

whether load address matches a previous store address

 How does the OOO engine treat the scheduling of a load instruction wrt

previous stores?

 Option 1: Assume load independent of all previous stores

 Option 2: Assume load dependent on all previous stores

 Option 3: Predict the dependence of a load on an outstanding store

Store Buffer Design (II)

 Why is it complex to design a store buffer?

 Content associative, age-ordered, range search on an

address range

 Check for overlap of [load EA, load EA + load size] and [store

EA, store EA + store size]

 EA: effective address

 A key limiter of instruction window scalability

 Simplifying store buffer design or alternative designs an

important topic of research

Memory Disambiguation (I)

 Option 1: Assume load independent of all previous stores

+ Simple and can be common case: no delay for independent loads

-- Requires recovery and re-execution of load and dependents on misprediction

 Option 2: Assume load dependent on all previous stores

+ No need for recovery

-- Too conservative: delays independent loads unnecessarily

 Option 3: Predict the dependence of a load on an

outstanding store

+ More accurate. Load store dependencies persist over time

-- Still requires recovery/re-execution on misprediction

 Alpha 21264 : Initially assume load independent, delay loads found to be dependent  Moshovos et al., “Dynamic speculation and synchronization of data dependences,” ISCA 1997.  Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.

Speculative Execution and Data Coherence

 Speculatively executed loads can load a stale value in a

multiprocessor system

 The same address can be written by another processor before

the load is committed  load and its dependents can use the

wrong value

 Solutions:

1. A store from another processor invalidates a load that loaded

the same address

-- Stores of another processor check the load buffer

-- How to handle dependent instructions? They are also

invalidated.

2. All loads re-executed at the time of retirement

Open Research Issues in OOO Execution (I)

 Performance with simplicity and energy-efficiency

 How to build scalable and energy-efficient instruction windows

 To tolerate very long memory latencies and to expose more memory
level parallelism
 Problems:
 How to scale or avoid scaling register files, store buffers
 How to supply useful instructions into a large window in the
presence of branches

 How to approximate the benefits of a large window

 MLP benefits vs. ILP benefits
 Can the compiler pack more misses (MLP) into a smaller window?

 How to approximate the benefits of OOO with in-order +

enhancements

Open Research Issues in OOO Execution (III)

 Out-of-order execution in the presence of multi-core

 Powerful execution engines are needed to execute

 Single-threaded applications
 Serial sections of multithreaded applications (remember Amdahl’s law)
 Where single thread performance matters (e.g., transactions, game logic)
 Accelerate multithreaded applications (e.g., critical sections)

Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Large core

ACMP Approach

Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core

“Niagara” Approach

Large core Large core Large core Large core

“Tile-Large” Approach

Asymmetric vs. Symmetric Cores

 Advantages of Asymmetric

+ Can provide better performance when thread parallelism is

limited

+ Can be more energy efficient

+ Schedule computation to the core type that can best execute it

 Disadvantages

- Need to design more than one type of core. Always?

- Scheduling becomes more complicated

- What computation should be scheduled on the large core?

- Who should decide? HW vs. SW?

- Managing locality and load balancing can become difficult if

threads move between cores (transparently to software)

- Cores have different demands from shared resources