Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Computer Architecture Lecture Notes: OoO Wrap-Up and Advanced Caching, Lecture notes of Computer Architecture and Organization

Carnegie Mellon University (CMU)Computer Architecture and Organization

A set of lecture notes from a Computer Architecture course at Carnegie Mellon University. The notes cover topics such as Runahead Execution, Dual-Core Execution, Store-Load Dependencies, and Memory Disambiguation. readings and seminars from Microsoft Research and RARE. The notes are detailed and technical, making them useful for university students studying Computer Architecture or related fields.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

shekara_44het 🇺🇸

4.3

(7)

229 documents

1 / 27

This page cannot be seen from the preview

Don't miss anything!

15-740/18-740

Computer Architecture

Lecture 11: OoO Wrap-Up and Advanced Caching

Prof. Onur Mutlu

Carnegie Mellon University

Discover Lecture notes of Computer Architecture and Organization Carnegie Mellon University (CMU)

Partial preview of the text

Download Computer Architecture Lecture Notes: OoO Wrap-Up and Advanced Caching and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!

Computer Architecture

Lecture 11: OoO Wrap-Up and Advanced Caching

Prof. Onur Mutlu

Carnegie Mellon University

Announcements

 Chuck Thacker (Microsoft Research) Seminar

 RARE: Rethinking Architectural Research and Education

 October 7, 4:30-5:30pm, GHC Rashid Auditorium

 Ben Zorn (Microsoft Research) Seminar

 Performance is Dead, Long Live Performance!

 October 8, 11am-noon, GHC 6115

 Guest lecture Friday

 Dr. Ben Zorn, Microsoft Research

 Fault Tolerant, Efficient, and Secure Runtimes

Last Time …

 Full Window Stalls

 Runahead Execution

 Memory Level Parallelism

 Memory Latency Tolerance Techniques

 Caching

 Prefetching

 Multithreading

 Out-of-order execution

 Improving Runahead Execution

 Efficiency

 Dependent Cache Misses: Address-Value Delta Prediction

OoO/Runahead Readings

 Mutlu et al., “Runahead Execution: An Alternative to Very Large

Instruction Windows for Out-of-order Processors,” HPCA 2003.

 Mutlu et al., “Efficient Runahead Execution: Power-Efficient

Memory Latency Tolerance,” IEEE Micro Top Picks 2006.

 Zhou, Dual-Core Execution: “Building a Highly Scalable Single-

Thread Instruction Window,” PACT 2005.

 Chrysos and Emer, “Memory Dependence Prediction Using Store

Sets,” ISCA 1998.

Runahead Execution (III)

 Advantages:

+ Very accurate prefetches for data/instructions (all cache levels)

+ Follows the program path

+ Uses the same thread context as main thread, no waste of context

+ Simple to implement, most of the hardware is already built in

 Disadvantages/Limitations:

-- Extra executed instructions

-- Limited by branch prediction accuracy

-- Cannot prefetch dependent cache misses. Solution?

-- Effectiveness limited by available “memory-level parallelism” (MLP)

-- Prefetch distance limited by memory latency

 Implemented in IBM POWER6, Sun “Rock”

Memory Latency Tolerance Techniques

 Caching [initially by Wilkes, 1965]

 Widely used, simple, effective, but inefficient, passive

 Not all applications/phases exhibit temporal or spatial locality

 Prefetching [initially in IBM 360/91, 1967]

 Works well for regular memory access patterns

 Prefetching irregular access patterns is difficult, inaccurate, and hardware-

intensive

 Multithreading [initially in CDC 6600, 1964]

 Works well if there are multiple threads

 Improving single thread performance using multithreading hardware is an

ongoing research effort

 Out-of-order execution [initially by Tomasulo, 1967]

 Tolerates cache misses that cannot be prefetched

 Requires extensive hardware resources for tolerating long latencies

Runahead and Dual Core Execution

Load 3 Hit Load 2 Miss Stall Miss 1 Load 1 Miss Saved Cycles

DCE: front processor

Compute Load 1 Miss Runahead Load 2 Miss Load 2 Hit Miss 1 Miss 2 Compute Load 1 Hit

Runahead:

Load 3 Miss Runahead Saved Cycles Load 1 Miss

DCE: back processor

Compute Compute (^) Compute Load 2 Hit Miss 3 Miss 2 Load 3 Miss

Handling of Store-Load Dependencies

 A load’s dependence status is not known until all previous store

addresses are available.

 How does the OOO engine detect dependence of a load instruction on a

previous store?

 Option 1: Wait until all previous stores committed (no need to

check)

 Option 2: Keep a list of pending stores in a store buffer and check

whether load address matches a previous store address

 How does the OOO engine treat the scheduling of a load instruction wrt

previous stores?

 Option 1: Assume load independent of all previous stores

 Option 2: Assume load dependent on all previous stores

 Option 3: Predict the dependence of a load on an outstanding store

Store Buffer Design (II)

 Why is it complex to design a store buffer?

 Content associative, age-ordered, range search on an

address range

 Check for overlap of [load EA, load EA + load size] and [store

EA, store EA + store size]

 EA: effective address

 A key limiter of instruction window scalability

 Simplifying store buffer design or alternative designs an

important topic of research

Memory Disambiguation (I)

 Option 1: Assume load independent of all previous stores

+ Simple and can be common case: no delay for independent loads

-- Requires recovery and re-execution of load and dependents on misprediction

 Option 2: Assume load dependent on all previous stores

+ No need for recovery

-- Too conservative: delays independent loads unnecessarily

 Option 3: Predict the dependence of a load on an

outstanding store

+ More accurate. Load store dependencies persist over time

-- Still requires recovery/re-execution on misprediction

 Alpha 21264 : Initially assume load independent, delay loads found to be dependent  Moshovos et al., “Dynamic speculation and synchronization of data dependences,” ISCA 1997.  Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA 1998.

Speculative Execution and Data Coherence

 Speculatively executed loads can load a stale value in a

multiprocessor system

 The same address can be written by another processor before

the load is committed  load and its dependents can use the

wrong value

 Solutions:

1. A store from another processor invalidates a load that loaded

the same address

-- Stores of another processor check the load buffer

-- How to handle dependent instructions? They are also

invalidated.

2. All loads re-executed at the time of retirement

Open Research Issues in OOO Execution (I)

 Performance with simplicity and energy-efficiency

 How to build scalable and energy-efficient instruction windows

 To tolerate very long memory latencies and to expose more memory

level parallelism

 Problems:

 How to scale or avoid scaling register files, store buffers

 How to supply useful instructions into a large window in the

presence of branches

 How to approximate the benefits of a large window

 MLP benefits vs. ILP benefits

 Can the compiler pack more misses (MLP) into a smaller window?

 How to approximate the benefits of OOO with in-order +

enhancements

Open Research Issues in OOO Execution (III)

 Out-of-order execution in the presence of multi-core

 Powerful execution engines are needed to execute

 Single-threaded applications

 Serial sections of multithreaded applications (remember Amdahl’s law)

 Where single thread performance matters (e.g., transactions, game logic)

 Accelerate multithreaded applications (e.g., critical sections)

Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Niagara -like core Large core

Computer Architecture Lecture Notes: OoO Wrap-Up and Advanced Caching, Lecture notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Computer Architecture Lecture Notes: OoO Wrap-Up and Advanced Caching and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!

Computer Architecture

Lecture 11: OoO Wrap-Up and Advanced Caching

Prof. Onur Mutlu

Carnegie Mellon University

Announcements

 Chuck Thacker (Microsoft Research) Seminar

 RARE: Rethinking Architectural Research and Education

 October 7, 4:30-5:30pm, GHC Rashid Auditorium

 Ben Zorn (Microsoft Research) Seminar

 Performance is Dead, Long Live Performance!

 October 8, 11am-noon, GHC 6115

 Guest lecture Friday

 Dr. Ben Zorn, Microsoft Research

 Fault Tolerant, Efficient, and Secure Runtimes

Last Time …

 Full Window Stalls

 Runahead Execution

 Memory Level Parallelism

 Memory Latency Tolerance Techniques

 Caching

 Prefetching

 Multithreading

 Out-of-order execution

 Improving Runahead Execution

 Efficiency

 Dependent Cache Misses: Address-Value Delta Prediction

OoO/Runahead Readings

 Mutlu et al., “Runahead Execution: An Alternative to Very Large

Instruction Windows for Out-of-order Processors,” HPCA 2003.

 Mutlu et al., “Efficient Runahead Execution: Power-Efficient

Memory Latency Tolerance,” IEEE Micro Top Picks 2006.

 Zhou, Dual-Core Execution: “Building a Highly Scalable Single-

Thread Instruction Window,” PACT 2005.

 Chrysos and Emer, “Memory Dependence Prediction Using Store

Sets,” ISCA 1998.

Runahead Execution (III)

 Advantages:

+ Very accurate prefetches for data/instructions (all cache levels)

+ Follows the program path

+ Uses the same thread context as main thread, no waste of context

+ Simple to implement, most of the hardware is already built in

 Disadvantages/Limitations:

-- Extra executed instructions

-- Limited by branch prediction accuracy

-- Cannot prefetch dependent cache misses. Solution?

-- Effectiveness limited by available “memory-level parallelism” (MLP)

-- Prefetch distance limited by memory latency

 Implemented in IBM POWER6, Sun “Rock”

Memory Latency Tolerance Techniques

 Caching [initially by Wilkes, 1965]

 Widely used, simple, effective, but inefficient, passive

 Not all applications/phases exhibit temporal or spatial locality

 Prefetching [initially in IBM 360/91, 1967]

 Works well for regular memory access patterns

 Prefetching irregular access patterns is difficult, inaccurate, and hardware-

intensive

 Multithreading [initially in CDC 6600, 1964]

 Works well if there are multiple threads

 Improving single thread performance using multithreading hardware is an

ongoing research effort

 Out-of-order execution [initially by Tomasulo, 1967]

 Tolerates cache misses that cannot be prefetched

 Requires extensive hardware resources for tolerating long latencies

Runahead and Dual Core Execution

DCE: front processor

Runahead:

DCE: back processor

Handling of Store-Load Dependencies

 A load’s dependence status is not known until all previous store

addresses are available.

 How does the OOO engine detect dependence of a load instruction on a

previous store?

 Option 1: Wait until all previous stores committed (no need to

check)

 Option 2: Keep a list of pending stores in a store buffer and check

whether load address matches a previous store address