Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Caches & Memory-Level Parallelism: Fall 2008 ECE 587/687 PSU Course - Prof. Alaa R. Alamel, Study notes of Computer Architecture and Organization

Portland State University (PSU)Computer Architecture and Organization

Prof. Alaa R. Alameldeen

An overview of caches and memory-level parallelism as taught in the fall 2008 ece 587/687 course at portland state university. Topics include cpu execution time, cache performance, cache performance metrics, memory hierarchy, cache structure, associativity, cache misses, and cache organization. The document also mentions the use of miss status holding (handling) registers (mshrs) and provides examples and references.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-3tg 🇺🇸

10 documents

1 / 3

This page cannot be seen from the preview

Don't miss anything!

Portland State University

1

ECE 587/687 – Fall 2008

Alaa R. Alameldeen

Caches and Memory-Level

Parallelism

Portland State University

ECE 587/687

2

Portland State University –ECE 587/687 – Fall 2008

Revisiting Processor Performance

CPU Execution Time =

(CPU clock cycles + Memory stall cycles)

x clock cycle time

For each instruction:

CPI = CPI(Perfect Memory)

+ Memory stall cycles per instruction

With no caches, all memory requests require main

memory access

Very long latency (discuss)

Caches filter out a lot of memory access to improve

execution time

3

Portland State University –ECE 587/687 – Fall 2008

Cache Performance

Memory stall cycles Per Instruction =

Cache Misses per instruction x miss penalty

Processor Performance:

CPI = CPI(Perfect Memory)

+ miss rate x miss penalty

Average memory access time =

Hit ratio x Hit latency + Miss ratio x Miss penalty

Cache hierarchies attempt to reduce average

memory access time

4

Portland State University –ECE 587/687 – Fall 2008

Cache Performance Metrics

Hit ratio: #hits / #accesses

Miss ratio: #misses / #accesses

Miss rate: Misses per instruction (or 1000 inst)

Miss rate = miss ratio x memory accesses per inst

Hit time: time from request issued to cache until data is

returned to the processor

Depends on cache design parameters

Bigger caches, larger associativity, more ports increase hit

time

Miss penalty: depends on memory hierarchy parameters

5

Portland State University –ECE 587/687 – Fall 2008

Memory Hierarchy Example

Levels in memory hierarchy:

First-level caches

Usually Split I & D caches

Small and fast

Second-level caches

Usually on die, SRAM cells

Main memory

DRAM cells, focus on density

Disk

Usually magnetic device

Non volatile, slow access

Processor

L1I$ L1D$

L2 Cache

Main Memory

Disk

6

Portland State University –ECE 587/687 – Fall 2008

Basic Cache Structure

Array of blocks (lines)

Each block is usually 32-128 bytes

Finding a block in cache:

Offset: byte offset in block

Index: Which set in the cache is the block

located

Tag: Need to match address tag in cache

Tag Index Offset

Data

Address

Discover Study notes of Computer Architecture and Organization Portland State University (PSU)

Partial preview of the text

Download Caches & Memory-Level Parallelism: Fall 2008 ECE 587/687 PSU Course - Prof. Alaa R. Alamel and more Study notes Computer Architecture and Organization in PDF only on Docsity!

ECE 587/687 – Fall 2008

Alaa R. Alameldeen

Caches and Memory-Level

Parallelism

Portland State University ECE 587/

Portland State University – ECE 587/687 – Fall 2008 2

Revisiting Processor Performance

CPU Execution Time =

(CPU clock cycles + Memory stall cycles)

x clock cycle time

For each instruction:

CPI = CPI(Perfect Memory)

Memory stall cycles per instruction

With no caches, all memory requests require main

memory access

Very long latency (discuss)

Caches filter out a lot of memory access to improve

execution time

Portland State University – ECE 587/687 – Fall 2008 3

Cache Performance

Memory stall cycles Per Instruction = Cache Misses per instruction x miss penalty Processor Performance:

CPI = CPI(Perfect Memory)

+ miss rate x miss penalty

Average memory access time =

Hit ratio x Hit latency + Miss ratio x Miss penalty

Cache hierarchies attempt to reduce average memory access time

Portland State University – ECE 587/687 – Fall 2008 4

Cache Performance Metrics

Hit ratio: #hits / #accesses

Miss ratio: #misses / #accesses

Miss rate: Misses per instruction (or 1000 inst)

Miss rate = miss ratio x memory accesses per inst

Hit time: time from request issued to cache until data is

returned to the processor

Depends on cache design parameters Bigger caches, larger associativity, more ports increase hit time

Miss penalty: depends on memory hierarchy parameters

Portland State University – ECE 587/687 – Fall 2008 5

Memory Hierarchy Example

Levels in memory hierarchy:

First-level caches

Usually Split I & D caches

Small and fast

Second-level caches

Usually on die, SRAM cells

Main memory

DRAM cells, focus on density

Disk

Usually magnetic device

Non volatile, slow access

Processor

L1I$ L1D$

L2 Cache

Main Memory

Disk

Portland State University – ECE 587/687 – Fall 2008 6

Basic Cache Structure

Array of blocks (lines)

Each block is usually 32-128 bytes

Finding a block in cache:

Offset: byte offset in block Index: Which set in the cache is the block located Tag: Need to match address tag in cache

Data Tag Index Offset Address

ECE 587/687 – Fall 2008

Alaa R. Alameldeen

Portland State University – ECE 587/687 – Fall 2008 7

Associativity

Set associativity Set: Group of blocks corresponding to same index Each block in the set is called a Way 2-way set associative cache: each set contains two blocks Direct-mapped cache: each set contains one block Fully-associative cache: the whole cache is one set Need to check all tags in a set to determine hit/miss status

Portland State University – ECE 587/687 – Fall 2008 8

Example: Cache Block Placement

Consider a 4-way, 32KB cache with 64-byte lines Where is 48-bit address 0x0000FFFFAB64?

Number of lines = cache size / line size = 32K / 64

Each set contains 4 lines ⇒ Number of sets =

512/4 = 128 sets

Offset bits = log 2 (64) = 6: 0x

Index bits = log 2 (128) = 7: 0x2D

Tag bits = 48-(6+7) = 35: 0x00007FFFD

Portland State University – ECE 587/687 – Fall 2008 9

Types of Cache Misses

Compulsory (cold) misses: First access to a block Prefetching can reduce these misses Capacity misses: A cache cannot contain all blocks needed in a program, some blocks are discarded then accessed Replacement policies should target blocks that won’t be used later Conflict misses: Blocks mapping to the same set may be discarded (in direct-mapped and set-associative caches) Increasing associativity can reduce these misses For multiprocessors, coherency misses can also happen

Portland State University – ECE 587/687 – Fall 2008 10

Non-Blocking Cache Hierarchy

Superscalar processors require parallel execution units

Multiple pipelined functional units

Cache hierarchies capable of simultaneously servicing multiple memory requests

Do not block cache references that do not need

the miss data

Service multiple miss requests to memory

concurrently

Revisit miss penalty with memory-level

parallelism

Portland State University – ECE 587/687 – Fall 2008 11

Miss Status Holding (Handling)

Registers

MSHRs facilitate non-blocking memory level parallelism Used to track address, data, and status for multiple outstanding cache misses Need to provide correct memory ordering, respond to CPU requests, and maintain cache coherence Design details vary widely between different processors

But basic functions are similar

Portland State University – ECE 587/687 – Fall 2008 12

Cache & MSHR Organization

Paper Fig 1: block diagram of cache organization Main Components: MSHR: One register for each miss to be handled concurrently N-way comparator: Compares an address to all block addresses in MSHRs (N = #MSHRs) Input Stack: Buffer space for all misses corresponding to MSHR entries Size = #MSHRs x block size Status update and collecting networks Current implementations combine the MSHR and input stack

Caches & Memory-Level Parallelism: Fall 2008 ECE 587/687 PSU Course - Prof. Alaa R. Alamel, Study notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Caches & Memory-Level Parallelism: Fall 2008 ECE 587/687 PSU Course - Prof. Alaa R. Alamel and more Study notes Computer Architecture and Organization in PDF only on Docsity!

ECE 587/687 – Fall 2008

Alaa R. Alameldeen

Caches and Memory-Level

Parallelism

Revisiting Processor Performance

CPU Execution Time =

(CPU clock cycles + Memory stall cycles)

x clock cycle time

For each instruction:

With no caches, all memory requests require main

memory access

Caches filter out a lot of memory access to improve

execution time

Cache Performance

CPI = CPI(Perfect Memory)

+ miss rate x miss penalty

Hit ratio x Hit latency + Miss ratio x Miss penalty

Cache Performance Metrics

Hit ratio: #hits / #accesses

Miss ratio: #misses / #accesses

Miss rate: Misses per instruction (or 1000 inst)

Hit time: time from request issued to cache until data is

returned to the processor

Miss penalty: depends on memory hierarchy parameters

Memory Hierarchy Example

First-level caches

Usually Split I & D caches

Small and fast

Second-level caches

Usually on die, SRAM cells

Main memory

DRAM cells, focus on density

Disk

Usually magnetic device

Non volatile, slow access

L1I$ L1D$

Basic Cache Structure

Each block is usually 32-128 bytes

Number of lines = cache size / line size = 32K / 64

Each set contains 4 lines ⇒ Number of sets =

512/4 = 128 sets

Offset bits = log 2 (64) = 6: 0x

Index bits = log 2 (128) = 7: 0x2D

Tag bits = 48-(6+7) = 35: 0x00007FFFD

Multiple pipelined functional units

Do not block cache references that do not need

the miss data

Service multiple miss requests to memory

concurrently

Revisit miss penalty with memory-level

parallelism

But basic functions are similar