Caches & Memory-Level Parallelism: Fall 2008 ECE 587/687 PSU Course - Prof. Alaa R. Alamel, Study notes of Computer Architecture and Organization

An overview of caches and memory-level parallelism as taught in the fall 2008 ece 587/687 course at portland state university. Topics include cpu execution time, cache performance, cache performance metrics, memory hierarchy, cache structure, associativity, cache misses, and cache organization. The document also mentions the use of miss status holding (handling) registers (mshrs) and provides examples and references.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-3tg
koofers-user-3tg ๐Ÿ‡บ๐Ÿ‡ธ

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Portland State University
1
ECE 587/687 โ€“ Fall 2008
Alaa R. Alameldeen
ยฉ Copyright by Alaa Alameldeen and Haitham Akkary 2008
Caches and Memory-Level
Parallelism
Portland State University
ECE 587/687
2
Portland State University โ€“ECE 587/687 โ€“ Fall 2008
Revisiting Processor Performance
๎˜CPU Execution Time =
(CPU clock cycles + Memory stall cycles)
x clock cycle time
๎˜For each instruction:
CPI = CPI(Perfect Memory)
+ Memory stall cycles per instruction
๎˜With no caches, all memory requests require main
memory access
๎˜‚Very long latency (discuss)
๎˜Caches filter out a lot of memory access to improve
execution time
3
Portland State University โ€“ECE 587/687 โ€“ Fall 2008
Cache Performance
๎˜Memory stall cycles Per Instruction =
Cache Misses per instruction x miss penalty
๎˜Processor Performance:
CPI = CPI(Perfect Memory)
+ miss rate x miss penalty
๎˜Average memory access time =
Hit ratio x Hit latency + Miss ratio x Miss penalty
๎˜Cache hierarchies attempt to reduce average
memory access time
4
Portland State University โ€“ECE 587/687 โ€“ Fall 2008
Cache Performance Metrics
๎˜Hit ratio: #hits / #accesses
๎˜Miss ratio: #misses / #accesses
๎˜Miss rate: Misses per instruction (or 1000 inst)
๎˜‚Miss rate = miss ratio x memory accesses per inst
๎˜Hit time: time from request issued to cache until data is
returned to the processor
๎˜‚Depends on cache design parameters
๎˜‚Bigger caches, larger associativity, more ports increase hit
time
๎˜Miss penalty: depends on memory hierarchy parameters
5
Portland State University โ€“ECE 587/687 โ€“ Fall 2008
Memory Hierarchy Example
๎˜Levels in memory hierarchy:
๎˜‚First-level caches
๎˜ƒUsually Split I & D caches
๎˜ƒSmall and fast
๎˜‚Second-level caches
๎˜ƒUsually on die, SRAM cells
๎˜‚Main memory
๎˜ƒDRAM cells, focus on density
๎˜‚Disk
๎˜ƒUsually magnetic device
๎˜ƒNon volatile, slow access
Processor
L1I$ L1D$
L2 Cache
Main Memory
Disk
6
Portland State University โ€“ECE 587/687 โ€“ Fall 2008
Basic Cache Structure
๎˜Array of blocks (lines)
๎˜‚Each block is usually 32-128 bytes
๎˜Finding a block in cache:
๎˜Offset: byte offset in block
๎˜Index: Which set in the cache is the block
located
๎˜Tag: Need to match address tag in cache
Tag Index Offset
Data
Address
pf3

Partial preview of the text

Download Caches & Memory-Level Parallelism: Fall 2008 ECE 587/687 PSU Course - Prof. Alaa R. Alamel and more Study notes Computer Architecture and Organization in PDF only on Docsity!

ECE 587/687 โ€“ Fall 2008

Alaa R. Alameldeen

ยฉ Copyright by Alaa Alameldeen and Haitham Akkary 2008

Caches and Memory-Level

Parallelism

Portland State University ECE 587/

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 2

Revisiting Processor Performance

 CPU Execution Time =

(CPU clock cycles + Memory stall cycles)

x clock cycle time

 For each instruction:

CPI = CPI(Perfect Memory)

  • Memory stall cycles per instruction

 With no caches, all memory requests require main

memory access

 Very long latency (discuss)

 Caches filter out a lot of memory access to improve

execution time

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 3

Cache Performance

 Memory stall cycles Per Instruction = Cache Misses per instruction x miss penalty  Processor Performance:

CPI = CPI(Perfect Memory)

+ miss rate x miss penalty

 Average memory access time =

Hit ratio x Hit latency + Miss ratio x Miss penalty

 Cache hierarchies attempt to reduce average memory access time

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 4

Cache Performance Metrics

 Hit ratio: #hits / #accesses

 Miss ratio: #misses / #accesses

 Miss rate: Misses per instruction (or 1000 inst)

 Miss rate = miss ratio x memory accesses per inst

 Hit time: time from request issued to cache until data is

returned to the processor

 Depends on cache design parameters  Bigger caches, larger associativity, more ports increase hit time

 Miss penalty: depends on memory hierarchy parameters

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 5

Memory Hierarchy Example

 Levels in memory hierarchy:

 First-level caches

Usually Split I & D caches

Small and fast

 Second-level caches

Usually on die, SRAM cells

 Main memory

DRAM cells, focus on density

 Disk

Usually magnetic device

Non volatile, slow access

Processor

L1I$ L1D$

L2 Cache

Main Memory

Disk

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 6

Basic Cache Structure

 Array of blocks (lines)

 Each block is usually 32-128 bytes

 Finding a block in cache:

 Offset: byte offset in block  Index: Which set in the cache is the block located  Tag: Need to match address tag in cache

Data Tag Index Offset Address

ECE 587/687 โ€“ Fall 2008

Alaa R. Alameldeen

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 7

Associativity

 Set associativity  Set: Group of blocks corresponding to same index  Each block in the set is called a Way  2-way set associative cache: each set contains two blocks  Direct-mapped cache: each set contains one block  Fully-associative cache: the whole cache is one set  Need to check all tags in a set to determine hit/miss status

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 8

Example: Cache Block Placement

 Consider a 4-way, 32KB cache with 64-byte lines  Where is 48-bit address 0x0000FFFFAB64?

 Number of lines = cache size / line size = 32K / 64

 Each set contains 4 lines โ‡’ Number of sets =

512/4 = 128 sets

 Offset bits = log 2 (64) = 6: 0x

 Index bits = log 2 (128) = 7: 0x2D

 Tag bits = 48-(6+7) = 35: 0x00007FFFD

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 9

Types of Cache Misses

 Compulsory (cold) misses: First access to a block  Prefetching can reduce these misses  Capacity misses: A cache cannot contain all blocks needed in a program, some blocks are discarded then accessed  Replacement policies should target blocks that wonโ€™t be used later  Conflict misses: Blocks mapping to the same set may be discarded (in direct-mapped and set-associative caches)  Increasing associativity can reduce these misses  For multiprocessors, coherency misses can also happen

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 10

Non-Blocking Cache Hierarchy

 Superscalar processors require parallel execution units

 Multiple pipelined functional units

Cache hierarchies capable of simultaneously servicing multiple memory requests

Do not block cache references that do not need

the miss data

Service multiple miss requests to memory

concurrently

 Revisit miss penalty with memory-level

parallelism

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 11

Miss Status Holding (Handling)

Registers

 MSHRs facilitate non-blocking memory level parallelism  Used to track address, data, and status for multiple outstanding cache misses  Need to provide correct memory ordering, respond to CPU requests, and maintain cache coherence  Design details vary widely between different processors

 But basic functions are similar

Portland State University โ€“ ECE 587/687 โ€“ Fall 2008 12

Cache & MSHR Organization

 Paper Fig 1: block diagram of cache organization  Main Components:  MSHR: One register for each miss to be handled concurrently  N-way comparator: Compares an address to all block addresses in MSHRs (N = #MSHRs)  Input Stack: Buffer space for all misses corresponding to MSHR entries  Size = #MSHRs x block size  Status update and collecting networks  Current implementations combine the MSHR and input stack