Memory Organization and Virtual Memory Management: Concurrency and Caching - Prof. Sudhaka, Study notes of Computer Architecture and Organization

Various memory organization techniques for cache and main memory, focusing on concurrency and virtual memory management. Topics include sequential and concurrent bank operations, data skewing, virtual memory, and caching from disk. The document also covers management policies, address translation concepts, and implementation using page tables and translation lookaside buffers (tlbs).

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-01v
koofers-user-01v 🇺🇸

10 documents

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
© Sudhakar Yalamanchili, Georgia Institute of Technology
Module:
Module:
Main Memory Organization and
Main Memory Organization and
Operation
Operation
ECE 4100/6100 (2)
Reading for this Module
Main memory organizations
Section 5.8
Memory Technology
Section 5.9
Virtual memory
Section 5.10
Examples
Paged virtual memory: Section 5.11
Memory hierarchies: Section 5.13
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download Memory Organization and Virtual Memory Management: Concurrency and Caching - Prof. Sudhaka and more Study notes Computer Architecture and Organization in PDF only on Docsity!

© Sudhakar Yalamanchili, Georgia Institute of Technology

Module: Module:

Main Memory Organization andMain Memory Organization and

OperationOperation

Reading for this Module

  • Main memory organizations
    • Section 5.
  • Memory Technology
    • Section 5.
  • Virtual memory
    • Section 5.
  • Examples
    • Paged virtual memory: Section 5.
    • Memory hierarchies: Section 5.

ECE 4100/6100 (3)

Main and Secondary Memory

Organization

single bank

single bit

Interleaved banks

single chip

mem mem mem mem

Interleaved disks

Each level is characterized by distinct technology and architecture

Main Memory Storage Technologies

  • DRAM: “Dynamic” Random Access Memory
    • Highest densities
    • Optimized for cost/bit Æ main memory
  • SRAM: “Static” Random Access Memory
    • Densities ¼ to 1/8 of DRAM
    • Speeds 8-16x faster than DRAM
    • Cost 8-16x more per bit
    • Optimized for speed Æ caches

ECE 4100/6100 (7)

Generic Memory Organization

sense amplifiers

  • Memory cell stores a bit of information: DRAM or

SRAM cell

  • Row address causes the read of a row of bits into the

buffer

  • Column address selects bit from the buffer
  • CS- Chip Select, RW – Read/Write

RAS array

CAS

11

Use half the address pins

Row address strobe (RAS) Column Address Strobe (CAS)

DRAM Optimizations

  • Faster to read data from the same row
    • Called “page mode” (fast page mode, EDO are variations)
    • Multiple CAS accesses
  • Bandwidth determined by cycle time
    • Example Row: 100ns +
    • Example Column: 30ns +, usually more like 50nS due to external components
  • Add a clock to the interface Æ synchronous DRAMs
    • Enable split transactions
  • Use both edges of the clock Æ Double data rate
  • In practice
    • There are multiple banks on chip
    • Arrays are 1-4Mbits

ECE 4100/6100 (9)

Conventional SDRAM Memory

System

  • Define memory bank as an independently addressable and controllable entity - Typically 2-4 banks per chip - Different topologies for each set of signals between memory controller and memory chips
  • (S)DRAM based designs typically constructed with chips contributing part of a word - Maximum number of outstanding requests limited by concurrency in the chip - Internal “banks used for performance optimizations - Speed via concurrency - Power

Word slice

8 bits per chip

32 bit words

ADDR/CLK

DATA

RAS/CAS/CS

Memory Controller

Rambus: Direct RDRAM

  • Designed to improve bandwidth to memory: Reinvent the DRAM interface - Split transaction bus - Tailored signal traces: the Rambus Signal Layer - Short bus length for high speed - 2 bytes @ 800Mhz (1.6GB/s per bus)
  • Each chip operates a set of 16 memory banks rather than a slice - Actually 2 banks where pairs share the sense amplifiers - Performs its own refresh - Block transfer, streaming, arbitrary sequencing

Word slice 16 bit packets

ADDR/CLK

DATA Only one device responds

ECE 4100/6100 (13)

Main Memory Organizations

  • The processor-memory bus may have width of one or more memory words
  • Multiple memory banks can operate in parallel
    • Transfer from memory to the cache is subject to the width of the processor-memory bus
  • Wide memory comes with constraints on expansion
    • Use of error correcting codes require the complete “width” to be read to recompute the codes on writes
    • Minimum expansion unit size is increased

registers ALU

cache

Memory

registers ALU

cache

Mem Mem Mem Mem

registers ALU

cache

Mem Mem Mem Mem

bus wide bus^ bus

Word Level Interleaved Memory

  • Memory is organized into multiple, concurrent, banks
  • World level interleaving across banks
  • Single address generates multiple, concurrent accesses
  • Well matched to cache line access patterns
  • Assuming a word-wide bus, cache miss penalty is T address + T mem_access + #words * T transfer cycles

τ

memory access 1

τ

memory access 2

Read the output ofa memory access

output of memoryTime to read the

Memory Module

Time

Note the effect of a split transaction vs. locked bus

word interleaving (^0 1 2 ) (^4 5 6 )

Bank 0 Bank 1 Bank 2 Bank 3

ECE 4100/6100 (15)

Sequential Bank Operation

n-m higher order bits m^ lower order bits

0

1

m-

word 1

access 1 module 0 module 1

  • Implement using DRAM arrays with page mode

access

bank

bank

bank

Concurrent Bank Operation

n-m

m

0 1 m-

word 1

module 0 module 1 module 2

ADDR

DATA

  • Supports arbitrary accesses
  • Needs sources of multiple, independent accesses
    • Lock-up free caches, data speculation, write buffers, pre- fetching

ECE 4100/6100 (19)

Conflict-Free Access

  • Conflict free access to elements of the vector if Æ
    • M >= N
    • M >= N. gcd(M, d )
  • Multi-dimensional arrays treated as arrays of 1-d

vectors

  • Conflict free access for various patterns in a matrix

requires

  • M >= N. gcd(M,δ1) for columns
  • M >= N. gcd(M, δ2) for rows
  • M >= N. gcd(M, δ1+ δ2 ) for forward diagonals
  • M >= N. gcd(M, δ1- δ2) for backward diagonals

Conflict-Free Access

  • Implications for M = N = even number?
  • For non-power-of-two values of M, indexing and

address computation must be efficient

  • Vectors that are accessed are scrambled
    • Unscrambling of vectors is a non-trivial performance issue
  • Data dependencies can still reduce bandwidth far

below O(M)

ECE 4100/6100 (21)

Avoiding Bank Conflicts: Compiler

Techniques

  • Many banks

int x[256][512];

for (j = 0; j < 512; j = j+1)

for (i = 0; i < 256; i = i+1)

x[i][j] = 2 * x[i][j];

  • Even with 128 banks, since 512 is multiple of 128, conflict

on word accesses

  • Solutions :
    • Software: loop interchange
    • Software: adjust array size to a prime # (“array padding”)
    • Hardware: prime number of banks (e.g. 17)
    • Data skewing

Pentium IV Execution Core

Exec Port 0 Exec Port 1 Load Port Store Port

ALU (2X) FP Move^

ALU (2X) Integer^ FP^ Load Store

FP/SSE Move FP/SSE Store

Add/Sub Logic Store Data Branches

Add/Sub Shift/Ro tate

FP/SSE

  • Add
  • Mul
  • Div

Dispatch Ports

scheduler scheduler scheduler^ scheduler

Out-of-order schedulers feed dispatch ports

Compute μop queue (^) memory μop queue

ECE 4100/6100 (25)

Summary

  • Two dominant memory technologies for cache and

main memory

  • SRAM and DRAM
  • Chip level organizations for concurrency
  • Main memory organizations for concurrency
  • Match the cache design
  • Optimize for bandwidth or latency
  • Latency and bandwidth computation
  • Miss penalty computation
  • Compiler techniques for organizing data for

concurrent access

© Sudhakar Yalamanchili, Georgia Institute of Technology

Virtual Memory Virtual Memory

ECE 4100/6100 (27)

Topical Outline

  • Review of virtual memory management
    • Motivation
    • Operation
  • Speeding up address translation via concurrency
  • Speeding up address translation via use of virtually

addressed caches

The Memory Hierarchy

registers ALU Cache

Memory

Memory

Memory

Managed by the compiler Memory

Managed by the hardware Managed by the operating system

Managed by the operating system

  • Caching is the mode of operation at each level of the hierarchy
    • data movement from tape Æ disk Æ memory Æ cache Æ registers
  • Control of movement is performed by HW or SW guided by
    • Static vs. run-time management of resources
    • Time scales of operation
      • Span nanoseconds (register access) to seconds (tapes)

ECE 4100/6100 (31)

Virtual Memory Management

  • Follows the same basic principles as cache

management

  • Unit of management is a page
    • A page is typically 4Kbytes – 32 Kbytes
  • The program sees a virtual address space

partitioned into virtual pages

  • Virtual address space typically exceeds physical memory size
  • The program resides on disk
  • Physical memory is partitioned into physical pages

Management Policies

  • Demand driven operation
    • Pages are brought into memory when referenced
  • Placement policy
    • Fully associative placement
  • Replacement policy
    • Approximations to least recently used (LRU) are the most common
  • Update policy
    • At current disk latencies, write-through is infeasible
    • Write-back update policy is employed

ECE 4100/6100 (33)

Address Translation: Concepts

VPN offset

Address Translation Data Structure

PPN offset offset

base address

  • Offsets within the virtual page and corresponding physical page are the same
  • We only need to translate the virtual page number (VPN) to the corresponding physical page number (PPN)

Physical memory pages

virtual memory pages (located on disk)

pages

PPN VPN

translate addressMove page and

Address Translation Implementation: The

Page Table

  • Translate the virtual page address to the physical

page address

  • Keep state information on a page
    • Modified/Not Modified
    • Access rights
    • In memory or on disk
    • Caching policies

V state PPN/Disk Address

VPN offset

Page table baseregister index

Physical address

To cache

ECE 4100/6100 (37)

TLB Operation

  • TLB size typically a function of the target domain
    • High end machines will have fully associative large TLBs
  • PTE entries are replaced on a demand driven basis
  • The TLB is in the critical path

registers ALU Cache

Memory

Memory

Memory

Memory

TLB

address^ virtual physicaladdress

Translate & Update TLB

miss

The Memory Access Path

  • Virtual-to-physical address translation occurs on

every access!

  • Adds to the latency of every memory access
  • How can we optimize the critical path?

registers ALU Cache

Memory

Memory

Memory

Memory

TLB

address^ virtual physicaladdress

ECE 4100/6100 (39)

Optimizing the Critical Path:

Principles

  • Concurrency between address translation and cache

access

  • Overlap cache access and TLB translation
  • Making translation the exception
  • Address the cache with the virtual address

Overlapping Cache and TLB Access

  • Example: direct mapped, 16 Kbyte cache with 64

byte lines and 16 Kbyte pages

18 14

18 8 6

VPN PPN V state VPN PPN V state VPN PPN V state VPN PPN V state

: : : : :

31 : 0

Mux

State Bits Tag^ Data

: :

TLB

Direct Mapped Cache

Access TLB with VPN

Access Cache with Line Address