Download Memory Organization and Virtual Memory Management: Concurrency and Caching - Prof. Sudhaka and more Study notes Computer Architecture and Organization in PDF only on Docsity!
© Sudhakar Yalamanchili, Georgia Institute of Technology
Module: Module:
Main Memory Organization andMain Memory Organization and
OperationOperation
Reading for this Module
- Main memory organizations
- Memory Technology
- Virtual memory
- Examples
- Paged virtual memory: Section 5.
- Memory hierarchies: Section 5.
ECE 4100/6100 (3)
Main and Secondary Memory
Organization
single bank
single bit
Interleaved banks
single chip
mem mem mem mem
Interleaved disks
Each level is characterized by distinct technology and architecture
Main Memory Storage Technologies
- DRAM: “Dynamic” Random Access Memory
- Highest densities
- Optimized for cost/bit Æ main memory
- SRAM: “Static” Random Access Memory
- Densities ¼ to 1/8 of DRAM
- Speeds 8-16x faster than DRAM
- Cost 8-16x more per bit
- Optimized for speed Æ caches
ECE 4100/6100 (7)
Generic Memory Organization
sense amplifiers
- Memory cell stores a bit of information: DRAM or
SRAM cell
- Row address causes the read of a row of bits into the
buffer
- Column address selects bit from the buffer
- CS- Chip Select, RW – Read/Write
RAS array
CAS
11
Use half the address pins
Row address strobe (RAS) Column Address Strobe (CAS)
DRAM Optimizations
- Faster to read data from the same row
- Called “page mode” (fast page mode, EDO are variations)
- Multiple CAS accesses
- Bandwidth determined by cycle time
- Example Row: 100ns +
- Example Column: 30ns +, usually more like 50nS due to external components
- Add a clock to the interface Æ synchronous DRAMs
- Enable split transactions
- Use both edges of the clock Æ Double data rate
- In practice
- There are multiple banks on chip
- Arrays are 1-4Mbits
ECE 4100/6100 (9)
Conventional SDRAM Memory
System
- Define memory bank as an independently addressable and controllable entity - Typically 2-4 banks per chip - Different topologies for each set of signals between memory controller and memory chips
- (S)DRAM based designs typically constructed with chips contributing part of a word - Maximum number of outstanding requests limited by concurrency in the chip - Internal “banks used for performance optimizations - Speed via concurrency - Power
Word slice
8 bits per chip
32 bit words
ADDR/CLK
DATA
RAS/CAS/CS
Memory Controller
Rambus: Direct RDRAM
- Designed to improve bandwidth to memory: Reinvent the DRAM interface - Split transaction bus - Tailored signal traces: the Rambus Signal Layer - Short bus length for high speed - 2 bytes @ 800Mhz (1.6GB/s per bus)
- Each chip operates a set of 16 memory banks rather than a slice - Actually 2 banks where pairs share the sense amplifiers - Performs its own refresh - Block transfer, streaming, arbitrary sequencing
Word slice 16 bit packets
ADDR/CLK
DATA Only one device responds
ECE 4100/6100 (13)
Main Memory Organizations
- The processor-memory bus may have width of one or more memory words
- Multiple memory banks can operate in parallel
- Transfer from memory to the cache is subject to the width of the processor-memory bus
- Wide memory comes with constraints on expansion
- Use of error correcting codes require the complete “width” to be read to recompute the codes on writes
- Minimum expansion unit size is increased
registers ALU
cache
Memory
registers ALU
cache
Mem Mem Mem Mem
registers ALU
cache
Mem Mem Mem Mem
bus wide bus^ bus
Word Level Interleaved Memory
- Memory is organized into multiple, concurrent, banks
- World level interleaving across banks
- Single address generates multiple, concurrent accesses
- Well matched to cache line access patterns
- Assuming a word-wide bus, cache miss penalty is T address + T mem_access + #words * T transfer cycles
τ
memory access 1
τ
memory access 2
Read the output ofa memory access
output of memoryTime to read the
Memory Module
Time
Note the effect of a split transaction vs. locked bus
word interleaving (^0 1 2 ) (^4 5 6 )
Bank 0 Bank 1 Bank 2 Bank 3
ECE 4100/6100 (15)
Sequential Bank Operation
n-m higher order bits m^ lower order bits
0
1
m-
word 1
access 1 module 0 module 1
- Implement using DRAM arrays with page mode
access
bank
bank
bank
Concurrent Bank Operation
n-m
m
0 1 m-
word 1
module 0 module 1 module 2
ADDR
DATA
- Supports arbitrary accesses
- Needs sources of multiple, independent accesses
- Lock-up free caches, data speculation, write buffers, pre- fetching
ECE 4100/6100 (19)
Conflict-Free Access
- Conflict free access to elements of the vector if Æ
- M >= N
- M >= N. gcd(M, d )
- Multi-dimensional arrays treated as arrays of 1-d
vectors
- Conflict free access for various patterns in a matrix
requires
- M >= N. gcd(M,δ1) for columns
- M >= N. gcd(M, δ2) for rows
- M >= N. gcd(M, δ1+ δ2 ) for forward diagonals
- M >= N. gcd(M, δ1- δ2) for backward diagonals
Conflict-Free Access
- Implications for M = N = even number?
- For non-power-of-two values of M, indexing and
address computation must be efficient
- Vectors that are accessed are scrambled
- Unscrambling of vectors is a non-trivial performance issue
- Data dependencies can still reduce bandwidth far
below O(M)
ECE 4100/6100 (21)
Avoiding Bank Conflicts: Compiler
Techniques
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
- Even with 128 banks, since 512 is multiple of 128, conflict
on word accesses
- Solutions :
- Software: loop interchange
- Software: adjust array size to a prime # (“array padding”)
- Hardware: prime number of banks (e.g. 17)
- Data skewing
Pentium IV Execution Core
Exec Port 0 Exec Port 1 Load Port Store Port
ALU (2X) FP Move^
ALU (2X) Integer^ FP^ Load Store
FP/SSE Move FP/SSE Store
Add/Sub Logic Store Data Branches
Add/Sub Shift/Ro tate
FP/SSE
Dispatch Ports
scheduler scheduler scheduler^ scheduler
Out-of-order schedulers feed dispatch ports
Compute μop queue (^) memory μop queue
ECE 4100/6100 (25)
Summary
- Two dominant memory technologies for cache and
main memory
- SRAM and DRAM
- Chip level organizations for concurrency
- Main memory organizations for concurrency
- Match the cache design
- Optimize for bandwidth or latency
- Latency and bandwidth computation
- Miss penalty computation
- Compiler techniques for organizing data for
concurrent access
© Sudhakar Yalamanchili, Georgia Institute of Technology
Virtual Memory Virtual Memory
ECE 4100/6100 (27)
Topical Outline
- Review of virtual memory management
- Speeding up address translation via concurrency
- Speeding up address translation via use of virtually
addressed caches
The Memory Hierarchy
registers ALU Cache
Memory
Memory
Memory
Managed by the compiler Memory
Managed by the hardware Managed by the operating system
Managed by the operating system
- Caching is the mode of operation at each level of the hierarchy
- data movement from tape Æ disk Æ memory Æ cache Æ registers
- Control of movement is performed by HW or SW guided by
- Static vs. run-time management of resources
- Time scales of operation
- Span nanoseconds (register access) to seconds (tapes)
ECE 4100/6100 (31)
Virtual Memory Management
- Follows the same basic principles as cache
management
- Unit of management is a page
- A page is typically 4Kbytes – 32 Kbytes
- The program sees a virtual address space
partitioned into virtual pages
- Virtual address space typically exceeds physical memory size
- The program resides on disk
- Physical memory is partitioned into physical pages
Management Policies
- Demand driven operation
- Pages are brought into memory when referenced
- Placement policy
- Fully associative placement
- Replacement policy
- Approximations to least recently used (LRU) are the most common
- Update policy
- At current disk latencies, write-through is infeasible
- Write-back update policy is employed
ECE 4100/6100 (33)
Address Translation: Concepts
VPN offset
Address Translation Data Structure
PPN offset offset
base address
- Offsets within the virtual page and corresponding physical page are the same
- We only need to translate the virtual page number (VPN) to the corresponding physical page number (PPN)
Physical memory pages
virtual memory pages (located on disk)
pages
PPN VPN
translate addressMove page and
Address Translation Implementation: The
Page Table
- Translate the virtual page address to the physical
page address
- Keep state information on a page
- Modified/Not Modified
- Access rights
- In memory or on disk
- Caching policies
V state PPN/Disk Address
VPN offset
Page table baseregister index
Physical address
To cache
ECE 4100/6100 (37)
TLB Operation
- TLB size typically a function of the target domain
- High end machines will have fully associative large TLBs
- PTE entries are replaced on a demand driven basis
- The TLB is in the critical path
registers ALU Cache
Memory
Memory
Memory
Memory
TLB
address^ virtual physicaladdress
Translate & Update TLB
miss
The Memory Access Path
- Virtual-to-physical address translation occurs on
every access!
- Adds to the latency of every memory access
- How can we optimize the critical path?
registers ALU Cache
Memory
Memory
Memory
Memory
TLB
address^ virtual physicaladdress
ECE 4100/6100 (39)
Optimizing the Critical Path:
Principles
- Concurrency between address translation and cache
access
- Overlap cache access and TLB translation
- Making translation the exception
- Address the cache with the virtual address
Overlapping Cache and TLB Access
- Example: direct mapped, 16 Kbyte cache with 64
byte lines and 16 Kbyte pages
18 14
18 8 6
VPN PPN V state VPN PPN V state VPN PPN V state VPN PPN V state
: : : : :
31 : 0
Mux
State Bits Tag^ Data
: :
TLB
Direct Mapped Cache
Access TLB with VPN
Access Cache with Line Address