

































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Professor: Yalamanchili; Class: Adv Computer Architecure; Subject: Electrical & Computer Engr; University: Georgia Institute of Technology-Main Campus; Term: Unknown 2002;
Typology: Study notes
1 / 73
This page cannot be seen from the preview
Don't miss anything!


































































DRAM: why bother? (i mean, besides the “memory wall” thing? ... is it just a performance issue?)
think about embedded systems: think cellphones, think printers, think switches ... nearly every embedded product that used to be expensive is now cheap. why? for one thing, rapid turnover from high performance to obsolescence guarantees generous supply of CHEAP, HIGH-PERFORMANCE embedded processors to suit nearly any design need.
what does the “memory wall” mean in this context? perhaps it will take longer for a high- performance design to become obsolete?
DRAM: Architectures,
Interfaces, and Systems
A Tutorial
Bruce Jacob and David Wang
Electrical & Computer Engineering Dept.
University of Maryland at College Park
http://www.ece.umd.edu/~blj/DRAM/
UNIVERSITY OF MARYLAND
Outline
Architectures, Systems, Embedded
Break at 10 a.m. — Stop us or starve
first off -- what is DRAM? an array of storage elements (capacitor-transistor pairs)
“DRAM” is an acronym (explain) why “dynamic”?
Basics
DRAM ORGANIZATION
. .. Word Lines ...
Storage element
Switching
element
Bit Line
Word Line (^) Data In/Out
(capacitor)
so how do you interact with this thing? let’s look at a traditional organization first ... CPU connects to a memory controller that connects to the DRAM itself.
let’s look at a read operation
Basics
BUS TRANSMISSION
. .. Word Lines ...
then the data is valid on the data bus ... depending on what you are using for in/out buffers, you might be able to overlap a litttle or a lot of the data transfer with the next CAS to the same page (this is PAGE MODE)
Basics
DATA TRANSFER
note: page mode enables overlap with CAS
. .. Word Lines ...
Basics
BUS TRANSMISSION
. .. Word Lines ...
DRAM “latency” isn’t deterministic because of CAS or RAS+CAS, and there may be significant queuing delays within the CPU and the memory controller Each transaction has some overhead. Some types of overhead cannot be pipelined. This means that in general, longer bursts are more efficient.
Basics
B
C
D
DRAM
E 2 /E 3
E 1
F
A
CPU (^) Mem
A: Transaction request may be delayed in Queue
B: Transaction request sent to Memory Controller
C: Transaction converted to Command Sequences
(may be queued)
D: Command/s Sent to DRAM
E
: Requires only a CAS or
E 2 : Requires RAS + CAS or
F: Transaction sent back to CPU
“DRAM Latency” = A + B + C + D + E + F
E3: Requires PRE + RAS + CAS
Basics
PHYSICAL ORGANIZATION
This is per bank …
Typical DRAMs have 2+ banks
x2 DRAM x4 DRAM x8 DRAM
... Bit Lines...
Sense Amps
Data Buffers
x8 DRAM
Row Decoder
Memory Array
Column Decoder
... Bit Lines...
Sense Amps
Data Buffers
x2 DRAM
Row Decoder
Memory Array
Column Decoder
... Bit Lines...
Sense Amps
Data Buffers
x4 DRAM
Row Decoder
Memory Array
Column Decoder
DRAM Evolution
Read Timing for Conventional DRAM
FPM aallows you to keep th esense amps actuve for multiple CAS commands ...
much better throughput
problem: cannot latch a new value in the column address buffer until the read-out of the data is complete
DRAM Evolution
Read Timing for Fast Page Mode
solution to that problem -- instead of simple tri-state buffers, use a latch as well.
by putting a latch after the column mux, the next column address command can begin sooner
DRAM Evolution
Read Timing for Extended Data Out
by driving the col-addr latch from an internal counter rather than an external signal, the minimum cycle time for driving the output bus was reduced by roughly 30%
DRAM Evolution
Read Timing for Burst EDO
output latch on EDO allowed you to start CAS sooner for next accesss (to same row)
latch whole row in ESDRAM -- allows you to start precharge & RAS sooner for thee next page access -- HIDE THE PRECHARGE OVERHEAD.
DRAM Evolution
Inter-Row Read Timing for ESDRAM
Command
Address
DQ
Clock
Row Addr
Col Addr
Valid Data
Valid Data
Valid Data
Valid Data
ACT READ
Row Addr
Col Addr
Valid Data
Valid Data
Valid Data
Valid Data
PRE ACT READ
Command
Address
DQ
Clock
Row Addr
Col Addr
Valid Data
Valid Data
Valid Data
Valid Data
ACT READ
Row Addr
Col Addr
Valid Data
Valid Data
Valid Data
Valid Data
ACT READ
PRE
Bank
Bank
neat feature of this type of buffering: write-around
DRAM Evolution
Write-Around in ESDRAM
(can second READ be this aggressive?)
Command
Address
DQ
Clock
Row Addr
Col Addr
Valid Data
Valid Data
Valid Data
Valid Data
ACT READ
Row Addr
Col Addr
Valid Data Valid Data Valid Data Valid Data
PRE ACT WRITE
Command
Address
DQ
Clock
Row Addr
Col Addr
Valid Data Valid Data Valid Data Valid Data
ACT READ
Row Addr
Col Addr
Valid Data Valid Data Valid Data
Valid Data
ACT WRITE
PRE
Bank
Bank
Row Addr
Col Addr
Valid Data
Valid Data
Valid Data
PRE ACT READ
Bank
Col Addr
Valid Data
Valid Data
Valid Data
Valid Data
READ
main thing ... it is like having a bunch of open row buffers (a la rambus), but the problem is that you must deal with the cache directly (move into and out of it), not the DRAM banks ... adds an extra couple of cycles of latency ... however, you get good bandwidth if the data you want is cache, and you can “prefetch” into cache ahead of when you want it ... originally targetted at reducing latency, now that SDRAM is CAS-2 and RCD-2, this make sense only in a throughput way
DRAM Evolution
Internal Structure of Virtual Channel
Segment cache is software-managed, reduces energy
$
2Kb Segment
2Kb Segment
2Kb Segment
2Kb Segment
FCRAM opts to break up the data array .. only activate a portion of the word line
8K rows requires 13 bits tto select ... FCRAM uses 15 (assuming the array is 8k x 1k ... the data sheet does not specify)
DRAM Evolution
Internal Structure of Fast Cycle RAM
Reduces access time and energy/access
tRCD = 15ns tRCD = 5ns
8M Array
8M Array
SDRAM FCRAM
(two clocks) (one clock)
(8Kr x 1Kb) (?)
Outline
Architectures, Systems, Embedded
Some Technology has legs, some do not have legs, and some have gone belly up.
We’ll start by emaining the fundamental technologies (I/O packaging etc) then explore ome of these technologies in depth a bit later.
What Does This All Mean?
EDO
FPM
SLDRAM
DDR II
DDR
SDRAM
SDRAM
D-RDRAM
ESDRAM
FCRAM
RLDRAM
xDDR II
netDRAM
What is a “good” system?
It’s all about the cost of a system. This is a multi- dimensional tradeoff problem. Especially tough when the relative cost factors of pins, die area, and the demands of bandwidth and latency keeps on changing. Good decisions for one generation may not be good for future generations. This is why we don’t keep a DRAM protocol for a long time. FPM lasted a while, but we’ve quickly progressed through EDO, SDRAM, DDR/RDRAM, and now DDR II and whatever else is on the horizon.
Cost - Benefit Criterion
Logic Overhead
Power
Consumption
Package Cost
Test and
DRAM
System
Design
Bandwidth
Latency
Implementation
Interconnect Cost
Now we’ll really get our hands dirty, and try to become DRAM designers. That is, we want to understand the tradeoffs, and design our own memory system with DRAM cells. By doing this, we can gain some insight into some of the basis of claims by proponents of various DRAM memory systems.
A Memory System is a system that has many parts. It’s a set of technologies and design decisions. All of the parts are inter-related, but for the sake of discussion, we’ll splite the components into ovals seen here, and try to examine each part of a DRAM system separately.
Memory System Design
DRAM
Memory
System
Topology
I/O Technology
Access Protocol
DRAM Chip
Architecture
Clock Network
Row Buffer
Address Mapping
Management
Pin Count
Chip Packaging
First, we have to introduce the concept that signal propagation takes finite time. Limited by the speed of light, or rather ideal transmission lines we should have speed of approximately 2/ the speed of light. That gets us 20cm/ns. All signals, including system wide clock signals has to be sent on a system board, so if you sent a clock signal from point A to point B on an ideal signal line, point B won’t be able to tell that the clock has change until at the earliest, 1/20 ns/cm * distance later that the clock has risen.
Then again, PC boards are not exactly ideal transmission lines. (ringing effect, drive strength, etc) The concept of “Synchronous” breaks down when different parts of the system observe different clocks. Kind of like relativity
Signal Propagation
Ideal Transmission Line
A (^) B
PC Board + Module Connectors +
Varying Electrical Loads
= Rather non-Ideal Transmission Line
~ 0.66c = 20 cm/ns
When we build a “synchronous system” on a PCB board, how do we distribute the clock signal? Do we want a sliding time domain? Is H Tree do-able to N-modules in parallel? Skew compensation?
Clocking Issues
0
th N
th
0
th N
th
Clk
SRC
Clk
SRC
What Kind of Clocking System?
Figure 1:
Sliding Time
Figure 2:
H Tree?
We would want the chips to be on a “global clock”, everyone is perfectly synchronous, but since clock signals are delivered through wires, different chips in the system will see the rising edge of a clock a little bit earlier/ later than other chips.
While an H-Tree may work for a low freq system, we really need a clock for sending (writing) signals from the controller to the chips, and another one for snding signals from chips to controller (reading)
Clocking Issues
0
th N
th
0
th N
th
Clk
SRC
Clk
SRC
We need different “clocks” for R/W
Figure 1:
Write Data
Figure 2:
Read Data
Signal Direction
Signal Direction
We purposefully “routed path # to be a bit longer than path #1 to illustrate the point in between the signal path length differentials. As illustrated, signals will reach load B at a later time than load A simply because it is farther away from controller than load A.
It is also difficult to do path length and impedence matching on a system board. Sometimes heroic efforts must be utilized to get us a nice “parallel” bus.
Path Length Differential
A
Controller
Path #
Path #
Path #
Bus Signal 2 Bus Signal 1
Intermodule Connectors
B
High Frequency AND Wide Parallel
Busses are Difficult to Implement
A “System” is a hard thing to design. Especially one that allows end users to perform configurations that will impact timing. To guarentee functional correctness of the system, all corner cases of variances in loading and timing must be accounted for.
Timing Variations
Controller
Controller
4 Loads
1 Load
Clock
Cmd to 1 Load
Cmd to 4 Loads
How many DIMMs in System?
How many devices on each DIMM?
Infinite variations on timing!
Who built the memory module?
To ensure that a lightly loaded system and a fully loaded system do not differ significantly in timing, we either have duplicate signals sent to different memory modules, or we have the same signal line, but the signal line uses variable strengths to drive the I/O pads, depending on if the system has 1,2,3 or 4 loads.
Loading Balance
Controller
Controller
Controller
Controller
Duplicate
Signal
Lines
Variable
Signal
Drive
Strength
Self Explanatory. topology determines loading and signal propagation lengths.
Controller
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
Very simple topology. The clock signal that turns around is very nice. Solves problem of needing multiple clocks.
Single
Channel
SDRAM
Controller