Memory Systems in Computer Architecture: An Examination of DRAM and SRAM, Study notes of Computer Architecture and Organization

An in-depth examination of memory systems in computer architecture, focusing on dram and sram technologies, their organization, timing parameters, and performance. It also covers the memory hierarchy and the impact of memory performance on overall processor performance.

Typology: Study notes

Pre 2010

Uploaded on 02/25/2010

koofers-user-v8c
koofers-user-v8c 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Adapted from D.A.Patterson, UCB
EEL 4768
Computer System Design 2
Lecture 8: Memory Systems
Adapted from D.A.Patterson, UCB
°The Five Classic Components of a Computer
The Big Picture: Where are We Now?
Control
Datapath
Memory
Processor
Input
Output
Adapted from D.A.Patterson, UCB
Technology Trends (from 1st lecture)
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
2004 1 Gb 35 ns
Capacity Speed (latency)
Logic:2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x i n 10 years
1000:1! 2:1!
Adapted from D.A.Patterson, UCB
Who Cares About the Memory Hierarchy?
µProc
60%/yr.
(2X/1.5yr)
DRAM
9%/yr.
(2X/10 yrs)
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-Memory
Performance Gap:
(grows 50% / year)
Performance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
Adapted from D.A.Patterson, UCB
Today’s Situation: Microprocessor
°Rely on caches to bridge gap
°Microprocessor-DRAM performance gap
time of a full cache miss in instructions executed
1st Alpha (7000): 340 ns/5.0 ns = 68 clksx 2 or 136 instructions
2nd Alpha (8400): 266 ns/3.3 ns = 80 clksx 4 or 320 instructions
3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions
1/2X latency x 3X clock rate x 3XInstr/clock -5X
Adapted from D.A.Patterson, UCB
Impact on Performance
°Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle)
CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
°Suppose that 10% of memory
operations get 50 cycle
miss penalty
°CPI = ideal CPI + average stalls per instruction
= 1.1(cyc) +( 0.30 (datamops/ins)
x 0.10 (miss/datamop) x 50 (cycle/miss) )
= 1.1 cycle + 1.5 cycle
= 2. 6
°58 % of the time the processor
is stalled waiting for memory!
°a 1% instruction miss rate would add
an additional 0.5 cycles to the CPI!
DataMiss
(1.6)
49%
Ideal CPI
(1.1)
35%
Inst Miss
(0.5)
16%
pf3
pf4
pf5

Partial preview of the text

Download Memory Systems in Computer Architecture: An Examination of DRAM and SRAM and more Study notes Computer Architecture and Organization in PDF only on Docsity!

UCB

EEL 4768

Computer System Design 2

Lecture 8: Memory Systems

Adapted from D.A.Patterson,

UCB

The Five Classic Components of a Computer

The Big Picture: Where are We Now?

Control Datapath

Memory

Processor

Input Output

Adapted from D.A.Patterson,

UCB

Technology Trends (from 1st lecture)

DRAM

Year

Size

Cycle Time

64 Kb

250 ns

256 Kb

220 ns

1 Mb

190 ns

4 Mb

165 ns

16 Mb

145 ns

64 Mb

120 ns

1 Gb

35 ns

Capacity

Speed (latency)

Logic:2x in 3 years

2x in 3 years

DRAM:

4x in 3 years

2x in 10 years

Disk:

4x in 3 years

2x in 10 years

UCB

Who Cares About the Memory Hierarchy?

μProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

DRAM

CPU

Processor-MemoryPerformance Gap:(grows 50% / year)

Performance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Adapted from D.A.Patterson,

UCB

Today’s Situation: Microprocessor

Rely on caches to bridge gap

Microprocessor-DRAM performance gap

time of a full cache miss in instructions executed

1st Alpha (7000):

340 ns/5.0 ns = 68 clks x 2 or

136 instructions

2nd Alpha (8400):

266 ns/3.3 ns = 80 clks x 4 or

320 instructions

3rd Alpha (t.b.d.):

180 ns/1.7 ns =108 clks x 6 or

648 instructions

1/2X latency x 3X clock rate x 3X Instr/clock

-5X

Adapted from D.A.Patterson,

UCB

Impact on Performance

Suppose a processor executes at

Clock Rate = 200 MHz (5 ns per cycle)

CPI = 1.

50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memoryoperations get 50 cyclemiss penalty

CPI

= ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins)

x 0.10 (miss/datamop) x 50 (cycle/miss) )

= 1.1 cycle + 1.5 cycle= 2. 6

58 % of the time the processor

is stalled waiting for memory!

a 1% instruction miss rate would add

an additional 0.5 cycles to the CPI!

DataMiss

(1.6)49%

Ideal CPI

(1.1)35%

Inst Miss

(0.5)16%

UCB

The Goal: illusion of large, fast, cheap memory

Fact: Large memories are slow, fast memories aresmall

How do we create a memory that is large, cheap andfast (most of the time)?

Hierarchy

Parallelism

Adapted from D.A.Patterson,

UCB

An Expanded View of the Memory System

Control

Datapath

Memory

Processor

Memory

Memory

Memory

Memory

Fastest

Slowest

Smallest

Biggest

Highest

Lowest

Speed:

Size: Cost:

Adapted from D.A.Patterson,

UCB

Why hierarchy works

The Principle of Locality:

Program access a relatively small portion of the address space atany instant of time.

Address Space

2^n - 1

Probabilityof reference

UCB

Memory Hierarchy: How Does it Work?

Temporal Locality (Locality in Time):

=> Keep most recently accessed data items closer to the processor

Spatial Locality (Locality in Space):

=> Move blocks consists of contiguous words to the upper levels

Lower Level

Memory

Upper Level

Memory

To Processor

From Processor

Blk X

Blk Y

Adapted from D.A.Patterson,

UCB

Memory Hierarchy: Terminology

Hit: data appears in some block in the upper level(example: Block X)

Hit Rate: the fraction of memory access found in the upper level

Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

Miss: data needs to be retrieved from a block in thelower level (Block Y)

Miss Rate = 1 - (Hit Rate)

Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

Hit Time << Miss Penalty

Lower Level

Memory

Upper Level

Memory

To Processor

From Processor

Blk X

Blk Y

Adapted from D.A.Patterson,

UCB

Memory Hierarchy of a Modern Computer System

By taking advantage of the principle of locality:

Present the user with as much memory as is available in thecheapest technology.

Provide access at the speed offered by the fastest technology.

Control

Datapath

Secondary

Storage

(Disk)

Processor

Registers

Main Memory(DRAM)

Second

Level Cache (SRAM)

On-Chip

Cache

1s

10,000,000s

(10s ms)

Speed (ns):

10s

100s

100s

Gs

Size (bytes):

Ks

Ms

Tertiary Storage (Disk)

10,000,000,000s

(10s sec)

Ts

UCB

Logic Diagram of a Typical SRAM

Write Enable is usually active low (WE_L)

Din and Dout are combined to save pins:

A new control signal, output enable (OE_L) is needed

WE_L is asserted (Low), OE_L is disasserted (High)

D serves as the data input pin

WE_L is disasserted (High), OE_L is asserted (Low)

D is the data output pin

Both WE_L and OE_L are asserted:

Result is unknown. Don’t do that!!!

Although could change VHDL to do what desire,must do the best with what you’ve got (vs. what youneed)

A

D

OE_L

N

words

x M bitSRAM

N

M

WE_L

Adapted from D.A.Patterson,

UCB

Typical SRAM Timing

Write Timing:

D

Read Timing:

A WE_L

WriteHold Time

Write Setup Time

A

D

OE_L

N

words

x M bitSRAM

N

M

WE_L

Data In

Write Address

OE_L

High Z

Read Address

Junk

Read Access

Time

Data Out

Read Access

Time

Data Out

Read Address

Adapted from D.A.Patterson,

UCB

Problems with SRAM

Six transistors use up a lot of area

bit = 1

bit = 0

Select = 1

Off On Off

On

N

N

P

P

On

On

UCB

1-Transistor Memory Cell (DRAM)

Write:

1. Drive bit line

2.. Select row

Read:

1. Precharge bit line to Vdd

2.. Select row

3. Cell and bit line share charges

Very small voltage changes on the bit line

4. Sense (fancy sense amp)

Can detect changes of ~1 million electrons

5. Write: restore the value

Refresh

1. Just do a dummy read to every cell.

row select

bit

Adapted from D.A.Patterson,

UCB

Classical DRAM Organization (square)

row decoder

rowaddress

Column Selector &

I/O Circuits

ColumnAddress

data RAM Cell

Array

word (row) select

bit (data) lines

Row and Column Addresstogether:

Select 1 bit a time

Each intersection representsa 1-T DRAM Cell

Adapted from D.A.Patterson,

UCB

DRAM logical organization (4 Mbit)

Square root of bits per RAS/CAS

Column Decoder

Sense Amps & I/OMemory Array

(2,048 x 2,048)

A0…A

D

Q

Word Line

StorageCell

UCB

Logic Diagram of a Typical DRAM

A

D

OE_L

256K x 8

DRAM

9

8

WE_L

Control Signals (RAS_L, CAS_L, WE_L, OE_L) are allactive low

Din and Dout are combined (D):

WE_L is asserted (Low), OE_L is disasserted (High)

D serves as the data input pin

WE_L is disasserted (High), OE_L is asserted (Low)

D is the data output pin

Row and column addresses share the same pins (A)

RAS_L goes low: Pins A are latched in as row address

CAS_L goes low: Pins A are latched in as column address

RAS/CAS edge-sensitive

CAS_L

RAS_L

Adapted from D.A.Patterson,

UCB

Key DRAM Timing Parameters °

t

RAC

: minimum time from RAS line falling to the

valid data output.

Quoted as the speed of a DRAM

A fast 4Mb DRAM t

RAC

= 60 ns

t

RC

: minimum time from the start of one row

access to the start of the next.

t

RC

= 110 ns for a 4Mbit DRAM with a t

RAC

of 60 ns

t

CAC

: minimum time from CAS line falling to

valid data output.

15 ns for a 4Mbit DRAM with a t

RAC

of 60 ns

t

PC

: minimum time from the start of one

column access to the start of the next.

35 ns for a 4Mbit DRAM with a t

RAC

of 60 ns

Adapted from D.A.Patterson,

UCB

DRAM Performance

A 60 ns (t

RAC

) DRAM can

perform a row access only every 110 ns (t

RC

perform column access (t

CAC

) in 15 ns, but time between column

accesses is at least 35 ns (t

PC

In practice, external address delays and turning aroundbuses make it 40 to 50 ns

These times do not include the time to drive theaddresses off the microprocessor nor the memorycontroller overhead.

Drive parallel DRAMs, external memory controller, bus to turnaround, SIMM module, pins…

180 ns to 250 ns latency from processor to memory is good for a“60 ns” (t

RAC

) DRAM

UCB

Simple

CPU, Cache, Bus, Memorysame width(32 bits)

Interleaved

CPU, Cache, Bus 1 word:Memory N Modules(4 Modules); example is word interleaved

Wide

CPU/Mux 1 word;Mux/Cache, Bus,Memory N words(Alpha: 64 bits & 256bits)

Main Memory Performance

Adapted from D.A.Patterson,

UCB

Cycle Time versus Access Time

DRAM (Read/Write) Cycle Time >> DRAM(Read/Write) Access Time

- 2:1; why?

DRAM (Read/Write) Cycle Time :

How frequent can you initiate an access?

Analogy: A little kid can only ask his father for money on Saturday

DRAM (Read/Write) Access Time:

How quickly will you get what you want once you initiate an access?

Analogy: As soon as he asks, his father will give him the money

DRAM Bandwidth Limitation analogy:

What happens if he runs out of money on Wednesday?

Time

Access Time

Cycle Time

Adapted from D.A.Patterson,

UCB

Increasing Bandwidth - Interleaving

Access Pattern without Interleaving:

Start Access for D

CPU

Memory

Start Access for D

D1 available

Access Pattern with 4-way Interleaving:

Access Bank 0

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

CPU

Memory

Bank 1 Memory

Bank 0 Memory

Bank 3 Memory

Bank 2

UCB

DRAM History

DRAMs: capacity +60%/yr, cost –30%/yr

2.5X cells/area, 1.5X die size in -3 years

DRAM fab line costs $1B to $2B

DRAM only: density, leakage v. speed

Rely on increasing no. of computers & memory percomputer (60% market)

SIMM or DIMM is replaceable unit=> computers use any generation DRAM

Commodity, second source industry=> high volume, low profit, conservative

Little organization innovation in 20 years

page mode, EDO, Synch DRAM

Order of importance: 1) Cost/bit 1a) Capacity

RAMBUS: 10X BW, +30% cost => little impact

Adapted from D.A.Patterson,

UCB

Today’s Situation: DRAM

Commodity, second source industry

high volume, low profit, conservative

Little organization innovation (vs. processors)in 20 years: page mode, EDO, Synch DRAM

DRAM industry at a crossroads:

Fewer DRAMs per computer over time

Growth bits/chip DRAM : 50%-60%/yr

Nathan Myrvold M/S: mature software growth(33%/yr for NT) - growth MB/$ of DRAM (25%-30%/yr)

Starting to question buying larger DRAMs?

Adapted from D.A.Patterson,

UCB

DRAM Revenue per Quarter

1Q

2Q

3Q

4Q

1Q

2Q

3Q

4Q

1Q

2Q

3Q

4Q

1Q 97

(Miillions)

$16B

$7B

• Intel: 30%/year since 1987; 1/3 income profit

UCB

Summary:

Two Different Types of Locality:

Temporal Locality (Locality in Time): If an item is referenced, it willtend to be referenced again soon.

Spatial Locality (Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon.

By taking advantage of the principle of locality:

Present the user with as much memory as is available in thecheapest technology.

Provide access at the speed offered by the fastest technology.

DRAM is slow but cheap and dense:

Good choice for presenting the user with a BIG memory system

SRAM is fast but expensive and not very dense:

Good choice for providing the user FAST access time.

Adapted from D.A.Patterson,

UCB

Summary: Processor-Memory Performance Gap “Tax”

Processor

% Area

%Transistors

(-cost)

(-power)

Alpha 21164

StrongArm SA

Pentium Pro

2 dies per package: Proc/I$/D$ + L2$

Caches have no inherent value,only try to close performance gap