Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Memory Systems in Computer Architecture: An Examination of DRAM and SRAM, Study notes of Computer Architecture and Organization

University of Central Florida (UCF)Computer Architecture and Organization

An in-depth examination of memory systems in computer architecture, focusing on dram and sram technologies, their organization, timing parameters, and performance. It also covers the memory hierarchy and the impact of memory performance on overall processor performance.

Typology: Study notes

Pre 2010

Uploaded on 02/25/2010

koofers-user-v8c 🇺🇸

10 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Adapted from D.A.Patterson, UCB

EEL 4768

Computer System Design 2

Lecture 8: Memory Systems

Adapted from D.A.Patterson, UCB

°The Five Classic Components of a Computer

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Input

Output

Adapted from D.A.Patterson, UCB

Technology Trends (from 1st lecture)

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

2004 1 Gb 35 ns

Capacity Speed (latency)

Logic:2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x i n 10 years

1000:1! 2:1!

Adapted from D.A.Patterson, UCB

Who Cares About the Memory Hierarchy?

µProc

60%/yr.

(2X/1.5yr)

DRAM

9%/yr.

(2X/10 yrs)

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-Memory

Performance Gap:

(grows 50% / year)

Performance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Adapted from D.A.Patterson, UCB

Today’s Situation: Microprocessor

°Rely on caches to bridge gap

°Microprocessor-DRAM performance gap

•time of a full cache miss in instructions executed

1st Alpha (7000): 340 ns/5.0 ns = 68 clksx 2 or 136 instructions

2nd Alpha (8400): 266 ns/3.3 ns = 80 clksx 4 or 320 instructions

3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions

•1/2X latency x 3X clock rate x 3XInstr/clock ⇒-5X

Adapted from D.A.Patterson, UCB

Impact on Performance

°Suppose a processor executes at

•Clock Rate = 200 MHz (5 ns per cycle)

•CPI = 1.1

•50% arith/logic, 30% ld/st, 20% control

°Suppose that 10% of memory

operations get 50 cycle

miss penalty

°CPI = ideal CPI + average stalls per instruction

= 1.1(cyc) +( 0.30 (datamops/ins)

x 0.10 (miss/datamop) x 50 (cycle/miss) )

= 1.1 cycle + 1.5 cycle

= 2. 6

°58 % of the time the processor

is stalled waiting for memory!

°a 1% instruction miss rate would add

an additional 0.5 cycles to the CPI!

DataMiss

(1.6)

49%

Ideal CPI

(1.1)

35%

Inst Miss

(0.5)

16%

Discover Study notes of Computer Architecture and Organization University of Central Florida (UCF)

Partial preview of the text

Download Memory Systems in Computer Architecture: An Examination of DRAM and SRAM and more Study notes Computer Architecture and Organization in PDF only on Docsity!



UCB

EEL 4768

Computer System Design 2

Lecture 8: Memory Systems

Adapted from D.A.Patterson,



UCB

The Five Classic Components of a Computer

The Big Picture: Where are We Now?

Control Datapath

Memory

Processor

Input Output

Adapted from D.A.Patterson,



UCB

Technology Trends (from 1st lecture)

DRAM

Year

Size

Cycle Time

64 Kb

250 ns

256 Kb

220 ns

1 Mb

190 ns

4 Mb

165 ns

16 Mb

145 ns

64 Mb

120 ns

1 Gb

35 ns

Capacity

Speed (latency)

Logic:2x in 3 years

2x in 3 years

DRAM:

4x in 3 years

2x in 10 years

Disk:

4x in 3 years

2x in 10 years



UCB

Who Cares About the Memory Hierarchy?

μProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

DRAM

CPU

Processor-MemoryPerformance Gap:(grows 50% / year)

Performance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Adapted from D.A.Patterson,



UCB

Today’s Situation: Microprocessor

Rely on caches to bridge gap

Microprocessor-DRAM performance gap

time of a full cache miss in instructions executed

1st Alpha (7000):

340 ns/5.0 ns = 68 clks x 2 or

136 instructions

2nd Alpha (8400):

266 ns/3.3 ns = 80 clks x 4 or

320 instructions

3rd Alpha (t.b.d.):

180 ns/1.7 ns =108 clks x 6 or

648 instructions

1/2X latency x 3X clock rate x 3X Instr/clock

-5X

Adapted from D.A.Patterson,



UCB

Impact on Performance

Suppose a processor executes at

Clock Rate = 200 MHz (5 ns per cycle)

CPI = 1.

50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memoryoperations get 50 cyclemiss penalty

CPI

= ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins)

x 0.10 (miss/datamop) x 50 (cycle/miss) )

= 1.1 cycle + 1.5 cycle= 2. 6

58 % of the time the processor

is stalled waiting for memory!

a 1% instruction miss rate would add

an additional 0.5 cycles to the CPI!

DataMiss

(1.6)49%

Ideal CPI

(1.1)35%

Inst Miss

(0.5)16%



UCB

The Goal: illusion of large, fast, cheap memory

Fact: Large memories are slow, fast memories aresmall

How do we create a memory that is large, cheap andfast (most of the time)?

Hierarchy

Parallelism

Adapted from D.A.Patterson,



UCB

An Expanded View of the Memory System

Control

Datapath

Memory

Processor

Memory

Fastest

Slowest

Smallest

Biggest

Highest

Lowest

Speed:

Size: Cost:

Adapted from D.A.Patterson,



UCB

Why hierarchy works

The Principle of Locality:

Program access a relatively small portion of the address space atany instant of time.

Address Space

2^n - 1

Probabilityof reference



UCB

Memory Hierarchy: How Does it Work?

Temporal Locality (Locality in Time):

=> Keep most recently accessed data items closer to the processor

Spatial Locality (Locality in Space):

=> Move blocks consists of contiguous words to the upper levels

Lower Level

Memory

Upper Level

Memory

To Processor

From Processor

Blk X

Blk Y

Adapted from D.A.Patterson,



UCB

Memory Hierarchy: Terminology

Hit: data appears in some block in the upper level(example: Block X)

Hit Rate: the fraction of memory access found in the upper level

Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

Miss: data needs to be retrieved from a block in thelower level (Block Y)

Miss Rate = 1 - (Hit Rate)

Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

Hit Time << Miss Penalty

Lower Level

Memory

Upper Level

Memory

To Processor

From Processor

Blk X

Blk Y

Adapted from D.A.Patterson,



UCB

Memory Hierarchy of a Modern Computer System

By taking advantage of the principle of locality:

Present the user with as much memory as is available in thecheapest technology.

Provide access at the speed offered by the fastest technology.

Control

Datapath

Secondary

Storage

(Disk)

Processor

Registers

Main Memory(DRAM)

Second

Level Cache (SRAM)

On-Chip

Cache

1s

10,000,000s

(10s ms)

Speed (ns):

10s

100s

Gs

Size (bytes):

Ks

Ms

Tertiary Storage (Disk)

10,000,000,000s

(10s sec)

Ts



UCB

Logic Diagram of a Typical SRAM

Write Enable is usually active low (WE_L)

Din and Dout are combined to save pins:

A new control signal, output enable (OE_L) is needed

WE_L is asserted (Low), OE_L is disasserted (High)

D serves as the data input pin

WE_L is disasserted (High), OE_L is asserted (Low)

D is the data output pin

Both WE_L and OE_L are asserted:

Result is unknown. Don’t do that!!!

Although could change VHDL to do what desire,must do the best with what you’ve got (vs. what youneed)

A

D

OE_L

N

words

x M bitSRAM

N

M

WE_L

Adapted from D.A.Patterson,



UCB

Typical SRAM Timing

Write Timing:

D

Read Timing:

A WE_L

WriteHold Time

Write Setup Time

A

D

OE_L

N

words

x M bitSRAM

N

M

WE_L

Data In

Write Address

OE_L

High Z

Read Address

Junk

Read Access

Time

Data Out

Read Access

Time

Data Out

Read Address

Adapted from D.A.Patterson,



UCB

Problems with SRAM

Six transistors use up a lot of area

bit = 1

bit = 0

Select = 1

Off On Off

On

N

P

On



UCB

1-Transistor Memory Cell (DRAM)

Write:

1. Drive bit line

2.. Select row

Read:

1. Precharge bit line to Vdd

2.. Select row

3. Cell and bit line share charges

Very small voltage changes on the bit line

4. Sense (fancy sense amp)

Can detect changes of ~1 million electrons

5. Write: restore the value

Refresh

1. Just do a dummy read to every cell.

row select

bit

Adapted from D.A.Patterson,



UCB

Classical DRAM Organization (square)

row decoder

rowaddress

Column Selector &

I/O Circuits

ColumnAddress

data RAM Cell

Array

word (row) select

bit (data) lines

Row and Column Addresstogether:

Select 1 bit a time

Each intersection representsa 1-T DRAM Cell

Adapted from D.A.Patterson,



UCB

DRAM logical organization (4 Mbit)

Square root of bits per RAS/CAS

Column Decoder

Sense Amps & I/OMemory Array

(2,048 x 2,048)

A0…A

D

Q

Word Line

StorageCell



UCB

Logic Diagram of a Typical DRAM

A

D

OE_L

256K x 8

DRAM

9

8

WE_L

Control Signals (RAS_L, CAS_L, WE_L, OE_L) are allactive low

Din and Dout are combined (D):

WE_L is asserted (Low), OE_L is disasserted (High)

D serves as the data input pin

WE_L is disasserted (High), OE_L is asserted (Low)

D is the data output pin

Row and column addresses share the same pins (A)

RAS_L goes low: Pins A are latched in as row address

CAS_L goes low: Pins A are latched in as column address

RAS/CAS edge-sensitive

CAS_L

RAS_L

Adapted from D.A.Patterson,



UCB

Key DRAM Timing Parameters °

t

RAC

: minimum time from RAS line falling to the

valid data output.

Quoted as the speed of a DRAM

A fast 4Mb DRAM t

RAC

= 60 ns

t

RC

: minimum time from the start of one row

access to the start of the next.

t

RC

= 110 ns for a 4Mbit DRAM with a t

RAC

of 60 ns

t

CAC

: minimum time from CAS line falling to

valid data output.

15 ns for a 4Mbit DRAM with a t

RAC

of 60 ns

t

PC

: minimum time from the start of one

column access to the start of the next.

35 ns for a 4Mbit DRAM with a t

RAC

of 60 ns

Adapted from D.A.Patterson,



UCB

DRAM Performance

A 60 ns (t

RAC

) DRAM can

perform a row access only every 110 ns (t

RC

perform column access (t

CAC

) in 15 ns, but time between column

accesses is at least 35 ns (t

PC

In practice, external address delays and turning aroundbuses make it 40 to 50 ns

These times do not include the time to drive theaddresses off the microprocessor nor the memorycontroller overhead.

Drive parallel DRAMs, external memory controller, bus to turnaround, SIMM module, pins…

180 ns to 250 ns latency from processor to memory is good for a“60 ns” (t

RAC

) DRAM



UCB

Simple

CPU, Cache, Bus, Memorysame width(32 bits)

Interleaved

CPU, Cache, Bus 1 word:Memory N Modules(4 Modules); example is word interleaved

Wide

CPU/Mux 1 word;Mux/Cache, Bus,Memory N words(Alpha: 64 bits & 256bits)

Main Memory Performance

Adapted from D.A.Patterson,



UCB

Cycle Time versus Access Time

DRAM (Read/Write) Cycle Time >> DRAM(Read/Write) Access Time

- 2:1; why?

DRAM (Read/Write) Cycle Time :

How frequent can you initiate an access?

Analogy: A little kid can only ask his father for money on Saturday

DRAM (Read/Write) Access Time:

How quickly will you get what you want once you initiate an access?

Analogy: As soon as he asks, his father will give him the money

DRAM Bandwidth Limitation analogy:

What happens if he runs out of money on Wednesday?

Time

Access Time

Cycle Time

Adapted from D.A.Patterson,



UCB

Increasing Bandwidth - Interleaving

Access Pattern without Interleaving:

Start Access for D

CPU

Memory

Start Access for D

D1 available

Access Pattern with 4-way Interleaving:

Access Bank 0

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

CPU

Memory

Bank 1 Memory

Bank 0 Memory

Bank 3 Memory

Bank 2



UCB

DRAM History

DRAMs: capacity +60%/yr, cost –30%/yr

2.5X cells/area, 1.5X die size in -3 years

DRAM fab line costs $1B to $2B

DRAM only: density, leakage v. speed

Rely on increasing no. of computers & memory percomputer (60% market)

SIMM or DIMM is replaceable unit=> computers use any generation DRAM

Commodity, second source industry=> high volume, low profit, conservative

Little organization innovation in 20 years

page mode, EDO, Synch DRAM

Order of importance: 1) Cost/bit 1a) Capacity

RAMBUS: 10X BW, +30% cost => little impact

Adapted from D.A.Patterson,



UCB

Today’s Situation: DRAM

Commodity, second source industry

high volume, low profit, conservative

Little organization innovation (vs. processors)in 20 years: page mode, EDO, Synch DRAM

DRAM industry at a crossroads:

Fewer DRAMs per computer over time

Growth bits/chip DRAM : 50%-60%/yr

Nathan Myrvold M/S: mature software growth(33%/yr for NT) - growth MB/$ of DRAM (25%-30%/yr)

Starting to question buying larger DRAMs?

Adapted from D.A.Patterson,



UCB

DRAM Revenue per Quarter

1Q

2Q

3Q

4Q

1Q

2Q

3Q

4Q

1Q

2Q

3Q

4Q

1Q 97

(Miillions)

$16B

$7B

• Intel: 30%/year since 1987; 1/3 income profit



UCB

Summary:

Two Different Types of Locality:

Temporal Locality (Locality in Time): If an item is referenced, it willtend to be referenced again soon.

Spatial Locality (Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon.

By taking advantage of the principle of locality:

Present the user with as much memory as is available in thecheapest technology.

Provide access at the speed offered by the fastest technology.

DRAM is slow but cheap and dense:

Good choice for presenting the user with a BIG memory system

SRAM is fast but expensive and not very dense:

Good choice for providing the user FAST access time.

Adapted from D.A.Patterson,



UCB

Summary: Processor-Memory Performance Gap “Tax”

Processor

% Area

%Transistors

(-cost)

(-power)

Alpha 21164

StrongArm SA

Pentium Pro

2 dies per package: Proc/I$/D$ + L2$

Memory Systems in Computer Architecture: An Examination of DRAM and SRAM, Study notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Memory Systems in Computer Architecture: An Examination of DRAM and SRAM and more Study notes Computer Architecture and Organization in PDF only on Docsity!

EEL 4768

Computer System Design 2

Lecture 8: Memory Systems

The Five Classic Components of a Computer

The Big Picture: Where are We Now?

Technology Trends (from 1st lecture)

DRAM

Capacity

Speed (latency)

Logic:2x in 3 years

2x in 3 years

DRAM:

4x in 3 years

2x in 10 years

Disk:

4x in 3 years

2x in 10 years

Who Cares About the Memory Hierarchy?

μProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

Processor-MemoryPerformance Gap:(grows 50% / year)

Performance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Today’s Situation: Microprocessor

Rely on caches to bridge gap

Microprocessor-DRAM performance gap

-5X

Impact on Performance

Suppose a processor executes at

CPI = 1.

Suppose that 10% of memoryoperations get 50 cyclemiss penalty

CPI

= ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins)

x 0.10 (miss/datamop) x 50 (cycle/miss) )

= 1.1 cycle + 1.5 cycle= 2. 6

58 % of the time the processor

is stalled waiting for memory!

a 1% instruction miss rate would add

an additional 0.5 cycles to the CPI!

The Goal: illusion of large, fast, cheap memory

Fact: Large memories are slow, fast memories aresmall

How do we create a memory that is large, cheap andfast (most of the time)?

An Expanded View of the Memory System

Why hierarchy works

The Principle of Locality:

Memory Hierarchy: How Does it Work?

Temporal Locality (Locality in Time):

Spatial Locality (Locality in Space):

Memory Hierarchy: Terminology

Hit: data appears in some block in the upper level(example: Block X)

Miss: data needs to be retrieved from a block in thelower level (Block Y)

Hit Time << Miss Penalty

Memory Hierarchy of a Modern Computer System

By taking advantage of the principle of locality:

Logic Diagram of a Typical SRAM

Write Enable is usually active low (WE_L)

Din and Dout are combined to save pins:

Although could change VHDL to do what desire,must do the best with what you’ve got (vs. what youneed)

A

D

OE_L

WE_L

Typical SRAM Timing

D

A WE_L

A

D

OE_L

WE_L

OE_L

Problems with SRAM

Six transistors use up a lot of area

N

N