Mainstream Computer System Components, Summaries of Performance Evaluation

Double Date Rate (DDR) SDRAM. One channel = 8 bytes = 64 bits wide. Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600). 200 MHz (internal ...

Typology: Summaries

2022/2023

Uploaded on 03/01/2023

aeinstein
aeinstein 🇺🇸

4.6

(22)

259 documents

1 / 33

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMPE550
CMPE550 -
-Shaaban
Shaaban
#1 lec # 10 Fall 2017 11-14-2017
Mainstream Computer System Components
Mainstream Computer System Components
Double Date Rate (DDR) SDRAM
One channel = 8 bytes = 64 bits wide
Current DDR3 SDRAM Example:
PC3-12800 (DDR3-1600)
200 MHz (internal base chip clock)
8-way interleaved (8-banks)
~12.8 GBYTES/SEC (peak)
(one 64bit channel)
~25.6 GBYTES/SEC (peak)
(two 64bit channels – e,g AMD x4, x6)
~38.4 GBYTES/SEC (peak)
(three 64bit channels – e.g Intel Core i7)
PC2-6400 (DDR2-800)
200 MHz (internal base chip clock)
64-128 bits wide
4-way interleaved (4-banks)
~6.4 GBYTES/SEC (peak)
(one 64bit channel)
~12.8 GBYTES/SEC (peak)
(two 64bit channels)
DDR SDRAM Example:
PC3200 (DDR-400)
200 MHz (base chip clock)
4-way interleaved (4-banks)
~3.2 GBYTES/SEC (peak)
(one 64bit channel)
~6.4 GBYTES/SEC
(two 64bit channels)
Single Date Rate SDRAM
PC100/PC133
100-133MHz (base chip clock)
64-128 bits wide
2-way inteleaved (2-banks)
~ 900 MBYTES/SEC peak (64bit)
CPU
Caches System Bus
I/O Devices:
Memory
Controllers
adapters
Disks
Displays
Keyboards Networks
NICs
I/O Buses
Memory
Controller Example: PCI, 33-66MHz
32-64 bits wide
133-528 MBYTES/SEC
PCI-X 133MHz 64 bit
1024 MBYTES/SEC
CPU Core 2 GHz - 3.5 GHz 4-way Superscaler (RISC or RISC-core (x86):
Dynamic scheduling, Hardware speculation
Multiple FP, integer FUs, Dynamic branch prediction …
L1
L2
L3
Memory Bus
All Non-blocking caches
L1 16-128K 2-8 way set associative (usually separate/split)
L2 256K- 4M 8-16 way set associative (unified)
L3 4-24M 16-64 way set associative (unified)
Examples: AMD K8: HyperTransport
Alpha, AMD K7: EV6, 200-400 MHz
Intel PII, PIII: GTL+ 133 MHz
Intel P4 800 MHz
North
Bridge South
Bridge
Chipset
Off or On-chip
System Bus = CPU-Memory Bus = Front Side Bus (FSB)
(FSB)
I/O Subsystem: 4th Edition in Chapter 6
(3rd Edition in Chapter 7)
System Memory
(DRAM)
SRAM
One core or multi-core (2-8) per chip
AKA System Core Logic
SRAM
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21

Partial preview of the text

Download Mainstream Computer System Components and more Summaries Performance Evaluation in PDF only on Docsity!

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Mainstream Computer System ComponentsMainstream Computer System Components

Double Date Rate (DDR) SDRAMOne channel = 8 bytes = 64 bits wideCurrent DDR3 SDRAM Example:PC3-12800 (DDR3-1600)200 MHz (internal base chip clock)8-way interleaved (8-banks)~12.8 GBYTES/SEC (peak)(one 64bit channel)~25.6 GBYTES/SEC (peak)(two 64bit channels –

e,g AMD x4, x

)

~38.4 GBYTES/SEC (peak)(three 64bit channels –

e.g Intel Core i

)

PC2-6400 (DDR2-800)200 MHz (internal base chip clock)64-128 bits wide4-way interleaved (4-banks)~6.4 GBYTES/SEC (peak)(one 64bit channel)~12.8 GBYTES/SEC (peak)(two 64bit channels)DDR SDRAM Example:PC3200 (DDR-400)200 MHz (base chip clock)4-way interleaved (4-banks)~3.2 GBYTES/SEC (peak)(one 64bit channel)~6.4 GBYTES/SEC(two 64bit channels)Single Date Rate SDRAMPC100/PC133100-133MHz (base chip clock)64-128 bits wide2-way inteleaved (2-banks)~ 900 MBYTES/SEC peak (64bit)

CPU

Caches

System Bus

I/O Devices:

Memory

Controllers

adapters

DisksDisplaysKeyboards

Networks

NICs

I/O Buses

MemoryController

Example: PCI, 33-66MHz

32-64 bits wide

133-528 MBYTES/SECPCI-X 133MHz 64 bit1024 MBYTES/SEC

CPU Core 2 GHz - 3.5 GHz 4-way Superscaler (RISC or RISC-core (x86):

Dynamic scheduling, Hardware speculationMultiple FP, integer FUs, Dynamic branch prediction …

L1 L2 L

Memory Bus

All

Non-blocking caches

L
16-128K

2-8 way set associative (usually separate/split)

L
256K- 4M

8-16 way set associative (unified)

L
4-24M

16-64 way set associative (unified) Examples: AMD K8: HyperTransport

Alpha, AMD K7: EV6, 200-400 MHzIntel PII, PIII: GTL+

133 MHz

Intel P

800 MHz

NorthBridge

SouthBridge

Chipset

Off or On-chip

System Bus = CPU-Memory Bus = Front Side Bus (FSB)

(FSB

)

I/O Subsystem: 4

th

Edition in Chapter 6

rd

Edition in Chapter 7)

System Memory(DRAM)

SRAM

One core or multi-core (2-8) per chip

AKA System Core Logic

SRAM

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

The Memory HierarchyThe Memory Hierarchy

Review of Memory Hierarchy & Cache Basics (from 350)

Memory Hierarchy & Cache Basics (from 350)

Cache Basics:
CPU Performance Evaluation with Cache

Classification of Steady-State Cache Misses:

The Three C’s of cache Misses

Cache Write Policies/Performance Evaluation

-^ •

Cache Write Miss PoliciesCache Write Miss Policies

Multi-Level Caches & Performance

Main Memory:

Performance Metrics: Latency & Bandwidth

Key DRAM Timing Parameters

DRAM System Memory Generations

Basic Memory Bandwidth Improvement/Miss Penalty Reduction Techniques

Techniques To Improve Cache Performance:

Reduce Miss Rate

Reduce Cache Miss Penalty

Reduce Cache Hit Time

Virtual Memory

Benefits, Issues/Strategies

Basic Virtual

Physical Address Translation: Page Tables

Speeding Up Address Translation: Translation Lookaside Buffer (TLB)

Cache exploits access locality to: •

Lower AMAT by hiding longmain memory access latency.

Lower demands on main memorybandwidth.

th

Edition: Chapter 5.

rd

Edition: Chapter 5.8, 5.

i.e Memory latency reduction

1 2

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Main Memory

Main Memory

Main (or system) memory generally utilizes Dynamic RAM (DRAM),

which use a single transistor to store a bit, but require a periodic data refresh by reading

every row increasing cycle time.

Static RAM may be used for main memory if the added expense, low density, high powerconsumption, and complexity is feasible (e.g. Cray Vector Supercomputers).

Main memory performance is affected by:

Memory latency or delay: Affects cache miss penalty, M.

Measured by:

Memory Access time: The time it takes between a memory accessrequest is issued to main memory and the time the requestedinformation is available to cache/CPU.

Memory Cycle time:

The minimum time between requests to memory

(greater than access time in DRAM to allow address lines to be stable)

Peak or Nominal (ideal ?) Memory bandwidth: The maximum sustaineddata transfer rate between main memory and cache/CPU.

In current memory technologies (e.g Double Data Rate SDRAM) published peakmemory bandwidth does not take account most of

the memory access latency

This leads to achievable

realistic memory bandwidth < peak memory bandwidth

Or maximum effective memory bandwidth

DRAM: Slow but high density

SRAM: Fast but low density

th

Edition: Chapter 5.

rd

Edition: Chapter 5.8, 5.

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Logical Dynamic RAM (DRAM) Chip OrganizationLogical Dynamic RAM (DRAM) Chip Organization

(16 Mbit)(16 Mbit)

Data In Data Out

Column Decoder^ Column Decoder

Sense Amps & I/OSense

Amps & I/O

MemoryMemory

ArrayArray

(16,384 x 16,384)(16,384 x 16,384)

A0…A

…A

A 0

14

DD Q Q

Word LineW

ord Line

Storage

CellCell

Row DecoderRow Decoder

Row/ColumnAddress

Control Signals:1 - Row Access Strobe (RAS): Low to latch row address2- Column Address Strobe (CAS): Low to latch column address3-

Write Enable (WE) orOutput Enable (OE)

4- Wait for data to be ready

D, Q share the same pins
(Single transistor per bit)

SharedPins

A periodic data refresh is requiredby reading every bit

Basic Steps:

1 - Supply Row Address

2- Supply Column Address

3- Get Data

1 - Supply Row Address2- Supply Column Address3- Read/Write Data

1 2

3

Now: 16 Gbit/chip

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Simplified Asynchronous DRAM Read TimingSimplified Asynchronous DRAM Read Timing

Source:

http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html

(late 1970s)

tRC

(memory cycle time)

tPC

t

RAC

:

Minimum time from RAS (Row Access Strobe) line falling to the valid data output.

t

RC

: Minimum time from the start of one row access to the start of the next (memory cycle time).

t

CAC

: minimum time from CAS (Column Access Strobe) line falling to valid data output.

t

PC

: minimum time from the start of one column access to the start of the next.

Memory Cycle Time = tRC = tRAC + RAS Precharge Time

Peak Memory Bandwidth = Memory bus width / Memory cycle timeExample: Memory Bus Width = 8 Bytes

Memory Cycle time = 200 ns

Peak Memory Bandwidth = 8 / 200 x 10

= 40 x 10

6

Bytes/sec

(memory access time)

1

1

2

2

3

3

4

4

Non-burst Mode Memory Access Example

Recovery Time

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Simplified DRAM Speed Parameters

Row Access Strobe (RAS)Time: (similar to t

RAC

):

Minimum time from RAS (Row Access Strobe) line falling(activated) to the first valid data output.

A major component of memory latency.

Only improves ~ 5% every year.

Column Access Strobe (CAS) Time/data transfer time:(similar to

t

CAC

)

The minimum time required to read additional data by changingcolumn address while keeping the same row address.

Along with memory bus width, determines peak memory bandwidth.

e.g For SDRAM Peak Memory Bandwidth = Bus Width /(0.5 x t

CAC

For PC100 SDRAM Memory bus width = 8 bytes

t

CAC

= 20ns

Peak Bandwidth = 8 x 100x

6

= 800 x 10

6

bytes/sec

And cache miss penalty M

RAS

1/2CAS

1st 8 bytes

2nd 8 bytes

3rd 8 bytes

4th 8 bytes

40 ns

50 ns

60 ns

70 ns

80 ns

Memory Latency

1/2CAS

1/2CAS

1/2CAS

Simplified SDRAM Burst-Mode Access Timing

Example

For PC100 SDRAM: Clock = 100 MHz

Burst length shown = 4

Effective

Burst-Mode Access

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Page Mode DRAMPage Mode DRAM

(Early 1980s)

Asynchronous DRAM:^ Asynchronous DRAM:

Memory Cycle Time

1 - Supply Row Address

2- Supply Column Address

3- Get Data

Non-burst Mode Memory Access

Last system memory type to use non-burst access mode

1 - Supply Row Address2- Supply Column Address3- Read/Write Data

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Fast Page Mode (FPM) DRAMFast Page Mode (FPM) DRAM

(late 1980s)

A read burst of length 4 shown

The first “burst mode” DRAM

(memory access time)

(constantfor entireburst access)

(Change)

Burst Mode Memory Access

Constant

Change

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Extended Data Out DRAM operates in a similar fashion to Fast Page ModeDRAM except putting data from one read on the output pins at the same timethe column address for the next read is being latched in.

Simplified Asynchronous Extended Data Out (EDO)Simplified Asynchronous Extended Data Out (EDO)

DRAM Read TimingDRAM Read Timing

Source:

http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html

Typical timing at 66 MHz :

5-2-2-

(

burst of length 4)

For bus width = 64 bits = 8 bytes

Max. Bandwidth =

8 x 66 / 2 = 264 Mbytes/sec

It takes = 5+2+2+2 = 11 memory cycles or

15 ns x 11 = 165 ns to read 32 byte cache block

Minimum Read Miss penalty for CPU running at 1 GHz = M =

11 x 15 = 165

CPU cycles

EDO DRAM speed rated using tRAC ~ 40-60ns

(early 90s)

(memory access time)

One memory cycle at 66 MHz = 1000/66 = 15 CPU cycles at 1 GHz

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Basic Memory Bandwidth Improvement/Miss Penalty (M) LatencyBasic Memory Bandwidth Improvement/

Reduction Techniques

Techniques

Wider Main Memory (CPU-Memory Bus/Interface):Memory bus width is increased to a number of words (“usually” up to the size of a cacheblock).

Memory bandwidth is proportional to memory bus width.

e.g Doubling the width of cache and memory doubles potential memory bandwidthavailable to the CPU.

The miss penalty is reduced since fewer memory bus accesses are needed tofill a cache block on a miss.

Interleaved (Multi-Bank) Memory:Memory is organized as a number of independent banks.

Multiple interleaved memory reads or writes are accomplished by sending memoryaddresses to several memory banks at once or pipeline access to the banks.

Interleaving factor: Refers to the mapping of memory addressees to memory banks.Goal reduce bank conflicts.

e.g. using 4 banks (width one word), bank 0 has all words whose address is:

(word address mod) 4 = 0

Burst Mode Memory Access

e.g 128 bit (16 bytes) memory bus instead of 64 bits (8 bytes) – now 24 bytes (192 bits)

wider FSB?

The above three techniques can also be applied to any cache level toreduce cache hit time and increase cache bandwidth.

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Four Way (Four Banks) Interleaved Memory

Memory Bank Number

AddressWithinBank

0 4 8 121620 ..

1 5 9 131721 ..

2 6 10141822 ..

3 7 11151923 ..

Bank 0

Bank 1

Bank 2

Bank 3

Bank Width = One WordBank Number = (Word Address) Mod (4)

Sequential Mapping ofMemory AddressesTo Memory Banks

Example

CacheBlock? NextCacheBlock?

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Memory Bank InterleavingMemory Bank Interleaving

Number of banks

Number of cycles to access word in a bank

(One Memory Bank)

(4 banks similar to the organizationof DDR SDRAM memory chips)

Memory Bank Cycle Time Memory Bank Cycle Time

Pipeline access to different memory banks to increase effective bandwidth

Can be applied at

:

1- DRAM chip level (e.g SDRAM, DDR)
2- DRAM module level
3- DRAM channel level

Bank interleavingBank interleaving

does not reduce latency of accesses to the same bank

Very long memory bank recovery time shown here

Also DDR2 (DDR3 increases the number to 8 banks,while DDR4 increases it to 16 banks)

(Multi-Banked Memory)

One Memory

Bank

Four Memory

Banks

(4 Banks)

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Synchronous^ SynchronousDynamic RAM,Dynamic RAM,(SDR SDRAM)(SDR SDRAM)OrganizationOrganizationDDR SDRAM organization is similar but

four

banks

are used in each DDR

SDRAM chip instead of two.Data transfer on both

rising and

falling edges of the clock DDR SDRAM rated by maximumor peak memory bandwidthPC3200 = 8 bytes x 200 MHz x 2

= 3200 Mbytes/sec

(mid 90s)

(late 90s - 2006)

SDRAM speed is rated at max.clock speed supported:100MHZ = PC100133MHZ = PC

A

SDR SDRAM Peak Memory Bandwidth =

= Bus Width /(0.5 x t

CAC

= Bus Width x Clock rate

DDR SDRAM Peak Memory Bandwidth =

= Bus Width /(0.25 x t

CAC

= Bus Width x Clock rate x 2

(Data Lines)

AddressLines

SDRAMDDR SDRAM

TimingComparison

SDR = Single Data Rate

DDR = Double Data Rate

Also DDR

SDR = Single Data Rate DDR = Double Data Rate

(DDR3 increases the number of banks to 8 banks)

CMPE550 -CMPE

  • Shaaban

Shaaban

lec # 10

Fall 2017

11-14-

Comparison of Synchronous Dynamic RAM SDRAM

Synchronous Dynamic RAM SDRAM Generations:

DDR2 Vs. DDR and SDR SDRAM

Figure Source: http://www.elpida.com/pdfs/E0678E10.pdf

Single Data Rate (SDR) SDRAM transfersdata on every rising edge of the clock.Whereas both DDR and DDR2 are doublepumped; they transfer data on the risingand falling edges of the clock.DDR2 vs. DDR: •

DDR2 doubles bus frequency for the same physical DRAM chip clock rate (asshown), thus doubling the effective datarate another time. •

Ability for much higher clock speeds than DDR, due to design improvements(still 4-banks per chip):

DDR2's bus frequency is boosted by electrical interfaceimprovements, on-die termination,prefetch buffers and off-chipdrivers.

However, latency vs. DDR is greatly increased as a trade-off.

Shown: DDR2-533 (PC2-4200)~ 4.2 GB/s peak bandwidthShown: DDR-266 (PC-2100)~ 2.1 GB/s peak bandwidth

Shown: PC133~ 1.05 GB/s peak bandwidth

Peak bandwidth given for a single 64bit memory channel (i.e 8-byte memory bus width)

4 Banks 4 Banks 2 Banks

4258 MB/s= 8 x 133 x 4 2128 MB/s= 8 x 133 x 2 1064 MB/s= 8 x 133

DDR2 DDR SDR

Internal Base Frequency = 133 MHz For DDR3: The trend continues with another external frequency doubling