Introduction to Microprocessor Performance Growth | CS 3853, Study notes of Computer Architecture and Organization

Material Type: Notes; Professor: Whaley; Class: Computer Architecture; Subject: Computer Science; University: University of Texas - San Antonio; Term: Fall 2008;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-fri
koofers-user-fri 🇺🇸

8 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1. Chapter 1 Concepts
1. Terminology
2. Trends
3. Measures of Computer Performance
4. Principles of Computer Design
2. Basic Terminology
ILP: inst lvl parallelism
CPI: clocks per inst
IPC: inst per clock
FPU: floating point unit
ALU: arithmetic logic unit
PE : processing element (eg,
FPU, ALU, etc)
ISA: instruction set architecture
CISC: complex instruction set
computer
RISC: reduced instruction set
computer
VLIW: very long instruction
word
CMP: Chip multiprocessor
multi-issue: ability to perform
more than one inst at once
pipelining: breaking single inst
up into sub-processes
3. Microprocessor Performance Growth (iSPEC)
Computer Architecture, Henn & Patt, Fig 1.1, pg 3
Software trends reducing arch inertia:
Standard operating systems (unix, dos,
windows)
Less use of assembly
Perf relative to VAX
11/780
Diff between 25% and
50% due to arch & org
ideas:
caches
Increased ILP
(pipelining, mul-
tiple inst issue)
FP speedups even
greater.
In 2002 hit wall: 20%
Power dissipation
little ILP left
memory latency
4. Classes of Computers
Old classes based loosely on size:
Microcomputer : Computer in
which entire CPU is on one chip
microprocs now used in all
types of computers
Minicomputer
Mainframes
Supercomputer
New classes based on usage:
1. Desktop: iface directly wt user,
includes PCs and workstations
Emphasize both price and
performance
2. Servers: provides computation
or data to other devices
Emphasize availability, scala-
bility, and throughput
3. Embedded systems : computer
lodged in other devices
Variable, but tend to empha-
size price, minimizing mem-
ory and power usage, and of-
ten require predictability
pf3
pf4
pf5

Partial preview of the text

Download Introduction to Microprocessor Performance Growth | CS 3853 and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Chapter 1 Concepts

  1. Principles of Computer Design3. Measures of Computer Performance2. Trends 1. Terminology

Basic Terminology

ILP: inst lvl parallelism

CPI: clocks per inst

IPC: inst per clock

FPU: floating point unit

ALU: arithmetic logic unit

PE

processing

element

(eg,

FPU, ALU, etc)

ISA: instruction set architecture

CISC:

complex

instruction

set

computer

RISC:

reduced

instruction

set

computer

VLIW:

very

long

instruction

word

CMP: Chip multiprocessor

multi-issue:

ability

to

perform

more than one inst at once

pipelining:

breaking

single

inst

up into sub-processes

Microprocessor Performance Growth (iSPEC)

Computer Architecture, Henn & Patt, Fig 1.1, pg 3

Software trends reducing arch inertia:

windows)Standard operating systems (unix, dos,

Less use of assembly

Perf

relative

to

VAX

Diff

between

and

ideas:50% due to arch & org

caches

Increased

ILP

(pipelining,

mul-

tiple inst issue)

FP

speedups

even

greater.

In 2002 hit wall: 20%

Power dissipation

little ILP left

memory latency

Classes of Computers

Old classes based loosely on size

Microcomputer

Computer

in

which entire CPU is on one chip

microprocs

now

used

in

all

types of computers

Minicomputer

Mainframes

Supercomputer

New classes based on usage

Desktop

: iface directly wt user,

includes PCs and workstations

Emphasize

both

price

and

performance

Servers

provides

computation

or data to other devices

bility, and throughputEmphasize availability, scala-

Embedded

systems

computer

lodged in other devices

size price,Variable, but tend to empha-

minimizing mem-

ten require predictabilityory and power usage, and of-

Three aspects to architecture

Instruction Set Architecture

programmer-visable machine instruc-

tions.

Really binary numbers, but can think of it as assembly code

Architecture used to imply only ISA design,

but modern ISAs

fairly uniform, except in embedded

Organization

: high-level aspects of machine

memory system (including caches), bus structure, CPU, etc.

Hardware

: low-level aspects of machine

detailed logic design and packaging technology

We will emphasize application of 2

Goals for Computer Architect

  1. Functional requirements/usage

Application

area

main

jobs

(desktop,

scientific,

database, embedded)

software

compatibility

ISA

compat needed or not

OS

requirements

address

etcspace, memory management,

standards

IEEE fp, I/O de-

vices, networking, etc.

find correct balance of:

  1. Price

lowered

through

less

cus-

tomization

and

greater

vol-

ume

  1. Power constraints3. Performance

in desktop/serversIncreasingly important, even

  1. Time-to-market

additional devel timeNew features must be worth

Trends

hardware increase per year

Transisters on a chip:

Program memory usage:

per year.

DRAM density: 40-60%

disk density: more than 100%

software trends

Languages become higher level

assembly,

F77/C,

C++,

java/python

Backend

optimization

con-

trolled at higher levels

user+assmblr, compiler, VM

design trends

: CISC gives way to RISC

: PentiumPRO: Revenge

of the CISC

Frontend

translates

CISC

inst to RISC microcode

semiannually

newest

gen

of

VLIW fails in marketplace

PowerPC790FX: VLIW,

kinda

Frontend

translates

pseudo-

RISC to pseudo-VLIW inst

Dual-core

on

one

chip

(AMD/Intel/IBM)

now

heat wall refocuses arch

from clock rate to CMP

Results of Trends

Use for transisters

Put entire CPU on one chip

Put fancy front-end on chip

dantly,Can design backend indepen-

make ISA almost ir-

relavent

put caches on chip too

Perform

compiler-like

trans-

forms in hardware

inst

reordering

(dep

analy-

sis),

reg

renaming,

OOE,

speculative execution, etc.

Put multiple cores on chip

Mem

controllers,

GPU,

more

cache, etc., on chip.

Some observations

Software

slow

to

adapt,

hard-

ware quick

hardwarerequires software to adapt withVLIW exposed at assembly level

Clock

speed

is

hitting

power/heat

walls,

but

so

far transister counts are not

Expect

hardware

support

to

multicore, VM)continue to expand (VLIW/SS,

scribe machineExpect ISA to less and less de-

Experts

predict

a

reversal

of

these trends, I’m dubious

Benchmarking

type, best to worst

User applications

most impor-

tant apps used by end-user

Kernels

key

kernels

extracted

from

popular

applications

(lin-

pack,

livermore

loops,

gemm,

etc)

toy benchmark

: small prog that

produce

known

result

(puzzle,

sieve,

quicksort,

arraymerge,

etc)

synthetic

benchmarks

at-

tempts

to

match

freq

of

ops

and

operands

from

real

progs

(drhystone, whetstone)

gotchas

Compilers/archs

may

tune

for

benchmark, wtout generality

in same way as applicationBenchmark may not use kernel

Hard

to

capture

all

im-

portant

info

(compil-

ers/caches/OS/settings)

Worsens

problems

of

repro-

ducibility

scoresHard to tell what is causing

score for same machineBenchmarks will vary on relative

Amdahl’s Law

limited by the fraction of time the faster mode can be usedStates that the perf improv from using some faster mode of exec is

Allows user to understand limits of a considered optimization

software optimization, parallelization, etc.May be applied to any method of speedup: hardware improvement,

Urges us to optimize common case

More rigorous form of 90/10 rule

Can get rough idea by asking ”what if this execution were free”

T

n

: new time,

T

o : original time,

F

e : frac of

T

o exploiting enhanc,

S

e : speedup from enhanc,

S

t : speedup of total program after enhanc

T

n

= (opt time) + (non-opt time) = (

T o × F e

S e

T

o −

( T o ×

F e )) =

T

o ( F e

S e

  • (

×

F

e )) =

T

o (

(^) −

F

e

(^) F e

S e (^) )

∴ T n = T o

F

e

F e

S e (^) )

S

t

T o

T n =

T o

T o ( − F e

SeFe (^) ) = 1

F

e

F e

S e (^) )

S

t = 1

F

e

F e

S e (^) )

Measuring clock speed

Clock Rates

(clock ticks in a second)

hertz (Hz): one cycle per sec

KHz:

one

thousand

cyc/sec

3 )

MHz: one million cyc/sec (

6 )

GHz: one billion cyc/sec (

9 )

Thz: on trillion cyc/sec (

12

)

Clock Periods (tick interval)

millisecond (ms):

1

1000

− 3

microsecond

μ s):

1

1000000

− 6

nanosecond

(ns):

(billionth)

1

10

9 = 10

− 9

picosecond

(ps):

(trillionth)

1

10

12

= 10

− 12

rate

: cycles/sec,

period

= sec/cycle

they are inverses!

  1. What is the clock rate of a chip with a 50 picosecond cycle time?1. What is the period of a 500 Mhz chip?

CPUtime, CPI, etc

CPUtime = CPUCycles * ClockPeriod (duh)

CPUtime = InstCount * CPI * Period

CPI =

CP U Cycles

IC

, IC : Instruction Count

InstCount: ISA & compiler

ClockPeriod: hardware tech and org

∗ CPI: compiler (inst selection), org, ISA

CPI varies by instruction, machine state, and inst mix*

Given program wt

IC

t insts of

n

types each with

CP I

i and

IC

i ,

CP I

t

∑ in =

(% of inst of type

i ) × CP I

i

(^) ∑

in

( IC

i

IC

t ×

CP I

i )

Book

posits

various

scen:

increase

CPI,

reduce

period,

what

speedup.

and so this is not as useful as they make it soundIn practice, each affects the other, and varies widely by program,

on subsets of instructionsEquations become more useful when considering improvements

Measurement Techniques

Amdahl’s Law tells us to make common case fast:

need a way to find

common case! Therefore, common to run ‘typical’ applications with:

Hardware perf counters

most machines keep track of number of

inst, general type, total cycles, etc.

pros

: represents machine used in real world, very fast

cons

: may be imprecise & unrepeatable, hard to avoid contami-

nation, only limited # of counters

Instrumented execution

extra code inserted into exec to monitor

events, which can then be processed later

pro/con

same as above,

but more flexible at cost of running

slower and possibly having instrumentation interfere with run

ISA interpretation

: Write an interpreter that simulates the architec-

ture in question (to greater or lesser accuracy)

pros

: repeatable, cycle accurate, does not require hardware

cons

very slow (

1000 times slower), hard to take all factors

into account, and without hardware, unverifiable.

Locality of Reference

locality of referencepossessed by most applications. One of the most important of these is Steeper architectural improvement rate enabled by exploiting properties

: programs tend to reuse data and instructions they

have recently accessed. Two important types of locality:

Temporal Locality

(^) : recently accessed items are likely to be accessed

in the near future

inst

: loops, small functions, etc

data

: stack, global scalar

Spatial Locality

items stored in close proximity tend to be refer-

enced close together during execution

inst

: only long jumps mess this up

data

: sequential array access

Exploit

these

localities

through

caches

(smaller,

faster

memories

that store recently used and proximal addresses)

Exploiting Parallelism

digital design

: parallel completely from hardware

Checking for address in all levels of cache sim

Checking multiple tags sim in set-assoc cache

carry lookahead for addition

within a processor

(^) : (exploiting ILP), hardware & compiler

pipelining

issuing multiple instructions simultaneously (super scalar)

system level

: (improves throughput), assigned by OS

perform tasks on different processors

accessing different disks simultaneously

Fallacies and Pitfalls

fallacy

: incorrect common belief

judgedRel perf of archs wt same ISA

by

clock

rate

or

single

benchmark

Actual perf tracks peak

computersMIPS is a good way to compare

ways meaningfulWidely used benchmarks are al-

pittfall

: an easily made mistake

Extrapolating

compilation

per-

nelsformance from hand-tuned ker-

Assuming

software

cheap

quick

Not applying Amdahl’s Law