



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Professor: Whaley; Class: Computer Architecture; Subject: Computer Science; University: University of Texas - San Antonio; Term: Fall 2008;
Typology: Study notes
1 / 5
This page cannot be seen from the preview
Don't miss anything!




Chapter 1 Concepts
Basic Terminology
ILP: inst lvl parallelism
CPI: clocks per inst
IPC: inst per clock
FPU: floating point unit
ALU: arithmetic logic unit
processing
element
(eg,
FPU, ALU, etc)
ISA: instruction set architecture
complex
instruction
set
computer
reduced
instruction
set
computer
very
long
instruction
word
CMP: Chip multiprocessor
multi-issue:
ability
to
perform
more than one inst at once
pipelining:
breaking
single
inst
up into sub-processes
Microprocessor Performance Growth (iSPEC)
Computer Architecture, Henn & Patt, Fig 1.1, pg 3
Software trends reducing arch inertia:
windows)Standard operating systems (unix, dos,
Less use of assembly
Perf
relative
to
Diff
between
and
ideas:50% due to arch & org
caches
Increased
(pipelining,
mul-
tiple inst issue)
speedups
even
greater.
In 2002 hit wall: 20%
Power dissipation
little ILP left
memory latency
Classes of Computers
Old classes based loosely on size
Microcomputer
Computer
in
which entire CPU is on one chip
microprocs
now
used
in
all
types of computers
Minicomputer
Mainframes
Supercomputer
New classes based on usage
Desktop
: iface directly wt user,
includes PCs and workstations
Emphasize
both
price
and
performance
Servers
provides
computation
or data to other devices
bility, and throughputEmphasize availability, scala-
Embedded
systems
computer
lodged in other devices
size price,Variable, but tend to empha-
minimizing mem-
ten require predictabilityory and power usage, and of-
Three aspects to architecture
Instruction Set Architecture
programmer-visable machine instruc-
tions.
Really binary numbers, but can think of it as assembly code
Architecture used to imply only ISA design,
but modern ISAs
fairly uniform, except in embedded
Organization
: high-level aspects of machine
memory system (including caches), bus structure, CPU, etc.
Hardware
: low-level aspects of machine
detailed logic design and packaging technology
We will emphasize application of 2
Goals for Computer Architect
Application
area
main
jobs
(desktop,
scientific,
database, embedded)
software
compatibility
compat needed or not
requirements
address
etcspace, memory management,
standards
IEEE fp, I/O de-
vices, networking, etc.
find correct balance of:
lowered
through
less
cus-
tomization
and
greater
vol-
ume
in desktop/serversIncreasingly important, even
additional devel timeNew features must be worth
Trends
hardware increase per year
Transisters on a chip:
Program memory usage:
per year.
DRAM density: 40-60%
disk density: more than 100%
software trends
Languages become higher level
assembly,
java/python
Backend
optimization
con-
trolled at higher levels
user+assmblr, compiler, VM
design trends
: CISC gives way to RISC
: PentiumPRO: Revenge
of the CISC
Frontend
translates
inst to RISC microcode
semiannually
newest
gen
of
VLIW fails in marketplace
PowerPC790FX: VLIW,
kinda
Frontend
translates
pseudo-
RISC to pseudo-VLIW inst
Dual-core
on
one
chip
(AMD/Intel/IBM)
now
heat wall refocuses arch
from clock rate to CMP
Results of Trends
Use for transisters
Put entire CPU on one chip
Put fancy front-end on chip
dantly,Can design backend indepen-
make ISA almost ir-
relavent
put caches on chip too
Perform
compiler-like
trans-
forms in hardware
inst
reordering
(dep
analy-
sis),
reg
renaming,
speculative execution, etc.
Put multiple cores on chip
Mem
controllers,
more
cache, etc., on chip.
Some observations
Software
slow
to
adapt,
hard-
ware quick
hardwarerequires software to adapt withVLIW exposed at assembly level
Clock
speed
is
hitting
power/heat
walls,
but
so
far transister counts are not
Expect
hardware
support
to
multicore, VM)continue to expand (VLIW/SS,
scribe machineExpect ISA to less and less de-
Experts
predict
a
reversal
of
these trends, I’m dubious
Benchmarking
type, best to worst
User applications
most impor-
tant apps used by end-user
Kernels
key
kernels
extracted
from
popular
applications
(lin-
pack,
livermore
loops,
gemm,
etc)
toy benchmark
: small prog that
produce
known
result
(puzzle,
sieve,
quicksort,
arraymerge,
etc)
synthetic
benchmarks
at-
tempts
to
match
freq
of
ops
and
operands
from
real
progs
(drhystone, whetstone)
gotchas
Compilers/archs
may
tune
for
benchmark, wtout generality
in same way as applicationBenchmark may not use kernel
Hard
to
capture
all
im-
portant
info
(compil-
ers/caches/OS/settings)
Worsens
problems
of
repro-
ducibility
scoresHard to tell what is causing
score for same machineBenchmarks will vary on relative
Amdahl’s Law
limited by the fraction of time the faster mode can be usedStates that the perf improv from using some faster mode of exec is
Allows user to understand limits of a considered optimization
software optimization, parallelization, etc.May be applied to any method of speedup: hardware improvement,
Urges us to optimize common case
More rigorous form of 90/10 rule
Can get rough idea by asking ”what if this execution were free”
n
: new time,
o : original time,
e : frac of
o exploiting enhanc,
e : speedup from enhanc,
t : speedup of total program after enhanc
n
= (opt time) + (non-opt time) = (
T o × F e
S e
o −
( T o ×
F e )) =
o ( F e
S e
e )) =
o (
(^) −
e
(^) F e
S e (^) )
∴ T n = T o
e
F e
S e (^) )
T o
T n =
T o
T o ( − F e
SeFe (^) ) = 1
e
F e
S e (^) )
t = 1
e
F e
S e (^) )
Measuring clock speed
Clock Rates
(clock ticks in a second)
hertz (Hz): one cycle per sec
KHz:
one
thousand
cyc/sec
3 )
MHz: one million cyc/sec (
6 )
GHz: one billion cyc/sec (
9 )
Thz: on trillion cyc/sec (
12
)
Clock Periods (tick interval)
millisecond (ms):
1
1000
− 3
microsecond
μ s):
1
1000000
− 6
nanosecond
(ns):
(billionth)
1
10
9 = 10
− 9
picosecond
(ps):
(trillionth)
1
10
12
= 10
− 12
rate
: cycles/sec,
period
= sec/cycle
they are inverses!
CPUtime, CPI, etc
CPUtime = CPUCycles * ClockPeriod (duh)
CPUtime = InstCount * CPI * Period
CPI =
CP U Cycles
IC
, IC : Instruction Count
InstCount: ISA & compiler
ClockPeriod: hardware tech and org
∗ CPI: compiler (inst selection), org, ISA
CPI varies by instruction, machine state, and inst mix*
Given program wt
t insts of
n
types each with
i and
i ,
∑ in =
(% of inst of type
i ) × CP I
(^) ∑
( IC
i
IC
t ×
CP I
i )
Book
posits
various
scen:
increase
reduce
period,
what
speedup.
and so this is not as useful as they make it soundIn practice, each affects the other, and varies widely by program,
on subsets of instructionsEquations become more useful when considering improvements
Measurement Techniques
Amdahl’s Law tells us to make common case fast:
need a way to find
common case! Therefore, common to run ‘typical’ applications with:
Hardware perf counters
most machines keep track of number of
inst, general type, total cycles, etc.
pros
: represents machine used in real world, very fast
cons
: may be imprecise & unrepeatable, hard to avoid contami-
nation, only limited # of counters
Instrumented execution
extra code inserted into exec to monitor
events, which can then be processed later
pro/con
same as above,
but more flexible at cost of running
slower and possibly having instrumentation interfere with run
ISA interpretation
: Write an interpreter that simulates the architec-
ture in question (to greater or lesser accuracy)
pros
: repeatable, cycle accurate, does not require hardware
cons
very slow (
1000 times slower), hard to take all factors
into account, and without hardware, unverifiable.
Locality of Reference
locality of referencepossessed by most applications. One of the most important of these is Steeper architectural improvement rate enabled by exploiting properties
: programs tend to reuse data and instructions they
have recently accessed. Two important types of locality:
Temporal Locality
(^) : recently accessed items are likely to be accessed
in the near future
inst
: loops, small functions, etc
data
: stack, global scalar
Spatial Locality
items stored in close proximity tend to be refer-
enced close together during execution
inst
: only long jumps mess this up
data
: sequential array access
Exploit
these
localities
through
caches
(smaller,
faster
memories
that store recently used and proximal addresses)
Exploiting Parallelism
digital design
: parallel completely from hardware
Checking for address in all levels of cache sim
Checking multiple tags sim in set-assoc cache
carry lookahead for addition
within a processor
(^) : (exploiting ILP), hardware & compiler
pipelining
issuing multiple instructions simultaneously (super scalar)
system level
: (improves throughput), assigned by OS
perform tasks on different processors
accessing different disks simultaneously
Fallacies and Pitfalls
fallacy
: incorrect common belief
judgedRel perf of archs wt same ISA
by
clock
rate
or
single
benchmark
Actual perf tracks peak
computersMIPS is a good way to compare
ways meaningfulWidely used benchmarks are al-
pittfall
: an easily made mistake
Extrapolating
compilation
per-
nelsformance from hand-tuned ker-
Assuming
software
cheap
quick
Not applying Amdahl’s Law