Download CA226: Advanced Computer Architecture - Module Code: ca226 and more Study notes Advanced Computer Architecture in PDF only on Docsity!
CA226 — Advanced
Computer Architecture
Stephen Blott
Table of Contents
CA226 — Advanced Computer Architecture
Preliminaries
Contacting me:
- before or after lectures, or during labs
- in my office: L1.
- at [email protected] [mailto:[email protected]] please put the module code (ca226) in the subject line
CA226 — Advanced Computer Architecture
More Preliminaries
Course web site:
- http://ca226.computing.dcu.ie/ use your School of Computing credentials
There’s a link to this site on Moodle [http://moodle.dcu.ie/].
CA226 — Advanced Computer Architecture
Still More Preliminaries
Labs:
Lab exams:
- weeks eight and twelve in the regular lab slot (Friday’s at 14:00)
Computer Architecture
Starters for 10.
- List the powers of 2?
- What is 2^{32}?
- What is 2^{64}?
Computer Architecture
Starters for 10..
- What is a register?
- What is a bus?
- What does USB stand for?
- What is a frame buffer?
- What is an interrupt?
CA226 — Advanced Computer Architecture
Starters for 10…
- What’s special about this IP address: 127.0.0.1?
- What’s special about this IP address: 192.168.3.3?
- What’s special about this IP address: 192.168.3.255?
- Could every person on earth be allocated a unique IP address?
- Old versions of the Linux ext2 file system had a 2GB limit on file sizes. Why?
CA226 — Advanced Computer Architecture
Observations on Processor Speed
Computer Architecture
CISC versus RISC
Memory constraints influenced early processor designs:
- with small memories, high code density [http://en.wikipedia.org/wiki/ Instruction_set#Code_density] was necessary
- this led to the development of processors with complex instruction sets:
- a single instruction might implement a high-level programming-language operation
- complex addressing modes
- e.g. b = a[i] + 1
Computer Architecture
CISC versus RISC
As memory costs reduced:
- memory size constraints lessened
- code did not need to be so dense
- reduced instruction sets became viable
- a single high-level programming-language operation might be implemented by several instructions
Almost all modern processors implement reduced instruction sets.
CA226 — Advanced Computer Architecture
A simple computer…
Note
Source [http://www-cs-faculty.stanford.edu/~eroberts/courses/soco/projects/risc/ risccisc/].
CA226 — Advanced Computer Architecture
Example — The Problem
The problem:
- a = a * b;
- so: multiply memory locations 5:2 and 2:3 (say)
Computer Architecture
Example — CISC Approach
CISC approach:
MULT 5:2 2:
- a single, complex instruction
- load both memory locations into registers
- multiply
- store the result back in the appropriate memory location say 5:
Just one instruction encodes a commonly-occurring programming operation which, at the hardware level, involves several steps.
Computer Architecture
Example — RISC Approach
RISC approach:
LOAD A, 2: LOAD B, 5: MULT A, B STORE 2:3, A
Four steps are required:
- so the program memory required is (well, may be) four times larger
- so this approach was only possible when cheaper/larger memory systems became more widespread
CA226 — Advanced Computer Architecture
RISC
RISC:
- reduced instruction set computing
- computations are performed only on register contents
- the only memory operations are LOAD and STORE
- few, uniformly-sized instructions
CA226 — Advanced Computer Architecture
RISC Advantages
Both approaches are likely to require roughly the same number of computational steps.
RISC advantages:
- moves complexity from hardware to software (compilers)
- smarter compilers make better use of registers
- fewer transistors:
- so smaller, can be clocked faster, reduced power consumption, less heat
- pipelining (and super-scalar processing)
Computer Architecture
Answer?
It depends.
Computer Architecture
Answer?
Usually:
- we’re interested in how long it takes to get some work done
So:
- wall-clock time might be a good measure
CA226 — Advanced Computer Architecture
However …
It depends how/why we’re measuring.
Wall-clock time includes:
- user CPU time
- system CPU time
- interrupt handling time
- I/O time (to/from terminal, disk, network)
CA226 — Advanced Computer Architecture
CPU Architectures
If we’re interested in comparing processors:
- we may be more interested in the number of clock cycles necessary to complete some task
Computer Architecture
Clock Rate
Clock rate:
- the number of clock cycles per unit time (usually, per second)
- say, 2GHz
Computer Architecture
CA226 — Advanced Computer Architecture
CPU Clock Cycles
CPU clock cycles:
- the number of clock cycles necessary to complete some job
CA226 — Advanced Computer Architecture
Computer Architecture
Alternatively
But that approach:
- is too dependent on a single job
Computer Architecture
Alternatively
Better:
- derive a metric which is (somewhat) independent of any particular job
- let IC be the instruction count the number of instructions needed to complete some job
Say:
CA226 — Advanced Computer Architecture
Then …
Then:
- cycles per instruction (CPI): text{CPI} = text{CPU clock cycles}/text{IC}
Example:
- text{CPI} = {4 times 10^8} / {2 times 10^8} = 2 so, two cycles per instruction
CA226 — Advanced Computer Architecture
Then again …
Then:
- CPU time: text{CPU time} = {text{IC} times text{CPI}} / text{clock rate}
Example:
- text{CPU time} = {2 times 10^8 times 2} / {2 times 10^9} = 0.2s
Computer Architecture
So …
- text{CPU time} = {text{IC} times text{CPI}} / text{clock rate}
So, to make things go faster (reduce CPU time):
- reduce the instruction count (IC)
- reduce the number of cycles per instruction (CPI), or
- increase the clock rate
Computer Architecture
Improvements in CPI
The Intel 8086 instruction PUSH AX:
- 8086 — 11 clock cycles
- 80286 — 3 clock cycles
- 80386 — 2 clock cycles
- 80486 — 1 clock cycles
So:
- it is not just clock speed that has improved over the years
- in fact: it is now commonplace to see text{CPI} le 1
CA226 — Advanced Computer Architecture
Example
Example:
- two machines (A and B) implementing the same instruction set architecture
- A has cycle time of 10ns and CPI of 2.0 (for some prog. P)
- B has cycle time of 20ns and CPI of 1.2 (for same P)
Which is faster?
CA226 — Advanced Computer Architecture
Aside
Note
The cycle time (in seconds) is just the reciprocal of the clock speed (in Herz) — and vice versa.
Computer Architecture
More Common Metrics
MIPS:
- text{MIPS} = text{clock rate} / {text{CPI} times 10^6}
MFLOPS:
- text{MFLOPS} = text{clock rate} / {text{C-per-FPI} times 10^6}
Computer Architecture
MIPS and MFLOPS
These can be poor metrics for comparing different processors:
- some implement FP division (e.g. Pentium)
- some don’t (e.g. SPARC)
Instruction counts:
- they may have different instruction sets (so the ICs will be different)
- for complex operations like sine and cosine may be quite large
- so these differences can be significant
CA226 — Advanced Computer Architecture
Improving Performance
Generally:
- optimise for the common case
CA226 — Advanced Computer Architecture
Improving Performance
However, (particularly) with computer hardware:
- optimisation is expensive (it requires substantial investment)
So:
- we need to decide where to invest in optimisation, and
- we need to know that the payback is going to be worth it
Computer Architecture
Speedup
Consider some possible hardware or software enhancement.
Speedup:
- text{performance without enhancement} / text{performance with enhancement}
Note
"Performance", here, might be response time (say). With speedup, larger values are better.
Computer Architecture
Speedup — Example
Example:
- a baseline implementation might execute a job in 3 seconds
- with some enhancement, that might be reduced to 2 seconds
Speedup:
CA226 — Advanced Computer Architecture
Important Gotcha!
Typically:
- only a portion of an entire job will be sped up by any proposed enhancement
Example:
- sort the contents of a disk file, storing the sorted results back in a new file on disk so: read data in, sort it, write data out
- an enhanced sorting algorithm can only improve the CPU costs, not the IO costs
- an enhanced IO subsystem can only improve the IO costs, not the sorting costs
CA226 — Advanced Computer Architecture
Example
Assume:
- some job involving sub-jobs A and B
- B accounts for 70% of the execution time, A the rest
Given a proposed enhancement:
- running B 20 times faster
How much faster would our job run overall?
Computer Architecture
Amdahl’s Law — Example
Overall speedup:
- 1 / {(1-P) + P/S}
- 1 / {(1-0.7) + 0.7/20}
- 1 / {0.3 + 0.035}
- 2.985 (approximately)
Computer Architecture
Example
Given a proposed enhancement:
- running B 20 times faster
How much faster would our job run overall?
It will run in about three times faster:
- this may be less than you intuitively expected.
CA226 — Advanced Computer Architecture
Another Example
Amdahl’s law also allows comparison between two or more design alternatives.
CA226 — Advanced Computer Architecture
Another Example
Example:
- a program spends:
- half its time doing floating-point operations
- including 20% of its time calculating floating-point square roots
Alternative optimisations:
- Add floating-point square root hardware which speeds up such operations by a factor of 10.
- Make all floating-point operations run twice as fast.
Computer Architecture
Engineering
Assuming we can only choose one:
- in which of these optimisations should we invest?
Computer Architecture
Engineering — First Case
Optimisations:
- Add floating-point square root hardware which speeds up such operations by a factor of 10.
Amdahl’s law:
- text{speedup} = 1 / {0.8 + 0.2 / 10} = 1.22
CA226 — Advanced Computer Architecture
Engineering — Second Case
Optimisations:
- Make all floating-point operations run twice as fast.
Amdahl’s law:
- text{speedup} = 1 / {0.5 + 0.5 / 2} = 1.33
So, under these assumptions, the second approach looks like the better investment.
CA226 — Advanced Computer Architecture
Corollary
Amdahl’s law tells us to:
- make the common case fast!
Or:
- we can never see a big speedup by optimising the uncommon case