Download Revolution in Computer Architecture: Old Conventional Wisdom vs. New Conventional Wisdom and more Slides Electronics engineering in PDF only on Docsity!
High Level Message
- Everything is changing; Old conventional wisdom
is out
- We DESPERATELY need a new architectural
solution for microprocessors based on parallelism
- Need to create a “watering hole” to bring
everyone together to quickly find that solution
- architects, language designers, application experts, numerical analysts, algorithm designers, programmers, …
Docsity.com
Outline
• Part I: A New Agenda for Computer
Architecture
- Old Conventional Wisdom vs. New Conventional
Wisdom
• Part II: A “Watering Hole” for Parallel Systems
- Research Accelerator for Multiple Processors
• Conclusion
Docsity.com
Conventional Wisdom (CW) in Computer Architecture
- Old CW: Power is free, Transistors expensive
- New CW: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)
- Old: Multiplies are slow, Memory access is fast
- New: “Memory wall” Memory slow, multiplies fast
(200 clocks to DRAM memory, 4 clocks for FP multiply)
- Old : Increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)
- New CW: “ILP wall” diminishing returns on more ILP
- New: Power Wall + Memory Wall + ILP Wall = Brick Wall
- Old CW: Uniprocessor performance 2X / 1.5 yrs
- New CW: Uniprocessor performance only 2X / 5 yrs? Docsity.com
Uniprocessor Performance (SPECint)
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Performance (vs. VAX-11/780) 25%/year
52%/year
??%/year
- VAX : 25%/year 1978 to 1986
- RISC + x86: 52%/year 1986 to 2002
- RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach , 4th edition, 2006
⇒ Sea change in chip design: multiple “cores” or processors per chip
3X
Docsity.com
Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989)
- Transputer had bad timing (Uniprocessor performance↑) ⇒ Procrastination rewarded: 2X seq. perf. / 1.5 years
- “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005)
- All microprocessor companies switch to MP (2X CPUs / 2 yrs) ⇒ Procrastination penalized: 2X sequential perf. / 5 yrs
Manufacturer/Year (^) AMD/’05 Intel/’06 IBM/’04 Sun/’
Processors/chip 2 2 2
Threads/Processor (^1 2 2 )
Threads/chip 2 4 4 32 Docsity.com
21 st^ Century Computer Architecture
- Old CW: Since cannot know future programs, find
set of old programs to evaluate designs of
computers for the future
- E.g., SPEC
- What about parallel codes?
- Few available, tied to old models, languages, architectures, …
- New approach: Design computers of future for
numerical methods important in future
- Claim: key methods for next decade are 7
dwarves (+ a few), so design for them!
- Representative codes may vary over time, but these numerical methods will be important for > 10 years Docsity.com
6/11 Dwarves Covers 24/30 SPEC
- SPECfp
- 8 Structured grid
- 3 using Adaptive Mesh Refinement
- 2 Sparse linear algebra
- 2 Particle methods
- 5 TBD: Ray tracer, Speech Recognition, Quantum Chemistry, Lattice Quantum Chromodynamics (many kernels inside each benchmark?)
- SPECint
- 8 Finite State Machine
- 2 Sorting/Searching
- 2 Dense linear algebra (data type differs from dwarf)
- 1 TBD: 1 C compiler (many kernels?) Docsity.com
21 st^ Century Code Generation
• Old CW: Takes a decade for compilers to
introduce an architecture innovation
• New approach: “Auto-tuners” 1st run
variations of program on computer to find
best combinations of optimizations (blocking,
padding, …) and algorithms, then produce C
code to be compiled for that computer
- E.g., PHiPAC (BLAS), Atlas (BLAS),
Sparsity (Sparse linear algebra), Spiral (DSP), FFT-
W
- Can achieve 10X over conventional compiler Docsity.com
Best Sparse Blocking for 8 Computers
- All possible column block sizes selected for 8 computers; How could compiler know?
Intel Pentium M
Sun Ultra 2, Sun Ultra 3, AMD Opteron IBM Power 4, Intel/HP Itanium
Intel/HP Itanium 2
IBM
Power 3
8
4
2 1 1 2 4 8
row block size (r)
column block size (c)
Docsity.com
21 st^ Century Measures of Success
• Old CW: Don’t waste resources on accuracy,
reliability
- Speed kills competition
- Blame Microsoft for crashes
• New CW: SPUR is critical for future of IT
- S ecurity
- P rivacy
- U sability (cost of ownership)
- R eliability
• Success not limited to performance/cost“20th century vs. 21st century C&C: the SPUR manifesto,”
Communications of the ACM , 48:3, 2005.Docsity.com
Parallel Framework – Apps (so far) Original 7 dwarves: 6 data parallel, 1 no coupling TLP Bonus 4 dwarves: 2 data parallel, 2 no coupling TLP EEMBC (Embedded): Stream 10, DLP 19, Barrier TLP 2 SPEC (Desktop): 14 DLP, 2 no coupling TLP
E E M B C
E E M B C
S P E C
S P E C (^) D w a r f S
D W A R F S
Streaming DLP DLP No coupling TLP Barrier TLP Tight TLP
Most New Architectures
Most Important Apps?
Docsity.com
Outline
• Part I: A New Agenda for Computer
Architecture
- Old Conventional Wisdom vs. New Conventional
Wisdom
• Part II: A “Watering Hole” for Parallel Systems
- Research Accelerator for Multiple Processors
• Conclusion
Docsity.com
Build Academic MPP from
• As ≈ 25 CPUs will fit in Field Programmable Gate ArrayFPGAs
(FPGA), 1000-CPU system from ≈ 40 FPGAs?
- 16 32-bit simple “soft core” RISC at 150MHz in 2004
(Virtex-II)
- FPGA generations every 1.5 yrs; ≈ 2X CPUs, ≈ 1.2X clock
rate
- HW research community does logic design (“gate
shareware”) to create out-of-the-box, MPP
- E.g., 1000 processor, standard ISA binary-compatible,
64-bit, cache-coherent supercomputer @ ≈ 100
MHz/CPU in 2007
- RAMPants: Arvind (MIT), Krste Asanovíc (MIT), DerekDocsity.com
Why RAMP Good for Research MPP?
SMP Cluster Simulate RAMP
Scalability (1k CPUs) C A A A
Cost (1k CPUs) F ($40M) C ($2-3M) A+ ($0M) A ($0.1-0.2M)
Cost of ownership A D A A
Power/Space (kilowatts, racks)
D (120 kw, 12 racks)
D (120 kw, 12 racks)
A+ (.1 kw, 0.1 racks)
A (1.5 kw, 0.3 racks)
Community D A A A
Observability D C A+ A+
Reproducibility B D A+ A+
Reconfigurability D C A+ A+
Credibility A+ A+ F B+/A-
Perform. (clock) A (2 GHz) A (3 GHz) F (0 GHz) C (0.1-.2 GHz)
GPA C B- B A- Docsity.com