Download Maximizing On-chip Parallelism: Superscalar vs. Chip Multiprocessor - Prof. John M. Mellor and more Study notes Computer Science in PDF only on Docsity!
John Mellor-Crummey
Department of Computer Science
Rice University
Simultaneous Multithreading and the Case for Chip Multiprocessing COMP 522 Lecture 2 28 August 2008
Microprocessor Architecture (Mid 90’s)
- Superscalar (SS) designs were the state of the art
—multiple instruction issue
—dynamic scheduling: h/w tracks instruction dependencies
—speculative execution: looks past predicted branches
—non-blocking caches: multiple outstanding memory accesses
- Circuit density continuing to double every 18 months
—provides raw material for more logic
—enables higher clock frequencies
- Apparent path to higher performance?
—wider instruction issue
—support for more speculation
Superscalar Designs (e.g. R10K, PA-8K)
- 3 phases —instruction fetch —issue and retirement —execution
- Circuit-level performance limiters? —issue and retirement - issue instruction when all operands ready - register renaming: avoid artificial dependences mapping table: (operands * issue width) ports reorder buffer: # 1-bit comparators = operands * issue width * log 2 (# reg) * issue Q length - wider issue widths increase need for deeper issue Q’s for sufficient ||-ism claims: quadratic increase in size of issue Q long wires for bcast tags of issued instr. ⇒ ultimately limit cycle time —execution phase: quadratic^ ⇑^ register file, quadratic^ ⇑^ bypass logic - Farkas et al ‘96: 8-issue only 20% > 4-issue (reg file complexity limits perf))
PA-8K:
56-ins Q 20% die area The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.
Circuit Technology Impact
- Delays^ ⇑^ as issue queues^ ⇑^ and multi-port register files^ ⇑
- Increasing delays limit performance returns from wider issue
Sources of Wasted Issue Slots
- TLB miss —larger TLB; h/w instruction prefetching; h/w or s/w data prefetch
- I cache miss —larger icache, more icache associativity; h/w prefetch
- D cache miss —larger, more associative, prefetching, more dynamic execution
- Control hazard —speculative execution; aggressive if-conversion
- Branch misprediction —better prediction logic; lower mispredict penalty
- Load delays (L1 hits) —shorten load latency; better scheduling
- Instruction delays —better instruction scheduling
- Memory conflict (multiple access to same location in a cycle) —improved instruction scheduling
How Much IPC is There?
- Approach: study applications and evaluate their characteristics
—assess quantity and character of parallelism present
- Are there any pitfalls to this approach?
- Is there any better approach?
Intel’s Recognition, Mining, Synthesis Research Initiative
Compute-Intensive, Highly Parallel Applications and Uses Intel Technology Journal, 9(2), May 19, 2005 DOI: 10.1535/itj. http://www.intel.com/technology/itj/2005/volume09issue02/
Analysis of 8-issue Simulations
- No dominant cause of wasted cycles, but 61% vertical waste
- No dominant solution
—no single latency-tolerance technique likely to help dramatically
- Even if memory latencies eliminated, utilization < 40%
- Tullesen et. al. claim
—“instruction scheduling targets several important segments of
the wasted issue bandwidth, but we expect that our compiler has
already achieved most of the available gains in that regard”
(using Multiflow trace-scheduling compiler)
- If specific latency hiding mechanisms limited, then need general latency hiding solution to increase parallelism
What Should be Next?
- If not^ ⇑^ superscalar issue width, then what?
- Alternatives
—single chip multiprocessor
—simultaneous multithreaded processor
- How should we decide?
- Best approach depends upon application characteristics
The Case for a Single Chip Multiprocessor Two motivations
—delays^ ⇑^ as^ issue queues^ ⇑^ and multi-port register files grow
—increasing delays limit performance returns from wider issue
—limited IPC is a problem
Comparing Alternative Designs
- Consider two microarchitectures
—6-way superscalar architecture
—4 x 2-way superscalar multiprocessor architecture
Integer Benchmarks
- eqntott: translates logic equations into truth tables
—manually parallelized bit vector comparison routine (90% time)
- compress: compresses and uncompresses files in memory
—unmodified on both SS and SMT architectures
- m88ksim: simulates Motorola 88000 CPU
—manually parallelized into 3 threads using SUIF compiler
- threads simulate different instructions in different phases parallelization analogous to the h/w pipelining it simulates
- MPSim: simulates a bus-based multiprocessor
—manually assign parts of model hierarchy to different threads
—4 threads: one for each simulated CPU
The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.
FP and Multiprogramming Benchmarks Floating point applications (all parallelized with SUIF)
- applu: solves parabolic/eliptic PDEs
- apsi: computes temp, wind, velocity, and distrib. of pollutants
- swim: shallow water model with 1k x 1k grid
- tomcatv: generates mesh using Thompson solver Multiprogramming application
- pmake: performs parallel make of gnuchess (C compiler)
—same application simulated on both architectures
—OS exploits extra PEs in MP architecture to run parallel compiles
The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.
Cache Performance SS vs. CMP
2-issue
6-issue
- MPCI = misses/completed instruction
- Cache miss rates inflated for speculative OO PE —misses for speculative instructions that abort —non-blocking caches: multiple misses/line
4x2-issue
Performance: 4 x 2 CMP vs. 6-issue SS
- Codes with coarse-grain thread-level parallelism
and multiprogramming workloads
—CMP performs 50-100% better than wide superscalar —SS less able to dynamically extract parallelism from 128-wide instruction window
- Non-parallelizable codes —wide superscalar architecture performs up to 30% better
- Codes with fine-grain thread-level
parallelism
—wide superscalar architecture is at most 10% better w/ same clock frequency —expect that simpler CPUs in CMP would support higher clock rates that would eliminate this difference