Maximizing On-chip Parallelism: Superscalar vs. Chip Multiprocessor - Prof. John M. Mellor, Study notes of Computer Science

A lecture note from rice university's comp 522 course, focusing on the comparison of superscalar and chip multiprocessor architectures. The lecture covers the limitations of superscalar designs, the case for chip multiprocessors, and the performance comparison of two microarchitectures. It also includes benchmark results and discussions on instruction-level parallelism and cache performance.

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-iax
koofers-user-iax 🇺🇸

5

(1)

10 documents

1 / 36

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
John Mellor-Crummey
Department of Computer Science
Rice University
Simultaneous Multithreading
and
the Case for Chip Multiprocessing
COMP 522 Lecture 2 28 August 2008
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24

Partial preview of the text

Download Maximizing On-chip Parallelism: Superscalar vs. Chip Multiprocessor - Prof. John M. Mellor and more Study notes Computer Science in PDF only on Docsity!

John Mellor-Crummey

Department of Computer Science

Rice University

[email protected]

Simultaneous Multithreading and the Case for Chip Multiprocessing COMP 522 Lecture 2 28 August 2008

Microprocessor Architecture (Mid 90’s)

  • Superscalar (SS) designs were the state of the art

—multiple instruction issue

—dynamic scheduling: h/w tracks instruction dependencies

—speculative execution: looks past predicted branches

—non-blocking caches: multiple outstanding memory accesses

  • Circuit density continuing to double every 18 months

—provides raw material for more logic

—enables higher clock frequencies

  • Apparent path to higher performance?

—wider instruction issue

—support for more speculation

Superscalar Designs (e.g. R10K, PA-8K)

  • 3 phases —instruction fetch —issue and retirement —execution
  • Circuit-level performance limiters? —issue and retirement - issue instruction when all operands ready - register renaming: avoid artificial dependences mapping table: (operands * issue width) ports reorder buffer: # 1-bit comparators = operands * issue width * log 2 (# reg) * issue Q length - wider issue widths increase need for deeper issue Q’s for sufficient ||-ism claims: quadratic increase in size of issue Q long wires for bcast tags of issued instr. ⇒ ultimately limit cycle time —execution phase: quadratic^ ⇑^ register file, quadratic^ ⇑^ bypass logic - Farkas et al ‘96: 8-issue only 20% > 4-issue (reg file complexity limits perf))

PA-8K:

56-ins Q 20% die area The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.

Circuit Technology Impact

  • Delays^ ⇑^ as issue queues^ ⇑^ and multi-port register files^ ⇑
  • Increasing delays limit performance returns from wider issue

Sources of Wasted Issue Slots

  • TLB miss —larger TLB; h/w instruction prefetching; h/w or s/w data prefetch
  • I cache miss —larger icache, more icache associativity; h/w prefetch
  • D cache miss —larger, more associative, prefetching, more dynamic execution
  • Control hazard —speculative execution; aggressive if-conversion
  • Branch misprediction —better prediction logic; lower mispredict penalty
  • Load delays (L1 hits) —shorten load latency; better scheduling
  • Instruction delays —better instruction scheduling
  • Memory conflict (multiple access to same location in a cycle) —improved instruction scheduling

How Much IPC is There?

  • Approach: study applications and evaluate their characteristics

—assess quantity and character of parallelism present

  • Are there any pitfalls to this approach?
  • Is there any better approach?

Intel’s Recognition, Mining, Synthesis Research Initiative

Compute-Intensive, Highly Parallel Applications and Uses Intel Technology Journal, 9(2), May 19, 2005 DOI: 10.1535/itj. http://www.intel.com/technology/itj/2005/volume09issue02/

Analysis of 8-issue Simulations

  • No dominant cause of wasted cycles, but 61% vertical waste
  • No dominant solution

—no single latency-tolerance technique likely to help dramatically

  • Even if memory latencies eliminated, utilization < 40%
  • Tullesen et. al. claim

—“instruction scheduling targets several important segments of

the wasted issue bandwidth, but we expect that our compiler has

already achieved most of the available gains in that regard”

(using Multiflow trace-scheduling compiler)

  • If specific latency hiding mechanisms limited, then need general latency hiding solution to increase parallelism

What Should be Next?

  • If not^ ⇑^ superscalar issue width, then what?
  • Alternatives

—single chip multiprocessor

—simultaneous multithreaded processor

  • How should we decide?
  • Best approach depends upon application characteristics

The Case for a Single Chip Multiprocessor Two motivations

  • “Technology push”

—delays^ ⇑^ as^ issue queues^ ⇑^ and multi-port register files grow

—increasing delays limit performance returns from wider issue

  • “Application pull”

—limited IPC is a problem

Comparing Alternative Designs

  • Consider two microarchitectures

—6-way superscalar architecture

—4 x 2-way superscalar multiprocessor architecture

Integer Benchmarks

  • eqntott: translates logic equations into truth tables

—manually parallelized bit vector comparison routine (90% time)

  • compress: compresses and uncompresses files in memory

—unmodified on both SS and SMT architectures

  • m88ksim: simulates Motorola 88000 CPU

—manually parallelized into 3 threads using SUIF compiler

  • threads simulate different instructions in different phases parallelization analogous to the h/w pipelining it simulates
  • MPSim: simulates a bus-based multiprocessor

—manually assign parts of model hierarchy to different threads

—4 threads: one for each simulated CPU

The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.

FP and Multiprogramming Benchmarks Floating point applications (all parallelized with SUIF)

  • applu: solves parabolic/eliptic PDEs
  • apsi: computes temp, wind, velocity, and distrib. of pollutants
  • swim: shallow water model with 1k x 1k grid
  • tomcatv: generates mesh using Thompson solver Multiprogramming application
  • pmake: performs parallel make of gnuchess (C compiler)

—same application simulated on both architectures

—OS exploits extra PEs in MP architecture to run parallel compiles

The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.

Cache Performance SS vs. CMP

2-issue

6-issue

  • MPCI = misses/completed instruction
  • Cache miss rates inflated for speculative OO PE —misses for speculative instructions that abort —non-blocking caches: multiple misses/line

4x2-issue

Performance: 4 x 2 CMP vs. 6-issue SS

  • Codes with coarse-grain thread-level parallelism

and multiprogramming workloads

—CMP performs 50-100% better than wide superscalar —SS less able to dynamically extract parallelism from 128-wide instruction window

  • Non-parallelizable codes —wide superscalar architecture performs up to 30% better
  • Codes with fine-grain thread-level

parallelism

—wide superscalar architecture is at most 10% better w/ same clock frequency —expect that simpler CPUs in CMP would support higher clock rates that would eliminate this difference