Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Maximizing On-chip Parallelism: Superscalar vs. Chip Multiprocessor - Prof. John M. Mellor, Study notes of Computer Science

Rice University Computer Science

Prof. John M. Mellor-Crummey

A lecture note from rice university's comp 522 course, focusing on the comparison of superscalar and chip multiprocessor architectures. The lecture covers the limitations of superscalar designs, the case for chip multiprocessors, and the performance comparison of two microarchitectures. It also includes benchmark results and discussions on instruction-level parallelism and cache performance.

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-iax 🇺🇸

(1)

10 documents

1 / 36

This page cannot be seen from the preview

Don't miss anything!

John Mellor-Crummey

Department of Computer Science

Rice University

[email protected]

Simultaneous Multithreading

and

the Case for Chip Multiprocessing

COMP 522 Lecture 2 28 August 2008

Discover Study notes of Computer Science Rice University

Partial preview of the text

Download Maximizing On-chip Parallelism: Superscalar vs. Chip Multiprocessor - Prof. John M. Mellor and more Study notes Computer Science in PDF only on Docsity!

John Mellor-Crummey

Department of Computer Science

Rice University

[email protected]

Simultaneous Multithreading and the Case for Chip Multiprocessing COMP 522 Lecture 2 28 August 2008

Microprocessor Architecture (Mid 90’s)

Superscalar (SS) designs were the state of the art

—multiple instruction issue

—dynamic scheduling: h/w tracks instruction dependencies

—speculative execution: looks past predicted branches

—non-blocking caches: multiple outstanding memory accesses

Circuit density continuing to double every 18 months

—provides raw material for more logic

—enables higher clock frequencies

Apparent path to higher performance?

—wider instruction issue

—support for more speculation

Superscalar Designs (e.g. R10K, PA-8K)

3 phases —instruction fetch —issue and retirement —execution
Circuit-level performance limiters? —issue and retirement - issue instruction when all operands ready - register renaming: avoid artificial dependences mapping table: (operands * issue width) ports reorder buffer: # 1-bit comparators = operands * issue width * log 2 (# reg) * issue Q length - wider issue widths increase need for deeper issue Q’s for sufficient ||-ism claims: quadratic increase in size of issue Q long wires for bcast tags of issued instr. ⇒ ultimately limit cycle time —execution phase: quadratic^ ⇑^ register file, quadratic^ ⇑^ bypass logic - Farkas et al ‘96: 8-issue only 20% > 4-issue (reg file complexity limits perf))

PA-8K:

56-ins Q 20% die area The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.

Circuit Technology Impact

Delays^ ⇑^ as issue queues^ ⇑^ and multi-port register files^ ⇑
Increasing delays limit performance returns from wider issue

Sources of Wasted Issue Slots

TLB miss —larger TLB; h/w instruction prefetching; h/w or s/w data prefetch
I cache miss —larger icache, more icache associativity; h/w prefetch
D cache miss —larger, more associative, prefetching, more dynamic execution
Control hazard —speculative execution; aggressive if-conversion
Branch misprediction —better prediction logic; lower mispredict penalty
Load delays (L1 hits) —shorten load latency; better scheduling
Instruction delays —better instruction scheduling
Memory conflict (multiple access to same location in a cycle) —improved instruction scheduling

How Much IPC is There?

Approach: study applications and evaluate their characteristics

—assess quantity and character of parallelism present

Are there any pitfalls to this approach?
Is there any better approach?

Intel’s Recognition, Mining, Synthesis Research Initiative

Compute-Intensive, Highly Parallel Applications and Uses Intel Technology Journal, 9(2), May 19, 2005 DOI: 10.1535/itj. http://www.intel.com/technology/itj/2005/volume09issue02/

Analysis of 8-issue Simulations

No dominant cause of wasted cycles, but 61% vertical waste
No dominant solution

—no single latency-tolerance technique likely to help dramatically

Even if memory latencies eliminated, utilization < 40%
Tullesen et. al. claim

—“instruction scheduling targets several important segments of

the wasted issue bandwidth, but we expect that our compiler has

already achieved most of the available gains in that regard”

(using Multiflow trace-scheduling compiler)

If specific latency hiding mechanisms limited, then need general latency hiding solution to increase parallelism

What Should be Next?

If not^ ⇑^ superscalar issue width, then what?
Alternatives

—single chip multiprocessor

—simultaneous multithreaded processor

How should we decide?
Best approach depends upon application characteristics

The Case for a Single Chip Multiprocessor Two motivations

“Technology push”

—delays^ ⇑^ as^ issue queues^ ⇑^ and multi-port register files grow

—increasing delays limit performance returns from wider issue

“Application pull”

—limited IPC is a problem

Comparing Alternative Designs

Consider two microarchitectures

—6-way superscalar architecture

—4 x 2-way superscalar multiprocessor architecture

Integer Benchmarks

eqntott: translates logic equations into truth tables

—manually parallelized bit vector comparison routine (90% time)

compress: compresses and uncompresses files in memory

—unmodified on both SS and SMT architectures

m88ksim: simulates Motorola 88000 CPU

—manually parallelized into 3 threads using SUIF compiler

threads simulate different instructions in different phases parallelization analogous to the h/w pipelining it simulates
MPSim: simulates a bus-based multiprocessor

—manually assign parts of model hierarchy to different threads

—4 threads: one for each simulated CPU

The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.

FP and Multiprogramming Benchmarks Floating point applications (all parallelized with SUIF)

applu: solves parabolic/eliptic PDEs
apsi: computes temp, wind, velocity, and distrib. of pollutants
swim: shallow water model with 1k x 1k grid
tomcatv: generates mesh using Thompson solver Multiprogramming application
pmake: performs parallel make of gnuchess (C compiler)

—same application simulated on both architectures

—OS exploits extra PEs in MP architecture to run parallel compiles

The case for a single-chip multiprocessor, Olukotun et al., ASPLOS-VII, 1996.

Cache Performance SS vs. CMP

2-issue

6-issue

MPCI = misses/completed instruction
Cache miss rates inflated for speculative OO PE —misses for speculative instructions that abort —non-blocking caches: multiple misses/line

4x2-issue

Performance: 4 x 2 CMP vs. 6-issue SS

Codes with coarse-grain thread-level parallelism

and multiprogramming workloads

—CMP performs 50-100% better than wide superscalar —SS less able to dynamically extract parallelism from 128-wide instruction window

Non-parallelizable codes —wide superscalar architecture performs up to 30% better
Codes with fine-grain thread-level

parallelism

—wide superscalar architecture is at most 10% better w/ same clock frequency —expect that simpler CPUs in CMP would support higher clock rates that would eliminate this difference

Maximizing On-chip Parallelism: Superscalar vs. Chip Multiprocessor - Prof. John M. Mellor, Study notes of Computer Science

Related documents

Partial preview of the text

Download Maximizing On-chip Parallelism: Superscalar vs. Chip Multiprocessor - Prof. John M. Mellor and more Study notes Computer Science in PDF only on Docsity!

John Mellor-Crummey

Department of Computer Science

Rice University

[email protected]

—multiple instruction issue

—dynamic scheduling: h/w tracks instruction dependencies

—speculative execution: looks past predicted branches

—non-blocking caches: multiple outstanding memory accesses

—provides raw material for more logic

—enables higher clock frequencies

—wider instruction issue

—support for more speculation

PA-8K:

—assess quantity and character of parallelism present

Intel’s Recognition, Mining, Synthesis Research Initiative

—no single latency-tolerance technique likely to help dramatically

—“instruction scheduling targets several important segments of

the wasted issue bandwidth, but we expect that our compiler has

already achieved most of the available gains in that regard”

(using Multiflow trace-scheduling compiler)

—single chip multiprocessor

—simultaneous multithreaded processor

—delays^ ⇑^ as^ issue queues^ ⇑^ and multi-port register files grow

—increasing delays limit performance returns from wider issue

—limited IPC is a problem

—6-way superscalar architecture

—4 x 2-way superscalar multiprocessor architecture

—manually parallelized bit vector comparison routine (90% time)

—unmodified on both SS and SMT architectures

—manually parallelized into 3 threads using SUIF compiler

—manually assign parts of model hierarchy to different threads

—4 threads: one for each simulated CPU

—same application simulated on both architectures

—OS exploits extra PEs in MP architecture to run parallel compiles

2-issue

6-issue

4x2-issue

and multiprogramming workloads

parallelism