Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Multiple Issue Processors: Superscalar, VLIW/EPIC and Performance Considerations, Study notes of Advanced Computer Architecture

Osmania University Advanced Computer Architecture

Multiple issue processors, including superscalar and vliw/epic. Topics covered include instruction fetch, pipeline stages, issue packets, and challenges in achieving high instructions per clock (ipc) rates. The document also touches upon performance and cost considerations.

Typology: Study notes

2013/2014

Uploaded on 06/03/2014

nagesh 🇮🇳

4.6

(14)

7 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

Summary of discussions

ILP processors

- VLIW/EPIC, Superscalar

Superscalar has hardware logic for extracting

parallelism

- Solutions for stalls etc. must be provided in hardware

Stalls play an even greater role in ILP processors

Software solutions, such as code scheduling through

code movement, can lead to improved execution

times

- More sophisticated techniques needed

- Can we provide some H/W support to help the compiler –

leads to EPIC/VLIW

Multiple Issue ILP Processors

In statically scheduled superscalar instructions issue

in order, and all pipeline hazards checked at issue

time

- Inst causing hazard will force subsequent inst to be stalled

In statically scheduled VLIW, compiler generates

multiple issue packets of instructions

During instruction fetch, pipeline receives number of

inst from IF stage – issue packet

- Examine each inst in packet: if no hazard then issue else

wait

- Issue unit examines all inst in packet

• Complexity implies further splitting of issue stage

Getting CPI < 1:

Issuing Multiple Instructions/Cycle

Vector Processing: Explicit coding of independent loops

as operations on large vectors of numbers

- Multimedia instructions being added to many processors

Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)

- IBM PowerPC, Sun UltraSparc , DEC Alpha, Pentium III/4

(Very) Long Instruction Words (V)LIW:

fixed number of instructions (4-16) scheduled by the

compiler; put ops into wide templates (TBD)

- Intel Architecture-64 (IA-64) 64-bit address

• Renamed: “Explicitly Parallel Instruction Computer (EPIC)”

Anticipated success of multiple instructions lead to

Instructions Per Clock cycle (IPC) vs. CPI

Getting CPI < 1: Issuing

Multiple Instructions/Cycle

Superscalar MIPS: 2 instructions, 1 FP & 1 anything

– Fetch 64-bits/clock cycle; Int on left, FP on right

– Can only issue 2nd instruction if 1st instruction issues

– More ports for FP registers to do FP load & FP op in a pair

Type PipeStages

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

1 cycle load delay expands to 3 instructions in SS

- instruction in right half can’t use it, nor instructions in next slot

Discover Study notes of Advanced Computer Architecture Osmania University

Partial preview of the text

Download Multiple Issue Processors: Superscalar, VLIW/EPIC and Performance Considerations and more Study notes Advanced Computer Architecture in PDF only on Docsity!

Summary of discussions

ILP processors

VLIW/EPIC, Superscalar

Superscalar has hardware logic for extracting

parallelism

Solutions for stalls etc. must be provided in hardware

Stalls play an even greater role in ILP processors

Software solutions, such as code scheduling through

code movement, can lead to improved execution

times

More sophisticated techniques needed
Can we provide some H/W support to help the compiler – leads to EPIC/VLIW

Multiple Issue ILP Processors

In statically scheduled superscalar instructions issue

in order, and all pipeline hazards checked at issue

time

Inst causing hazard will force subsequent inst to be stalled

In statically scheduled VLIW, compiler generates

multiple issue packets of instructions

During instruction fetch, pipeline receives number of

inst from IF stage – issue packet

Examine each inst in packet: if no hazard then issue else wait
Issue unit examines all inst in packet
- Complexity implies further splitting of issue stage

Getting CPI < 1:

Issuing Multiple Instructions/Cycle

Vector Processing: Explicit coding of independent loops

as operations on large vectors of numbers

Multimedia instructions being added to many processors

Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)

IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/

(Very) Long Instruction Words (V)LIW:

fixed number of instructions (4-16) scheduled by the

compiler; put ops into wide templates (TBD)

Intel Architecture-64 (IA-64) 64-bit address
- Renamed: “Explicitly Parallel Instruction Computer (EPIC)”

Anticipated success of multiple instructions lead to

Instructions Per Clock cycle (IPC) vs. CPI

Getting CPI < 1: Issuing

Multiple Instructions/Cycle

Superscalar MIPS: 2 instructions, 1 FP & 1 anything

Fetch 64-bits/clock cycle; Int on left, FP on right
Can only issue 2nd instruction if 1st instruction issues
More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB

1 cycle load delay expands to 3 instructions in SS

instruction in right half can’t use it, nor instructions in next slot

Multiple Issue Issues

issue packet: group of instructions from fetch unit that

could potentially issue in 1 clock

If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue
0 to N instruction issues per clock cycle, for N-issue

Performing issue checks in 1 cycle could limit clock

cycle time: O(n 2 -n) comparisons

=> issue stage usually split and pipelined
1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued
=> higher branch penalties => prediction accuracy important

Multiple Issue Challenges

While Integer/FP split is simple for the HW, get CPI of 0.

only for programs with:

Exactly 50% FP operations AND No hazards

If more instructions issue at same time, greater difficulty of

decode and issue:

Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; (N-issue ~O(N 2 -N) comparisons)
Register file: need 2x reads and 1x writes/cycle
Rename logic: must be able to rename same register multiple times in one cycle! For instance, consider 4-way issue: add r1, r2, r3 add p11, p4, p sub r4, r1, r2 ⇒ sub p22, p11, p lw r1, 4(r4) lw p23, 4(p22) add r5, r1, r2 add p12, p23, p Imagine doing this transformation in a single cycle!
Result buses: Need to complete multiple instructions/cycle
- So, need multiple buses with associated matching logic at every reservation station.
- Or, need multiple forwarding paths

Summary of Course Performance and Cost

Amdahl’s Law:

CPI Law:

Designing to Last through Trends

Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 4 years 2x in 10 years Disk 4x in 3 years 2x in 5 years Processor 2x every 1.5 years?

Speedup (^) overall =

ExTime (^) old

ExTime (^) new

(1 - Fraction (^) enhanced) + Fraction (^) enhanced Speedup (^) enhanced

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Software Scheduling

Instruction Level Parallelism (ILP) found either by compiler or

hardware.

Loop level parallelism is easiest to see

SW dependencies/compiler sophistication determine if compiler can unroll loops
Memory dependencies hardest to determine => Memory disambiguation
Very sophisticated transformations available

Trace Sceduling to Parallelize If statements

Superscalar and VLIW: CPI < 1 (IPC > 1)

Dynamic issue vs. Static issue
More instructions issue at same time => larger hazard penalty
Limitation is often number of instructions that you can successfully fetch and decode per cycle

EPIC/VLIW

What did IA-64/EPIC do well besides floating point

programs?

Was the only difference the 64-bit address v. 32-bit address?
What happened to the AMD 64-bit address 80x86 proposal?

What happened on EPIC code size vs. x86?

Did anybody propose anything at ISA to help with

software quality? availability? Security?

Hardware versus Software Speculation

Mechanisms

To speculate extensively, must be able to

disambiguate memory references

Much easier in HW than in SW for code with pointers

HW-based speculation works better when control flow

is unpredictable, and when HW-based branch

prediction is superior to SW-based branch prediction

done at compile time

Mispredictions mean wasted speculation

HW-based speculation maintains precise exception

model even for speculated instructions

HW-based speculation does not require compensation

or bookkeeping code

Hardware versus Software Speculation

Mechanisms cont’d

Compiler-based approaches may benefit from the

ability to see further in the code sequence, resulting in

better code scheduling

HW-based speculation with dynamic scheduling does

not require different code sequences to achieve good

performance for different implementations of an

architecture

may be the most important in the long run?

IA-64 EPIC vs. Classic VLIW

Similarities:

Compiler generated wide instructions
Static detection of dependencies
ILP encoded in the binary (a group)
Large number of architected registers

Differences:

Instructions in a bundle can have dependencies
Hardware interlock between dependent instructions
Accommodates varying number of functional units and latencies
Allows dynamic scheduling and functional unit binding Static scheduling are “suggestive” rather than absolute ⇒ Code compatibility across generations but software won’t run at top speed until it is recompiled so “shrink-wrap binary” might need to include multiple builds

EPIC and Compiler Optimization

EPIC requires dependency free “scheduled code”

Burden of extracting parallelism falls on compiler

success of EPIC architectures depends on efficiency

of Compilers!!

We provide overview of Compiler Optimization

techniques (as they apply to EPIC/ILP)

Introduction to

Compiler Optimization

Hardware-Software Interface

Machine Program

Performance = tcyc x CPI x code size

X

Available resources statically fixed Designed to support wide variety of programs

Required resources dynamically varying Designed to run well on a variety of machines Interested in having itself run fast

Interested in running many programs fast

Reflects how well the machine resources match the program requirements

Multiple Issue Processors: Superscalar, VLIW/EPIC and Performance Considerations, Study notes of Advanced Computer Architecture

Related documents

Partial preview of the text

Download Multiple Issue Processors: Superscalar, VLIW/EPIC and Performance Considerations and more Study notes Advanced Computer Architecture in PDF only on Docsity!

Summary of discussions

 ILP processors

 Superscalar has hardware logic for extracting

parallelism

 Stalls play an even greater role in ILP processors

 Software solutions, such as code scheduling through

code movement, can lead to improved execution

times

Multiple Issue ILP Processors

 In statically scheduled superscalar instructions issue

in order, and all pipeline hazards checked at issue

time

 In statically scheduled VLIW, compiler generates

multiple issue packets of instructions

 During instruction fetch, pipeline receives number of

inst from IF stage – issue packet

Getting CPI < 1:

Issuing Multiple Instructions/Cycle

 Vector Processing: Explicit coding of independent loops

as operations on large vectors of numbers

 Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)

 (Very) Long Instruction Words (V)LIW:

fixed number of instructions (4-16) scheduled by the

compiler; put ops into wide templates (TBD)

 Anticipated success of multiple instructions lead to

Instructions Per Clock cycle (IPC) vs. CPI

Getting CPI < 1: Issuing

Multiple Instructions/Cycle

 Superscalar MIPS: 2 instructions, 1 FP & 1 anything

 1 cycle load delay expands to 3 instructions in SS

Multiple Issue Issues

 issue packet: group of instructions from fetch unit that

could potentially issue in 1 clock

 Performing issue checks in 1 cycle could limit clock

cycle time: O(n 2 -n) comparisons

Multiple Issue Challenges

 While Integer/FP split is simple for the HW, get CPI of 0.

only for programs with:

 If more instructions issue at same time, greater difficulty of

decode and issue:

Summary of Course Performance and Cost

 Amdahl’s Law:

 CPI Law:

 Designing to Last through Trends

Software Scheduling

 Instruction Level Parallelism (ILP) found either by compiler or

hardware.

 Loop level parallelism is easiest to see

 Trace Sceduling to Parallelize If statements

 Superscalar and VLIW: CPI < 1 (IPC > 1)

EPIC/VLIW

 What did IA-64/EPIC do well besides floating point

programs?

 What happened on EPIC code size vs. x86?

 Did anybody propose anything at ISA to help with

software quality? availability? Security?

Hardware versus Software Speculation

Mechanisms

 To speculate extensively, must be able to

disambiguate memory references

 HW-based speculation works better when control flow

is unpredictable, and when HW-based branch

prediction is superior to SW-based branch prediction

done at compile time

 HW-based speculation maintains precise exception

model even for speculated instructions

 HW-based speculation does not require compensation

or bookkeeping code

Hardware versus Software Speculation

Mechanisms cont’d

 Compiler-based approaches may benefit from the

ability to see further in the code sequence, resulting in

better code scheduling

 HW-based speculation with dynamic scheduling does

not require different code sequences to achieve good

ILP processors

Superscalar has hardware logic for extracting

Stalls play an even greater role in ILP processors

Software solutions, such as code scheduling through

In statically scheduled superscalar instructions issue

In statically scheduled VLIW, compiler generates

During instruction fetch, pipeline receives number of

Vector Processing: Explicit coding of independent loops

Superscalar: varying no. instructions/cycle (1 to 8),

(Very) Long Instruction Words (V)LIW:

Anticipated success of multiple instructions lead to

Superscalar MIPS: 2 instructions, 1 FP & 1 anything

1 cycle load delay expands to 3 instructions in SS

issue packet: group of instructions from fetch unit that

Performing issue checks in 1 cycle could limit clock

While Integer/FP split is simple for the HW, get CPI of 0.

If more instructions issue at same time, greater difficulty of

Amdahl’s Law:

CPI Law:

Designing to Last through Trends

Instruction Level Parallelism (ILP) found either by compiler or

Loop level parallelism is easiest to see

Trace Sceduling to Parallelize If statements

Superscalar and VLIW: CPI < 1 (IPC > 1)

What did IA-64/EPIC do well besides floating point

What happened on EPIC code size vs. x86?

Did anybody propose anything at ISA to help with

To speculate extensively, must be able to

HW-based speculation works better when control flow

HW-based speculation maintains precise exception

HW-based speculation does not require compensation

Compiler-based approaches may benefit from the

HW-based speculation with dynamic scheduling does

Similarities:

Differences:

EPIC requires dependency free “scheduled code”

Burden of extracting parallelism falls on compiler

success of EPIC architectures depends on efficiency

We provide overview of Compiler Optimization