Explicitly Parallel Instruction Computing - Computer Design and Organisation - Lecture Slides, Slides of Computer Science

These are the Lecture Slides of Computer Design and Organisation which includes Brick and Mortar, Silicon Manufacturing, Cost of Production, Mortar Chips, Fixed Set of Functions, Standard Interface, Benefits of Brick and Mortar, Chip Design etc. Key important points are: Explicitly Parallel Instruction Computing, Sequence of Steps, Possibilities of Optimization, Static Scheduling Techniques, Partial Predication, Predication Benefits, Itanium Overview

Typology: Slides

2012/2013

Uploaded on 03/22/2013

dhirendra
dhirendra šŸ‡®šŸ‡³

4.3

(78)

268 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The EPIC-VLIW Approach
•Explicitly Parallel Instruction Computing (EPIC) is a
ā€œphilosophyā€
•Very Long Instruction Word (VLIW) is an
implementation of EPIC
•Concept derives from horizontal microprogramming,
namely:
–A sequence of steps (microoperation) that interprets the
ISA
–If only one microop per cycle: vertical microprogramming
–If (at the extreme all) several units (say, incr PC, add, f-p,
register file read, register file write etc…) can be activated
in the same cycle: horizontal microprogramming
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Explicitly Parallel Instruction Computing - Computer Design and Organisation - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

The EPIC-VLIW Approach

• Explicitly Parallel Instruction Computing (EPIC) is a

ā€œphilosophyā€

• Very Long Instruction Word (VLIW) is an

implementation of EPIC

• Concept derives from horizontal microprogramming,

namely:

– A sequence of steps (microoperation) that interprets the

ISA

– If only one microop per cycle: vertical microprogramming

– If (at the extreme all) several units (say, incr PC, add, f-p,

register file read, register file write etc…) can be activated

in the same cycle: horizontal microprogramming

The EPIC ā€œphilosophyā€

  • Compiler generates packets, or bundles, of

instructions that can execute together

  • Instructions executed in order (static scheduling) and

assumed to have a fixed latency

  • Architecture should provide features that assists

the compiler in exploiting ILP

  • Branch prediction, load speculation (see later), and

associated recoveries

  • Difficulties occur with unpredictable latencies :
    • Branch prediction → Use of predication in addition to

static and dynamic branch prediction

  • Pointer-based computations →Use cache hints,

speculative loads

Other Static Scheduling Techniques

• Eliminate branches via predication (next

slides)

• Loop unrolling

• Software pipelining (see in a few slides)

• Use of global scheduling

– Trace scheduling technique: focus on the critical

path

• Software prefetching

– We’ll talk about prefetching at length later

Predication Basic Idea

• Associate a Boolean condition (predicate) with

the issue, execution, or commit of an instruction

– The stage in which to test the predicate is an

implementation choice

• If the predicate is true, the result of the

instruction is kept

• If the predicate is false, the instruction is nullified

• Distinction between

– Partial predication: only a few opcodes can be

predicated

– Full predication: every instruction is predicated

Predication Costs

• Increased fetch utilization

• Increased register consumption

• If predication is tested at commit time, increased

functional-unit utilization

• With code movement, increased complexity of

exception handling

– For example, insert extra instructions for exception

checking

• If every instruction is predicated, larger

instruction

– Impacts I-cache

Flavors of Predication Implementation

  • Has its roots in vector machines like CRAY-
    • Creation of vector masks to control vector

operations on an element per element basis

  • Often (partial) predication limited to

conditional moves as, e.g., in the Alpha, MIPS

10000, IBM Power PC, SPARC and Intel P

microarchitecture

  • Full predication: Every instruction predicated

as in Intel Itanium (IA-64 ISA)

Other Forms of Partial Predication

• Select dest, src1, src2,cond

– Corresponds to C-like --- dest = ( (cond)? src1 :

src2)

– Note the destination register is always assigned a

value

– Use in the Multiflow (first commercial VLIW

machine)

• Nullify

– Any register-register instruction can nullify the

next instruction, thus making it conditional

Full Predication

• Define predicates with instructions of the form:

Pred_< cmp > Pout1< type > , Pout2 < type >,, src1, src2 (P in )

where

– Pout1 and Pout2 are assigned values according to the

comparison between src1 and src2 and the cmp

ā€œopcodeā€

– The predicate types are most often U (unconditional)

and U its complement, and OR and OR

– The predicate define instruction can itself be

predicated with the value of P in

  • There are definite rules for that, e.g., if P in = 0, U and U are set to 0
independently of the result of the comparison and the OR predicates are not
modified.

IA-64 : Explicitly Parallel

Architecture

  • IA-64 template specifies
    • The type of operation for each instruction, e.g.
      • MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBB
    • Intra-bundle relationship, e.g.
      • M / MI or MI / I (/ is a ā€œstopā€ meaning no parallelism )
    • Inter-bundle relationship
  • Most common combinations covered by templates
    • Headroom for additional templates
  • Simplifies hardware requirements
  • Scales compatibly to future generations

Instruction 2 41 bits

Instruction 1 41 bits

Instruction 0 41 bits

Template 5 bits

128 bits (bundle)

M=Memory F=Floating-point I=Integer L=Long Immediate B=Branch

Memory (M) Memory (M) Integer (I) (MMI)

Itanium Overview

Itanium implementation

• Can execute 2 bundles (6 instructions) per

cycle

• 10 stage pipeline

• 4 integer units (2 of them can handle load-

store), 2 f-p units and 3 branch units

• Issue in order, execute in order but can

complete out of order. Uses a (restricted)

register scoreboard technique to resolve

dependencies. Docsity.com

Itanium implementation

• Predication reduces number of branches and

number of mispredicts,

• Nonetheless: sophisticated branch predictor

– Two level branch predictor of the SAs variety

– Some provision for multiway branches

  • Several basic blocks can terminate in the same bundle

– 4 registers for highly predictable target addresses

(end of loops) hence no bubble on taken branch

– Return address stack

– Hints from the compiler

Traditional Register Models

  • Procedure A calls procedure B
  • Procedures must share space in register
  • Performance penalty due to register save / restore

Traditional Register Models Traditional Register Stacks

B^ A

Register Memory

A A

B

C

D

A

B

C

D

  • Eliminate the need for save / restore by reserving fixed blocks in register
  • However, fixed blocks waste resources

Procedure Procedures Register

I think that the ā€œtraditional register stackā€ model they refer to is the ā€œregister windowsā€ model used in Sparc

IA-64 Register Stack

Traditional Register Stacks

A

B

C

D

A

B

C

D

  • Eliminate the need for save / restore by reserving fixed blocks in register
  • However, fixed blocks waste resources

Procedures Register

IA-64 Register Stack

A

B

C

A

B

C

D

Procedures Register

  • IA-64 able to reserve variable block sizes
  • No wasted resources

D

D

D