EPIC Architecture: A Compiler-Controlled Processor Design - Prof. Scott Mahlke, Assignments of Electrical and Electronics Engineering

An overview of epic architecture, a compiler-controlled processor design developed at the university of michigan. Epic architecture features a philosophy that allows the compiler to create a complete plan of run-time execution, including the time and resources required. Defining features such as multiop and exposed latency, as well as other architectural features like register structure, branch architecture, and data/control speculation. The goal is to create more efficient poes, expose the microarchitecture, and play the statistics.

Typology: Assignments

Pre 2010

Uploaded on 09/02/2009

koofers-user-o4a
koofers-user-o4a 🇺🇸

5

(1)

10 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EECS 583 – Lecture 3
EPIC Architectures
University of Michigan
January 14, 2002
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download EPIC Architecture: A Compiler-Controlled Processor Design - Prof. Scott Mahlke and more Assignments Electrical and Electronics Engineering in PDF only on Docsity!

EECS 583 – Lecture 3 EPIC Architectures

University of Michigan January 14, 2002

  • 1 -

EPIC philosphy Y

Compiler creates complete plan of run-time execution^ »

At what time and using what resource »^

POE communicated to hardware via the instruction set »^

Processor obediently follows POE »^

No dynamic scheduling, out of order execution (these secondguess the compilers plan)

Y

Compiler allowed to play the statistics^ »

Many types of info only available at run-time (branchdirections, locations accessed via pointers) »^

Traditionally compilers behave conservatively

Æ

handle worst

case possibility »^

Allow the compiler to gamble when it believes the odds are inits favor^ y

Profiling

Y

Expose microarchitecture to the compiler^ »

memory system, branch execution

  • 3 -

Defining feature II - Exposed latency Y

Superscalar^ »

Sequence of atomic operations »^

Sequential order defines semantics (UAL) »^

Each conceptually finishes before the next one starts

Y

EPIC – non-atomic operations^ »

Register reads/writes for 1 operation separated in time »^

Semantics determined by relative ordering of reads/writes

Y

Assumed latency (NUAL if > 1)^ »

Contract between the compiler and hardware »^

Instruction issuance provides common notion of time

MultiOp1: r1 = r2 + r3 MultiOp2: r4 = r1 * r5 MultiOp3: r6 = r1 / r

  • 4 -

Other architectural features of EPIC Y

Add features into the architecture to support EPICphilosphy^ »

Create more efficient POEs »^

Expose the microarchitecture »^

Play the statistics

Y

Register structure

Y

Branch architecture

Y

Data/Control speculation

Y

Memory hierarchy

Y

Predicated execution (largest impact on the compiler)

  • 6 -

Rotating registers Y^

Overlap loop iterations^ »

How do you prevent registeroverwrite in later iterations? »^

Compiler-controlled dynamicregister renaming

Y^

Rotating registers^ »

Each iteration writes to r »^

But this gets mapped to adifferent physical register »^

Block of consecutive regsallocated for each reg in loopcorresponding to number ofiterations it is needed

Op1 Op iteration nRRB = 7

Op1 Op iteration n + 1RRB = 6

II

r

r

actual reg = (reg + RRB) % NumRegsAt end of each iteration, RRB--

  • 7 -

Branch architecture Y

Branch actions^ »

Branch condition computed »^

Target address formed »^

Instructions fetched from taken, fall-through or both paths »^

Branch itself executes »^

After the branch, target of the branch is decoded/executed

Y

Superscalar processors use hardware to hide the latency ofall the actions^ »

Icache prefetching »^

Branch prediction – Guess outcome of branch »^

Dynamic scheduling – overlap other instructions with branch »^

Reorder buffer – Squash when wrong

  • 9 -

Speculation Y

Allow the compiler to play the statistics^ »

Reordering operations to find enough parallelism »^

Branch outcome^ y

Control speculation

»^

Lack of memory dependence in pointer code^ y

Data speculation

»^

Profile or clever analysis provides “the statistics”

Y

General plan of action^ »

Compiler reorders aggressively »^

Hardware support to catch times when its wrong »^

Execution repaired, continue^ y

Repair is expensive y^

So have to be right most of the time to or performance willsuffer

  • 10 -

Control speculation Y^

Compile-time movement of operationsabove branches»

Guess the operation result is needed »^

If wrong, wasted execution

Y^

Potential problems»

Too much wasted execution^ y

Speculate likely operations

»^

Spurious exceptions^ y

Useless op causes problem y^

NAT/poison/exception bit y^

check NAT operations

»^

Register overwrite^ y

Rename or don’t do it

»^

Memory corrupted^ y

Don’t speculate stores

blt r1, r2, L1^ r6 = r7 + r8r9 = r6 << 3r4 = r9 + 7

taken^ r3 = load(r4)r5 = r3 + 1store (r4, r5)

fallthru

  • 12 -

Management of the memory hierarchy Y^

Common problems^ »

Kick out good locality datawith bad locality data »^

Capacity/conflict misses

Æ

prefetch the data »^

Non-deterministic latency –What should be assumed?

Y^

Expose cache hierarchy tothe compiler^ »

No longer a black box »^

Placement made explicit »^

Assumed latency explicit

CPU

L1-D

L2 - U

Main Memory

L1 - I

  • 13 -

Source/target cache specifiers Y

Source specifier – Compiler tells hardware where withinthe cache hierarchy the data is expected to be found^ »

Assumed latency

Y

Target specifier – Compiler tells hardware the highest levelin which the data should be placed^ »

Reduce pollution of lower levels

Y

Prefetching – Speculative load to some dummy register

Y

Icache managed by PBRs

Y

Traditional processors use C1/C

Source cache specifier –

where its coming from

Æ

latency

L_B_C3_C

S_H_C

Target cache specifier – where to place the data

  • 15 -

Predicated execution example

a = b + c if (a > 0)

e = f + g else

e = f / g h = i - j

BB1 BB1 BB3 BB3 BB2 BB

add a, b, c bgt a, 0, L1 div e, f, g jump L2 L1: add e, f, g L2: sub h, i, j

BB

BB

BB

BB

Traditional branching code

BB1 BB1 BB1 BB3 BB2 BB

add a, b, c if T p2 = a > 0 if T p3 = a <= 0 if T div e, f, g if p3 add e, f, g if p2 sub h, i, j if T

BB1 BB2 BB3 BB

p

Æ

BB

p

Æ

BB

Predicated code

  • 16 -

What about nested if-then-else’s?

a = b + c if (a > 0)

if (a > 25)

e = f + g else

e = f * g

else

e = f / g h = i - j

BB1 BB1 BB3 BB3 BB2 BB6 BB6 BB5 BB

add a, b, c bgt a, 0, L1 div e, f, g jump L2 L1: bgt a, 25, L3 mpy e, f, g jump L2 L3: add e, f, g L2: sub h, i, j

BB

BB

BB

BB

Traditional branching code

BB

BB

  • 18 -

Benefits/Costs of predicated execution Y

Benefits^ »

Remove branches (both conditional and unconditional) »^

Remove branch mispredictions »^

Overlap execution of if-then-else statements^ y

Branches tend to sequentialize operations y^

Predicates can be computed/used in parallel

Y

Costs^ »

Useless instructions executed »^

Code size (extra operand, can’t fit into 32-bits) »^

Possibly longer schedule lengths

Y

The real story^ »

Must be applied selectively or you get worse performancethan not using it at all

  • 19 -

Benefits/Costs of predicated execution (2)

Benefits: - No branches, no mispredicts - Can freely reorder independent operations in the predicated block - Overlap BB2 with BB5 and BB6 Costs (execute all paths) -worst case schedule length -worst case resources required

BB

BB

BB

BB1 BB2 BB3 BB4 BB5 BB6 BB

BB

BB

BB

BB