


















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of epic architecture, a compiler-controlled processor design developed at the university of michigan. Epic architecture features a philosophy that allows the compiler to create a complete plan of run-time execution, including the time and resources required. Defining features such as multiop and exposed latency, as well as other architectural features like register structure, branch architecture, and data/control speculation. The goal is to create more efficient poes, expose the microarchitecture, and play the statistics.
Typology: Assignments
1 / 26
This page cannot be seen from the preview
Don't miss anything!



















EPIC philosphy Y
At what time and using what resource »^
POE communicated to hardware via the instruction set »^
Processor obediently follows POE »^
No dynamic scheduling, out of order execution (these secondguess the compilers plan)
Y
Many types of info only available at run-time (branchdirections, locations accessed via pointers) »^
Traditionally compilers behave conservatively
handle worst
case possibility »^
Allow the compiler to gamble when it believes the odds are inits favor^ y
Profiling
Y
memory system, branch execution
Defining feature II - Exposed latency Y
Sequence of atomic operations »^
Sequential order defines semantics (UAL) »^
Each conceptually finishes before the next one starts
Y
Register reads/writes for 1 operation separated in time »^
Semantics determined by relative ordering of reads/writes
Y
Contract between the compiler and hardware »^
Instruction issuance provides common notion of time
MultiOp1: r1 = r2 + r3 MultiOp2: r4 = r1 * r5 MultiOp3: r6 = r1 / r
Other architectural features of EPIC Y
Create more efficient POEs »^
Expose the microarchitecture »^
Play the statistics
Y
Y
Y
Y
Y
Rotating registers Y^
Overlap loop iterations^ »
How do you prevent registeroverwrite in later iterations? »^
Compiler-controlled dynamicregister renaming
Y^
Rotating registers^ »
Each iteration writes to r »^
But this gets mapped to adifferent physical register »^
Block of consecutive regsallocated for each reg in loopcorresponding to number ofiterations it is needed
Op1 Op iteration nRRB = 7
Op1 Op iteration n + 1RRB = 6
r
r
actual reg = (reg + RRB) % NumRegsAt end of each iteration, RRB--
Branch architecture Y
Branch condition computed »^
Target address formed »^
Instructions fetched from taken, fall-through or both paths »^
Branch itself executes »^
After the branch, target of the branch is decoded/executed
Y
Icache prefetching »^
Branch prediction – Guess outcome of branch »^
Dynamic scheduling – overlap other instructions with branch »^
Reorder buffer – Squash when wrong
Speculation Y
Reordering operations to find enough parallelism »^
Branch outcome^ y
Control speculation
Lack of memory dependence in pointer code^ y
Data speculation
Profile or clever analysis provides “the statistics”
Y
Compiler reorders aggressively »^
Hardware support to catch times when its wrong »^
Execution repaired, continue^ y
Repair is expensive y^
So have to be right most of the time to or performance willsuffer
Control speculation Y^
Compile-time movement of operationsabove branches»
Guess the operation result is needed »^
If wrong, wasted execution
Y^
Potential problems»
Too much wasted execution^ y
Speculate likely operations
»^
Spurious exceptions^ y
Useless op causes problem y^
NAT/poison/exception bit y^
check NAT operations
»^
Register overwrite^ y
Rename or don’t do it
»^
Memory corrupted^ y
Don’t speculate stores
blt r1, r2, L1^ r6 = r7 + r8r9 = r6 << 3r4 = r9 + 7
taken^ r3 = load(r4)r5 = r3 + 1store (r4, r5)
fallthru
Management of the memory hierarchy Y^
Common problems^ »
Kick out good locality datawith bad locality data »^
Capacity/conflict misses
Æ
prefetch the data »^
Non-deterministic latency –What should be assumed?
Y^
Expose cache hierarchy tothe compiler^ »
No longer a black box »^
Placement made explicit »^
Assumed latency explicit
Main Memory
Source/target cache specifiers Y
Assumed latency
Y
Reduce pollution of lower levels
Y
Y
Y
Source cache specifier –
where its coming from
latency
L_B_C3_C
S_H_C
Target cache specifier – where to place the data
Predicated execution example
a = b + c if (a > 0)
e = f + g else
e = f / g h = i - j
add a, b, c bgt a, 0, L1 div e, f, g jump L2 L1: add e, f, g L2: sub h, i, j
Traditional branching code
add a, b, c if T p2 = a > 0 if T p3 = a <= 0 if T div e, f, g if p3 add e, f, g if p2 sub h, i, j if T
p
p
Predicated code
What about nested if-then-else’s?
a = b + c if (a > 0)
if (a > 25)
e = f + g else
e = f * g
else
e = f / g h = i - j
add a, b, c bgt a, 0, L1 div e, f, g jump L2 L1: bgt a, 25, L3 mpy e, f, g jump L2 L3: add e, f, g L2: sub h, i, j
Traditional branching code
Benefits/Costs of predicated execution Y
Remove branches (both conditional and unconditional) »^
Remove branch mispredictions »^
Overlap execution of if-then-else statements^ y
Branches tend to sequentialize operations y^
Predicates can be computed/used in parallel
Y
Useless instructions executed »^
Code size (extra operand, can’t fit into 32-bits) »^
Possibly longer schedule lengths
Y
Must be applied selectively or you get worse performancethan not using it at all
Benefits/Costs of predicated execution (2)
Benefits: - No branches, no mispredicts - Can freely reorder independent operations in the predicated block - Overlap BB2 with BB5 and BB6 Costs (execute all paths) -worst case schedule length -worst case resources required