





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A set of lecture notes that covers the topic of instruction issue algorithms in single-threaded execution, including in-order and out-of-order issue, war hazard, register renaming, and pipeline. It also discusses the limitations of ilp (instruction level parallelism) and cycle time reduction techniques.
Typology: Slides
1 / 9
This page cannot be seen from the preview
Don't miss anything!






Simplest possible design Issue the instructions sequentially (in-order) Scan the issue queue, stop as soon as you come to an instruction dependent on one already issued
Cannot issue the last two even though they are independent of the first two: in-order completion is a must for precise exception support
Complexity of selection logic Need to check for RAW and WAW Comparisons for RAW: N(N-1) where N is the issue width Comparisons for WAW: N(N-1)/ 18 comparators for 4-issue Still need to make sure instructions write back in-order to support precise exception As instructions issue, they are removed from the issue queue and put in a re-order buffer (also called active list in MIPS processors) [Isn’t WAW check sufficient?] Instructions write back or retire in-order from re-order buffer (ROB)
Taking the parallelism to a new dimension Central to all modern microprocessors Scan the issue queue completely, select independent instructions and issue as many as possible limited only by the number of functional units Need more comparators Able to extract more ILP: CPI goes down further Possible to overlap the latency of mult/div, load/store with execution of other independent instructions
An executing instruction must broadcast results to the issue queue Waiting instructions compare their source register numbers with the destination register number of the bypassed value Also, now it needs to make sure that it is consuming the right value in program order to avoid WAR
Need to tag every instruction with its last producer Can we simplify this?
These are really false dependencies Arises due to register allocation by the compiler Thus far we have assumed that ROB has space to hold the destination values: needs wide ROB entries These values are written back to the register file when the instructions retire or commit in- order from ROB Also, bypass becomes complicated Better way to solve it: rename the destination registers
More physical registers more in-flight instructions possibility of more parallelism But cannot make the register file very big Takes time to access Burns power
Fetch, decode, rename, issue, register file read, ALU, cache, retire Fetch, decode, rename are in-order stages, each handles multiple instructions every cycle The ROB entry is allocated in rename stage Issue, register file, ALU, cache are out-of-order Retire is again in-order, but multiple instructions may retire each cycle: need to free the resources and drain the pipeline quickly
Instruction cache miss (normally not a big issue) Branch misprediction Observe that you predict a branch in decode, and the branch executes in ALU There are four pipeline stages before you know outcome Misprediction amounts to loss of at least 4F instructions where F is the fetch width Data cache miss Assuming a issue width of 4, frequency of 3 GHz, memory latency of 120 ns, you need to find 1440 independent instructions to issue so that you can hide the memory latency: this is impossible (resource shortage)
Execution time = CPI × instruction count × cycle time Talked about CPI reduction or improvement in IPC (instructions retired per cycle) Cycle time reduction is another technique to boost performance Faster clock frequency Pipelining poses a problem Each pipeline stage should be one cycle for balanced progress Smaller cycle time means need to break pipe stages into smaller stages Superpipelining Faster clock frequency necessarily means deep pipes Each pipe stage contains small amount of logic so that it fits in small cycle time May severely degrade CPI if not careful Now branch penalty is even bigger (31 cycles for Intel Prescott): branch mispredictions cause massive loss in performance (93 micro-ops are lost, F=3) Long pipes also put more pressure on resources such as ROB and registers because instruction latency increases (in terms of cycles, not in absolute terms) Instructions occupy ROB entries and registers longer The design becomes increasingly complicated (long wires)