Computer Architecture Problem Set 2: McKinley IA64 Pipeline Analysis | Assignments Computer Architecture and Organization

CE202 – Computer Architecture

Problem Set 2

Here’s part of last year’s midterm.

1. (80 points) The McKinley implementation of IA64 has an 8-stage core (integer)

pipeline. The phases can approximately be called Instruction Fetch (IF), Instruc-

tion Issue 1 (IS1), IS2, IS3, Register Read (RF), Execute and L1 cache access (EXE),

Exception Detect and Branch Correction (DET), and Write Back (WB). IS1, IS2, and

IS3 are concerned with analyzing a packet of intructions for issue, and register renam-

ing. Branch targets are calculated in IS1. Branches are resolved in DET. L1 cache

access, instruction or data, takes a single cycle. After the REG cycle, floating-point

instructions have stages FP1, FP2, FP3, FP4, and FWB, replacing EX, DET, and

WB.

Consider the instructions Jump (J), Jump Register (JR), Branch, Integer, FP, Load,

and Store.

(a) (10 points) What data and control forwarding is needed? Use an IF1 IS1

. . . diagram to show several instructions executing and label and comment on

each arrow. Assume single instruction issue.

(b) (10 points) Make a symbolic pipeline diagram by showing the IF.. . RF stages

(with latches in between), and a split into two possible pipelines: EX. . . WB and

FP1. . . FWB. Draw forwarding paths between the stages for data and control

forwarding.

X, Y, and Z with result R1 followed by W or X with source R1: 2 intervening

clock cycles.” (Or more compactly, {X,Y,Z}followed by {W,X}: 2.)

(d) (10 points) Control hazards. Assume 10% of instructions are branches, 2% are

jumps and calls, and 2% are returns (jr). The McKinley has a Branch Target

Buffer (BTB) and a Branch Prediction Table (BPT). The BPT is 90% accurate.

Assume that both correctly and incorrectly predicted branches are 60% taken.

The Branch Target Buffer has a hit rate of 70%, regardless of the accuracy of

the prediction. What are the costs in clocks for the various branch possibilities

(ie, presense or absense in BTB, correct or incorrect BPT, taken or not taken)?

What is the average cost in clocks for a control statement?

(e) (10 points) McKinley tries to issue 6 instructions per clock cycle (ideal IPC = 6).

What is the reduction in IPC considering only control hazards? Next, assume

that of the remaining 86% of instructions, only 50% of the peak issue rates can

be achieved (roughly based on Figure 4.57) due to structural and data hazard

considerations. What is the overall IPC of the machine?

Partial preview of the text

Download Computer Architecture Problem Set 2: McKinley IA64 Pipeline Analysis and more Assignments Computer Architecture and Organization in PDF only on Docsity!

CE202 – Computer Architecture

Problem Set 2

Here’s part of last year’s midterm.

(80 points) The McKinley implementation of IA64 has an 8-stage core (integer) pipeline. The phases can approximately be called Instruction Fetch (IF), Instruc- tion Issue 1 (IS1), IS2, IS3, Register Read (RF), Execute and L1 cache access (EXE), Exception Detect and Branch Correction (DET), and Write Back (WB). IS1, IS2, and IS3 are concerned with analyzing a packet of intructions for issue, and register renam- ing. Branch targets are calculated in IS1. Branches are resolved in DET. L1 cache access, instruction or data, takes a single cycle. After the REG cycle, floating-point instructions have stages FP1, FP2, FP3, FP4, and FWB, replacing EX, DET, and WB. Consider the instructions Jump (J), Jump Register (JR), Branch, Integer, FP, Load, and Store.

(a) (10 points) What data and control forwarding is needed? Use an IF1 IS

... diagram to show several instructions executing and label and comment on each arrow. Assume single instruction issue. (b) (10 points) Make a symbolic pipeline diagram by showing the IF... RF stages (with latches in between), and a split into two possible pipelines: EX... WB and FP1... FWB. Draw forwarding paths between the stages for data and control forwarding. (c) (10 points) What data stalls remain? Provide a neat list of the form “Instructions X, Y, and Z with result R1 followed by W or X with source R1: 2 intervening clock cycles.” (Or more compactly, {X,Y,Z} followed by {W,X}: 2.) (d) (10 points) Control hazards. Assume 10% of instructions are branches, 2% are jumps and calls, and 2% are returns (jr). The McKinley has a Branch Target Buffer (BTB) and a Branch Prediction Table (BPT). The BPT is 90% accurate. Assume that both correctly and incorrectly predicted branches are 60% taken. The Branch Target Buffer has a hit rate of 70%, regardless of the accuracy of the prediction. What are the costs in clocks for the various branch possibilities (ie, presense or absense in BTB, correct or incorrect BPT, taken or not taken)? What is the average cost in clocks for a control statement? (e) (10 points) McKinley tries to issue 6 instructions per clock cycle (ideal IPC = 6). What is the reduction in IPC considering only control hazards? Next, assume that of the remaining 86% of instructions, only 50% of the peak issue rates can be achieved (roughly based on Figure 4.57) due to structural and data hazard considerations. What is the overall IPC of the machine?

(f) (10 points) If a cache miss happens, an additional 8 cycles are required to ac- cess the on-chip 256K L2 cache. The L1 caches are 16K. If L2 fits all memory necessary for a program, and the instruction cache miss rate is 0.4%, and the data cache miss rate is 6% (estimated from table 5.7 for a 16K cache), what is IPC considering cache misses? Assume that instruction cache misses fully stall the pipeline, but that there are sufficient reservation stations so that other instructions may execute during a data cache miss. (g) (10 points) Consider the merging of DET and WB. What would the implications of this be? What issues would come up in deciding whether or not to merge the two stages? (h) (10 points) Why is instruction issue rate insufficient for comparing different pro- cessors? (i) (Extra credit) Design a VLSI layout for this architecture.

(10 points) Dynamic instruction scheduling

Consider a Tomosulo pipeline (similar to PowerPC 620 in text) with stages IF, ID, IS (the process of moving instructions to reservation stations), EX (1 cycle integer, 2-cycle LSU, 3-cycle FP mult or add), and WB (commit). There is one integer unit with two reservation stations (I1, I2), 1 FP unit with two reservation stations (F1, F2), 1 Load/Store unit with two reservation stations (LS1, LS2), and one Branch unit with 2 reservation stations (B1, B2) that takes 1 cycle. Indicate the clock cycle for each of the following instructions would be in the various stages. Assume 1 instruction issue per cycle and serparate FP and integer result busses. Assume that In the RS# column indicate the reservation station slot (e.g., F1, I2, B1) that was used. For an instruction already in a reservation station, EX1 can commence during the same cycle as the common data bus write in WB. EX2 and EX3 are not used for all instructions — leave blank for those that do not need them. A new instruction can be loaded into a reservation station when the old one is in WB — WB and IS can overlap.

Computer Architecture Problem Set 2: McKinley IA64 Pipeline Analysis, Assignments of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Computer Architecture Problem Set 2: McKinley IA64 Pipeline Analysis and more Assignments Computer Architecture and Organization in PDF only on Docsity!

CE202 – Computer Architecture