






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Professor: Rotenberg; Class: Advanced Microprocessor Systems Design; Subject: Electrical and Computer Engineering; University: North Carolina State University; Term: Spring 2005;
Typology: Exams
1 / 12
This page cannot be seen from the preview
Don't miss anything!







Name: __________________________
Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total
This is a 180-minute open-book test. You may use the textbook and any course notes yfor this course that you may have. You may not use sample tests or published solutions to problem sets. Answer five of the six questions. Each question is worth 20 points. If you answer all six questions, your five highest scores will count. For partial credit, show your work, especially where the answer is just a number. Please write your answers in the space provided. If you run out of room, you may use the back of the page or another piece of paper. Here is a list of questions and subjects. Question Topic Page
Question 1. (4 points each) For a given benchmark program, the frequencies (% of total instruction count) of each kind of instructions and their clock cycles (number of clock cycles needed if not pipelined) are respectively:
Instruction type Load Store ALU Conditional branch Call/return Frequency 41% 8% 32% 12% 7% Clock cycles 7 6 3 4 3
(a) What is the overall CPI if a non-pipelined processor A runs the given benchmark program? CPI_overall = 41% × 7 + 8% × 6 + 32% × 3 + 12% × 4 + 7% × 3 = 5. (b) Assume the clock rate for the processor A is 1.4GHz, and the total number of instructions of this given program is 1,000,000,000 (10^9 ). How many seconds does it take to run the benchmark program? CPUtime = CPIoverall × Instruction count × cycle time = 5 × 1,000,000,000 × (1 / 1,400,000,000) = 3.57 seconds (c) Now consider a pipelined processor B. In order to obtain an instruction throughput of 0. instructions per cycle while running this given benchmark program, what is the maximum average number of stalls per instruction that can be allowed? For a pipeline: CPIpipe = CPIno-stall + stall cycles per instruction = 1 + stall cycles per instruction = 1 / 0.8 = 1. ⇒ Stall cycles per instruction = 0. (d) Suppose the instruction throughput of the pipelined processor B is 3 times higher than that of the non-pipelined processor A. If B has a throughput of 0.8 instructions per cycle, then which processor, B or A, must have a higher clock rate and by what percent? CPIun-pipe = CPIoverall = 5 CPIpipe = 1. Speedup = (CPIunpipe × CTunpipe) / (CPIpipe × CTpipe) = 3 ⇒ CTunpipe / CTpipe = 3 ×1.25 / 5 ≈ 0. i.e., Clock rate A / Clock rate B = 1 / 0.75 ≈ 1. Thus, the clock rate of machine A is one-third faster than the clock rate of machine B. (e) Is this more likely to be an integer or a floating-point benchmark? Answer: It’s more likely to be an integer benchmark. From the tables at the end of Lecture 13, we would expect a much higher frequency of ALU operations in a floating-point benchmark.
Question 3. Here is an assembly-language program for a machine that uses delayed jumps. Note that arithmetic operands may come from memory as well as from registers. The code has not yet been modified to accommodate the delayed jumps. (The program is the loop portion of a binary search. You may find the comments helpful, but you do not need to understand the code in order to do the problem.)
LOOP LOAD R1,L Calculate M := [ L + U] ÷ 2 2 ADD R1,U and leave result in register 1. RSHIFT R1,1 : 4 LOAD R2,A[1] Test A[ M ]– K SUB R2,K : 6 JZERO R2,SUCCESS If A[ M ] = K , search was successful JPOS R2,SETU Branch if A[ M ] > K 8 MOVE R1,R3 Set L:= M + 1 ADD1 R3 : 10 STORE R3,L : JUMP TESTL Unconditional jump 12 SETU MOVE R1,R4 Set U := M – 1 SUB1 R4 : 14 STORE R4,U : TESTL LOAD R1,U If L > U then search was 16 SUB R1,L unsuccessful; otherwise continue JPOS R1,LOOP loop. 18 FAILURE LOAD R5,L : : SUCCESS LOAD R5,M : (a) (2 points) The easiest way to accommodate the delayed jumps is simply to add no-ops (NOPs). After which instructions in this program would no-ops need to be added? Answer: After instructions 6, 7, 11, and 17. (b) (4 points) Sometimes a NOP can be removed by interchanging two instructions. Which of the delayed jumps in the above program can be removed in this way? Mark up the program to show how to remove them. Answer: The NOP after line 11 can be removed if the instructions at lines 10 and 11 are interchanged. This is true because the result of the STORE R3,L instruction is not required before the jump is executed. The other three jumps in the program are all conditional jumps that depend on the value computed by the previous instruction; hence no interchange is possible. (c) (6 points) Otherwise, a delayed jump can be removed by copying or moving an instruction from elsewhere in the program. Which delayed jumps must be removed in this fashion? Mark up the program to show how to remove them. Answer: One technique is to copy the target instruction of the branch to the place occupied by the NOP. If the instruction can be reached only via the branch, the original copy may then be discarded. Since the target instruction is then executed regardless of whether the branch is
taken, it is a useless instruction execution when no branching occurs. One must also be careful that the moved instruction does not write into registers or memory that will be needed later if the branch is not taken. If the changes from parts (b) and (c) are made, the program will now read— LOOP LOAD R1,L Calculate M := [ L + U ] ÷ 2 2 ADD R1,U and leave result in register 1. RSHIFT R1,1 : 4 LOAD R2,A[1] Test A[ M ]– K SUB R2,K : 6 JZERO R2,SUCCESS If A[ M ] = K , search was successful LOAD R5,M 8 JPOS R2,SETU Branch if A[ M ] > K MOVE R1,R4 Start setting U := M – 1 10 MOVE R1,R3 Set L:= M + 1 ADD1 R3 : 12 JUMP TESTL Unconditional jump STORE R3,L : 14 SETU SUB1 R4 Finish setting U := M – 1 STORE R4,U : 16 TESTL LOAD R1,U If L > U then search was SUB R1,L unsuccessful; otherwise continue 18 JPOS R1,LOOP+1 loop. LOAD R1,L 20 FAILURE LOAD R5,L : : SUCCESS (d) (4 points) Assume that this machine uses the following MIPS-like pipeline.
S 1 S 2 S 3 S 4 S 5 Instr. fetch Instr. decode Operand decode^ Store execute^ or Operand fetch
Question 4. Consider the following code in our simple 5-stage MIPS-like pipeline. This code sums the elements of an array starting at address 1000.
First instruction (e.g. “3”) Action (R or W, which data item?)
Second instruc- tion (e. g. “4”) Action (R or W, which data item?)
(RAW, WAR, WAW)^ Type of hazard
Answer: Our simple 5-stage pipeline does not allow WAR or WAW hazards. RAW hazards still exist: Insts. 2 and 3 Inst. 2 writes r7, inst. 3 reads r Insts. 2 and 5 inst. 2 writes r7, inst. 5 reads r Insts. 3 and 4 inst. 3 writes r6, inst. 4 reads r Insts. 5 and 3 inst. 5 writes r7; on the next iteration of the loop, inst. 3 reads r Insts. 6 and 7 inst. 6 writes r2, inst. 7 reads r2 to determine whether to branch.
(d) (4 points) Assuming that a branch target becomes known in the MEM stage, and that there are no bypasses or branch prediction, what is the delay between the start of two successive iterations of the loop in an in-order architecture (no Tomasulo’s algorithm, etc.)? Answer: The easiest way to determine this is to draw a diagram. Clock # 1 2 3 4 5 6 7 8 9 10 11 12
Clock # 13 14 15 16 17 18 19 20 21 22 23 24
On the first iteration, instruction 3 starts at cycle 3. On the second iteration, it starts at cycle 20. Therefore, the delay is 17 cycles.
Question 5. (4 points each) Tomasulo's algorithm For this problem, consider the following architecture specification:
(c) It is often possible to speed up the code by local scheduling – moving code around within a basic block. Is that possible in this case? If so, rearrange the instructions to improve the speed of this code fragment as much as possible, and then fill out the tableau again. If not, explain why not. Answer: It’s not possible in this case. The first two loads must be finished before any multiplications begin. This means no multiplications can issue before the second cycle. So one multiplication needs to issue on the third cycle, one on the fourth, and one on the fifth. There is no way to get the schedule to finish more quickly. (d) Redo part (b), assuming two instructions issued per cycle, and assume that the functional units are fully pipelined—they can begin an operation each cycle. Answer:
Instruction Issued Execution WB LD F0, 0(R1) 1 2 3 LD F2, 0(R2) 1 3 4 LD F4, 16(R2) 2 4 5 MULTD F6, F2, F0 2 4 – 12 13 MULTD F8, F4, F0 3 5 – 13 14 LD F2, 32(R2) 3 4 6 LD F4, 48(R2) 4 5 7 MULTD F10, F2, F0 4 5 – 13 14 MULTD F12, F4, F0 5 6 – 14 15
(e) Now, suppose that Thornton’s algorithm was in use (no register renaming). Without register renaming, but with fully pipelined functional units, how long would it take to complete the schedule? Answer: Fifteen cycles, the same as in part (d). Although the values in F2 and F4 are overwritten, the “old” values are already in the pipeline by that point, so there is no additional delay.
Question 6. In Lecture 24, we went through an example with a ROB. In Lecture 25, we introduced history buffers and future files as alternatives to a ROB. Suppose we have the code sequence given on the right, the same sequence as in Lecture 24. Consider the point where instruction F has just been issued. The ROB is as shown below.
Entry Dest Result Exception Completed PC 0 R2 A 1 R3 B 2 R2 667 No √^ C 3 R1 743 No √^ D 4 R2 – 689 No √^ E 5 R0 F
(a) (5 points) Assume that a history buffer is in use instead of a ROB. Show its contents below. You may assume the initial register contents were as shown at the right.
Entry Dest Old value Exception Valid PC 0 R2 A 1 R3 B 2 R2 – 5 No √^ C 3 R1 76 No √^ D 4 R2 667 No √^ E 5 R0 F (b) (2 points) With the history buffer in use, what will be the contents of the register file at this point? Show them at the right.
(c) (3 points) Suppose that a future file is in use instead of a history buffer. As above, assume that instruction F has just been issued. How—if at all—do the contents of the ROB differ from the ROB shown at the beginning of the problem? Explain. You may write on the tableau at the beginning of the problem, if that helps you show the differences. Answer: There is no difference from the tableau above. The only difference is in the register file. (d) (4 points) With a future file in use, show the contents of the architectural register file and the f file at this point in the program. uture
Architectural file R0: 54 R1: 76 R2: – 5 R3: – 99
Future file R0: 54 R1: 743 R2: – 689 R3: – 99