Solved Final Exam - Advanced Microprocessor Systems Design | ECE 463, Exams of Electrical and Electronics Engineering

Material Type: Exam; Professor: Rotenberg; Class: Advanced Microprocessor Systems Design; Subject: Electrical and Computer Engineering; University: North Carolina State University; Term: Spring 2005;

Typology: Exams

Pre 2010

Uploaded on 03/18/2009

koofers-user-z65-1
koofers-user-z65-1 🇺🇸

9 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Name: __________________________
Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total
ECE 463: Advanced Microprocessor Design
ECE 521: Computer Design and Technology
Final Examination
Wednesday, May 4, 2005
This is a 180-minute open-book tes t. You may use the
textbook
and any
course notes
yfor this
course that you may have. You may
not
use sample tests or published solutions to problem sets.
Answer
five
of the six questions. Each question is worth 20 points. If you answer all six
questions, your five
highest
scores will count. For partial credit, show your work, especially where
the answer is just a number.
Please write your answers in the space provided. If you run out of room, you may use the back of
the page or another piece of paper.
Here is a list of ques tions and subjects.
Question Topic Page
1. CPI of an instruction mix ………................... 2
2. Cache replacement strategies ..................... 3
3. Delayed jumps and data hazards ………...… 5
4. Structural, control, and data hazards ………. 6
5. Tomasulo’s and Thornton’s algorithm .......... 7
6. History buffers and future files...................... 9
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Solved Final Exam - Advanced Microprocessor Systems Design | ECE 463 and more Exams Electrical and Electronics Engineering in PDF only on Docsity!

Name: __________________________

Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Total

ECE 463: Advanced Microprocessor Design

ECE 521: Computer Design and Technology

Final Examination

Wednesday, May 4, 2005

This is a 180-minute open-book test. You may use the textbook and any course notes yfor this course that you may have. You may not use sample tests or published solutions to problem sets. Answer five of the six questions. Each question is worth 20 points. If you answer all six questions, your five highest scores will count. For partial credit, show your work, especially where the answer is just a number. Please write your answers in the space provided. If you run out of room, you may use the back of the page or another piece of paper. Here is a list of questions and subjects. Question Topic Page

  1. CPI of an instruction mix ………................... 2
  2. Cache replacement strategies ..................... 3
  3. Delayed jumps and data hazards ………...… 5
  4. Structural, control, and data hazards ………. 6
  5. Tomasulo’s and Thornton’s algorithm .......... 7
  6. History buffers and future files...................... 9

Question 1. (4 points each) For a given benchmark program, the frequencies (% of total instruction count) of each kind of instructions and their clock cycles (number of clock cycles needed if not pipelined) are respectively:

Instruction type Load Store ALU Conditional branch Call/return Frequency 41% 8% 32% 12% 7% Clock cycles 7 6 3 4 3

(a) What is the overall CPI if a non-pipelined processor A runs the given benchmark program? CPI_overall = 41% × 7 + 8% × 6 + 32% × 3 + 12% × 4 + 7% × 3 = 5. (b) Assume the clock rate for the processor A is 1.4GHz, and the total number of instructions of this given program is 1,000,000,000 (10^9 ). How many seconds does it take to run the benchmark program? CPUtime = CPIoverall × Instruction count × cycle time = 5 × 1,000,000,000 × (1 / 1,400,000,000) = 3.57 seconds (c) Now consider a pipelined processor B. In order to obtain an instruction throughput of 0. instructions per cycle while running this given benchmark program, what is the maximum average number of stalls per instruction that can be allowed? For a pipeline: CPIpipe = CPIno-stall + stall cycles per instruction = 1 + stall cycles per instruction = 1 / 0.8 = 1. ⇒ Stall cycles per instruction = 0. (d) Suppose the instruction throughput of the pipelined processor B is 3 times higher than that of the non-pipelined processor A. If B has a throughput of 0.8 instructions per cycle, then which processor, B or A, must have a higher clock rate and by what percent? CPIun-pipe = CPIoverall = 5 CPIpipe = 1. Speedup = (CPIunpipe × CTunpipe) / (CPIpipe × CTpipe) = 3 ⇒ CTunpipe / CTpipe = 3 ×1.25 / 5 ≈ 0. i.e., Clock rate A / Clock rate B = 1 / 0.75 ≈ 1. Thus, the clock rate of machine A is one-third faster than the clock rate of machine B. (e) Is this more likely to be an integer or a floating-point benchmark? Answer: It’s more likely to be an integer benchmark. From the tables at the end of Lecture 13, we would expect a much higher frequency of ALU operations in a floating-point benchmark.

Question 3. Here is an assembly-language program for a machine that uses delayed jumps. Note that arithmetic operands may come from memory as well as from registers. The code has not yet been modified to accommodate the delayed jumps. (The program is the loop portion of a binary search. You may find the comments helpful, but you do not need to understand the code in order to do the problem.)

LOOP LOAD R1,L Calculate M := [ L + U] ÷ 2 2 ADD R1,U and leave result in register 1. RSHIFT R1,1 : 4 LOAD R2,A[1] Test A[ M ]– K SUB R2,K : 6 JZERO R2,SUCCESS If A[ M ] = K , search was successful JPOS R2,SETU Branch if A[ M ] > K 8 MOVE R1,R3 Set L:= M + 1 ADD1 R3 : 10 STORE R3,L : JUMP TESTL Unconditional jump 12 SETU MOVE R1,R4 Set U := M – 1 SUB1 R4 : 14 STORE R4,U : TESTL LOAD R1,U If L > U then search was 16 SUB R1,L unsuccessful; otherwise continue JPOS R1,LOOP loop. 18 FAILURE LOAD R5,L : : SUCCESS LOAD R5,M : (a) (2 points) The easiest way to accommodate the delayed jumps is simply to add no-ops (NOPs). After which instructions in this program would no-ops need to be added? Answer: After instructions 6, 7, 11, and 17. (b) (4 points) Sometimes a NOP can be removed by interchanging two instructions. Which of the delayed jumps in the above program can be removed in this way? Mark up the program to show how to remove them. Answer: The NOP after line 11 can be removed if the instructions at lines 10 and 11 are interchanged. This is true because the result of the STORE R3,L instruction is not required before the jump is executed. The other three jumps in the program are all conditional jumps that depend on the value computed by the previous instruction; hence no interchange is possible. (c) (6 points) Otherwise, a delayed jump can be removed by copying or moving an instruction from elsewhere in the program. Which delayed jumps must be removed in this fashion? Mark up the program to show how to remove them. Answer: One technique is to copy the target instruction of the branch to the place occupied by the NOP. If the instruction can be reached only via the branch, the original copy may then be discarded. Since the target instruction is then executed regardless of whether the branch is

taken, it is a useless instruction execution when no branching occurs. One must also be careful that the moved instruction does not write into registers or memory that will be needed later if the branch is not taken. If the changes from parts (b) and (c) are made, the program will now read— LOOP LOAD R1,L Calculate M := [ L + U ] ÷ 2 2 ADD R1,U and leave result in register 1. RSHIFT R1,1 : 4 LOAD R2,A[1] Test A[ M ]– K SUB R2,K : 6 JZERO R2,SUCCESS If A[ M ] = K , search was successful LOAD R5,M 8 JPOS R2,SETU Branch if A[ M ] > K MOVE R1,R4 Start setting U := M – 1 10 MOVE R1,R3 Set L:= M + 1 ADD1 R3 : 12 JUMP TESTL Unconditional jump STORE R3,L : 14 SETU SUB1 R4 Finish setting U := M – 1 STORE R4,U : 16 TESTL LOAD R1,U If L > U then search was SUB R1,L unsuccessful; otherwise continue 18 JPOS R1,LOOP+1 loop. LOAD R1,L 20 FAILURE LOAD R5,L : : SUCCESS (d) (4 points) Assume that this machine uses the following MIPS-like pipeline.

S 1 S 2 S 3 S 4 S 5 Instr. fetch Instr. decode Operand decode^ Store execute^ or Operand fetch

  • In LOAD instructions, registers are loaded in cycle 5.
  • In arithmetic or bit-manipulation instructions, calculations take place during cycle 4.
  • In STORE instructions, registers are stored in cycle 4.
  • MOVE a, b means “move contents of register a to register b”; it takes place on cycle 4. The original program is reproduced below. [It was reproduced on the test, though not here.] Indicate where stalls would occur due to data hazards. For simplicity’s sake, assume that all useful bypasses are available, and that the jumps are not delayed jumps in parts (d) and (e). Answer: A stall would be needed after instruction 1 (the LOAD and ADD conflict on the LOAD’s 5th cycle). Stalls would also be needed after the LOADs at line 4,15, and 18, for the same reason. (e) (4 points) Show how to remove these stalls, if possible, by reorganizing the program. (You may mark up the code above). Answer: The modified code is shown below. Changes are italicized. The first stall can be removed by moving the “LOAD R2,A[I]” (instruction 4) into the place of the stall; this has the effect

Question 4. Consider the following code in our simple 5-stage MIPS-like pipeline. This code sums the elements of an array starting at address 1000.

  1. li r1, 0
  2. li r7, 1000
  3. loop: lw r6, (r7)
  4. add r1, r6, r
  5. addi r7, 4
  6. subi r2, 1
  7. bne r2, r0, loop (a) (3 points) What are the structural hazards in the above code sequence? Answer: Structural hazards may occur if an instruction accesses memory during the same cycle another instruction is fetched. Thus, a potential structural hazard exists between the load at instruction 3 (loop: lw r6, (r7)) and between the IF stage of the next instruction. (b) (3 points) What are the control hazards in the above code sequence? Answer: Control hazards occur at branches and jumps; therefore instruction 7 (bne r2, r0, loop) causes a control hazard. (c) (10 points) What are the potential data hazards for the above code sequence (do not forget that there is a loop)? For each hazard, is it RAW, WAR, or WAW?

First instruction (e.g. “3”) Action (R or W, which data item?)

Second instruc- tion (e. g. “4”) Action (R or W, which data item?)

(RAW, WAR, WAW)^ Type of hazard

Answer: Our simple 5-stage pipeline does not allow WAR or WAW hazards. RAW hazards still exist: Insts. 2 and 3 Inst. 2 writes r7, inst. 3 reads r Insts. 2 and 5 inst. 2 writes r7, inst. 5 reads r Insts. 3 and 4 inst. 3 writes r6, inst. 4 reads r Insts. 5 and 3 inst. 5 writes r7; on the next iteration of the loop, inst. 3 reads r Insts. 6 and 7 inst. 6 writes r2, inst. 7 reads r2 to determine whether to branch.

(d) (4 points) Assuming that a branch target becomes known in the MEM stage, and that there are no bypasses or branch prediction, what is the delay between the start of two successive iterations of the loop in an in-order architecture (no Tomasulo’s algorithm, etc.)? Answer: The easiest way to determine this is to draw a diagram. Clock # 1 2 3 4 5 6 7 8 9 10 11 12

  1. li r1, 0 IF ID EX MEM WB
  2. li r7, 1000 IF ID EX MEM WB
  3. loop: lw r6, (r7) IF ID stall stall ID EX MEM WB
  4. add r1, r6, r1 IF ID stall stall stall stall stall ID EX
  5. addi r7, 4 IF ID stall stall stall stall stall ID
  6. subi r2, 1 IF ID stall stall stall stall stall
  7. bne r2, r0, loop IF ID stall stall stall stall
  8. loop: lw r6, (r7)

Clock # 13 14 15 16 17 18 19 20 21 22 23 24

  1. li r1, 0
  2. li r7, 1000
  3. loop: lw r6, (r7)
  4. add r1, r6, r1 MEM WB
  5. addi r7, 4 EX MEM WB
  6. subi r2, 1 ID EX MEM WB
  7. bne r2, r0, loop stall stall stall stall ID EX MEM WB
  8. loop: lw r6, (r7) IF ID EX MEM WB

On the first iteration, instruction 3 starts at cycle 3. On the second iteration, it starts at cycle 20. Therefore, the delay is 17 cycles.

Question 5. (4 points each) Tomasulo's algorithm For this problem, consider the following architecture specification:

  • 2 FP Multiply units, with 9 EX cycles.
  • 1 FP Add unit, with 5 EX cycles.
  • 1 FP Divide unit, with 12 EX cycles.
  • 1 integer unit, used for all memory references, with 1 EX cycle in the case of a cache hit. In parts (a)–(c), these units are not fully pipelined. When one instruction is in a functional unit, another cannot enter until the first one is done. Use the following assumptions: o Simulate only issue, execution, and WB. Do not include IF and ID. o Assume that the number of reservation stations is large enough, i.e., none of the instructions suffer a structural hazard due to the lack of free reservation stations.

(c) It is often possible to speed up the code by local scheduling – moving code around within a basic block. Is that possible in this case? If so, rearrange the instructions to improve the speed of this code fragment as much as possible, and then fill out the tableau again. If not, explain why not. Answer: It’s not possible in this case. The first two loads must be finished before any multiplications begin. This means no multiplications can issue before the second cycle. So one multiplication needs to issue on the third cycle, one on the fourth, and one on the fifth. There is no way to get the schedule to finish more quickly. (d) Redo part (b), assuming two instructions issued per cycle, and assume that the functional units are fully pipelined—they can begin an operation each cycle. Answer:

Instruction Issued Execution WB LD F0, 0(R1) 1 2 3 LD F2, 0(R2) 1 3 4 LD F4, 16(R2) 2 4 5 MULTD F6, F2, F0 2 4 – 12 13 MULTD F8, F4, F0 3 5 – 13 14 LD F2, 32(R2) 3 4 6 LD F4, 48(R2) 4 5 7 MULTD F10, F2, F0 4 5 – 13 14 MULTD F12, F4, F0 5 6 – 14 15

(e) Now, suppose that Thornton’s algorithm was in use (no register renaming). Without register renaming, but with fully pipelined functional units, how long would it take to complete the schedule? Answer: Fifteen cycles, the same as in part (d). Although the values in F2 and F4 are overwritten, the “old” values are already in the pipeline by that point, so there is no additional delay.

Question 6. In Lecture 24, we went through an example with a ROB. In Lecture 25, we introduced history buffers and future files as alternatives to a ROB. Suppose we have the code sequence given on the right, the same sequence as in Lecture 24. Consider the point where instruction F has just been issued. The ROB is as shown below.

A B LWMULT R2,R3, 4(R0)R1, R

C D LWADD R2,R1, 8(R0)R1, R

E F SUBADD R2,R0, R0,R1, R1R

Entry Dest Result Exception Completed PC 0 R2 A 1 R3 B 2 R2 667 No √^ C 3 R1 743 No √^ D 4 R2 – 689 No √^ E 5 R0 F

(a) (5 points) Assume that a history buffer is in use instead of a ROB. Show its contents below. You may assume the initial register contents were as shown at the right.

R0: 54

R1: 76

R2: – 5

R3:– 99

Entry Dest Old value Exception Valid PC 0 R2 A 1 R3 B 2 R2 – 5 No √^ C 3 R1 76 No √^ D 4 R2 667 No √^ E 5 R0 F (b) (2 points) With the history buffer in use, what will be the contents of the register file at this point? Show them at the right.

R0: 54

R1: 743

R2: – 689

R3: – 99

(c) (3 points) Suppose that a future file is in use instead of a history buffer. As above, assume that instruction F has just been issued. How—if at all—do the contents of the ROB differ from the ROB shown at the beginning of the problem? Explain. You may write on the tableau at the beginning of the problem, if that helps you show the differences. Answer: There is no difference from the tableau above. The only difference is in the register file. (d) (4 points) With a future file in use, show the contents of the architectural register file and the f file at this point in the program. uture

Architectural file R0: 54 R1: 76 R2: – 5 R3: – 99

Future file R0: 54 R1: 743 R2: – 689 R3: – 99