






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Class: Computer System Organization; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Spring 2008;
Typology: Exams
1 / 10
This page cannot be seen from the preview
Don't miss anything!







Name Solutions NetID Category 3 Credit Hours 4 Credit Hours Instructions
Problem 1 [ 5 points]: Suppose we apply an enhancement, E1, that speeds up 20% of a program by a factor of 4. And we apply another enhancement, E2, that speeds up another 10% of the program by a factor of 10. Assume the two enhancements are independent and affect different parts of the program (there is no overlap between the 20% and 10% of program). What is the overall speed up for the entire program using both E1 and E simultaneously? Solution: This problem tests the understanding of Amdahl’s law. According to Amdahl’s law, Speedup = (old latency) / (new latency) , new latency = (1 – f1 – f2) * old latency + f1/S1*old_latency + f2/S2 * old_latency. f1 = 20%, S1 = 4; f2 = 10%, S2 = 10; new latency = old latency *. Final speedup = 1. Grading: 2 point for listing Amdahl’s law and the standard equation; 1 points for the equation of this problem; 2 points for substituting the correct values in the equation. Problem 2 [20 Points] This problem concerns Tomasulo’s algorithm (with reservation stations) with the reorder buffer scheme discussed in detail in the lecture notes. We have the following changes/additions/clarifications relative to the discussion in class. Assume the following information about functional units. Functional Unit Type Cycles in EX Integer Mul 2 Integer Div 10 Integer Add 1
Grading: 0.5 point per entries 1 point if the majority is correct Do not cascade errors. Do not take off additional points if an earlier error causes later inaccuracies.
Problem 3 [15 points] Consider the following code fragment: Loop: LD.D F2, 0(R1) MUL.D F4, F6, F ADD.D F4, F4, F SD.D F4, 0(R1) DADDUI R1, R1, # BNE R1, R3, Loop Consider a pipeline with the following latencies: 1 cycle between a load and a dependent ALU instruction; 2 cycles between two dependent FP ALU instructions; 3 cycles between an FP ALU and a dependent store instruction; and 0 cycles between all other pairs. That is, there would need to be one stall cycle between the load and multiply above for correct operation. Unroll the above loop 4 times and write the resulting code on the left of the table below. You have access to temporary registers T0…T63. Assume the total number of iterations for the original loop is a multiple of 4. Then schedule the unrolled loop for a VLIW machine where each VLIW instruction can contain one memory reference, one FP operation, and one integer operation. Write the scheduled instructions in the table below to minimize the number of stalls. You may use L for LD.D, M for MUL.D, etc. Mem FP ALU Int ALU
Solution: Loop: LD.D F2, 0(R1) MUL.D F4, F6, F ADD.D F4, F4, F SD.D F4, 0(R1) LD.D T0, 8(R1) MUL.D T2, F6, T ADD.D T4, T2, F SD.D T4, 8(R1) LD.D T6, 16(R1) MUL.D T8, F6, T ADD.D T10, T8, F SD.D T10, 16(R1) LD.D T12, 24(R1) MUL.D T14, F6, T ADD.D T16, T14, F SD.D T16, 24(R1) DADDUI R1, R1, # BNE R1, R3, Loop E: For loop unrolling, not as long as the unrolled loop works the same, should give full credits. No need to schedule at this point. Mem FP ALU Int ALU LD.D F2, 0(R1) LD.D T0, 8(R1) LD.D T6, 16(R1) MUL.D F4, F6, F LD.D T12, 24(R1) MUL.D T2, F6, T MUL.D T8, F6, T6 DADDUI R1, R1, # MUL.D T14, F6, T ADD.D F4, F4, F ADD.D T4, T2, F ADD.D T10, T8, F ADD.D T16, T14, F SD.D F4, -32(R1) SD.D T4, -24(R1) SD.D T10, -16(R1) SD.D T16, -8(R1) BNE R1, R3, L
Grading: 6 points for loop unrolling, 1/3 pt each instruction, round up. 9 pts for filling up the table, .5 pt for each instruction element. Round up. Problem 4 [6 points] Consider a loop that is entered several times in a program. Each time it is entered, the loop performs 10 iterations. Each iteration executes four branches with the following outcomes (branch 1 occurs before branch 2 which occurs before branch 3 which occurs before branch 4 in each iteration): Iteration 1 2 3 4 5 6 7 8 9 10 Branch 1 N N N N N N N N N T Branch 2 T T T T T T T T T Branch 3 T T N T T T N T N Branch 4 N N T N N N T N T When Branch 1 is taken at iteration 10, the program counter leaves the loop, and branches 2, 3, and 4 are not reached. Of all the dynamic branch predictors studied in class, state the cheapest predictor that will give the best misprediction rate for each of the following branches. Explain why. (A) Branch 1: Solution:
(B) Branch 2: Solution: Branch 2 is always taken. It will have the same result on all predictors. Therefore, use a 1-bit predictor. (C) Branch 4: Solution: Branch 4 is always the opposite of the most recent Branch 3. Therefore, use a (1,1) correlating predictor. Grading: 2 points each: 1 per predictor, 1 per reason.
entire block has been evicted to make room for the next block. Adding the L2 cache removes all capacity misses, and moves the miss ratio down to 50% ii) Each cache line (for both L1 and L2 chaches) is 1KB, and each element is 64 bytes. One line = 16 elements, and the array has 640Kentries. The elements are accessed in this order: 0, 64, 128, … 640K, 1, 65, … After the entire array is accessed, it is accessed again in the same order. The first 1024 entries are compulsory misses. The rest are capacity misses. By the time the process accesses entry 1, the cache line with entries 0-15 will already have been evicted from the L 1 cache. But it will still be in the L2 cache. When the process gets to entry 16, none of the desired lines will be in either cache; ultimately every line in L2 will be replaced. The same will happen starting at 32 and 48. Therefore, only multiples of 16 will be misses (compulsory ones), so the miss ratio drops to 1/16=6.25%