







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Class: COMPU ARCHITECT PRIN; Subject: COMPUTER DESIGN/ARCHITECTURE; University: University of Florida; Term: Summer 2003;
Typology: Exams
1 / 13
This page cannot be seen from the preview
Don't miss anything!








th
Problem No. Max Points Points Scored Comment 1 19 2 15 3 20 4 20 5 13 6 23 Total 110
1. [19 points] Branch Prediction Bits Throughout this problem we will be working with the following program:
loop: LW R4, 0 (R3) ADDI R3, R3, # SUBI R1, R1, # b1: BEQZ R4, b ADDI R2, R2, # b2: BNEZ R1, loop Assume the initial value of R1 is n (n>0). Assume the initial value of R2 is 0 (R2 holds the result of the program). Assume the initial value of R3 is p (a pointer to the beginning of an array of 32-bit integers). Assume that the MIPS ISA has no delay slots. We will use a 2-bit predictor state machine, as shown below. In state 1X we will guess not taken. In state 0X we will guess taken. Assume that b1 and b2 do not conflict in the BHT. a) [4 pts] What does the program compute? That is, what does R2 contain when we exit the loop? Contains the number of non-zero values in the first n integers. b) [15 pts] Branch prediction
2. [15 points] Cost, yield and defect-tolerance Suppose a 40-cm diameter wafer costs$100,000 to run through a fabrication line having 3 critical mask levels where the resulting defect density in the process is 0. defects per square cm. Assume the wafer yield is 100%, and ignore testing and packaging costs. a. [5 pts] Taking the yield into account, compute the effective cost of an ordinary microprocessor die having an area of 5 cm^2 in this process. Yield = (1+ 0.155/3)^(-3) = 51.2% Dies per wafer = ((Pi * 20 * 20)/5) – (Pi40)/(Sqrt(25)) = 211. Good dies = 212 * 51.2 % = 108 Therefore cost = $ b. [10 pts] Now, consider an alternative, fault-tolerant design in which each die is now somewhat larger, 5 cm^2 , but is logically sub-divided into 5 equal-size pieces, at least 4 of which must be defective before the whole chip will fail to function. What is the yield of the individual pieces? What is the probability that a given piece will be defective? What is the probability that either 4 or 5 pieces out of the 5 on a die will be defective? What is then the effective yield and cost for the whole die? Is it better than in part (a)? Yield of individual pieces = (1+ (0.151.2/3))^(-3) = 83.96% Thus every piece has 16.04 % probability of being defective. Either 4 or 5 are defective means: (1-0.8396) + C(5,4) * (1-0.8396)^4 * 0.8396 = 0.0029 = 0.29% The probability for each chip to be defective is 0.29% Dies per wafer = (Pi * 20^2 /6) – (Pi*40/Sqrt(12)) = 173 Therefore, good dies = 173 * 99.71% = 172. Cost = $ This is certainly better than (a).
3. [20 points] Loop unrolling and software pipelining Consider the following loop. Loop: L.D F0, 0(R1) ADD.D F4, F0, F S.D F4, 0(R1) DADDUI R1, R1, #- BNEZ R1, Loop Assume that the ADD.D takes 5 EX cycles, ignore the hazard between DADDUI and BNEZ, and ignore the branch delay slot. [6 points] Show the software pipelined version of this loop. Do not show the start-up or finish-up code. Solution: Loop: S.D F4, 16(R1) ADD.D F4, F0, F L.D F0, 0(R1) DADDUI R1, R1, #- BNEZ R1, Loop Note: the difference between the SD and LD offset should be 16. Suppose that, instead, ADD.D takes 6 EX cycles, so we have to First unroll and then software pipeline. So: [6 points] Unroll the original loop once, to get a two-body iteration. You do not need to schedule it for the pipeline - just unroll it. Solution: Loop: L.D F0, 0(R1) ADD.D F4, F0, F S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F10,F6,F S.D F10, -8(R1) DADDUI R1, R1, #- BNEZ R1, Loop
4. [20 points] Cache Performance Analysis The following MIPS code loop adds the vectors X, Y, and Z and stores the result in vector W. All vectors are of length n and contain words of four bytes each. W[i] = X[i] + Y[i] + Z[i] for 0 ≤ i < n Registers 1 through 5 are initialized with the following values: R1 holds the address of X R2 holds the address of Y R3 holds the address of Z R4 holds the address of W R5 holds n All vector addresses are word aligned. The MIPS code for our function is: loop: LW R6, 0(R1) ; load X LW R7, 0(R2) ; load Y LW R8, 0(R3) ; load Z ADD R9, R6, R7 ; do the add ADD R9, R9, R SW R9, 0(R4) ; store W ADDI R1, R1, 4 ; increment the vector indices ADDI R2, R2, 4 ; 4 bytes per word ADDI R3, R3, 4 ADDI R4, R4, 4 ADDI R5, R5, -1 ; decrement n BNEZ R5, loop We run the loop using a cache simulator that reports the average memory access time for a range of cache configurations.
142 142 142 15 142 0 20 40 60 80 100 120 140 160 1024 512 256 128 64 32 16 8 4 2 1 Block Size (words) Average m emory access time (cycles) Figure1: Simulator Results for Varying Block Size In Figure 1 we plot the results of one set of experiments. For these measurements, we used a single level cache of fixed capacity and varied the block size. We used large values of n to ensure the loop settled into a steady state. For each block size, we plot the average memory access time, i.e., the average time in cycles required to access each data memory operand (cache hit time plus memory access time). Assume that capacity (represented by C) and block size (represented by b) are restricted to powers of 2 and measured in four byte words. The cache is fully associative and uses an LRU replacement policy. Assume the processor stalls during a cache miss and that the cache miss penalty is independent of the block size. [5 points] Based on the results in Figure 1, what is the cache capacity C? When C<4b, every access will cause a cache miss because the block for each vector’s access will be paged out during every loop iteration with the LRU replacement policy. When C ≥ 4b, a vector’s block will not be paged out each iteration. A number of hits will occur and either the function will be completed, or the end of the cache line will be reached and the next section of the vector will have to be loaded in. Larger values of b will result in less reloading and, therefore, smaller average memory access times. Therefore, the curve should drop at C=4b and then climb again as b decreases. So, C=4*128=512 words
5. [13 points] Input/Output Consider a database system with the following characteristics: The CPU can execute at most 1800 MIPS. Each database transaction requires 1 disk read plus 3 disk writes. Each database transaction involves 100,000 user instructions plus 20,000 kernel instructions for each disk read or write. i.e. Totally 100,000 + 420,000 = 180, instructions are required for each database transaction. [3 points] Calculate the maximum number of transactions per second (TPS) that the CPU can handle. The CPU can handle 1,800,000,000/180,000 = 10,000 TPS [3 points] Assume that, in addition: o The I/O bus can sustain a data rate of 2100MB/sec. o The transfer size for each read or write is 1K bytes. Calculate the maximum number of transactions per second (TPS) that the bus can handle. The IO bus can handle 2,100,000,000/4,000 = 525,000TPS [4 points] Assume that, in addition: o Each disk is 10GB in size. o A disk can average 60 reads or writes per second. If the system must contain 200 GB of data, calculate the maximum number of transactions per second (TPS) that the disk group can handle. Since the system has to be able to handle 200GB of data, there must be 20 10GB disks. These disks can sustain a TPS rate of 2060/4 = 300TPS [3 points] Assume the CPU and bus cost $5000 together and each disk is $250. What is the cost per TPS of the system?
Since the CPU and bus cost $5000 together and each disk is $250, the total cost of the system is $5000 + 20 * $250 = $10, cost per TPS = $10,000/300 = $33.
6. [23 points] Write through & Write back You bought a computer with the following properties: 95% of all memory accesses are found in the cache. A cache block is two words long and is read as a whole on a cache miss. The processor accesses the cache at a rate of 109 words per second. 75% of all accesses are reads. 25% of all accesses are writes. The memory system can read or write 2 × 108 words per second. The bus can only transfer one word at a time. Assume that there are always 30% dirty blocks in the cache. The replacement strategy is write allocate on write miss. Then answer following questions in the following two cases: o The cache implementation is write through. o The cache implementation is write back. [5 points] Analysis of Cache hit. Indicate the number of words transferred in the bus for every read hit, write hit. And also compute the number of read hits, write hits as a percentage of the number of instructions. Solution: The answer for read hit is given to you as an example of how to fill the table. Analysis of Cache hit Number of Accesses on main memory Probability Read Write Read hit 0 0 95%75% = Write hit Write through*
Write back 0 0 95%25%* [12 points] Analysis of Cache Miss. Indicate the number of words transferred in the bus for every read miss, write miss. And also compute the number of read misses, write misses as a percentage of the number of instructions.
o The cache implementation is write back. Overall analysis of write back: Access_average = 71.25% * 0 + 23.75% * 0 + 2.635% * 2 + 1.125% * 4