Cache Systems and Memory Access Exam - Summer 2004: Advanced Computer Architecture, Exams of Computer Architecture and Organization

The questions and answers for exam iii of the advanced computer architecture course (ece 4100/6100) held in summer 2004. The exam focuses on various aspects of cache systems and memory access, including cache sizes, cache miss penalties, and cache conflict resolution. Students are required to apply their knowledge of cache systems and memory access to solve problems and answer questions related to cache hit rates, average memory access times, and loop unrolling.

Typology: Exams

Pre 2010

Uploaded on 08/05/2009

koofers-user-a7d
koofers-user-a7d 🇺🇸

4

(1)

9 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
SCORE:________ Name:__________________________________________
ECE 4100/6100 Advanced Computer Architecture
Exam III – Summer 2004
1. (10 points) What limits the size and complexity of the L1 cache and why is an L2 (and even an L3) in
common use today?
The L1 cache is in the critical delay path for clock cycle time on most processors. If it gets too large
(larger memory is slower) everything else would slow down. Adding larger L2 or L3 cache levels keeps
L1 hits as fast as possible and reduces the miss penalty on an L1 miss.
2. (10 points) List the five approaches suggested in the text to reduce the cache miss penalty.
Write Policy
Read Priority over write on a miss
Early Restart – Critical Word First
Non Blocking Caches
Add an additional level of cache
3. (10 points) Other than increasing cache size, associativity or number of levels how can you reduce the
bad effects of cache conflicts (in hardware and software)?
Hardware: Victim Cache
Software: compiler optimizations to reduce conflicts
4. (15 points) A computer has three levels of cache. The L1 cache has a 8% local miss rate and the L2
cache has a local miss rate of 30%. The global (L1,L2,L3 combined) cache system hit rate is 98%. Main
memory takes 50 clock cycles at 4Ghz, an L3 hit is 20 clock cycles, an L2 hit is 8 clock cycles, and an L1
hit is 2 clock cycles. Compute the average memory access time.
Need L3 hit/miss rate: ML1*ML2*ML3= Global miss rate, .08*.30*ML3 = 100-98, so ML3=.833
AMAT=HTL1+MRL1*(HTL2+MRL2*(HTL3+MRL3*Mem))
AMAT = 2+ .08*(8 + .3 * (20 + .833*50) = 4.12
4.12/4Ghz = 1.03 ns.
Average memory access time = _____4.12______clocks and _____1.03___________ ns.
pf3
pf4

Partial preview of the text

Download Cache Systems and Memory Access Exam - Summer 2004: Advanced Computer Architecture and more Exams Computer Architecture and Organization in PDF only on Docsity!

SCORE:________ Name:__________________________________________

ECE 4100/6100 Advanced Computer Architecture

Exam III – Summer 2004

1_. (10 points)_ What limits the size and complexity of the L1 cache and why is an L2 (and even an L3) in common use today? The L1 cache is in the critical delay path for clock cycle time on most processors. If it gets too large (larger memory is slower) everything else would slow down. Adding larger L2 or L3 cache levels keeps L1 hits as fast as possible and reduces the miss penalty on an L1 miss.

  1. (10 points) List the five approaches suggested in the text to reduce the cache miss penalty. _Write Policy Read Priority over write on a miss Early Restart – Critical Word First Non Blocking Caches Add an additional level of cache
  2. (10 points)_ Other than increasing cache size, associativity or number of levels how can you reduce the bad effects of cache conflicts (in hardware and software)? Hardware: Victim Cache Software: compiler optimizations to reduce conflicts
  3. (15 points) A computer has three levels of cache. The L1 cache has a 8% local miss rate and the L cache has a local miss rate of 30%. The global (L1,L2,L3 combined) cache system hit rate is 98%. Main memory takes 50 clock cycles at 4Ghz, an L3 hit is 20 clock cycles, an L2 hit is 8 clock cycles, and an L hit is 2 clock cycles. Compute the average memory access time. Need L3 hit/miss rate: ML1ML2ML3= Global miss rate, .08.30ML3 = 100-98, so ML3=. AMAT=HTL1+MRL1(HTL2+MRL2(HTL3+MRL3Mem)) AMAT = 2+ .08(8 + .3 * (20 + .833*50) = 4. 4.12/4Ghz = 1.03 ns. Average memory access time = _____4.12______clocks and _____1.03___________ ns.

Instruction producing result Instruction using result Latency in clock cycles FP. ALU Op FP. ALU Op 3 FP. ALU Op Store Double 2 Int. ALU Op Any 1 Load Double FP. ALU Op; 2 Load Double Store Double 0 LOOP: L.D F2, 0(R1) SUB.D F6, F4, F ADD.D F4, F6, F S.D F4, 0(R1) DADDIU R1, R1, # BNE R1, R3 LOOP Before loop unrolling a single execution requires ___15_____clocks (original code - do not reschedule)

  1. (25 points) Unroll the loop shown above four times and schedule operations to reduce the number of stalls and control overhead. You can assume that the loop executes a multiple of four times. Use extra registers F8..F30, as needed. Indicate any stalls in your answer. Assume 1 branch delay slot is present.

LOOP:L.D F2, 0(R1)

L.D F8, 8(R1)

L.D F14, 16(R1)

L.D F20, 24(R1)

SUB.D F6, F4, F

SUB.D F12, F10, F

SUB.D F18, F16, F

SUB.D F24, F16, F

ADD.D F4, F6, F

ADD.D F10, F12, F

DADDIU R1, R1,

ADD.D F16, F18, F

ADD.D F22, F24,F

S.D F4, -32(R1)

S.D F10, -24(R1)

S.D F16, -16(R1)

BNE R1, R3, LOOP

S.D F22, -8(R1)

After unrolling, a (single) execution of the original loop’s operations now requires ___4.5____clocks

  1. (10 points) Using the your loop (four times) unrolled code from the earlier problem, schedule it on this VLIW machine. Assume the same latencies as the earlier problem and no branch delay. Just like the books example, it can do 2 memory operations (loads and/or stores), two floating point operations, and an integer ALU or Branch operation every clock cycle. Memory Ref. 1 Memory Ref. 2 FP Operation 1 FP Operation 2 Int. Op/Branch L.D F2,0(R1) L.D F8,8(R1) L.D F14,16(R1) L.D F20,24(R1) Must stall SUB.D F6,F4,F2 SUB.D F12,F10,F SUB.D F18,F16,F14 SUB.D F24,F22,F Must stall Must stall ADD.D F4,F6,F0 ADD.D F10,F12,F ADD.D F16,F18,F0 ADD.D F22,F24,F0 DADDIU R1,R1,# Must stall S.D F4,-32(R1) S.D F10,-24(R1) S.D F16,-16(R1) S.D F22,-8(R1) BNE R1,R3,LOOP A single loop iteration now takes 12/4 = 3 clocks. (based on operations in the original loop code)