Module 8: Cache Memory Optimizations | ECE 4100, Study notes of Computer Architecture and Organization

Material Type: Notes; Professor: Yalamanchili; Class: Adv Computer Architecure; Subject: Electrical & Computer Engr; University: Georgia Institute of Technology-Main Campus; Term: Fall 2003;

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-yij
koofers-user-yij 🇺🇸

10 documents

1 / 31

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECE 4100/6100: Yalamanchili Fall 2003
Module 8: Cache Memory
Module 8: Cache Memory
Optimizations
Optimizations
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Partial preview of the text

Download Module 8: Cache Memory Optimizations | ECE 4100 and more Study notes Computer Architecture and Organization in PDF only on Docsity!

ECE 4100/6100: Yalamanchili^

Module 8: Cache MemoryModule 8: Cache MemoryOptimizationsOptimizations

Improving the Performance of the Cache Hierarchy • Reductions in miss penalty • Reductions in the miss rate • Reductions in the hit time • Compiler optimizations

Hit Time Miss PenaltyMiss Rate

Analysis AMAT = Hit_time+ miss_rate* miss_penalty^ L1 L1^ L1 Miss_penalty = hit_time + miss_rate^ * miss_penalty^ L1 L2 L2^ L2 AMAT = Hit_time+ miss_rateL1 L1 (hit_time + miss_rate* miss_penalty^ L2 L2^ L2 • Local miss rate – Defined with respect to the cache • Global miss rate – Defined with respect to the total number of memory references

Performance • Note L2 hit time not that important, why? • Miss rate behavior of large L2 indistinguishable from a singlecache – Global miss rate a good indicator of performance

ECE 4100/6100: S. Yalamanchili^

Multilevel Inclusion/Exclusion 0x40 0x22 0x35^ 0x27v 0x66 0x76 0x01^ 0x00v 0x40 0x22 0x35^ 0x27v 0x55 0x56 0x12^ 0x34v 0x08 0x16 0x32^ 0x64v 0x66 0x76 0x01^ 0x00v^ •^ Increase in miss rate • Simplifies coherencebut reduced costmaintenance

0x54^ 0x36^ 0x22^ 0x47v0x000x270x69^ 0x36^ 0x21^ 0x02v0x400x00 0x50^ 0x22^ 0x35^ 0x28v0x270x770x55^ 0x76^ 0x42^ 0x34v0x340x440x08^ 0x26^ 0x38^ 0x64v0x640x620x68^ 0x76^ 0x01^ 0x40v0x400x Invalidate L1 entry when replaced

Swap with L2 entry when replaced

2. Critical Word First/Early Restart • Fetch referenced^ word^ first^ and^ remainder

of^ the^ line^ in^ the Wordboundary background • Standard line fetch but referenced word is forwarded to the CPUwhen it is fetched • Gains improve for larger line size • Complexity of multiple, successive references to the same block Memory 0x400x44 line 0x480x4C WordReferenced bythe CPU

4. Merging Write Buffer (write combining) • Improving the efficiency of write buffers • Combine sequential writes into a burst transaction to memory • Amortize transfer startup overhead

Fall 2003^^11

Performance of Write Combining^ Effects of Write-Combining on PIO 1401201008060402001 10 100 1000 10000 • Close to 90% bus bandwidth utilization

(^100000 1000000) Injection Burst Size Injection Bandwidth (MB/s)

PIO Write-CombiningPIO Plain

Reducing the Miss Rate • Reading: Section 5.5 • Focus on reducing – Compulsory misses^ Æ^ e.g., larger block size – Capacity misses^ Æ^ e.g., larger cache – Conflict misses^ Æ^ e.g., higher associativity • Trade-off miss rate with – Hit time – e.g., higher associativity can increase hit time – Miss penalty – e.g., larger block size can increase miss penalty

1: Larger Block Size • Larger block size increases spatial locality at the (eventual) expenseof temporal locality (compare Figures 5.17 & 5.18) • Reduces compulsory^ misses^ but^ (eventually)^

increases^ conflict misses • Reductions in miss rate are accompanied by increase in miss penalty^ –^ What happens to AMAT?^ –^ High (low) latency, high (low) bandwidth memory impact on block size

3. Way Prediction and Pseudo Associative Caches • Way Prediction^ –^ Use set-associative caches but predict the line in the set^ –^ Multiplexer is set early^ –^ Makes the common case fast^ –^ We can see a natural affinity with I-cache behavior^ –^ 2-way set associative 21264 I-cache with predictor bit– 1cycle hit vs. 3 cycle hit^ –^ Activity management for power management • Variant is pseudo associative cache^ –^ Each set has a fast hit line and a slow hit line (fixed!)^ –^ Maintenance of a fast hit block requires transfers from the slowhit block^ –^ Performance degradation due to too many slow hits

4. Compiler Techniques • Memory hierarchy exposed to the compiler – We can schedule for execution performance, why not schedulefor miss rate or miss penalty? • Examples – Re-ordering instructions to improve locality – Re-ordering data accesses to improve locality – Reduce conflict misses by re-mapping of instructions or data inmemory

ECE 4100/6100: S. Yalamanchili^

4.2 Blocking

for (i=0; i<N; i++)for (j=0; j<N; j++){r=0;for (k=0; k<n; k++)r = r + y[i][k] * z[k][[j];x[i][j] = r;};

for (jj=0; jj<N; jj = jj+B)for (kk=0; kk<N; kk = kk+B)for (i=0; i<N; i++)for (j=jj; j<min(jj+B,N); j++){r=0;for(k=kk;j<min(kk+B,N); k++)r = r + y[i][k] * z[k][[j];x[i][j] = x[i][j] + r;};

-^ Restructure the loops to improve^ –^ Fit in the cache^ –^ Improve temporal locality •^ Solutions now become machine dependent

Transformed Code A row in a blockOne block in a column^ All blocks in a columnAll columns of blocks Original Code

4.2 Blocking (cont.) Compute the partialproduct for this blockCompute a row in theblock (j and k)Complete computationof Block (0,0) (i)Complete computationof Blocks in a column(kk) • What is the miss behavior? • Decompose the computation to operate on BxB blocks such thatthree blocks fit in the cache • Reduce the overall number of worst case misses by a factor of B

y[i][k] Complete computationof all columns (jj) z[k][j]