Memory Cache Optimization: An Impact Case Study on Compress Algorithm, Study notes of Computer Architecture and Organization

The optimization of the compress algorithm's memory cache usage. The study focuses on the hash table accesses, which have little reuse in current caches, and suggests a solution to bypass the cache for infrequently accessed data. The document also introduces the concept of memory macroblocks and their usage in caching decisions.

Typology: Study notes

Pre 2010

Uploaded on 02/24/2010

koofers-user-a7d
koofers-user-a7d 🇺🇸

4

(1)

9 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
IMPACT
Case Study: 026.compress
while ( (c = getchar()) != EOF ) {
in_count++;
fcode = (long) (((long) c << maxbits) + ent);
i = ((c << hshift) ^ ent); /* xor hashing */
if ( htabof (i) == fcode ) {
ent = codetabof (i);
continue;
} else if ( (long)htabof (i) < 0 ) /* empty slot*/
goto nomatch;
disp = hsize_reg - i; /* secondary hash */
if ( i == 0 ) disp = 1;
probe:
if ( (i -= disp) < 0 ) i += hsize_reg;
if ( htabof (i) == fcode ) {
ent = codetabof (i);
continue;
}
if ( (long)htabof (i) > 0 ) goto probe;
nomatch:
output ( (code_int) ent );
out_count++;
ent = c;
if ( free_ent < maxmaxcode ) {
codetabof (i) = free_ent++; /* code -> hashtable */
htabof (i) = fcode;
}
else if ( (count_int)in_count >= checkpoint &&
block_compress ) cl_block ();
}
Inner loop of compress(): most of execution time!
The hash table accesses have little reuse in current caches
htab ~ 270K; codetab ~ 135K
Even in a 4-way set associative cache there are many conflicts
IMPACT
Htab Memory Access Distribution
150000
200000
250000
300000
350000
400000
450000
2000000 2020000 2040000 2060000 2080000 2100000
Cycle
Address (offset from 1G)
htab load hits
htab load misses
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Memory Cache Optimization: An Impact Case Study on Compress Algorithm and more Study notes Computer Architecture and Organization in PDF only on Docsity!

IMPACT

Case Study: 026.compress

while ( (c = getchar()) != EOF ) { in_count++; fcode = (long) (((long) c << maxbits) + ent); i = ((c << hshift) ^ ent); /* xor hashing / if ( htabof (i) == fcode ) { ent = codetabof (i); continue; } else if ( (long)htabof (i) < 0 ) / empty slot/ goto nomatch; disp = hsize_reg - i; / secondary hash */ if ( i == 0 ) disp = 1; probe: if ( (i -= disp) < 0 ) i += hsize_reg; if ( htabof (i) == fcode ) { ent = codetabof (i);

continue; } if ( (long)htabof (i) > 0 ) goto probe; nomatch: output ( (code_int) ent ); out_count++; ent = c; if ( free_ent < maxmaxcode ) { codetabof (i) = free_ent++; /* code -> hashtable */ htabof (i) = fcode; } else if ( (count_int)in_count >= checkpoint && block_compress ) cl_block (); }

  • Inner loop of compress(): most of execution time!
  • The hash table accesses have little reuse in current caches
    • htab ~ 270K; codetab ~ 135K
    • Even in a 4-way set associative cache there are many conflicts

Htab Memory Access Distribution

Cycle

Address (offset from 1G)

htab load hits htab load misses

IMPACT

Compress Cache Bypass

  • Infrequently accessed

data should bypass cache

  • Only when it conflicts

with more frequently

accessed data

  • Less cache pollution
  • Increase cache reuse of

frequently accessed

addresses

  • overall increase in hit

ratio

  • Need a way to keep

track of usage patterns

Data Cache

Main Memory

Infrequently Accessed RegFile Frequently Accessed

Memory Macroblocks

  • Used to track memory

accessing behavior

  • Combine groups of adjacent

cache blocks into larger

regions called macroblocks

  • Macroblock size should be:
    • large enough so the total number

of macroblocks is not

unreasonable to track

  • small enough so accessing

frequency within each is

relatively uniform

Main Memory

IMPACT

MAT Bypassing Operation

  • On a memory access

lookup counter in MAT

  • accessed with macroblock

address

  • increment counter
  • Save updated counter

value

Data Cache

Main Memory

A

A

B

B ctr

++ctr ctr

MAT

Reg File

Infrequently Accessed Frequently Accessed

MAT Bypassing Operation (cont.)

  • If access hit in cache, execute

normally

  • If access missed , lookup MAT

counter for cache block that

would be replaced

  • Compare counter values
    • if ctr1 is lower then bypass cache
  • Place bypassing data in a small

set-associative bypass buffer

  • Holds bypassed data for a short

time to exploit some temporal

locality

Data Cache

Main Memory

A

A

B

B ctr2- -

ctr ctr CMP

MAT

Buffer

Reg File

bypass?

Infrequently Accessed Frequently Accessed

IMPACT

026.compress

072.sc099.go 147.vortex

Pcode

lmdes2_customizer

085.cc

130.li 134.perl 124.m88ksim

wordexcelphoto

Benchmark

% Improvement over Base

Sampling No Sampling

Performance Improvement

L1 Hit Ratios

026.compress

072.sc099.go 147.vortex

Pcode

lmdes2_customizer

085.cc

130.li 134.perl 124.m88ksim

wordexcelphoto

Benchmark

L1 Hit Ratio

Base Simulator (1024-byte macroblocks) Upper Bound

IMPACT

Example: 085.gcc Routine

int rtx_renumbered_equal_p (rtx x, rtx y)

rtx_code code = x->code;

if (code != y->code) return 0; /* Exits here 448 times (1%) */

/* A: 8 bytes sufficient */

switch (code) {

[...]

case CONST_INT:

return x->fld[0].rtint == y->fld[0].rtint; /* Exits here 29096 times (46%) */

/* A: 8 bytes sufficient */

fmt = GET_RTX_FORMAT (code);

for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--) {

switch (fmt[i]) {

[...]

case 'e':

if (! rtx_renumbered_equal_p (x->fld[i].rtx, y->fld[i].rtx)) /* Executed 33030 times */

return 0; /* A: More than 8 bytes needed! */

break;

Q: How much data

to fetch on a miss

to y->code?

Cache Organization Alternative:

Variable Fetch: 8-byte lines & 32-byte

virtual lines

ld A

ld B

ld C

ld A

ld C

ld B+

8 bytes

.. .

A

B

.. .

miss C

8 bytes

.. .

A

C

B

.. .

hit B+

hit C

hit A

B+

cache cache

IMPACT

SLDT Hardware

(Spatial Locality Detection Table)

L1 Data Cache

8-byte blocks

... sctr

MAT

SLDT

fetch size?

hit? (^) spatial reuse?

update sctr with hit and spatial reuse results

Memory Address

tag sz vc sr SLDT entry format

(^01) 00

fi bit

SLDT Actions

  • When max fetch size block is fetched into data cache
    • sz = 1; vc = max_fetch_size/min_fetch_size - 1; sr = 0
  • When min fetch size block is fetched into data cache
    • no prior SLDT entry: sz = 0; vc = 0; sr = 0
    • existing SLDT entry: vc ++

Cache Access SLDT Access fi sz vc Result Result Value Value Value Action miss hit - 0 sr = 1; sctr++

  • 1 sr = 1 hit hit 0 sr = 1 0 >0 sr = 1 hit miss 0 - - allocate SLDT entry; sz = 1; sr = 1 1 - - allocate SLDT entry Cache entry vc > 0 vc-- replaced vc == 0 invalidate SLDT entry SLDT entry replaced sr == 0 sctr-- or invalidated sr == 1 no action

tag sz vc sr

SLDT entry:

IMPACT

compress

072.sc

go

147.vortex

Pcode

lmdes2_customizer

gcc 130.li 134.perl 124.m88ksim

word excelphoto

Benchmark

% Improvement over 64-byte L2 Lines with L1 Varying Fetches

32-byte L2 lines (L1 vary fetch) 128-byte L2 lines (L1 vary fetch) 256-byte L2 lines (L1 vary fetch) L2 vary fetch (1-bit sctr) L2 vary fetch (4-bit sctr)

L2 Varying Fetch Sizes

compress

072.sc

go

147.vortex

Pcode

lmdes2_customizer

gcc 130.li 134.perl 124.m88ksim

wordexcelphoto

Benchmark

% Improvement over Base

(sampling) L1/L2 varying fetch (4-bit sctrs) (no sampling) L1/L2 varying fetch (4-bit sctrs)

Performance Improvements

IMPACT

085.gcc Example Revisited

  • Access y->code missed 11,223 times
    • Fetched 32 bytes 47% of misses
    • Fetched 8 bytes 53% of misses
  • On average:
    • 0.99 spatial hits to the resulting data per miss
    • 0.02 spatial misses to the resulting data per miss
  • Illustrates that the run-time scheme is doing a good job of

selecting the correct fetch size for the example

Related Work

  • A large body of work has examined prefetching

techniques for numeric codes

  • Prefetching techniques for integer codes often focus on

prefetching of pointer targets [MeHa95][LuMo96]

[LiSKR95]

  • orthogonal to my techniques
  • [DuLe92] proposed variable blocksizes to reduce cache

coherence traffic in multiprocessors

  • mechanism not geared to exploiting temporal-only data
  • used subblocking to support multiple blocksizes
  • Split caches provide one cache with short lines and one

with long lines for spatial data [GoAV95][MiMTT96]

  • ratio between temporal and spatial data determined at design time