Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Memory Cache Optimization: An Impact Case Study on Compress Algorithm, Study notes of Computer Architecture and Organization

University of Illinois - Urbana-Champaign Computer Architecture and Organization

The optimization of the compress algorithm's memory cache usage. The study focuses on the hash table accesses, which have little reuse in current caches, and suggests a solution to bypass the cache for infrequently accessed data. The document also introduces the concept of memory macroblocks and their usage in caching decisions.

Typology: Study notes

Pre 2010

Uploaded on 02/24/2010

koofers-user-a7d 🇺🇸

4

(1)

9 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

IMPACT

Case Study: 026.compress

while ( (c = getchar()) != EOF ) {

in_count++;

fcode = (long) (((long) c << maxbits) + ent);

i = ((c << hshift) ^ ent); /* xor hashing */

if ( htabof (i) == fcode ) {

ent = codetabof (i);

continue;

} else if ( (long)htabof (i) < 0 ) /* empty slot*/

goto nomatch;

disp = hsize_reg - i; /* secondary hash */

if ( i == 0 ) disp = 1;

probe:

if ( (i -= disp) < 0 ) i += hsize_reg;

if ( htabof (i) == fcode ) {

ent = codetabof (i);

continue;

}

if ( (long)htabof (i) > 0 ) goto probe;

nomatch:

output ( (code_int) ent );

out_count++;

ent = c;

if ( free_ent < maxmaxcode ) {

codetabof (i) = free_ent++; /* code -> hashtable */

htabof (i) = fcode;

}

else if ( (count_int)in_count >= checkpoint &&

block_compress ) cl_block ();

}

•Inner loop of compress(): most of execution time!

•The hash table accesses have little reuse in current caches

–htab ~ 270K; codetab ~ 135K

–Even in a 4-way set associative cache there are many conflicts

IMPACT

Htab Memory Access Distribution

150000

200000

250000

300000

350000

400000

450000

2000000 2020000 2040000 2060000 2080000 2100000

Cycle

Address (offset from 1G)

htab load hits

htab load misses

Discover Study notes of Computer Architecture and Organization University of Illinois - Urbana-Champaign

Partial preview of the text

Download Memory Cache Optimization: An Impact Case Study on Compress Algorithm and more Study notes Computer Architecture and Organization in PDF only on Docsity!

IMPACT

Case Study: 026.compress

while ( (c = getchar()) != EOF ) { in_count++; fcode = (long) (((long) c << maxbits) + ent); i = ((c << hshift) ^ ent); /* xor hashing / if ( htabof (i) == fcode ) { ent = codetabof (i); continue; } else if ( (long)htabof (i) < 0 ) / empty slot/ goto nomatch; disp = hsize_reg - i; / secondary hash */ if ( i == 0 ) disp = 1; probe: if ( (i -= disp) < 0 ) i += hsize_reg; if ( htabof (i) == fcode ) { ent = codetabof (i);

continue; } if ( (long)htabof (i) > 0 ) goto probe; nomatch: output ( (code_int) ent ); out_count++; ent = c; if ( free_ent < maxmaxcode ) { codetabof (i) = free_ent++; /* code -> hashtable */ htabof (i) = fcode; } else if ( (count_int)in_count >= checkpoint && block_compress ) cl_block (); }

Inner loop of compress(): most of execution time!
The hash table accesses have little reuse in current caches
- htab ~ 270K; codetab ~ 135K
- Even in a 4-way set associative cache there are many conflicts

Htab Memory Access Distribution

Cycle

Address (offset from 1G)

htab load hits htab load misses

IMPACT

Compress Cache Bypass

Infrequently accessed

data should bypass cache

Only when it conflicts

with more frequently

accessed data

Less cache pollution
Increase cache reuse of

frequently accessed

addresses

overall increase in hit

ratio

Need a way to keep

track of usage patterns

Data Cache

Main Memory

Infrequently Accessed RegFile Frequently Accessed

Memory Macroblocks

Used to track memory

accessing behavior

Combine groups of adjacent

cache blocks into larger

regions called macroblocks

Macroblock size should be:
- large enough so the total number

of macroblocks is not

unreasonable to track

small enough so accessing

frequency within each is

relatively uniform

Main Memory

IMPACT

MAT Bypassing Operation

On a memory access

lookup counter in MAT

accessed with macroblock

address

increment counter
Save updated counter

value

Data Cache

Main Memory

A

B

B ctr

++ctr ctr

MAT

Reg File

Infrequently Accessed Frequently Accessed

MAT Bypassing Operation (cont.)

If access hit in cache, execute

normally

If access missed , lookup MAT

counter for cache block that

would be replaced

Compare counter values
- if ctr1 is lower then bypass cache
Place bypassing data in a small

set-associative bypass buffer

Holds bypassed data for a short

time to exploit some temporal

locality

Data Cache

Main Memory

A

B

B ctr2- -

ctr ctr CMP

MAT

Buffer

Reg File

bypass?

Infrequently Accessed Frequently Accessed

IMPACT

026.compress

072.sc099.go 147.vortex

Pcode

lmdes2_customizer

085.cc

130.li 134.perl 124.m88ksim

wordexcelphoto

Benchmark

% Improvement over Base

Sampling No Sampling

Performance Improvement

L1 Hit Ratios

026.compress

072.sc099.go 147.vortex

Pcode

lmdes2_customizer

085.cc

130.li 134.perl 124.m88ksim

wordexcelphoto

Benchmark

L1 Hit Ratio

Base Simulator (1024-byte macroblocks) Upper Bound

IMPACT

Example: 085.gcc Routine

int rtx_renumbered_equal_p (rtx x, rtx y)

rtx_code code = x->code;

if (code != y->code) return 0; /* Exits here 448 times (1%) */

/* A: 8 bytes sufficient */

switch (code) {

[...]

case CONST_INT:

return x->fld[0].rtint == y->fld[0].rtint; /* Exits here 29096 times (46%) */

/* A: 8 bytes sufficient */

fmt = GET_RTX_FORMAT (code);

for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--) {

switch (fmt[i]) {

[...]

case 'e':

if (! rtx_renumbered_equal_p (x->fld[i].rtx, y->fld[i].rtx)) /* Executed 33030 times */

return 0; /* A: More than 8 bytes needed! */

break;

Q: How much data

to fetch on a miss

to y->code?

Cache Organization Alternative:

Variable Fetch: 8-byte lines & 32-byte

virtual lines

ld A

ld B

ld C

ld A

ld C

ld B+

8 bytes

.. .

A

B

.. .

miss C

8 bytes

.. .

A

C

B

.. .

hit B+

hit C

hit A

B+

cache cache

IMPACT

SLDT Hardware

(Spatial Locality Detection Table)

L1 Data Cache

8-byte blocks

... sctr

MAT

SLDT

fetch size?

hit? (^) spatial reuse?

update sctr with hit and spatial reuse results

Memory Address

tag sz vc sr SLDT entry format

(^01) 00

fi bit

SLDT Actions

When max fetch size block is fetched into data cache
- sz = 1; vc = max_fetch_size/min_fetch_size - 1; sr = 0
When min fetch size block is fetched into data cache
- no prior SLDT entry: sz = 0; vc = 0; sr = 0
- existing SLDT entry: vc ++

Cache Access SLDT Access fi sz vc Result Result Value Value Value Action miss hit - 0 sr = 1; sctr++

1 sr = 1 hit hit 0 sr = 1 0 >0 sr = 1 hit miss 0 - - allocate SLDT entry; sz = 1; sr = 1 1 - - allocate SLDT entry Cache entry vc > 0 vc-- replaced vc == 0 invalidate SLDT entry SLDT entry replaced sr == 0 sctr-- or invalidated sr == 1 no action

tag sz vc sr

SLDT entry:

IMPACT

compress

072.sc

go

147.vortex

Pcode

lmdes2_customizer

gcc 130.li 134.perl 124.m88ksim

word excelphoto

Benchmark

% Improvement over 64-byte L2 Lines with L1 Varying Fetches

32-byte L2 lines (L1 vary fetch) 128-byte L2 lines (L1 vary fetch) 256-byte L2 lines (L1 vary fetch) L2 vary fetch (1-bit sctr) L2 vary fetch (4-bit sctr)

L2 Varying Fetch Sizes

compress

072.sc

go

147.vortex

Pcode

lmdes2_customizer

gcc 130.li 134.perl 124.m88ksim

wordexcelphoto

Benchmark

% Improvement over Base

(sampling) L1/L2 varying fetch (4-bit sctrs) (no sampling) L1/L2 varying fetch (4-bit sctrs)

Performance Improvements

IMPACT

085.gcc Example Revisited

Access y->code missed 11,223 times
- Fetched 32 bytes 47% of misses
- Fetched 8 bytes 53% of misses
On average:
- 0.99 spatial hits to the resulting data per miss
- 0.02 spatial misses to the resulting data per miss
Illustrates that the run-time scheme is doing a good job of

selecting the correct fetch size for the example

Related Work

A large body of work has examined prefetching

techniques for numeric codes

Prefetching techniques for integer codes often focus on

prefetching of pointer targets [MeHa95][LuMo96]

[LiSKR95]

orthogonal to my techniques
[DuLe92] proposed variable blocksizes to reduce cache

coherence traffic in multiprocessors

mechanism not geared to exploiting temporal-only data
used subblocking to support multiple blocksizes
Split caches provide one cache with short lines and one

with long lines for spatial data [GoAV95][MiMTT96]

ratio between temporal and spatial data determined at design time

Memory Cache Optimization: An Impact Case Study on Compress Algorithm, Study notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Memory Cache Optimization: An Impact Case Study on Compress Algorithm and more Study notes Computer Architecture and Organization in PDF only on Docsity!

IMPACT

IMPACT

with more frequently

accessed data

frequently accessed

addresses

ratio

Data Cache

Main Memory

of macroblocks is not

unreasonable to track

frequency within each is

relatively uniform

IMPACT

address

Data Cache

Main Memory

MAT

time to exploit some temporal

locality

Data Cache

Main Memory

MAT

Buffer

bypass?

IMPACT

IMPACT

rtx_code code = x->code;

if (code != y->code) return 0; /* Exits here 448 times (1%) */

/* A: 8 bytes sufficient */

switch (code) {

[...]

case CONST_INT:

return x->fld[0].rtint == y->fld[0].rtint; /* Exits here 29096 times (46%) */

/* A: 8 bytes sufficient */

fmt = GET_RTX_FORMAT (code);

for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--) {

switch (fmt[i]) {

[...]

case 'e':

if (! rtx_renumbered_equal_p (x->fld[i].rtx, y->fld[i].rtx)) /* Executed 33030 times */

return 0; /* A: More than 8 bytes needed! */

break;

Q: How much data

to fetch on a miss

ld A

ld B

ld C

ld A

ld C

ld B+

IMPACT

SLDT Hardware

(Spatial Locality Detection Table)

... sctr

SLDT Actions

IMPACT

IMPACT