CDA 5155: Fall 2008 HW3 - Parallelism, Cache, Memory Hierarchy - Prof. Prabhat Kumar Mishr | Assignments Electrical and Electronics Engineering

Homework 3

CDA 5155: Fall 2008

Due Date: 11/06/2008 11:59 PM (UF EDGE Students: 11/13/2008 11:59 PM)

Primary TA: Weixun Wang

You are not allowed to take or give help in completing this assignment. Submit the PDF version of

the submission in e-Learning website before the deadline. Please include the sentence in bold on

top of your submission: “I have neither given nor received any unauthorized aid on this

assignment”.

Problem 1

1. [1 + 2 + 2] What is the “window” in exploiting instruction-level parallelism? What’s the relation

between window size and issue rate (how one limits another or/and vice versa)? What other

factors are limiting window size and issue rate?

2. [2+2] Briefly describe fine-grained and coarse-grained multithreading techniques and their pros

and cons?

3. [1] What is the main disadvantage in allowing a processor to execute instructions from multiple

processes at the same time (compared to executing multiple threads)?

Problem 2

A traditional way to improve cache performance is to separately optimize memory accesses with

distinct purposes, such as instruction access versus data access. This technique can be taken even

further, by separating the data memory access into different sub-categories such as stack versus

non-stack access. Such approaches are sometimes called region caching, in reference to different

address-space regions that may conventionally be devoted to instructions, stack, static data and heap.

In this problem, you are going to calculate and compare the performance for two different machines

A and B. Note that in a given benchmark suite, 15% of benchmark instructions are loads and 5% are

stores, and that the average CPI is 2 for the given benchmark suite on both machines, excluding

both the “store → load” and “load → other” latencies (which will be explained below) as well as

the data miss. Assume that 20% of loads in the benchmark are close enough to a store that they

depend on to incur the maximum “store → load” latency on either machine, and that the rest of the

loads are far enough away to incur no latency on either machine. Similarly, assume that 30% of

loads are followed by a data-dependent instruction that is close enough to incur the maximum “load

→ other” latency on either machine and the rest incur no latency. Moreover, we assume 50% of the

data memory accesses are stack accesses.

Partial preview of the text

Download CDA 5155: Fall 2008 HW3 - Parallelism, Cache, Memory Hierarchy - Prof. Prabhat Kumar Mishr and more Assignments Electrical and Electronics Engineering in PDF only on Docsity!

Homework 3

CDA 5155: Fall 2008

Due Date: 11/06/2008 11:59 PM (UF EDGE Students: 11/13/2008 11:59 PM)

Primary TA: Weixun Wang

You are not allowed to take or give help in completing this assignment. Submit the PDF version of the submission in e-Learning website before the deadline. Please include the sentence in bold on top of your submission: “ I have neither given nor received any unauthorized aid on this assignment ”.

Problem 1

[1 + 2 + 2] What is the “window” in exploiting instruction-level parallelism? What’s the relation between window size and issue rate (how one limits another or/and vice versa)? What other factors are limiting window size and issue rate?
[2+2] Briefly describe fine-grained and coarse-grained multithreading techniques and their pros and cons?
[1] What is the main disadvantage in allowing a processor to execute instructions from multiple processes at the same time (compared to executing multiple threads)?

Problem 2 A traditional way to improve cache performance is to separately optimize memory accesses with distinct purposes, such as instruction access versus data access. This technique can be taken even further, by separating the data memory access into different sub-categories such as stack versus non-stack access. Such approaches are sometimes called region caching , in reference to different address-space regions that may conventionally be devoted to instructions, stack, static data and heap. In this problem, you are going to calculate and compare the performance for two different machines A and B. Note that in a given benchmark suite, 15% of benchmark instructions are loads and 5% are stores, and that the average CPI is 2 for the given benchmark suite on both machines, excluding both the “store → load” and “load → other” latencies (which will be explained below) as well as the data miss. Assume that 20% of loads in the benchmark are close enough to a store that they depend on to incur the maximum “store → load” latency on either machine, and that the rest of the loads are far enough away to incur no latency on either machine. Similarly, assume that 30% of loads are followed by a data-dependent instruction that is close enough to incur the maximum “load → other” latency on either machine and the rest incur no latency. Moreover, we assume 50% of the data memory accesses are stack accesses.

[5] Machine A uses a data cache (there is a separate instruction cache, but stack access just use the data cache) and runs at a clock rate of 4 GHz. Suppose the machine has a deep pipeline, so that the minimum latency between a store and a subsequent data-dependent load from the same address (“store → load”) is 10 cycles, even for a cache hit. The latency between a load and an instruction using the result of the load (“load → other”) is 6 cycles. When there is a cache miss, on average it takes 50 cycles of latency. The miss rate for data accesses on machine A is 5%. What is the average CPI of machine A?
[5] Machine B is just like machine A, except with the following improvements. All stack accesses (which are identified by the base register used) go to a separate fully-associative stack value file, instead of to the data cache (assuming that stack value file is large enough and never gets filled up). In the stack value file, dependence prediction is used, and it guesses the effective address correctly on 80% of stack accesses. When this happens, the speculative load result is available early enough in the pipeline so that the “store → load” and “load → other” latencies are each only 1 cycle. When the prediction is incorrect, the load-related latencies are same as machine A. The miss rate for stack access is 2% and for non-stack data access is reduced to 3%, since the static and heap accesses to the main cache no longer have to compete with the stack accesses for available cache lines. Due to these improvements, machine B can only run at clock cycle of 3.5GHz. What is the average CPI of machine B?
[5] Determine the performance (in millions of instructions per second, a.k.a MIPS) of machine A and machine B.

Problem 3

[5] Assume a cache with a total of 4 lines (blocks) that can be organized as: direct-mapped, 2- way set-associative, and fully-associative. Consider 6 references: A (0000), B (0001), C (0010), D (0011), E (0100) and F (0101) with their block address in ‘( )’. Your job is to come up with a referencing sequence (from left to right) of the 6 blocks (in any order, and each block can be repeated), such that the hit ratios of the three organizations are ordered as follows: Fully- associative > 2 way set-associative > direct-mapped. (Note: the content of cache should be listed from the most recently used line to the least recently used, assuming LRU replacement policy is applied in all three organizations).

Request sequence

A B

Dir-map (set 0)

A A

Dir-map (set 1)

- B

Dir-map (set 2)

Problem 4

[1 + 2 + 2 + 2 + 2 + 2] Figure C.24 on Page C-47 in the textbook gives an overall picture of a hypothetical memory hierarchy going from virtual address to physical address. It also shows how cache is accessed using virtual/physical address. Assume that we have a processor architecture in which the virtual addresses and physical address have 52 bits and 44 bits, respectively. The page size is 8KB. The cache block size for L1 and L2 cache is 64 bytes and 256 bytes, respectively. The TLB is configured to have 4K entries with 16-way set-associative design. The “virtual-indexed and physical-tagged” L1 cache size is 32KB with 4-way set-associative design. The L2 cache size is 4MB with 8-way set-associative design. Give the number of bits of the following address components: a) L1 block offset b) L1 index c) TLB index d) TLB tag e) L2 index f) L2 tag
[2 + 2] In order to speed up cache access, address can be translated in parallel with the cache tag and data array access. In other words, the right cache set (one line/block for direct-mapped cache or k lines for k-way set-associative cache) need to be fetched at the same time when virtual address is transferred to physical address. a) Based on this point of view, describe intuitively what is the relation between the page size and L cache size. b) In sub-problem 1, if I increases the L1 cache size from 32KB to 128KB, while other parameters remains the same, what will be the difficulty to achieve the goal in a)? Also, give one solution that can solve this difficulty to allow parallel TLB and cache accesses. (Hint: think about what does “virtual-indexed and physical-tagged” mean?)

CDA 5155: Fall 2008 HW3 - Parallelism, Cache, Memory Hierarchy - Prof. Prabhat Kumar Mishr, Assignments of Electrical and Electronics Engineering

Related documents

Partial preview of the text

Download CDA 5155: Fall 2008 HW3 - Parallelism, Cache, Memory Hierarchy - Prof. Prabhat Kumar Mishr and more Assignments Electrical and Electronics Engineering in PDF only on Docsity!

Homework 3

CDA 5155: Fall 2008

Due Date: 11/06/2008 11:59 PM (UF EDGE Students: 11/13/2008 11:59 PM)

Primary TA: Weixun Wang

A B

A A

- B