Performance Improvement with Victim Cache and CPI Calculation - Prof. Josep Torrellas, Exams of Computer Architecture and Organization

A solution to calculate the average memory access time (amat) with and without a victim cache, and the overall cpi for a given cache configuration. It includes formulas for instruction access stalls, data read miss stalls, and data write miss stalls.

Typology: Exams

Pre 2010

Uploaded on 03/10/2009

koofers-user-0r3
koofers-user-0r3 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CS 433g Final Exam – December 15, 2004
Professor Sarita Adve
Time: 7:00-10:00pm, 3 hours
Please clearly print your full name, NetID and circle the appropriate category in the space provided
below. Failure to completely fill out this table will result in a ZERO grade.
Name
NetID
Category (circle one) 3 Credit Hours 4 Credit Hours
Instructions
1. You may only use class handouts from this semester’s offering, the course text (Computer
Architecture: A Quantitative Approach - 3rd Edition - by Hennessy and Patterson), your own
homework submissions for this course, and notes written or typed by yourself. No other materials are
allowed, including other books, notes prepared by others, or materials from previous offerings of this
course or from other universities.
2. Calculators are allowed. You may not use any other electronic devices.
3. Please do not turn in your loose scrap paper. Limit your answers to the space provided, if possible. If
not, write on the back of the same sheet. You may use the back of each sheet for scratch work.
4. In all cases, show your work. No credit will be given for numeric answers if there is no indication of
how the answer was derived. Partial credit will be given even if your final solution is incorrect,
provided you show the intermediate steps in getting the final solution.
5. If you believe a problem is incorrectly or incompletely specified, make a reasonable assumption
and solve the problem. The assumption should not result in a trivial solution. In all cases,
clearly state any assumptions you make in your answers.
6. This exam has 6 problems and 9 pages (including this one). All students should attempt all problems.
Please budget your time appropriately. Good luck!
Problem Maximum
Points Received
Points
1 6
2 5
3 11
4 14
5 8
6 6
Total 50
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Performance Improvement with Victim Cache and CPI Calculation - Prof. Josep Torrellas and more Exams Computer Architecture and Organization in PDF only on Docsity!

CS 433g Final Exam – December 15, 2004

Professor Sarita Adve

Time: 7:00-10:00pm, 3 hours

Please clearly print your full name, NetID and circle the appropriate category in the space provided below. Failure to completely fill out this table will result in a ZERO grade.

Name NetID Category (circle one) 3 Credit Hours 4 Credit Hours

Instructions

  1. You may only use class handouts from this semester’s offering, the course text ( Computer Architecture: A Quantitative Approach - 3rd Edition - by Hennessy and Patterson), your own homework submissions for this course, and notes written or typed by yourself. No other materials are allowed, including other books, notes prepared by others, or materials from previous offerings of this course or from other universities.
  2. Calculators are allowed. You may not use any other electronic devices.
  3. Please do not turn in your loose scrap paper. Limit your answers to the space provided, if possible. If not, write on the back of the same sheet. You may use the back of each sheet for scratch work.
  4. In all cases, show your work. No credit will be given for numeric answers if there is no indication of how the answer was derived. Partial credit will be given even if your final solution is incorrect, provided you show the intermediate steps in getting the final solution.
  5. If you believe a problem is incorrectly or incompletely specified, make a reasonable assumption and solve the problem. The assumption should not result in a trivial solution. In all cases, clearly state any assumptions you make in your answers.
  6. This exam has 6 problems and 9 pages (including this one). All students should attempt all problems. Please budget your time appropriately. Good luck!

Problem Maximum Points

Received Points 1 6 2 5 3 11 4 14 5 8 6 6 Total 50

Question 1 [6 points]

A four entry victim cache for a 4KB direct mapped cache removes 85% of the conflict misses in a program. Without the victim cache, the miss rate is 0.06 and 67% of these misses are conflict misses. What is the percentage improvement in the AMAT (average memory access time) due to the victim cache? Assume a hit in the main cache takes 1 cycle. For a miss in the main cache that hits in the victim cache, assume an additional penalty of 1 cycle to access the victim cache. For a miss in the main and victim caches, assume a further penalty of 48 cycles to get the data from memory_._ Assume a simple, single-issue, 5-stage pipeline, in-order processor that blocks on every read and write until it completes.

Solution: AMAT = Hit time + Miss Rate x Miss Penalty

Without the victim cache: AMAT = 1 + 0.06*48 = 3.88 cycles

With the victim cache: AMAT = 1 + 0.06{(0.670.851) + ((1-(0.670.85)) *49)} = 2.29984 cycles

Improvement: (3.88 – 2.29984)/3.88 = 40.73%

Grading : 1 point for original AMAT formula, 4 points for victim cache AMAT formula, 1 point for percent improvement formula. For the victim cache AMAT, 1 point for determining the correct rate for victim cache hits and misses, 1 point for assigning the correct penalty to victim cache hits, 1 point for assigning the correct penalty to victim cache misses, and 1 point for putting it all together correctly.

Question 3 [11 points]

Consider a machine M with a simple, 5-stage, single-issue, in-order pipeline that blocks on loads until the requested data is received and blocks on stores until the data is stored in the cache. Consider a single level of separate data and instruction caches and the following characteristics:

  • The base CPI with a perfect memory system (i.e., where every memory access takes 1 cycle) is 1.5.
  • The hit latency for both I and D caches is 1 cycle.
  • Main memory access latency is 40 cycles. This is measured as the time from when the cache issues a main memory request until the time the main memory is ready to deliver the first piece of data. After this latency, both caches receive data/instructions from main memory at the rate of 4 bytes per clock cycle.
  • Assume each cache waits for the entire requested block before servicing the processor’s request to any word in the block.
  • The block size is 32 bytes for both caches.
  • Memory load and store instructions account for 32% of the dynamic instruction mix. 75% of these instructions are loads and 25% are stores.
  • Each load and store instruction accesses 64 bits of data. All the instructions are 32-bits long.
  • Assume that there is no overlap among stalls resulting from computation, data cache misses, or instruction cache misses.
  • Both the data and instruction caches are 8 KB and direct mapped.
  • The data cache is write-back, write-allocate. Assume the cache stalls for the write-back before it can issue the request that caused the write-back. Assume the time to write-back a block from the data cache to memory is the same as the time for the cache to read a block from memory.
  • The data cache and instruction cache have miss rates of 5% and 3% respectively.
  • Assume that for 40% of the data cache misses, you need to replace a block that is dirty.

Part A [4 points]

What is the CPI for the machine M? Be sure to show your work.

Solution: We have the following formula for the overall CPI:

CPI = base CPI + instruction access stalls + data read miss stalls + data write miss stalls

We can calculate the individual values as follows. First, cache miss penalty for x bytes = memory access latency + x bytes / data receive rate. For 32 bytes, we have cache miss penalty = 40 cycles + 32 bytes / 4 bytes/cycle = 48 cycles.

We now have the following formulas:

Instruction access stalls = instruction cache miss rate * instruction cache miss penalty

Data read miss stalls = data cache miss rate * percent load instructions * total load/store percent * ((2 * cache miss penalty) * percent data cache misses with dirty block + data cache miss penalty * (1 – percent data cache misses with dirty block))

Data write miss stalls = data cache miss rate * percent store instructions * total load/store percent * ((2 * cache miss penalty) * percent data cache misses with dirty block + data cache miss penalty * (1 – percent data cache misses with dirty block))

Applying these formulas, we find the following values:

Instruction access stalls = 0.03 * 48 cycles = 1.44 cycles Data read miss stalls = 0.05 * 0.75 * 0.32 * ((48 cycles * 2) * 0.4 + 48 cycles * 0.6) = .8064 cycles Data write miss stalls = 0.05 * .25 * .32 * ((48 cycles * 2) * 0.4 + 48 cycles * 0.6) = .2688 cycles

Thus, we have the following overall CPI using the CPI formula given above:

CPI = 1.5 + 1.44 + .8064 + .2688 = 4.

Grading: 1 point for correct CPI equation. 1 point for each correct subequation

We have the following formula for the overall CPI:

CPI = base CPI + instruction access stalls + data read miss stalls + data write miss stalls

We can calculate the individual values as follows. First, cache miss penalty for x bytes = memory access latency + x bytes / data receive rate. For 32 bytes, we have cache miss penalty = 40 cycles + 32 bytes / 4 bytes/cycle = 48 cycles.

We now have the following formulas:

Instruction access stalls = instruction cache miss rate * instruction cache miss penalty

Data read miss stalls = data cache miss rate * percent load instructions * total load/store percent * cache miss penalty

Data write miss stalls = percent store instructions * total load/store percent * percent buffer full * buffer stall time

Applying these formulas, we find the following values:

Instruction access stalls = 0.06 * 48 cycles = 2.88 cycles Data read miss stalls = 0.06 * 0.75 * 0.32 * 48 cycles = 0.6912 cycles Data write miss stalls = .25 * .32 * y * z cycles = 0.08 yz cycles

Thus, we have the following overall CPI using the CPI formula given above:

CPI = 1.5 + 2.88 + 0.6912 + 0.08 yz = 5.0712 + 0.08 yz

Grading: 3 points a correct set of assumptions. Either assumption may receive full credit. 1 point each for the instruction access stalls and data read miss stalls formulas and 2 points for the data write miss stalls formula.

Question 4 [14 points]

This question concerns a snooping update (as opposed to invalidate) cache coherence protocol. Consider a system where the processors are connected by a bus, the caches are write-back and write-allocate and cache coherence is maintained through a snooping update protocol. In a snooping update protocol, when a cache modifies its data, it broadcasts the updated data on a bus using a bus update transaction, if necessary. Memory and all caches that have a copy of that data then update their own copies. This is in contrast to the invalidation protocol discussed in class where a cache invalidates its copy in response to another processor’s write request to a block.

Our update protocol has three states – CE, CS and DE:

  • CE (Clean Exclusive): The block is present only in this cache (exclusively) and memory also has the same (clean copy).
  • CS (Clean Shared): The block is present in several caches (shared) and memory and all those caches have the same (clean) copy.
  • DE (Dirty Exclusive): The block is present only in this cache (exclusively) and the data in the cache is updated or dirty (i.e., a more recent version than the copy in memory).

The bus has a special line called the Shared Line (SL) whose state is usually 0. When cache i performs a bus transaction for a cache block, all caches that have the same block pull up the Shared Line (SL) to 1. The SL is only checked when a bus transaction is performed.

Assume that if a request is made to a block for which memory has a clean copy, memory will service that request. If the memory does not have a clean copy, the cache with the updated block will service the request and memory will also get updated.

For the question below, consider the following bus transactions:

  • BR: Bus Read – Request to get the cache line (on a cache miss).
  • BU: Bus update – Request to update copies of the cache line in memory and other caches with the new value of a word in the block.
  • BRU: Bus read and update – A combination of BR and BU.

Note: you are not required to consider Bus Writeback, which may take place on a replacement.

Part A [8 points]

Fill out the following state transition table for processor i showing the next state for a block in the cache of processor i and any bus transaction performed by processor i.

Each of the entries should be filled out as:

Next Sate/Bus Transaction (e.g. CS/BR ) where Next State = CS, CE, DE or NIC (Not in Cache; i.e., a cache miss) Bus Transaction = BR, BU, BRU, or NT (No transaction)

Note: If an entry is not possible (i.e., the system cannot be in such a state) write “Not Possible” in that entry.

Part B [6 points] Fill out the following state transition table for the cache of processor i showing the next state for a block in cache of processor i and the action(s) taken by the cache when a bus transaction is initiated by another processor j.

Each of the entries should be filled out as:

Next Sate/Action (e.g. CS/UPDL ) Next State = CS, CE, DE or NIC (Not in Cache; i.e., a cache miss) Action = PULLSL1 : Pull SL to 1 UPDL : Update line in cache i (i.e., one’s own cache) PROVL: Provide line in response to a BR or BRU (main memory is also updated as part of this action) NA: No Action

Note: If an entry is not possible (i.e., the system cannot be in such a state) write “Not Possible” in that entry.

State in proc i BR by proc j BU by proc j BRU by proc j

CE

CS

DE

NIC NIC/NA NIC/NA NIC/NA

State in proc i BR by proc j BU by proc j BRU by proc j CE CS/PULLSL1 Not possible CS/PULLSL1, UPDL CS CS/PULLSL1 CS/UPDL, PULLSL1 CS/UPDL, PULLSL DE CS/PROVL, PULLSL1 Not possible CS/PROVL, UPDL, PULLSL NIC NIC/NA NIC/NA NIC/NA

Grading: ¼ point for each correct Next State, 1/4 point for each correct component of each Action. Each “Not Possible” carries ½ point.

Question 5 [8 Points]

This problem involves implementing a stack using an array in a multiprocessor system. The elements of the array can be accessed in parallel by multiple processors. You are to write the following two functions:

  • Push : This will add an element to the top of the stack.
  • Pop: This will delete an element from the top of the stack.

Assume that Push is never called on a full stack and Pop is never called on an empty stack (i.e., you do not have to worry about overflow and underflow conditions).

Write the Push and Pop functions using an atomic test&set instruction to achieve synchronization. Add C-like pseudocode to the following stub (complete the incomplete statements as well):

int top; /* index for the top of the stack / int index; / current index for adding or deleting an element / Lock lock_var; / Lock variable for synchronization */

Push (item) {

index =

stack[index] = item;

Pop (void) {

index =

item = stack[index];

return item;

Question 6 [6 points]

What criteria would you use to compare different RAID architectures? Use these criteria to compare RAID level 3 and RAID level 5. Be sure to say which is better with respect to each criteria and why.

Solution: Comparison criteria for RAID levels include the following: disk space overhead for fault tolerance (redundancy), read performance, write performance, and level of fault tolerance

Both RAID levels can tolerate one disk crash since they both use parity. The remaining parity data and actual data can be used to reconstruct the contents of the failed disk. Thus, the fault tolerance level is the same for both.

Both levels use the same number of parity bits. So they are both the same in terms of disk space overhead as well.

Read performance in RAID level 5 is better than in RAID level 3 because RAID level 5 is block interleaved while RAID level 3 is bit interleaved. This means RAID level 3 can do only one read I/O at a time while RAID level 5 can do multiple block I/Os (note that it does not have to check the parity disk on reads unless there is an error – the presence of an error is detected through information on the data disk itself).

Write performance in RAID level 5 is also better for the same reason as read performance, and since the parity bits are distributed on different disks (ensuring that the parity update is not a bottleneck either).

Note that the read and write performance improvement assumes that the interleaving in RAID 5 is not in conflict with the access pattern. It is also worth noting that writes on RAID level 5 incur more overhead than reads because they have to compute the new parity. [Neither of these notes is required for full credit.]

Grading : 3 points for listing 3 criteria. 3 points for correct comparison (and justification) using each criteria listed.