







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A solution to calculate the average memory access time (amat) with and without a victim cache, and the overall cpi for a given cache configuration. It includes formulas for instruction access stalls, data read miss stalls, and data write miss stalls.
Typology: Exams
1 / 13
This page cannot be seen from the preview
Don't miss anything!








Please clearly print your full name, NetID and circle the appropriate category in the space provided below. Failure to completely fill out this table will result in a ZERO grade.
Name NetID Category (circle one) 3 Credit Hours 4 Credit Hours
Instructions
Problem Maximum Points
Received Points 1 6 2 5 3 11 4 14 5 8 6 6 Total 50
Question 1 [6 points]
A four entry victim cache for a 4KB direct mapped cache removes 85% of the conflict misses in a program. Without the victim cache, the miss rate is 0.06 and 67% of these misses are conflict misses. What is the percentage improvement in the AMAT (average memory access time) due to the victim cache? Assume a hit in the main cache takes 1 cycle. For a miss in the main cache that hits in the victim cache, assume an additional penalty of 1 cycle to access the victim cache. For a miss in the main and victim caches, assume a further penalty of 48 cycles to get the data from memory_._ Assume a simple, single-issue, 5-stage pipeline, in-order processor that blocks on every read and write until it completes.
Solution: AMAT = Hit time + Miss Rate x Miss Penalty
Without the victim cache: AMAT = 1 + 0.06*48 = 3.88 cycles
With the victim cache: AMAT = 1 + 0.06{(0.670.851) + ((1-(0.670.85)) *49)} = 2.29984 cycles
Improvement: (3.88 – 2.29984)/3.88 = 40.73%
Grading : 1 point for original AMAT formula, 4 points for victim cache AMAT formula, 1 point for percent improvement formula. For the victim cache AMAT, 1 point for determining the correct rate for victim cache hits and misses, 1 point for assigning the correct penalty to victim cache hits, 1 point for assigning the correct penalty to victim cache misses, and 1 point for putting it all together correctly.
Question 3 [11 points]
Consider a machine M with a simple, 5-stage, single-issue, in-order pipeline that blocks on loads until the requested data is received and blocks on stores until the data is stored in the cache. Consider a single level of separate data and instruction caches and the following characteristics:
Part A [4 points]
What is the CPI for the machine M? Be sure to show your work.
Solution: We have the following formula for the overall CPI:
CPI = base CPI + instruction access stalls + data read miss stalls + data write miss stalls
We can calculate the individual values as follows. First, cache miss penalty for x bytes = memory access latency + x bytes / data receive rate. For 32 bytes, we have cache miss penalty = 40 cycles + 32 bytes / 4 bytes/cycle = 48 cycles.
We now have the following formulas:
Instruction access stalls = instruction cache miss rate * instruction cache miss penalty
Data read miss stalls = data cache miss rate * percent load instructions * total load/store percent * ((2 * cache miss penalty) * percent data cache misses with dirty block + data cache miss penalty * (1 – percent data cache misses with dirty block))
Data write miss stalls = data cache miss rate * percent store instructions * total load/store percent * ((2 * cache miss penalty) * percent data cache misses with dirty block + data cache miss penalty * (1 – percent data cache misses with dirty block))
Applying these formulas, we find the following values:
Instruction access stalls = 0.03 * 48 cycles = 1.44 cycles Data read miss stalls = 0.05 * 0.75 * 0.32 * ((48 cycles * 2) * 0.4 + 48 cycles * 0.6) = .8064 cycles Data write miss stalls = 0.05 * .25 * .32 * ((48 cycles * 2) * 0.4 + 48 cycles * 0.6) = .2688 cycles
Thus, we have the following overall CPI using the CPI formula given above:
CPI = 1.5 + 1.44 + .8064 + .2688 = 4.
Grading: 1 point for correct CPI equation. 1 point for each correct subequation
We have the following formula for the overall CPI:
CPI = base CPI + instruction access stalls + data read miss stalls + data write miss stalls
We can calculate the individual values as follows. First, cache miss penalty for x bytes = memory access latency + x bytes / data receive rate. For 32 bytes, we have cache miss penalty = 40 cycles + 32 bytes / 4 bytes/cycle = 48 cycles.
We now have the following formulas:
Instruction access stalls = instruction cache miss rate * instruction cache miss penalty
Data read miss stalls = data cache miss rate * percent load instructions * total load/store percent * cache miss penalty
Data write miss stalls = percent store instructions * total load/store percent * percent buffer full * buffer stall time
Applying these formulas, we find the following values:
Instruction access stalls = 0.06 * 48 cycles = 2.88 cycles Data read miss stalls = 0.06 * 0.75 * 0.32 * 48 cycles = 0.6912 cycles Data write miss stalls = .25 * .32 * y * z cycles = 0.08 yz cycles
Thus, we have the following overall CPI using the CPI formula given above:
CPI = 1.5 + 2.88 + 0.6912 + 0.08 yz = 5.0712 + 0.08 yz
Grading: 3 points a correct set of assumptions. Either assumption may receive full credit. 1 point each for the instruction access stalls and data read miss stalls formulas and 2 points for the data write miss stalls formula.
Question 4 [14 points]
This question concerns a snooping update (as opposed to invalidate) cache coherence protocol. Consider a system where the processors are connected by a bus, the caches are write-back and write-allocate and cache coherence is maintained through a snooping update protocol. In a snooping update protocol, when a cache modifies its data, it broadcasts the updated data on a bus using a bus update transaction, if necessary. Memory and all caches that have a copy of that data then update their own copies. This is in contrast to the invalidation protocol discussed in class where a cache invalidates its copy in response to another processor’s write request to a block.
Our update protocol has three states – CE, CS and DE:
The bus has a special line called the Shared Line (SL) whose state is usually 0. When cache i performs a bus transaction for a cache block, all caches that have the same block pull up the Shared Line (SL) to 1. The SL is only checked when a bus transaction is performed.
Assume that if a request is made to a block for which memory has a clean copy, memory will service that request. If the memory does not have a clean copy, the cache with the updated block will service the request and memory will also get updated.
For the question below, consider the following bus transactions:
Note: you are not required to consider Bus Writeback, which may take place on a replacement.
Fill out the following state transition table for processor i showing the next state for a block in the cache of processor i and any bus transaction performed by processor i.
Each of the entries should be filled out as:
Next Sate/Bus Transaction (e.g. CS/BR ) where Next State = CS, CE, DE or NIC (Not in Cache; i.e., a cache miss) Bus Transaction = BR, BU, BRU, or NT (No transaction)
Note: If an entry is not possible (i.e., the system cannot be in such a state) write “Not Possible” in that entry.
Part B [6 points] Fill out the following state transition table for the cache of processor i showing the next state for a block in cache of processor i and the action(s) taken by the cache when a bus transaction is initiated by another processor j.
Each of the entries should be filled out as:
Next Sate/Action (e.g. CS/UPDL ) Next State = CS, CE, DE or NIC (Not in Cache; i.e., a cache miss) Action = PULLSL1 : Pull SL to 1 UPDL : Update line in cache i (i.e., one’s own cache) PROVL: Provide line in response to a BR or BRU (main memory is also updated as part of this action) NA: No Action
Note: If an entry is not possible (i.e., the system cannot be in such a state) write “Not Possible” in that entry.
State in proc i BR by proc j BU by proc j BRU by proc j
State in proc i BR by proc j BU by proc j BRU by proc j CE CS/PULLSL1 Not possible CS/PULLSL1, UPDL CS CS/PULLSL1 CS/UPDL, PULLSL1 CS/UPDL, PULLSL DE CS/PROVL, PULLSL1 Not possible CS/PROVL, UPDL, PULLSL NIC NIC/NA NIC/NA NIC/NA
Grading: ¼ point for each correct Next State, 1/4 point for each correct component of each Action. Each “Not Possible” carries ½ point.
Question 5 [8 Points]
This problem involves implementing a stack using an array in a multiprocessor system. The elements of the array can be accessed in parallel by multiple processors. You are to write the following two functions:
Assume that Push is never called on a full stack and Pop is never called on an empty stack (i.e., you do not have to worry about overflow and underflow conditions).
Write the Push and Pop functions using an atomic test&set instruction to achieve synchronization. Add C-like pseudocode to the following stub (complete the incomplete statements as well):
int top; /* index for the top of the stack / int index; / current index for adding or deleting an element / Lock lock_var; / Lock variable for synchronization */
Push (item) {
index =
stack[index] = item;
Pop (void) {
index =
item = stack[index];
return item;
Question 6 [6 points]
What criteria would you use to compare different RAID architectures? Use these criteria to compare RAID level 3 and RAID level 5. Be sure to say which is better with respect to each criteria and why.
Solution: Comparison criteria for RAID levels include the following: disk space overhead for fault tolerance (redundancy), read performance, write performance, and level of fault tolerance
Both RAID levels can tolerate one disk crash since they both use parity. The remaining parity data and actual data can be used to reconstruct the contents of the failed disk. Thus, the fault tolerance level is the same for both.
Both levels use the same number of parity bits. So they are both the same in terms of disk space overhead as well.
Read performance in RAID level 5 is better than in RAID level 3 because RAID level 5 is block interleaved while RAID level 3 is bit interleaved. This means RAID level 3 can do only one read I/O at a time while RAID level 5 can do multiple block I/Os (note that it does not have to check the parity disk on reads unless there is an error – the presence of an error is detected through information on the data disk itself).
Write performance in RAID level 5 is also better for the same reason as read performance, and since the parity bits are distributed on different disks (ensuring that the parity update is not a bottleneck either).
Note that the read and write performance improvement assumes that the interleaving in RAID 5 is not in conflict with the access pattern. It is also worth noting that writes on RAID level 5 incur more overhead than reads because they have to compute the new parity. [Neither of these notes is required for full credit.]
Grading : 3 points for listing 3 criteria. 3 points for correct comparison (and justification) using each criteria listed.