








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A lab assignment for the ECE 4750 Computer Architecture course at Cornell University. The lab focuses on designing two finite-state-machine cache microarchitectures, implementing them, and comparing their performance. The lab aims to give students experience with memory system design, complex FSM cache controllers, microarchitectural techniques, and design principles. The lab logistics document provides more information about the expectations for all lab assignments and how they will be assessed.
Typology: Assignments
1 / 14
This page cannot be seen from the preview
Don't miss anything!









School of Electrical and Computer Engineering Cornell University
revision: 2022-11-02-00-
In this lab, you will design two finite-state-machine (FSM) cache microarchitectures, which we will eventually compose with the processor designs you developed in the previous lab. The baseline design is a direct-mapped, write-back, write-allocate cache, and the alternative design is a two-way set associative cache that should reduce the miss rate by avoiding conflict misses. You are required to implement the baseline and alternative designs, verify the designs using an effective testing strategy, and perform an evaluation comparing the two implementations. The milestone for this lab is to complete as many transactions as you can for the baseline design along with an initial set of directed tests. You should consult the lab logistics document for more information about the expectations for all lab assignments and how they will be assessed.
This lab is designed to give you experience with:
This handout assumes that you have read and understand the course tutorials. You should have already used the ece4750-lab-admin script to create or join a GitHub group. To get started, login to an ecelinux server, source the setup script, and clone your lab group’s remote repository from GitHub:
% source setup-ece4750.sh % mkdir -p ${HOME}/ece % cd ${HOME}/ece % git clone [email protected]:cornell-ece4750/lab-groupXX
where XX is your group number. You should never fork your lab group’s remote repository! If you need to work in isolation then use a branch within your lab group’s remote repository. If you have already cloned your lab group’s remote repository, then use git pull to ensure you have any recent updates before running all of the tests. You can run all of the tests in the lab like this:
% cd ${HOME}/ece4750/lab-groupXX % git pull --rebase % mkdir -p sim/build % cd sim/build % pytest ../lab3_mem
All of the tests for the provided functional-level model should pass, while the tests for the baseline and alternative cache designs should fail. For this lab, you will be working in the lab3_mem subproject which includes the following files:
Accessing main memory can require hundreds of cycles, but cache memories can significantly reduce the average memory access latency for well-structured address patterns. Caches are faster than main memory because they are smaller and are located close to the processor; but because a cache can only hold a subset of all memory locations at any one time, we must carefully manage what data we keep in the cache. A cache hit occurs when the data we are requesting is already in the cache, while a cache miss occurs when the data we are requesting is not in the cache and thus requires accessing main memory. Caches exploit spatial and temporal locality to increase the number of cache hits. In an address pattern with significant spatial locality, if we access a given address then in the near future, we are likely to access an address close to the first address. In an address pattern with significant temporal locality, if we access a given address then in the near future, we are likely to access that same address again. In this lab, you will implement and evaluate two cache microarchitectures that organize cache lines in two different ways: (1) direct-mapped where every cache line can only be placed in a single location in the cache, and (2) two-way set-associative where every cache line can be placed in one of two locations in the cache. Both caches will use a write-back, write-allocate policy for handling write misses. Additionally, both caches will have the ability to act as a bank in a larger multi-bank cache organization.
We have provided you with a functional-level model of a cache, which essentially just passes all cache requests through to the memory interface, and passes all memory responses through to the cache response interface. While this might not seem useful, the functional-level model will enable us to develop many of our test cases with the test memory before attempting to use these tests with the baseline and alternative designs.
Figure 1 shows a block-level diagram illustrating how the functional-level, baseline, and alternative designs are integrated with a test source, test sink, and test memory for testing and evaluation. We will load data into the test memory before resetting the cache. Once we start the execution, the
Bank 0
Bank 1
Bank 2
Bank 3
Cache Request Network
Cache Response Network
Memory Request Network
Memory Response Network
Figure 3: Banked Cache Organization
- Four-bank cache organization. High- lighted path is for a memory request that hits in bank 0.
Address With No Banking 31 8 7 4 3 0 tag index offset
Address With Four Banks 31 10 9 6 5 4 3 0 tag index bank offset
Figure 4: Memory Address Formats With and Without Banking – When using the baseline or alternative de- sign as a bank in a larger cache, we need to exclude the bank bits when indexing into the cache since the request network will take care of routing messages to the correct bank.
field can be used for implementation defined behavior. You should always ensure the opaque field is correctly preserved in the response. Note that the memory messages used for the cachereq and cacheresp interfaces are for a single word (i.e., 32-bit data field and 2-bit length field), while the memory messages used for the memreq and memresp interfaces are for an entire cache line (i.e., 128- bit data field and 4-bit length field). If the length field is one then only the least significant byte of the data field (i.e., bits 7-0) is valid. If the length field is two then only the least significant two bytes of the data field (i.e., bits 15-0) are valid. If the length field is zero then all bytes are valid. Note that while the memory message format is quite flexible, our cache designs will only support 4-byte cache requests and 16-byte memory requests. The data field can contain an arbitrary value in a write memory request, however the data field must contain all zeros in a write memory response. This simpli- fies creating reference responses when testing. We add a two-bit test field to each cacheresp and memresp message. We use the test field in cacheresp for testing. If a cachereq ends up with a cache miss, we should set the corresponding cacheresp message’s test field to be 2’b0. If a cachereq turns out to be a cache hit, we should set the corresponding cacheresp message’s test field to be 2’b1. By using the test field in the test harness can verify whether a cache transaction is a hit or a miss.
One way to increase cache bandwidth is to enable a cache to process multiple transactions at the same time. Figure 3 shows an alternative approach based on a banked cache organization. In a banked cache, we add a request network which directs a cache request to the appropriate bank based on some bits in the address of this cache request. Cache responses are returned over a different response network. Different cache banks can be potentially execute different transactions at the same time, and this increases the overall throughput of the system. The cache you design in this lab will be used both on its own (i.e., in a single-bank configuration) as well as in a four-bank configuration. Figure 4 illustrates which bits are used by the request network to direct a given cache request to the
read_addr
write_addr write_data
write_en
read_en
read_data
Tag Array (16x24b)
_arraytag _wen
_arraytag _ren
read_addr
write_addr write_data
write_en
read_en
write_byte_en (^) read_data
Data Array (16x128b)
arraydata _wen
arraydata _ren
memresp_en
repl
128b
memresp_data_reg
cachereq_type_reg
write_data_mux_sel
cachereq_en
cachereq_addr_reg
cachereq_data_reg data
addr
type
cachereq_msg
data
igore other
fields
memresp_msg
128b
32b
index cmp
read_data_reg_en
read_data_reg
evict_addr_reg_en
evict_addr_reg
read_data_zero_mux_sel
mk addr
cachereq_type cachereq_addr
data
type
cacheresp_msg
data
memreq_msg
len
len
127: 95: 63: 31:
2'b
4'b
cacheresp_type
memreq_type
_matchtag
128b
128b
32b
test
opaque
8'b0 opaque
opaque_regcachereq_
opaque
hit
addr
mk addr
addr[31:4]
0
indextag word offset
Wben Dec 16b
memreq_addr_mux_sel
wordoffset
0xfff
wben_mux _sel
Figure 5: Baseline Datapath – Direct-mapped, write-back, write-allocate cache with 16-byte cache lines and a total capacity of 256 bytes. repl = replicate 32b four times to create 128b signal; mkaddr = concatenate address as appropriate. Orange blocks represent extracting or inserting fields into either a PyMTL BitStruct or Verilog struct.
appropriate cache bank in a four-bank configuration. For example, if bits 4 and 5 of the cache request address are zero then the cache request is sent to bank zero, and if bits 4 and 5 of the cache request are one then the cache request is sent to bank three. For a single-bank design to support its use in a banked cache organization, we need to exclude the bank bits when indexing into the cache bank. In other words, if we consider all four banks holistically as a single “banked cache”, a fourth of the sets of the banked cache are in each bank. Note you cannot completely ignore the bank bits since you will need them when generating a cache line address for refills. Your design does not need to support an arbitrary number of banks only single-bank and four-bank configurations.
The baseline design for this lab is a direct-mapped, write-back, write-allocate cache with a total capacity of 256 bytes, 16 cache lines, and 16 bytes per cache line. As with the earlier labs, we will be decomposing the baseline design into two separate modules: the datapath which has paths for moving data through various arithmetic blocks, muxes, and registers; and the control unit which is in charge of managing the movement of data through the datapath. As in the first lab, the control unit will use an FSM. Because the cache design is significantly more complicated than the first lab, we have decided to place the datapath module, control unit module, and the parent module that connects the datapath and control unit together in three different files.
tag check
read data access
wait
write data access
idle
evict prepare
evict request
refill request
evict wait
refill update
refill wait
read hit
write hit write
read
!cacheresp_rdy
!cachereq_val
miss & dirty
!memreq_rdy
!memreq_rdy
!memresp_val
!memresp_val
miss & !dirty
init data access
init transaction
memreq_rdy
memresp_val
memreq_rdy
memresp_val
Figure 6: Baseline FSM Control Unit
This FSM control unit differs from the basic FSM cache discussed in lecture. This is due to the need to handle the various latency insensitive interfaces, the init transaction, and waiting for eviction responses. The FSM cache microarchitectures used in this lab will have a hit latency of four cycles (e.g., I → TC → RD → W). In the previous lab, the memory access time was usually a single cycle although you also tested your processor with random delays on the memory interface. Assuming we correctly implement the latency insensitive val/rdy microprotocol in both the processor and cache, then there should be no problem composing these two subsystems. Later optimizations can reduce the cache hit latency without the need to modify the processor.
We strongly encourage you to take an incremental design approach using the following steps:
The hit path is the simplest place to start, but in order to hit in the cache, we need valid data in the cache. The init transaction (explained further in Section 4) allows you to write data into the cache without doing a refill from main memory. This kind of transaction is an example of design-for-test since this transaction is only meant to simplify testing and has no real use once we have a working cache model. We recommend implementing the init transaction first, and then using this transaction
to implement and test both hit paths. Once the hit paths are working, then you can move onto the more complicated miss paths.
The alternative design for this lab is a two-way set-associative, write-back, write-allocate cache with the same capacity (256 bytes) and cache line size (16 bytes) as the baseline cache. The general FSM for the alternative design will be very similar to that of the baseline design, except that the address map and control signals will likely be different. Note that you will need to split the valid bits into two parts, one for each way and carefully keep track of them. You will need to AND the result of the tag match in each way with the appropriate valid bit to determine if there is a hit or miss. The control unit should use a least-recently-used (LRU) replacement policy to choose between the two ways during eviction. You should track the LRU status with separate bits in the control unit.
We provide you with a basic directed tests for each of the incremental design steps described in Section 2. Work in systematic way by implementing and passing each of these groups of tests in order as you add support for specific transactions and FSM states. When finished you must add some additional directed tests for specific corner cases, and you must develop a random testing strategy.
Although you will not need to write as many tests as in the previous lab, the tests for this lab may be more challenging since you will need to carefully craft directed tests that exercise all paths in your datapath and all states and state transitions in your FSM. As with the previous labs, you will want to initially write tests using the functional-level model. Once these tests are working on the functional-level model, you can move on to testing the baseline and alternative designs.
The following commands illustrate how to run all of tests for the entire project, how to run just the tests for this lab, and how to run just the basic tests we provide on the various designs.
% cd ${HOME}/ece4750/lab-groupXX/sim/build % pytest .. % pytest ../lab3_mem % pytest ../lab3_mem/test/CacheFL_test.py % pytest ../lab3_mem/test/CacheBase_test.py % pytest ../lab3_mem/test/CacheAlt_test.py
You will add your directed and random tests to CacheFL_test.py. Since this harness is shared across the functional-level model, the baseline design, and the alternative design you can write your tests once and reuse them to test all three models. You will be adding more test cases. Do not just make the given test cases larger. A key challenge in writing directed tests for cache memories, is that most of the miss path must be working before you can test the hit path. The miss path is significantly more complicated than the hit path, so this lends itself more towards a monolithic design process. Most of the cache must be implemented before we can run our first directed test. We could potentially use white-box ad-hoc testing that directly initializes the tag and data arrays in the cache before starting the test, but these ad-hoc tests are fragile and difficult to develop on the functional-level model.
To address this challenge, we will use a design-for-test (DFT) approach. DFT involves adding fea- tures to a design that are solely meant to facilitate test-driven development. In this specific design,
1 #---------------------------------------------------------------------- 2 # Test Case: read hit path 3 #---------------------------------------------------------------------- 4 # The test field in the response message: 0 == MISS, 1 == HIT 5 6 def read_hit_1word(): 7 return [ 8 # type opq addr len data type opq test len data 9 req( 'in', 0x00, base_addr, 0, 0xdeadbeef ), resp( 'in', 0x00, 0, 0, 0 ), 10 req( 'rd', 0x01, base_addr, 0, 0 ), resp( 'rd', 0x01, 1, 0, 0xdeadbeef ), 11 ] 12 13 #------------------------------------------------------------------------- 14 # Test Case: read miss path 15 #------------------------------------------------------------------------- 16 17 def read_miss_1word(): 18 return [ 19 # type opq addr len data type opq test len data 20 req( 'rd', 0x00, 0x00000000, 0, 0 ), resp( 'rd', 0x00, 0, 0, 0xdeadbeef ), 21 req( 'rd', 0x01, 0x00000004, 0, 0 ), resp( 'rd', 0x01, 1, 0, 0x00c0ffee ), 22 ] 23 24 # Data to be loaded into memory before running the test 25 26 def read_miss_1word_mem(): 27 return [ 28 # addr data 29 0x00000000, 0xdeadbeef, 30 0x00000004, 0x00c0ffee, 31 ] 32 33 #------------------------------------------------------------------------- 34 # Test table for generic test 35 #------------------------------------------------------------------------- 36 37 test_case_table_generic = mk_test_case_table([ 38 ( "msg_func mem_data_func stall lat src sink"), 39 [ "read_hit_1word", read_hit_1word, None, 0.0, 0, 0, 0 ], 40 [ "read_miss_1word_mem", read_miss_1word, read_miss_1word_mem, 0.0, 0, 0, 0 ], 41 ]) 42 43 @pytest.mark.parametrize( **test_case_table_generic ) 44 def test_generic( test_params, cmdline_opts ): 45 run_test( CacheFL(), test_params, cmdline_opts, cmp_wo_test_field )
Figure 7: Writing Directed Tests – Example of how to write tests in the CacheFL_test.py file. We create all test cases in this file and use it to test both FL, baseline RTL and alternative RTL model.
if we use this comparison function we are effectively ignoring the test field. If we do not specify a comparison function, then the stream sink will check the test field.
As mentioned above, your baseline and alternative designs will have different hit/miss behaviors, so you may need to use different expected response messages for the same sequence of source mes- sages. You should add tests designed specifically for your baseline or alternative design. For ex- ample, you should have tests that only hit in a two-way set-associative cache to make sure your alternative design is indeed two-way set-associative. We group the test cases into five test case ta- bles. test_case_table_generic is shown on lines 37–41 in Figure 7 and is used to generically test both the baseline and alternative designs (i.e., tests in this table should have the same expected be- havior for both the baseline and alternative design). test_case_table_random is for random testing. test_case_table_dmap and
test_case_table_sassoc are used to test only one of the designs. test_case_table_bank should explicitly test a four-bank configuration. We provide examples for most of these kinds of tests.
Make sure the combination of the given tests and your new tests cover all of these scenarios:
Once you have finished writing your directed tests you should move on to writing random tests. You can use the same Python-based random test generation system we used in the first lab. Some suggestions for what you might want to test are listed below. Each of these would probably be a separate test pattern, or potentially multiple test patterns with different random parameters. You don’t necessarily need to do all of these.
Writing random tests for memories can actually be quite challenging. With the first lab, the correct output was trivial to calculate based on the random inputs, but with a memory system the correct output (i.e., the data we expect in a memory read response) depends on the last write to the corre- sponding address. To write random tests with random address patterns and/or types, you will need to keep track of a “reference memory” in your Python script. This reference memory can just be an array of words. Every time you generate a write request, you should update the reference memory in addition to generating the appropriate write request. Every time you generate a read request, you should consult your reference memory to determine what data we expect to be returned in a memory read response.
You will almost certainly want to use line tracing to visualize the execution of transactions on your baseline and alternative designs. We have provided some line tracing code for you in the test harness which traces the cache request/response and memory request/response interfaces. Figure 8 illus- trates a line trace for the basic test in Figure 7 executing on the baseline design with extra annotations to indicate what the columns mean. The first columns shows when memory request messages are sent from the test source into the cache, and the last column shows when memory response messages are sent from the cache back to the test sink. The second column shows the state of the cache. This column is critical to understanding the behavior of your cache, but it is not currently implemented
// loop1 pattern // loop2 pattern // loop3 pattern // a array allocated at 0x1000 // a array allocated at 0x1000 // a array allocated at 0x for ( i = 0; i < 100; i++ ) // b array allocated at 0x2000 // b array allocated at 0x result += a[i]; for ( i = 0; i < 100; i ++ ) // c array allocated at 0x result += a[i]b[i] for ( i = 0; i < 100; i ++ ) result += a[i]b[i] + c[i]
Figure 10: Evaluation Patterns – Three loops that correspond to the given evaluation patterns.
Once you have verified the functionality of the baseline and alternative designs, you should then use the provided simulator to evaluate these two designs. The simulator delays all responses from the test memory by 20 cycles to model a long main-memory latency. You can run the simulator to see the performance of each cache implementation as follows:
% cd ${HOME}/ece4750/lab-groupXX/sim/build % ../lab3_mem/mem-sim --impl base --input loop1 --stats % ../lab3_mem/mem-sim --impl alt --input loop1 --stats
We provide you with three input patterns that capture common access patterns in loops. The C code for each loop that might generate the corresponding pattern is shown in Figure 10. The loop pattern simply iterates through an array. The loop2 pattern iterates through two arrays, and the loop3 pattern iterates through three arrays. All three patterns start the array accesses at carefully chosen memory addresses to demonstrate interesting cache behaviors.
The simulator will display a collection of statistics: number of cycles, number of memory and cache accesses, number of misses, miss rate, and the average memory access latency (AMAL). You should study the line traces (with the --trace option) and possibly the waveforms (with the --dump-vcd option) to understand the reason why each design performs as it does on the various patterns.
You must add additional evaluation patterns with different amounts of spatial and temporal locality. We recommend including some patterns that mix reads/writes and random patterns. We recom- mend adding three more patterns for evaluation. Obviously, these patterns need to be carefully chosen to highlight the differences between the baseline and alternative designs. You will also need to add the name of the new dataset to the mem-sim simulator script.
Writing an access pattern for the simulator is similar to writing a test case. Basically for each pattern you need to write a Python function that returns a list of source messages, a list of sink message, and a list of memory address-data pairs that will be loaded to the test memory before the simulation. Please keep in mind that patterns you will write in the simulator are not for testing. Instead, you need to fully test your designs using test cases and then use patterns in the simulator script as a way to evaluate your designs.
In lab 4, we will compose the pipelined processor designed in lab 2 (which includes the multiplier designed in lab 1) and the cache memory designed in lab 3 to create a baseline single-core system, and we will compose the pipelined processor, cache memory, and a simple ring networks network designed in lab 4 to create the alternative multicore system. You will be developing a serial and parallel sorting microbenchmark, and comparing the performance of this microbenchmark across the baseline and alternative designs. You will quickly find that the performnace of your alternative multicore system is limited by the hit latency of the cache you designed in this lab. While we could
move to a more aggressive pipelined cache microarchitecture, you can achieve much of the same benefit by simply merging states in the FSM control unit used in this lab. Ideally, you would merge enough states to enable a single-cycle hit latency for reads (i.e., a single state for read hits) and a sustained throughput of one read hit per cycle. This requires performing tag check and data access in parallel, and carefully handling the val/rdy signals for the cache request and response interfaces. Writes can potentially use two states to do tag check and data access in sequence, although single- cycle hit latency for writes is still possible if the cache response is sent back in the first state. Reducing the read hit latency is the most critical since this would improve the performance of instruction fetch in your processor. There is no need to wait until lab 4. Students should feel free to start optimizing their cache as part of the alternative design in this lab, or after this lab is submitted.
This lab was created by Shunning Jiang, Shuang Chen, Ian Thompson, Moyang Wang, Christopher Torng, Berkin Ilbeyi, Ackerley Tng, Shreesha Srinath, Christopher Batten, and Ji Kim as part of the course ECE 4750 Computer Architecture at Cornell University.