





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document compares the performance of loops on out-of-order processors and vector machines. The out-of-order processor characteristics include up to 6 instructions per cycle, a 128-entry ROB, 96-entry unified physical register file, 2 integer ALUs, 2 load/store units, 1 floating-point adder, and 1 floating-point multiplier. The vector machine characteristics include 16 elements per vector register, 4 vector lanes, 1 load/store unit per lane, 1 floating-point adder per lane, and 1 floating-point multiplier per lane. The document also discusses virtual memory management and page table updates in the context of virtualization.
Typology: Study notes
1 / 29
This page cannot be seen from the preview
Don't miss anything!






















Question Topic Point Value 1 Parallelism 32 2 Virtual Memory 21 3 Branch Prediction 28 4 Cache Coherence 24 5 Memory Consistency 30 6 Synchronization 25 TOTAL 160
In this problem, we will explore how out-of-order processors, VLIW machines, and vector machines extract parallelism from the following code:
// Assume that N is large for (i = 0; i < N; i++) { a = A[i]; b = B[i]; C[i] = (a * a) + (b * b); }
Problem 1.A: Out-of-Order Execution (8 points)
This loop is translated into the following scalar code:
loop: fld f1, 0(a0) fld f2, 0(a1) fmul.d f1, f1, f fmul.d f2, f2, f fadd.d f1, f1, f fsd f1, 0(a2) addi a0, a0, 8 addi a1, a1, 8 addi a2, a2, 8 bltu a2, a3, loop
Consider an out-of-order processor with the following characteristics:
Note: Not all rows may be needed.
ALU0 ALU1 MEM0 MEM1 FADD FMUL addi a0,a0, [iter i+4]
addi a2,a2, [iter i+4]
fld f1,0(a0) [iter i+4]
fld f2,0(a1) [iter i+4]
fadd.d f5,f3,f [iter i+1]
fmul.d f3,f1,f [iter i+3] addi a1,a1, [iter i+4]
bltu a2,a3,loop [iter i+4]
fsd f5,-40(15) [iter i]
fmul.d f4,f2,f [iter i+3]
FLOPs per cycle:
3/2 FLOPs per cycle
The VLIW machine has the same mix of functional units as the OoO processor from Part 1.A, and so static scheduling should yield the same performance as dynamic scheduling under these conditions.
Problem 1.C: Vector Machines (8 points)
The loop is translated into the following vector assembly code:
loop: vsetvli t0, a3, e vle.v v0, 0(a0) vfmul.vv v1, v0, v vle.v v2, 0(a1) vfmul.vv v3, v2, v vfadd.vv v4, v1, v vse.v v4, 0(a2) sub a3, a3, t slli t0, t0, 3 add a0, a0, t add a1, a1, t add a2, a2, t bnez a3, loop
For this question, consider a vector machine with the following characteristics:
Problem 1.D: Impact on Performance (4+4 points)
For each of the processors (OoO, VLIW, vector) introduced in Parts 1.A to 1.C, discuss how the following hardware changes would impact performance on the given loop. Assume that all other design parameters remain unchanged.
(i) Doubling the register file size (i.e., doubling the size of the unified physical register file for OoO, doubling the number of architectural registers for VLIW, doubling the length of the vector registers). Does taking advantage of the expanded register file capacity require changing the code?
OoO : FLOPs/cycle is unchanged; performance is bottlenecked by structural hazards with the functional units, not lack of registers. Increasing the number of physical registers beyond the number of ROB entries also serves no useful purpose. No code changes are needed since renaming automatically exploits the larger physical register file.
VLIW : FLOPs/cycle is unchanged for the same reasons as the OoO case. The code must be rewritten to use the extra architectural registers now provided by the ISA.
Vector : FLOPs/cycle improves if chaining is used. Chaining longer vectors increases the proportion of cycles in which the adder and multiplier are simultaneously utilized. No code changes are needed if the code is written to be vector-length-agnostic.
(ii) Adding another floating-point multiplier. Does taking advantage of this new functional unit require changing the code?
OoO : FLOPs/cycle is unchanged. Performance is still bottlenecked to the same degree by structural hazards with the ALUs and load/store units. No code changes are needed since the issue logic can automatically exploit the extra multiplier.
VLIW : FLOPs/cycle is unchanged for the same reasons as the OoO case. (Some improvement is possible if the instruction scheduling is initially suboptimal.) The code must be rewritten to use the extra VLIW slot.
Vector : FLOPs/cycle is unchanged. The lack of a second load unit in each lane prevents the two multipliers from being utilized simultaneously. No code changes are needed.
A virtual machine monitor (VMM) runs several guest OSs on a single host machine. The guest OSs run in user (unprivileged) mode, whereas the VMM runs in supervisor (privileged) mode. The OS in each guest virtual machine manages its own set of page tables, which reflect the mapping of the guest virtual address space to the guest physical address space (“virtual-to-real”). The guest physical addresses must then be mapped to host physical addresses.
To reuse the hardware TLB, the VMM maintains a set of shadow page tables that map directly from the guest virtual address space to the host physical address space (“virtual-to-physical”). When running the guest OS in user mode, the VMM sets the hardware page table base pointer to point to the shadow page table. The TLB works as if there were no virtualization.
Problem 2.A: TLB Miss Latency (3 points)
Suppose that the guest and host machines both use three-level page tables. The host has a hardware-refilled TLB. When running the guest OS, what is the TLB miss latency if the TLB access takes 1 cycle and the memory latency is 50 cycles per access?
1 + 50 + 50 + 50 = 151 cycles
A hardware TLB refill involves a page table walk of the shadow page tables in memory, same as if there were no virtualization in effect. Note that the guest page tables are not directly involved.
The TLB miss latency must also include the cost of the initial TLB lookup (1 cycle).
Problem 2.D: Page Table Updates (5 points)
From the perspective of the guest OS, the guest page tables live in guest physical memory. How does the VMM ensure that the shadow page table is updated when the corresponding guest page table is modified by the guest OS?
In the shadow page table, the VMM write-protects the pages holding the guest page tables so that any attempt to modify the guest page tables causes a trap into the VMM. On a trap, the VMM mirrors the changes to the guest page table entry in the shadow page table. The VMM must translate the guest physical addresses presented by the guest OS to host physical addresses; this typically requires the VMM to maintain a separate table with the real-to-physical mappings.
Problem 2.E: Different Page Sizes (5 points)
Describe how it is possible to support a guest virtual machine with an 8 KiB page size on a host machine with a 4 KiB page size.
Each guest page is mapped to two host pages; a leaf entry in the guest page table corresponds to two consecutive entries in the shadow page table with the same permissions. To properly support I/O (i.e., DMA requests), the two host pages should be physically contiguous.
The following loop iterates through two arrays of integers and compares their elements. The code contains four branches labeled B1 , B2 , B3 , and B4. Assume that the arrays X and Y are populated with uniformly random values.
c = 0; for (i = 0; i < N; i++) { // B x = X[i]; y = Y[i]; if (x == 0) // B c++; if (y == 0) // B c--; if (x != y) // B c += (x – y); }
la x1, X la x2, Y li x3, N li x4, 0 # c loop: lw x5, (x1) # x lw x6, (x2) # y bnez x5, skip1 # B addi x4, x4, 1 skip1: bnez x6, skip2 # B addi x4, x4, - skip2: beq x5, x6, skip3 # B sub x5, x5, x add x4, x4, x skip3: addi x1, x1, 4 addi x2, x2, 4 addi x3, x3, - bnez x3, loop # B
Problem 3.A: Branch Correlation (2+2 points)
In contrast to spatial correlation , a branch may also demonstrate temporal correlation such that the present outcome of the branch is related to the previous outcomes of the same branch.
(i) For the code above, briefly explain which branches exhibit spatial correlation, if any.
B3 is correlated with B1 and B2. For example, if B1 and B2 are both not taken, then B will always be taken.
(ii) For the code above, briefly explain which branches exhibit temporal correlation, if any.
Only B4. B1, B2, and B3 are not temporally correlated since the value of each element is independent of the others.
Let X = { 0, 1, 2, … } and Y = { 2, 0, 2 , … }.
Loop Branch Predictor Branch Behavior Iteration Branch History Way 00 Way 01 Way 10 Way 11 Predicted Actual 0 B1 00 01 01 01 01 NT NT B2 00 01 01 01 01 NT T B3 01 01 01 01 01 NT NT B4 10 01 01 01 01 NT T 1 B1 01 00 NT T B2 11 10 NT NT B3 10 00 NT NT B4 00 10 NT T 2 B1 01 10 T T B2 11 00 NT T B3 11 00 NT T B4 11 10 NT T
Problem 3.C: Expected Accuracy (6 points)
Suppose that the elements in arrays X and Y are randomly and uniformly distributed over the set of integers 0, 1, and 2 (each possibility is equally likely). With the two-level predictor, what is the expected accuracy in predicting branch B3 correctly (beq x5, x6, skip3) as the loop approaches an infinite number of iterations? Show your work.
Hint: Consider the combined outcomes of branches B1 and B2, their probabilities, and how they contribute to the bimodal counters for B3. It may be helpful to separate the cases as such:
Conditions Possibilities of (x, y) x ≠ 0 y ≠ 0 (1, 1), (1, 2), (2, 1), (2, 2) x = 0 y ≠ 0 (0, 1), (0, 2) x ≠ 0 y = 0 (1, 0), (2, 0) x = 0 y = 0 (0, 0)
For ¾ of the cases, the global history perfectly determines the outcome of B3 with 100% accuracy. In the remaining case (x ≠ 0 and y ≠ 0), both outcomes are equally likely, so the counter increments and decrements cancel out on average. Thus, the predictor does not learn anything useful and performs no better than a random guess, resulting in only 50% accuracy.
B1 B2 P(B1, B2) P(B3 = T) Prediction T T 4/9 0.5 50% accurate on average NT T 2/9 0.0 Always NT; counter converges to 00 T NT 2/9 0.0 Always NT; counter converges to 00 NT NT 1/9 1.0 Always T; counter converges to 11
The expected prediction accuracy is therefore: (1/9 + 2/9 + 2/9)(100%) + (4/9)(50%) = 7/9 ≈ 77. 7̅%
Problem 3.D: Trace Scheduling (5 points)
Now consider a different microarchitecture without a dynamic branch predictor. The processor statically predicts that branches are never taken, and taken branches incur a multi-cycle penalty.
Although originally conceived in a VLIW context, trace scheduling is a general compiler technique for removing control hazards that can also be applied to conventional scalar architectures. Assuming the contents of arrays X and Y follow the same uniform distribution as Part 3.C (all elements are equally likely to be either 0, 1, or 2), reschedule the assembly code to minimize the branch penalty along the most frequently executed code path.
la x1, X la x2, Y li x3, N li x4, 0 # c loop: lw x5, (x1) # x lw x6, (x2) # y beqz x5, B1 # B cont1: beqz x6, B2 # B cont2: beq x5, x6, skip3 # B sub x5, x5, x add x4, x4, x skip3: addi x1, x1, 4 addi x2, x2, 4 addi x3, x3, - bnez x3, loop # B B1: addi x4, x4, 1 j cont B2: addi x4, x4, - j cont
Problem 4.A: Inclusion Policy (4+4 points)
In lecture, it was mentioned that an inclusive L2 cache can act as a filter to reduce the amount of L1 coherence traffic in a snoopy cache-coherence protocol. If a coherence request misses in the L2 cache, there is no need to probe the L1 cache for the given line.
(i) Explain how a strictly exclusive L2 cache can also be used to optimize snooping by the L1 cache.
If a coherence request hits in the strictly exclusive L2 cache, the given line cannot be in the L1 cache, so there is no need to probe the L1.
Another acceptable answer is that coherence requests to upgrade permission for lines already present in the L1 (e.g., S to M) do not need to be broadcasted to the L2.
(ii) Could a non-inclusive, non-exclusive L2 cache (i.e., neither strictly inclusive nor strictly exclusive) be similarly used to optimize snooping by the L1 cache? Explain.
No, hits and misses at the L2 cache reveal no information about what the L1 cache contains, so the L1 must snoop every coherence request.
A “yes” answer is also acceptable if it sufficiently explains how the L2 can track inclusivity with the L1 (e.g., an extra bit per L2 line to indicate it is shared with the L1).
Problem 4.B: False Sharing (4 points)
In the following table, indicate which memory operations experience a hit, true sharing miss, or false sharing miss under an MSI coherence protocol. Assume that x1 and x2 reside in the same cache line, and both words are read by both processors P1 and P2 before this sequence. The first row has been completed for you.
Time P1 P2 Hit True Sharing Miss
False Sharing Miss 1 write x1 X 2 write x2 X 3 read x1 X 4 read x1 X 5 write x2 X
Problem 4.C: Directory-Based Coherence (6+6 points)
The following questions explore the directory-based coherence protocol described in Appendix A (same as Handout #6 from Problem Set 5) in more detail.
As before, assume that message passing maintains FIFO order: All messages between the same source and destination are always received in the same order that they were sent. Also assume that each site has sufficient queuing capacity to buffer all incoming messages without drops.
(i) Consider the situation where a cache is sent an InvReq message for a given cache line. This occurs only if the directory state indicates that the site is a current sharer of the memory block, and the directory intends to invalidate the copy in the cache before granting exclusive access to another cache.
Typically, one expects the line to be in the C-shared state when the InvReq arrives. How is it also possible for the cache to receive the InvReq message while it has the line in the C-pending state (row #22 in Table H12-1 of Appendix A) – in other words, when the line is not actually present? Why does ignoring InvReq work out correctly in this case?
Assume that the home directory state is initially R(id’), indicating that the block is shared by the cache at site id’. Consider the following scenario:
While possible, this is not necessarily problematic. Although these operations can overlap in real time, the reads are treated as logically preceding the write in the global memory order. Coherence does not require that writes be “immediately” visible. All other caches will eventually receive an InvReq, and subsequent reads will trigger a ShReq that returns the updated data, so the second coherence invariant continues to be upheld.
Conversely, it can be argued that this proposed optimization is unsafe from a consistency perspective since the store-load reordering described above may lead to a violation of a stricter memory consistency model. This is potentially the case when multiple directory sites are involved (multi-bank last-level caches).
Problem 5.A: Load/Store Queues (4+4+4 points)
Consider a multiprocessor with out-of-order cores that implement conservative out-of-order load/store execution (loads wait for memory addresses to be fully checked/disambiguated).
Table 2.1 shows the current state of the store queue in one of the cores. Stores are kept in the store queue until they commit. The instruction number indicates the order of the instructions in the program, with lower numbers being earlier in program order.
Table 2.2 shows the values present in the non-blocking data cache of the same core. Loads following cache misses can read from the data cache on a hit if the memory consistency model is not violated.
Tables 2.3, 2.4, and 2.5 show the current state of the load queue. Assume that all loads and stores access the full 32-bit word.
Table 2.1: Store Queue
Instruction # Address Value 5 0x100^ 0x 7 0x200^ unknown 11 0x300^ 0xABCDABCD 13 0x200^ 0x 17 unknown unknown
Table 2.2: Data Cache
Valid? Address Value Y 0x100^ 0xFFFFFFFF Y 0x200^ 0x1234ABCD Y 0x300^ 0x N 0x400^ unknown