Analyzing the Performance of Loops on Out-of-Order Processors and Vector Machines, Study notes of Architecture

This document compares the performance of loops on out-of-order processors and vector machines. The out-of-order processor characteristics include up to 6 instructions per cycle, a 128-entry ROB, 96-entry unified physical register file, 2 integer ALUs, 2 load/store units, 1 floating-point adder, and 1 floating-point multiplier. The vector machine characteristics include 16 elements per vector register, 4 vector lanes, 1 load/store unit per lane, 1 floating-point adder per lane, and 1 floating-point multiplier per lane. The document also discusses virtual memory management and page table updates in the context of virtualization.

Typology: Study notes

2021/2022

Uploaded on 08/05/2022

nguyen_99
nguyen_99 🇻🇳

4.2

(80)

1K documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 152 Computer Architecture and Engineering
Final Exam
SOLUTIONS
May 12, 2020
Professor Krste Asanović
Name:______________________
SID:______________________
I am taking CS152 / CS252
(circle one)
180 Minutes, 27 pages.
Notes:
Not all questions are of equal difficulty, so look over the entire exam!
Please carefully state any assumptions you make.
Please write your name on every page in the exam.
Do not discuss the exam with other students who haven’t taken the exam.
If you have inadvertently been exposed to an exam prior to taking it, you
must tell the instructor or TA.
You will receive no credit for selecting multiple-choice answers without
giving explanations if the instructions ask you to explain your choice.
Question
Topic
Point Value
1
Parallelism
32
2
Virtual Memory
21
3
Branch Prediction
28
4
Cache Coherence
24
5
Memory Consistency
30
6
Synchronization
25
TOTAL
160
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Analyzing the Performance of Loops on Out-of-Order Processors and Vector Machines and more Study notes Architecture in PDF only on Docsity!

CS 152 Computer Architecture and Engineering

Final Exam

SOLUTIONS

May 12, 2020

Professor Krste Asanović

Name:______________________

SID:______________________

I am taking CS152 / CS

(circle one)

180 Minutes, 27 pages.

Notes:

  • Not all questions are of equal difficulty, so look over the entire exam!
  • Please carefully state any assumptions you make.
  • Please write your name on every page in the exam.
  • Do not discuss the exam with other students who haven’t taken the exam.
  • If you have inadvertently been exposed to an exam prior to taking it, you

must tell the instructor or TA.

  • You will receive no credit for selecting multiple-choice answers without

giving explanations if the instructions ask you to explain your choice.

Question Topic Point Value 1 Parallelism 32 2 Virtual Memory 21 3 Branch Prediction 28 4 Cache Coherence 24 5 Memory Consistency 30 6 Synchronization 25 TOTAL 160

Problem 1: Parallelism (32 points)

In this problem, we will explore how out-of-order processors, VLIW machines, and vector machines extract parallelism from the following code:

// Assume that N is large for (i = 0; i < N; i++) { a = A[i]; b = B[i]; C[i] = (a * a) + (b * b); }

Problem 1.A: Out-of-Order Execution (8 points)

This loop is translated into the following scalar code:

a0 points to A

a1 points to B

a2 points to C

a3 points to C+N

loop: fld f1, 0(a0) fld f2, 0(a1) fmul.d f1, f1, f fmul.d f2, f2, f fadd.d f1, f1, f fsd f1, 0(a2) addi a0, a0, 8 addi a1, a1, 8 addi a2, a2, 8 bltu a2, a3, loop

Consider an out-of-order processor with the following characteristics:

  • Up to 6 instructions can be dispatched, issued, and committed per cycle
  • 128-entry ROB
  • 96-entry unified physical register file
  • 2 integer ALUs, 1-cycle latency
  • 2 load/store units, 2-cycle latency (assume all loads and stores hit in the cache)
  • 1 floating-point adder, 2-cycle latency
  • 1 floating-point multiplier, 3-cycle latency
  • All functional units are fully pipelined
  • Scheduler always selects the oldest ready instructions to issue
  • Assume perfect branch prediction and memory disambiguation

Note: Not all rows may be needed.

ALU0 ALU1 MEM0 MEM1 FADD FMUL addi a0,a0, [iter i+4]

addi a2,a2, [iter i+4]

fld f1,0(a0) [iter i+4]

fld f2,0(a1) [iter i+4]

fadd.d f5,f3,f [iter i+1]

fmul.d f3,f1,f [iter i+3] addi a1,a1, [iter i+4]

bltu a2,a3,loop [iter i+4]

fsd f5,-40(15) [iter i]

fmul.d f4,f2,f [iter i+3]

FLOPs per cycle:

3/2 FLOPs per cycle

The VLIW machine has the same mix of functional units as the OoO processor from Part 1.A, and so static scheduling should yield the same performance as dynamic scheduling under these conditions.

Problem 1.C: Vector Machines (8 points)

The loop is translated into the following vector assembly code:

a0 points to A

a1 points to B

a2 points to C

a3 holds N

loop: vsetvli t0, a3, e vle.v v0, 0(a0) vfmul.vv v1, v0, v vle.v v2, 0(a1) vfmul.vv v3, v2, v vfadd.vv v4, v1, v vse.v v4, 0(a2) sub a3, a3, t slli t0, t0, 3 add a0, a0, t add a1, a1, t add a2, a2, t bnez a3, loop

For this question, consider a vector machine with the following characteristics:

  • 16 elements per vector register
  • 4 vector lanes
  • 1 load/store unit per lane, 2-cycle latency
  • 1 floating-point adder per lane, 2-cycle latency
  • 1 floating-point multiplier per lane, 3-cycle latency
  • All functional units are fully pipelined
  • All functional units have dedicated read/write ports into the vector register file
  • No dead time between vector instructions
  • Vector instructions execute in order
  • Scalar instructions execute separately on a decoupled control processor

Problem 1.D: Impact on Performance (4+4 points)

For each of the processors (OoO, VLIW, vector) introduced in Parts 1.A to 1.C, discuss how the following hardware changes would impact performance on the given loop. Assume that all other design parameters remain unchanged.

(i) Doubling the register file size (i.e., doubling the size of the unified physical register file for OoO, doubling the number of architectural registers for VLIW, doubling the length of the vector registers). Does taking advantage of the expanded register file capacity require changing the code?

OoO : FLOPs/cycle is unchanged; performance is bottlenecked by structural hazards with the functional units, not lack of registers. Increasing the number of physical registers beyond the number of ROB entries also serves no useful purpose. No code changes are needed since renaming automatically exploits the larger physical register file.

VLIW : FLOPs/cycle is unchanged for the same reasons as the OoO case. The code must be rewritten to use the extra architectural registers now provided by the ISA.

Vector : FLOPs/cycle improves if chaining is used. Chaining longer vectors increases the proportion of cycles in which the adder and multiplier are simultaneously utilized. No code changes are needed if the code is written to be vector-length-agnostic.

(ii) Adding another floating-point multiplier. Does taking advantage of this new functional unit require changing the code?

OoO : FLOPs/cycle is unchanged. Performance is still bottlenecked to the same degree by structural hazards with the ALUs and load/store units. No code changes are needed since the issue logic can automatically exploit the extra multiplier.

VLIW : FLOPs/cycle is unchanged for the same reasons as the OoO case. (Some improvement is possible if the instruction scheduling is initially suboptimal.) The code must be rewritten to use the extra VLIW slot.

Vector : FLOPs/cycle is unchanged. The lack of a second load unit in each lane prevents the two multipliers from being utilized simultaneously. No code changes are needed.

Problem 2: Virtual Memory and Virtualization (21 points)

A virtual machine monitor (VMM) runs several guest OSs on a single host machine. The guest OSs run in user (unprivileged) mode, whereas the VMM runs in supervisor (privileged) mode. The OS in each guest virtual machine manages its own set of page tables, which reflect the mapping of the guest virtual address space to the guest physical address space (“virtual-to-real”). The guest physical addresses must then be mapped to host physical addresses.

To reuse the hardware TLB, the VMM maintains a set of shadow page tables that map directly from the guest virtual address space to the host physical address space (“virtual-to-physical”). When running the guest OS in user mode, the VMM sets the hardware page table base pointer to point to the shadow page table. The TLB works as if there were no virtualization.

Problem 2.A: TLB Miss Latency (3 points)

Suppose that the guest and host machines both use three-level page tables. The host has a hardware-refilled TLB. When running the guest OS, what is the TLB miss latency if the TLB access takes 1 cycle and the memory latency is 50 cycles per access?

1 + 50 + 50 + 50 = 151 cycles

A hardware TLB refill involves a page table walk of the shadow page tables in memory, same as if there were no virtualization in effect. Note that the guest page tables are not directly involved.

The TLB miss latency must also include the cost of the initial TLB lookup (1 cycle).

Problem 2.D: Page Table Updates (5 points)

From the perspective of the guest OS, the guest page tables live in guest physical memory. How does the VMM ensure that the shadow page table is updated when the corresponding guest page table is modified by the guest OS?

In the shadow page table, the VMM write-protects the pages holding the guest page tables so that any attempt to modify the guest page tables causes a trap into the VMM. On a trap, the VMM mirrors the changes to the guest page table entry in the shadow page table. The VMM must translate the guest physical addresses presented by the guest OS to host physical addresses; this typically requires the VMM to maintain a separate table with the real-to-physical mappings.

Problem 2.E: Different Page Sizes (5 points)

Describe how it is possible to support a guest virtual machine with an 8 KiB page size on a host machine with a 4 KiB page size.

Each guest page is mapped to two host pages; a leaf entry in the guest page table corresponds to two consecutive entries in the shadow page table with the same permissions. To properly support I/O (i.e., DMA requests), the two host pages should be physically contiguous.

Problem 3: Branch Prediction (28 points)

The following loop iterates through two arrays of integers and compares their elements. The code contains four branches labeled B1 , B2 , B3 , and B4. Assume that the arrays X and Y are populated with uniformly random values.

c = 0; for (i = 0; i < N; i++) { // B x = X[i]; y = Y[i]; if (x == 0) // B c++; if (y == 0) // B c--; if (x != y) // B c += (x – y); }

la x1, X la x2, Y li x3, N li x4, 0 # c loop: lw x5, (x1) # x lw x6, (x2) # y bnez x5, skip1 # B addi x4, x4, 1 skip1: bnez x6, skip2 # B addi x4, x4, - skip2: beq x5, x6, skip3 # B sub x5, x5, x add x4, x4, x skip3: addi x1, x1, 4 addi x2, x2, 4 addi x3, x3, - bnez x3, loop # B

Problem 3.A: Branch Correlation (2+2 points)

In contrast to spatial correlation , a branch may also demonstrate temporal correlation such that the present outcome of the branch is related to the previous outcomes of the same branch.

(i) For the code above, briefly explain which branches exhibit spatial correlation, if any.

B3 is correlated with B1 and B2. For example, if B1 and B2 are both not taken, then B will always be taken.

(ii) For the code above, briefly explain which branches exhibit temporal correlation, if any.

Only B4. B1, B2, and B3 are not temporally correlated since the value of each element is independent of the others.

Let X = { 0, 1, 2, … } and Y = { 2, 0, 2 , … }.

Loop Branch Predictor Branch Behavior Iteration Branch History Way 00 Way 01 Way 10 Way 11 Predicted Actual 0 B1 00 01 01 01 01 NT NT B2 00 01 01 01 01 NT T B3 01 01 01 01 01 NT NT B4 10 01 01 01 01 NT T 1 B1 01 00 NT T B2 11 10 NT NT B3 10 00 NT NT B4 00 10 NT T 2 B1 01 10 T T B2 11 00 NT T B3 11 00 NT T B4 11 10 NT T

Problem 3.C: Expected Accuracy (6 points)

Suppose that the elements in arrays X and Y are randomly and uniformly distributed over the set of integers 0, 1, and 2 (each possibility is equally likely). With the two-level predictor, what is the expected accuracy in predicting branch B3 correctly (beq x5, x6, skip3) as the loop approaches an infinite number of iterations? Show your work.

Hint: Consider the combined outcomes of branches B1 and B2, their probabilities, and how they contribute to the bimodal counters for B3. It may be helpful to separate the cases as such:

Conditions Possibilities of (x, y) x ≠ 0 y ≠ 0 (1, 1), (1, 2), (2, 1), (2, 2) x = 0 y ≠ 0 (0, 1), (0, 2) x ≠ 0 y = 0 (1, 0), (2, 0) x = 0 y = 0 (0, 0)

For ¾ of the cases, the global history perfectly determines the outcome of B3 with 100% accuracy. In the remaining case (x ≠ 0 and y ≠ 0), both outcomes are equally likely, so the counter increments and decrements cancel out on average. Thus, the predictor does not learn anything useful and performs no better than a random guess, resulting in only 50% accuracy.

B1 B2 P(B1, B2) P(B3 = T) Prediction T T 4/9 0.5 50% accurate on average NT T 2/9 0.0 Always NT; counter converges to 00 T NT 2/9 0.0 Always NT; counter converges to 00 NT NT 1/9 1.0 Always T; counter converges to 11

The expected prediction accuracy is therefore: (1/9 + 2/9 + 2/9)(100%) + (4/9)(50%) = 7/9 ≈ 77. 7̅%

Problem 3.D: Trace Scheduling (5 points)

Now consider a different microarchitecture without a dynamic branch predictor. The processor statically predicts that branches are never taken, and taken branches incur a multi-cycle penalty.

Although originally conceived in a VLIW context, trace scheduling is a general compiler technique for removing control hazards that can also be applied to conventional scalar architectures. Assuming the contents of arrays X and Y follow the same uniform distribution as Part 3.C (all elements are equally likely to be either 0, 1, or 2), reschedule the assembly code to minimize the branch penalty along the most frequently executed code path.

la x1, X la x2, Y li x3, N li x4, 0 # c loop: lw x5, (x1) # x lw x6, (x2) # y beqz x5, B1 # B cont1: beqz x6, B2 # B cont2: beq x5, x6, skip3 # B sub x5, x5, x add x4, x4, x skip3: addi x1, x1, 4 addi x2, x2, 4 addi x3, x3, - bnez x3, loop # B B1: addi x4, x4, 1 j cont B2: addi x4, x4, - j cont

Problem 4: Cache Coherence (24 points)

Problem 4.A: Inclusion Policy (4+4 points)

In lecture, it was mentioned that an inclusive L2 cache can act as a filter to reduce the amount of L1 coherence traffic in a snoopy cache-coherence protocol. If a coherence request misses in the L2 cache, there is no need to probe the L1 cache for the given line.

(i) Explain how a strictly exclusive L2 cache can also be used to optimize snooping by the L1 cache.

If a coherence request hits in the strictly exclusive L2 cache, the given line cannot be in the L1 cache, so there is no need to probe the L1.

Another acceptable answer is that coherence requests to upgrade permission for lines already present in the L1 (e.g., S to M) do not need to be broadcasted to the L2.

(ii) Could a non-inclusive, non-exclusive L2 cache (i.e., neither strictly inclusive nor strictly exclusive) be similarly used to optimize snooping by the L1 cache? Explain.

No, hits and misses at the L2 cache reveal no information about what the L1 cache contains, so the L1 must snoop every coherence request.

A “yes” answer is also acceptable if it sufficiently explains how the L2 can track inclusivity with the L1 (e.g., an extra bit per L2 line to indicate it is shared with the L1).

Problem 4.B: False Sharing (4 points)

In the following table, indicate which memory operations experience a hit, true sharing miss, or false sharing miss under an MSI coherence protocol. Assume that x1 and x2 reside in the same cache line, and both words are read by both processors P1 and P2 before this sequence. The first row has been completed for you.

Time P1 P2 Hit True Sharing Miss

False Sharing Miss 1 write x1 X 2 write x2 X 3 read x1 X 4 read x1 X 5 write x2 X

Problem 4.C: Directory-Based Coherence (6+6 points)

The following questions explore the directory-based coherence protocol described in Appendix A (same as Handout #6 from Problem Set 5) in more detail.

As before, assume that message passing maintains FIFO order: All messages between the same source and destination are always received in the same order that they were sent. Also assume that each site has sufficient queuing capacity to buffer all incoming messages without drops.

(i) Consider the situation where a cache is sent an InvReq message for a given cache line. This occurs only if the directory state indicates that the site is a current sharer of the memory block, and the directory intends to invalidate the copy in the cache before granting exclusive access to another cache.

Typically, one expects the line to be in the C-shared state when the InvReq arrives. How is it also possible for the cache to receive the InvReq message while it has the line in the C-pending state (row #22 in Table H12-1 of Appendix A) – in other words, when the line is not actually present? Why does ignoring InvReq work out correctly in this case?

Assume that the home directory state is initially R(id’), indicating that the block is shared by the cache at site id’. Consider the following scenario:

  1. The directory receives an ExReq from a site other than id’. The directory sends an InvReq to site id’. The home directory state becomes Tw(id’).
  2. Before the InvReq arrives at site id’, the cache performs a voluntary invalidation to evict the cache line. The cache line state moves from C-shared to C-nothing.
  3. The processor then issues a load or store to the evicted line, causing the cache to send a ShReq or ExReq. The cache line state becomes C-pending.
  4. The InvReq associated with the first ExReq eventually arrives at site id’.

While possible, this is not necessarily problematic. Although these operations can overlap in real time, the reads are treated as logically preceding the write in the global memory order. Coherence does not require that writes be “immediately” visible. All other caches will eventually receive an InvReq, and subsequent reads will trigger a ShReq that returns the updated data, so the second coherence invariant continues to be upheld.

Conversely, it can be argued that this proposed optimization is unsafe from a consistency perspective since the store-load reordering described above may lead to a violation of a stricter memory consistency model. This is potentially the case when multiple directory sites are involved (multi-bank last-level caches).

Problem 5: Memory Consistency (30 points)

Problem 5.A: Load/Store Queues (4+4+4 points)

Consider a multiprocessor with out-of-order cores that implement conservative out-of-order load/store execution (loads wait for memory addresses to be fully checked/disambiguated).

Table 2.1 shows the current state of the store queue in one of the cores. Stores are kept in the store queue until they commit. The instruction number indicates the order of the instructions in the program, with lower numbers being earlier in program order.

Table 2.2 shows the values present in the non-blocking data cache of the same core. Loads following cache misses can read from the data cache on a hit if the memory consistency model is not violated.

Tables 2.3, 2.4, and 2.5 show the current state of the load queue. Assume that all loads and stores access the full 32-bit word.

Table 2.1: Store Queue

Instruction # Address Value 5 0x100^ 0x 7 0x200^ unknown 11 0x300^ 0xABCDABCD 13 0x200^ 0x 17 unknown unknown

Table 2.2: Data Cache

Valid? Address Value Y 0x100^ 0xFFFFFFFF Y 0x200^ 0x1234ABCD Y 0x300^ 0x N 0x400^ unknown