Computer Architecture Final Exam - COE 501 - Fall 2015, Summaries of Engineering

King Fahd University of Petroleum and Minerals. Computer Engineering Department. COE 501 โ€“ Fall 2015. Computer Architecture. Final Exam.

Typology: Summaries

2021/2022

Uploaded on 09/07/2022

nabeel_kk
nabeel_kk ๐Ÿ‡ธ๐Ÿ‡ฆ

4.6

(65)

1.3K documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
King Fahd University of Petroleum and Minerals
Computer Engineering Department
COE 501 โ€“ Fall 2015
Computer Architecture
Final Exam
Thursday, December 24, 2015
7 pm โ€“ 9:30 pm
Prepared by: Dr. Muhamed Mudawar
Student Name:
Student ID:
Q1 / 15 Q2 / 10
Q3 / 15 Q4 / 25
Q5 / 15 Q6 / 20
Total / 100
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Computer Architecture Final Exam - COE 501 - Fall 2015 and more Summaries Engineering in PDF only on Docsity!

King Fahd University of Petroleum and Minerals

Computer Engineering Department

COE 501 โ€“ Fall 2015

Computer Architecture

Final Exam

Thursday, December 24, 2015

7 pm โ€“ 9:30 pm

Prepared by: Dr. Muhamed Mudawar

Student Name:

Student ID:

Q1 / 15 Q2 / 10

Q3 / 15 Q4 / 25

Q5 / 15 Q6 / 20

Total / 100

Q1 (15 pts) The transpose of a matrix interchanges the rows and columns. Here is the code:

for (i=0; i<N; i++) for (j=0; j<N; j++) transpose[j][i] = matrix[i][j];

The matrix and its transpose are stored in row-major order. Assume that you are executing matrix transpose on a processor with 256 KB D-Cache, which is 4-way set-associative and with 64-byte blocks. Each element is 8 bytes in the matrix and its transpose.

a) (6 pts) How many sets exist in the D-Cache? What is the maximum value of N such that the matrix and its transpose can both fit in the 256 KB D-Cache?

b) (9 pts) Loop interchange does not help matrix transpose, because either the matrix or its transpose will be traversed by column, which is non-ideal. Transform the above code to perform matrix transpose using B ร— B blocks (Block parameter B).

Q3 (15 pts) Consider the following three processors (X, Y, and Z) that are fabricated on a constant silicon area = 16 A. The performance of a core increases with the square root of its area. The performance of a large core (area = 4 A ) is double the performance of a small core (area = A ).

The serial execution time of a given program is T on one core of processor X (area = A ). The fraction of the execution time that should be executed serially on one core is S. The remaining fraction of the execution time can be executed in parallel on many cores.

a) (3 pts) What is the parallel execution time of the program on all 16 cores of Processor X? (Assume that the parallel load is balanced perfectly on all 16 cores).

b) (4 pts) What is the parallel execution time of the program on all 4 cores of Processor Y? (Assume that the parallel load is balanced perfectly on all 4 cores).

c) (4 pts) What is the parallel execution time of the program on Processor Z? Assume that the serial fraction of the program is executed on the large core, and the parallel fraction is balanced perfectly on the 12 small cores.

d) (4 pts) If the serial fraction S = 10%, what will be the parallel execution time on X, Y, and Z. Which processor has the best performance?

A A A A

A A A A

A A A A

A A A A

Processor Z 12 small cores 1 large core

A A 4 A A A A A A A A A (^) A A

Processor Y 4 large cores Core area = 4 A

4 A 4 A

4 A 4 A

Processor X 16 small cores Core area = A

Q4 (25 pts) The following code computes Y[i]=aX[i]+Y[i]* for an array of length 100 doubles. Initially, R1 = address of X , R2 = address of Y , and F0 contains a.

ADDIU R3, R1, 800 ; R3 = address of X + 100 doubles loop: L.D F2, 0(R1) ; load F2 = X[i] MUL.D F4, F2, F0 ; F4 = a * X[i] L.D F6, 0(R2) ; load F6 = Y[i] ADD.D F6, F4, F6 ; F6 = a * X[i] + Y[i] S.D F6, 0(R2) ; store F ADDIU R1, R1, 8 ; increment X pointer ADDIU R2, R2, 8 ; increment Y pointer BNE R1, R3, loop ; loop if (R1 != R3)

Assume the following functional unit latencies (given as stall cycles before use). The loop branch is predicted to be always taken with zero delay.

Instruction producing result Instruction using result Stall cycles before use FP operation Another FP operation 3 stall cycles FP operation FP store 2 stall cycles FP load FP operation 2 stall cycles Integer ALU operation Another ALU operation 0 stall cycles

a) (4 pts) Assume a single-issue pipeline. Show the loop code including the stall cycles. What is the number of cycles per iteration to compute one element of the array Y?

b) (5 pts) Show how the compiler would schedule the loop to minimize stall cycles. What is the number of cycles per iteration to compute one element of the array Y?

c) (8 pts) Unroll the loop four times and schedule it to reduce stall cycles, collapsing the loop overhead instructions. Show the instruction schedule. What is the average number of cycles to compute one element of the array Y?

d) (8 pts) Consider a VLIW processor with long instructions that contain three operation slots. Slot 0 is reserved for Memory operations, Slot 1 is reserved for FPU operations, while Slot 2 can be used by ALU or branch operations. Using the unrolled loop of part c (unrolled four times), show a schedule of the VLIW instructions + cycle time that minimizes the clock cycles. What percent of the operation slots are used? What is the average number of cycles to compute one element of the array Y?

Q4 Solution (cont'd)

Q5 (15 pts) The bus-based dual-core multiprocessor shown below represents a symmetric shared memory architecture. Each processor has an L1 write-back private cache. Coherence is maintained using the MESI write-invalidate snooping protocol. For simplicity, each cache is directly-mapped with four blocks indexed from 0 to 3. Each block holds two words (8 bytes). For clarity, the tag contains the full address in hexadecimal, while the data is shown in decimal.

Each part of this exercise specifies a memory operation. Treat each operation as independently applied to the initial state given in Figure 1. What value is returned by a read, the bus transaction (if any), the resulting state, tag, and value of the caches and memory after the given operation? Show the content of the relevant updated cache blocks.

a) P0 reads address 0x b) P1 reads address 0x c) P1 writes address 0x100  28 d) P1 writes address 0x130  16 e) P0 reads address 0x

Memory

block data

addr

0x100 32 15 0x108 5 17 0x110 0 10 0x118 3 1 0x120 6 8 0x128 10 25 0x130 27 9

Processor P state tag block data S 0x100 32 15 M 0x128 7 2 E 0x130 27 9 S 0x118 3 1

Processor P state tag block data S 0x100 32 15 M 0x108 5 12 E 0x110 0 10 S 0x118 3 1

Bus-Based Dual Core multiprocessor

Q6 (20 pts) Consider the following MIPS loop:

loop: LD R2, 0(R1) ; Load R2 = Memory(R1) L.D F2, 0(R2) ; Load F2 = Memory(R2) ADD.D F2, F2, F4 ; F2 = F2 + F S.D F2, 0(R2) ; Store Memory(R2) = F ADD R1, R1, 4 ; R1 = R1 + 4 BNE R1, R5, loop ; Branch if (R1 != R5)

a) (10 pts) Consider the execution of the above loop on an out-of-order pipeline with a reorder buffer. The pipeline has an instruction fetch ( IF ) stage, an instruction decode ( ID ) stage, and an issue ( IS ) stage with sufficient number of reservation stations. There are three independent function units: ALU ( EX stage for integer and branch instructions), FP unit for floating-point instructions (4 pipeline stages: FP1 , FP2 , FP3 , and FP4 ), a memory unit for load and store instructions (2 pipeline stages: A unit for address calculation and M for data cache access). The write-back ( WB ) stage writes the result of an instruction into the reorder buffer ( ROB ). Finally, the commit ( C ) stage commits the instruction at the front of the reorder buffer by updating the register file (or memory in the case of a store instruction). If the address of a store instruction is computed by the address unit ( A stage) then the store instruction waits in a Store Buffer ( SB ) until it is committed. A load instruction is allowed to bypass a previous store waiting in the store buffer if the load address is different from the store address. Draw a timing diagram showing the execution of the first two iterations of the above loop on the OOO execution pipeline. Draw an arrow indicating the forwarding of a data result on the common data bus to a reservation station or to the store buffer. Show also conflicting writes on the common data bus. Assume that the loop branch is predicted to be always taken.

b) (10 pts) Consider a superscalar pipeline that fetches two instructions, decodes two instructions, issues at most three instructions because there are three function units, writes two results into the ROB, and commits the results of at most two instructions each cycle. Redraw the timing diagram of the first two iterations of the above loop on the superscalar processor, given that the loop branch is predicted to be always taken.

Q6 Solution, Part a: Part b:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

LD

R2, (R1)

L.D

F2, (R2)

ADD.D F2, F2, F4 S.D

F2, (R2)

ADD

R1, R1, 4

BNE

R1, R5, loop

LD

R2, (R1)

L.D

F2, (R2)

ADD.D F2, F2, F4 S.D

F2, (R2)

ADD

R1, R1, 4

BNE

R1, R5, loop

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

LD

R2, (R1)

L.D

F2, (R2)

ADD.D F2, F2, F4 S.D

F2, (R2)

ADD

R1, R1, 4

BNE

R1, R5, loop

LD

R2, (R1)

L.D

F2, (R2)

ADD.D F2, F2, F4 S.D

F2, (R2)

ADD

R1, R1, 4

BNE

R1, R5, loop