Exercises Computation II 5EIB0: Answers and Solutions, Exercises of Computer Science

Computer Architecture Questions

Typology: Exercises

2018/2019

Uploaded on 06/24/2019

kristikapllani
kristikapllani 🇳🇱

2 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Exercises Computation II 5EIB0
Answers
Answer 1
N_instructions CPI T_cycle T_execution
------------------------------------------------------------------
Single cycle 10000 1 2.0 ns 20000 ns
Multi cycle 10000 4 0.4 ns 16000 ns
Pipelined 10000 1 0.4 ns 4000 ns
Answer 2
a. CPI_ideal = 1/3
b. CPI_branch = CPI_ideal + f_branch * f_wrong * BranchPenalty
= 0.33 + 0.15 * 0.05 * 19
= 0.476
Answer 3
6 stall cycles.
lw $t0, 0($t2)
2 stall cycles
lw $t1, 4($t0)
2 stall cycles
sub $s5, $t1, $t2
2 stall cycles
sw $s5, 4($t0)
Answer 4
An extra (third) read port is needed.
pf3
pf4
pf5

Partial preview of the text

Download Exercises Computation II 5EIB0: Answers and Solutions and more Exercises Computer Science in PDF only on Docsity!

Exercises Computation II 5EIB

Answers

Answer 1

N_instructions CPI T_cycle T_execution


Single cycle 10000 1 2.0 ns 20000 ns

Multi cycle 10000 4 0.4 ns 16000 ns

Pipelined 10000 1 0.4 ns 4000 ns

Answer 2

a. CPI_ideal = 1/

b. CPI_branch = CPI_ideal + f_branch * f_wrong * BranchPenalty

= 0.33 + 0.15 * 0.05 * 19

= 0.

Answer 3

6 stall cycles.

lw $t0, 0($t2)

2 stall cycles

lw $t1, 4($t0)

2 stall cycles

sub $s5, $t1, $t

2 stall cycles

sw $s5, 4($t0)

Answer 4

An extra (third) read port is needed.

Check the MIPS pipelined data path figures !!

Answer 5

CPI = CPI_ideal + f_inst * I_missrate * I_misspenalty

  • F_data * D_missrate * D_misspenalty

= 2 + 1 * 0.05 * 20 + 0.3 * 0.1 * 20 = 3.6 cycles

Slowdown is T_new / T_old

= (N_instr_new * CPI_new * T_cycle_new)/ N_instr_old * CPI_old * T_cycle_old

= CPI_new / CPI_ideal = 3.6 / 2.0 = 1.

(so if the ideal cache program would take 1000 cycles, the real one takes 1800 cycles, or an 80 % slowdown)

Note that N_instr and T_cycle do not change.

Answer 6

Tag bits = 32 = index - word_offset - byte_offset = 32 - 8 - 1 -2 = 21 bits

Cache size = 4 * 2^8 * (value bit + tag bits + block bits)

= 2^10 * (1 + 21 + 64)

= 86 kbit

Answer 7

The data memory access pattern is: 100, 108, 104, 112, 108, 116, 112, 120

This mappes to word: 0, 2, 1, 3, 2, 4, 3, 5

1-word block: M M M M H M H M -> 25%

2-word block: M M H H H M H H -> 62.5%

4-word block: M H H H H M H H -> 75%

Note, there are no capacity or conflict misses.

(making the cache smaller and/or the offset of the second load bigger can introduce these

misses).

  • The peak DDR3 bandwidth =

#Partitions * #bytes/transfer * #transfers/clock * #clocks/sec =

8 * 8 * 2 * 1G = 128 GB/sec

a. Total DDR3RAM memory size = 8 * 256 MB = 2048 MB

Modern computers have 32-bit single precision

So, if we want 3 n*n SP matrices, maximum n is

3n^2 * 4 <= 2048 * 1024 * 1024

n_max = 13377 = n

b. For each element of the result, we need n multiply-adds

For each row of the result, we need n * n multiply-adds

For the entire result matrix, we need n * n * n multiply-adds

Thus, 2393 GFlops.

Per multiply-add we need to load 2 source operands, 4 bytes each.

Now a discussion is needed about the bottleneck, either processing of memory bandwidth. If no caching,

memory is clearly determining execution speed, for optimal caching (using tiling) its the processing.

b.a. Assuming cache : loading of 2 matrices and storing of 1 to the graphics memory. That is 3 * n^

= 512 GB of data =>

t_memory = 512 / 128 = 4 seconds.

t_processing = 2393 / 192 = 12.46 seconds

t_total = 16.46 seconds.

b.b. No cache: 2393 GFlops require 239324 Gbytes (note, storing the result can in this case

be neglected) =>

t_memory = 239324 / 128 = 149.6 seconds

t_total = 149.6+12.5 = 162.1 seconds

Answer 13

2D grid/mesh: n^2 nodes, n=4 in picture Diameter: n = Nodal degree: 4 (assuming unidirectional links) Network Bandwidth: 2PB = n^2B = 216B = 32B Bisection Bandwidth: 2nB = 8B

n-cube tree: 2^n nodes, n=3 in picture

Diameter: n = 3

Nodal degree: 2n = 23 = 6 (bidirectional links)

Network Bandwidth: N_linksB = 2^n2nB = 24B

Bisection Bandwidth: 2^n 2 / 2 * B=2^nB = 8*B

Scalability

  • pro mesh: constant nodal degree (so cheap), easier to layout in 2 dimensions
  • pro cube: short diameter

Answer 14

a. In shared memory system: using (regular) loads and stores

In message passing system by sending and receiving messages

b. Yes, the address space can be fully shared, while physically memories can be distributed. It means that

loads and stores can address all locations, also the ones in other cores.

c. Pros of shared memory:

  • well-known programming model;
  • large data structures can be shared (and passed by reference)
  • no memory fragmentation losses

Cons of shared memory:

  • synchronization, coherence and consistency issues have to be addressed