








































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Includes solution to textbook by Patterson
Typology: Exercises
1 / 80
This page cannot be seen from the preview
Don't miss anything!









































































Chapter 1 Solutions S-
1.1 Personal computer (includes workstation and laptop): Personal computers emphasize delivery of good performance to single users at low cost and usually execute third-party software. Personal mobile device (PMD, includes tablets): PMDs are battery operated with wireless connectivity to the Internet and typically cost hundreds of dollars, and, like PCs, users can download software (“apps”) to run on them. Unlike PCs, they no longer have a keyboard and mouse, and are more likely to rely on a touch-sensitive screen or even speech input. Server: Computer used to run large problems and usually accessed via a network. Warehouse scale computer: Thousands of processors forming a large cluster. Supercomputer: Computer composed of hundreds to thousands of processors and terabytes of memory. Embedded computer: Computer designed to run one application or one set of related applications and integrated into a single system.
a. Performance via Pipelining b. Dependability via Redundancy c. Performance via Prediction d. Make the Common Case Fast e. Hierarchy of Memories f. Performance via Parallelism g. Design for Moore’s Law h. Use Abstraction to Simplify Design
1.3 The program is compiled into an assembly language program, which is then assembled into a machine language program.
a. 1280 1024 pixels 1,310,720 pixels 1,310,720 3 3,932, bytes/frame. b. 3,932,160 bytes (8 bits/byte) /100E6 bits/second 0.31 seconds
a. performance of P1 (instructions/sec) 3 109 /1.5 2 109 performance of P2 (instructions/sec) 2.5 109 /1.0 2.5 109 performance of P3 (instructions/sec) 4 109 /2.2 1.8 109
Chapter 1 Solutions S-
Pentium 4: C 3.2E–8F Core i5 Ivy Bridge: C 2.9E–8F 1.8.2 Pentium 4: 10/100 10% Core i5 Ivy Bridge: 30/70 42.9% 1.8.3 (Snew Dnew)/(S (^) old D (^) old) 0. Dnew C Vnew 2 F S (^) old Vold I Snew Vnew I Therefore: Vnew [Dnew/(C F)]1/ Dnew 0.90 (S (^) old D (^) old) Snew Snew Vnew (S (^) old/Vold) Pentium 4: Snew Vnew (10/1.25) Vnew 8 Dnew 0.90 100 Vnew 8 90 Vnew 8 Vnew [(90 Vnew 8)/(3.2E8 3.6E9)]1/ Vnew 0.85 V Core i5: Snew Vnew (30/0.9) Vnew 33. Dnew 0.90 70 Vnew 33.3 63 Vnew 33. Vnew [(63 Vnew 33.3)/(2.9E8 3.4E9)]1/ Vnew 0.64 V
1.9. p # arith inst. # L/S inst. # branch inst. cycles ex. time speedup 1 2.56E9 1.28E9 2.56E8 7.94E10 39.7 1 2 1.83E9 9.14E8 2.56E8 5.67E10 28.3 1. 4 9.12E8 4.57E8 2.56E8 2.83E10 14.2 2. 8 4.57E8 2.29E8 2.56E8 1.42E10 7.10 5.
S-6 Chapter 1 Solutions
p ex. time 1 41. 2 29. 4 14. 8 7. 1.9.3 3
1.10.1 die area (^) 15cm wafer area/dies per wafer pi7.5^2 / 84 2.10 cm^2 yield15cm 1/(1(0.0202.10/2))^2 0. die area (^) 20cm wafer area/dies per wafer pi10^2 /100 3.14 cm^2 yield20cm 1/(1(0.0313.14/2))^2 0. 1.10.2 cost/die15cm 12/(840.9593) 0. cost/die20cm 15/(1000.9093) 0. 1.10.3 die area (^) 15cm wafer area/dies per wafer pi7.5^2 /(841.1) 1.91 cm^2 yield15cm 1/(1 (0.0201.151.91/2))^2 0. die area (^) 20cm wafer area/dies per wafer pi10^2 /(1001.1) 2.86 cm^2 yield20cm 1/(1 (0.031.152.86/2))^2 0. 1.10.4 defects per area0.92 (1–y^.5)/(y^.5die_area/2) (10.92^.5)/ (0.92^.52/2) 0.043 defects/cm^2 defects per area0.95 (1–y^.5)/(y^.5die_area/2) (10.95^.5)/ (0.95^.52/2) 0.026 defects/cm^2
1.11.1 CPI clock rate CPU time/instr. count clock rate 1/cycle time 3 GHz CPI(bzip2) 3 109 750/(2389 109 ) 0. 1.11.2 SPEC ratio ref. time/execution time SPEC ratio(bzip2) 9650/750 12. 1.11.3. CPU time No. instr. CPI/clock rate If CPI and clock rate do not change, the CPU time increase is equal to the increase in the of number of instructions, that is 10%.
S-8 Chapter 1 Solutions
MIPS(P1) MIPS(P2), performance(P1) performance(P2) (from 11a) 1.12.4 MFLOPS No. FP operations 10 ^6 /T MFLOPS(P1) .4 5E9 1E-6/1.125 1.78E MFLOPS(P2) .4 1E9 1E-6/.25 1.60E MFLOPS(P1) MFLOPS(P2), performance(P1) performance(P2) (from 11a)
1.13.1 Tfp 70 0.8 56 s. Tnew 56 85 55 40 236 s. Reduction: 5.6% 1.13.2 Tnew 250 0.8 200 s, TfpTl/sTbranch 165 s, Tint 35 s. Reduction time INT: 58.8% 1.13.3 Tnew 250 0.8 200 s, TfpTintTl/s 210 s. NO
1.14.1 Clock cycles CPIfp No. FP instr. CPIint No. INT instr. CPIl/s No. L/S instr. CPIbranch No. branch instr. TCPU clock cycles/clock rate clock cycles/2 10 9 clock cycles 512 106 ; TCPU 0.256 s To have the number of clock cycles by improving the CPI of FP instructions: CPIimproved fp No. FP instr. CPIint No. INT instr. CPI (^) l/s No. L/S instr. CPIbranch No. branch instr. clock cycles/ CPIimproved fp (clock cycles/2 (CPIint No. INT instr. CPIl/s No. L/S instr. CPIbranch No. branch instr.)) / No. FP instr. CPIimproved fp (256462)/50 0 not possible 1.14.2 Using the clock cycle data from a. To have the number of clock cycles improving the CPI of L/S instructions: CPIfp No. FP instr. CPIint No. INT instr. CPIimproved l/s No. L/S instr. CPIbranch No. branch instr. clock cycles/ CPIimproved l/s (clock cycles/2 (CPIfp No. FP instr. CPIint No. INT instr. CPIbranch No. branch instr.)) / No. L/S instr. CPIimproved l/s (256198)/80 0. 1.14.3 Clock cycles CPIfp No. FP instr. CPIint No. INT instr. CPIl/s No. L/S instr. CPIbranch No. branch instr.
Chapter 1 Solutions S-
TCPU clock cycles/clock rate clock cycles/2 10 9 CPIint 0.6 1 0.6; CPIfp 0.6 1 0.6; CPIl/s 0.7 4 2.8; CPIbranch 0.7 2 1. TCPU (before improv.) 0.256 s; TCPU (after improv.) 0.171 s
processors
exec. time/ processor
time w/overhead speedup
actual speedup/ideal speedup 1 100 (^2 50 54) 100/54 1.85 1.85/2 . (^4 25 29) 100/29 3.44 3.44/4 0. 8 12.5 16.5 100/16.5 6.06 6.06/8 0. 16 6.25 10.25 100/10.25 9.76 9.76/16 0.
Chapter 2 Solutions S-
2.1 addi f, h, -5 (note, no subi) add f, f, g
2.2 f = g + h + i
2.3 sub $t0, $s3, $s add $t0, $s6, $t lw $t1, 16($t0) sw $t1, 32($s7)
2.4 B[g] = A[f] + A[1+f];
2.5 add $t0, $s6, $s add $t1, $s7, $s lw $s0, 0($t0) lw $t0, 4($t0) add $t0, $t0, $s sw $t0, 0($t1)
2.6.1 temp = Array[0]; temp2 = Array[1]; Array[0] = Array[4]; Array[1] = temp; Array[4] = Array[3]; Array[3] = temp2;
2.6.2 lw $t0, 0($s6) lw $t1, 4($s6) lw $t2, 16($s6) sw $t2, 0($s6) sw $t0, 4($s6) lw $t0, 12($s6) sw $t0, 16($s6) sw $t1, 12($s6)
S-4 Chapter 2 Solutions
Little-Endian Big-Endian Address Data Address Data 12 ab 12 12 8 cd 8 ef 4 ef 4 cd 0 12 0 ab 2.8 2882400018 2.9 sll $t0, $s1, 2 # $t0 <-- 4g add $t0, $t0, $s7 # $t0 <-- Addr(B[g]) lw $t0, 0($t0) # $t0 <-- B[g] addi $t0, $t0, 1 # $t0 <-- B[g]+ sll $t0, $t0, 2 # $t0 <-- 4(B[g]+1) = Addr(A[B[g]+1]) lw $s0, 0($t0) # f <-- A[B[g]+1] 2.10 f = 2*(&A);
type opcode rs rt rd immed addi $t0, $s6, 4 I-type 8 22 8 4 add $t1, $s6, $0 R-type 0 22 0 9 sw $t1, 0($t0) (^) I-type 43 8 9 0 lw $t0, 0($t0) (^) I-type 35 8 8 0 add $s0, $t1, $t0 R-type 0 9 8 16
2.12.1 50000000 2.12.2 overflow 2.12.3 B 2.12.4 no overflow 2.12.5 D 2.12.6 overflow
2.13.1 128 231 1, x 231 129 and 128 x 231 , x 231 128 (impossible) 2.13.2 128 x 231 1, x 231 129 and 128 x 231 , x 2 31 128 (impossible) 2.13.3 x 128 2 31 , x 2 31 128 and x 128 2 31 1, x 2 31 127 (impossible)
S-6 Chapter 2 Solutions
2.25.1 i-type 2.25.2 addi $t2, $t2, – 1 beq $t2, $0, loop
2.26.1 20 2.26.2 i = 10; do { B += 2; i = i – 1; } while ( i > 0) 2.26.3 5*N 2.27 addi $t0, $0, 0 beq $0, $0, TEST LOOP1: addi $t1, $0, 0 beq $0, $0, TEST LOOP2: add $t3, $t0, $t sll $t2, $t1, 4 add $t2, $t2, $s sw $t3, ($t2) addi $t1, $t1, 1 TEST2: slt $t2, $t1, $s bne $t2, $0, LOOP addi $t0, $t0, 1 TEST1: slt $t2, $t0, $s bne $t2, $0, LOOP 2.28 14 instructions to implement and 158 instructions executed 2.29 for (i=0; i<100; i++) { result += MemArray[s0]; s0 = s0 + 4; }
Chapter 2 Solutions S-
2.30 addi $t1, $s0, 400 LOOP: lw $s1, 0($t1) add $s2, $s2, $s addi $t1, $t1, - bne $t1, $s0, LOOP
2.31 fib: addi $sp, $sp, -12 # make room on stack sw $ra, 8($sp) # push $ra sw $s0, 4($sp) # push $s sw $a0, 0($sp) # push $a0 (N) bgt $a0, $0, test2 # if n>0, test if n= add $v0, $0, $0 # else fib(0) = 0 j rtn # test2: addi $t0, $0, 1 # bne $t0, $a0, gen # if n>1, gen add $v0, $0, $t0 # else fib(1) = 1 j rtn gen: subi $a0, $a0,1 # n- jal fib # call fib(n-1) add $s0, $v0, $0 # copy fib(n-1) sub $a0, $a0,1 # n- jal fib # call fib(n-2) add $v0, $v0, $s0 # fib(n-1)+fib(n-2) rtn: lw $a0, 0($sp) # pop $a lw $s0, 4($sp) # pop $s lw $ra, 8($sp) # pop $ra addi $sp, $sp, 12 # restore sp jr $ra
2.32 Due to the recursive nature of the code, it is not possible for the compiler to in-line the function call.
2.33 after calling function fib: old $sp -> 0x7ffffffc ??? -4 contents of register $ra for fib(N) -8 contents of register $s0 for fib(N) $sp-> -12 contents of register $a0 for fib(N) there will be N-1 copies of $ra, $s0 and $a
Chapter 2 Solutions S-
DONE: add $v0, $s0, $ lw $ra, ($sp) addi $sp, $sp, 4 jr $ra
2.38 0x
2.39 Generally, all solutions are similar:
lui $t1, top_16_bits ori $t1, $t1, bottom_16_bits
2.40 No, jump can go up to 0x0FFFFFFC.
2.41 No, range is 0x604 + 0x1FFFC = 0x0002 0600 to 0x604 – 0x = 0xFFFE 0604.
2.42 Yes, range is 0x1FFFF004 + 0x1FFFC = 0x2001F000 to 0x1FFFF
2.43 trylk: li $t1, ll $t0,0($a0) bnez $t0,trylk sc $t1,0($a0) beqz $t1,trylk lw $t2,0($a1) slt $t3,$t2,$a bnez $t3,skip sw $a2,0($a1) skip: sw $0,0($a0) 2.44 try: ll $t0,0($a1) slt $t1,$t0,$a bnez $t1,skip mov $t0,$a sc $t0,0($a1) beqz $t0,try skip:
2.45 It is possible for one or both processors to complete this code without ever reaching the SC instruction. If only one executes SC, it completes successfully. If both reach SC, they do so in the same cycle, but one SC completes first and then the other detects this and fails.
S-10 Chapter 2 Solutions
2.46.1 Answer is no in all cases. Slows down the computer. CCT clock cycle time ICa instruction count (arithmetic) ICls instruction count (load/store) ICb instruction count (branch) new CPU time 0.75old ICaCPIa1.1oldCCT oldIClsCPIls1.1oldCCT oldICbCPIb1.1oldCCT The extra clock cycle time adds sufficiently to the new CPU time such that it is not quicker than the old execution time in all cases. 2.46.2 107.04%, 113.43%
2.47.1 2. 2.47.2 0. 2.47.3 0.
Chapter 3 Solutions S-
The attraction is that each hex digit contains one of 16 different characters (0–9, A–E). Since with 4 binary bits you can represent 16 different patterns, in hex each digit requires exactly 4 binary bits. And bytes are by definition 8 bits long, so two hex digits are all that are required to represent the contents of 1 byte. 3.4 753 3.5 7777 (3777) 3.6 Neither (63) 3.7 Neither (65) 3.8 Overflow (result 179, which does not fit into an SM 8-bit format) 3.9 105 42 128 (147) 3.10 105 42 63 3.11 151 214 255 (365) 3.12 62 12 Step Action Multiplier Multiplicand Product 0 Initial Vals 001 010 000 000 110 010 000 000 000 000 lsb=0, no op 001 010 000 000 110 010 000 000 000 000 1 Lshift^ Mcand^001 010 000 001 100 100^ 000 000 000 000 Rshift Mplier 000 101 000 001 100 100 000 000 000 000 Prod=Prod+Mcand 000 101 000 001 100 100 000 001 100 100 2 Lshift Mcand 000 101 000 011 001 000 000 001 100 100 Rshift Mplier 000 010 000 011 001 000 000 001 100 100 lsb=0, no op 000 010 000 011 001 000 000 001 100 100 3 Lshift^ Mcand^000 010 000 110 010 000^ 000 001 100 100 Rshift Mplier 000 001 000 110 010 000 000 001 100 100 Prod=Prod+Mcand 000 001 000 110 010 000 000 111 110 100 4 Lshift Mcand 000 001 001 100 100 000 000 111 110 100 Rshift Mplier 000 000 001 100 100 000 000 111 110 100 lsb=0, no op 000 000 001 100 100 000 000 111 110 100 5 Lshift Mcand 000 000 011 001 000 000 000 111 110 100 Rshift Mplier 000 000 011 001 000 000 000 111 110 100 lsb=0, no op 000 000 110 010 000 000 000 111 110 100 6 Lshift Mcand 000 000 110 010 000 000 000 111 110 100 Rshift Mplier 000 000 110 010 000 000 000 111 110 100