Download Fixed-Point Representation - Machine Structures - Solved Exams and more Exams Data Structures and Algorithms in PDF only on Docsity!
University of California, Berkeley – College of Engineering
Department of Electrical Engineering and Computer Sciences
Fall 2006 Instructor: Dan Garcia 2006 - 12 - 14
CS61C Final Exam
Last Name ANSWER
First Name KEY
Student ID Number
Login cs61c-
Login First Letter (please circle) a b c d e f g
Login Second Letter (please circle) a b c d e f g h i j k l m
n o p q r s t u v w x y z
The name of your LAB TA (please circle) Scott Aaron David P. Sameer David J.
Name of the person to your Left
Name of the person to your Right
All the work is my own. I have no prior knowledge of the exam contents nor will I share the contents with others in CS61C
who have not taken it yet. (please sign)
Instructions (Read Me!)
- This booklet contains 9 numbered pages including the cover page. Put all answers on these pages (feel
free to use the back of any page for scratch work); don’t hand in any stray pieces of paper.
- Please turn off all pagers, cell phones & beepers. Remove all hats & headphones. Place your
backpacks, laptops and jackets at the front. Sit in every other seat. Nothing may be placed in the “no fly
zone” spare seat/desk between students.
- Fill in the front of this page and put your name & login on every sheet of paper.
- You have 180 minutes to complete this exam. The exam is closed book, no computers, PDAs or
calculators. You may use two pages (US Letter, front and back) of notes, plus the green reference
sheet from COD 3/e.
- There may be partial credit for incomplete answers; write as much of the solution as you can. We will
deduct points if your solution is far more complicated than necessary. When we provide a blank, please
fit your answer within the space provided. “IEC format” refers to the mebi, tebi, etc prefixes. You have 3
hours...relax.
- You must complete ALL THE QUESTIONS , regardless of your score on the midterm.
Clobbering only works from the Final to the Midterm, not vice versa.
Problem M1 M2 M3 Ms F1 F2 F3 F4 Fs Total Minutes 20 20 20 60 30 30 30 30 120 180 Points 10 10 10 30 22 22 22 24 90 120
Score
Midterm Revisited
M1) “Son of a bits…” ( 10 pts, 20 min)
a) How many bits does it take to address N things?
(hint: you may use the floor or ceiling function)
ceiling(log2(N))
Recall the quarter definition from midterm (skip this paragraph if you remember it):
Early processors had no hardware support for floating point numbers. Suppose you are a
game developer for the original 8-bit Nintendo Entertainment System (NES) and wish to
represent fractional numbers. You and your engineering team decide to create a variant on
IEEE floating point numbers you call a quarter (for quarter precision floats). It has all the
properties of IEEE 754 (including denorms, NaNs and ± ∞) just with different ranges, precision
& representations. A quarter is a single byte split into the following fields (1 sign, 3 exponent, 4
mantissa): SEEEMMMM. The bias of the exponent is 3, and the implicit exponent for denorms is - 2
You’re also familiar with the Fixed-Point Representation , where the binary point is always in the same
place so there’s no need to store the exponent. E.g., you could imagine splitting the Nintendo’s byte
into two nibbles with the left nibble representing the unsigned whole number component (W), and the
right representing the fractional component (F): WWWW.FFFF. Thus the bit pattern 0xa8 would be
interpreted as the unsigned fixed-point value 0xa.8 = 0b1010.1000 = 10.5 10. As a systems designer,
you could choose to interpret a byte any way you want, so you could change the point location (e.g.,
WW.FFFFFF or WWWWWWW.F) to suit your needs.
b) One of your games involves velocities that always fall in the range of [10, 15), i.e., 10 ≤ v < 15.
If you only have a single NES byte , you’re asked to design a novel representation to encode
a velocity (assume the hardware can handle whatever you do). It should be better than a
quarter, fixed-point, and any 8-bit encoding we’ve discussed! What is “better”? You will be
judged on four criteria (check the box to the left of the ones you think you satisfy, listed in
decreasing priority order). Explain your de coding below on the left (how to go from a bit pattern
b to a velocity v), and on the right show the bit patterns that would result from encoding
numbers closest to 10, 12.5 and 15 as well as the velocity each bit pattern actually represents.
√ Most bit patterns encoding the most different numbers in [10, 15)
√ You have bit patterns that are as close as possible to 10, 12.5, and 15
√ Uniform spacing between numbers in [10, 15) is better than non-uniform
√ Simplicity
My scheme has ___ 256 __ NES byte bit patterns representing the
range [10, 15). Here’s how I go from a bit pattern b to a velocity v:
(if you’d like, you may write it mathematically... v as a function of b).
Number closest to...
...and its bit pattern
...and the velocity it represents
10 0b 00000000 10
12.5 0b 10000000 12.
10 + (unsigned char) b * (5.0/256)
15 0b 11111111
M3) “What’s the MIPS is going on here?” ( 10 pts, 20 min)
a) Given the MIPS code below, write the equivalent (from a functional point of view) C function
below in the structure we’ve provided. When you’re writing the C code, you can assume
that Mystery will be called fewer than 100 times. (Later questions ask what happens when
it’s called more times.) Feel free to add comments to help your disassembly. You may
assume la will always be expanded into a lui/ori pair that fills up (clobbers) the nop.
Mystery: la $t0, Mystery nop lw $t1, 20($t0) addiu $t1, $t1, 1 sw $t1, 20($t0) addiu $v0, $0, 0 jr $ra
b) In one sentence, explain what this MIPS code does.
It returns the number Mystery has been called (up to a point, see below).
c) What is the most times this function can be called so that it still does what you described in
part (b)? (It can be left as an expression)
d) What will it return (exactly, but it may be left as an expression) if it is called one more time?
It will return the negative number (0xFFFF8000 = - 215 ).
e) What will happen if it is called twice as many times as in (d)? Will it crash? Hang forever?
What’s returned, if anything? Describe the effect from the caller’s standpoint; be explicit.
Since it now modifies $v1 instead of $v0 (and $v0 doesn’t get touched) the caller will get a “random” =
“garbage” return value, probably whatever was left in $v0 from the return value of the most recent
function that was called.
// Mystery called < 100 times short timesCalled = 0;
_____________short_ Mystery() {
return ++timesCalled;
}
Post-Midterm Questions
F1) “The Datapath less traveled…” ( 22 pts, 30 min)
On the right is the single-cycle
MIPS datapath presented
during lecture. Your job is to
modify the diagram to
accommodate a new MIPS
instruction. Your modification
may use simple adders,
shifters, mux chips, wires, and
new control signals. If
necessary, you may replace
original labels.
We want to add a new MIPS
instruction so that the following
C statement (p is a pointer to an
int, and CONSTANT is small and
can be negative) could be performed in one I-type MAL MIPS instruction: *p = CONSTANT;
a) Make up the syntax for the I-type MAL MIPS instruction (call it sc for “store constant”) that
does it (show an example if the pointer lives in $v0 and the CONSTANT is 42 ). On the right,
show the register transfer language (RTL) description of sc.
Syntax: sc 42 ($v0) RTL: R[rs]<-signExtImm; PC<-PC+
b) For a larger CONSTANT (say 0xFAB5BEEF), to what exact TAL instructions would the MAL
above expand?
lui $at, 0xFAB5; ori $at, 0xBEEF; sw $at 0($v0)
c) Modify the picture above and list your changes below. You may not need all the boxes.
Please write them in “pipeline stage order” (i.e., changes affecting IF first, MEM next, etc)
(i) Add a 2-input mux (dataIn={busB,Extender}) to select the memory Data In. ...OR... Take the ALUSrc Mux, and
have another output (tied to DataIn) which is the opposite of what ALUSrc chooses. When ALUSrc is 1, Data In gets busB and the normal output gets the imm16 for the offset (as in sw). When ALUSrc is 0, Data In gets imm16 (as in *p=CONSTANT ) and the normal input is fed to the ALU which is ignored by the following mux.. You can also have the control always set RT to 0 (regardless of what it is in the instruction).
(ii) Add a 2-input mux (memAdr={ALU,busA}) to select the memory Adr with busA on input 1 ... OR ...
Rewrite the ALU so that it can take an ALUSrc command to “pass A unchanged”
(iii)
(iv)
d) We now want to set all the control lines appropriately. List what each signal should be
(an intuitive name or {0, 1, x = don't care}). Include any new control signals you added.
RegDst RegWr nPC_sel ExtOp ALUSrc ALUctr MemWr MemtoReg memAdr memDataI n
Rs Rt
F2) “ Don’t let me fault ” (22 pts, 30 min)
The specs for a MIPS machine’s memory system that has one level of cache and virtual memory are:
o 1MiB of Physical Address Space
o 4GiB of Virtual Address Space
o 4KiB page size
o 16KiB 8-way set-associative write-through cache, LRU replacement
o 1KiB Cache Block Size
o 2 - entry TLB, LRU replacement
The following code is run on the system, which has no other users and process switching turned off.
#define NUM_INTS 8192 // This many ints... **int A = (int )malloc(NUM_INTS * sizeof(int)); // malloc returns address 0x int i, total = 0; for(i = 0; i < NUM_INTS; i += 128) A[i] = i; for(i = 0; i < NUM_INTS; i += 128) total += A[i]; // SPECIAL
a) What is the T:I:O bit breakup for the cache (assuming byte addressing)? ___ 9 :_ 1 : 10 _
b) What is the VPN : PO bit breakup for VM (assuming byte addressing)? ___ 20 : 12 ___
For the following questions, only consider the line marked “ SPECIAL”. Your answer can be a fraction.
c) Calculate the hit percentage for the cache ½ = 50%
d) Calculate the hit percentage for the TLB 7/8 = 87.5%
e) Calculate the page hit percentage for the page table 100%
Show all your work below...
F3) “These Pipes are Clean…” ( 22 pts, 30 min)
Consider a processor with the following specification:
o Standard five (5) stage (F, D, E, M, W) pipeline.
o No forwarding.
o Stalls on all data and control hazards.
o Non-delayed branches
o Branch comparison occurs during the second stage.
o Instructions are not fetched until branch comparison is done.
o Memory CAN be read/written on same clock cycle.
o The same register CAN be read & written on the same clock cycle.
o No out-of-order execution
o “Dumb” control that does not optimize for “always-branch” conditional branches
a) Count how many cycles will be needed to execute the code below and write out each
instruction’s progress through the pipeline by filling in the table below with pipeline stages
(F, D, E, M, W).
add $t1, $t2, $t xor $t1, $t4, $t lw $t3, 0($t1) beq $t3, $t3, 1 lw $t5, 0($t3) xor $t4, $t5, $t add $t5, $t5, $t
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Inst 1 F^2 3 D
E M W 1
Inst 2 F^4 5 D
E M W 1
Inst 3 F^ F^ F^1 D
E M W 3 Inst 4 F^ F^ F^ F^ F^3 D =
E M W
Inst 5 F^56 D
E M W 4 Inst 6 F^ F^ F^4 D
E M W 5
b) Considering the following three changes , fill in the table again:
o Our processor now forwards values
o Interlocks on load hazards
o “Intelligent” control that optimizes for “always-branch” conditional branches
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Inst 1 F^2 3 D
E M W 1
Inst 2 F^4 5 D
E 1
M W 1
#define ITERATIONS 96 int s = serial(); // 40 cycles to complete for (int n = 0; n < ITERATIONS; n++) parallel(s,n); // 10 cycles per loop
F4) The CS61C Variety Pack… ( 24 pts, 30 min)
The table on the right is only used for
questions (a)-(c). Given the following
instruction mix and CPIi for each instructioni:
a) What is the overall CPI? 3.
b) If Stores were free (CPI=0), how many times faster would the CPU be? 6/5 = 1.
c) If you could make one instruction type twice as fast, what should it be? Branch
d) What problem prevents us from easily transitioning to quad-, 8-, Cache coherency
or more-core processing? (The proverbial fly in the ointment )
e) What RAID # should be used, if you want to maximize hard drive 0
read speed, want the most space possible, and can use never-fail disks?
A large computing task is at hand, but
thankfully, we’ve got a cluster of
computers at our disposal. Assume that
the for loop is fully parallelizable, but
serial() is not. We run this code:
f) How many times faster are we if we parallelize the code over many machines? 1000/50 = 20
g) Match the following items. Some items on the right will not be used, or may be used
more than once.
Instructioni Frequencyi CPIi Scratch space
ALU 25 % 1 0.
Load 35% 3 1.
Store 10 % 5 0.
Branch 30% 4 1.
G Makes more efficient use of available disk area
J The basis of network abstraction
Q This guarantees delivery over a network
M Work per unit time
F Time to complete a single task
I Bigger blocks take advantage of this
B All caches take advantage of this
R “It’s getting harder to build a new chip fab plant!”
A) LRU
B) Temporal Locality
C) Synchronization
D) Write-back
E) Full-duplex
F) Latency
G) Constant Bit Density
H) Amdahl’s Law
I) Spatial Locality
J) Encapsulation
K) Fragmentation
L) Synchronization
M) Throughput
N) Parallelization
O) AMAT
P) Constant angular velocity
Q) Ack
R) Rock’s law
S) Superscalar
T) Pipelining
U) Superparamagnetism
V) Polling (not David)