Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Computer Architecture and Design, Lecture notes of Advanced Computer Architecture

Duke University Advanced Computer Architecture

computer architecture and design

Typology: Lecture notes

2014/2015

Uploaded on 10/07/2015

joe_jacob 🇺🇸

1 document

1 / 15

This page cannot be seen from the preview

Don't miss anything!

2.4:

- MIPS word is 32 bits

- two's complement

- leading 0's = +

leading 1's = -

- 1000...000 is -2^31

- 1000...001 is -2^31 + 1

- 1111...111 is -1

- 0111...111 is 2^31 - 1

- MSB is sign bit in two's complement

- it's multiplied by -2^31 and the other bits are multiplied by 2^30, etc.

- negate number by flipping each bit and adding 1

- sign extend to place correct representation of number within register

- repeatedly copy sign bit to the front of the number (if signed)

- otherwise, repeatedly copy 0's

- one's complement

- 100...000 is most negative (-2^31+1)

- 011...111 is most positive (2^31-1)

- 000...000 is positive 0

- 111...111 is negative 0

- opposite sign number is simply negating each bit

- not good because we need to subtract each time (inefficient)

3.2:

- overflow causes an exception or interrupt

- EPC contains address of instruction that caused exception

- mfc0 is used to copy EPC into general register ($k0 or $k1)

3.5:

- MIPS floating point: 1 sign bit, 8 exponent bits, 23 fraction bits

- (-1)^S X F X 2^E

- IEEE 754 floating point: 1 sign bit, 11 exponent bits, 52 fraction bits

- biased notation: (-1)^S X (1 + fraction) X 2 ^(exponent - bias)

- HOW DO I CONVERT BINARY TO FLOATING POINT AND VICE VERSA

2.3:

- the MIPS load instruction transfers data from memory to a register

- lw $t0, 32($s3)

- loads the 8th element of the base of the array stored in $s3 into $t0

- $s3 is base register, 8 is offset

- byte-addressed means that the addresses of sequential words differ by 4

- 4 bytes in a 32-bit word

- MIPS is big-endian, which means that it uses the address of the leftmost byte as the

word address versus the rightmost (little endian)

- the MIPS store instruction transfer data from a register to memory

- sw $t0, 48($s3)

- stores the value from $t0 to the 12th element in the array contained in

$s3

2.5:

- $to to $t7 = $8 to $15

$s0 to

Discover Lecture notes of Advanced Computer Architecture Duke University

Partial preview of the text

Download Computer Architecture and Design and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!

MIPS word is 32 bits
two's complement
- leading 0's = + leading 1's = -
- 1000...000 is -2^
- 1000...001 is -2^31 + 1
- 1111...111 is -
- 0111...111 is 2^31 - 1
- MSB is sign bit in two's complement
  - it's multiplied by -2^31 and the other bits are multiplied by 2^30, etc.
- negate number by flipping each bit and adding 1
sign extend to place correct representation of number within register
- repeatedly copy sign bit to the front of the number (if signed)
- otherwise, repeatedly copy 0's
one's complement
- 100...000 is most negative (-2^31+1)
- 011...111 is most positive (2^31-1)
- 000...000 is positive 0
- 111...111 is negative 0
- opposite sign number is simply negating each bit
- not good because we need to subtract each time (inefficient)

3.2:

overflow causes an exception or interrupt
- EPC contains address of instruction that caused exception
  - mfc0 is used to copy EPC into general register ($k0 or $k1)

MIPS floating point: 1 sign bit, 8 exponent bits, 23 fraction bits
- (-1)^S X F X 2^E
IEEE 754 floating point: 1 sign bit, 11 exponent bits, 52 fraction bits
biased notation: (-1)^S X (1 + fraction) X 2 ^(exponent - bias)
HOW DO I CONVERT BINARY TO FLOATING POINT AND VICE VERSA

the MIPS load instruction transfers data from memory to a register
- lw $t0, 32($s3)
  - loads the 8th element of the base of the array stored in $s3 into $t
  - $s3 is base register, 8 is offset
byte-addressed means that the addresses of sequential words differ by 4
- 4 bytes in a 32-bit word
MIPS is big-endian, which means that it uses the address of the leftmost byte as the word address versus the rightmost (little endian)
the MIPS store instruction transfer data from a register to memory
sw $t0, 48($s3)
stores the value from $t0 to the 12th element in the array contained in $s

$to to $t7 = $8 to $ $s0 to

R-type: 6 bits (opcode) - 5 bits (1st src) - 5 bits (2nd src) - 5 bits (dest) - 5 bits (shift amount - > used for accessing array, etc) - 6 bits (function code)
I-type: 6 bits (opcode) - 5 bits (src) - 5 bits (dest) - 16 bits (immediate)

2.6:

logical operators
- sll (shift left) - move current bits to left
  - utilizes shamt bits by dictating shift amount srl (shift right) - move current bits to right and, andi (bit-by-bit AND)
use andi/ori to do a bitmask and manipulate the bits you want or, ori (bit-by-bit OR) nor (bit-by-bit NOT)

2.7:

conditional branches: bne reg1, reg2, L - if true, go to L beq reg1, reg2, L - if true, go to L
unconditional branch: j L - kind of like the "else" statement to a conditonal branch - it will automatically go to L2 because L1 wasn't gone to - it's a jump
slt $t0, $s3, $s
- set $t0 to 1 if $s3 < $s4, otherwise set $t0 to 0
- slti is the immediate form ($s4 would be an immediate)

A.2:

symbol table
- a record of each label and its associated memory address
object file
- header: size and position of other pieces of file
- text segment: machine language code for routines in source file
- data segment: binary representation of data in source file
- relocation information: identifies instructions and daa words that depend on absolute addresses
symbol table: associates addresses with external labels in source file
debugging info: description of way program was compiled
.macro name_of_macro($arg) ...... .end_macro - this is located in the text segment and basically you can call it by writing: name_of_macro($any_register)
in the macro $arg will be replaced by $any_register in all instances

A.10:

c0 handles exceptions/interrupts c1 handles floating-point

jr $ra

end of the procedure; return to the caller
$t0 - $t9: temp regs that are not preserved by callee on prodedure call $s0 - $s7: saved regs that must be preserved on procedure call (if used, callee saves and restores them)
in the example above, the sw and lw for $t0 and $t1 are unnecessary
for recursion, you save the return address and arg each time in the stack pointer
in the base case, perform the base calculation, pop the address and arg off the stack, and call jr $ra
this keeps going until the original call procedure finishes
you restore the arg and the return address and pop the two items from the stack
save the final result in $v0 and jr $ra to caller

lb $t0, 0($s1)
- load a byte from $s1 (a byte can be an ascii value!)
sb $t0, 0($s1)
- store a byte from $t0 to $s1 (this can also be an ascii value)

2.10:

lui $s0, 61
- loads 61 (in binary) in the upper 16 bits of $s
- to manipulate the lower 16 bits of $s0, you ori with your desired bits --> this is how you load a 32-bit constant into a reg
J-type: 6 bits (opcode) - 26 bits (address field)
for I-type, addresses of 18 bits can be accessed by word-addressing, and addresses of 28 bits can be accessed for J-type by word-addressing too
I DONT UNDERSTAND THIS

c program-compiler:assembly-assembler:object machine language- linker:executable-loader:memory
compiler converts to assembly
assembler converts assembly into machine code
a linker combines all the independently assembled machine languahge programs into an executable
this lets you avoid recompiling the whole program for only changing one line of code

ARM uses 32-bits and 12-bit immediate fild

B.2:

gate can have multiple inputs
any logical function be constructed using AND/OR gates and inversion

B.3:

a decoder takes an n-bit input and 2n outputs where each output is associated with a unique input combination

a multiplexor (mux) has three inputs: two values and a selector
- the selector determines which input becomes the output
- can be C = AS + BS (output is either A or B)
- if there are n inputs, there will need to be ceiling(logn/log2) selector inputs (still one output)
programmable logic array (PLA)
you convert your sum of products into a logisim thing, but if there are multiple outputs for the same set of inputs, you can output more than one signal
(IMAGE 250-1)

B.5:

For a 1-bit adder, we have three inputs (the two bits and carry_in) and two outputs (the sum and carryout)
you simply make the truth table for all values of inputs and proceed
this is a smaller part of an ALU that can use a mux for different operations
(IMAGE 250-2)
A ripple carry adder is a linking of 1-bit adders (the carry_out becomes the carry_in of the next adder and so on)
In order to support substraction, simply negate the second operand ( a or b), and this negation will be decided by an extra mux that dictates if it's addition or subtraction and set carry_in to 1
(IMAGE 250-3)
HOW DOES THE +1 FOR NEGATING TWO'S COMPLEMENT WORK?
To implement slt (set less than), the upper 31 bits of the 32-bit ALU are set to 0 automatically, and the LSB is either 0 or 1 based on the result from slt
this requires another input, Less, which is set to 0 for the 31 upper bits of the ALU, which outputs 0 for the result
if a < b, a-b is negative and the output is 1, otherwise output is 0
using this, subtract the two numbers and take the sign bit from the result --> that's your LSB of slt
you have to add an extra output to the MSB in the ALU to have its adder output go straight to Less in the LSB in the ALU
MSB ALU requires overflow detection (if carry_out is 1--> overflow)
To see if two 32-bit values are equal, subtract them and OR all the output bits to see if it's 0 (0 if all 0, 1 otherwise --> negate output)
Full ALU (IMAGE 250-4)

B.7:

edge-triggered clocking is when all state changes occur on a clock edge
- this determines when the state elements are updated
- state element can be both an input and output (therefore an element can be read and written in the same clock cycle)
clock cycle has two portions: high and low

B.8:

simplest type of memory elements are unclocked
unclocked latch = S-R latch (set-reset latch)
- basically cycles the values of the two inputs continuously
- (IMAGE 250-5)
in flip-flops and latches, the output is equal to the stored state
- in a clocked latch, state is changed if inputs change and clock is asserted
- in a flip-flop, state is changed only on clock edge

- (IMAGE 250-12)

2 inputs (32-bit address and 32-bit data to write) 2 control signals (MemWrite which tells if data is being written and MemRead which says if data is being read from input address)
only one can be turned on a given clock 1 output (32-bit data read from address)
for beq, a 16-bit immediate is the offset you're supposed to add to the PC if the two registers are equal (this computes branch target address)
the 16-bit offset is sign-extended to 32-bits and added to the PC
the 32-bit sign-extended offset is shifted left 2 bits because since the PC is incremented by 4 each time, the first two bits won't matter because the lower 2 bits only refer to the increase in PC (still 32-bit but the lower 2 bits will always be 00)
(IMAGE 250-13)
PC and 32-bit shifted offset have their own adder to get new branch target
branches are delayed (instruction after branch is immediately executed)
if branch condition is true, the instruction after the branch is immediately executed before going to branch target address
otherwise, execution goes normally
jump replaces lower 28 bits of PC with lower 26 bits of instruction shifted left by 2

4.4:

ALU control lines 0000 AND 0001 OR 0010 add 0110 subtract 0111 set on less than 1100 NOR - these are all possible operations the ALU can perform - for R-type, ALU op is chosen by 6-bit funct field - for branch equal, subtract must be used - lw and sw use addition to compute memory address (offset + address) - ALU control input - 2 inputs (2-bit control signal [ALUOp] and 6-bit funct field) - 1 output (4-bit ALU control input) - (IMAGE 250-14) - know the ALUOps (cheat sheet)
All R-type instructions have opcodes of 0
Need a mux to select which bits will be destination register for lw and sw (they are flipped)
Each mux requires a single control line
Control signals (deasserted -> mux selects 0, asserted -> mux selects 1)
RegDst, deasserted: reg dest # for Write reg comes from rt field, asserted: reg dest # for Write reg comes from rd field
RegWr, deasserted: none, asserted: Reg on Write reg input is written with value in Write data input
ALUSrc, deasserted: 2nd ALU operand comes from 2nd reg file output (read data 2), asserted: 2nd ALU operand is sign-extended, lower 16 bits of instruction
PCSrc, deasserted: PC is replaced by output of adder (PC+4), asserted: PC is replaced by output of adder that computes branch target
MemRd, deasserted: none, asserted: Data mem contents designated by address input are put on read data output

MemWr, deasserted: none, asserted: Data mem contents designated by address input are replaced by value on Write data input
MemtoReg, deasserted: value fed to Write data input comes from ALU, asserted: value fed to Write data input comes from data memory
These control signals (except PCSrc) can be set by looking at opcode
PCSrc is set by AND'ing Zero output of ALU and if instr is bne
if instr is bne = Branch control signal
(IMAGE 250-15)
shows control signals for certain types of instructions
R-type instruction flow:
Instruction is fetched, PC incremented
2 registers read from reg file, main control unit computes control line values
ALU operates on data read from reg file using func code
Result from ALU written into reg file (rd)
lw
Instruction is fetched, PC incremented
Reg value read from reg file (rs)
ALU computes sum of value from reg file and sign-extended 32-bit immediate
Sum from ALU is used as address for data memory
Data from memory is written into reg file (rt)
beq
Instruction is fetched, PC incremented
Two regs read from reg file
ALU performs subtract on data values from reg file, value of PC+4 is added to sign-extended 32-bit immediate (shifted left by 2 already) to receive branch target address
Zero result from ALU is used to decide which adder result to store in PC
implement a jump by concatenating upper 4 bits of PC+4, 26-bit immediate field of instruction, bits 00
control signal, Jump, is added
and mux is added to choose between PC+4, branch PC, or jump PC

principle of locality states that programs access a small portion of their address space at any instant
temporal locality (time) exploits that if you recently accessed something, you'll access it again soon
loops show a lot of this
spatail locality (space) exploits that for your recently referenced item, items near it will be referenced soon
instruction access sequence show a lot of this
data accesses show this too (elements in an array)
memory hierarchy is a structure that uses multiple levels of memories (distance from proc increases, size of memory and access time increase)
data is only copied between two adjacent levels at a time
a block (line) is the minimum unit of information present/unpresent in cache
data requested is in some block in upper level, HIT (otherwise a MISS)
hit rate (hit ratio) is hits/total_accesses
miss rate (1 - hit_rate) is misses/total_accesses
hit time is time needed to determine if access is hit or miss

miss penalty increases because the time to transfer a large block to the cache is longer
processing of cache miss creates pipeline stall (not an interrupt)
handling a miss:
Send PC-4 to memory
Instruct memory to perform read and wait for memory to complete access
Write cache entry, copying memory data into cache, writing tag, and turning valid bit on
restart instruction at first step (refetches instruction to find it in cache)
write-through updates both cache and next lower level of memory
write-back updates data only in cache
lower level memory is updated when this cache block is replaced
write-allocate fetches the block from memory on a miss and overwrites this new cache block, not the main memory
no-write-allocate fetches the block from memory on a miss and updates only the memory, not the cache

Read-stall cycles = (read/program)(read miss rate)(read miss penalty)
Write-stall cycles = (writes/program)(write MR)(write miss penalty) + write buffer stalls
write buffer is a queue that holds data while data is being written to memory
write data into cache and write buffer, continue execution, write into memory
memory-stall clock cycles = (mem accesses/prgm)(miss rate)(miss penalty) = (instr/ prgm)(misses/instr)(miss penalty)
avg miss access time = t_hit + (miss rate)(miss penalty)
THIS IS IMPORTANT
fully associative
a block can be placed in any location in cache
to find a block, all entries in cache are searched
all tags of all blocks must be searched
set associative cache
fixed # of locations each block can be placed
n locations for a block = n-way set associative
contains x sets with each fitting n blocks
each block maps to a unique SET (given by index)
it can be placed anywhere in the set
(block_#)mod(# of sets) --> same as DM but SETS
all tags of any element in the set must be searched
increasing associativity usually decreases miss rate
might increase hit time
it also has the costs of extra comparators and delay imposed by searching within set
LRU replacement scheme
We can optimize our code by being mindful of how arrays (ND) are stored in memory
We can take advantage of spatial locality

5.7:

Virtual memory is a technique that uses main memory as a "cache" for 2ndary storage
allows efficient/safe sharing of memory betw programs
Each program has own address space, a range of mem locs accessible only to program
Translate program's address space to physical addresses (address in main mem)
remove programming burdens of limited main memory
Memory hierarchy of VM and physical memory
VM block is a page
- usually 4KiB to 16KiB
VM page miss called page fault
- not present in main memory (it's on the disk)
- usually reduce these by fully associative placement of pages in memory
- LRU is used here too
  - reference (use) bit set to 1 whenever page is accessed
  - periodically, OS collects bits to find LRU
  - too expensive to constantly update
- data page faults are difficult (occur in middle of instruction)
  - exception must be handled and then instruction restarted
Processor produces virtual address
- address translation from virtual to physical address
Relocation allows program loaded anywhere in main memory
Each program is allocated a set of pages
Virtual address = virtual page number, page offset
- Page offsets are same for virtual and physical addresses
- (IMAGE 250-16)
write-through isn't used in VM (too long); only write-back is used
- dirty bit (in page table) tracks whether page has been overwritten since first use
indicates if main memory must be updated when page is replaced
segementation
- variable-size address mapping scheme where address = segment # (mapped to physical address) and offset, which is added to segment number to find actual physical address
page table indexes memory for pages (fully associative scheme makes it difficult to find pages in memory originally)
each program has own page table
page table is indexed with page # from VA to find physical page #
page table register (hardware) maps processes to page tables
valid bit is found here
no tags are required for page table because it contains a mapping of all virtual pages
VPN is all you need
OS creates structure to store location of each virtual page on disk and also structure that tracks which processes/VAs use each physical page
PTE = (VA bits)/(page bits)
page table size = (PTE)(bytes per PTE)
can be expensive; solutions:
grow page table as process needs more VAs
multi-level page tables (each level reads certain high-order bits)
less memory is wasted --> not a bunch of empty space

ready bit (bit 0)
keep polling until ready
receiver data 0xffff
read lower 8 bits (input char)
trans control 0xffff
ready bit (bit 0)
poll until ready
trans data 0xffff000c
transfer lower 8 bits (output char)
(IMAGE 250-18)

pipeline stages (IF/ID/EX/MEM/WB)
- data path is broken into 5 stages
- (IMAGE 250-19)
- pipeline registers separate each stages
  - IF/ID 64 bits, ID/EX 128 bits, EX/MEM 97 bits, MEM/WB 64 bits
lw
- IF: instr read, placed in IF/ID reg, PC incremented, new PC saved in IF/ID reg
ID: IF/ID reg supplying 16-bit immediate (extended to 32) and read 2 regs. All three stored to ID/EX reg with PC
EX: reg address and 32-bit immediated added result placed in EX/MEM reg
MEM: Read DMEM using address from EX/MEM reg and load data into MEM/WB reg
WB: Read data from MEM/WB and write into reg file
sw same as lw but you store initial read reg 2 data in the address stored in EX/MEM
Set control values during each pipeline stage
- 5 different groups of control lines (1 for each stage)
- (IMAGE 250-20)
- IF and ID are same every clock

register read and written in same clock -> data hazard
- data forwarding from MEM stage is solution (MEM and ID can be concurrent)
stall instruction until MEM stage is ready
Need to stall if RegWrite control is asserted
add muxes to use registers from other stages
- (IMAGE 250-21)
- You have a forwarding unit to assert control signals for muxes
hazards
- EX: forward result from prev instr to ALU, but if it's going to write to reg file, get value from EX/MEM
MEM: if the result from ALU in prev instruction is source in next
hazard detection unit in addition to forwarding unit
- operates during ID to insert stall between load and use
  - restart instr until ready (nop)
  - stall is changing EX/MEM/WB controls of ID/EX to 0
  - ASK TAVO TO EXPLAIN
- for lw, data has to be written into the dmem for it to be used (it's read during MEM)

stalling is slow --> predict branch isn't taken and continue execution
- if branch is taken, flush instructions by change original controls to 0s and change instructions in IF, ID, and EX
move branch execution earlier in pipeline (ID stage)
target address and branch decision
can still cause data hazards
ALU instruction before branch instruction produces result for one of branch operands
load beofre branch instruction
to flush, add control line in IF stage
dynamic branch prediction sees if branch was taken last time instruction was executed
branch prediction buffer (small memory indexed by lower porton of branch instruction address that contains bits whether branch was taken or not)
2bit schemes to check last two times it was taken
branch delay slot
- MIPS fills instruction not affected by branch
branch target buffer caches destination PC for a branch

overflow and undefined instr exceptions right now
- save instr address in EPC and let OS handle it
- reason for exception in Cause register
- vectored interrupt has the control transferred to an address determined by cause of exception
to handle exceptions:
EPC 32-bit reg added, 32-bit Cause register added
(250-22)
imprecise exception in pipelines that are not associated with exact instr that caused exception
makes things simpler

cache coherence problem is when multicore processors have different contents of caches
memory system coherent if:
Read X with P and write to X with P with no other processor intefering
Read X by P after Q wrote X returns the new X by P
Writes to same location are kept in order (serialization)
read in same order
migration moves data to local cache
reduces latency
replication when shared data is read simultaneously, cache makes copy of item in local cache
reduces latency of access
snooping
every cache has a copy of data from block of phys mem also has copy of sharing status