Computer Architecture and Design, Lecture notes of Advanced Computer Architecture

computer architecture and design

Typology: Lecture notes

2014/2015

Uploaded on 10/07/2015

joe_jacob
joe_jacob šŸ‡ŗšŸ‡ø

1 document

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
2.4:
- MIPS word is 32 bits
- two's complement
- leading 0's = +
leading 1's = -
- 1000...000 is -2^31
- 1000...001 is -2^31 + 1
- 1111...111 is -1
- 0111...111 is 2^31 - 1
- MSB is sign bit in two's complement
- it's multiplied by -2^31 and the other bits are multiplied by 2^30, etc.
- negate number by flipping each bit and adding 1
- sign extend to place correct representation of number within register
- repeatedly copy sign bit to the front of the number (if signed)
- otherwise, repeatedly copy 0's
- one's complement
- 100...000 is most negative (-2^31+1)
- 011...111 is most positive (2^31-1)
- 000...000 is positive 0
- 111...111 is negative 0
- opposite sign number is simply negating each bit
- not good because we need to subtract each time (inefficient)
3.2:
- overflow causes an exception or interrupt
- EPC contains address of instruction that caused exception
- mfc0 is used to copy EPC into general register ($k0 or $k1)
3.5:
- MIPS floating point: 1 sign bit, 8 exponent bits, 23 fraction bits
- (-1)^S X F X 2^E
- IEEE 754 floating point: 1 sign bit, 11 exponent bits, 52 fraction bits
- biased notation: (-1)^S X (1 + fraction) X 2 ^(exponent - bias)
- HOW DO I CONVERT BINARY TO FLOATING POINT AND VICE VERSA
2.3:
- the MIPS load instruction transfers data from memory to a register
- lw $t0, 32($s3)
- loads the 8th element of the base of the array stored in $s3 into $t0
- $s3 is base register, 8 is offset
- byte-addressed means that the addresses of sequential words differ by 4
- 4 bytes in a 32-bit word
- MIPS is big-endian, which means that it uses the address of the leftmost byte as the
word address versus the rightmost (little endian)
- the MIPS store instruction transfer data from a register to memory
- sw $t0, 48($s3)
- stores the value from $t0 to the 12th element in the array contained in
$s3
2.5:
- $to to $t7 = $8 to $15
$s0 to
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Computer Architecture and Design and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!

  • MIPS word is 32 bits
  • two's complement
    • leading 0's = + leading 1's = -
    • 1000...000 is -2^
    • 1000...001 is -2^31 + 1
    • 1111...111 is -
    • 0111...111 is 2^31 - 1
    • MSB is sign bit in two's complement
      • it's multiplied by -2^31 and the other bits are multiplied by 2^30, etc.
    • negate number by flipping each bit and adding 1
  • sign extend to place correct representation of number within register
    • repeatedly copy sign bit to the front of the number (if signed)
    • otherwise, repeatedly copy 0's
  • one's complement
    • 100...000 is most negative (-2^31+1)
    • 011...111 is most positive (2^31-1)
    • 000...000 is positive 0
    • 111...111 is negative 0
    • opposite sign number is simply negating each bit
    • not good because we need to subtract each time (inefficient)

3.2:

  • overflow causes an exception or interrupt
    • EPC contains address of instruction that caused exception
      • mfc0 is used to copy EPC into general register ($k0 or $k1)
  • MIPS floating point: 1 sign bit, 8 exponent bits, 23 fraction bits
    • (-1)^S X F X 2^E
  • IEEE 754 floating point: 1 sign bit, 11 exponent bits, 52 fraction bits
  • biased notation: (-1)^S X (1 + fraction) X 2 ^(exponent - bias)
  • HOW DO I CONVERT BINARY TO FLOATING POINT AND VICE VERSA
  • the MIPS load instruction transfers data from memory to a register
    • lw $t0, 32($s3)
      • loads the 8th element of the base of the array stored in $s3 into $t
      • $s3 is base register, 8 is offset
  • byte-addressed means that the addresses of sequential words differ by 4
    • 4 bytes in a 32-bit word
  • MIPS is big-endian, which means that it uses the address of the leftmost byte as the word address versus the rightmost (little endian)
  • the MIPS store instruction transfer data from a register to memory
  • sw $t0, 48($s3)
  • stores the value from $t0 to the 12th element in the array contained in $s
  • $to to $t7 = $8 to $ $s0 to
  • R-type: 6 bits (opcode) - 5 bits (1st src) - 5 bits (2nd src) - 5 bits (dest) - 5 bits (shift amount - > used for accessing array, etc) - 6 bits (function code)
  • I-type: 6 bits (opcode) - 5 bits (src) - 5 bits (dest) - 16 bits (immediate)

2.6:

  • logical operators
    • sll (shift left) - move current bits to left
      • utilizes shamt bits by dictating shift amount srl (shift right) - move current bits to right and, andi (bit-by-bit AND)
  • use andi/ori to do a bitmask and manipulate the bits you want or, ori (bit-by-bit OR) nor (bit-by-bit NOT)

2.7:

  • conditional branches: bne reg1, reg2, L - if true, go to L beq reg1, reg2, L - if true, go to L
  • unconditional branch: j L - kind of like the "else" statement to a conditonal branch - it will automatically go to L2 because L1 wasn't gone to - it's a jump
  • slt $t0, $s3, $s
    • set $t0 to 1 if $s3 < $s4, otherwise set $t0 to 0
    • slti is the immediate form ($s4 would be an immediate)

A.2:

  • symbol table
    • a record of each label and its associated memory address
  • object file
    • header: size and position of other pieces of file
    • text segment: machine language code for routines in source file
    • data segment: binary representation of data in source file
    • relocation information: identifies instructions and daa words that depend on absolute addresses
  • symbol table: associates addresses with external labels in source file
  • debugging info: description of way program was compiled
  • .macro name_of_macro($arg) ...... .end_macro - this is located in the text segment and basically you can call it by writing: name_of_macro($any_register)
  • in the macro $arg will be replaced by $any_register in all instances

A.10:

  • c0 handles exceptions/interrupts c1 handles floating-point

jr $ra

  • end of the procedure; return to the caller
  • $t0 - $t9: temp regs that are not preserved by callee on prodedure call $s0 - $s7: saved regs that must be preserved on procedure call (if used, callee saves and restores them)
  • in the example above, the sw and lw for $t0 and $t1 are unnecessary
  • for recursion, you save the return address and arg each time in the stack pointer
  • in the base case, perform the base calculation, pop the address and arg off the stack, and call jr $ra
  • this keeps going until the original call procedure finishes
  • you restore the arg and the return address and pop the two items from the stack
  • save the final result in $v0 and jr $ra to caller
  • lb $t0, 0($s1)
    • load a byte from $s1 (a byte can be an ascii value!)
  • sb $t0, 0($s1)
    • store a byte from $t0 to $s1 (this can also be an ascii value)

2.10:

  • lui $s0, 61
    • loads 61 (in binary) in the upper 16 bits of $s
    • to manipulate the lower 16 bits of $s0, you ori with your desired bits --> this is how you load a 32-bit constant into a reg
  • J-type: 6 bits (opcode) - 26 bits (address field)
  • for I-type, addresses of 18 bits can be accessed by word-addressing, and addresses of 28 bits can be accessed for J-type by word-addressing too
  • I DONT UNDERSTAND THIS
  • c program-compiler:assembly-assembler:object machine language- linker:executable-loader:memory
  • compiler converts to assembly
  • assembler converts assembly into machine code
  • a linker combines all the independently assembled machine languahge programs into an executable
  • this lets you avoid recompiling the whole program for only changing one line of code
  • ARM uses 32-bits and 12-bit immediate fild

B.2:

  • gate can have multiple inputs
  • any logical function be constructed using AND/OR gates and inversion

B.3:

  • a decoder takes an n-bit input and 2n outputs where each output is associated with a unique input combination
  • a multiplexor (mux) has three inputs: two values and a selector
    • the selector determines which input becomes the output
    • can be C = AS + BS (output is either A or B)
    • if there are n inputs, there will need to be ceiling(logn/log2) selector inputs (still one output)
  • programmable logic array (PLA)
  • you convert your sum of products into a logisim thing, but if there are multiple outputs for the same set of inputs, you can output more than one signal
  • (IMAGE 250-1)

B.5:

  • For a 1-bit adder, we have three inputs (the two bits and carry_in) and two outputs (the sum and carryout)
  • you simply make the truth table for all values of inputs and proceed
  • this is a smaller part of an ALU that can use a mux for different operations
  • (IMAGE 250-2)
  • A ripple carry adder is a linking of 1-bit adders (the carry_out becomes the carry_in of the next adder and so on)
  • In order to support substraction, simply negate the second operand ( a or b), and this negation will be decided by an extra mux that dictates if it's addition or subtraction and set carry_in to 1
  • (IMAGE 250-3)
  • HOW DOES THE +1 FOR NEGATING TWO'S COMPLEMENT WORK?
  • To implement slt (set less than), the upper 31 bits of the 32-bit ALU are set to 0 automatically, and the LSB is either 0 or 1 based on the result from slt
  • this requires another input, Less, which is set to 0 for the 31 upper bits of the ALU, which outputs 0 for the result
  • if a < b, a-b is negative and the output is 1, otherwise output is 0
  • using this, subtract the two numbers and take the sign bit from the result --> that's your LSB of slt
  • you have to add an extra output to the MSB in the ALU to have its adder output go straight to Less in the LSB in the ALU
  • MSB ALU requires overflow detection (if carry_out is 1--> overflow)
  • To see if two 32-bit values are equal, subtract them and OR all the output bits to see if it's 0 (0 if all 0, 1 otherwise --> negate output)
  • Full ALU (IMAGE 250-4)

B.7:

  • edge-triggered clocking is when all state changes occur on a clock edge
    • this determines when the state elements are updated
    • state element can be both an input and output (therefore an element can be read and written in the same clock cycle)
  • clock cycle has two portions: high and low

B.8:

  • simplest type of memory elements are unclocked
  • unclocked latch = S-R latch (set-reset latch)
    • basically cycles the values of the two inputs continuously
    • (IMAGE 250-5)
  • in flip-flops and latches, the output is equal to the stored state
    • in a clocked latch, state is changed if inputs change and clock is asserted
    • in a flip-flop, state is changed only on clock edge

- (IMAGE 250-12)

  • 2 inputs (32-bit address and 32-bit data to write) 2 control signals (MemWrite which tells if data is being written and MemRead which says if data is being read from input address)
  • only one can be turned on a given clock 1 output (32-bit data read from address)
  • for beq, a 16-bit immediate is the offset you're supposed to add to the PC if the two registers are equal (this computes branch target address)
  • the 16-bit offset is sign-extended to 32-bits and added to the PC
  • the 32-bit sign-extended offset is shifted left 2 bits because since the PC is incremented by 4 each time, the first two bits won't matter because the lower 2 bits only refer to the increase in PC (still 32-bit but the lower 2 bits will always be 00)
  • (IMAGE 250-13)
  • PC and 32-bit shifted offset have their own adder to get new branch target
  • branches are delayed (instruction after branch is immediately executed)
  • if branch condition is true, the instruction after the branch is immediately executed before going to branch target address
  • otherwise, execution goes normally
  • jump replaces lower 28 bits of PC with lower 26 bits of instruction shifted left by 2

4.4:

  • ALU control lines 0000 AND 0001 OR 0010 add 0110 subtract 0111 set on less than 1100 NOR - these are all possible operations the ALU can perform - for R-type, ALU op is chosen by 6-bit funct field - for branch equal, subtract must be used - lw and sw use addition to compute memory address (offset + address) - ALU control input - 2 inputs (2-bit control signal [ALUOp] and 6-bit funct field) - 1 output (4-bit ALU control input) - (IMAGE 250-14) - know the ALUOps (cheat sheet)
  • All R-type instructions have opcodes of 0
  • Need a mux to select which bits will be destination register for lw and sw (they are flipped)
  • Each mux requires a single control line
  • Control signals (deasserted -> mux selects 0, asserted -> mux selects 1)
  • RegDst, deasserted: reg dest # for Write reg comes from rt field, asserted: reg dest # for Write reg comes from rd field
  • RegWr, deasserted: none, asserted: Reg on Write reg input is written with value in Write data input
  • ALUSrc, deasserted: 2nd ALU operand comes from 2nd reg file output (read data 2), asserted: 2nd ALU operand is sign-extended, lower 16 bits of instruction
  • PCSrc, deasserted: PC is replaced by output of adder (PC+4), asserted: PC is replaced by output of adder that computes branch target
  • MemRd, deasserted: none, asserted: Data mem contents designated by address input are put on read data output
  • MemWr, deasserted: none, asserted: Data mem contents designated by address input are replaced by value on Write data input
  • MemtoReg, deasserted: value fed to Write data input comes from ALU, asserted: value fed to Write data input comes from data memory
  • These control signals (except PCSrc) can be set by looking at opcode
  • PCSrc is set by AND'ing Zero output of ALU and if instr is bne
  • if instr is bne = Branch control signal
  • (IMAGE 250-15)
  • shows control signals for certain types of instructions
  • R-type instruction flow:
  • Instruction is fetched, PC incremented
  • 2 registers read from reg file, main control unit computes control line values
  • ALU operates on data read from reg file using func code
  • Result from ALU written into reg file (rd)
  • lw
  • Instruction is fetched, PC incremented
  • Reg value read from reg file (rs)
  • ALU computes sum of value from reg file and sign-extended 32-bit immediate
  • Sum from ALU is used as address for data memory
  • Data from memory is written into reg file (rt)
  • beq
  • Instruction is fetched, PC incremented
  • Two regs read from reg file
  • ALU performs subtract on data values from reg file, value of PC+4 is added to sign-extended 32-bit immediate (shifted left by 2 already) to receive branch target address
  • Zero result from ALU is used to decide which adder result to store in PC
  • implement a jump by concatenating upper 4 bits of PC+4, 26-bit immediate field of instruction, bits 00
  • control signal, Jump, is added
  • and mux is added to choose between PC+4, branch PC, or jump PC
  • principle of locality states that programs access a small portion of their address space at any instant
  • temporal locality (time) exploits that if you recently accessed something, you'll access it again soon
  • loops show a lot of this
  • spatail locality (space) exploits that for your recently referenced item, items near it will be referenced soon
  • instruction access sequence show a lot of this
  • data accesses show this too (elements in an array)
  • memory hierarchy is a structure that uses multiple levels of memories (distance from proc increases, size of memory and access time increase)
  • data is only copied between two adjacent levels at a time
  • a block (line) is the minimum unit of information present/unpresent in cache
  • data requested is in some block in upper level, HIT (otherwise a MISS)
  • hit rate (hit ratio) is hits/total_accesses
  • miss rate (1 - hit_rate) is misses/total_accesses
  • hit time is time needed to determine if access is hit or miss
  • miss penalty increases because the time to transfer a large block to the cache is longer
  • processing of cache miss creates pipeline stall (not an interrupt)
  • handling a miss:
  • Send PC-4 to memory
  • Instruct memory to perform read and wait for memory to complete access
  • Write cache entry, copying memory data into cache, writing tag, and turning valid bit on
  • restart instruction at first step (refetches instruction to find it in cache)
  • write-through updates both cache and next lower level of memory
  • write-back updates data only in cache
  • lower level memory is updated when this cache block is replaced
  • write-allocate fetches the block from memory on a miss and overwrites this new cache block, not the main memory
  • no-write-allocate fetches the block from memory on a miss and updates only the memory, not the cache
  • Read-stall cycles = (read/program)(read miss rate)(read miss penalty)
  • Write-stall cycles = (writes/program)(write MR)(write miss penalty) + write buffer stalls
  • write buffer is a queue that holds data while data is being written to memory
  • write data into cache and write buffer, continue execution, write into memory
  • memory-stall clock cycles = (mem accesses/prgm)(miss rate)(miss penalty) = (instr/ prgm)(misses/instr)(miss penalty)
  • avg miss access time = t_hit + (miss rate)(miss penalty)
  • THIS IS IMPORTANT
  • fully associative
  • a block can be placed in any location in cache
  • to find a block, all entries in cache are searched
  • all tags of all blocks must be searched
  • set associative cache
  • fixed # of locations each block can be placed
  • n locations for a block = n-way set associative
  • contains x sets with each fitting n blocks
  • each block maps to a unique SET (given by index)
  • it can be placed anywhere in the set
  • (block_#)mod(# of sets) --> same as DM but SETS
  • all tags of any element in the set must be searched
  • increasing associativity usually decreases miss rate
  • might increase hit time
  • it also has the costs of extra comparators and delay imposed by searching within set
  • LRU replacement scheme
  • We can optimize our code by being mindful of how arrays (ND) are stored in memory
  • We can take advantage of spatial locality

5.7:

  • Virtual memory is a technique that uses main memory as a "cache" for 2ndary storage
  • allows efficient/safe sharing of memory betw programs
  • Each program has own address space, a range of mem locs accessible only to program
  • Translate program's address space to physical addresses (address in main mem)
  • remove programming burdens of limited main memory
  • Memory hierarchy of VM and physical memory
  • VM block is a page
    • usually 4KiB to 16KiB
  • VM page miss called page fault
    • not present in main memory (it's on the disk)
    • usually reduce these by fully associative placement of pages in memory
    • LRU is used here too
      • reference (use) bit set to 1 whenever page is accessed
      • periodically, OS collects bits to find LRU
      • too expensive to constantly update
    • data page faults are difficult (occur in middle of instruction)
      • exception must be handled and then instruction restarted
  • Processor produces virtual address
    • address translation from virtual to physical address
  • Relocation allows program loaded anywhere in main memory
  • Each program is allocated a set of pages
  • Virtual address = virtual page number, page offset
    • Page offsets are same for virtual and physical addresses
    • (IMAGE 250-16)
  • write-through isn't used in VM (too long); only write-back is used
    • dirty bit (in page table) tracks whether page has been overwritten since first use
  • indicates if main memory must be updated when page is replaced
  • segementation
    • variable-size address mapping scheme where address = segment # (mapped to physical address) and offset, which is added to segment number to find actual physical address
  • page table indexes memory for pages (fully associative scheme makes it difficult to find pages in memory originally)
  • each program has own page table
  • page table is indexed with page # from VA to find physical page #
  • page table register (hardware) maps processes to page tables
  • valid bit is found here
  • no tags are required for page table because it contains a mapping of all virtual pages
  • VPN is all you need
  • OS creates structure to store location of each virtual page on disk and also structure that tracks which processes/VAs use each physical page
  • PTE = (VA bits)/(page bits)

  • page table size = (PTE)(bytes per PTE)
  • can be expensive; solutions:
  • grow page table as process needs more VAs
  • multi-level page tables (each level reads certain high-order bits)
  • less memory is wasted --> not a bunch of empty space
  • ready bit (bit 0)
  • keep polling until ready
  • receiver data 0xffff
  • read lower 8 bits (input char)
  • trans control 0xffff
  • ready bit (bit 0)
  • poll until ready
  • trans data 0xffff000c
  • transfer lower 8 bits (output char)
  • (IMAGE 250-18)
  • pipeline stages (IF/ID/EX/MEM/WB)
    • data path is broken into 5 stages
    • (IMAGE 250-19)
    • pipeline registers separate each stages
      • IF/ID 64 bits, ID/EX 128 bits, EX/MEM 97 bits, MEM/WB 64 bits
  • lw
    • IF: instr read, placed in IF/ID reg, PC incremented, new PC saved in IF/ID reg
  • ID: IF/ID reg supplying 16-bit immediate (extended to 32) and read 2 regs. All three stored to ID/EX reg with PC
  • EX: reg address and 32-bit immediated added result placed in EX/MEM reg
  • MEM: Read DMEM using address from EX/MEM reg and load data into MEM/WB reg
  • WB: Read data from MEM/WB and write into reg file
  • sw same as lw but you store initial read reg 2 data in the address stored in EX/MEM
  • Set control values during each pipeline stage
    • 5 different groups of control lines (1 for each stage)
    • (IMAGE 250-20)
    • IF and ID are same every clock
  • register read and written in same clock -> data hazard
    • data forwarding from MEM stage is solution (MEM and ID can be concurrent)
  • stall instruction until MEM stage is ready
  • Need to stall if RegWrite control is asserted
  • add muxes to use registers from other stages
    • (IMAGE 250-21)
    • You have a forwarding unit to assert control signals for muxes
  • hazards
    • EX: forward result from prev instr to ALU, but if it's going to write to reg file, get value from EX/MEM
  • MEM: if the result from ALU in prev instruction is source in next
  • hazard detection unit in addition to forwarding unit
    • operates during ID to insert stall between load and use
      • restart instr until ready (nop)
      • stall is changing EX/MEM/WB controls of ID/EX to 0
      • ASK TAVO TO EXPLAIN
    • for lw, data has to be written into the dmem for it to be used (it's read during MEM)
  • stalling is slow --> predict branch isn't taken and continue execution
    • if branch is taken, flush instructions by change original controls to 0s and change instructions in IF, ID, and EX
  • move branch execution earlier in pipeline (ID stage)
  • target address and branch decision
  • can still cause data hazards
  • ALU instruction before branch instruction produces result for one of branch operands
  • load beofre branch instruction
  • to flush, add control line in IF stage
  • dynamic branch prediction sees if branch was taken last time instruction was executed
  • branch prediction buffer (small memory indexed by lower porton of branch instruction address that contains bits whether branch was taken or not)
  • 2bit schemes to check last two times it was taken
  • branch delay slot
    • MIPS fills instruction not affected by branch
  • branch target buffer caches destination PC for a branch
  • overflow and undefined instr exceptions right now
    • save instr address in EPC and let OS handle it
    • reason for exception in Cause register
    • vectored interrupt has the control transferred to an address determined by cause of exception
  • to handle exceptions:
  • EPC 32-bit reg added, 32-bit Cause register added
  • (250-22)
  • imprecise exception in pipelines that are not associated with exact instr that caused exception
  • makes things simpler
  • cache coherence problem is when multicore processors have different contents of caches
  • memory system coherent if:
  • Read X with P and write to X with P with no other processor intefering
  • Read X by P after Q wrote X returns the new X by P
  • Writes to same location are kept in order (serialization)
  • read in same order
  • migration moves data to local cache
  • reduces latency
  • replication when shared data is read simultaneously, cache makes copy of item in local cache
  • reduces latency of access
  • snooping
  • every cache has a copy of data from block of phys mem also has copy of sharing status