








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth analysis of the design and implementation of an unpipelined processor. It covers topics such as the role of the arithmetic/logic unit (alu) and on-chip memory storage (registers), the two-bus architecture, register file implementation, and memory interface for instruction fetch and load/store operations. The document also discusses the importance of avoiding hazards and the use of a risc design for efficient implementation.
Typology: Study notes
1 / 14
This page cannot be seen from the preview
Don't miss anything!









The heart of the processor is the arithmetic/logic unit (ALU) and the on chip memory stor- age usually called registers. The hardware to implement the arithmetic/logic operations typically requires three operands, two sources and one destination. Putting any of these operands in memory violates our RISC design principles. Since all of the operands must be in registers (an R-format instruction type), a simple three-bus architecture is then implied. Three busses allow two inputs and one output to/from the ALU all in the same clock period (see fig.5.7, p. 295). We will name these busses as shown.
The timing of the data transfers is shown in the following timing diagram.
A-bus B-bus
D-bus
CL Registers
Bus D
read delay
delay
write delay
t
t
t
t
Bus B
Bus A
Shortly after the clock period begins for the execution of an instruction, the A-bus and B- bus will be driven by the registers specified in the instruction. After the ALU has correct input data, it will compute the selected function after a further delay onto the D-bus. Finally the register file stores the value from the D-bus.
Note that we finish writing the register data from one instruction before reading register data for the next instruction. If we try to read too soon, the register data might be incorrect if it is still being written. This is called a data hazard. Hazards will be discussed in more detail when the processor is pipelined. For now, the hazard is avoided by requiring that reading and writing to/from the register file be done at different times.
Since writing is done at a different time than reading, it is no longer necessary to have a bus dedicated to writing. Instead, reading and writing can use the same bus at different times. This leads to the two bus architecture.
But, this brings up two problems.
A-bus (^) B-bus
Registers
The 2-bus timing diagram is
Note:
In this way, the 2-bus scheme is really not slower than the 3-bus scheme. We have just used the latches to save the ALU input and allowed the same busses to be used to transfer the ALU output back to the register file. The extra latches and tri-state buffers are much less expensive than adding a third bus.
Read Write
We will later employ pipelining techniques to remove the objections raised in the last two notes.
Register File Model
The 2-port memory that is most commonly used to implement the “register” file really has a timing characteristic more like a latch. The timing requirements of the latches makes the register file model more consistent with a realistic register file based on a static RAM design. The operation of the bi-directional bbus on the register file is controlled by the R and W control lines as follows.
Whenever W is high, the selected “register” is written (writing uses only the B-select and B-data to avoid the possibility of writing two things into the same register). Whenever R is high, the “register” selected by the A-select lines drives the A-data bus and the “regis- ter” selected by the B-select drives the B-data bus.
Individual “registers” are implemented as latches connected to the data busses through tri- state buffers such that only one “register” is connected to a bus at any time. The B-select
R W function 0 0 bbus not used by register file 0 1 write bbus into register selected by Bselect 1 0 read register selected by Bselect onto bbus 1 1 not allowed
The select control lines for the register file come from the instruction code. The currently executing instruction is decoded as follows so that the appropriate control signals can be generated.
Avoiding Setup and Hold Time Violations. The system as designed does not work because the tri-state buffers turn on and off at the same time that data is latched into the A and B latches and the register file. This may violate the hold time requirement for the latches.
An easy solution to this problem is to delay the clock signal controlling the tri-state buff- ers (the R control on the register file controls the tri-state buffers inside the register file).
reg file
Aselect
Bselect
ALU W
abus
bbus
abus
bbus
Aselect
Bselect
A latch
B latch
We can use any logic gate to provide the necessary delay, but it is sufficient to use an odd number of inverter gates.
Note: you must use structural Verilog to force these gates in. The synthesizer will elimi- nate the inverter pairs if you give a behavioral description.
Processor-Memory Interface
In a RISC design, only the load and store instructions transfer data into or out of memory. Because of their high instruction frequency, loads and stores must be implemented as effi- ciently as possible. To use the memory, the processor must provide an address bus to specify the memory location, a dataout bus for loads and a datain bus for stores.
Some or all of these busses are usually multiplexed together to reduce cost in unpipelined processors. However, it is best to leave them unmultiplexed for pipelined processors so we will leave them unmultiplexed to make it easier to pipeline the processor later.
Using memory design techniques which will be covered later, the memory interface can be designed as if the memory is inside the same chip as the processor. This makes the mem-
memory
address
datain dataout
write enable
The instruction code comes directly out of the read data bus of the instruction memory. The program counter (PC) register provides the address to the instruction memory address bus. In the absence of branching or jumping, the PC gets incremented by 4 bytes (NOT 1 byte) to point to the next 32 bit instruction code.
Note that there is no write into the instruction memory since self modifying code is not allowed. We will discuss later how instructions get into instruction memory.
Branches. Just as for loads and stores, implementing branches requires additional hard- ware beyond the ALU because we must do the following.
In the MIPS instruction set, the branch condition is whether or not the two register oper- ands are equal. The ALU is used to subtract the two operands and then the ALU result is checked. If the ALU output is zero (all 0’s), then the operands are equal, else the operands are not equal. A zero checker is just a large AND gate with inverted inputs (a NOR gate).
Since the ALU is busy comparing the register operands, it is necessary to add an extra adder to do the branch address calculation (see fig. 5.9, p. 297).
Note: a permanent shift left of 2 is accomplished by simply moving the bus wires left two positions.
The extra adder is the price paid for determining the branch condition as part of the branch instruction. Alternatively, the branch could use the results of previous instructions to determine the branch condition, but this would require saving the results somewhere (flag register or status register). The MIPS does not use a flag or status register to make it easier to pipeline.
Complete Data Path. Common data path hardware for the register ALU instructions and the load, store instructions can be obtained by multiplexing one of the ALU inputs and multiplexing the register write as shown in fig. 5.10, p. 299.
zero
This design has the added advantage that it also implements immediate addressing if the ALU input MUX selects the sign extender and the register write MUX selects the ALU output.
The instruction memory interface and the hardware to support branches is added as shown in fig. 5.11, p. 300.
Control Unit. There are several control inputs to the data path hardware that are neces- sary to make the hardware perform the appropriate action for the instruction. A control unit is necessary to provide these control inputs. The control unit determines a value for each control line based on the instruction code, external signals (for example, system reset), and internal signals (for example, signed overflow, V).
First, the control lines must be identified as in fig. 5.15, p. 305.
Note that the register codes from the instruction into the register file are control lines.
Since we are doing a single cycle implementation, the control lines should be set once the instruction is fetched from instruction memory and they should remain set until the end of the clock period. This can be accomplished with a combinational logic circuit. No mem- ory storage or state machines are necessary for the control unit. The control unit simply decodes the instruction and sets the appropriate control lines as shown in fig. 5.17, p. 307.
The control unit function is defined by making a table, similar to fig. 5.18, p. 308, that lists how the control lines should be set for each instruction (or instruction type).
The view that we will take is to group the control unit along with the decoders in the regis- ter file into the instruction decoder. This will make pipelining easier later.
Jump type instructions are added in fig. 5.24, p. 314.
Performance. Recall that our performance measure is
For the single cycle implementation,
Instruction Memory
Instruction Decoder (^) Register File
Aselect
Bselect
control lines to other data path units
Microprogrammed Control. Microprogrammed control is an alternative way to imple- ment the finite state machine which is the controller for an unpipelined processor.
The micro program ROM implements the combinational logic of the controller and the micro instruction register implements the state register.
The controller is designed as a sequence of micro instructions rather than a set of logic equations. This made it much easier for programmers to design. It also allowed for easy changes to the control logic since only the ROM need be reprogrammed.
The disadvantage of microprogrammed control is that it works best for a sequence of dif- ferent micro instructions in each clock period. This make it unsuitable for pipeline control as we will see. Also, the ROM takes more hardware than equivalent logic gates.
Performance. Let’s assume that the multicycle CPU is designed with a controller as spec- ified by the state diagram in fig. 5.38, p. 339. The number of clock cycles for each instruc- tion type is as follows.
Loads: 5 Stores: 4 ALU: 4 Branches/Jumps: 3
After running benchmark programs, the average dynamic instruction frequencies are as follows.
Loads: 20% Stores: 12% ALU: 52% Branches/Jumps: 16%
Then the average CPI is
microprogram ROM
outputs
inputs
clk micro instruc- tion register
addr data
Even though a CPI of 4.04 is worse than the single cycle implementation, the clock period can be much smaller than the clock period for the single cycle implementation. The con- straint on T is now
T > max { (I-mem delay), (I-decode delay), (Reg File delay) , (ALU delay), (D-mem delay) }
which is usually much smaller than for the single cycle clock period.
i