Unpipelined Processor: Design and Implementation - Prof. Louis Johnson, Study notes of Computer Architecture and Organization

An in-depth analysis of the design and implementation of an unpipelined processor. It covers topics such as the role of the arithmetic/logic unit (alu) and on-chip memory storage (registers), the two-bus architecture, register file implementation, and memory interface for instruction fetch and load/store operations. The document also discusses the importance of avoiding hazards and the use of a risc design for efficient implementation.

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-zoy
koofers-user-zoy 🇺🇸

10 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECEN 4243 Computer Architecture
Unpipelined Processor January 30, 2008 page 1 of 14
Unpipelined Processor
The heart of the processor is the arithmetic/logic unit (ALU) and the on chip memory stor-
age usually called registers. The hardware to implement the arithmetic/logic operations
typically requires three operands, two sources and one destination. Putting any of these
operands in memory violates our RISC design principles. Since all of the operands must
be in registers (an R-format instruction type), a simple three-bus architecture is then
implied. Three busses allow two inputs and one output to/from the ALU all in the same
clock period (see fig.5.7, p. 295). We will name these busses as shown.
The timing of the data transfers is shown in the following timing diagram.
A-bus
B-bus
ALU
D-bus
Registers
CL
Bus D
read
delay
ALU
delay
write
delay
t
t
t
t
Bus B
Bus A
CL
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Unpipelined Processor: Design and Implementation - Prof. Louis Johnson and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Unpipelined Processor

The heart of the processor is the arithmetic/logic unit (ALU) and the on chip memory stor- age usually called registers. The hardware to implement the arithmetic/logic operations typically requires three operands, two sources and one destination. Putting any of these operands in memory violates our RISC design principles. Since all of the operands must be in registers (an R-format instruction type), a simple three-bus architecture is then implied. Three busses allow two inputs and one output to/from the ALU all in the same clock period (see fig.5.7, p. 295). We will name these busses as shown.

The timing of the data transfers is shown in the following timing diagram.

A-bus B-bus

ALU

D-bus

CL Registers

Bus D

read delay

ALU

delay

write delay

t

t

t

t

Bus B

Bus A

CL

Shortly after the clock period begins for the execution of an instruction, the A-bus and B- bus will be driven by the registers specified in the instruction. After the ALU has correct input data, it will compute the selected function after a further delay onto the D-bus. Finally the register file stores the value from the D-bus.

Note that we finish writing the register data from one instruction before reading register data for the next instruction. If we try to read too soon, the register data might be incorrect if it is still being written. This is called a data hazard. Hazards will be discussed in more detail when the processor is pipelined. For now, the hazard is avoided by requiring that reading and writing to/from the register file be done at different times.

Since writing is done at a different time than reading, it is no longer necessary to have a bus dedicated to writing. Instead, reading and writing can use the same bus at different times. This leads to the two bus architecture.

But, this brings up two problems.

  1. How to control when reading and when writing?
  2. How to avoid mixing up read and write data on the same bus?

A-bus (^) B-bus

ALU

Registers

The 2-bus timing diagram is

Note:

  1. For both the 2-bus and 3-bus architectures, (clock cycle time) > (read delay) + (ALU delay) + (write delay).
  2. Since we do not read and write into the register file at the same time, we do not need the 3-bus architecture.

In this way, the 2-bus scheme is really not slower than the 3-bus scheme. We have just used the latches to save the ALU input and allowed the same busses to be used to transfer the ALU output back to the register file. The extra latches and tri-state buffers are much less expensive than adding a third bus.

  1. The ALU is not doing anything useful during the read/write delays.

t

t

t

t

CL

ALU

input

ALU

output

ALU

input

ALU

output

Read Write

Read

Delay

ALU

Delay

Write

Delay

A, B

data

bus

A, B

latch

output

ALU

output

  1. The register file is not doing anything useful during the ALU delay.

We will later employ pipelining techniques to remove the objections raised in the last two notes.

Register File Model

The 2-port memory that is most commonly used to implement the “register” file really has a timing characteristic more like a latch. The timing requirements of the latches makes the register file model more consistent with a realistic register file based on a static RAM design. The operation of the bi-directional bbus on the register file is controlled by the R and W control lines as follows.

Whenever W is high, the selected “register” is written (writing uses only the B-select and B-data to avoid the possibility of writing two things into the same register). Whenever R is high, the “register” selected by the A-select lines drives the A-data bus and the “regis- ter” selected by the B-select drives the B-data bus.

Individual “registers” are implemented as latches connected to the data busses through tri- state buffers such that only one “register” is connected to a bus at any time. The B-select

R W function 0 0 bbus not used by register file 0 1 write bbus into register selected by Bselect 1 0 read register selected by Bselect onto bbus 1 1 not allowed

A-select

A-data

Reg

File

R W

B-select

B-data

The select control lines for the register file come from the instruction code. The currently executing instruction is decoded as follows so that the appropriate control signals can be generated.

Avoiding Setup and Hold Time Violations. The system as designed does not work because the tri-state buffers turn on and off at the same time that data is latched into the A and B latches and the register file. This may violate the hold time requirement for the latches.

An easy solution to this problem is to delay the clock signal controlling the tri-state buff- ers (the R control on the register file controls the tri-state buffers inside the register file).

source

Reg A

OP

Instruction Code

binary

decoder

A-bus

select

source

Reg B

dest

Reg D

binary

decoder

binary

decoder

B-bus

read

B-bus

write

select select

reg file

R

CL

Aselect

Bselect

ALU W

CL_D

S

Cin

abus

bbus

abus

bbus

Aselect

Bselect

CL

A latch

B latch

CL_D

We can use any logic gate to provide the necessary delay, but it is sufficient to use an odd number of inverter gates.

Note: you must use structural Verilog to force these gates in. The synthesizer will elimi- nate the inverter pairs if you give a behavioral description.

Processor-Memory Interface

In a RISC design, only the load and store instructions transfer data into or out of memory. Because of their high instruction frequency, loads and stores must be implemented as effi- ciently as possible. To use the memory, the processor must provide an address bus to specify the memory location, a dataout bus for loads and a datain bus for stores.

Some or all of these busses are usually multiplexed together to reduce cost in unpipelined processors. However, it is best to leave them unmultiplexed for pipelined processors so we will leave them unmultiplexed to make it easier to pipeline the processor later.

Using memory design techniques which will be covered later, the memory interface can be designed as if the memory is inside the same chip as the processor. This makes the mem-

CL CL_D

CL

t

t

CL_D

CL_D

memory

address

datain dataout

write enable

The instruction code comes directly out of the read data bus of the instruction memory. The program counter (PC) register provides the address to the instruction memory address bus. In the absence of branching or jumping, the PC gets incremented by 4 bytes (NOT 1 byte) to point to the next 32 bit instruction code.

Note that there is no write into the instruction memory since self modifying code is not allowed. We will discuss later how instructions get into instruction memory.

Branches. Just as for loads and stores, implementing branches requires additional hard- ware beyond the ALU because we must do the following.

  1. Determine whether the branch condition is true.
  2. Calculate the branch address.

In the MIPS instruction set, the branch condition is whether or not the two register oper- ands are equal. The ALU is used to subtract the two operands and then the ALU result is checked. If the ALU output is zero (all 0’s), then the operands are equal, else the operands are not equal. A zero checker is just a large AND gate with inverted inputs (a NOR gate).

Since the ALU is busy comparing the register operands, it is necessary to add an extra adder to do the branch address calculation (see fig. 5.9, p. 297).

Note: a permanent shift left of 2 is accomplished by simply moving the bus wires left two positions.

The extra adder is the price paid for determining the branch condition as part of the branch instruction. Alternatively, the branch could use the results of previous instructions to determine the branch condition, but this would require saving the results somewhere (flag register or status register). The MIPS does not use a flag or status register to make it easier to pipeline.

Complete Data Path. Common data path hardware for the register ALU instructions and the load, store instructions can be obtained by multiplexing one of the ALU inputs and multiplexing the register write as shown in fig. 5.10, p. 299.

ALU ...

zero

This design has the added advantage that it also implements immediate addressing if the ALU input MUX selects the sign extender and the register write MUX selects the ALU output.

The instruction memory interface and the hardware to support branches is added as shown in fig. 5.11, p. 300.

Control Unit. There are several control inputs to the data path hardware that are neces- sary to make the hardware perform the appropriate action for the instruction. A control unit is necessary to provide these control inputs. The control unit determines a value for each control line based on the instruction code, external signals (for example, system reset), and internal signals (for example, signed overflow, V).

First, the control lines must be identified as in fig. 5.15, p. 305.

Note that the register codes from the instruction into the register file are control lines.

Since we are doing a single cycle implementation, the control lines should be set once the instruction is fetched from instruction memory and they should remain set until the end of the clock period. This can be accomplished with a combinational logic circuit. No mem- ory storage or state machines are necessary for the control unit. The control unit simply decodes the instruction and sets the appropriate control lines as shown in fig. 5.17, p. 307.

The control unit function is defined by making a table, similar to fig. 5.18, p. 308, that lists how the control lines should be set for each instruction (or instruction type).

The view that we will take is to group the control unit along with the decoders in the regis- ter file into the instruction decoder. This will make pipelining easier later.

Jump type instructions are added in fig. 5.24, p. 314.

Performance. Recall that our performance measure is

For the single cycle implementation,

Instruction Memory

Instruction Decoder (^) Register File

Aselect

Bselect

control lines to other data path units

CPUtime = ( IC) ( CPI) ( T)

Microprogrammed Control. Microprogrammed control is an alternative way to imple- ment the finite state machine which is the controller for an unpipelined processor.

The micro program ROM implements the combinational logic of the controller and the micro instruction register implements the state register.

The controller is designed as a sequence of micro instructions rather than a set of logic equations. This made it much easier for programmers to design. It also allowed for easy changes to the control logic since only the ROM need be reprogrammed.

The disadvantage of microprogrammed control is that it works best for a sequence of dif- ferent micro instructions in each clock period. This make it unsuitable for pipeline control as we will see. Also, the ROM takes more hardware than equivalent logic gates.

Performance. Let’s assume that the multicycle CPU is designed with a controller as spec- ified by the state diagram in fig. 5.38, p. 339. The number of clock cycles for each instruc- tion type is as follows.

Loads: 5 Stores: 4 ALU: 4 Branches/Jumps: 3

After running benchmark programs, the average dynamic instruction frequencies are as follows.

Loads: 20% Stores: 12% ALU: 52% Branches/Jumps: 16%

Then the average CPI is

microprogram ROM

outputs

inputs

clk micro instruc- tion register

addr data

Even though a CPI of 4.04 is worse than the single cycle implementation, the clock period can be much smaller than the clock period for the single cycle implementation. The con- straint on T is now

T > max { (I-mem delay), (I-decode delay), (Reg File delay) , (ALU delay), (D-mem delay) }

which is usually much smaller than for the single cycle clock period.

CPI

IC i

IC

i

= ( CPI i )