Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Unpipelined Processor: Design and Implementation - Prof. Louis Johnson, Study notes of Computer Architecture and Organization

Oklahoma State University (OSU) - Stillwater Computer Architecture and Organization

Prof. Louis Johnson

An in-depth analysis of the design and implementation of an unpipelined processor. It covers topics such as the role of the arithmetic/logic unit (alu) and on-chip memory storage (registers), the two-bus architecture, register file implementation, and memory interface for instruction fetch and load/store operations. The document also discusses the importance of avoiding hazards and the use of a risc design for efficient implementation.

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-zoy 🇺🇸

10 documents

1 / 14

This page cannot be seen from the preview

Don't miss anything!

ECEN 4243 Computer Architecture

Unpipelined Processor January 30, 2008 page 1 of 14

Unpipelined Processor

The heart of the processor is the arithmetic/logic unit (ALU) and the on chip memory stor-

age usually called registers. The hardware to implement the arithmetic/logic operations

typically requires three operands, two sources and one destination. Putting any of these

operands in memory violates our RISC design principles. Since all of the operands must

be in registers (an R-format instruction type), a simple three-bus architecture is then

implied. Three busses allow two inputs and one output to/from the ALU all in the same

clock period (see fig.5.7, p. 295). We will name these busses as shown.

The timing of the data transfers is shown in the following timing diagram.

A-bus

B-bus

ALU

D-bus

Registers

Bus D

read

delay

ALU

delay

write

delay

Bus B

Bus A

Discover Study notes of Computer Architecture and Organization Oklahoma State University (OSU) - Stillwater

Partial preview of the text

Download Unpipelined Processor: Design and Implementation - Prof. Louis Johnson and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Unpipelined Processor

The heart of the processor is the arithmetic/logic unit (ALU) and the on chip memory stor- age usually called registers. The hardware to implement the arithmetic/logic operations typically requires three operands, two sources and one destination. Putting any of these operands in memory violates our RISC design principles. Since all of the operands must be in registers (an R-format instruction type), a simple three-bus architecture is then implied. Three busses allow two inputs and one output to/from the ALU all in the same clock period (see fig.5.7, p. 295). We will name these busses as shown.

The timing of the data transfers is shown in the following timing diagram.

A-bus B-bus

ALU

D-bus

CL Registers

Bus D

read delay

ALU

delay

write delay

Bus B

Bus A

CL

Shortly after the clock period begins for the execution of an instruction, the A-bus and B- bus will be driven by the registers specified in the instruction. After the ALU has correct input data, it will compute the selected function after a further delay onto the D-bus. Finally the register file stores the value from the D-bus.

Note that we finish writing the register data from one instruction before reading register data for the next instruction. If we try to read too soon, the register data might be incorrect if it is still being written. This is called a data hazard. Hazards will be discussed in more detail when the processor is pipelined. For now, the hazard is avoided by requiring that reading and writing to/from the register file be done at different times.

Since writing is done at a different time than reading, it is no longer necessary to have a bus dedicated to writing. Instead, reading and writing can use the same bus at different times. This leads to the two bus architecture.

But, this brings up two problems.

How to control when reading and when writing?
How to avoid mixing up read and write data on the same bus?

A-bus (^) B-bus

ALU

Registers

The 2-bus timing diagram is

Note:

For both the 2-bus and 3-bus architectures, (clock cycle time) > (read delay) + (ALU delay) + (write delay).
Since we do not read and write into the register file at the same time, we do not need the 3-bus architecture.

In this way, the 2-bus scheme is really not slower than the 3-bus scheme. We have just used the latches to save the ALU input and allowed the same busses to be used to transfer the ALU output back to the register file. The extra latches and tri-state buffers are much less expensive than adding a third bus.

The ALU is not doing anything useful during the read/write delays.

t

CL

ALU

input

ALU

output

ALU

input

ALU

output

Read Write

Read

Delay

ALU

Delay

Write

Delay

A, B

data

bus

A, B

latch

output

ALU

output

The register file is not doing anything useful during the ALU delay.

We will later employ pipelining techniques to remove the objections raised in the last two notes.

Register File Model

The 2-port memory that is most commonly used to implement the “register” file really has a timing characteristic more like a latch. The timing requirements of the latches makes the register file model more consistent with a realistic register file based on a static RAM design. The operation of the bi-directional bbus on the register file is controlled by the R and W control lines as follows.

Whenever W is high, the selected “register” is written (writing uses only the B-select and B-data to avoid the possibility of writing two things into the same register). Whenever R is high, the “register” selected by the A-select lines drives the A-data bus and the “regis- ter” selected by the B-select drives the B-data bus.

Individual “registers” are implemented as latches connected to the data busses through tri- state buffers such that only one “register” is connected to a bus at any time. The B-select

R W function 0 0 bbus not used by register file 0 1 write bbus into register selected by Bselect 1 0 read register selected by Bselect onto bbus 1 1 not allowed

A-select

A-data

Reg

File

R W

B-select

B-data

The select control lines for the register file come from the instruction code. The currently executing instruction is decoded as follows so that the appropriate control signals can be generated.

Avoiding Setup and Hold Time Violations. The system as designed does not work because the tri-state buffers turn on and off at the same time that data is latched into the A and B latches and the register file. This may violate the hold time requirement for the latches.

An easy solution to this problem is to delay the clock signal controlling the tri-state buff- ers (the R control on the register file controls the tri-state buffers inside the register file).

source

Reg A

OP

Instruction Code

binary

decoder

A-bus

select

source

Reg B

dest

Reg D

binary

decoder

binary

decoder

B-bus

read

B-bus

write

select select

reg file

R

CL

Aselect

Bselect

ALU W

CL_D

S

Cin

abus

bbus

abus

bbus

Aselect

Bselect

CL

A latch

B latch

CL_D

We can use any logic gate to provide the necessary delay, but it is sufficient to use an odd number of inverter gates.

Note: you must use structural Verilog to force these gates in. The synthesizer will elimi- nate the inverter pairs if you give a behavioral description.

Processor-Memory Interface

In a RISC design, only the load and store instructions transfer data into or out of memory. Because of their high instruction frequency, loads and stores must be implemented as effi- ciently as possible. To use the memory, the processor must provide an address bus to specify the memory location, a dataout bus for loads and a datain bus for stores.

Some or all of these busses are usually multiplexed together to reduce cost in unpipelined processors. However, it is best to leave them unmultiplexed for pipelined processors so we will leave them unmultiplexed to make it easier to pipeline the processor later.

Using memory design techniques which will be covered later, the memory interface can be designed as if the memory is inside the same chip as the processor. This makes the mem-

CL CL_D

CL

t

CL_D

memory

address

datain dataout

write enable

The instruction code comes directly out of the read data bus of the instruction memory. The program counter (PC) register provides the address to the instruction memory address bus. In the absence of branching or jumping, the PC gets incremented by 4 bytes (NOT 1 byte) to point to the next 32 bit instruction code.

Note that there is no write into the instruction memory since self modifying code is not allowed. We will discuss later how instructions get into instruction memory.

Branches. Just as for loads and stores, implementing branches requires additional hard- ware beyond the ALU because we must do the following.

Determine whether the branch condition is true.
Calculate the branch address.

In the MIPS instruction set, the branch condition is whether or not the two register oper- ands are equal. The ALU is used to subtract the two operands and then the ALU result is checked. If the ALU output is zero (all 0’s), then the operands are equal, else the operands are not equal. A zero checker is just a large AND gate with inverted inputs (a NOR gate).

Since the ALU is busy comparing the register operands, it is necessary to add an extra adder to do the branch address calculation (see fig. 5.9, p. 297).

Note: a permanent shift left of 2 is accomplished by simply moving the bus wires left two positions.

The extra adder is the price paid for determining the branch condition as part of the branch instruction. Alternatively, the branch could use the results of previous instructions to determine the branch condition, but this would require saving the results somewhere (flag register or status register). The MIPS does not use a flag or status register to make it easier to pipeline.

Complete Data Path. Common data path hardware for the register ALU instructions and the load, store instructions can be obtained by multiplexing one of the ALU inputs and multiplexing the register write as shown in fig. 5.10, p. 299.

ALU ...

zero

This design has the added advantage that it also implements immediate addressing if the ALU input MUX selects the sign extender and the register write MUX selects the ALU output.

The instruction memory interface and the hardware to support branches is added as shown in fig. 5.11, p. 300.

Control Unit. There are several control inputs to the data path hardware that are neces- sary to make the hardware perform the appropriate action for the instruction. A control unit is necessary to provide these control inputs. The control unit determines a value for each control line based on the instruction code, external signals (for example, system reset), and internal signals (for example, signed overflow, V).

First, the control lines must be identified as in fig. 5.15, p. 305.

Note that the register codes from the instruction into the register file are control lines.

Since we are doing a single cycle implementation, the control lines should be set once the instruction is fetched from instruction memory and they should remain set until the end of the clock period. This can be accomplished with a combinational logic circuit. No mem- ory storage or state machines are necessary for the control unit. The control unit simply decodes the instruction and sets the appropriate control lines as shown in fig. 5.17, p. 307.

The control unit function is defined by making a table, similar to fig. 5.18, p. 308, that lists how the control lines should be set for each instruction (or instruction type).

The view that we will take is to group the control unit along with the decoders in the regis- ter file into the instruction decoder. This will make pipelining easier later.

Jump type instructions are added in fig. 5.24, p. 314.

Performance. Recall that our performance measure is

For the single cycle implementation,

Instruction Memory

Instruction Decoder (^) Register File

Aselect

Bselect

control lines to other data path units

CPUtime = ( IC) ( CPI) ( T)

Microprogrammed Control. Microprogrammed control is an alternative way to imple- ment the finite state machine which is the controller for an unpipelined processor.

The micro program ROM implements the combinational logic of the controller and the micro instruction register implements the state register.

The controller is designed as a sequence of micro instructions rather than a set of logic equations. This made it much easier for programmers to design. It also allowed for easy changes to the control logic since only the ROM need be reprogrammed.

The disadvantage of microprogrammed control is that it works best for a sequence of dif- ferent micro instructions in each clock period. This make it unsuitable for pipeline control as we will see. Also, the ROM takes more hardware than equivalent logic gates.

Performance. Let’s assume that the multicycle CPU is designed with a controller as spec- ified by the state diagram in fig. 5.38, p. 339. The number of clock cycles for each instruc- tion type is as follows.

Loads: 5 Stores: 4 ALU: 4 Branches/Jumps: 3

After running benchmark programs, the average dynamic instruction frequencies are as follows.

Loads: 20% Stores: 12% ALU: 52% Branches/Jumps: 16%

Then the average CPI is

microprogram ROM

outputs

inputs

clk micro instruc- tion register

addr data

Even though a CPI of 4.04 is worse than the single cycle implementation, the clock period can be much smaller than the clock period for the single cycle implementation. The con- straint on T is now

T > max { (I-mem delay), (I-decode delay), (Reg File delay) , (ALU delay), (D-mem delay) }

Unpipelined Processor: Design and Implementation - Prof. Louis Johnson, Study notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Unpipelined Processor: Design and Implementation - Prof. Louis Johnson and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Unpipelined Processor

ALU

ALU

CL

ALU

t

t

t

t

CL

ALU

input

ALU

output

ALU

input

ALU

output

Read

Delay

ALU

Delay

Write

Delay

A, B

data

bus

A, B

latch

output

ALU

output

A-select

A-data

Reg

File

R W

B-select

B-data

source

Reg A

OP

Instruction Code

binary

decoder

A-bus

select

source

Reg B

dest

Reg D

binary

decoder

binary

decoder

B-bus

read

B-bus

write

select select

R

CL

CL_D

S

Cin

CL

CL_D

CL CL_D

CL

t

t

CL_D

CL_D

ALU ...

CPUtime = ( IC) ( CPI) ( T)

CPI