Carry Save Adder Implementation - Lecture Notes | EE 457, Study notes of Computer Architecture and Organization

Carry save adder Material Type: Notes; Professor: Puvvada; Class: Computer Systems Organization; Subject: Electrical Engineering; University: University of Southern California; Term: Fall 2010;

Typology: Study notes

Pre 2010

Uploaded on 12/12/2010

mihir-mandavia2000
mihir-mandavia2000 🇺🇸

5

(1)

3 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECEN 6263 Advanced VLSI Design
Carry Save Adder Implementation December 11, 2004 page 1 of 10
Carry Save Adder Implementation
Now that we have seen that the carry save adder trees are most efficiently implemented by
putting together the (3,2) blocks, we must still address the issue of how to implement the
(3,2) block (carry save adder) efficiently. Functionally, the carry save adder is identical to
the full adder. The full adder is usually implemented with a reduced delay from Cin to
Cout because the carry chain is the critical delay path in adders. Unfortunately, there is no
single carry chain in the carry save adder trees in multipliers. Thus, it does not pay to
make the delay shorter for one input by sacrificing delay on other inputs for carry save
adders. Instead, carry save adders are normally implemented by treating the 3 inputs
equally and trying to minimize delay from each input to the outputs. We have
B
A
C
A
B
B
C
A
C
S A B C ABC ABC A+BC ABC++=⊕⊕=
CABACBC++=
As we can see from the expanded version of the exclusive or function for the sum, S, both
the uncomplemented and complemented form is required for each input (there is a trans-
mission gate XOR circuit that does not require the complemented inputs but we won’t
consider this here). If we want to avoid putting extra inverters in our carry paths to pro-
duce the complemented input, the best thing to do is to have each carry save adder produce
both uncomplemented and complemented outputs which can then be used as inputs by the
next stage of carry save adders. Due to symmetries in the logic functions for C and S, pro-
ducing C, C, S and S does not take as much circuitry as one might think. The idea is to
find common sub-functions for which we may use the same transistors to implement parts
of more than one output function.
SABCABCABC()AB C()+=⊕⊕=⊕⊕=
SA B C⊕⊕ ABCABC()AB C()+=⊕⊕==
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Carry Save Adder Implementation - Lecture Notes | EE 457 and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Carry Save Adder Implementation

Now that we have seen that the carry save adder trees are most efficiently implemented by putting together the (3,2) blocks, we must still address the issue of how to implement the (3,2) block (carry save adder) efficiently. Functionally, the carry save adder is identical to the full adder. The full adder is usually implemented with a reduced delay from Cin to

Cout because the carry chain is the critical delay path in adders. Unfortunately, there is no

single carry chain in the carry save adder trees in multipliers. Thus, it does not pay to make the delay shorter for one input by sacrificing delay on other inputs for carry save adders. Instead, carry save adders are normally implemented by treating the 3 inputs equally and trying to minimize delay from each input to the outputs. We have

B

A C A B B C

A

C

S = A ⊕ B ⊕ C = ABC + ABC + ABC + ABC

C = AB + AC + BC

As we can see from the expanded version of the exclusive or function for the sum, S , both the uncomplemented and complemented form is required for each input (there is a trans- mission gate XOR circuit that does not require the complemented inputs but we won’t consider this here). If we want to avoid putting extra inverters in our carry paths to pro- duce the complemented input, the best thing to do is to have each carry save adder produce both uncomplemented and complemented outputs which can then be used as inputs by the next stage of carry save adders. Due to symmetries in the logic functions for C and S , pro- ducing C , C , S and S does not take as much circuitry as one might think. The idea is to find common sub-functions for which we may use the same transistors to implement parts of more than one output function.

S = A ⊕ B ⊕ C = A ⊕ B ⊕ C = A B ( ⊕ C ) + A B ( ⊕ C )

S = A ⊕ B ⊕ C = A ⊕ B ⊕ C = A B ( ⊕ C ) + A B ( ⊕ C )

C = AB + BC + AC

= AB + ( A + A ) BC + A B ( + B ) C

= AB + ABC + ABC

= AB +( AB + AB ) C

C = AB + BC + AC

= ( A + B ) ( B + C ) ( A + C )

= ABA + ABC + ACA + ACC + BBA + BBC + BCA + BCC

= AB + AC + BC

= AB + A B ( + B ) C +( A + A ) BC

= AB + ABC + ABC

= AB +( AB + AB ) C

In both cases, we see the functions have

  1. Common sub-functions
  2. The common part is gated by a complementary input.

These two properties allow the transistors for the common part to be shared. Consider full CMOS gates for f and f with a common part, C , which is gates by I.

C’

C

f I^ I f

C’ = dual of C

uncommon parts

Here it is obvious why the gating signals I , I must be complementary to avoid shorting f to f!

The transistor count may be further reduced by using logic gate design styles that elimi- nate the pMOS pull-up block which is made possible when synthesizing both f and f. Common blocks in f and f may still be shared as above.

CVSL [2] CPL [1]

f f

f f f f

f f

In both cases, the f and f blocks are synthesized with nMOSFET’s only (no pMOSFET’s). CVSL eliminates having to duplicate f and f with pMOSFET’s by using the cross coupled pMOSFET’s which force f and f to opposite values. The problem is that the cross-couple is slow as we saw last semester. Consider switching f from high to low. At the beginning of the switching transient the pMOS cross couple has not yet switched, so we have

f f

off^ on

open (^) on t

f pull up turns off

the nMOSFET that just turned on must fight the pMOSFET that is still turned on to bring the f output low enough to turn on the other pMOSFET which then causes the first pMOS- FET to turn off. This can take a considerable amount of time so that the typical CVSL gates are not much faster than the full CMOS gates even though the input gate load is 1/ that of full CMOS.

The Complementary Pass Logic (CPL) method overcomes the speed problem by using inverters as level detectors for the two nMOS pass transistor blocks. There is no cross couple circuit and no fighting of logic levels. However, an nMOS pass circuit is notori- ously slow at passing high logic levels. This can be compensated by adjusting the inverter cross over voltage, Vinv , to a lower than usual value as discussed for partial swing logic last semester. In fact, CPL is just non-full swing pass transistor logic where both a logic

function and its complement are implemented simultaneously. This is very useful for arithmetic circuits such as multipliers and adders. CPL gates as originally presented in [1], can be improved somewhat. The AND/NAND gate should be changed as follows.

B

B

A B^ B^ A

AB AB

B

B

A 0 1 A

AB AB

original AND/NAND improved AND/NAND

The revised form has a much smaller load on the B input, and is much faster. As usual, the inverters do not need to be included in every gate; they are inserted where needed to pre-

vent n^2 delay through n transistors in series. For example, two 2-input XOR gates can be cascaded to make a 3-input XOR gate and an inverter need not be inserted between the two XOR gates

ABC

A

BC

A

A

B

B

C C

S

S

S

S

The three input XOR gate can be used to produce the sum output for the carry save adder. The CPL three input XOR gate has the same number of transistors as the folded CVSL three input XOR gate[2]. The structure of the circuits is almost the same which can be

Cout (^) Cout

C C

B

B

A A

C C

B

B

A

A

Cout Cout

CPL Carry Circuits Folded CVSL

It is interesting to note that the folded CVSL carry circuit from [2], which has only 6 pass FETs in it, cannot be made into a CPL circuit. When A = B = 1 in the CVSL circuit, a par- allel combination of pass FETs controlled by C and C gives a valid logic 0, but in CPL it does not.

C C

C C

invalid valid 0

The above circuits are optimized implementations for the (3,2) carry save adder building block cell. It is also possible to optimize other building block cells, for example the (4,2) compressor. The (4,2) compressor has 4 explicit inputs plus one hidden carry for a total of 5 inputs. The sum bit output of the (4,2) compressor is the exclusive or of all 5 inputs. If the (4,2) compressor is made from two (3,2) blocks, then the 5 input XOR gets imple- mented by four 2 input XOR gates in series. A tree of XOR gates would be faster [3].

Similarly, a tree of gates can be found for the other (4,2) outputs which would be faster than obtained from two cascaded (3,2) circuits.

B

A

5 input XOR from cascaded (3,2) compressors

A

5 input XOR optimized tree

C

D E

B

C

D E

The CPL gate for the XOR tree might look like the following. Note that it is necessary to add the inverters before the internal XOR outputs can be used to control the gate of a pass FET.

A

A

B B

E

E

D D

C C

S S

V (^) Tp = V (^) Tn + V (^) Tn (pass ) . Such a large VTp would make the p-channel devices very slow.

There is another way to make V (^) Tp = V (^) Tn + V (^) Tn (pass ). That is to make the nMOS pass FETs differently than the regular nMOSFETs in the inverter. A “native” nMOSFET is easy to fabricate with a threshold V (^) Tn ′ ≈ 0. If the native nMOSFET is used for the pass transistors, body effect increases the threshold to only a few tenths of a volt. Thus, it is possible to satisfy

V (^) Tp = V (^) Tn + V (^) Tn ′ (pass )

without increasing VTp very much.

[1] K. Yano et al. , “A 3.8-ns 16X16-b Multiplier Using Complementary Pass-Transistor Logic,” IEEE J. Solid-State Circuits , vol. 25, pp. 388-394, Apr. 1990.

[2] P. Song and G. De Micheli, “Circuit and Architecture Trade-offs for High-Speed Multiplication,” IEEE J. Solid-State Circuits , vol. 26, pp. 1184-1198, Sep. 1991.

[3] N. Nagamatsu et al. , “A 15-ns 32X32-b CMOS Multiplier with an Improved Parallel Structure,” IEEE J. Solid-State Circuits , vol. 25, pp. 494-497, Apr. 1990.