SRT Division Algorithm and the Pentium Division Bug - Prof. Gabriel Loh, Study notes of Computer Science

An explanation of the srt division algorithm, which is used for performing division in base 2. The document also discusses a bug in intel's pentium processor implementation of this algorithm, which results in incorrect quotient digits being returned for certain inputs. A detailed explanation of the algorithm, the steps for performing division using the algorithm, and an analysis of the pentium division bug.

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-7ey-1
koofers-user-7ey-1 🇺🇸

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Multiplication, Division
Prof. Loh
CS3220 - Processor Design - Fall 2008
November 3, 2008
1 Multiplication
Multiplication by hand in base 10 involves repeated multiplication of one number by the digits of the second number,
and then adding all of the results together. In base 2, multiplication by a single bit is simplified by the fact that a bit
only has two values (both of which are trivial to multiply by!). Multiplying an n-bit number by a single bit simply
involves nAND gates, as illustrated in Figure 1. To multiply two n-bit numbers together, we simply need to perform
ndifferent n×1-bit multiplications in parallel, shift the partial results properly, and add them all together. This is
illustrated in Figure 2. The total gate delay is O(1) for the 1-bit multiplies, zero for shifting (each shift is by a constant
amount, so only wires are involved), and O(log (n+ lg n)) O(log n)gate delays for adding together ndifferent
O(n)-bit numbers (if, for example, a tree of carry-save adders is used). Notice that the final output of a n-bit by n-bit
multiply is 2n-bits wide.
There are other ways to perform multiplication by using repeated iterative steps. The naive approach is to have
a single 1-bit multiplier, and on each cycle, generate an additional partial product. After nsuch steps, all of the
partial products will have been generated. In parallel, the partial products can be added as they are generated with an
accumulator. This uses considerably less hardware, but takes much longer to complete the calculation (O(n)).
In either iterative addition of partial products, or the usage of a Wallace Tree, the number of partial products is
largely what determines how fast the multiplication can be performed. To reduce the number of partial products,
the Booth algorithm can be used. This is a simple trick that invovles the recoding the binary numbers using 0’s, 1’s
and -1’s. For example, the number 00111102is the same as 01000102, where 1means -1. Any partial product that
corresponds to a bit equalling zero can be skipped. This doesn’t help much for the case where a tree of adders is used,
but can save many iterations when an iterative method is used. To perform the encoding, start from the least significant
bit. Each time a block of zero ends, and a block of ones starts, a 1is written down. Each time a block of ones ends, and
a block of zeros starts, a 1is written down. If neither condition holds, a zero is written. The following is an example:
00011000011111 0 implicit zero
00101000100001
x3x2x1x0
xn2
xn1
yi
x1·yi
x3·yi
xn1·yi
xn2·yix2·yix0·yi
Figure 1: A n-bit by 1-bit multiply is achieved by using AND gates.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download SRT Division Algorithm and the Pentium Division Bug - Prof. Gabriel Loh and more Study notes Computer Science in PDF only on Docsity!

Multiplication, Division

Prof. Loh

CS3220 - Processor Design - Fall 2008

November 3, 2008

1 Multiplication

Multiplication by hand in base 10 involves repeated multiplication of one number by the digits of the second number, and then adding all of the results together. In base 2, multiplication by a single bit is simplified by the fact that a bit only has two values (both of which are trivial to multiply by!). Multiplying an n-bit number by a single bit simply involves n AND gates, as illustrated in Figure 1. To multiply two n-bit numbers together, we simply need to perform n different n × 1 -bit multiplications in parallel, shift the partial results properly, and add them all together. This is illustrated in Figure 2. The total gate delay is O(1) for the 1-bit multiplies, zero for shifting (each shift is by a constant amount, so only wires are involved), and O(log (n + lg n)) ≈ O(log n) gate delays for adding together n different O(n)-bit numbers (if, for example, a tree of carry-save adders is used). Notice that the final output of a n-bit by n-bit multiply is 2 n-bits wide. There are other ways to perform multiplication by using repeated iterative steps. The naive approach is to have a single 1-bit multiplier, and on each cycle, generate an additional partial product. After n such steps, all of the partial products will have been generated. In parallel, the partial products can be added as they are generated with an accumulator. This uses considerably less hardware, but takes much longer to complete the calculation (O(n)). In either iterative addition of partial products, or the usage of a Wallace Tree, the number of partial products is largely what determines how fast the multiplication can be performed. To reduce the number of partial products, the Booth algorithm can be used. This is a simple trick that invovles the recoding the binary numbers using 0’s, 1’s and -1’s. For example, the number 00111102 is the same as 01000102 , where 1 means -1. Any partial product that corresponds to a bit equalling zero can be skipped. This doesn’t help much for the case where a tree of adders is used, but can save many iterations when an iterative method is used. To perform the encoding, start from the least significant bit. Each time a block of zero ends, and a block of ones starts, a 1 is written down. Each time a block of ones ends, and a block of zeros starts, a 1 is written down. If neither condition holds, a zero is written. The following is an example:

00011000011111 0 ← implicit zero 00101000100001

xn− 1 xn− 2 x 3 x 2 x 1 x 0

yi

xn− 1 · yi x 3 · yi x 1 · yi

xn− 2 · yi x 2 · yi x 0 · yi

Figure 1: A n-bit by 1-bit multiply is achieved by using AND gates.

 1

 2

 3

 4

 5

 6

 7

y 0

y 1

y 2

y 3

y 4

y 5

y 6

y 7

x 7 x 6 x 5 x 4 x 3 x 2 x 1 x 0

8

9

10

11

12

13

14

15

16 LCA x × y Tree

Wallace

Figure 2: The n partial products can be computed by n separate 1-bit multiplies. The partial products are then combined with a Wallace Tree and LCA.

As before, we start with R D^0. Then we repeatedly choose qi such that B(Ri −Dqi) ∈ [q⊥.q⊥q⊥q⊥..., q>.q>q>q>...]. In base 10 long division, S = { 0 , 1 , ..., 9 }, and so the range limitation is simply that 10(Ri − Dqi) ∈ [0. 000 ..., 9. 999 ...]. For all of our divisions, we are going to assume that both operands have been normalized such that the first bit is a one (i.e. 1 .xxxx). Renormalization can be performed at the end of the division to adjust the answer to be of the correct order of magnitude. At the very minimum, each step of the long division will require a comparison (O(lg n)), a subtraction (O(lg n)), and a shift (O(1)). With n iterations of this, we know that there will be a delay of at least Ω(n lg n) with this approach. We have not discussed how the operation of choosing qi to meet the specified requirements can be implemented efficiently either. In any case, we know that this approach for performing division takes longer than our target of O(n) time.

2.2 SRT Division

SRT Division is named for the people who invented the algorithm. At about the same time, D. Sweeney (IBM), J.E. Robertson (University of Illinois) and T.D. Tocher (Imperial College of London) all independently discovered the algorithm. Before getting into the details, we will first look at the concept of negative digits. In base 10, we normally use the digits S = { 0 , 1 , ..., 9 } to represent numbers, and in this system, each number has a unique representation. Alternatively, we can use negative digits as well, such that S = { 9 , 8 , ..., 1 , 0 , 1 , ..., 8 , 9 }, where x = −x. Then, a number such as 16 can be represented as either 16 or 24 (20 + −4 = 16). This is a redundant representation, which allows for more than one way to represent a number.

Radix 4 SRT Division

We use base 4 division (which allows us to process two bits per step), and we use the set of digits S = { 2 , 1 , 0 , 1 , 2 }.

Algorithm:

R 0 := R for k = 0, 1 , ... determine qk ∈ S s.t. Rk+1 := 4(Rk − qkD) and |Rk+1| ≤ 83 D end for

q = (^) DR =

i=

qi 4 i

This algorithm, as presented, should look pretty much like the abstract division algorithm presented earlier, except with the appropriate constants substituted in.

Theorem: The Radix 4 SRT Algorithm computes q = RD =

i=

qi 4 i^.

Proof:

Rk+1 = 4(Rk − qkD) by definition (1) Rk+ D

4 Rk D

− 4 qk divide (1) by D (2) 4 Rk D

Rk+ D

  • 4qk rearrange terms (3) Rk D

Rk+ 4 D

  • qk divide by 4 (4) R D

R 0

D

by definition

R 1

4 D

  • q 0 by (4)

R 2

4 D

  • q 1
  • q 0 by (4)

R 2

D

q 1 + q 0 expansion of terms

.. .

=

4 k

Rk D

( (^) q k− 1 4 k−^1

q 1 4

  • q 0

Since − 83 D ≤ Rk ≤ 83 D, then:

lim k→∞

4 k^

Rk D

= lim k→∞

4 k^

= 0 substitute Rk (6)

R D

= lim k→∞

4 k^

Rk D

∑^ ∞

i=

qi 4 i

∑^ ∞

i=

qi (^4) i

Q.E.D. (7)

First question: why in the world do we use a limit of 83?

Recall that in base 10, we required Rk+1 < q>.q>q>q>... < 10 (for q> = 9). In our radix 4 SRT division, q> = 2. So we have the condition that

Rk+1 < 2. 2222 ... (base 4)

= 2

∑^ ∞

i=

)i

Similarly for the lower bound of − 83 using q⊥ = 2.

Second question: how do we choose the qk’s?

− 83 − 53 − 43 − 23 − (^130 )

q = 1

q = 2

q = − 1

q = − 2 q = 0

Figure 3: The redundant representation of numbers allows for more than one choice for the quotient digits.

Not reachable so long as the qk ’s are properly selected

D Rk

0

1

2

3

4

5 Rk^ =^ 8 3 D

Rk = 43 D

Rk = 23 D Rk = 13 D

Rk = − 13 D Rk = − 23 D

Rk = − 43 D Rk = − 53 D

Rk = − 83 D

Rk = 53 D

1.000... 1.111...

q = 2

q = 1

q = 0

q = − 1

q = − 2

to normalization

Not reachable due

Figure 4: A lookup table for Rk and D can make choosing a suitable qk only require O(1) time. Certain regions of the table can never be reached.

1 3 2 3

q = 0 q = 1

1 2

Figure 5: We will choose the midpoint in the overlap region to decide what value to return for qk.

1 3 1 2 2 3

Actual R Dk gRk D^ e

(^16)

Figure 6: For a cutoff of 12 , an error greater than 16 can cause erroneous quotient digits to be returned.

entries that are less than 1. 000 ... or greater than 1. 111 .... With such a table, the choice for a particular qk can be made in O(1) time. The regions that allow for more than one choice for qk are darker than the others. An obvious problem with this approach is that this table will be of infinite size, and so it is not anywhere near realistic. The solution to this problem is to take advantage of the “slack” in what number we choose for qk. In Figure 4, there are regions where we have a choice of what value for qk we return. Let us choose the midpoint of the overlap region for deciding what to return (Figure 5). For example, for the q = 0/q = 1 overlap region, if R Dk > 12 , then we’ll return qk = 1, otherwise if R Dk ≤ 12 , we’ll return qk = 0. Now, suppose we simply approximate the value of Rk D by^

Rek D^ e. If^

Rek D^ e is less than^

1 6 away from^

Rk D , we’ll still return the correct value of^ qk. If the error is greater than^

1 6 , erroneous values may be returned. An example is illustrated in Figure 6, where the actual value of R Dk dictates that qk = 0 must be returned (i.e. it’s not in the overlap region), but the approximation has a sufficiently large error to result in qk = 1 being returned. Our approximation for Rk and D is to simply use the first 8 bits of Rk and the first 5 bits of D. This results in an error that is less than 16. At the same time, because R˜k and D˜ are of a constant size, we can now create a lookup table for all of the possible ≈ 25 · 28 inputs. In the Pentium, the cutoff in the overlap regions is not symmetrically located (i.e. not at the midpoint), which allows their implementation to only use 7 bits for R˜k. The encoding of q ∈ S = { 2 , 1 , 0 , 1 , 2 } requires three bits. Instead, we will use two numbers to keep the positive and negative portions separate. In base 10, instead of 24 , we’ll store 20 and 04 as two separate numbers. In the very end, these numbers can be combined by subtracting the negative portion from the postive part. The other trick that we will use is that all addition/subtraction through each iteration will be carried out in carry-save form, so we will never have to pay the price for a carry propagation. The only exception is when we need to compute R˜k and D˜ we will have to perform an addition to get the non-carry-save form of the numbers, but both of these have fixed widths that are independent of the number of bits in our arguments n, and so take O(1) time.

B compute −qkD = −D. Same as before, negate and add 1 in the carry.

C add R + −qD

previous partial sum = 1110. 1100000001000110011000 previous carry = 0010. 0000000000000001000100 −qD = 1110. 1110000000000000001111

partial sum = 0010. 0010000001000111010011 carry = 1101. 1000000000000000011001 the 1 is from −qD.

D multiply by 4 (shift bits by 2)

partial sum = 0010. 0010000001000111010011 carry = 1101. 1000000000000000011001 ⇒ new partial sum = 1000. 1000000100011101001100 new carry = 0110. 0000000000000001100100

Iteration 3:

A R˜k = 1110. 1000 , D˜ is the same, and so q 2 = 1. So far, we have qso far = 1. 11

Recall that we stated that the digits of q would be stored as two separate numbers, one for the positive digits, and one for the negative digits. So 1. 11 is stored as

q 0 q 1 q 2 q+ =positive digits: 01. 01 00 q− =negative digits: 00. 00 01 q = difference: 01. 00 11

We will stop our example at this point since this becomes very tedious very quickly. Let us now analyze the cost for each iteration.

A Compute R˜k and perform lookup: O(1)

B Compute −qkD:

  • q = 2: shift to get 2 D, invert and add 1 (in the carry) to get − 2 D: O(1)
  • q = 1: invert and add 1 to get −D: O(1)
  • q = 0: multiply by zero gives all zeros: O(1)
  • q = − 1 : −(−1)D = D, do nothing to D: O(1)
  • q = − 2 : −(−2)D = 2D, shift D: O(1)

C use carry save add to sum Rk + −qkD: O(1)

D shift by 4: O(1)

So the total time per iteration is simply O(1). For an n-bit answer, we’ll have to run through the loop n 2 = O(n) times (recall that two bits are generated per iteration because we’re using a radix 4 division). Now we must add up all of the other steps that occur before and after the main loop:

D Rk

0

1

2

3

4

5 Rk^ =^ 8 3 D

Rk = 43 D

Rk = 23 D Rk = 13 D

Rk = − 13 D Rk = − 23 D

Rk = − 43 D Rk = − 53 D

Rk = − 83 D

Rk = 53 D

1.000... 1.111...

q = 2

q = 1

q = 0

q = − 1

q = − 2

Figure 7: The Pentium’s SRT lookup table contains five locations, marked by the ©’s, that would erroneously return 0 instead of 2.

  • Normalization of arguments at the start: O(lg n)
  • O(n) iterations of the loop at O(1) per iteration: O(n)
  • Final subtraction of q+ − q− using LCA: O(lg n)
  • Renormalize: O(lg n)

So the total time to perform an n-bit division is O(n) when using the SRT algorithm.

2.3 The Pentium Division Bug

SRT division is used in Intel’s Pentium processor (as well as most other processors that support division). The problem with the Pentium’s implementation of SRT division is that the lookup table contains a few cells that would return incorrect values. The approximate location of the cells are illustrated in Figure 7. These are all located along the R^ ˜k = 8 3 D˜ line. Because there are many cells in the table that can never be reached, no space is actually allocated for those entries. The table is simply hardwired to return a zero if any of those locations are ever accessed. Apparently, someone at Intel thought that five of the cells would never be accessed, and removed them, thus allowing some further optimizations of the table. It turns out that under very special circumstances, these cells can be accessed. At this point, the table should return qk = 2, but a zero is returned instead. Tim Coe (Vitesse Semiconductor Corporation) and Ping Tak Tang (Argonne National Laboratory) published a paper titled ”It Takes Six Ones To Reach a Flaw”. In the paper, they provide a proof that shows that the divisor must