






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An explanation of the srt division algorithm, which is used for performing division in base 2. The document also discusses a bug in intel's pentium processor implementation of this algorithm, which results in incorrect quotient digits being returned for certain inputs. A detailed explanation of the algorithm, the steps for performing division using the algorithm, and an analysis of the pentium division bug.
Typology: Study notes
1 / 12
This page cannot be seen from the preview
Don't miss anything!







Multiplication by hand in base 10 involves repeated multiplication of one number by the digits of the second number, and then adding all of the results together. In base 2, multiplication by a single bit is simplified by the fact that a bit only has two values (both of which are trivial to multiply by!). Multiplying an n-bit number by a single bit simply involves n AND gates, as illustrated in Figure 1. To multiply two n-bit numbers together, we simply need to perform n different n × 1 -bit multiplications in parallel, shift the partial results properly, and add them all together. This is illustrated in Figure 2. The total gate delay is O(1) for the 1-bit multiplies, zero for shifting (each shift is by a constant amount, so only wires are involved), and O(log (n + lg n)) ≈ O(log n) gate delays for adding together n different O(n)-bit numbers (if, for example, a tree of carry-save adders is used). Notice that the final output of a n-bit by n-bit multiply is 2 n-bits wide. There are other ways to perform multiplication by using repeated iterative steps. The naive approach is to have a single 1-bit multiplier, and on each cycle, generate an additional partial product. After n such steps, all of the partial products will have been generated. In parallel, the partial products can be added as they are generated with an accumulator. This uses considerably less hardware, but takes much longer to complete the calculation (O(n)). In either iterative addition of partial products, or the usage of a Wallace Tree, the number of partial products is largely what determines how fast the multiplication can be performed. To reduce the number of partial products, the Booth algorithm can be used. This is a simple trick that invovles the recoding the binary numbers using 0’s, 1’s and -1’s. For example, the number 00111102 is the same as 01000102 , where 1 means -1. Any partial product that corresponds to a bit equalling zero can be skipped. This doesn’t help much for the case where a tree of adders is used, but can save many iterations when an iterative method is used. To perform the encoding, start from the least significant bit. Each time a block of zero ends, and a block of ones starts, a 1 is written down. Each time a block of ones ends, and a block of zeros starts, a 1 is written down. If neither condition holds, a zero is written. The following is an example:
00011000011111 0 ← implicit zero 00101000100001
Figure 1: A n-bit by 1-bit multiply is achieved by using AND gates.
1
2
3
4
5
6
7
y 0
y 1
y 2
y 3
y 4
y 5
y 6
y 7
x 7 x 6 x 5 x 4 x 3 x 2 x 1 x 0
8
9
10
11
12
13
14
15
16 LCA x × y Tree
Wallace
Figure 2: The n partial products can be computed by n separate 1-bit multiplies. The partial products are then combined with a Wallace Tree and LCA.
As before, we start with R D^0. Then we repeatedly choose qi such that B(Ri −Dqi) ∈ [q⊥.q⊥q⊥q⊥..., q>.q>q>q>...]. In base 10 long division, S = { 0 , 1 , ..., 9 }, and so the range limitation is simply that 10(Ri − Dqi) ∈ [0. 000 ..., 9. 999 ...]. For all of our divisions, we are going to assume that both operands have been normalized such that the first bit is a one (i.e. 1 .xxxx). Renormalization can be performed at the end of the division to adjust the answer to be of the correct order of magnitude. At the very minimum, each step of the long division will require a comparison (O(lg n)), a subtraction (O(lg n)), and a shift (O(1)). With n iterations of this, we know that there will be a delay of at least Ω(n lg n) with this approach. We have not discussed how the operation of choosing qi to meet the specified requirements can be implemented efficiently either. In any case, we know that this approach for performing division takes longer than our target of O(n) time.
2.2 SRT Division
SRT Division is named for the people who invented the algorithm. At about the same time, D. Sweeney (IBM), J.E. Robertson (University of Illinois) and T.D. Tocher (Imperial College of London) all independently discovered the algorithm. Before getting into the details, we will first look at the concept of negative digits. In base 10, we normally use the digits S = { 0 , 1 , ..., 9 } to represent numbers, and in this system, each number has a unique representation. Alternatively, we can use negative digits as well, such that S = { 9 , 8 , ..., 1 , 0 , 1 , ..., 8 , 9 }, where x = −x. Then, a number such as 16 can be represented as either 16 or 24 (20 + −4 = 16). This is a redundant representation, which allows for more than one way to represent a number.
Radix 4 SRT Division
We use base 4 division (which allows us to process two bits per step), and we use the set of digits S = { 2 , 1 , 0 , 1 , 2 }.
Algorithm:
R 0 := R for k = 0, 1 , ... determine qk ∈ S s.t. Rk+1 := 4(Rk − qkD) and |Rk+1| ≤ 83 D end for
q = (^) DR =
i=
qi 4 i
This algorithm, as presented, should look pretty much like the abstract division algorithm presented earlier, except with the appropriate constants substituted in.
Theorem: The Radix 4 SRT Algorithm computes q = RD =
i=
qi 4 i^.
Proof:
Rk+1 = 4(Rk − qkD) by definition (1) Rk+ D
4 Rk D
− 4 qk divide (1) by D (2) 4 Rk D
Rk+ D
Rk+ 4 D
by definition
q 1 + q 0 expansion of terms
.. .
=
4 k
Rk D
( (^) q k− 1 4 k−^1
q 1 4
Since − 83 D ≤ Rk ≤ 83 D, then:
lim k→∞
4 k^
Rk D
= lim k→∞
4 k^
= 0 substitute Rk (6)
R D
= lim k→∞
4 k^
Rk D
i=
qi 4 i
i=
qi (^4) i
First question: why in the world do we use a limit of 83?
Recall that in base 10, we required Rk+1 < q>.q>q>q>... < 10 (for q> = 9). In our radix 4 SRT division, q> = 2. So we have the condition that
Rk+1 < 2. 2222 ... (base 4)
= 2
i=
)i
Similarly for the lower bound of − 83 using q⊥ = 2.
Second question: how do we choose the qk’s?
− 83 − 53 − 43 − 23 − (^130 )
q = 1
q = 2
q = − 1
q = − 2 q = 0
Figure 3: The redundant representation of numbers allows for more than one choice for the quotient digits.
Not reachable so long as the qk ’s are properly selected
D Rk
0
1
2
3
4
5 Rk^ =^ 8 3 D
Rk = 43 D
Rk = 23 D Rk = 13 D
Rk = − 13 D Rk = − 23 D
Rk = − 43 D Rk = − 53 D
Rk = − 83 D
Rk = 53 D
1.000... 1.111...
q = 2
q = 1
q = 0
q = − 1
q = − 2
to normalization
Not reachable due
Figure 4: A lookup table for Rk and D can make choosing a suitable qk only require O(1) time. Certain regions of the table can never be reached.
1 3 2 3
q = 0 q = 1
1 2
Figure 5: We will choose the midpoint in the overlap region to decide what value to return for qk.
1 3 1 2 2 3
Actual R Dk gRk D^ e
(^16)
Figure 6: For a cutoff of 12 , an error greater than 16 can cause erroneous quotient digits to be returned.
entries that are less than 1. 000 ... or greater than 1. 111 .... With such a table, the choice for a particular qk can be made in O(1) time. The regions that allow for more than one choice for qk are darker than the others. An obvious problem with this approach is that this table will be of infinite size, and so it is not anywhere near realistic. The solution to this problem is to take advantage of the “slack” in what number we choose for qk. In Figure 4, there are regions where we have a choice of what value for qk we return. Let us choose the midpoint of the overlap region for deciding what to return (Figure 5). For example, for the q = 0/q = 1 overlap region, if R Dk > 12 , then we’ll return qk = 1, otherwise if R Dk ≤ 12 , we’ll return qk = 0. Now, suppose we simply approximate the value of Rk D by^
Rek D^ e. If^
Rek D^ e is less than^
1 6 away from^
Rk D , we’ll still return the correct value of^ qk. If the error is greater than^
1 6 , erroneous values may be returned. An example is illustrated in Figure 6, where the actual value of R Dk dictates that qk = 0 must be returned (i.e. it’s not in the overlap region), but the approximation has a sufficiently large error to result in qk = 1 being returned. Our approximation for Rk and D is to simply use the first 8 bits of Rk and the first 5 bits of D. This results in an error that is less than 16. At the same time, because R˜k and D˜ are of a constant size, we can now create a lookup table for all of the possible ≈ 25 · 28 inputs. In the Pentium, the cutoff in the overlap regions is not symmetrically located (i.e. not at the midpoint), which allows their implementation to only use 7 bits for R˜k. The encoding of q ∈ S = { 2 , 1 , 0 , 1 , 2 } requires three bits. Instead, we will use two numbers to keep the positive and negative portions separate. In base 10, instead of 24 , we’ll store 20 and 04 as two separate numbers. In the very end, these numbers can be combined by subtracting the negative portion from the postive part. The other trick that we will use is that all addition/subtraction through each iteration will be carried out in carry-save form, so we will never have to pay the price for a carry propagation. The only exception is when we need to compute R˜k and D˜ we will have to perform an addition to get the non-carry-save form of the numbers, but both of these have fixed widths that are independent of the number of bits in our arguments n, and so take O(1) time.
B compute −qkD = −D. Same as before, negate and add 1 in the carry.
C add R + −qD
previous partial sum = 1110. 1100000001000110011000 previous carry = 0010. 0000000000000001000100 −qD = 1110. 1110000000000000001111
partial sum = 0010. 0010000001000111010011 carry = 1101. 1000000000000000011001 the 1 is from −qD.
D multiply by 4 (shift bits by 2)
partial sum = 0010. 0010000001000111010011 carry = 1101. 1000000000000000011001 ⇒ new partial sum = 1000. 1000000100011101001100 new carry = 0110. 0000000000000001100100
Iteration 3:
A R˜k = 1110. 1000 , D˜ is the same, and so q 2 = 1. So far, we have qso far = 1. 11
Recall that we stated that the digits of q would be stored as two separate numbers, one for the positive digits, and one for the negative digits. So 1. 11 is stored as
q 0 q 1 q 2 q+ =positive digits: 01. 01 00 q− =negative digits: 00. 00 01 q = difference: 01. 00 11
We will stop our example at this point since this becomes very tedious very quickly. Let us now analyze the cost for each iteration.
A Compute R˜k and perform lookup: O(1)
B Compute −qkD:
C use carry save add to sum Rk + −qkD: O(1)
D shift by 4: O(1)
So the total time per iteration is simply O(1). For an n-bit answer, we’ll have to run through the loop n 2 = O(n) times (recall that two bits are generated per iteration because we’re using a radix 4 division). Now we must add up all of the other steps that occur before and after the main loop:
D Rk
0
1
2
3
4
5 Rk^ =^ 8 3 D
Rk = 43 D
Rk = 23 D Rk = 13 D
Rk = − 13 D Rk = − 23 D
Rk = − 43 D Rk = − 53 D
Rk = − 83 D
Rk = 53 D
1.000... 1.111...
q = 2
q = 1
q = 0
q = − 1
q = − 2
Figure 7: The Pentium’s SRT lookup table contains five locations, marked by the ©’s, that would erroneously return 0 instead of 2.
So the total time to perform an n-bit division is O(n) when using the SRT algorithm.
2.3 The Pentium Division Bug
SRT division is used in Intel’s Pentium processor (as well as most other processors that support division). The problem with the Pentium’s implementation of SRT division is that the lookup table contains a few cells that would return incorrect values. The approximate location of the cells are illustrated in Figure 7. These are all located along the R^ ˜k = 8 3 D˜ line. Because there are many cells in the table that can never be reached, no space is actually allocated for those entries. The table is simply hardwired to return a zero if any of those locations are ever accessed. Apparently, someone at Intel thought that five of the cells would never be accessed, and removed them, thus allowing some further optimizations of the table. It turns out that under very special circumstances, these cells can be accessed. At this point, the table should return qk = 2, but a zero is returned instead. Tim Coe (Vitesse Semiconductor Corporation) and Ping Tak Tang (Argonne National Laboratory) published a paper titled ”It Takes Six Ones To Reach a Flaw”. In the paper, they provide a proof that shows that the divisor must