




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Some concept of Parallel Processing are Anatomy, Cache Access Time, Instruction Formats, Instruction Formats, Instruction Formats, Multidimensional Meshes, Network Processors, Snooping Protocol. Main points of this lecture are: Circuit-Level Parallelism, Most Realistic Parallel, Computation Model, Sorting and Selection Networks, Search Acceleration Circuits, Arithmetic and Counting Circuits, Fourier Transform Circuits, Stand-Alone Systems, Acceleration Units, Comparisons
Typology: Slides
1 / 100
This page cannot be seen from the preview
Don't miss anything!





























































































Topics in This Part Chapter 7 Sorting and Selection Networks Chapter 8A Search Acceleration Circuits Chapter 8B Arithmetic and Counting Circuits Chapter 8C Fourier Transform Circuits
Circuit-level specs: most realistic parallel computation model
7.1 What is a Sorting Network?
Fig. 7.1 An n -input sorting network or an n -sorter.
x x x
x
. . .
. . .
n-sorter
0 1 2
n–
y y y
y
0 1 2
n–
The outputs are a permutation of the inputs satisfying y Š y Š ... Š y (non-descending)
0 ≤^1 ≤^ ≤ n–
Fig. 7.2 Block diagram and four different schematic representations for a 2-sorter.
2-sorter
input (^0) min
input 1 max
in out
in out
Block Diagram Alternate Representations
in out
in out
Building Blocks for Sorting Networks
2-sorter
input (^0) min
input 1 max
in out
in out Block Diagram Alternate Representations
in out
in out
2-sorter
Fig. 7.3 Parallel and bit-serial hardware realizations of a 2-sorter.
RS^ Q
Com- pare
1
0
1
0
k
k
k
k
min ( a , b )
max ( a , b )
b < a?
a
b
RS^ Q
1
0
1
0
min ( a , b )
max ( a , b )
b < a?
a
b
MSB-first serial inputs
a < b?
Reset
Implementation with bit-parallel inputs
Implementation with bit-serial inputs
Elaboration on the Zero-One Principle
Deriving a 0-1 sequence that is not correctly sorted, given an arbitrary sequence that is not correctly sorted.
Let outputs y (^) i and y (^) i +1 be out of order, that is y (^) i > y (^) i +
Replace inputs that are strictly less than y (^) i with 0s and all others with 1s
The resulting 0-1 sequence will not be correctly sorted either
6-sorter
Invalid
7.2 Figures of Merit for Sorting Networks
Delay: Number of levels
Cost: Number of comparators
Cost × Delay
x 0
x 1
x
x 3
2
y 0
y 1
y
y 3
2
2
3
1
5
3
2
5
1
1
3
2
5
1
2
3
5
In the following example, we have 5 comparators
The following 4-sorter has 3 comparator levels on its critical path
The cost-delay product for this example is 15
Fig. 7.4 Block diagram and schematic representation of a 4-sorter.
Delay as a Figure of Merit
Fig. 7.6 Some fast sorting networks.
n = 6, 12 modules, 5 levels
n = 9, 25 modules, 8 levels n = 10, 31 modules , 7 levels
n = 12, 40 modules , 8 levels
n = 16, 61 modules , 9 levels
Optimal delay is known for n = 1 to 10: 0, 1, 3, 3, 5, 5, 6, 6, 7, 7
These 3 comparators constitute one level
Cost-Delay Product as a Figure of Merit
n = 6, 12 modules, 5 levels
n = 9, 25 modules, 8 levels n = 10, 31 modules , 7 levels
n = 12, 40 modules , 8 levels
n = 16, 61 modules , 9 levels
Fast 10-sorter from Fig. 7.
n = 10, 29 modules , 9 levels
n = 16, 60 modules , 10 levels
Low-cost 10-sorter from Fig. 7.
Cost × Delay = 29 × 9 = 261 Cost × Delay = 31 × 7 = 217
The most cost-effective n -sorter may be neither the fastest design, nor the lowest-cost design
Insertion Sort and Selection Sort
Fig. 7.8 Sorting network based on insertion sort or selection sort.
x x x
x
. . .
(n–1)-sorter
0 1 2
n–
y y y
y
0 1 2
n– x (^) n–
. . .
y (^) n–
x x x
x
. . .
(n–1)-sorter
0 1 2
n–
y y y
y
0 1 2
n– x (^) n–
. . .
y (^) n–
. . .
Insertion sort Sel ection sort Parallel ins ertion s ort = Parallel selection s ort = Parallel bubble s ort!
C ( n ) = n ( n – 1)/ D ( n ) = 2 n – 3 Cost × Delay = Θ( n^3 )
Theoretically Optimal Sorting Networks
AKS sorting network (Ajtai, Komlos, Szemeredi: 1983)
x x x
x
. . .
. . .
n-sorter
0 1 2
n–
y y y
y
0 1 2
n–
The outputs are a permutation of the inputs satisfying y Š y Š ... Š y (non-descending)
0 ≤^1 ≤^ ≤ n–
O(log n ) depth
O( n log n ) size
Unfortunately, AKS networks are not practical owing to large (4-digit) constant factors involved; improvements since 1983 not enough
Note that even for these optimal networks, delay-cost product is suboptimal; but this is the best we can do
Existing sorting networks have O(log 2 n ) latency and O( n log 2 n ) cost
Given that log 2 n is only 20 for n = 1 000 000, the latter are more practical
Proof of Batcher’s Even-Odd Merge
x x x x y y y y y y
y (^) v
v
v
v
v
v
0 1 2 3 0 1 2 3 4 5 6 0 1 2 3 4 5 w
w
w
w
w
0
1
2
3
4 (2, 4)-merger (2, 3)-merger
Firstsorted sequ-ence x
Secondsorted sequ-ence y
Use the zero-one principle
Assume: x has k 0s y has k ′ 0s
Case a: k even = k odd v 0 0 0 0 0 0 1 1 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1 Case b: k even = k odd +1 v 0 0 0 0 0 0 0 1 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1 Case c: k even = k odd +2 v 0 0 0 0 0 0 0 0 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1 Out of order
v has k even = k /2 + k ′/2 0s w has k odd = k /2 + k ′/2 0s
Batcher’s Even-Odd Merge Sorting
Batcher’s ( m , m ) even-odd merger, for m a power of 2: C ( m ) = 2 C ( m /2) + m – 1 = ( m – 1) + 2( m /2 – 1) + 4( m /4 – 1) +... = m log 2 m + 1 D ( m ) = D ( m /2) + 1 = log 2 m + 1 Cost × Delay = Θ( m log 2 m )
Batcher sorting networks based on the even-odd merge technique: C ( n ) = 2 C ( n /2) + ( n /2)(log 2 ( n /2)) + 1 ≅ n (log 2 n )^2 / 2 D ( n ) = D ( n /2) + log 2 ( n /2) + 1 = D ( n /2) + log 2 n = log 2 n (log 2 n + 1)/ Cost × Delay = Θ( n log 4 n )
n/2-sorter
n/2-sorter
(n/2, n/2)- merger
. . .
...... . . . ......
Fig. 7.10 The recursive structure of Batcher’s even– odd merge sorting network.
Bitonic-Sequence Sorter
Fig. 14.2 Sorting a bitonic sequence on a linear array.
Shift right half of data to left half (superimpose the two halves)
In eac h position, keep the smaller value of each pair and ship the larger value to the right
Each half is a bitonic sequence that can be sorted independently
0 1 2 n – 1
0 1 2 n – 1
... ...
Bitonic sequence
Shifted right half
n /
n /
... ...
Bitonic sequence:
1 3 3 4 6 6 6 2 2 1 0 0 Rises, then falls
Falls, then rises
The previous sequence, right-rotated by 2
Batcher’s Bitonic Sorting Networks
Fig. 7.12 The recursive structure of Batcher’s bitonic sorting network.
n/2-sorter
n/2-sorter
n-input bitonic- sequence sorter
. . .
...... . . . ......
Bitonic sequence
......
Fig. 7.13 Batcher’s bitonic sorting network for eight inputs.
8-input bitonic- sequence sorter
4-input bitonic- sequence sorters
2-input sorters