Circuit-Level Parallelism - Parallel Processing - Lecture Slides, Slides of Parallel Computing and Programming

Some concept of Parallel Processing are Anatomy, Cache Access Time, Instruction Formats, Instruction Formats, Instruction Formats, Multidimensional Meshes, Network Processors, Snooping Protocol. Main points of this lecture are: Circuit-Level Parallelism, Most Realistic Parallel, Computation Model, Sorting and Selection Networks, Search Acceleration Circuits, Arithmetic and Counting Circuits, Fourier Transform Circuits, Stand-Alone Systems, Acceleration Units, Comparisons

Typology: Slides

2012/2013

Uploaded on 04/30/2013

devank
devank 🇮🇳

4.3

(12)

152 documents

1 / 100

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Part II
Circuit-Level Parallelism
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Circuit-Level Parallelism - Parallel Processing - Lecture Slides and more Slides Parallel Computing and Programming in PDF only on Docsity!

Part II ″

Circuit-Level Parallelism

II″ Circuit-Level Parallelism

Topics in This Part Chapter 7 Sorting and Selection Networks Chapter 8A Search Acceleration Circuits Chapter 8B Arithmetic and Counting Circuits Chapter 8C Fourier Transform Circuits

Circuit-level specs: most realistic parallel computation model

  • Concrete circuit model; incorporates hardware details
  • Allows realistic speed and cost comparisons
  • Useful for stand-alone systems or acceleration units

7.1 What is a Sorting Network?

Fig. 7.1 An n -input sorting network or an n -sorter.

x x x

x

. . .

. . .

n-sorter

0 1 2

n–

y y y

y

0 1 2

n–

The outputs are a permutation of the inputs satisfying y Š y Š ... Š y (non-descending)

0 ≤^1 ≤^ ≤ n–

Fig. 7.2 Block diagram and four different schematic representations for a 2-sorter.

2-sorter

input (^0) min

input 1 max

in out

in out

Block Diagram Alternate Representations

in out

in out

Building Blocks for Sorting Networks

2-sorter

input (^0) min

input 1 max

in out

in out Block Diagram Alternate Representations

in out

in out

2-sorter

Fig. 7.3 Parallel and bit-serial hardware realizations of a 2-sorter.

RS^ Q

Com- pare

1

0

1

0

k

k

k

k

min ( a , b )

max ( a , b )

b < a?

a

b

RS^ Q

1

0

1

0

min ( a , b )

max ( a , b )

b < a?

a

b

MSB-first serial inputs

a < b?

Reset

Implementation with bit-parallel inputs

Implementation with bit-serial inputs

Elaboration on the Zero-One Principle

Deriving a 0-1 sequence that is not correctly sorted, given an arbitrary sequence that is not correctly sorted.

Let outputs y (^) i and y (^) i +1 be out of order, that is y (^) i > y (^) i +

Replace inputs that are strictly less than y (^) i with 0s and all others with 1s

The resulting 0-1 sequence will not be correctly sorted either

6-sorter

Invalid

7.2 Figures of Merit for Sorting Networks

Delay: Number of levels

Cost: Number of comparators

Cost × Delay

x 0

x 1

x

x 3

2

y 0

y 1

y

y 3

2

2

3

1

5

3

2

5

1

1

3

2

5

1

2

3

5

In the following example, we have 5 comparators

The following 4-sorter has 3 comparator levels on its critical path

The cost-delay product for this example is 15

Fig. 7.4 Block diagram and schematic representation of a 4-sorter.

Delay as a Figure of Merit

Fig. 7.6 Some fast sorting networks.

n = 6, 12 modules, 5 levels

n = 9, 25 modules, 8 levels n = 10, 31 modules , 7 levels

n = 12, 40 modules , 8 levels

n = 16, 61 modules , 9 levels

Optimal delay is known for n = 1 to 10: 0, 1, 3, 3, 5, 5, 6, 6, 7, 7

These 3 comparators constitute one level

Cost-Delay Product as a Figure of Merit

n = 6, 12 modules, 5 levels

n = 9, 25 modules, 8 levels n = 10, 31 modules , 7 levels

n = 12, 40 modules , 8 levels

n = 16, 61 modules , 9 levels

Fast 10-sorter from Fig. 7.

n = 10, 29 modules , 9 levels

n = 16, 60 modules , 10 levels

Low-cost 10-sorter from Fig. 7.

Cost × Delay = 29 × 9 = 261 Cost × Delay = 31 × 7 = 217

The most cost-effective n -sorter may be neither the fastest design, nor the lowest-cost design

Insertion Sort and Selection Sort

Fig. 7.8 Sorting network based on insertion sort or selection sort.

x x x

x

. . .

(n–1)-sorter

0 1 2

n–

y y y

y

0 1 2

n– x (^) n–

. . .

y (^) n–

x x x

x

. . .

(n–1)-sorter

0 1 2

n–

y y y

y

0 1 2

n– x (^) n–

. . .

y (^) n–

. . .

Insertion sort Sel ection sort Parallel ins ertion s ort = Parallel selection s ort = Parallel bubble s ort!

C ( n ) = n ( n – 1)/ D ( n ) = 2 n – 3 Cost × Delay = Θ( n^3 )

Theoretically Optimal Sorting Networks

AKS sorting network (Ajtai, Komlos, Szemeredi: 1983)

x x x

x

. . .

. . .

n-sorter

0 1 2

n–

y y y

y

0 1 2

n–

The outputs are a permutation of the inputs satisfying y Š y Š ... Š y (non-descending)

0 ≤^1 ≤^ ≤ n–

O(log n ) depth

O( n log n ) size

Unfortunately, AKS networks are not practical owing to large (4-digit) constant factors involved; improvements since 1983 not enough

Note that even for these optimal networks, delay-cost product is suboptimal; but this is the best we can do

Existing sorting networks have O(log 2 n ) latency and O( n log 2 n ) cost

Given that log 2 n is only 20 for n = 1 000 000, the latter are more practical

Proof of Batcher’s Even-Odd Merge

x x x x y y y y y y

y (^) v

v

v

v

v

v

0 1 2 3 0 1 2 3 4 5 6 0 1 2 3 4 5 w

w

w

w

w

0

1

2

3

4 (2, 4)-merger (2, 3)-merger

Firstsorted sequ-ence x

Secondsorted sequ-ence y

Use the zero-one principle

Assume: x has k 0s y has k ′ 0s

Case a: k even = k odd v 0 0 0 0 0 0 1 1 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1 Case b: k even = k odd +1 v 0 0 0 0 0 0 0 1 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1 Case c: k even = k odd +2 v 0 0 0 0 0 0 0 0 1 1 1 1 w 0 0 0 0 0 0 1 1 1 1 1 Out of order

v has k even =  k /2 +  k ′/2 0s w has k odd =  k /2 +  k ′/2 0s

Batcher’s Even-Odd Merge Sorting

Batcher’s ( m , m ) even-odd merger, for m a power of 2: C ( m ) = 2 C ( m /2) + m – 1 = ( m – 1) + 2( m /2 – 1) + 4( m /4 – 1) +... = m log 2 m + 1 D ( m ) = D ( m /2) + 1 = log 2 m + 1 Cost × Delay = Θ( m log 2 m )

Batcher sorting networks based on the even-odd merge technique: C ( n ) = 2 C ( n /2) + ( n /2)(log 2 ( n /2)) + 1 ≅ n (log 2 n )^2 / 2 D ( n ) = D ( n /2) + log 2 ( n /2) + 1 = D ( n /2) + log 2 n = log 2 n (log 2 n + 1)/ Cost × Delay = Θ( n log 4 n )

n/2-sorter

n/2-sorter

(n/2, n/2)- merger

. . .

...... . . . ......

Fig. 7.10 The recursive structure of Batcher’s even– odd merge sorting network.

Bitonic-Sequence Sorter

Fig. 14.2 Sorting a bitonic sequence on a linear array.

Shift right half of data to left half (superimpose the two halves)

In eac h position, keep the smaller value of each pair and ship the larger value to the right

Each half is a bitonic sequence that can be sorted independently

0 1 2 n – 1

0 1 2 n – 1

... ...

Bitonic sequence

Shifted right half

n /

n /

... ...

Bitonic sequence:

1 3 3 4 6 6 6 2 2 1 0 0 Rises, then falls

Falls, then rises

The previous sequence, right-rotated by 2

Batcher’s Bitonic Sorting Networks

Fig. 7.12 The recursive structure of Batcher’s bitonic sorting network.

n/2-sorter

n/2-sorter

n-input bitonic- sequence sorter

. . .

...... . . . ......

Bitonic sequence

......

Fig. 7.13 Batcher’s bitonic sorting network for eight inputs.

8-input bitonic- sequence sorter

4-input bitonic- sequence sorters

2-input sorters