




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The application of sorting algorithms on gpus, focusing on the parallel bitonic sort algorithm and hashing with stencil routing. The background of sorting algorithms, their classification, and specific algorithms like insertion sort, merge sort, bubble sort, heap sort, and radix sort. It then delves into sorting networks and parallel methods, discussing bitonic sort and odd-even transition sort. The document concludes by discussing the mapping of bitonic sort to gpus and hashing on the gpu using stencil routing.
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Sorting is a fundamental algorithmic building block. One of the most studied problems in computer science is ordering a list of items efficiently. Buck and Purcell showed how the parallel bitonic merge sort algorithm, could exploit many of the parallel features of the SIMD architecture of the GPU. Efficient sorting has practical importance to optimizing algorithms that require sorted lists to work correctly. Moreover, the cost of reading data back from the GPU to the CPU is incredibly inefficient; sorting the data directly on the GPU is preferable for many graphics algorithms
There is a wide variety of needs for sorting algorithms in computer graphics. For example, a program that renders objects according to their depth value so that it can be drawn in the correct order on the screen. The so-called painter’s algorithm consists of sorting the object or polygons from back to front and rasterizing them in that order. In Lecture 3 we studied “depth peeling”: a way of extracting layers from a scene in depth sorted order. Additionally, particle systems need to be sorted according to viewer distance. The sorting data textures contain the particle-viewer distance and the index of the particle which correctly rendered. In physical simulation, sorting is necessary for inserting the participating objecting into spatial structures for collision detection.
Sorting algorithms can be divided into two categories: data driven and data-independent ones. First we will briefly review the standard data driven algorithms:
2.1 Insertion Sort Insertion Sort works the way many people sort a hand of playing cards. The insertion sort works just as the name suggests – it inserts each item into its proper place in the final list. The insertion sort has a running time of O (n 2 ).
2.2 Merge Sort The merge sort splits the list to be sorted into two equal halves, and places them in separate arrays. Each array is recursively sorted, and then merged back together to form the final sorted list. This algorithm follows the divide-and-conquer approach they divide the problem into several subproblems that are similar to the original problem but smaller size, solve the subproblems recursively, and then combine these solutions to create a solution to the original problem.
We can derive the following recrance for the worst case running time T(n) of merge sort:
T ( n ) = θ( 1 ) if n= = 2 T ( n /2) + θ( n ) if n>
Using the second case of the Master theorem we derive the merge sort has an algorithmic complexity of O ( n log n ). A major drawback to this approach is at least twice the memory requirements of the other sorts. Merge sort does not sort ‘in place’, like insertion or quick sort.
2.5 Quick Sort
Quicksort sorts by employing a divide and conquer strategy to divide a list into two sublists.
The base case of the recursion are lists of size one, which are always sorted. The algorithm always terminates because it puts at least one element in its final place on each iteration. The most crucial concern of a quicksort implementation is the choosing of a good pivot element. A naïve implementation of quicksort, like the ones below, will be terribly inefficient for certain inputs. For example, if the input is already sorted, a common practical case, this implementation of quicksort degenerates into a selection sort with O( n^2 ) running time. A common choice is to randomly choose a pivot index, typically using a pseudorandom number generator. If the numbers are truly random, it can be proven that the resulting algorithm, called randomized quicksort , runs in an expected time of O( n log n ).
2.6 Radix Sort
Radix sort is a fast stable sorting algorithm which can be used to sort items that are identified by unique keys. Every key is a string or number, and radix sort sorts these keys in some particular lexicographic-like order. The algorithm operates in O( n · k ) time, where n is the number of items, and k is the average key length.
Example (170, 45, 75, 90, 2, 24, 802, 66) : sorting by least significant digit (1s place) gives: 170, 90, 2, 802, 24, 45, 75, 66 sorting by next digit (10s place) gives: 2, 802, 24, 45, 66, 170, 75, 90 sorting by most significant digit (100s) gives: 2, 24, 45, 66, 75, 90, 170, 802
Data independent methods do exhibit the discrepancies of time variation due to input like data dependent methods do. Data independent methods can be represented as a sorting network.
3.1 Bitonic sequence A 0-1-sequence is called bitonic, if it contains at most two changes between 0 and 1, i.e. if there exist subsequence lengths k , m {1, ..., n } such that
a 0 , ..., a (^) k -1 = 0 , a (^) k , ..., a (^) m -1 = 1 , a (^) m , ..., a (^) n -1 = 0 or a 0 , ..., a (^) k -1 = 1 , a (^) k , ..., a (^) m -1 = 0 , a (^) m , ..., a (^) n -1 = 1
Examples: 00000 , 111111 , 0001110000, 111000111
More generally, a sequence of numbers is bitonic sequence if it has at most one local maximum or one local minimum.
Examples: 1,2,3,4,5 10,6,5,3,1 3,7,9,8,6,5,4,1 10,8,6,9,12,15,
3.2 Bitonic sort The bitonic sort takes a bitonic sequence as its input. Then it forms a binary split of its elements, compares the two partner elements, and then exchange the values as necessary. It continues this process recursively, and combines the elements at the end.
Example: 24 20 15 9 4 2 5 8 10 11 12 13 22 30 32 45
Result after Binary-split: 10 11 12 9 4 2 5 8 24 20 15 13 22 30 32 45
Notice that: a) Each element in the first half is smaller than each element in the second half b) Each half is a bitonic list of length n/2.
If you keep applying the BINARY-SPLIT to each half repeatedly:
10 11 12 9. 4 2 5 8 24 20 15 13. 22 30 32 45 4 2. 5 8. 10 11. 12 9 22 20. 15 13. 24 30. 32 45 4 2. 5 8. 10 9. 12 11 15 13. 22 20. 24 30. 32 45 Sorted: 2 4. 5 8. 9 10. 11 12 13 15. 20 22. 24 30. 32 45
By rotating the normal bitonic sort 90 degrees to the right, we obtain the figure to the right for sorting a 1D array of n keys. Observe that in each pass, there are always groups of items that are treated alike. There is a strong relationship between neighboring groups: they sometimes have parameters with opposite values. We now draw several quads per sorting pass that exactly fit pairs of groups on the right figure and together cover the entire buffer. On the right side of each quad, you perform the operation opposite in the vertex program are linearly interpolated by the rasterizer over the fragments. If we want to sort many items, it is efficient to store them in a 2D texture. The remaining two actions that have to be computed are to decide which compare operation to use and to locate the partner item to compare.
The final optimization that is made is to generalize the sorter to work on key/index pairs. Because the GPU processes four-vectors, we can pack two key/index pairs into one fragment. This optimization cuts the row width in half and thus cuts the number of fragments in half as well.
GPU Gems Table 46-1 : Performance of the CPU and GPU sorting Algorithms
GPU Gems Figure 46-4: Grouping keys for the Bitonic merge sort
In Photon Mapping on Programmable Graphics hardware, Purcell et al. devised a method of organizing photons into grid cells. Normally this can be perceived as a many- to-one routing problem, as there may be multiple photons to store in each cell. However, if there is a limit on the maximum number of photons that will be stored per cell, we can preallocate the storage for each cell. By knowing this .texture footprint of each cell in advance, we reduce the problem to a variant of one-to-one routing.
Purcell’s main idea was to draw the photons as a large fat point using glPoint over the entire footprint of its destination cell. The the stencil buffer would route the appropriate photons to a unique destination within that footprint. Each set of mxm pixels contain at most mxm photons. They set the stencil buffer to control the location each photon renders to within each grid cell by allowing at most one fragment of the mxm fragments to pass for each drawn photon. The stencil buffer is initialized such that each grid cell region contains the increasing pattern. The stencil test is set to write on equal to m^2 -1, and to always increment. Each time a photon is drawn, the stencil buffer allows only one fragment to pass through, the region of the stencil buffer covering the grid cell all increment, and the next photon will draw to a different region of the grid cell. This allows effcient routing of up to the first m^2 photons to each grid cell. This method can be completed in a single pass. 4
(note that a lot of the diagrams and specific algorithms above come from GPU Gems and CLRS)
Purcell Figure 5: Building the photon map with stencil routing. For this example, grid cells can hold up to four photons, and photons are rendered as 2 _ 2 points. Photons are transformed by a vertex program to the proper grid cell. In (a), a photon is rendered to a grid cell, but because there is no stencil masking the fragment write, it is stored in all entries in the grid cell. In (b) and (c), the stencil buffer controls the destination written to by each photon.