Shared-Memory Sample Sort Algorithm: Implementing Efficient Parallel Sorting on XMT, Assignments of Algorithms and Programming

The shared-memory sample sort algorithm, a parallel sorting technique designed for shared memory machines. The goal is to provide a randomized sorting algorithm that runs efficiently on xmt. The problem statement, an overview of the algorithm, and hints for implementation. Students are required to write both a parallel and a serial sorting program for the xmt architecture.

Typology: Assignments

Pre 2010

Uploaded on 07/30/2009

koofers-user-svn
koofers-user-svn 🇺🇸

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
HW3: Shared-Memory Sample Sort
Course: CMSC751/ENEE759, Spring 2009
Title: Shared-Memory Sample Sort
Date Assigned: March 10th, 2009
Date Due: March 24th, 2009
Contact: Alex Tzannes - [email protected]
1 Assignment Goal
The goal of this assignment is to provide a randomized sorting algorithm that runs efficiently on XMT.
While you are allowed some flexibility as to what serial sorting algorithms to use for different steps of the
parallel algorithm, you should try to find and select the most efficient one for each case. The Sample Sort
algorithm follows a "decomposition first" pattern and is widely used on multiprocessor architectures.
Being a randomized algorithm, its running time depends on the output of a random number generator.
Sample Sort performs well on very large arrays, with high probability.
In this assignment, we propose implementing a variation of the Sample Sort algorithm that performs
well on shared memory parallel architectures such as XMT.
2 Problem Statement
The Shared Memory Sample Sort algorithm is an implementation of Sample Sort for shared memory
machines. The idea behind Sample Sort is to find a set of p1 elements from the array, called splitters,
which partition the ninput elements into pgroups set0. ..setp1. In particular, every element in setiis
smaller than every element in seti+1. The partitioned sets are then sorted independently.
The input is an unsorted array A. The output is returned in array Result. Let pbe the number of
processors. We will assume, without loss of generality, that Nis divisible by p. An overview of the
Shared Memory Sample Sort algorithm is as follows:
Step 1. In parallel, a set Sof s×prandom elements from the original array Ais collected, where pis
the number of TCUs available and sis called the oversampling ratio. Sort the array S, using an
algorithm that performs well for the size of S. Select a set of p1 evenly spaced elements from
it into S0:S0={S[s],S[2s],...,S[(p1)×s]}
These elements are the splitters that are used below to partition the elements of Ainto psets (or
partitions)seti, 0 i<p. The sets are set0={A[i]|A[i]<S0[0]},set1={A[i]|S0[0]A[i]<
S0[1]},...,setp1={A[i]|S0[p1]A[i]}.
Step 2. Consider the input array Adivided into psubarrays, B0=A[0,...,(N/p)1],B1=A[N/p,...,2(N/p)
1]etc. The ith TCU iterates through subarray Biand for each element executes a binary search on
the array of splitters S0, for a total of N/pbinary searches per TCU. The following quantities are
computed:
1
pf3
pf4
pf5

Partial preview of the text

Download Shared-Memory Sample Sort Algorithm: Implementing Efficient Parallel Sorting on XMT and more Assignments Algorithms and Programming in PDF only on Docsity!

HW3: Shared-Memory Sample Sort

Course: CMSC751/ENEE759, Spring 2009 Title: Shared-Memory Sample Sort Date Assigned: March 10th, 2009 Date Due: March 24th, 2009 Contact: Alex Tzannes - [email protected]

1 Assignment Goal

The goal of this assignment is to provide a randomized sorting algorithm that runs efficiently on XMT. While you are allowed some flexibility as to what serial sorting algorithms to use for different steps of the parallel algorithm, you should try to find and select the most efficient one for each case. The Sample Sort algorithm follows a "decomposition first" pattern and is widely used on multiprocessor architectures. Being a randomized algorithm, its running time depends on the output of a random number generator. Sample Sort performs well on very large arrays, with high probability.

In this assignment, we propose implementing a variation of the Sample Sort algorithm that performs well on shared memory parallel architectures such as XMT.

2 Problem Statement

The Shared Memory Sample Sort algorithm is an implementation of Sample Sort for shared memory machines. The idea behind Sample Sort is to find a set of p − 1 elements from the array, called splitters, which partition the n input elements into p groups set 0... setp− 1. In particular, every element in seti is smaller than every element in seti+ 1. The partitioned sets are then sorted independently.

The input is an unsorted array A. The output is returned in array Result. Let p be the number of processors. We will assume, without loss of generality, that N is divisible by p. An overview of the Shared Memory Sample Sort algorithm is as follows:

Step 1. In parallel, a set S of s × p random elements from the original array A is collected, where p is the number of TCUs available and s is called the oversampling ratio. Sort the array S, using an algorithm that performs well for the size of S. Select a set of p − 1 evenly spaced elements from it into S′: S′^ = {S[s], S[ 2 s],... , S[(p − 1 ) × s]} These elements are the splitters that are used below to partition the elements of A into p sets (or partitions) seti, 0 ≤ i < p. The sets are set 0 = {A[i] | A[i] < S′[ 0 ]}, set 1 = {A[i] | S′[ 0 ] ≤ A[i] < S′[ 1 ]},... , setp− 1 = {A[i] | S′[p − 1 ] ≤ A[i]}.

Step 2. Consider the input array A divided into p subarrays, B 0 = A[ 0 ,... , (N/p)− 1 ], B 1 = A[N/p,... , 2 (N/p)− 1 ] etc. The ith TCU iterates through subarray Bi and for each element executes a binary search on the array of splitters S′, for a total of N/p binary searches per TCU. The following quantities are computed:

Figure 1: The C matrix built in Step 2.

  • ci j - the number of elements from Bi that belong in partition set (^) j. The ci j makes up the matrix C as in figure 1.
  • partitionk - the partition (i.e. the seti) in which element A[k] belongs. Each element is tagged with such an index.
  • serialk - the number of elements in Bi that belong in setpartitionk but are located before A[k] in Bi.

For example, if B 0 = [ 105 , 101 , 99 , 205 , 75 , 14 ] and we have S′^ = [ 100 , 150 ,... ] as splitters, we will have c 0 , 0 = 3, c 0 , 1 = 2 etc., partition 0 = 1, partition 2 = 0 etc. and serial 0 = 0, serial 1 = 1, serial 5 = 2.

Step 3. Compute prefix-sums psi, j for each column of the matrix C. For example, ps 0 , j,ps 1 , j,... ,psp− 1 , j are the prefix-sums of c 0 , j,c 1 , j,... ,cp− 1 , j. Also compute the sum of column i, which is stored in sumi. Compute the prefix sums of the sum 1 ,... , sump into global_ps 0 ,...,p− 1 and the total sum of sumi in global_psp. This definition of globalps turns out to be a programming conveninence.

Step 4. Each TCU i computes: for each element A[ j] in segment Bi, i · Np ≤ j < (i + 1 ) Np :

pos (^) j = global_pspartition (^) j + psi,partition (^) j + serial (^) j

Copy Result[pos (^) j] = A[ j].

Step 5. TCU i executes a (serial) sorting algorithm on the elements of seti, which are now stored in Result[global_psi,... , global_psi+ 1 − 1 ].

At the end of Step 5, the elements of A are stored in sorted order in Result.

4.2 Input Format

The input is provided as an array of integers A.

#define N The number of elements to sort. int A[N] The array to sort. int s The oversampling ratio. #define NTCU The number of TCUs to be used for sorting. #define NRAND The number of random values in the RANDOM array. int RANDOM[NRAND] An array with pregenerated random integers. int result[N] To store the result of the sorting.

You can declare any number of global arrays and variables in your program as needed. The number of elements in the arrays (n) is declared as a constant in each dataset, and you can use it to declare auxiliary arrays. For example, this is valid XMTC code:

#define N 16384

int temp1[16384]; int temp2[2*N]; int pointer;

int main() { //... }

4.3 Data sets

Run all your programs (serial and parallel) using the data files given in the following table. You can directly include the header file into your XMTC code with #include or you can include the header file with the compiler option -include.

Dataset N NTCU Header File Binary file d1 256 8 data/d1/ssort.h data/d1/ssort.xbo d2 4096 8 data/d2/ssort.h data/d2/ssort.xbo d3 128k 64 data/d3/ssort.h data/d3/ssort.xbo

4.4 Compiling and Executing

You can compile the parallel program using the following command line for the small dataset (d1):

xmtcc -include ../data/d1/ssort.h ../data/d1/ssort.xbo ssort.p.c -o ssort.p If the program compiles correctly a file called ssort.p.b will be created. This is the binary exe- cutable you will run on the FPGA using the following command:

xmtfpga ssort.p.b

5 Output

The array has to be sorted in increasing order. The array result should hold the array of sorted values.

Prepare and fill the following table: Create a text file named table.txt in doc. Remove any printf statements from your code while taking these measurements. Printf statements increase the clock count. Therefore the measurements with printf statements may not reflect the actual time and work done.

Dataset d1 d2 d Parallel sort clock cycles Serial sort clock cycles

Note that a part of your grading criteria is the performance of your parallel implementation on the largest dataset (d3). Therefore you should try to obtain the fastest running parallel program. As a guideline, for the larger dataset (d3) our Serial Sort runs in 45526102 cycles, and our Parallel Sample runs in 9047152 cycles (speedup ∼5x) on the FPGA computer.

5.1 Submission

The use of the make utility for submission make submit is required. Make sure that you have the correct files at correct locations (src and doc directories) using the make submitcheck command. Run following commands to submit the assignment:

$ make submitcheck $ make submit

5.2 Discussion about Serial Sorting Algorithms

In this assignment you need a serial sorting algorithm in three different places. First when you imple- ment the serial sorting itself to compare against your implementation, but also within the sample sort algorithm, first to sort the array of samples S and later to sort in parallel the p segments. So choosing the right serial sorting algorithm is very important. The discussion below should guide you and limit your search space when looking for the best serial algorithms to use with sample sort.

Table 1: Table of cycle counts for different serial sorting algorithms and sample sort using different sorting algorithms Dataset d1 d2 d Serial(QS) 50302 1002756 45526102 Serial(HS) 64376 1562200 103129058 Serial(BS) 327350 96523199 timeout Serial(BS+check) 340158 100349982 timeout

Sample Sort (QS/HS) 59011 1501593 9047152 Sample Sort (HS/HS) 59819 1502359 9101561 Sample Sort (QS/BS) 150381 83490620 timeout

In Table 1 the performance of four serial sorting algorithms is compared as well as the perfor- mance of sample sort using some combinations of these algorithms. The serial algorithms are quicksort (QS), heapsort (HS), bubble sort(BS) and bubble sort with termination check(BS+check)^1. The notation “Sample Sort(XX/YY)” indicates the parallel sample sort algorithm using the serial sorting algorithm XX in Step 1 to sort array S and the serial sorting algorithm YY in Step 5. The Table shows that the fastest serial algorithm of the ones compared is quicksort, heapsort comes second, and bubblesort is too slow to get a cycle count for the largest dataset. Quicksort however is a recursive algorithm, naturally implemented using recursive function calls. For that reason it was not used for Step 5 (the QS/QS configuration was not implemented) since function calls are currently not supported in parallel code. Students have been able to implement a non-recursive version of quicksort to use in Step 5 which gave improved performance.

(^1) The algorithm checks after each of the N passes of the input array A[N] if there were any swaps. If not it terminates earlier.