Download External Sorting - Parallel Programming | CS 415 and more Study notes Computer Science in PDF only on Docsity!
External Sorting
Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 External Sorting 1 / 35
External Sorting
Initial data and final results are both stored on disks.
External sorting algorithms typically consist of two phases:
- The first phase produces a set of files containing half-processed data.
- The second phase processes these files to produce a totally ordered permutation of the input data.
Two General Categories
- (^) Distribution-Based Sorting:
- The first phase partitions the input data into k disjoint buckets such that the elements in in one buckets precede the elements in the remaining buckets.
- In the second phase, each bucket is sorted independently.
- Merge-Based Sorting:
- (^) The first phase partitions the input data into data chunks of approximately equal size, sorts these data chunks in main memory and writes the “runs” to disk.
- (^) The second phase merges the runs in main memory and writes the sorted output to the disk.
Jingke Li (Portland State University) CS 415/515 External Sorting 3 / 35
Performance Issues
- Disk I/O bandwidth — Often is the performance bottleneck. Should try to minimize the number of passes over the data. (For large data set, 2 is the minimum.) A Simple Rule: Make Phase 1 partition size small enough to fit in main memory, so that a two-pass algorithm is possible.
- (^) Balancing the time needed by the two phases — Smaller run sizes make phase 1 run faster, but may slow down phase 2
- (^) Maximizing overlap between I/O and computation — Use disk prefetching, etc.
Datamation Winners
year system #cpu #disk time(s) 1986 Tandem 2 2 3600 1987 Tandem 3 6 980 1988 Cray 1 1+ 1 28 1990 Sequent 8 4 83 1992 Intel iPSC/2 32 32 58 1993 DEC Alpha 1 16 9. 1994 DEC Alpha 3 28 7. 1996 SGI Challenge 12 96 4. 1997 SUN Sparc 32 64 2. 1999 DELL NT 16x2? 1. 2000 HP Xeon 4 32. 2001 Intel P3 32x2 32x5.
Jingke Li (Portland State University) CS 415/515 External Sorting 7 / 35
Case Study: AlphaSort
(Sorting on Shared-Everything SMPs)
Hardware: A SMP with commodity components
- 1-3 DEC Alpha AXP 3000/ processors
- 256MB memory (memory bus 640MB/s)
- 10-28 disks (I/O bus 4 x 100MB/s)
Case Study: AlphaSort (cont.)
Software:
- DEC OpenVMS
- (^) Threads over shared memory:
- threads allow overlapping between I/O and computing
- locks are used to ensure proper access to shared data
Algorithms:
- One-Node Version — Focusing on overlapping between I/O and computing
- Parallel Version — Enhanced from the one-node version
Jingke Li (Portland State University) CS 415/515 External Sorting 9 / 35
One-Node AlphaSort Algorithm Sketch
- Divide one million records into 30-50 groups.
- Read groups one by one from disk.
- As each group becomes available, use a separate thread to sort it into a run using quicksort.
- Merge the runs using a replacement-selection tree.
- Gather records into contiguous buffer and write to disk.
Key Optimizations:
- Multiple threads — overlapping I/O and internal sorting.
- Disk striping — to allow parallel reads and writes. (hardware: 3 controllers and 28 disks; software: a file striping layer)
- Lots of memory — 256-384MB for sorting 95MB data; this affords quicksort for internal sorting
Parallel AlphaSort Algorithm Sketch
- Each thread requests affinity to a processor.
- Master thread responsible for all file operations; worker threads do sorting and memory-intensive operations.
- Master reads records; workers quicksort the data groups.
- Master merges the runs; workers gather records into contiguous buffers.
- Master writes sorted buffers to disk.
Programming Issues:
- Balancing workload —
- (^) Time to read 1MB = Time to sort 1MB / # workers?
- (^) Time to merge 1MB = Time to gather records / # workers?
- (^) Scalability — Master/slaves model is adequate only for small-scale parallelism
Jingke Li (Portland State University) CS 415/515 External Sorting 13 / 35
Case Study: NOW-Sort
(Sorting on Shared-Nothing Clusters)
Hardware: A cluster of commodity workstations (Up to 64 SUN ultraSPARC I workstations connected with an Marinet switch (160MB/s))
NOW-Sort (cont.)
Each workstation has 64-128MB memory and 2-4 SCSI disks
Jingke Li (Portland State University) CS 415/515 External Sorting 15 / 35
NOW-Sort (cont.)
Software:
- GLUnix + N copies of Solaris
- (^) supports a simple parallel environment — process start-up, job control, etc., but no dynamic scheduling
- (^) Shared-nothing with explicit communication
- Active messages (which is essentially a restricted, lightweight remote procedure call) - moves keys + records between nodes - 10 μs latency, 35 MB/s bandwidth
Algorithms:
Assumption about input data: Input data is assumed to be drawn from an uniform distribution, and is evenly partitioned across processors’ local disks.
Internal Sort Options
- (^) Quicksort — Quicksort the (key-prefix, pointer) pairs.
- Bucket Sort + Quicksort — While reading records into memory, simultaneously examine the high-order b bits of the keys and place keys into the the appropriate buckets; then each bucket is sorted individually with quicksort. - Each bucket entry contains the most significant 32-bits of a key after the top b-bits, and a pointer to the the full record. - The number of buckets (hence the value of b) is determined such that the average number of keys per bucket fits into the L cache.
Jingke Li (Portland State University) CS 415/515 External Sorting 19 / 35
Internal Sort Options (cont.)
- Bucket Sort + Partial Radix-Sort — Distribute keys into buckets as in the previous algorithm. But use a partial radix-sort instead of quicksort to sort each bucket. Two passes are performed over the keys, each with a radix size of 11 bits. Since the two passes together examine only 22-bits of (80-b)-bit keys, a final clean-up phase is performed, where keys with ties in the top 22-b bits are bubble-sorted.
Experimental data show that the last one has the best performance.
Exploring Overlap
- Synchronous (i.e. No overlapping) — Each processor reads, communicates, sorts, and then writes, with no overlap across the four steps.
- Interleaved — A single thread alternates reading and communicating. Input data are saved in (small) send buffers, one for each destination. As soon as a buffer is full, its data is sent out. (Active message is a key for this to work.)
- Threaded — One I/O thread and one communication thread.
Experimental data shows:
- With two disks per node, both the interleaved and the threaded versions are about 10% better than the synchronous version.
- With more disks, the three versions perform nearly identically.
Jingke Li (Portland State University) CS 415/515 External Sorting 21 / 35
NOW-Sort Optimization Summary
- User-level software striping.
- Use mmap instead of read, to avoid the system double-buffering problem.
- Use (key-prefix, pointer) pairs in internal sorting.
- (^) Overlap I/O with CPU computation by using buckets.
- (^) Overlapping I/O with communication.
PennySort Benchmark Winners
Sort as much as possible for less than a penny.
Price-performance metric — calculate the 5-year cost of the hardware and software, then prorate it for the elapsed time.
year name system category GB 1998 PostManSort Intel/NT Daytona 1. 1998 NTsort Intel/NT Indy 1. 1999 HMsort Indy&Daytona 2. 2000 HMsort Indy&Daytona 4. 2002 DMsort Indy 12 2004 THsort AMD/Linux Daytona 10 2005 PostManSort Intel/Windows Daytona 15 2005 SheenkSort AMD/Linux Indy 40 2006 Byte-Split-Index Sort AMD/WindowsXP Daytona 32 2006 GpuTeraSort Intel P4/WindowsXP Indy 55
Jingke Li (Portland State University) CS 415/515 External Sorting 25 / 35
PennySort Notes
- (^) PostManSort — A commercial sort program for Unix. It is basically a top-down radix sort (i.e. it sorts from MSB to LSB).
- (^) NTsort — A commercial sort program included in Windows NT 5. and Windows 2000. It uses a simple two-pass algorithm — partition the input and quicksort each partition into a run; and merge all the runs with a tournament tree. No I/O overlapping is used.
- (^) HMsort — A sort program similar to NTsort, but with I/O overlapping.
PennySort Notes (cont.)
- (^) DMsort — It uses a top-down radix sort algorithm. In the first phase, the data is read into memory a segment a time and records are placed into “bin”s based on their key values. In the second phase, each bin is read into memory and sorted by a hybrid radix and quicksort.
- (^) THsort, SheenkSort, Byte-Split-Index Sort — Three sorts developed by the same group; all are loosely based on the radix sort algorithm, although no details are given for any of them.
- (^) GpuTeraSort — Sorting with a GPU co-processor and it implements a bitonic merge sort algorithm.
Jingke Li (Portland State University) CS 415/515 External Sorting 27 / 35
TeraByteSort Benchmark Winners
Sort 10 billion 100-byte records.
year name system category time(m) 1998 Nsort SGI Origin2000 Indy&Daytona 151 2000 Tandem FastSort Compaq Tandem Daytona 49 2000 SPsort 1952 SP cluster Indy 18
GPU Properties
- Data parallelism — multi-way execution of the same instruction
- Instruction-level parallelism — current execution of different instructions
- Dedicated memory Interface — 64-byte interface to on-board video memory
- Low memory latency — prefetch nad pipelining are more effective since memory stalls are less likely
A Comparison:
cpu vs. gpu #cmp/cl peak mem bdwt high-end dual-core P4 4 25.6 GFlops 6.4 GB Nvidia GeForce 7800 GTX 96 313 GFlops 56 GB
Jingke Li (Portland State University) CS 415/515 External Sorting 31 / 35
GpuTeraSort Algorithm
Phase One:
- Reader: Disks → RAM Asynchronously reads the input file into a (≈ 100MB) main memory buffer.
- Key-Generator: RAM → Disks → RAM Computes the (key, rec-ptr) pairs from the input buffer.
- Sorter: RAM ↔ GPU ⇔ Video RAM Reads and sorts the key-pointer pairs. The core algorithm: Bitonic sorting (which fits well on a GPU)
- (^) Reorder: RAM → Disks → RAM Rearranges the input buffer based on the sorted key-pointer pairs to generate a sorted output buffer (a run).
- (^) Write: RAM → Disks Asynchronously writes the run to the disk.
Phase Two: Merges the runs with a common algorithm.
Case Study: SAN Cluster Sort
(Sorting on a Cluster with a Storage-Area Network)
Hardware —
- (^) 40 dual-processor Itanium2 nodes, each with 4GB of RAM, three 2GBps fibre channel adapters (to connect to the SAN), a single gigabit Ethernet interface (to connect to other nodes), and local SCSI drive (for OS use).
- 60 IBM TotalStorage DS4300 RAID subsystems, each with 42 73GB 10kRPM fibre channel disk drives.
- 3 Brocade Silkworm 24,000 fibre channel switches.
- A Force10 model E1200 Ethernet switch.
In total, the cluster contains 160GB of RAM and 140TB formatted disk storage capacity (2,560 disk drives).
Software — Linux with IBM GPFS file system, C with MPI
Jingke Li (Portland State University) CS 415/515 External Sorting 33 / 35
SCS Algorithm — First Pass
- The 1TB data is partitioned into 80 partitions (only pointers are calculated, no data movement is involved); each partition is used as input for a processor
- Each processor reads records from its input partion and distribute them into slices (i.e. bins) of approximately equal size based key values; the slice size is selected so that a whole slice would fit in main memory (with multiple buffers), which turns out to be 768MB. (Therefore, there are 1280 slices in total).
- Each processor writes each of its 1280 slices into a separate per-node slice file. (So there are 80 files for each slice range.) These files are automatically distributed to different disks by the GPFS file system.
- The processors exchange information about slice sizes using MPI AllGather.