External Sorting - Parallel Programming | CS 415, Study notes of Computer Science

Material Type: Notes; Professor: Li; Class: PARALLEL PROGRAMMING; Subject: Computer Science; University: Portland State University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-arx
koofers-user-arx 🇺🇸

3

(2)

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
External Sorting
Jingke Li
Portland State University
Jingke Li (Portland State University) CS 415/515 External Sorting 1 / 35
External Sorting
Initial data and final results are both stored on disks.
External sorting algorithms typically consist of two phases:
The first phase produces a set of files containing half-processed data.
The second phase processes these files to produce a totally ordered
permutation of the input data.
Jingke Li (Portland State University) CS 415/515 External Sorting 2 / 35
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download External Sorting - Parallel Programming | CS 415 and more Study notes Computer Science in PDF only on Docsity!

External Sorting

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 External Sorting 1 / 35

External Sorting

Initial data and final results are both stored on disks.

External sorting algorithms typically consist of two phases:

  • The first phase produces a set of files containing half-processed data.
  • The second phase processes these files to produce a totally ordered permutation of the input data.

Two General Categories

  • (^) Distribution-Based Sorting:
    • The first phase partitions the input data into k disjoint buckets such that the elements in in one buckets precede the elements in the remaining buckets.
    • In the second phase, each bucket is sorted independently.
  • Merge-Based Sorting:
    • (^) The first phase partitions the input data into data chunks of approximately equal size, sorts these data chunks in main memory and writes the “runs” to disk.
    • (^) The second phase merges the runs in main memory and writes the sorted output to the disk.

Jingke Li (Portland State University) CS 415/515 External Sorting 3 / 35

Performance Issues

  • Disk I/O bandwidth — Often is the performance bottleneck. Should try to minimize the number of passes over the data. (For large data set, 2 is the minimum.) A Simple Rule: Make Phase 1 partition size small enough to fit in main memory, so that a two-pass algorithm is possible.
  • (^) Balancing the time needed by the two phases — Smaller run sizes make phase 1 run faster, but may slow down phase 2
  • (^) Maximizing overlap between I/O and computation — Use disk prefetching, etc.

Datamation Winners

year system #cpu #disk time(s) 1986 Tandem 2 2 3600 1987 Tandem 3 6 980 1988 Cray 1 1+ 1 28 1990 Sequent 8 4 83 1992 Intel iPSC/2 32 32 58 1993 DEC Alpha 1 16 9. 1994 DEC Alpha 3 28 7. 1996 SGI Challenge 12 96 4. 1997 SUN Sparc 32 64 2. 1999 DELL NT 16x2? 1. 2000 HP Xeon 4 32. 2001 Intel P3 32x2 32x5.

Jingke Li (Portland State University) CS 415/515 External Sorting 7 / 35

Case Study: AlphaSort

(Sorting on Shared-Everything SMPs)

Hardware: A SMP with commodity components

  • 1-3 DEC Alpha AXP 3000/ processors
  • 256MB memory (memory bus 640MB/s)
  • 10-28 disks (I/O bus 4 x 100MB/s)

Case Study: AlphaSort (cont.)

Software:

  • DEC OpenVMS
  • (^) Threads over shared memory:
    • threads allow overlapping between I/O and computing
    • locks are used to ensure proper access to shared data

Algorithms:

  • One-Node Version — Focusing on overlapping between I/O and computing
  • Parallel Version — Enhanced from the one-node version

Jingke Li (Portland State University) CS 415/515 External Sorting 9 / 35

One-Node AlphaSort Algorithm Sketch

  • Divide one million records into 30-50 groups.
  • Read groups one by one from disk.
  • As each group becomes available, use a separate thread to sort it into a run using quicksort.
  • Merge the runs using a replacement-selection tree.
  • Gather records into contiguous buffer and write to disk.

Key Optimizations:

  • Multiple threads — overlapping I/O and internal sorting.
  • Disk striping — to allow parallel reads and writes. (hardware: 3 controllers and 28 disks; software: a file striping layer)
  • Lots of memory — 256-384MB for sorting 95MB data; this affords quicksort for internal sorting

Parallel AlphaSort Algorithm Sketch

  • Each thread requests affinity to a processor.
  • Master thread responsible for all file operations; worker threads do sorting and memory-intensive operations.
  • Master reads records; workers quicksort the data groups.
  • Master merges the runs; workers gather records into contiguous buffers.
  • Master writes sorted buffers to disk.

Programming Issues:

  • Balancing workload —
    • (^) Time to read 1MB = Time to sort 1MB / # workers?
    • (^) Time to merge 1MB = Time to gather records / # workers?
  • (^) Scalability — Master/slaves model is adequate only for small-scale parallelism

Jingke Li (Portland State University) CS 415/515 External Sorting 13 / 35

Case Study: NOW-Sort

(Sorting on Shared-Nothing Clusters)

Hardware: A cluster of commodity workstations (Up to 64 SUN ultraSPARC I workstations connected with an Marinet switch (160MB/s))

NOW-Sort (cont.)

Each workstation has 64-128MB memory and 2-4 SCSI disks

Jingke Li (Portland State University) CS 415/515 External Sorting 15 / 35

NOW-Sort (cont.)

Software:

  • GLUnix + N copies of Solaris
    • (^) supports a simple parallel environment — process start-up, job control, etc., but no dynamic scheduling
  • (^) Shared-nothing with explicit communication
    • one process per node
  • Active messages (which is essentially a restricted, lightweight remote procedure call) - moves keys + records between nodes - 10 μs latency, 35 MB/s bandwidth

Algorithms:

  • Quicksort + bucket sort

Assumption about input data: Input data is assumed to be drawn from an uniform distribution, and is evenly partitioned across processors’ local disks.

Internal Sort Options

  • (^) Quicksort — Quicksort the (key-prefix, pointer) pairs.
  • Bucket Sort + Quicksort — While reading records into memory, simultaneously examine the high-order b bits of the keys and place keys into the the appropriate buckets; then each bucket is sorted individually with quicksort. - Each bucket entry contains the most significant 32-bits of a key after the top b-bits, and a pointer to the the full record. - The number of buckets (hence the value of b) is determined such that the average number of keys per bucket fits into the L cache.

Jingke Li (Portland State University) CS 415/515 External Sorting 19 / 35

Internal Sort Options (cont.)

  • Bucket Sort + Partial Radix-Sort — Distribute keys into buckets as in the previous algorithm. But use a partial radix-sort instead of quicksort to sort each bucket. Two passes are performed over the keys, each with a radix size of 11 bits. Since the two passes together examine only 22-bits of (80-b)-bit keys, a final clean-up phase is performed, where keys with ties in the top 22-b bits are bubble-sorted.

Experimental data show that the last one has the best performance.

Exploring Overlap

  • Synchronous (i.e. No overlapping) — Each processor reads, communicates, sorts, and then writes, with no overlap across the four steps.
  • Interleaved — A single thread alternates reading and communicating. Input data are saved in (small) send buffers, one for each destination. As soon as a buffer is full, its data is sent out. (Active message is a key for this to work.)
  • Threaded — One I/O thread and one communication thread.

Experimental data shows:

  • With two disks per node, both the interleaved and the threaded versions are about 10% better than the synchronous version.
  • With more disks, the three versions perform nearly identically.

Jingke Li (Portland State University) CS 415/515 External Sorting 21 / 35

NOW-Sort Optimization Summary

  • User-level software striping.
  • Use mmap instead of read, to avoid the system double-buffering problem.
  • Use (key-prefix, pointer) pairs in internal sorting.
  • (^) Overlap I/O with CPU computation by using buckets.
  • (^) Overlapping I/O with communication.

PennySort Benchmark Winners

Sort as much as possible for less than a penny.

Price-performance metric — calculate the 5-year cost of the hardware and software, then prorate it for the elapsed time.

year name system category GB 1998 PostManSort Intel/NT Daytona 1. 1998 NTsort Intel/NT Indy 1. 1999 HMsort Indy&Daytona 2. 2000 HMsort Indy&Daytona 4. 2002 DMsort Indy 12 2004 THsort AMD/Linux Daytona 10 2005 PostManSort Intel/Windows Daytona 15 2005 SheenkSort AMD/Linux Indy 40 2006 Byte-Split-Index Sort AMD/WindowsXP Daytona 32 2006 GpuTeraSort Intel P4/WindowsXP Indy 55

Jingke Li (Portland State University) CS 415/515 External Sorting 25 / 35

PennySort Notes

  • (^) PostManSort — A commercial sort program for Unix. It is basically a top-down radix sort (i.e. it sorts from MSB to LSB).
  • (^) NTsort — A commercial sort program included in Windows NT 5. and Windows 2000. It uses a simple two-pass algorithm — partition the input and quicksort each partition into a run; and merge all the runs with a tournament tree. No I/O overlapping is used.
  • (^) HMsort — A sort program similar to NTsort, but with I/O overlapping.

PennySort Notes (cont.)

  • (^) DMsort — It uses a top-down radix sort algorithm. In the first phase, the data is read into memory a segment a time and records are placed into “bin”s based on their key values. In the second phase, each bin is read into memory and sorted by a hybrid radix and quicksort.
  • (^) THsort, SheenkSort, Byte-Split-Index Sort — Three sorts developed by the same group; all are loosely based on the radix sort algorithm, although no details are given for any of them.
  • (^) GpuTeraSort — Sorting with a GPU co-processor and it implements a bitonic merge sort algorithm.

Jingke Li (Portland State University) CS 415/515 External Sorting 27 / 35

TeraByteSort Benchmark Winners

Sort 10 billion 100-byte records.

year name system category time(m) 1998 Nsort SGI Origin2000 Indy&Daytona 151 2000 Tandem FastSort Compaq Tandem Daytona 49 2000 SPsort 1952 SP cluster Indy 18

GPU Properties

  • Data parallelism — multi-way execution of the same instruction
  • Instruction-level parallelism — current execution of different instructions
  • Dedicated memory Interface — 64-byte interface to on-board video memory
  • Low memory latency — prefetch nad pipelining are more effective since memory stalls are less likely

A Comparison:

cpu vs. gpu #cmp/cl peak mem bdwt high-end dual-core P4 4 25.6 GFlops 6.4 GB Nvidia GeForce 7800 GTX 96 313 GFlops 56 GB

Jingke Li (Portland State University) CS 415/515 External Sorting 31 / 35

GpuTeraSort Algorithm

Phase One:

  • Reader: Disks → RAM Asynchronously reads the input file into a (≈ 100MB) main memory buffer.
  • Key-Generator: RAM → Disks → RAM Computes the (key, rec-ptr) pairs from the input buffer.
  • Sorter: RAM ↔ GPU ⇔ Video RAM Reads and sorts the key-pointer pairs. The core algorithm: Bitonic sorting (which fits well on a GPU)
  • (^) Reorder: RAM → Disks → RAM Rearranges the input buffer based on the sorted key-pointer pairs to generate a sorted output buffer (a run).
  • (^) Write: RAM → Disks Asynchronously writes the run to the disk.

Phase Two: Merges the runs with a common algorithm.

Case Study: SAN Cluster Sort

(Sorting on a Cluster with a Storage-Area Network)

Hardware —

  • (^) 40 dual-processor Itanium2 nodes, each with 4GB of RAM, three 2GBps fibre channel adapters (to connect to the SAN), a single gigabit Ethernet interface (to connect to other nodes), and local SCSI drive (for OS use).
  • 60 IBM TotalStorage DS4300 RAID subsystems, each with 42 73GB 10kRPM fibre channel disk drives.
  • 3 Brocade Silkworm 24,000 fibre channel switches.
  • A Force10 model E1200 Ethernet switch.

In total, the cluster contains 160GB of RAM and 140TB formatted disk storage capacity (2,560 disk drives).

Software — Linux with IBM GPFS file system, C with MPI

Jingke Li (Portland State University) CS 415/515 External Sorting 33 / 35

SCS Algorithm — First Pass

  • The 1TB data is partitioned into 80 partitions (only pointers are calculated, no data movement is involved); each partition is used as input for a processor
  • Each processor reads records from its input partion and distribute them into slices (i.e. bins) of approximately equal size based key values; the slice size is selected so that a whole slice would fit in main memory (with multiple buffers), which turns out to be 768MB. (Therefore, there are 1280 slices in total).
  • Each processor writes each of its 1280 slices into a separate per-node slice file. (So there are 80 files for each slice range.) These files are automatically distributed to different disks by the GPFS file system.
  • The processors exchange information about slice sizes using MPI AllGather.