Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

External Sorting - Parallel Programming | CS 415, Study notes of Computer Science

Portland State University (PSU)Computer Science

Prof. Jingke Li

Material Type: Notes; Professor: Li; Class: PARALLEL PROGRAMMING; Subject: Computer Science; University: Portland State University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-arx 🇺🇸

(2)

10 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

External Sorting

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 External Sorting 1 / 35

External Sorting

Initial data and final results are both stored on disks.

External sorting algorithms typically consist of two phases:

•The first phase produces a set of files containing half-processed data.

•The second phase processes these files to produce a totally ordered

permutation of the input data.

Jingke Li (Portland State University) CS 415/515 External Sorting 2 / 35

Discover Study notes of Computer Science Portland State University (PSU)

Partial preview of the text

Download External Sorting - Parallel Programming | CS 415 and more Study notes Computer Science in PDF only on Docsity!

External Sorting

Jingke Li

Portland State University

Jingke Li (Portland State University) CS 415/515 External Sorting 1 / 35

External Sorting

Initial data and final results are both stored on disks.

External sorting algorithms typically consist of two phases:

The first phase produces a set of files containing half-processed data.
The second phase processes these files to produce a totally ordered permutation of the input data.

Two General Categories

(^) Distribution-Based Sorting:
- The first phase partitions the input data into k disjoint buckets such that the elements in in one buckets precede the elements in the remaining buckets.
- In the second phase, each bucket is sorted independently.
Merge-Based Sorting:
- (^) The first phase partitions the input data into data chunks of approximately equal size, sorts these data chunks in main memory and writes the “runs” to disk.
- (^) The second phase merges the runs in main memory and writes the sorted output to the disk.

Jingke Li (Portland State University) CS 415/515 External Sorting 3 / 35

Performance Issues

Disk I/O bandwidth — Often is the performance bottleneck. Should try to minimize the number of passes over the data. (For large data set, 2 is the minimum.) A Simple Rule: Make Phase 1 partition size small enough to fit in main memory, so that a two-pass algorithm is possible.
(^) Balancing the time needed by the two phases — Smaller run sizes make phase 1 run faster, but may slow down phase 2
(^) Maximizing overlap between I/O and computation — Use disk prefetching, etc.

Datamation Winners

year system #cpu #disk time(s) 1986 Tandem 2 2 3600 1987 Tandem 3 6 980 1988 Cray 1 1+ 1 28 1990 Sequent 8 4 83 1992 Intel iPSC/2 32 32 58 1993 DEC Alpha 1 16 9. 1994 DEC Alpha 3 28 7. 1996 SGI Challenge 12 96 4. 1997 SUN Sparc 32 64 2. 1999 DELL NT 16x2? 1. 2000 HP Xeon 4 32. 2001 Intel P3 32x2 32x5.

Jingke Li (Portland State University) CS 415/515 External Sorting 7 / 35

Case Study: AlphaSort

(Sorting on Shared-Everything SMPs)

Hardware: A SMP with commodity components

1-3 DEC Alpha AXP 3000/ processors
256MB memory (memory bus 640MB/s)
10-28 disks (I/O bus 4 x 100MB/s)

Case Study: AlphaSort (cont.)

Software:

DEC OpenVMS
(^) Threads over shared memory:
- threads allow overlapping between I/O and computing
- locks are used to ensure proper access to shared data

Algorithms:

One-Node Version — Focusing on overlapping between I/O and computing
Parallel Version — Enhanced from the one-node version

Jingke Li (Portland State University) CS 415/515 External Sorting 9 / 35

One-Node AlphaSort Algorithm Sketch

Divide one million records into 30-50 groups.
Read groups one by one from disk.
As each group becomes available, use a separate thread to sort it into a run using quicksort.
Merge the runs using a replacement-selection tree.
Gather records into contiguous buffer and write to disk.

Key Optimizations:

Multiple threads — overlapping I/O and internal sorting.
Disk striping — to allow parallel reads and writes. (hardware: 3 controllers and 28 disks; software: a file striping layer)
Lots of memory — 256-384MB for sorting 95MB data; this affords quicksort for internal sorting

Parallel AlphaSort Algorithm Sketch

Each thread requests affinity to a processor.
Master thread responsible for all file operations; worker threads do sorting and memory-intensive operations.
Master reads records; workers quicksort the data groups.
Master merges the runs; workers gather records into contiguous buffers.
Master writes sorted buffers to disk.

Programming Issues:

Balancing workload —
- (^) Time to read 1MB = Time to sort 1MB / # workers?
- (^) Time to merge 1MB = Time to gather records / # workers?
(^) Scalability — Master/slaves model is adequate only for small-scale parallelism

Jingke Li (Portland State University) CS 415/515 External Sorting 13 / 35

Case Study: NOW-Sort

(Sorting on Shared-Nothing Clusters)

Hardware: A cluster of commodity workstations (Up to 64 SUN ultraSPARC I workstations connected with an Marinet switch (160MB/s))

NOW-Sort (cont.)

Each workstation has 64-128MB memory and 2-4 SCSI disks

Jingke Li (Portland State University) CS 415/515 External Sorting 15 / 35

NOW-Sort (cont.)

Software:

GLUnix + N copies of Solaris
- (^) supports a simple parallel environment — process start-up, job control, etc., but no dynamic scheduling
(^) Shared-nothing with explicit communication
- one process per node
Active messages (which is essentially a restricted, lightweight remote procedure call) - moves keys + records between nodes - 10 μs latency, 35 MB/s bandwidth

Algorithms:

Quicksort + bucket sort

Assumption about input data: Input data is assumed to be drawn from an uniform distribution, and is evenly partitioned across processors’ local disks.

Internal Sort Options

(^) Quicksort — Quicksort the (key-prefix, pointer) pairs.
Bucket Sort + Quicksort — While reading records into memory, simultaneously examine the high-order b bits of the keys and place keys into the the appropriate buckets; then each bucket is sorted individually with quicksort. - Each bucket entry contains the most significant 32-bits of a key after the top b-bits, and a pointer to the the full record. - The number of buckets (hence the value of b) is determined such that the average number of keys per bucket fits into the L cache.

Jingke Li (Portland State University) CS 415/515 External Sorting 19 / 35

Internal Sort Options (cont.)

Bucket Sort + Partial Radix-Sort — Distribute keys into buckets as in the previous algorithm. But use a partial radix-sort instead of quicksort to sort each bucket. Two passes are performed over the keys, each with a radix size of 11 bits. Since the two passes together examine only 22-bits of (80-b)-bit keys, a final clean-up phase is performed, where keys with ties in the top 22-b bits are bubble-sorted.

Experimental data show that the last one has the best performance.

Exploring Overlap

Synchronous (i.e. No overlapping) — Each processor reads, communicates, sorts, and then writes, with no overlap across the four steps.
Interleaved — A single thread alternates reading and communicating. Input data are saved in (small) send buffers, one for each destination. As soon as a buffer is full, its data is sent out. (Active message is a key for this to work.)
Threaded — One I/O thread and one communication thread.

Experimental data shows:

With two disks per node, both the interleaved and the threaded versions are about 10% better than the synchronous version.
With more disks, the three versions perform nearly identically.

Jingke Li (Portland State University) CS 415/515 External Sorting 21 / 35

NOW-Sort Optimization Summary

User-level software striping.
Use mmap instead of read, to avoid the system double-buffering problem.
Use (key-prefix, pointer) pairs in internal sorting.
(^) Overlap I/O with CPU computation by using buckets.
(^) Overlapping I/O with communication.

PennySort Benchmark Winners

Sort as much as possible for less than a penny.

Price-performance metric — calculate the 5-year cost of the hardware and software, then prorate it for the elapsed time.

year name system category GB 1998 PostManSort Intel/NT Daytona 1. 1998 NTsort Intel/NT Indy 1. 1999 HMsort Indy&Daytona 2. 2000 HMsort Indy&Daytona 4. 2002 DMsort Indy 12 2004 THsort AMD/Linux Daytona 10 2005 PostManSort Intel/Windows Daytona 15 2005 SheenkSort AMD/Linux Indy 40 2006 Byte-Split-Index Sort AMD/WindowsXP Daytona 32 2006 GpuTeraSort Intel P4/WindowsXP Indy 55

Jingke Li (Portland State University) CS 415/515 External Sorting 25 / 35

PennySort Notes

(^) PostManSort — A commercial sort program for Unix. It is basically a top-down radix sort (i.e. it sorts from MSB to LSB).
(^) NTsort — A commercial sort program included in Windows NT 5. and Windows 2000. It uses a simple two-pass algorithm — partition the input and quicksort each partition into a run; and merge all the runs with a tournament tree. No I/O overlapping is used.
(^) HMsort — A sort program similar to NTsort, but with I/O overlapping.

PennySort Notes (cont.)

(^) DMsort — It uses a top-down radix sort algorithm. In the first phase, the data is read into memory a segment a time and records are placed into “bin”s based on their key values. In the second phase, each bin is read into memory and sorted by a hybrid radix and quicksort.
(^) THsort, SheenkSort, Byte-Split-Index Sort — Three sorts developed by the same group; all are loosely based on the radix sort algorithm, although no details are given for any of them.
(^) GpuTeraSort — Sorting with a GPU co-processor and it implements a bitonic merge sort algorithm.

Jingke Li (Portland State University) CS 415/515 External Sorting 27 / 35

TeraByteSort Benchmark Winners

Sort 10 billion 100-byte records.

year name system category time(m) 1998 Nsort SGI Origin2000 Indy&Daytona 151 2000 Tandem FastSort Compaq Tandem Daytona 49 2000 SPsort 1952 SP cluster Indy 18

GPU Properties

Data parallelism — multi-way execution of the same instruction
Instruction-level parallelism — current execution of different instructions
Dedicated memory Interface — 64-byte interface to on-board video memory
Low memory latency — prefetch nad pipelining are more effective since memory stalls are less likely

A Comparison:

cpu vs. gpu #cmp/cl peak mem bdwt high-end dual-core P4 4 25.6 GFlops 6.4 GB Nvidia GeForce 7800 GTX 96 313 GFlops 56 GB

Jingke Li (Portland State University) CS 415/515 External Sorting 31 / 35

GpuTeraSort Algorithm

Phase One:

Reader: Disks → RAM Asynchronously reads the input file into a (≈ 100MB) main memory buffer.
Key-Generator: RAM → Disks → RAM Computes the (key, rec-ptr) pairs from the input buffer.
Sorter: RAM ↔ GPU ⇔ Video RAM Reads and sorts the key-pointer pairs. The core algorithm: Bitonic sorting (which fits well on a GPU)
(^) Reorder: RAM → Disks → RAM Rearranges the input buffer based on the sorted key-pointer pairs to generate a sorted output buffer (a run).
(^) Write: RAM → Disks Asynchronously writes the run to the disk.

Phase Two: Merges the runs with a common algorithm.

Case Study: SAN Cluster Sort

(Sorting on a Cluster with a Storage-Area Network)

Hardware —

(^) 40 dual-processor Itanium2 nodes, each with 4GB of RAM, three 2GBps fibre channel adapters (to connect to the SAN), a single gigabit Ethernet interface (to connect to other nodes), and local SCSI drive (for OS use).
60 IBM TotalStorage DS4300 RAID subsystems, each with 42 73GB 10kRPM fibre channel disk drives.
3 Brocade Silkworm 24,000 fibre channel switches.
A Force10 model E1200 Ethernet switch.

In total, the cluster contains 160GB of RAM and 140TB formatted disk storage capacity (2,560 disk drives).

Software — Linux with IBM GPFS file system, C with MPI

Jingke Li (Portland State University) CS 415/515 External Sorting 33 / 35

SCS Algorithm — First Pass

The 1TB data is partitioned into 80 partitions (only pointers are calculated, no data movement is involved); each partition is used as input for a processor
Each processor reads records from its input partion and distribute them into slices (i.e. bins) of approximately equal size based key values; the slice size is selected so that a whole slice would fit in main memory (with multiple buffers), which turns out to be 768MB. (Therefore, there are 1280 slices in total).
Each processor writes each of its 1280 slices into a separate per-node slice file. (So there are 80 files for each slice range.) These files are automatically distributed to different disks by the GPFS file system.
The processors exchange information about slice sizes using MPI AllGather.

External Sorting - Parallel Programming | CS 415, Study notes of Computer Science

Related documents

Partial preview of the text

Download External Sorting - Parallel Programming | CS 415 and more Study notes Computer Science in PDF only on Docsity!

External Sorting

External Sorting

Two General Categories

Performance Issues

Datamation Winners

Case Study: AlphaSort

(Sorting on Shared-Everything SMPs)

Case Study: AlphaSort (cont.)

One-Node AlphaSort Algorithm Sketch

Parallel AlphaSort Algorithm Sketch

Case Study: NOW-Sort

(Sorting on Shared-Nothing Clusters)

NOW-Sort (cont.)

NOW-Sort (cont.)

Internal Sort Options

Internal Sort Options (cont.)

Exploring Overlap

NOW-Sort Optimization Summary

PennySort Benchmark Winners

PennySort Notes

PennySort Notes (cont.)

TeraByteSort Benchmark Winners

GPU Properties

GpuTeraSort Algorithm

Case Study: SAN Cluster Sort

(Sorting on a Cluster with a Storage-Area Network)

SCS Algorithm — First Pass