Applications: Parallel Sorting - Lecture Notes | CSE 160, Study notes of Computer Science

Material Type: Notes; Class: Intro to Parallel Computing; Subject: Computer Science & Engineering; University: University of California - San Diego; Term: Spring 2005;

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-qkf
koofers-user-qkf 🇺🇸

10 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture #7, Slide 1
CSE 160 Chien, Spring 2005
Applications: Parallel Sorting
Last Time
»Caches and Performance
»Threads and Caches
»Parallel Machine Structure
Today
»Parallel Cluster Communication
»Parallel Sorting
Reminders/Announcements
»Homework #2 due thursday (April 21)
»CSE 160 Midterm, Thursday April 28, In Class
Lecture #7, Slide 2
CSE 160 Chien, Spring 2005
The HPVM System
Goals
»Enable tightly coupled and distributed clusters with high efficiency and low
effort (integrated solution)
»Provide usable access thru convenient standard parallel interfaces
»Deliver highest possible performance and simple programming model
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Applications: Parallel Sorting - Lecture Notes | CSE 160 and more Study notes Computer Science in PDF only on Docsity!

CSE 160 Chien, Spring 2005 Lecture #7, Slide 1

Applications: Parallel Sorting

  • Last Time

» Caches and Performance

» Threads and Caches

» Parallel Machine Structure

  • Today

» Parallel Cluster Communication

» Parallel Sorting

  • Reminders/Announcements

» Homework #2 due thursday (April 21)

» CSE 160 Midterm, Thursday April 28, In Class

The HPVM System

  • Goals » Enable tightly coupled and distributed clusters with high efficiency and low effort (integrated solution) » Provide usable access thru convenient standard parallel interfaces » Deliver highest possible performance and simple programming model

CSE 160 Chien, Spring 2005 Lecture #7, Slide 3

Fast Messages Design

Elements

  • User-level network access
  • Lightweight protocols » flow control, reliable delivery » tightly-coupled link, buffer, and I/O bus management
  • Poll-based notification
  • Streaming API for efficient composition
  • Many generations 1994- » [IEEE Concurrency, 6/97] » [Supercomputing ’95, 12/95]
  • Related efforts: UCB AM, RWCP PM, Princeton VMMC/Shrimp, Lyon BIP => VIA standard

Improved Bandwidth

  • 20MB/s -> 200+ MB/s (10x) » Much of advance is software structure: API’s and implementation » Deliver all of the underlying hardware performance

0

50

100

150

200

250

1995 1996 1997 1998 1999

Performance (megabytes/sec)

MB/s

CSE 160 Chien, Spring 2005 Lecture #7, Slide 7

HPVM Communication Performance

  • Delivers underlying performance for small messages, endpoints are the limits
  • 100MB/s at 1K vs 60MB/s at 1000K » >1500x improvement

0

20

40

60

80

100

120

4 (^512268835844480537662727168806489609856) 10752116481254413440143361523216128 message size (bytes)

MB/s

FM on Myrinet MPI on FM-Myrinet

  • N1/2 ~ 400 Bytes

HPVM/FM on VIA

  • FM Protocol/techniques portable to Giganet VIA
  • Slightly lower performance, comparable N (^) 1/

0

10

20

30

40

50

60

70

80

90

4 (^1024294439684992601670408064908810112111361216013184142081523216256) message size (bytes)

MB/s

FM on Giganet VIA MPI-FM on Giganet VIA

  • N1/2 ~ 400 Bytes

CSE 160 Chien, Spring 2005 Lecture #7, Slide 9

Supercomputer Performance

Characteristics (11/99)

MF/Proc Flops/Byte Flops/NetworkRT

Cray T3E 1200 ~2 ~2,

SGI Origin2000 500 ~0.5 ~1,

HPVM NT Supercluster 600 ~8 ~12,

IBM SP2 (4 or 8-way) 2.6-5.2GF ~12-25 ~150-300K

Beowulf(100Mbit) 600 ~50 ~200,

Sorting

  • Postcondition: all items in order, data is distributed!
  • Movement of data can be within a node, or across nodes

13 14 10 8 1 11 3 7 9 2 4 5

12 6

1 2 3 4 5 6 7 8 9

10 11 12 13 14

CSE 160 Chien, Spring 2005 Lecture #7, Slide 13

Bucket Sort Steps

  • Divide Keys
  • Each Worker divides his/her

keys and sends/receives the

keys to the appropriate

worker

  • Each Worker, wait until it has

received all of its keys

(synchronization),

  • Each Worker sorts its keys

locally, concatenated

sections in order

Implementing Bucket Sort

  • Divide Keys (easy, but often

hard)

  • Each Worker divides his/her

keys (easy) and sends/receives

the keys to the appropriate

worker (HARD)

  • Each Worker, receive keys
  • Wait until it has received all of

its keys (synchronization)

(HARD)

  • Each Worker sorts its keys

locally (easy)

Cross-worker interaction

-Threads and Locks

  • Nodes and Messages

Synchronization

-How to know when have all the keys

-Synchronization Messages or Flags

CSE 160 Chien, Spring 2005 Lecture #7, Slide 15

One more thing…

  • Data Starts on Disk and Ends on Disk (many gigabytes!)
  • Read Data from Disk
  • Divide Keys (easy)
  • Each Worker divides his/her keys (easy) and sends/receives the

keys to the appropriate worker (HARD)

  • Each Worker, receive keys
  • Wait until it has received all of its keys (synchronization) (HARD)
  • Each Worker sorts its keys locally (easy)
  • Write Data to Disk

CSE 160 Chien, Spring 2005^ Lecture #7, Slide 16

One Pass MinuteSort

Kayak

Kayak

Netserver

(Luis Rivera UIUC, Xianan Zhang UCSD)

32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks

32 HP Netservers 2 x 16GB SCSI disks

HPVM & 1Gbps Myrinet

CSE 160 Chien, Spring 2005 Lecture #7, Slide 19

Myrinet

32 Kayak •300 MHz Pentium II •Dual Processors

•384 MB Memory •Four 20 GB 7200 RPM Disk Connected to a 3Ware Controller

32 Netservers •400 MHz Pentium II •Dual Processors •1 GB Memory

•Two SCSI 18.2 GB 7200 RPM

Modeling Performance for

Scaleup!

•Kayak Read Rate : 50 MB/s •Kayak Communication •Send Rate : 100 MB/s •Receive Rate: 70 MB/s •NetServer Communication •Send/Receive Rate: 100 MB/s •In-core Sorting Rate: 189 MB/s •Total Writing Rate : 60 MB/s •Kayak: 40 MB/s •Netserver: 20 MB/s •Launch Time: 5 - 10 seconds •Total Time: < 52 s

Launch Time Read and Distribution Time In-core Sort Time Write Time and Send results to Kayaks Time

Execution Time

19.2s

10s

5s

16s

CSE 160 Chien, Spring 2005 Lecture #7, Slide 21

Kayak 1

Kayak 2

Netserver 1

Netserver 2

Read and Distribute (hard)

Communication

Thread

Read Thread

Double Buffer

Myrinet Network

CSE 160 Chien, Spring 2005 Lecture #7, Slide 25

Perspective

  • Local Sort in a Small Fraction of the time
  • Global Sort

» Sending + Receiving => Sorting amongst the buckets!

» + Disk I/O

» => 90% of the time

  • Communication is expensive
  • Variable Numbers of Senders and Receivers Expensive
  • Synchronization is expensive

Summary

  • Clusters and High Performance Computing
  • Heroic Sorting

» Gigabytes disk to disk in a minute

» Easy Parts

  • Divide the Work
  • Read and Write Data (but slow)

» Hard Parts

  • Communication, Many to Many
  • Synchronization
  • Initial Exposure: Scaling and Speedup

» Constant time Benchmark