Download Applications: Parallel Sorting - Lecture Notes | CSE 160 and more Study notes Computer Science in PDF only on Docsity!
CSE 160 Chien, Spring 2005 Lecture #7, Slide 1
Applications: Parallel Sorting
» Caches and Performance
» Threads and Caches
» Parallel Machine Structure
» Parallel Cluster Communication
» Parallel Sorting
» Homework #2 due thursday (April 21)
» CSE 160 Midterm, Thursday April 28, In Class
The HPVM System
- Goals » Enable tightly coupled and distributed clusters with high efficiency and low effort (integrated solution) » Provide usable access thru convenient standard parallel interfaces » Deliver highest possible performance and simple programming model
CSE 160 Chien, Spring 2005 Lecture #7, Slide 3
Fast Messages Design
Elements
- User-level network access
- Lightweight protocols » flow control, reliable delivery » tightly-coupled link, buffer, and I/O bus management
- Poll-based notification
- Streaming API for efficient composition
- Many generations 1994- » [IEEE Concurrency, 6/97] » [Supercomputing ’95, 12/95]
- Related efforts: UCB AM, RWCP PM, Princeton VMMC/Shrimp, Lyon BIP => VIA standard
Improved Bandwidth
- 20MB/s -> 200+ MB/s (10x) » Much of advance is software structure: API’s and implementation » Deliver all of the underlying hardware performance
0
50
100
150
200
250
1995 1996 1997 1998 1999
Performance (megabytes/sec)
MB/s
CSE 160 Chien, Spring 2005 Lecture #7, Slide 7
HPVM Communication Performance
- Delivers underlying performance for small messages, endpoints are the limits
- 100MB/s at 1K vs 60MB/s at 1000K » >1500x improvement
0
20
40
60
80
100
120
4 (^512268835844480537662727168806489609856) 10752116481254413440143361523216128 message size (bytes)
MB/s
FM on Myrinet MPI on FM-Myrinet
HPVM/FM on VIA
- FM Protocol/techniques portable to Giganet VIA
- Slightly lower performance, comparable N (^) 1/
0
10
20
30
40
50
60
70
80
90
4 (^1024294439684992601670408064908810112111361216013184142081523216256) message size (bytes)
MB/s
FM on Giganet VIA MPI-FM on Giganet VIA
CSE 160 Chien, Spring 2005 Lecture #7, Slide 9
Supercomputer Performance
Characteristics (11/99)
MF/Proc Flops/Byte Flops/NetworkRT
Cray T3E 1200 ~2 ~2,
SGI Origin2000 500 ~0.5 ~1,
HPVM NT Supercluster 600 ~8 ~12,
IBM SP2 (4 or 8-way) 2.6-5.2GF ~12-25 ~150-300K
Beowulf(100Mbit) 600 ~50 ~200,
Sorting
- Postcondition: all items in order, data is distributed!
- Movement of data can be within a node, or across nodes
13 14 10 8 1 11 3 7 9 2 4 5
12 6
1 2 3 4 5 6 7 8 9
10 11 12 13 14
CSE 160 Chien, Spring 2005 Lecture #7, Slide 13
Bucket Sort Steps
- Divide Keys
- Each Worker divides his/her
keys and sends/receives the
keys to the appropriate
worker
- Each Worker, wait until it has
received all of its keys
(synchronization),
- Each Worker sorts its keys
locally, concatenated
sections in order
Implementing Bucket Sort
- Divide Keys (easy, but often
hard)
- Each Worker divides his/her
keys (easy) and sends/receives
the keys to the appropriate
worker (HARD)
- Each Worker, receive keys
- Wait until it has received all of
its keys (synchronization)
(HARD)
- Each Worker sorts its keys
locally (easy)
Cross-worker interaction
-Threads and Locks
Synchronization
-How to know when have all the keys
-Synchronization Messages or Flags
CSE 160 Chien, Spring 2005 Lecture #7, Slide 15
One more thing…
- Data Starts on Disk and Ends on Disk (many gigabytes!)
- Read Data from Disk
- Divide Keys (easy)
- Each Worker divides his/her keys (easy) and sends/receives the
keys to the appropriate worker (HARD)
- Each Worker, receive keys
- Wait until it has received all of its keys (synchronization) (HARD)
- Each Worker sorts its keys locally (easy)
- Write Data to Disk
CSE 160 Chien, Spring 2005^ Lecture #7, Slide 16
One Pass MinuteSort
Kayak
Kayak
Netserver
(Luis Rivera UIUC, Xianan Zhang UCSD)
32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks
32 HP Netservers 2 x 16GB SCSI disks
HPVM & 1Gbps Myrinet
CSE 160 Chien, Spring 2005 Lecture #7, Slide 19
Myrinet
32 Kayak •300 MHz Pentium II •Dual Processors
•384 MB Memory •Four 20 GB 7200 RPM Disk Connected to a 3Ware Controller
32 Netservers •400 MHz Pentium II •Dual Processors •1 GB Memory
•Two SCSI 18.2 GB 7200 RPM
Modeling Performance for
Scaleup!
•Kayak Read Rate : 50 MB/s •Kayak Communication •Send Rate : 100 MB/s •Receive Rate: 70 MB/s •NetServer Communication •Send/Receive Rate: 100 MB/s •In-core Sorting Rate: 189 MB/s •Total Writing Rate : 60 MB/s •Kayak: 40 MB/s •Netserver: 20 MB/s •Launch Time: 5 - 10 seconds •Total Time: < 52 s
Launch Time Read and Distribution Time In-core Sort Time Write Time and Send results to Kayaks Time
Execution Time
19.2s
10s
5s
16s
CSE 160 Chien, Spring 2005 Lecture #7, Slide 21
Kayak 1
Kayak 2
Netserver 1
Netserver 2
Read and Distribute (hard)
Communication
Thread
Read Thread
Double Buffer
Myrinet Network
CSE 160 Chien, Spring 2005 Lecture #7, Slide 25
Perspective
- Local Sort in a Small Fraction of the time
» Sending + Receiving => Sorting amongst the buckets!
» + Disk I/O
» => 90% of the time
- Communication is expensive
- Variable Numbers of Senders and Receivers Expensive
- Synchronization is expensive
Summary
- Clusters and High Performance Computing
- Heroic Sorting
» Gigabytes disk to disk in a minute
» Easy Parts
- Divide the Work
- Read and Write Data (but slow)
» Hard Parts
- Communication, Many to Many
- Synchronization
- Initial Exposure: Scaling and Speedup
» Constant time Benchmark