MPI vs. OpenMP: Parallel Programming with MPI and OpenMP in CMSC 714 - Prof. Alan L. Sussm, Study notes of Computer Science

The use of message passing interface (mpi) and open multi-processing (openmp) in parallel programming, with examples from the cmsc 714 course at carnegie mellon university. The differences between mpi and openmp, the advantages of using both, and specific applications in fields such as hydrology, computational chemistry, linear algebra, seismic processing, and computational fluid dynamics.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-y2d
koofers-user-y2d 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CMSC 714
Lecture 5
MPI vs. OpenMP
and
Titanium
Alan Sussman
2
CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth
Notes
zFirst programming assignment coming soon
Slight change from original plan – you’ll write one program,
first using either OpenMP or MPI, then the other
zLorin Hochstein will talk at the end of class today on
his study of all of you writing parallel programs
zFirst, questions on OpenMP and UPC
Directives vs. language extensions
3
CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth
OpenMP + MPI
zSome applications can take advantage of both
message passing and threads
Questions is what to do to obtain best overall performance,
without too much programming difficulty
Choices are all MPI, all OpenMP, or both
•For both, common option is outer loop parallelized with
message passing, inner loop with directives to generate
threads
zApplications studied:
–HydrologyCGWAVE
Computational chemistry – GAMESS
Linear algebra – matrix multiplication and QR factorization
Seismic processing – SPECseis95
Computational fluid dynamics – TLNS3D
Computational physics - CRETIN
4
CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth
Types of parallelism in the codes
zFor message passing parallelism (MPI)
Parametric – coarse-grained outer loop, essentially task
parallel
Structured domains – domain decomposition with local
operations – structured and unstructured grids
Direct solvers – linear algebra, lots of communication and
load balancing required – message passing works well for
large systems of equations
zShared memory parallelism (OpenMP)
Statically scheduled parallel loops – one large, or several
smaller loops, non-nested parallel
Parallel regions – merge loops into one parallel region to
reduce overhead of directives
Dynamic load balanced – when static scheduling leads to
load imbalance from irregular task sizes
5
CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth
CGWAVE
zFinite elements - MPI parameter space evaluation at outer loop,
OpenMP sparse linear equation solver in inner loops
zSpeedup using 2 levels of parallelism allows modeling larger
bodies of water possible in a reasonable amount of time
zMaster-worker strategy for dynamic load balancing in MPI
part/component
zSolver for each component solves large sparse linear system
with OpenMP to parallelize
zOn SGI Origin 2000 (distributed shared memory machine), use
first touch rule to migrate data for each component to the
processor that uses it
zPerformance results show that best performance obtained using
both MPI and OpenMP, with a combination of MPI workers and
OpenMP threads that depends on the problem/grid size
And for load balancing, a lot fewer MPI workers than components
CMSC 714
Lecture 5
MPI vs. OpenMP
and
Titanium
Alan Sussman
pf3
pf4

Partial preview of the text

Download MPI vs. OpenMP: Parallel Programming with MPI and OpenMP in CMSC 714 - Prof. Alan L. Sussm and more Study notes Computer Science in PDF only on Docsity!

CMSC 714

Lecture 5

MPI vs. OpenMP

and

Titanium

Alan Sussman

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 2

Notes

z First programming assignment coming soon

  • Slight change from original plan – you’ll write one program, first using either OpenMP or MPI, then the other

z Lorin Hochstein will talk at the end of class today on

his study of all of you writing parallel programs

z First, questions on OpenMP and UPC

  • Directives vs. language extensions

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 3

OpenMP + MPI

z Some applications can take advantage of both

message passing and threads

  • Questions is what to do to obtain best overall performance, without too much programming difficulty
  • Choices are all MPI, all OpenMP, or both
    • For both , common option is outer loop parallelized with message passing, inner loop with directives to generate threads

z Applications studied:

  • Hydrology – CGWAVE
  • Computational chemistry – GAMESS
  • Linear algebra – matrix multiplication and QR factorization
  • Seismic processing – SPECseis
  • Computational fluid dynamics – TLNS3D
  • Computational physics - CRETIN

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 4

Types of parallelism in the codes

z For message passing parallelism (MPI)

  • Parametric – coarse-grained outer loop, essentially task parallel
  • Structured domains – domain decomposition with local operations – structured and unstructured grids
  • Direct solvers – linear algebra, lots of communication and load balancing required – message passing works well for large systems of equations

z Shared memory parallelism (OpenMP)

  • Statically scheduled parallel loops – one large, or several smaller loops, non-nested parallel
  • Parallel regions – merge loops into one parallel region to reduce overhead of directives
  • Dynamic load balanced – when static scheduling leads to load imbalance from irregular task sizes

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 5

CGWAVE

z Finite elements - MPI parameter space evaluation at outer loop, OpenMP sparse linear equation solver in inner loops

z Speedup using 2 levels of parallelism allows modeling larger bodies of water possible in a reasonable amount of time

z Master-worker strategy for dynamic load balancing in MPI part/component z Solver for each component solves large sparse linear system with OpenMP to parallelize

z On SGI Origin 2000 (distributed shared memory machine), use first touch rule to migrate data for each component to the processor that uses it

z Performance results show that best performance obtained using both MPI and OpenMP, with a combination of MPI workers and OpenMP threads that depends on the problem/grid size

  • And for load balancing, a lot fewer MPI workers than components

CMSC 714

Lecture 5

MPI vs. OpenMP

and

Titanium

Alan Sussman

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 7

Notes

z First programming assignment is on web page

  • Still a work in progress
  • You will get email telling you which parallel version to do first, OpenMP or MPI
  • Still some issues with MPI on the cluster, but we’re working on them

z Not here on Tuesday

  • Jeff Hollingsworth will teach the class, on HPF
  • No questions for next time, since you’ve already sent me HPF questions …

z Need volunteers to present papers

  • Starting with Sisal programming language paper, 1 week from today

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 8

GAMESS

z Computational chemistry – molecular dynamics –

MPI across cluster, OpenMP within each node

z Built on top of Global Arrays package – for distributed

array operations

  • Which in turn uses MPI (paper says PVM) and OpenMP

z Linear algebra solvers mainly use OpenMP for

dynamic scheduling and load balancing

z MPI versions of parts of code are complex, but can

provide higher performance for large problems

z Performance results on “medium” sized problem from

SPEC (Standard Evaluation Performance Corp.) are

for a small system (4 8-processor Alpha processors)

connected by Memory Channel

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 9

Linear algebra

z Hybrid parallelism with MPI for scalability and

OpenMP for load balancing, for MM and QR

factorization

z On IBM SP system with multiple 4-processor nodes

z Studies tradeoffs of hybrid approach for linear

algebra algorithms vs. only using MPI (running 4 MPI

processes per node)

z Use OpenMP for load balancing and decreasing

communication costs within a node

z Also helps to hide communication latency behind

other operations – important for overall performance

z QR factorization results on “medium” sized matrices

show that adaptive load balancing is better than

dynamic loop scheduling within a node

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 10

SPECseis

z For gas and oil exploration

  • Uses FFTs and finite-difference solvers

z Original message passing version (in PVM) is SPMD,

OpenMP starts serial then starts an SPMD parallel

section

  • In OpenMP version, shared data is only boundaries, everything else local (like PVM version)
  • OpenMP calls all in Fortran – no C OpenMP compiler – caused difficulties for privatizing C global data, and thread issues (binding to processors, OS calls)

z Code scales equally well for PVM and OpenMP, on

SGI Power Challenge (a DSM machine)

  • This is a weak argument, because of likely poor PVM message passing performance (in general, and especially on DSM systems)

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 11

TLNS3D

z CFD in Fortran77, uses MPI across grids and OpenMP to parallelize each grid

z Multiple, non-overlapping grids/blocks that exchange data at boundaries periodically

z Static block assignment to processors – divide blocks into groups of about equal number of grid points for each processor

z Master-worker execution model for MPI level, then parallelize 3D loops for each block with OpenMP

  • Many loops, so need to be careful about affinity of data objects to processors across loops

z Hard to balance MPI workers vs. OpenMP threads per block – tradeoff minimizing load imbalance vs. communication and synchronization cost z Seems to work best on DSMs, but can be done well on distributed memory systems

z No performance results! CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 12

CRETIN

z Physics application with multiple levels of message

passing and thread parallelism

z Ported onto both distributed memory system (1464 4-

processor nodes) and DSM (large SGI Origin 2000)

z Complex structure, with 2 parts discussed

  • Atomic kinetics – multiple zones with lots of computation per zone – maps to either MPI or OpenMP - Load balancing across zones is the problem – requires complex dynamic algorithm that benefits both versions
  • Radiation transport – mesh sweep across multiple zones, suitable for both MPI and OpenMP - Two MPI options to parallelize, which one works best depends on problem size – one needs a transpose operation for the MPI version

z No performance results

CMSC 714, Fall05 - Alan Sussman & Jeffrey K. Hollingsworth 19

Applications

z AMR3D

  • Adaptive mesh refinement Poisson solver
  • Multiple nested grids cover the domain, and can change over time
  • Need dynamic load balancing to keep number of points per process near equal z EM3D
  • Kernel from application that models EM wave propagation through objects in 3D
  • Turns into a bipartite grid/graph of electric and magnetic field nodes
  • each point is computed as a function of its neighbors, alternating E and M nodes z Overall performance of Titanium code is comparable to C/C++/Fortran code versions, for small problems on Sparc and Pentium machines
  • They don’t do a good job of comparing against other systems because of the weaknesses of the other compilers
  • EM3D shows perfect speedup, and AMR3D shows some parallel speedup, on 8-way Sparc SMP – don’t show absolute times (probably worse for parallel versions only running on 1 processor compared to sequential versions)