Measuring Machine Performance in Parallel Computing: Focus on CSE 160 Chien's Lecture, Assignments of Introduction to Sociology

A set of lecture slides from cse 160 chien's spring 2005 course on parallel computing. The slides cover the topic of benchmarks used for measuring machine performance, specifically in the context of parallel machines. The slides discuss various benchmarks such as the livermore loops, linpack, and nas parallel benchmarks, and their importance in measuring the performance of different aspects of a machine, including its clock rate, cpu structure, alus, caches, memory, networks, storage, compilers, and more. The slides also mention the use of benchmarks in the top 500 supercomputer sites list.

Typology: Assignments

Pre 2010

Uploaded on 03/28/2010

koofers-user-hua
koofers-user-hua 🇺🇸

9 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture #11, Slide 1
CSE 160 Chien, Spring 2005
Understanding and Measuring
Speedup
Last Time
»Midterm Summary
»Definitions of Speedup
»Amdahl’s Law
»Gustafson’s Scaled and Fixed Time Speedup
Today
»Benchmarks
»Measuring Parallelism
»Machine and Application examples
Reminders/Announcements
»Homework #3 will be out tomorrow; Due date is Monday, May 16th
in Section
»Homework #2 grading is still in progress
Lecture #11, Slide 2
CSE 160 Chien, Spring 2005
Using programs to Measure
Machine Performance
Speedup measures performance of an individual
program on a particular machine
»Speedup cannot be used to
Compare different algorithms on the same computer
Compare the same algorithm on different computers
Benchmarks are representative programs which can
be used to compare performance of machines
»Integrate many aspects of the machines
Clock rate, cpu structure, ALU’s, caches, memory, networks,
storage, compilers, etc.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Measuring Machine Performance in Parallel Computing: Focus on CSE 160 Chien's Lecture and more Assignments Introduction to Sociology in PDF only on Docsity!

CSE 160 Chien, Spring 2005 Lecture #11, Slide 1

Understanding and Measuring

Speedup

  • Last Time » Midterm Summary » Definitions of Speedup » Amdahl’s Law » Gustafson’s Scaled and Fixed Time Speedup
  • Today » Benchmarks » Measuring Parallelism » Machine and Application examples
  • Reminders/Announcements » Homework #3 will be out tomorrow; Due date is Monday, May 16th in Section » Homework #2 grading is still in progress

Using programs to Measure

Machine Performance

  • Speedup measures performance of an individual program on a particular machine » Speedup cannot be used to - Compare different algorithms on the same computer - Compare the same algorithm on different computers
  • Benchmarks are representative programs which can be used to compare performance of machines » Integrate many aspects of the machines - Clock rate, cpu structure, ALU’s, caches, memory, networks, storage, compilers, etc.

CSE 160 Chien, Spring 2005 Lecture #11, Slide 3

Benchmarks used for Parallel

Machines

  • The Livermore Loops
  • The “PACKS” (Linpack, LAPACK, ScaLAPACK, etc.)
  • The NAS Parallel Benchmarks
  • The Perfect Club
  • ParkBENCH
  • SLALOM, HINT

The Livermore Loops

  • Set of 24 Fortran DO loops extracted from operational codes at LLNL » Each one is literally a single loop nest (multiple loops inside of each other) » Different structures of iteration (1D, 2D, 3D arrays) » Different structures of dependences (all parallel, forward with distance 1, forward with distance 2, forward with distance 8, etc.) » All different types of data reference (and therefore locality) structures
  • => Originated the use of MFLOP/s for performance » Performance statistics reported: arithmetic, harmonic, geometric means, …
  • http://www.netlib.org/benchmark/livermore

CSE 160 Chien, Spring 2005 Lecture #11, Slide 7

LinPack

  • Linear Algebra routines available in both C and Fortran
  • Benchmarks solve a system of equations using Gaussian Elimination and report Mega, Giga, TeraFlops
  • Core of Linpack is subroutine ("saxpy" in the single-precision version, "daxpy" in the double-precision version) doing the inner loop for frequent matrix operations:

y(i) = y(i) + a * x(i)

  • Standard version operates on 100x100 matrices; there are also versions for sizes 300x300 and 1000x1000, with different optimization rules. Largest runs are 933,887 x 933,887 for recent record holder => Huge problems!
  • Originator: Jack Dongarra, Univ. of Tennessee
  • http://www.netlib.org/benchmark/hpl/

Optimizing Linpack

  • Linpack is easily vectorizable on many systems.
  • Easy to exploit a multiply-add operation
  • Some compilers have "daxpy recognizers" to substitute hand-optimized code!
  • Lots of special optimization for parallel machines

CSE 160 Chien, Spring 2005 Lecture #11, Slide 9

Top 500 List

  • Linpack is Basis of the “Top 500 supercomputer Sites” list » http://www.top500.org/
  • Overview of the market of high-performance systems
  • Twice a year list of the 500 most powerful installed systems » November and April each Year
  • Linpack benchmark used to rank systems
  • Annual report analyzes developments in HW, SW and the market

Want to see where your laptop

rates?

CSE 160 Chien, Spring 2005 Lecture #11, Slide 13

NAS Parallel Benchmarks

(NPB)

  • Benchmarks from CFD (computational fluid dynamics) codes » Fortran and C versions available. » NPB are kernels and compact pseudo-applications, not full applications
  • Algorithmic definition of each program and sequential implementation of each algorithm
  • Users write a set of tuned parallel applications
  • NPB are widely used
  • http://www.nas.nasa.gov/Software/NPB/

NAS Parallel Benchmarks (cont.)

  • EP – Embarrasingly Parallel
  • MG – MultiGrid Kernel
  • CG – Conjugate Gradient (CFD applications)
  • FT – PDE solver, many FFT’s
  • IS – Integer Sort
  • LU – Sparse Linear Solver
  • SP – Scalar Pentadiagonal Linear Systems
  • BT – Block Tridiagonal Linear Systems

CSE 160 Chien, Spring 2005 Lecture #11, Slide 15

NAS Parallel Benchmarks (cont.)

The Perfect Club

  • Developed at University of Illinois around 1987 » Center for Supercomputing Research and Development (CSRD) » Founded by David Kuck, a pioneer and leading researcher in the area of optimizing compilers » See Structure of Computers and Computations, David Kuck, John Wiley and Sons, 1978. » See also, Kuck and Associates (KAI)
  • Goal: standard set of benchmarks to measure the progress of parallel computer advances » Applications characterized by their algorithmic behavior » Allow users to get meaningful predictions for their own applications

CSE 160 Chien, Spring 2005 Lecture #11, Slide 19

Large Sorting Benchmarks

  • Look again, the 2005 Results are out! http://research.microsoft.com/barc/SortBenchmark/
  • Minute Sort (all the records you can sort in a Minute!) » 125M, 116GB, 2005 » 340Million, 32GB, 2004 » ~120M, 12GB, 2000
  • Terabyte Sort » 7.25 minutes, 2005 » ~30 Minutes, 2000-2003…
  • Penny Sort: As much as you can for a penny! » 163M, 15GB, 2005 » 105M, 10GB, 200-

Limitations and Pitfalls of

Benchmarks

  • Benchmarks do not address questions which you did not ask… only see what you measure
  • Specific application benchmarks will not tell you about the performance of other applications without proper analysis => extrapolation is not straightforward
  • General benchmarks will not tell you about the details of your specific application
  • You must understand the benchmark itself to understand what it tells you

CSE 160 Chien, Spring 2005 Lecture #11, Slide 21

Benefits of Benchmarks

  • Popular benchmarks keep vendors attuned to applications => vendors approximate applications
  • Benchmarks can give useful information about the performance of systems on particular kinds of programs => users compare systems
  • Benchmarks help in exposing performance bottlenecks of systems at the technical and applications level

Summary

  • Benchmark Types » Fixed Work » Scalable » Rate Based » Top500 List
  • Purposes and Uses of Benchmarks
  • Download and try some on FWGrid!