Linear Algebra and Parallel Computing Lecture Notes, Lecture notes of Linear Algebra

These lecture notes cover topics such as LAPACK and BLAS, Gaussian elimination as factorization, and parallel computing concepts. The notes provide information on how to install LAPACK and BLAS libraries and how to use them in programs. Additionally, the notes provide references for further reading and tutorials on parallel computing. The notes are likely useful for students studying topics such as linear algebra, numerical methods, and parallel computing.

Typology: Lecture notes

2010/2011

Uploaded on 05/11/2023

sheela_98
sheela_98 🇺🇸

4.2

(12)

234 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
AMath 483/583 Lecture 12 April 22, 2011
Today:
LAPACK and BLAS
Parallel computing concepts
Monday:
OpenMP
Read: Class notes and references
R.J. LeVeque,University of Washington AMath 483/583, Lecture 12, April 22, 2011
Notes:
R.J. LeVeque,University of Washington AMath 483/583, Lecture 12, April 22, 2011
The BLAS
Basic Linear Algebra Subroutines
Core routines used by LAPACK (Linear Algebra Package)
and elsewhere.
Generally optimized for particular machine architectures, cache
hierarchy.
Can create optimized BLAS using
ATLAS (Automatically Tuned Linear Algebra Software)
See notes and http://www.netlib.org/blas/faq.html
Level 1: Scalar and vector operations
Level 2: Matrix-vector operations
Level 3: Matrix-matrix operations
R.J. LeVeque,University of Washington AMath 483/583, Lecture 12, April 22, 2011
Notes:
R.J. LeVeque,University of Washington AMath 483/583, Lecture 12, April 22, 2011
The BLAS
Subroutine names start with:
S: single precision
D: double precision
C: single precision complex
Z: double precision complex
Examples:
DDOT: dot product of two vectors
DGEMV: matrix-vector multiply, general matrices
DGEMM: matrix-matrix multiply, general matrices
DSYMM: matrix-matrix multiply, symmetric matrices
R.J. LeVeque,University of Washington AMath 483/583, Lecture 12, April 22, 2011
Notes:
R.J. LeVeque,University of Washington AMath 483/583, Lecture 12, April 22, 2011
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Linear Algebra and Parallel Computing Lecture Notes and more Lecture notes Linear Algebra in PDF only on Docsity!

AMath 483/583 — Lecture 12 — April 22, 2011

Today:

  • (^) LAPACK and BLAS
  • Parallel computing concepts

Monday:

  • (^) OpenMP

Read: Class notes and references

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

The BLAS

Basic Linear Algebra Subroutines

Core routines used by LAPACK (Linear Algebra Package) and elsewhere.

Generally optimized for particular machine architectures, cache hierarchy.

Can create optimized BLAS using ATLAS (Automatically Tuned Linear Algebra Software)

See notes and http://www.netlib.org/blas/faq.html

  • Level 1: Scalar and vector operations
  • (^) Level 2: Matrix-vector operations
  • Level 3: Matrix-matrix operations

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

The BLAS

Subroutine names start with:

  • S: single precision
  • D: double precision
  • (^) C: single precision complex
  • Z: double precision complex

Examples:

  • (^) DDOT: dot product of two vectors
  • DGEMV: matrix-vector multiply, general matrices
  • DGEMM: matrix-matrix multiply, general matrices
  • (^) DSYMM: matrix-matrix multiply, symmetric matrices

Notes:

LAPACK

Many routines for linear algebra.

Typical name: XYYZZZ

X is precision

YY is type of matrix, e.g. GE (general), BD (bidiagonal),

ZZZ is type of operation, e.g. SV (solve system), EV (eigenvalues, vectors), SVD (singular values, vectors)

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Installing LAPACK

On Virtual Machine or other Debian or Ubuntu Linux:

$ sudo apt-get install liblapack-dev

This will include BLAS (but not optimized for your system).

Alternatively can download tar files and compile.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Using libraries

If program.f90 uses BLAS routines...

$ gfortran -c program.f $ gfortran -lblas program.o

or can combine as

$ gfortran -lblas program.f

When linking together .o files, will look for a file called libblas.a (probably in /usr/lib).

This is a archived static library.

Notes:

DGESV — Solves a general linear system

http://www.netlib.org/lapack/double/dgesv.f SUBROUTINE DGESV( N, NRHS, A, LDA, IPIV, $ B, LDB, INFO ) N = size of system (square)

A = matrix on input, L,U factors on output, dimension(LDA,N)

LDA = leading dimension of A (number of columns in declaration of A) real(kind=8) dimension(100,500) :: a ! fill a(1:20, 1:20) with 20x20 matrix n = 20 lda = 100 Need this to index into a(i,j) = (j-1)*lda + i (stored by columns) R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

DGESV — Solves a general linear system

SUBROUTINE DGESV( N, NRHS, A, LDA, IPIV,

$ B, LDB, INFO )

NRHS = number of right hand sides

IPIV = Returns pivot vector (permutation of rows) integer, dimension(N) Row I was interchanged with row IPIV(I).

B = matrix whose columns are right hand side(s) on input solution vector(s) on output.

LDB = leading dimension of B.

INFO = integer returning 0 if successful.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Gaussian elimination as factorization

If A is nonsingular it can be factored as

P A = LU

where

P is a permutation matrix (rows of identity permuted),

L is lower triangular with 1’s on diagonal,

U is upper triangular.

After returning from dgesv: A contains L and U (without the diagonal of L), IPIV gives ordering of rows in P.

Notes:

Gaussian elimination as factorization

Example:

A =

IPIV = (2,3,1)

and A ends up as  

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

dgesv examples

See $CLASSHG/codes/lapack/random.

randomsys1.f90 is with static array allocation.

randomsys2.f90 is with dynamic array allocation.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Parallel Computing

  • Basic concepts
  • Shared vs. distributed memory
  • (^) OpenMP (shared)
  • (^) MPI (shared or distributed)

Notes:

Multi-thread computing

For example, multi-threaded program on dual-core computer.

Thread:

A thread of control: program code, program counter, call stack, small amount of thread-specific data (registers, L1 cache).

Shared memory and file system.

Threads may be spawned and destroyed as computation proceeds.

Languages like OpenMP.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

POSIX Threads

Portable Operating System Intefrace

Standardized C language threads programming interface

For UNIX systems, this interface has been specified by the IEEE POSIX 1003.1c standard (1995).

Implementations adhering to this standard are referred to as POSIX threads, or Pthreads.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Multi-thread computing

Some issues:

Limited to modest number of cores when memory is shared.

Multiple threads have access to same data — convenient and fast.

Contention: But, need to make sure they don’t conflict (e.g. two threads should not write to same location at same time).

Dependencies, synchronization: Need to make sure some operations are done in proper order!

May need cache coherence: If Thread 1 changes x in its private cache, other threads might need to see changed value.

Notes:

Multi-process computing

A process is a thread that also has its own private address space.

Multiple processes are often running on a single computer (e.g. different independent programs).

For distributed memory parallel computers, a single computation must be tackled with multiple processes because of memory layout.

Larger cost in creating and destroying processes.

Greater latency in sharing data.

Processes communicate by passing messages.

Languages like MPI — Message Passing Interface.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Multi-process computing with distributed memory

Some issues:

Often more complicated to program.

High cost of data communication between processes. Want to maximize processing on local data relative to communication with other processes.

Often need to partition problem domain into subdomains, (e.g. domain decomposition for PDEs)

Generally requires coarse grain parallelism.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Amdahl’s Law

Typically only part of a computation can be parallelized.

Suppose 50% of the computation is inherently sequential, and the other 50% can be parallelized.

Question: How much faster could the computation potentially run on many processors?

Answer: At most a factor of 2, no matter how many processors.

The sequential part is taking half the time and that time is still required even if the parallel part is reduced to zero time.

Notes:

Speedup

The ratio TS /TP of time on a sequential machine to time running in parallel is the speedup.

This is generally less than P for P processors. Perhaps much less.

Amdahl’s Law plus overhead costs of starting processes/threads, communication, etc.

Caveat: May (rarely) see speedup greater than P ... For example, if data doesn’t all fit in one cache but does fit in the combined caches of multiple processors.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Scaling

Some algorithms scale better than others as the number of processors increases.

Typically interested on how well algorithms work for large problems requiring lots of time, e.g. Particle methods for n particles, algorithms for solving systems of n equations, algorithms for solving PDEs on n × n × n grid in 3D,

For large n, there may be lots of inherent parallelism.

But depends on many factors: dependencies between calculations, communication as well as flops, nature of problem and algorithm chosen.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Scaling

Typically interested on how well algorithms work for large problems requiring lots of time.

Strong scaling: How does the algorithm perform as the number of processors P increases for a fixed problem size n?

Any algorithm will eventually break down (consider P > n)

Weak scaling: How does the algorithm perform when the problem size increases with the number of processors?

E.g. If we double the number of processors can we solve a problem “twice as large” in the same time?

Notes:

Weak scaling

What does “twice as large” mean?

Depends on how algorithm complexity scales with n.

Example: Solving linear system with Gaussian elimination requires O(n^3 ) flops.

Doubling n requires 8 times as many operations.

Problem is “twice as large” if we increase n by a factor of 21 /^3 ≈ 1. 26.

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Weak scaling

Solving steady state heat equation on n × n × n grid.

n^3 grid points =⇒ linear system with this many unknowns.

If we used Gaussian elimination (very bad idea!) we would require ∼ (n^3 )^3 = n^9 flops.

Doubling n would require 29 = 512 times more flops.

Good iterative methods can do the job in O(n^3 ) log 2 (n) work or less. (e.g. multigrid).

Developing better algorithms is as important as better hardware!!

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Notes:

R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011

Speedup for problems like steady state heat equation

Source: SIAM Review

Notes: