






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
These lecture notes cover topics such as LAPACK and BLAS, Gaussian elimination as factorization, and parallel computing concepts. The notes provide information on how to install LAPACK and BLAS libraries and how to use them in programs. Additionally, the notes provide references for further reading and tutorials on parallel computing. The notes are likely useful for students studying topics such as linear algebra, numerical methods, and parallel computing.
Typology: Lecture notes
1 / 11
This page cannot be seen from the preview
Don't miss anything!







Today:
Monday:
Read: Class notes and references
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Basic Linear Algebra Subroutines
Core routines used by LAPACK (Linear Algebra Package) and elsewhere.
Generally optimized for particular machine architectures, cache hierarchy.
Can create optimized BLAS using ATLAS (Automatically Tuned Linear Algebra Software)
See notes and http://www.netlib.org/blas/faq.html
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Subroutine names start with:
Examples:
Many routines for linear algebra.
Typical name: XYYZZZ
X is precision
YY is type of matrix, e.g. GE (general), BD (bidiagonal),
ZZZ is type of operation, e.g. SV (solve system), EV (eigenvalues, vectors), SVD (singular values, vectors)
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
On Virtual Machine or other Debian or Ubuntu Linux:
$ sudo apt-get install liblapack-dev
This will include BLAS (but not optimized for your system).
Alternatively can download tar files and compile.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
If program.f90 uses BLAS routines...
$ gfortran -c program.f $ gfortran -lblas program.o
or can combine as
$ gfortran -lblas program.f
When linking together .o files, will look for a file called libblas.a (probably in /usr/lib).
This is a archived static library.
http://www.netlib.org/lapack/double/dgesv.f SUBROUTINE DGESV( N, NRHS, A, LDA, IPIV, $ B, LDB, INFO ) N = size of system (square)
A = matrix on input, L,U factors on output, dimension(LDA,N)
LDA = leading dimension of A (number of columns in declaration of A) real(kind=8) dimension(100,500) :: a ! fill a(1:20, 1:20) with 20x20 matrix n = 20 lda = 100 Need this to index into a(i,j) = (j-1)*lda + i (stored by columns) R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
NRHS = number of right hand sides
IPIV = Returns pivot vector (permutation of rows) integer, dimension(N) Row I was interchanged with row IPIV(I).
B = matrix whose columns are right hand side(s) on input solution vector(s) on output.
LDB = leading dimension of B.
INFO = integer returning 0 if successful.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
If A is nonsingular it can be factored as
P A = LU
where
P is a permutation matrix (rows of identity permuted),
L is lower triangular with 1’s on diagonal,
U is upper triangular.
After returning from dgesv: A contains L and U (without the diagonal of L), IPIV gives ordering of rows in P.
Example:
A =
and A ends up as
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
See $CLASSHG/codes/lapack/random.
randomsys1.f90 is with static array allocation.
randomsys2.f90 is with dynamic array allocation.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
For example, multi-threaded program on dual-core computer.
Thread:
A thread of control: program code, program counter, call stack, small amount of thread-specific data (registers, L1 cache).
Shared memory and file system.
Threads may be spawned and destroyed as computation proceeds.
Languages like OpenMP.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Portable Operating System Intefrace
Standardized C language threads programming interface
For UNIX systems, this interface has been specified by the IEEE POSIX 1003.1c standard (1995).
Implementations adhering to this standard are referred to as POSIX threads, or Pthreads.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Some issues:
Limited to modest number of cores when memory is shared.
Multiple threads have access to same data — convenient and fast.
Contention: But, need to make sure they don’t conflict (e.g. two threads should not write to same location at same time).
Dependencies, synchronization: Need to make sure some operations are done in proper order!
May need cache coherence: If Thread 1 changes x in its private cache, other threads might need to see changed value.
A process is a thread that also has its own private address space.
Multiple processes are often running on a single computer (e.g. different independent programs).
For distributed memory parallel computers, a single computation must be tackled with multiple processes because of memory layout.
Larger cost in creating and destroying processes.
Greater latency in sharing data.
Processes communicate by passing messages.
Languages like MPI — Message Passing Interface.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Some issues:
Often more complicated to program.
High cost of data communication between processes. Want to maximize processing on local data relative to communication with other processes.
Often need to partition problem domain into subdomains, (e.g. domain decomposition for PDEs)
Generally requires coarse grain parallelism.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Typically only part of a computation can be parallelized.
Suppose 50% of the computation is inherently sequential, and the other 50% can be parallelized.
Question: How much faster could the computation potentially run on many processors?
Answer: At most a factor of 2, no matter how many processors.
The sequential part is taking half the time and that time is still required even if the parallel part is reduced to zero time.
The ratio TS /TP of time on a sequential machine to time running in parallel is the speedup.
This is generally less than P for P processors. Perhaps much less.
Amdahl’s Law plus overhead costs of starting processes/threads, communication, etc.
Caveat: May (rarely) see speedup greater than P ... For example, if data doesn’t all fit in one cache but does fit in the combined caches of multiple processors.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Some algorithms scale better than others as the number of processors increases.
Typically interested on how well algorithms work for large problems requiring lots of time, e.g. Particle methods for n particles, algorithms for solving systems of n equations, algorithms for solving PDEs on n × n × n grid in 3D,
For large n, there may be lots of inherent parallelism.
But depends on many factors: dependencies between calculations, communication as well as flops, nature of problem and algorithm chosen.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Typically interested on how well algorithms work for large problems requiring lots of time.
Strong scaling: How does the algorithm perform as the number of processors P increases for a fixed problem size n?
Any algorithm will eventually break down (consider P > n)
Weak scaling: How does the algorithm perform when the problem size increases with the number of processors?
E.g. If we double the number of processors can we solve a problem “twice as large” in the same time?
What does “twice as large” mean?
Depends on how algorithm complexity scales with n.
Example: Solving linear system with Gaussian elimination requires O(n^3 ) flops.
Doubling n requires 8 times as many operations.
Problem is “twice as large” if we increase n by a factor of 21 /^3 ≈ 1. 26.
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Solving steady state heat equation on n × n × n grid.
n^3 grid points =⇒ linear system with this many unknowns.
If we used Gaussian elimination (very bad idea!) we would require ∼ (n^3 )^3 = n^9 flops.
Doubling n would require 29 = 512 times more flops.
Good iterative methods can do the job in O(n^3 ) log 2 (n) work or less. (e.g. multigrid).
Developing better algorithms is as important as better hardware!!
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
R.J. LeVeque, University of Washington AMath 483/583, Lecture 12, April 22, 2011
Source: SIAM Review