Parallel Numerical Libraries-Computer Science-Project Report, Study Guides, Projects, Research of Computer Science

This is project report related to Computer science degree. Project was supervised by Dr. Niharika Raj at Acharya Nagarjuna University. Its main points are: Cluster, Computer, Scope, Performance, Parallel, Comparsion, Computing, Matlab, Programming, Fortran, Linux

Typology: Study Guides, Projects, Research

2011/2012

Uploaded on 07/16/2012

sanjeev
sanjeev 🇮🇳

5

(1)

19 documents

1 / 79

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Chapter 1. Introduction
The performance of a system is an important characteristic but it is difficult to
characterize with a single number. The best way of capturing performance is to
identify what the system must do to fulfill its mission and then identify the values that
allow one to characterize performance for that mission. If the analysis of the workload
shows that one or more standard benchmarks is a close approximation to the
workload, then one may use the relative performance of different systems on the
relevant benchmark(s) to provide a good predictor of performance on the company‟s
workload. If no benchmarks fit well, then one can compare performance only by
running tests on the real workload.
1.1 Scope
Parallel computers can run some types of programs far faster than traditional single
processor computers. That is why there is an emerging trend in the use of parallel
computing.
Most of the people do not want to run all the available programs related to their
problem in order to find the best performing program. This project will help them to
see some performance comparison of different parallel programs so that they might be
able to start their initial task easily.
When writing programs for parallel computer architectures, programmers
explicitly specify how to decompose the computing work between all available nodes.
After this project, performance of certain libraries will be available and by analyzing
the decomposition of tasks of these well performing libraries, programmers can have
an idea about the decomposition rules of the program that can give better
performance.
1.2 Applications
Parallelism finds applications in very diverse application domains for different
motivating reasons. The use of multi-processor machines or parallel computing has
made a great impact in a variety of areas, especially in computational simulations for
scientific and engineering applications. These range from improved application
performance to cost considerations. Some of the particular fields are,
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f

Partial preview of the text

Download Parallel Numerical Libraries-Computer Science-Project Report and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

Chapter 1. Introduction

The performance of a system is an important characteristic but it is difficult to characterize with a single number. The best way of capturing performance is to identify what the system must do to fulfill its mission and then identify the values that allow one to characterize performance for that mission. If the analysis of the workload shows that one or more standard benchmarks is a close approximation to the workload, then one may use the relative performance of different systems on the relevant benchmark(s) to provide a good predictor of performance on the company‟s workload. If no benchmarks fit well, then one can compare performance only by running tests on the real workload.

1.1 Scope

Parallel computers can run some types of programs far faster than traditional single processor computers. That is why there is an emerging trend in the use of parallel computing. Most of the people do not want to run all the available programs related to their problem in order to find the best performing program. This project will help them to see some performance comparison of different parallel programs so that they might be able to start their initial task easily. When writing programs for parallel computer architectures, programmers explicitly specify how to decompose the computing work between all available nodes. After this project, performance of certain libraries will be available and by analyzing the decomposition of tasks of these well performing libraries, programmers can have an idea about the decomposition rules of the program that can give better performance.

1.2 Applications

Parallelism finds applications in very diverse application domains for different motivating reasons. The use of multi-processor machines or parallel computing has made a great impact in a variety of areas, especially in computational simulations for scientific and engineering applications. These range from improved application performance to cost considerations. Some of the particular fields are,

 Earth Quakes  Economics  Fission  Fusion  Material Science  Medical Imaging

1.3 Thesis Layout

In the first chapter, Introduction , we have discussed the scope and applications of this project. Second chapter, Parallel Numerical Libraries , gives detail description of the parallel numerical libraries selected for analysis purposes. Namely these libraries include PETSc, ScaLAPACK and MatlabMPI. Chapter 3, Underlying Architectures , discusses the hardware using which we implemented the examples of parallel numerical libraries. Chapter 4, Performance Metrics , discusses the performance metrics used to evaluate the performance of parallel libraries. These performance metrics include parallel execution time, speed-up and efficiency. Finally, the results after analyzing the libraries are discussed in chapter 5.

The components below the line, labeled “LOCAL” are called on a single processor. Whereas the components above the line, labeled “GLOBAL” are making use of parallel routines.

2.1.2 Naming Convention of ScaLAPACK Routines

All routines have the name of the form PXYYZZZ , where the seventh character is left blank for some routines. The first letter „P‟ indicates that it is the parallel version of LAPACK routine. The second letter „X‟ shows the data type that can be real(S), double (D), complex(C) or double complex (Z). The next two letters “YY” indicate the type of the matrix. Matrix type can be tri- diagonal (DT), symmetric (SY), general (GE) etc. The last three letters “ZZZ” indicate the computation performed.

2.2 PETSc

The PETSc (Portable, Extensible Toolkit for Scientific Computation) is a suite of data structures and routines that provide the building blocks for the implementation of large scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, and C++[ 2 ]. PETSc provides many of the mechanisms needed within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. Some of the PETSc modules deal with  Vectors  Matrices (generally sparse)  Distributed arrays (useful for parallelizing regular grid-based problems) PETSc has been used for modeling in the areas such as Acoustics, Combustion, Concrete, Aerodynamics, Bone Fractures, Brain Surgery, Cancer Surgery, Cancer Cure, Carbon Sequestration, Cardiology, Air Pollution, Corrosion, Data Mining, Earth Quakes, and Surface Water Flow [ 3 ].

2.2.1 External Libraries

PETSc makes use of a variety of external libraries. During installation process, user can select the external libraries of his own choice. These external libraries include BLAS, LAPACK, LINPACK, MINPACK, SPARSPAK etc. [ 2 ].

2.2.2 PETSc Objects

PETSc library has its own objects. These objects are particularly optimized for vector and matrix operations. Moreover these objects can be specified for sparse vectors and matrices [ 2 ]. Some PETSc objects include Vec, Mat, etc. For viewing the vectors and matrices, PETSc has PETSc Viewer objects.

2.3 MatlabMPI

Matlab is the leading programming language for executing numerical computations. It is extensively used for algorithm development, simulation, data reduction, and testing and system evaluation by many engineers and scientists [ 4 ]. Many of these computations could be implemented in less time on a parallel computer. Many attempts were made to provide efficient ways of executing its programs. But none of these has received widespread acceptance. Message Passing Interface (MPI) is the de- facto standard for implementing programs on multiple processors, in the world of parallel computing. MPI provides C and FORTRAN language routines for communication in a parallel program. MPI is widely used in the world's most demanding applications (weather modeling, weapons simulation, aircraft design, etc.). MatlabMPI is set of Matlab scripts that allow us to execute any Matlab program on a parallel computer. Thus, MatlabMPI will run on any combination of computers that Matlab supports. Most of the routines provide by MatlabMPI are almost the same as MPI [ 5 ].

Table 3-2 shows the hardware specifications of Grid Lab‟s cluster. It has one 8-core system, five 4-core systems and two 2-core systems. So, it has a total of 32 processors. It is using Scientific LINUX [ 6 ] operating system. The system with 8- cores acts as a master node in the cluster.

Chapter 4. Performance Metrics

In order to analyze the performance of a system, we need some performance evaluation metrics. By using these metrics, we can perform comparison of various types of systems. In parallel computing, the criteria for evaluating performance can include speedup, efficiency and scalability.

4.1 Parallel Execution Time

The parallel execution time is the measure of time from the moment a parallel computation starts to the moment the last processing element finishes execution[ 7 ]. Serial time is denoted by TS while parallel execution time is denoted by TP.

4.2 Speedup

Speedup tells us that how much a parallel algorithm is faster than a corresponding sequential algorithm. It is defined as

Where Sp is the speed up, TS is the execution time for sequential algorithm, TP is the execution time for parallel algorithm and p is the number of processors. There are three cases for speedup linear, sub-linear and super-linear as shown in figure 4-1.

Figure 4-1 Speedup vs. Number of Processors[ 8 ]

Chapter 5. Results and Discussions

In this section, all the results for this project are discussed. Results are displayed in the form of graphs. A conclusion is adopted after discussion on these results. This conclusion could assist user in choosing a particular parallel numerical library that suits him the most.

5.1 PETSc’s Performance Analysis

Following are the examples implemented using PETSc library.  Matrix Operations o Matrix-Vector Multiplication o Matrix-Matrix Multiplication  System of Linear Equations o Jacobi o SOR In all these examples, the dimensions of the matrix and vector are n x n and n x 1 respectively. Vector, b = [ 1,1,...,1]T^ and the n x n matrix A has 3 on the main diagonal, -1 on the super and sub diagonal, and 0.5 on the (i,n+1-i) position of all i=1,…,n except for i=n/2 and n/2+1. All of these examples were executed on Virtu and Cluster. The specifications of these underlying architectures are discussed earlier in this report. The performance of PETSc is best at Virtu. These examples were also executed on only the master node of cluster at grid lab. During this time all other computers of cluster were switched off. These examples were executed with different dimensions of matrices. Dimensions were varied from 50 x 50 to 10,000 x 10,000. Three different underlying architectures are used for these examples. In the figures 5-1 to 5-44, the following Graph Legends are used, Virtu: Multicore machine with 8 processors. For more information see section-3. Grid: A cluster consisting of 8 computers and a total of 32 physical processors. For more information see section-3.2. Grid 8n: Master node of the cluster described in section-3.2. It is multicore machine with 8 physical processors.

5.1.1 Matrix-Vector Multiplication (Parallel Execution Time)

Figures 5-1 to 5-6 show the graphs of parallel execution time for matrix vector multiplication. X-axis indicates the time in seconds and Y-axis indicates the number of processors. It is clear from the below timing graphs that Virtu gives the better performance. Although the cluster at Grid Lab has a total of 32 processors but there come the bottle neck of communication through the LAN. With matrix of dimensions 50 x 50, the parallel execution time is much more than the serial time. This is due to the fact of communication overhead. After that as we increase the dimensions of matrix, the parallel time go on decreasing as compared to the sequential time. Great decrease in parallel execution time can be seen when the dimensions of matrix are 1000 x 1000 and above. Moreover the parallel time keeps on decreasing until the number of processors is less than 8. It is due to the fact that Virtu has 8 processors. Above 8 processors the performance becomes worse.

Figure 5-1 Parallel Execution Time Comparison for Matrix-Vector Multiplication operation of Dimension 50 x 50

(^00 20 40 60 80 100 120 )

0.25 Parallel Time Comparison Size 50

Number of Processors

Time (seconds)

Virtu Grid Grid 8n

Figure 5-4 Parallel Execution Time Comparison for Matrix-Vector Multiplication with Dimension 1000 x 1000

Figure 5-5 Parallel Execution Time Comparison for Matrix-Vector Multiplication with Dimension 5000 x 5000

Figure 5-6 Parallel Execution Time Comparison for Matrix-Vector Multiplication with Dimension 10,000 x 10,

5.1.2 Matrix-Vector Multiplication (Speed Up & Efficiency)

Figures 5-7 to 5-12 show the graphs of speed up for matrix vector multiplication and Figures 5-13 to 5-18 show the graphs of efficiency for matrix vector multiplication. Speed up tells us that how much the parallel algorithm is better than the serial algorithm. Theoretically, if value of speed up is less than one then we say that it is useless to parallelize the algorithm. For problems of matrix dimension 50 x 50 and 100 x 100, we can see that the speed up is less than one. And for problems of matrix dimension 500 x 500 and above, we can see that the speed up is greater than one. Hence we can conclude that it is useless to use the PETSc‟s for problem size less than 500 x 500. Efficiency tells us that how much time or resources are wasted in communication overhead, synchronization etc. Ideally, efficiency should be equal to one. From the graphs of efficiency, we can see that efficiency approaches to one when our problem size is very large i.e. 5000 and above. Moreover, we can notice that efficiency approaches to one when number of processes ranges from 4 to 8.

(^00 20 40 60 80 100 120 )

Parallel Time Comparison Size 10,

Number of Processors

Time (seconds)

Virtu Grid Grid 8n

Figure 5-9 Speed Up for Matrix Vector Multiplication of Problem Size 500 x 500

Figure 5-10 Speed Up for Matrix Vector Multiplication of Problem Size 1000 x 1000

(^00 20 40 60 80 100 120 )

1

1.5 Speed Up for Problem Size 500 x 500

Number of Processors

Speed UP

Virtu Grid Grid 8n

(^00 20 40 60 80 100 120 )

1

2

3 Speed Up for Problem Size 1000 x 1000

Number of Processors

Speed UP

Virtu Grid Grid 8n

Figure 5-11 Speed Up for Matrix Vector Multiplication of Problem Size 5000 x 5000

Figure 5-12 Speed Up for Matrix Vector Multiplication of Problem Size 10,000 x 10,

(^00 20 40 60 80 100 120 )

1

2

3

4

5

6

7

8 Speed Up for Problem Size 5000 x 5000

Number of Processors

Speed UP

Virtu Grid Grid 8n

(^00 20 40 60 80 100 120 )

2

4

6

8

10

12

14 Speed Up for Problem Size 10,000 x 10,

Number of Processors

Speed UP

Virtu Grid Grid 8n

Figure 5-15 Efficiency for Matrix Vector Multiplication of Problem Size 500 x 500

Figure 5-16 Efficiency for Matrix Vector Multiplication of Problem Size 1000 x 1000

(^00 20 40 60 80 100 120 )

0.7 Problem Size 500 x 500

Number of Processors

Efficiency

Virtu Grid Grid 8n

(^00 20 40 60 80 100 120 )

0.9 Problem Size 1000 x 1000

Number of Processors

Efficiency

Virtu Grid Grid 8n

Figure 5-17 Efficiency for Matrix Vector Multiplication of Problem Size 5000 x 5000

Figure 5-18 Efficiency for Matrix Vector Multiplication of Problem Size 10,000 x 10,

(^00 20 40 60 80 100 120 )

1 Problem Size 5000 x 5000

Number of Processors

Efficiency

Virtu Grid Grid 8n

(^00 20 40 60 80 100 120 )

1 Problem Size 10,000 x 10,

Number of Processors

Efficiency

Virtu Grid Grid 8n