







































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This is project report related to Computer science degree. Project was supervised by Dr. Niharika Raj at Acharya Nagarjuna University. Its main points are: Cluster, Computer, Scope, Performance, Parallel, Comparsion, Computing, Matlab, Programming, Fortran, Linux
Typology: Study Guides, Projects, Research
1 / 79
This page cannot be seen from the preview
Don't miss anything!








































































The performance of a system is an important characteristic but it is difficult to characterize with a single number. The best way of capturing performance is to identify what the system must do to fulfill its mission and then identify the values that allow one to characterize performance for that mission. If the analysis of the workload shows that one or more standard benchmarks is a close approximation to the workload, then one may use the relative performance of different systems on the relevant benchmark(s) to provide a good predictor of performance on the company‟s workload. If no benchmarks fit well, then one can compare performance only by running tests on the real workload.
Parallel computers can run some types of programs far faster than traditional single processor computers. That is why there is an emerging trend in the use of parallel computing. Most of the people do not want to run all the available programs related to their problem in order to find the best performing program. This project will help them to see some performance comparison of different parallel programs so that they might be able to start their initial task easily. When writing programs for parallel computer architectures, programmers explicitly specify how to decompose the computing work between all available nodes. After this project, performance of certain libraries will be available and by analyzing the decomposition of tasks of these well performing libraries, programmers can have an idea about the decomposition rules of the program that can give better performance.
Parallelism finds applications in very diverse application domains for different motivating reasons. The use of multi-processor machines or parallel computing has made a great impact in a variety of areas, especially in computational simulations for scientific and engineering applications. These range from improved application performance to cost considerations. Some of the particular fields are,
Earth Quakes Economics Fission Fusion Material Science Medical Imaging
In the first chapter, Introduction , we have discussed the scope and applications of this project. Second chapter, Parallel Numerical Libraries , gives detail description of the parallel numerical libraries selected for analysis purposes. Namely these libraries include PETSc, ScaLAPACK and MatlabMPI. Chapter 3, Underlying Architectures , discusses the hardware using which we implemented the examples of parallel numerical libraries. Chapter 4, Performance Metrics , discusses the performance metrics used to evaluate the performance of parallel libraries. These performance metrics include parallel execution time, speed-up and efficiency. Finally, the results after analyzing the libraries are discussed in chapter 5.
The components below the line, labeled “LOCAL” are called on a single processor. Whereas the components above the line, labeled “GLOBAL” are making use of parallel routines.
All routines have the name of the form PXYYZZZ , where the seventh character is left blank for some routines. The first letter „P‟ indicates that it is the parallel version of LAPACK routine. The second letter „X‟ shows the data type that can be real(S), double (D), complex(C) or double complex (Z). The next two letters “YY” indicate the type of the matrix. Matrix type can be tri- diagonal (DT), symmetric (SY), general (GE) etc. The last three letters “ZZZ” indicate the computation performed.
The PETSc (Portable, Extensible Toolkit for Scientific Computation) is a suite of data structures and routines that provide the building blocks for the implementation of large scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, and C++[ 2 ]. PETSc provides many of the mechanisms needed within parallel application codes, such as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to employ the level of abstraction that is most appropriate for a particular problem. By using techniques of object-oriented programming, PETSc provides enormous flexibility for users. Some of the PETSc modules deal with Vectors Matrices (generally sparse) Distributed arrays (useful for parallelizing regular grid-based problems) PETSc has been used for modeling in the areas such as Acoustics, Combustion, Concrete, Aerodynamics, Bone Fractures, Brain Surgery, Cancer Surgery, Cancer Cure, Carbon Sequestration, Cardiology, Air Pollution, Corrosion, Data Mining, Earth Quakes, and Surface Water Flow [ 3 ].
PETSc makes use of a variety of external libraries. During installation process, user can select the external libraries of his own choice. These external libraries include BLAS, LAPACK, LINPACK, MINPACK, SPARSPAK etc. [ 2 ].
PETSc library has its own objects. These objects are particularly optimized for vector and matrix operations. Moreover these objects can be specified for sparse vectors and matrices [ 2 ]. Some PETSc objects include Vec, Mat, etc. For viewing the vectors and matrices, PETSc has PETSc Viewer objects.
Matlab is the leading programming language for executing numerical computations. It is extensively used for algorithm development, simulation, data reduction, and testing and system evaluation by many engineers and scientists [ 4 ]. Many of these computations could be implemented in less time on a parallel computer. Many attempts were made to provide efficient ways of executing its programs. But none of these has received widespread acceptance. Message Passing Interface (MPI) is the de- facto standard for implementing programs on multiple processors, in the world of parallel computing. MPI provides C and FORTRAN language routines for communication in a parallel program. MPI is widely used in the world's most demanding applications (weather modeling, weapons simulation, aircraft design, etc.). MatlabMPI is set of Matlab scripts that allow us to execute any Matlab program on a parallel computer. Thus, MatlabMPI will run on any combination of computers that Matlab supports. Most of the routines provide by MatlabMPI are almost the same as MPI [ 5 ].
Table 3-2 shows the hardware specifications of Grid Lab‟s cluster. It has one 8-core system, five 4-core systems and two 2-core systems. So, it has a total of 32 processors. It is using Scientific LINUX [ 6 ] operating system. The system with 8- cores acts as a master node in the cluster.
In order to analyze the performance of a system, we need some performance evaluation metrics. By using these metrics, we can perform comparison of various types of systems. In parallel computing, the criteria for evaluating performance can include speedup, efficiency and scalability.
The parallel execution time is the measure of time from the moment a parallel computation starts to the moment the last processing element finishes execution[ 7 ]. Serial time is denoted by TS while parallel execution time is denoted by TP.
Speedup tells us that how much a parallel algorithm is faster than a corresponding sequential algorithm. It is defined as
Where Sp is the speed up, TS is the execution time for sequential algorithm, TP is the execution time for parallel algorithm and p is the number of processors. There are three cases for speedup linear, sub-linear and super-linear as shown in figure 4-1.
Figure 4-1 Speedup vs. Number of Processors[ 8 ]
In this section, all the results for this project are discussed. Results are displayed in the form of graphs. A conclusion is adopted after discussion on these results. This conclusion could assist user in choosing a particular parallel numerical library that suits him the most.
Following are the examples implemented using PETSc library. Matrix Operations o Matrix-Vector Multiplication o Matrix-Matrix Multiplication System of Linear Equations o Jacobi o SOR In all these examples, the dimensions of the matrix and vector are n x n and n x 1 respectively. Vector, b = [ 1,1,...,1]T^ and the n x n matrix A has 3 on the main diagonal, -1 on the super and sub diagonal, and 0.5 on the (i,n+1-i) position of all i=1,…,n except for i=n/2 and n/2+1. All of these examples were executed on Virtu and Cluster. The specifications of these underlying architectures are discussed earlier in this report. The performance of PETSc is best at Virtu. These examples were also executed on only the master node of cluster at grid lab. During this time all other computers of cluster were switched off. These examples were executed with different dimensions of matrices. Dimensions were varied from 50 x 50 to 10,000 x 10,000. Three different underlying architectures are used for these examples. In the figures 5-1 to 5-44, the following Graph Legends are used, Virtu: Multicore machine with 8 processors. For more information see section-3. Grid: A cluster consisting of 8 computers and a total of 32 physical processors. For more information see section-3.2. Grid 8n: Master node of the cluster described in section-3.2. It is multicore machine with 8 physical processors.
Figures 5-1 to 5-6 show the graphs of parallel execution time for matrix vector multiplication. X-axis indicates the time in seconds and Y-axis indicates the number of processors. It is clear from the below timing graphs that Virtu gives the better performance. Although the cluster at Grid Lab has a total of 32 processors but there come the bottle neck of communication through the LAN. With matrix of dimensions 50 x 50, the parallel execution time is much more than the serial time. This is due to the fact of communication overhead. After that as we increase the dimensions of matrix, the parallel time go on decreasing as compared to the sequential time. Great decrease in parallel execution time can be seen when the dimensions of matrix are 1000 x 1000 and above. Moreover the parallel time keeps on decreasing until the number of processors is less than 8. It is due to the fact that Virtu has 8 processors. Above 8 processors the performance becomes worse.
Figure 5-1 Parallel Execution Time Comparison for Matrix-Vector Multiplication operation of Dimension 50 x 50
(^00 20 40 60 80 100 120 )
0.25 Parallel Time Comparison Size 50
Number of Processors
Time (seconds)
Virtu Grid Grid 8n
Figure 5-4 Parallel Execution Time Comparison for Matrix-Vector Multiplication with Dimension 1000 x 1000
Figure 5-5 Parallel Execution Time Comparison for Matrix-Vector Multiplication with Dimension 5000 x 5000
Figure 5-6 Parallel Execution Time Comparison for Matrix-Vector Multiplication with Dimension 10,000 x 10,
Figures 5-7 to 5-12 show the graphs of speed up for matrix vector multiplication and Figures 5-13 to 5-18 show the graphs of efficiency for matrix vector multiplication. Speed up tells us that how much the parallel algorithm is better than the serial algorithm. Theoretically, if value of speed up is less than one then we say that it is useless to parallelize the algorithm. For problems of matrix dimension 50 x 50 and 100 x 100, we can see that the speed up is less than one. And for problems of matrix dimension 500 x 500 and above, we can see that the speed up is greater than one. Hence we can conclude that it is useless to use the PETSc‟s for problem size less than 500 x 500. Efficiency tells us that how much time or resources are wasted in communication overhead, synchronization etc. Ideally, efficiency should be equal to one. From the graphs of efficiency, we can see that efficiency approaches to one when our problem size is very large i.e. 5000 and above. Moreover, we can notice that efficiency approaches to one when number of processes ranges from 4 to 8.
(^00 20 40 60 80 100 120 )
Parallel Time Comparison Size 10,
Number of Processors
Time (seconds)
Virtu Grid Grid 8n
Figure 5-9 Speed Up for Matrix Vector Multiplication of Problem Size 500 x 500
Figure 5-10 Speed Up for Matrix Vector Multiplication of Problem Size 1000 x 1000
(^00 20 40 60 80 100 120 )
1
1.5 Speed Up for Problem Size 500 x 500
Number of Processors
Speed UP
Virtu Grid Grid 8n
(^00 20 40 60 80 100 120 )
1
2
3 Speed Up for Problem Size 1000 x 1000
Number of Processors
Speed UP
Virtu Grid Grid 8n
Figure 5-11 Speed Up for Matrix Vector Multiplication of Problem Size 5000 x 5000
Figure 5-12 Speed Up for Matrix Vector Multiplication of Problem Size 10,000 x 10,
(^00 20 40 60 80 100 120 )
1
2
3
4
5
6
7
8 Speed Up for Problem Size 5000 x 5000
Number of Processors
Speed UP
Virtu Grid Grid 8n
(^00 20 40 60 80 100 120 )
2
4
6
8
10
12
14 Speed Up for Problem Size 10,000 x 10,
Number of Processors
Speed UP
Virtu Grid Grid 8n
Figure 5-15 Efficiency for Matrix Vector Multiplication of Problem Size 500 x 500
Figure 5-16 Efficiency for Matrix Vector Multiplication of Problem Size 1000 x 1000
(^00 20 40 60 80 100 120 )
0.7 Problem Size 500 x 500
Number of Processors
Efficiency
Virtu Grid Grid 8n
(^00 20 40 60 80 100 120 )
0.9 Problem Size 1000 x 1000
Number of Processors
Efficiency
Virtu Grid Grid 8n
Figure 5-17 Efficiency for Matrix Vector Multiplication of Problem Size 5000 x 5000
Figure 5-18 Efficiency for Matrix Vector Multiplication of Problem Size 10,000 x 10,
(^00 20 40 60 80 100 120 )
1 Problem Size 5000 x 5000
Number of Processors
Efficiency
Virtu Grid Grid 8n
(^00 20 40 60 80 100 120 )
1 Problem Size 10,000 x 10,
Number of Processors
Efficiency
Virtu Grid Grid 8n