CS 441/632/732 Parallel Computing - HW3: Matrix Multiplication with 2-D Data Distribution, Assignments of Computer Science

The requirements for homework 3 in the fall 2005 cs 441/632/732 parallel computing course. Students are tasked with implementing matrix-matrix multiplication using cannon's algorithm and fox's algorithm with 2-d data distribution. They must analyze the performance of both algorithms and plot speedup plots for different process grid layouts and matrix sizes. Students are encouraged to allocate and initialize only the required parts of the matrices in each process and use blas routine dgemm for local matrix-matrix multiplication.

Typology: Assignments

Pre 2010

Uploaded on 04/12/2010

koofers-user-1u0-2
koofers-user-1u0-2 🇺🇸

3.8

(4)

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Fall 2005 CS 441/632/732 Parallel Computing
Homework-3
10/11/2005 1-2
Individual work only. 150 points. Due Oct 25, 2005.
1. Implement matrix-matrix multiplication with 2-D data distribution using (a) Cannon’s
algorithm (section 11.2.4, page 348) and (b) Fox’s algorithm (handout given in class).
Measure time taken for the matrix sizes 5000x5000 and 10000x10000 for the following
process grid layouts: 1x1, 2x2, and 4x4. Analyze the performance of Cannon’s algorithm vs.
Fox’s algorithm and plot the speedup plots for different process grid layouts and the two
matrix sizes. To reduce the overall execution time of the program allocate and initialize only
the required parts of the matrices in each process. Do not include the time spent in allocation
and initialization when measuring the time taken for matrix-matrix multiplication. Use BLAS
routine dgemm to perform local matrix-matrix multiplication and use an input file to
provide matrix size and process grid layout (the name of the input file must be specified as
a command line argument). Implementation of Cannon’s algorithm is optional for
undergraduate students.
Note that the problem sizes and process grid layouts where chosen such that each process
performed multiplication of square matrices. Also while measuring speedup the problem size
was fixed and the number of processes were varied. Measure the time taken by Cannon’s
algorithm and Fox’s algorithm when the matrix sizes are increased as the number of
processes are increased. Assign each process matrices of size 1000x1000 and use the
following process grid layouts: 1x1, 2x2, 3x3, and 4x4 (the corresponding global problem
sizes will be 1000x1000, 2000x2000, 3000x3000, and 4000x4000). Plot the scaled speedup
plots for both the algorithms. Use the example programs provided with the MPI tutorial as a
starting point and use MPI point-to-point and collective communications primitives for
message passing. Use the following tables to include the timing measurements:
5000x5000 10000x10000
# of
Process 1x1 = 1 2x2 = 4 4x4 = 16 1x1 = 1 2x2 = 4 4x4 = 16
Cannon
Fox
# of
Process
1x1 = 1
1000x1000
2x2 = 4
2000x2000
3x3 = 9
3000x3000
4x4 = 16
4000x4000
Cannon
Fox
General Comments:
You must implement and test these programs on the CIS cluster (Everest) and use MPI for
communication. Instructions for using the CIS cluster and submitting jobs to SGE can be found
at: http://www.cis.uab.edu/ccl/resources/everest/EverestGridNodeUserGuide.php. While
submitting to the queue you must request # of processors = # of processes, for example, for the
process grid layout 3x3, total # of processors requested = 9.
pf2

Partial preview of the text

Download CS 441/632/732 Parallel Computing - HW3: Matrix Multiplication with 2-D Data Distribution and more Assignments Computer Science in PDF only on Docsity!

Fall 2005 CS 441/632/732 Parallel Computing Homework-

Individual work only. 150 points. Due Oct 25, 2005.

  1. Implement matrix-matrix multiplication with 2-D data distribution using (a) Cannon’s algorithm (section 11.2.4, page 348) and (b) Fox’s algorithm (handout given in class). Measure time taken for the matrix sizes 5000x5000 and 10000x10000 for the following process grid layouts: 1x1, 2x2, and 4x4. Analyze the performance of Cannon’s algorithm vs. Fox’s algorithm and plot the speedup plots for different process grid layouts and the two matrix sizes. To reduce the overall execution time of the program allocate and initialize only the required parts of the matrices in each process. Do not include the time spent in allocation and initialization when measuring the time taken for matrix-matrix multiplication. Use BLAS routine dgemm to perform local matrix-matrix multiplication and use an input file to provide matrix size and process grid layout (the name of the input file must be specified as a command line argument). Implementation of Cannon’s algorithm is optional for undergraduate students. Note that the problem sizes and process grid layouts where chosen such that each process performed multiplication of square matrices. Also while measuring speedup the problem size was fixed and the number of processes were varied. Measure the time taken by Cannon’s algorithm and Fox’s algorithm when the matrix sizes are increased as the number of processes are increased. Assign each process matrices of size 1000x1000 and use the following process grid layouts: 1x1, 2x2, 3x3, and 4x4 (the corresponding global problem sizes will be 1000x1000, 2000x2000, 3000x3000, and 4000x4000). Plot the scaled speedup plots for both the algorithms. Use the example programs provided with the MPI tutorial as a starting point and use MPI point-to-point and collective communications primitives for message passing. Use the following tables to include the timing measurements:

of 5000x5000^ 10000x

Process (^) 1x1 = 1 2x2 = 4 4x4 = 16 1x1 = 1 2x2 = 4 4x4 = 16

Cannon Fox

of

Process

1x1 = 1 1000x

2x2 = 4 2000x

3x3 = 9 3000x

4x4 = 16 4000x Cannon

Fox

General Comments:

You must implement and test these programs on the CIS cluster (Everest) and use MPI for communication. Instructions for using the CIS cluster and submitting jobs to SGE can be found at: http://www.cis.uab.edu/ccl/resources/everest/EverestGridNodeUserGuide.php. While submitting to the queue you must request # of processors = # of processes, for example, for the process grid layout 3x3, total # of processors requested = 9.

Fall 2005 CS 441/632/732 Parallel Computing Homework-

Submission: Email the source code along with any Makefile and scripts as a single tar file attachment to [email protected] with the subject “CS 441/632/732 Homework-3.” Turn-in a printed report in class using the format provided at http://www.cis.uab.edu/cs441/report.html. After submission, do not make any changes to your source code on Everest, you will be asked to demonstrate your program on Everest and the timestamp of the files will be used to determine late submissions.

Grading: Correct implementation and testing of the programs (including collecting timing information for the tables above)

115 points

Performance Analysis 25 points Lab report format/presentation 10 points