Midterm Exam Solutions - High-Performance Computer | CSI 702, Exams of Computer Science

Material Type: Exam; Professor: Wallin; Class: High-Performance Comput; Subject: Computational Sci& Informatics; University: George Mason University; Term: Fall 2007;

Typology: Exams

Pre 2010

Uploaded on 12/09/2008

koofers-user-dap
koofers-user-dap šŸ‡ŗšŸ‡ø

5

(1)

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSI 702 2007 Midterm Exam 1
CSI 702 2007 Midterm Exam SOLUTION
Answers should be brief - most require only one paragraph. Please use additional paper to
answer these questions, and write your name at the top of each page. Good luck!
Honor Code Certification
Name :
I certify that I have abided by the GMU honor code in taking this examination. The work
on this exam is my own. I have received no assistance from other persons in completing this
exam.
Signature:
1. (35 pts) Optimization and Modern CPU’s For this problem, use the table, plots, and
profile on the next page.
(a) (5 pts) Explain why branches within loops generally reduce performance.
Branches break the instruction pipelines. You can’t pre-load commands
easily if you don’t know which commands to load.
(b) (5 pts) Explain why unrolling loops generally improves performance.
Unrolling the loops helps the compiler figure out what command will
be next in a sequence. This increases the numerical and instruction
pipelining in the system. It also cuts down the amount of overhead in
the loop.
(c) (5 pts) Give a short pseudo-code example of how a ā€œsentinel valueā€ can be used to
improve performance.
Instead of testing for the end of an array AND a target value, we can simplify the
loop to a single test.
array(n+1) = target_value
i=1
while (x(i) /= target_value)
i=i+1
if (i == n+1) then
target_index = not_found
pf3
pf4
pf5
pf8

Partial preview of the text

Download Midterm Exam Solutions - High-Performance Computer | CSI 702 and more Exams Computer Science in PDF only on Docsity!

CSI 702 2007 Midterm Exam SOLUTION

Answers should be brief - most require only one paragraph. Please use additional paper to answer these questions, and write your name at the top of each page. Good luck!

Honor Code Certification Name :

I certify that I have abided by the GMU honor code in taking this examination. The work on this exam is my own. I have received no assistance from other persons in completing this exam.

Signature:

  1. (35 pts) Optimization and Modern CPU’s For this problem, use the table, plots, and profile on the next page.

(a) (5 pts) Explain why branches within loops generally reduce performance. Branches break the instruction pipelines. You can’t pre-load commands easily if you don’t know which commands to load. (b) (5 pts) Explain why unrolling loops generally improves performance. Unrolling the loops helps the compiler figure out what command will be next in a sequence. This increases the numerical and instruction pipelining in the system. It also cuts down the amount of overhead in the loop. (c) (5 pts) Give a short pseudo-code example of how a ā€œsentinel valueā€ can be used to improve performance. Instead of testing for the end of an array AND a target value, we can simplify the loop to a single test. array(n+1) = target_value i = 1

while (x(i) /= target_value) i = i + 1

if (i == n+1) then target_index = not_found

else target_index = i endif

(d) (5 pts) The table and graph below show the CPU and memory requirements as a function of runsize for a particular code using the workstations in Research I room 249 (aka the COS cluster). What is the maximum run size practical on this workstation and why? The big issue in this run is memory. The 256k run takes 100 Mb. If we increase the run size by a factor of 20, we run out of memory. Conversely, a run size 20 times larger will only take about 20 hours to complete. Although this would annoy most of the users on this machine, it would be possible. Output files are also an issue, since we need to consider how much data will be stored. (e) (5 pts) Assume you needed to do twenty runs with a size of 100 million. Estimate the computational requirements and CPU configuration for the run. Justify this estimate, and explain changes that might be needed to run the code on this machine. Assuming that the runs scale linearly as the graph suggests, we would need about 416 hours of CPU time and a memory of about 41.6 Gbytes. The memory space is the big issue, since we can’t easily fit this on a single workstation. Getting this run to complete would involve using a cluster of machines or a shared memory machine. We would probably need to spend a great deal of time converting this over to MPI or OpenMP, and we would need extra time for the runs because of the inherent lower computational efficiency of the run. It probably would be good to build in an extra factor of two or three on both memory and CPU. (f) (5 pts) Estimate how long it will take before a 100 miillion particle run could be executed on a typical home computer. We need a computer that has a lot more memory for this to work. Mem- ory is increasing rapidly with a doubling time of a little less than 18 months. We currently have a size of about 2 Gbytes. To increase this to 48 Gbytes,

18 months(log(41.6)/ log(2)) = (5. 37 āˆ— 18)months = 8years (1)

1.04 3964.36 42.80 366 0.00 0.00 build_module_mp_build_recursive_ 0.48 3984.03 19.68 366 0.00 0.00 build_module_mp_quadrupole_recurs 0.26 3994.59 10.56 366 0.00 0.00 build_module_mp_build_master_ 0.23 4004.20 9.60 364 0.00 0.00 verlet_module_mp_predict_verlet_i 0.21 4012.74 8.55 366 0.00 0.00 build_module_mp_build_tree_ 0.21 4021.22 8.47 366 0.00 0.00 build_module_mp_set_mac_bmax_

  1. (30 pts) Threads, Processes and Sockets

For this problem, use the code on the next page.

(a) (5 pts) What is the most likely value of the ā€œfinal resultā€? Explain your conclusion. The most likely value is zero. The original program will continue execu- tion after the threads are created. Since we are not joining the threads, the updates don’t occur until the main program exits.. (b) (5 pts) What is the most likely output from the ā€œHello Worldā€ line? Be as specific as possible and explain your conclusion. You will get a random list of outputs like

Hello World! It’s me, thread #4 counter = 4 Hello World! It’s me, thread #3 counter = 3 Hello World! It’s me, thread #1 counter = 1 Hello World! It’s me, thread #5 counter = 5 Hello World! It’s me, thread #0 counter = 0 Hello World! It’s me, thread #5 counter = 6 Hello World! It’s me, thread #7 counter = 7

In each case, the thread will equal the counter since the variable is read at the begin- ning of the thread. (c) (10 pts) Neatly annote the code below with changes that would be needed so that the final summation is equal to 28 ( 0+1+2+3+4+5+6+7 for 8 processors). There are two main changes needed.

  1. A join needs to be added inside the main program for each thread that is spawned. This join command must precede the final print out.
  2. We need to remove the initial counter read at the beginning of the routine and put mutex locks around the update section of the routine. (d) (10 pts) Explain the essential differences between threads and processes? Explain why we would use sockets and processes to implement MPI instead of threads?

Threads are share memory and resources. They are not independent of each other, and can update shared memory. Processes are independent units of program execution with their own resources and memory. You cannot easily access the memory of a different process. One of the best ways to communicate between processes is sockets. Using sockets, we can pass messages and data between different processes that could be on different machines. MPI was designed to work on clusters of workstations, and we can’t split threads across different boxes in the same way we split processes across boxes.

#define NUM_THREADS 8 int counter = 0;

void *test_mutex(void *tid) { int *tt; int thread_id; int rnd; int count_start;

tt = tid; thread_id = (int) *tt; count_start = counter; rnd = rand() % 5; sleep(rnd); counter = count_start + thread_id; printf("Hello World! It’s me, thread #%d counter = %d!\n", thread_id, counter); }

int main(int argc, char *argv[]){ pthread_t thread1[NUM_THREADS]; int t; int ec; int thread_ids[NUM_THREADS];

(c) (25 pts) Using MPI, write a routine that broadcasts data from a central CPU to all the other CPU’s in a COMM group. You may assume we are broadcasting only from node 0 for this problem and that we have a processor group of size 2ˆn where n is an integer. However the routine should be efficently implemented with a minimun number of communications. The routine should only use MPI SEND, MPI RECV, MPI COMM RANK, and MPI COMM SIZE, and should NOT use the broadcast commands. The logic amd algorithm is more important than the syntax for this problem. To implement this efficiently, we need to use a fan-out routine.

! the pattern will be ! ! iteration sending receiving ! 1 0 1 ! 2 01 23 ! 3 0123 4567 ! etc...

call MPI_INIT( ierr )

call MPI_COMM_RANK(myrank, MPI_COMM_WORLD, ierr) call MPI_COMM_SIZE(mysize, MPI_COMM_WORLD, ierr)

! figure out the number of iterations needed for the send ncomm = int( log(mysize)/ log(2) + 0.00001 )

! loop over the iterations do i = 1, ncomm

! define boundaries between who is sending and who is receiving idivide = 2(i-1) iend = 2(i)

! if i am below the half way point and there is someone to send a messag

! send a message if (myrank < idivide) then target = myrank+idivide if (target < myrank) then call MPI_SEND(value, size, MPI_TYPE, target, tag, MPI_COMM_WORLD, ie endif endif

! if i am above the divide point and below the limit of this iteration, if (myrank >= idivide .AND. myrank < iend) then source = myrank-idivide call MPI_RECV(value, size, MPI_TYPE, source, tag, MPI_COMM_WORLD, ierr endif

enddo

call MPI_FINALIZE(ierr)