




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Professor: Wallin; Class: High-Performance Comput; Subject: Computational Sci& Informatics; University: George Mason University; Term: Fall 2007;
Typology: Exams
1 / 8
This page cannot be seen from the preview
Don't miss anything!





CSI 702 2007 Midterm Exam SOLUTION
Answers should be brief - most require only one paragraph. Please use additional paper to answer these questions, and write your name at the top of each page. Good luck!
Honor Code Certification Name :
I certify that I have abided by the GMU honor code in taking this examination. The work on this exam is my own. I have received no assistance from other persons in completing this exam.
Signature:
(a) (5 pts) Explain why branches within loops generally reduce performance. Branches break the instruction pipelines. You canāt pre-load commands easily if you donāt know which commands to load. (b) (5 pts) Explain why unrolling loops generally improves performance. Unrolling the loops helps the compiler figure out what command will be next in a sequence. This increases the numerical and instruction pipelining in the system. It also cuts down the amount of overhead in the loop. (c) (5 pts) Give a short pseudo-code example of how a āsentinel valueā can be used to improve performance. Instead of testing for the end of an array AND a target value, we can simplify the loop to a single test. array(n+1) = target_value i = 1
while (x(i) /= target_value) i = i + 1
if (i == n+1) then target_index = not_found
else target_index = i endif
(d) (5 pts) The table and graph below show the CPU and memory requirements as a function of runsize for a particular code using the workstations in Research I room 249 (aka the COS cluster). What is the maximum run size practical on this workstation and why? The big issue in this run is memory. The 256k run takes 100 Mb. If we increase the run size by a factor of 20, we run out of memory. Conversely, a run size 20 times larger will only take about 20 hours to complete. Although this would annoy most of the users on this machine, it would be possible. Output files are also an issue, since we need to consider how much data will be stored. (e) (5 pts) Assume you needed to do twenty runs with a size of 100 million. Estimate the computational requirements and CPU configuration for the run. Justify this estimate, and explain changes that might be needed to run the code on this machine. Assuming that the runs scale linearly as the graph suggests, we would need about 416 hours of CPU time and a memory of about 41.6 Gbytes. The memory space is the big issue, since we canāt easily fit this on a single workstation. Getting this run to complete would involve using a cluster of machines or a shared memory machine. We would probably need to spend a great deal of time converting this over to MPI or OpenMP, and we would need extra time for the runs because of the inherent lower computational efficiency of the run. It probably would be good to build in an extra factor of two or three on both memory and CPU. (f) (5 pts) Estimate how long it will take before a 100 miillion particle run could be executed on a typical home computer. We need a computer that has a lot more memory for this to work. Mem- ory is increasing rapidly with a doubling time of a little less than 18 months. We currently have a size of about 2 Gbytes. To increase this to 48 Gbytes,
18 months(log(41.6)/ log(2)) = (5. 37 ā 18)months = 8years (1)
1.04 3964.36 42.80 366 0.00 0.00 build_module_mp_build_recursive_ 0.48 3984.03 19.68 366 0.00 0.00 build_module_mp_quadrupole_recurs 0.26 3994.59 10.56 366 0.00 0.00 build_module_mp_build_master_ 0.23 4004.20 9.60 364 0.00 0.00 verlet_module_mp_predict_verlet_i 0.21 4012.74 8.55 366 0.00 0.00 build_module_mp_build_tree_ 0.21 4021.22 8.47 366 0.00 0.00 build_module_mp_set_mac_bmax_
For this problem, use the code on the next page.
(a) (5 pts) What is the most likely value of the āfinal resultā? Explain your conclusion. The most likely value is zero. The original program will continue execu- tion after the threads are created. Since we are not joining the threads, the updates donāt occur until the main program exits.. (b) (5 pts) What is the most likely output from the āHello Worldā line? Be as specific as possible and explain your conclusion. You will get a random list of outputs like
Hello World! Itās me, thread #4 counter = 4 Hello World! Itās me, thread #3 counter = 3 Hello World! Itās me, thread #1 counter = 1 Hello World! Itās me, thread #5 counter = 5 Hello World! Itās me, thread #0 counter = 0 Hello World! Itās me, thread #5 counter = 6 Hello World! Itās me, thread #7 counter = 7
In each case, the thread will equal the counter since the variable is read at the begin- ning of the thread. (c) (10 pts) Neatly annote the code below with changes that would be needed so that the final summation is equal to 28 ( 0+1+2+3+4+5+6+7 for 8 processors). There are two main changes needed.
Threads are share memory and resources. They are not independent of each other, and can update shared memory. Processes are independent units of program execution with their own resources and memory. You cannot easily access the memory of a different process. One of the best ways to communicate between processes is sockets. Using sockets, we can pass messages and data between different processes that could be on different machines. MPI was designed to work on clusters of workstations, and we canāt split threads across different boxes in the same way we split processes across boxes.
#define NUM_THREADS 8 int counter = 0;
void *test_mutex(void *tid) { int *tt; int thread_id; int rnd; int count_start;
tt = tid; thread_id = (int) *tt; count_start = counter; rnd = rand() % 5; sleep(rnd); counter = count_start + thread_id; printf("Hello World! Itās me, thread #%d counter = %d!\n", thread_id, counter); }
int main(int argc, char *argv[]){ pthread_t thread1[NUM_THREADS]; int t; int ec; int thread_ids[NUM_THREADS];
(c) (25 pts) Using MPI, write a routine that broadcasts data from a central CPU to all the other CPUās in a COMM group. You may assume we are broadcasting only from node 0 for this problem and that we have a processor group of size 2Ėn where n is an integer. However the routine should be efficently implemented with a minimun number of communications. The routine should only use MPI SEND, MPI RECV, MPI COMM RANK, and MPI COMM SIZE, and should NOT use the broadcast commands. The logic amd algorithm is more important than the syntax for this problem. To implement this efficiently, we need to use a fan-out routine.
! the pattern will be ! ! iteration sending receiving ! 1 0 1 ! 2 01 23 ! 3 0123 4567 ! etc...
call MPI_INIT( ierr )
call MPI_COMM_RANK(myrank, MPI_COMM_WORLD, ierr) call MPI_COMM_SIZE(mysize, MPI_COMM_WORLD, ierr)
! figure out the number of iterations needed for the send ncomm = int( log(mysize)/ log(2) + 0.00001 )
! loop over the iterations do i = 1, ncomm
! define boundaries between who is sending and who is receiving idivide = 2(i-1) iend = 2(i)
! if i am below the half way point and there is someone to send a messag
! send a message if (myrank < idivide) then target = myrank+idivide if (target < myrank) then call MPI_SEND(value, size, MPI_TYPE, target, tag, MPI_COMM_WORLD, ie endif endif
! if i am above the divide point and below the limit of this iteration, if (myrank >= idivide .AND. myrank < iend) then source = myrank-idivide call MPI_RECV(value, size, MPI_TYPE, source, tag, MPI_COMM_WORLD, ierr endif
enddo
call MPI_FINALIZE(ierr)