Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Midterm Exam Solutions - High-Performance Computer | CSI 702, Exams of Computer Science

George Mason University (GMU)Computer Science

Prof. John Wallin

Material Type: Exam; Professor: Wallin; Class: High-Performance Comput; Subject: Computational Sci& Informatics; University: George Mason University; Term: Fall 2007;

Typology: Exams

Pre 2010

Uploaded on 12/09/2008

koofers-user-dap 🇺🇸

5

(1)

10 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

CSI 702 2007 Midterm Exam 1

CSI 702 2007 Midterm Exam SOLUTION

Answers should be brief - most require only one paragraph. Please use additional paper to

answer these questions, and write your name at the top of each page. Good luck!

Honor Code Certification

Name :

I certify that I have abided by the GMU honor code in taking this examination. The work

on this exam is my own. I have received no assistance from other persons in completing this

exam.

Signature:

1. (35 pts) Optimization and Modern CPU’s For this problem, use the table, plots, and

profile on the next page.

(a) (5 pts) Explain why branches within loops generally reduce performance.

Branches break the instruction pipelines. You can’t pre-load commands

easily if you don’t know which commands to load.

(b) (5 pts) Explain why unrolling loops generally improves performance.

Unrolling the loops helps the compiler figure out what command will

be next in a sequence. This increases the numerical and instruction

pipelining in the system. It also cuts down the amount of overhead in

the loop.

(c) (5 pts) Give a short pseudo-code example of how a “sentinel value” can be used to

improve performance.

Instead of testing for the end of an array AND a target value, we can simplify the

loop to a single test.

array(n+1) = target_value

i=1

while (x(i) /= target_value)

i=i+1

if (i == n+1) then

target_index = not_found

Discover Exams of Computer Science George Mason University (GMU)

Partial preview of the text

Download Midterm Exam Solutions - High-Performance Computer | CSI 702 and more Exams Computer Science in PDF only on Docsity!

CSI 702 2007 Midterm Exam SOLUTION

Answers should be brief - most require only one paragraph. Please use additional paper to answer these questions, and write your name at the top of each page. Good luck!

Honor Code Certification Name :

I certify that I have abided by the GMU honor code in taking this examination. The work on this exam is my own. I have received no assistance from other persons in completing this exam.

Signature:

(35 pts) Optimization and Modern CPU’s For this problem, use the table, plots, and profile on the next page.

(a) (5 pts) Explain why branches within loops generally reduce performance. Branches break the instruction pipelines. You can’t pre-load commands easily if you don’t know which commands to load. (b) (5 pts) Explain why unrolling loops generally improves performance. Unrolling the loops helps the compiler figure out what command will be next in a sequence. This increases the numerical and instruction pipelining in the system. It also cuts down the amount of overhead in the loop. (c) (5 pts) Give a short pseudo-code example of how a “sentinel value” can be used to improve performance. Instead of testing for the end of an array AND a target value, we can simplify the loop to a single test. array(n+1) = target_value i = 1

while (x(i) /= target_value) i = i + 1

if (i == n+1) then target_index = not_found

else target_index = i endif

(d) (5 pts) The table and graph below show the CPU and memory requirements as a function of runsize for a particular code using the workstations in Research I room 249 (aka the COS cluster). What is the maximum run size practical on this workstation and why? The big issue in this run is memory. The 256k run takes 100 Mb. If we increase the run size by a factor of 20, we run out of memory. Conversely, a run size 20 times larger will only take about 20 hours to complete. Although this would annoy most of the users on this machine, it would be possible. Output files are also an issue, since we need to consider how much data will be stored. (e) (5 pts) Assume you needed to do twenty runs with a size of 100 million. Estimate the computational requirements and CPU configuration for the run. Justify this estimate, and explain changes that might be needed to run the code on this machine. Assuming that the runs scale linearly as the graph suggests, we would need about 416 hours of CPU time and a memory of about 41.6 Gbytes. The memory space is the big issue, since we can’t easily fit this on a single workstation. Getting this run to complete would involve using a cluster of machines or a shared memory machine. We would probably need to spend a great deal of time converting this over to MPI or OpenMP, and we would need extra time for the runs because of the inherent lower computational efficiency of the run. It probably would be good to build in an extra factor of two or three on both memory and CPU. (f) (5 pts) Estimate how long it will take before a 100 miillion particle run could be executed on a typical home computer. We need a computer that has a lot more memory for this to work. Mem- ory is increasing rapidly with a doubling time of a little less than 18 months. We currently have a size of about 2 Gbytes. To increase this to 48 Gbytes,

18 months(log(41.6)/ log(2)) = (5. 37 ∗ 18)months = 8years (1)

1.04 3964.36 42.80 366 0.00 0.00 build_module_mp_build_recursive_ 0.48 3984.03 19.68 366 0.00 0.00 build_module_mp_quadrupole_recurs 0.26 3994.59 10.56 366 0.00 0.00 build_module_mp_build_master_ 0.23 4004.20 9.60 364 0.00 0.00 verlet_module_mp_predict_verlet_i 0.21 4012.74 8.55 366 0.00 0.00 build_module_mp_build_tree_ 0.21 4021.22 8.47 366 0.00 0.00 build_module_mp_set_mac_bmax_

(30 pts) Threads, Processes and Sockets

For this problem, use the code on the next page.

(a) (5 pts) What is the most likely value of the “final result”? Explain your conclusion. The most likely value is zero. The original program will continue execu- tion after the threads are created. Since we are not joining the threads, the updates don’t occur until the main program exits.. (b) (5 pts) What is the most likely output from the “Hello World” line? Be as specific as possible and explain your conclusion. You will get a random list of outputs like

Hello World! It’s me, thread #4 counter = 4 Hello World! It’s me, thread #3 counter = 3 Hello World! It’s me, thread #1 counter = 1 Hello World! It’s me, thread #5 counter = 5 Hello World! It’s me, thread #0 counter = 0 Hello World! It’s me, thread #5 counter = 6 Hello World! It’s me, thread #7 counter = 7

In each case, the thread will equal the counter since the variable is read at the begin- ning of the thread. (c) (10 pts) Neatly annote the code below with changes that would be needed so that the final summation is equal to 28 ( 0+1+2+3+4+5+6+7 for 8 processors). There are two main changes needed.

A join needs to be added inside the main program for each thread that is spawned. This join command must precede the final print out.
We need to remove the initial counter read at the beginning of the routine and put mutex locks around the update section of the routine. (d) (10 pts) Explain the essential differences between threads and processes? Explain why we would use sockets and processes to implement MPI instead of threads?

Threads are share memory and resources. They are not independent of each other, and can update shared memory. Processes are independent units of program execution with their own resources and memory. You cannot easily access the memory of a different process. One of the best ways to communicate between processes is sockets. Using sockets, we can pass messages and data between different processes that could be on different machines. MPI was designed to work on clusters of workstations, and we can’t split threads across different boxes in the same way we split processes across boxes.

#define NUM_THREADS 8 int counter = 0;

void *test_mutex(void *tid) { int *tt; int thread_id; int rnd; int count_start;

tt = tid; thread_id = (int) *tt; count_start = counter; rnd = rand() % 5; sleep(rnd); counter = count_start + thread_id; printf("Hello World! It’s me, thread #%d counter = %d!\n", thread_id, counter); }

int main(int argc, char *argv[]){ pthread_t thread1[NUM_THREADS]; int t; int ec; int thread_ids[NUM_THREADS];

(c) (25 pts) Using MPI, write a routine that broadcasts data from a central CPU to all the other CPU’s in a COMM group. You may assume we are broadcasting only from node 0 for this problem and that we have a processor group of size 2ˆn where n is an integer. However the routine should be efficently implemented with a minimun number of communications. The routine should only use MPI SEND, MPI RECV, MPI COMM RANK, and MPI COMM SIZE, and should NOT use the broadcast commands. The logic amd algorithm is more important than the syntax for this problem. To implement this efficiently, we need to use a fan-out routine.

! the pattern will be ! ! iteration sending receiving ! 1 0 1 ! 2 01 23 ! 3 0123 4567 ! etc...

call MPI_INIT( ierr )

call MPI_COMM_RANK(myrank, MPI_COMM_WORLD, ierr) call MPI_COMM_SIZE(mysize, MPI_COMM_WORLD, ierr)

! figure out the number of iterations needed for the send ncomm = int( log(mysize)/ log(2) + 0.00001 )

! loop over the iterations do i = 1, ncomm

! define boundaries between who is sending and who is receiving idivide = 2(i-1) iend = 2(i)

! if i am below the half way point and there is someone to send a messag

! send a message if (myrank < idivide) then target = myrank+idivide if (target < myrank) then call MPI_SEND(value, size, MPI_TYPE, target, tag, MPI_COMM_WORLD, ie endif endif

! if i am above the divide point and below the limit of this iteration, if (myrank >= idivide .AND. myrank < iend) then source = myrank-idivide call MPI_RECV(value, size, MPI_TYPE, source, tag, MPI_COMM_WORLD, ierr endif

enddo

call MPI_FINALIZE(ierr)

Midterm Exam Solutions - High-Performance Computer | CSI 702, Exams of Computer Science

Related documents

Partial preview of the text

Download Midterm Exam Solutions - High-Performance Computer | CSI 702 and more Exams Computer Science in PDF only on Docsity!