Solved Homework 3 | High Performance Computing Systems | CS 1645, Assignments of Computer Science

Material Type: Assignment; Class: INTRO HIGH PERF COMPTNG SYSTMS; Subject: Computer Science; University: University of Pittsburgh; Term: Fall 2007;

Typology: Assignments

Pre 2010

Uploaded on 09/17/2009

koofers-user-oau
koofers-user-oau 🇺🇸

9 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 1645/2045 – INTRODUCTION TO HIGH PERFORMANCE
COMPUTING SYSTEMS
HOMEWORK 3 - SOLUTIONS
October 2, 2007
Socrates Dimitriadis (TA)
Problem 1: fork/join method
Creating new threads is time consuming and should be consider as a trade off in designing.
The current implementation creates threads on each iteration of the algorithm, resulting in a
fork-and-join large loop. Therefore, it is expected to note performance degradation. In addition,
if we further partition the algorithm to several threads we will have further performance
degradation due to thread handling overheads which are more noticeable when the problem
size is getting smaller.
Speed-up = f (#threads)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 4 8
Numbe r of T hre ads
Speed-up
N=32
N=64
N=128
The next plot, presents the speedup of using thread implementation instead of the standard sequential
implementation
.
Speed-up = f (#threads)
0
0.2
0.4
0.6
0.8
1
1.2
seq 1 2 4 8
Numbe r of Thre ads
Speed-up
N=32
N=64
N=128
pf3
pf4
pf5

Partial preview of the text

Download Solved Homework 3 | High Performance Computing Systems | CS 1645 and more Assignments Computer Science in PDF only on Docsity!

CS 1645/2045 – INTRODUCTION TO HIGH PERFORMANCE

COMPUTING SYSTEMS

HOMEWORK 3 - SOLUTIONS

October 2, 2007

Socrates Dimitriadis (TA)

Problem 1: fork/join method

Creating new threads is time consuming and should be consider as a trade off in designing.

The current implementation creates threads on each iteration of the algorithm, resulting in a

fork-and-join large loop. Therefore, it is expected to note performance degradation. In addition,

if we further partition the algorithm to several threads we will have further performance

degradation due to thread handling overheads which are more noticeable when the problem

size is getting smaller.

Speed-up = f (#threads)

Num ber of Threads

Speed-up

N=

N=

N=

The next plot, presents the speedup of using thread implementation instead of the standard sequential

implementation.

Speed-up = f (#threads)

0

1

seq 1 2 4 8 Num ber of Threads

Speed-up

N= N= N=

Fork_join.c

#include <pthread.h> #include <stdlib.h> #include <sys/time.h> #include <math.h>

# define MAX_THREADS 32 #define n 128

void * laplace ( void *);

struct arg_to_thread { int id ;} ; float x[n+2][n+2] ; float xn[n+2][n+2] ; int conv[MAX_THREADS]; int conv_total; int num_threads;

main ( ) { int i, j, k ; pthread_t p_threads[MAX_THREADS]; pthrea d_at tr_t attr; double time_start, time_end; struct timeval tv; struct timezone tz; struct arg_to_thread my_arg[MAX_THREADS] ;

printf ("Enter number of threads: "); scanf ("%d", &num_threads);

for (k=1 ; k <= n ; k++) { x[0][k] = 0; x[k][0] = 0; x[n+1][k] = k; x[k][n+1] = k; }

for (i=1 ; i <= n ; i++) for (j=1 ; j <= n ; j++) x[i][j] = 0 ;

gettimeofday (&tv , &tz); time_start = ( double )tv.tv_sec + ( double )tv.tv_usec / 1000000.0;

/* Iterations with forking/Joining */ for (k=1 ; k < 15000 ; k++) { pthread_attr_init (&attr); pthread_attr_setscope (&attr, PTHREAD_SCOPE_SYSTEM);

conv_total = 0;

/* Create threads, compute for k */ for (i=0; i< num_threads; i++) { my_arg[i].id = i ; pthread_create (&p_threads[i], &attr, laplace, ( void *) &my_arg[i]);

Problem 2

Here, we create 2-8 threads at once, and each thread performs the necessary iterations of the

algorithm. Of course, synchronization issues should be considered carefully.

For small size problems, increasing the number of threads could degrade the performance since

thread switching and synchronization overheads overrule the computation partitioning. While

the problem size is getting bigger, the computation part becomes more computational sensitive

and therefore, by using more threads/cpus we might be able to increase the overall performance

of the algorithm.

Speed-up = f (#threads)

Num ber of Threads

Speed-up

N=

N=

N=

The next plot, presents the speedup of using thread implementation instead of the standard sequential

implementation.

Speed-up = f (#threads)

0

1

seq 1 2 4 8 Num ber of Threads

Speed-up

(^) N=

N= N=

barrier.c

#include <pthread.h> #include <stdlib.h> #include <sys/time.h> #include <math.h>

# define MAX_THREADS 32 #define n 128

typedef struct { pthread_mutex_t x_lock; pthread_cond_t barrier; int count; } my_barrier_t;

void * laplace ( void *); void init_my_barrier (my_barrier_t *); void my_barrier (my_barrier_t *, int );

struct arg_to_thread { int id ;} ; float x[n+2][n+2] ; float xn[n+2][n+2] ;

int num_threads; int conv=0; my_barrier_t br; pthread_mutex_t conv_lock;

void init_my_barrier (my_barrier_t *b) { b->count = 0; pthread_mutex_init(&(b->x_lock), NULL) ; pthread_cond_init(&(b->barrier) , NULL) ; }

void my_barrier (my_barrier_t *b, int num_threads) { pthread_mutex_lock(&(b->x_lock)); b->count++ ; if (b->count == num_threads) { b->count = 0 ; pthread_cond_broadcast(&(b->barrier)) ; } else pthread_cond_wait(&(b->barrier), &(b->x_lock)) ; pthread_mutex_unlock(&(b->x_lock));

void * laplace ( void *s) { struct arg_to_thread local_arg ; int k, i, j, lid, local_conv; local_arg = s; lid = (local_arg).id; float error ;

/* Each tread performs iterations until all the points converged / for (k=1 ; k < 15000 ; k++) { local_conv = 0 ; for (i= lid(n/num_threads)+1 ; i <= (lid+1)*(n/num_threads) ; i++) for (j=1 ; j <= n ; j++) { xn[i][j] = 0.25 * (x[i-1][j]+x[i+1][j]+x[i][j-1]+x[i][j+1]); if (xn[i][j] <= x[i][j]) error = x[i][j] - xn[i][j]; else error = xn[i][j] - x[i][j]; if (error <= 0.001) local_conv = local_conv + 1 ; }

/* Access to a global variable */ pthread_mutex_lock(&conv_lock) ; conv = conv + local_conv; pthread_mutex_unlock(&conv_lock) ;

/* barrier - wait all the threads to reach this point */ my_barrier(&br,num_threads);

if (conv == n*n) break ; // break all threads if all points have been converged */ else { if (lid==0) conv=0; } // this is ok since there is another barrier later

/* Update the values in matrix x / for (i= lid(n/num_threads)+1 ; i <= (lid+1)*(n/num_threads) ; i++) for (j=1 ; j <= n ; j++) x[i][j] = xn[i][j] ;

/* barrier - wait all the threads to reach this point */ my_barrier(&br,num_threads);

} pthread_exit (0);