Parallel Processing Computer Architectures - Handout 2 | ENEE 759, Lab Reports of Electrical and Electronics Engineering

Material Type: Lab; Subject: Electrical & Computer Engineering; University: University of Maryland; Term: Spring 2006;

Typology: Lab Reports

Pre 2010

Uploaded on 07/30/2009

koofers-user-4ai-2
koofers-user-4ai-2 🇺🇸

10 documents

1 / 44

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ENEE 759A: Parallel Processing Computer Architectures
Spring 2006 Handout #2
Laboratory Information
During the lab portions of ENEE 759A, you will be exposed to a number of computational
tools. This document provides insight on how to use these tools.
Contents
1 General Information 4
1.1 Accessing the Necessary Software . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Personal Access to Course Binaries . . . . . . . . . . . . . . . . . . . . . . . 5
2 Parallel C Reference Manual 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Include Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Distributed Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Thread Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.1 Thread Spawning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.2 Thread Wait Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7.1 Spin Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7.2 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7.3 Condition Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.4 Counting Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.5 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.6 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Message Passing Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c

Partial preview of the text

Download Parallel Processing Computer Architectures - Handout 2 | ENEE 759 and more Lab Reports Electrical and Electronics Engineering in PDF only on Docsity!

ENEE 759A: Parallel Processing Computer Architectures

Laboratory Information

During the lab portions of ENEE 759A, you will be exposed to a number of computational tools. This document provides insight on how to use these tools.

1 General Information

This document contains information necessary to complete the homework assignments in ENEE 759A. Immediately following this introductory section, the document contains 4 parts:

  • Section 2 describes the Parallel C programming language, which will be used for Homeworks 1, 2, and 4.
  • Sections 3 and 4 describe NWO (the Alewife Simulator), and its statistics-gathering capabilities, which will be used for Homeworks 1, 2, and 4.
  • Section 5 discusses use of a network simulator that will be used for Homework 4.
  • Section 6 describes a cache-directory simulator that will be used for Homework 3.

This document is meant to serve as a stand-alone reference for the software tools you will be using in the course. Most of this document has been adapted from a handout prepared by Victor Lee from the MIT Alewife group. Much thanks goes to Victor for his efforts.

1.1 Accessing the Necessary Software

The programming portions of the homework assignments for the course will be performed using GLUE SUN workstations. The software for all the assignments can be found in the course directory: /software/stradivari4/class/759a. The first thing you should do is copy the environment files from the course directory into your home directory. There are four files that you will need: nwo, init.t, tea19.el, and .emacs. These files are available from /software/stradivari4/class/759a/envfiles. They are necessary to run the Parallel C compiler and the Alewife simulator, NWO. Follow the following directions:

  • Create a “bin” sub-directory in your home directory (if it doesn’t already exist), and include it in your default path. Copy the file “nwo” into your “bin” sub-directory.
  • Create a “t” sub-directory in your home directory (if it doesn’t already exist). Copy the file “init.t” into your “t” sub-directory.
  • Create a “emacs” sub-directory in your home directory (if it doesn’t already exist). Copy the file “tea19.el” into your “emacs” sub-directory.
  • If you do not already have a “.emacs” file, copy the “.emacs” file provided in the course directory into your home directory (at the topmost level). If you already have a “.emacs” file, append the contents of the “.emacs” file provided in the course directory to the end of your own “.emacs” file.
  • Include the directory /software/stradivari4/class/759a/tools/sun4 into your default path.
  • Add the line “setenv LD LIBRARY PATH /usr/local/X11R6.3/lib” in your “.cshrc” file.

Further instructions on the use of Parallel C and NWO will be provided in Sections 2 and 3.

1.2 Personal Access to Course Binaries

It is recommended that students access course software directly from the course directory, /software/stradivari4/class/759a. In some cases, students may want to access course soft- ware from a local machine (for instance to work at home or in another lab). This is not possible for Parallel C and NWO. These are extremely large pieces of software that are intimately tied to the SPARC ISA (i.e. not runnable on x86). To run Parallel C and NWO, students must use the GLUE SUN workstations.

The network simulator and the directory simulator, used in homeworks 3 and 4, are much smaller software packages and can be easily installed on other machines. You can find the network simulator in /software/stradivari4/class/759a/netwrap, and the directory simulator in /software/stradivari4/class/759a/dirsim. You are welcome to copy the “.c” source files from these directories and the “Makefile” to install the software on other machines.

is performed to initialize processor-local variables to useful values.

void DISTRIBUTE DATA()

Copy all global variables on the local processor to all other processors.

Due to the potentially large number of global variables in the program (includes variables in all the linked library files), DISTRIBUTE DATA can take about 100000 cycles to execute, which is a long time if you’re running on the simulator.

2.3 Include Files

To use the Alewife-specific parallel library, you need to include <parallel.h>. To use the p4 shared-memory monitor functions, you need to include <p4.h>. <p4.h> automatically includes <parallel.h>.

2.4 Memory Allocation

char *shmalloc(unsigned nbytes)

Allocate a memory block of size nbytes in the global address space. The memory block is allocated from the portion of global memory that is local to the calling processor.

void shfree(char *ptr)

Free memory block in the global address space pointed to by ptr. This function currently does not do anything.

2.5 Distributed Arrays

Like regular arrays, distributed arrays in Parallel C are logically contiguous in shared memory. However, distributed arrays are physically distributed across the memories of multiple nodes.

In a Parallel C program, distributed arrays have the same logical behavior as regular arrays, but are useful for large arrays than may not fit on one node, or to avoid contention on one node.

void vm init pagemap(int n pages, int pagesize, int nprocs, int pages per proc)

Initialize the pagemap for a total of n pages of memory, each of size pagesize. This needs to be called exactly once before any calls to vm alloc(). The address space is

mapped across nprocs processors, and blocks of pages per proc consecutive pages are mapped to a single processor. nprocs should be equal to the number of processors. pagesize is fixed by the compiler to be 256 bytes. (To change the page size, use --DPAGESIZE=n).

char *vm alloc(unsigned nbytes)

Allocate a memory block of size nbytes in the global address space. The memory block is striped across the processing nodes.

Pointers to distributed arrays need to be declared as darray, e.g.,darray int pointer[], declares pointer to be a pointer to a distributed array of integers. In general, darray .

Restrictions: i) no pointer arithmetic on darray pointers. ii) arrays of structures - size of structure need to be a power of two bytes. This is to prevent a structure from spanning more than one node. Otherwise addressing fields of a structure will not work.

2.6 Thread Management

2.6.1 Thread Spawning

Currently, only producer-oriented thread spawning is supported. No thread migration is available, so threads have to be statically spawned on remote processors.

void thread on(int pid, void f(), ...)

Forks a thread on processor pid that calls the function f with the arguments in the rest of the argument list. Arguments are evaluated on the local processor and passed by value to the forked thread. Returns immediately without waiting for the forked thread to complete.

The following functions spawn threads across the entire machine and are useful for SPMD style computations.

void do in parallel(void f(), ...)

Forks a thread on each processor to call the function f with the arguments in the rest of the argument list. Arguments are evaluated on the local processor and passed by value to the forked threads. Returns after all threads have terminated.

void do in parallel no synch(void f(), ...)

Forks a thread on each processor to call the function f with the arguments in the rest of the argument list. Arguments are evaluated on the local processor and passed by value to the forked threads. Returns immediately without waiting for the forked threads to complete.

void spin unlock(unsigned val, unsigned addr)

Sets the contents of memory location addr to val and unlocks the memory location.

2.7.2 Semaphores

sem p make semaphore()

Create and return an unlocked semaphore.

sem p make locked semaphore()

Create and return a locked semaphore.

void SEMAPHORE P(sem p sem)

Lock semaphore sem.

void SEMAPHORE V(sem p sem)

Unlock semaphore sem.

int semaphore conditional p(sem p sem)

Returns TRUE if it succeeded in “P”-ing the semaphore, FALSE otherwise.

The following semaphore functions allow an integer to be associated with semaphore ob- ject. This integer can then be manipulated atomically. This has the advantage of allowing one to combine the manipulation of the lock bit together with the data.

sem p make semaphore(int initval)

Create and return an unlocked semaphore. The semaphore’s value is set to initval.

void init semaphore(sem p sem, int initval)

Non-atomically initializes the semaphore’s value to initval.

int semaphore take(sem p sem)

Lock semaphore sem and return its value.

void semaphore put(sem p sem, int val)

Unlock semaphore sem and set its value to val.

int semaphore conditional take(sem p sem, int *got sem)

Returns semaphore sem’s value and sets *got sem to TRUE if successful in locking the semaphore. Returns 0 and sets *got sem to FALSE otherwise.

2.7.3 Condition Variables

condvar p make condvar()

Create and returns an initialized condition variable.

void init condvar(condvar p cv)

Initialize the condition variable cv.

void condvar wait(sem p sem, condvar p cv)

Release the semaphore sem and block on condition variable cv; reacquire the semaphore when we unblock (wake up).

void condvar broadcast(condvar p cv)

Wake up all threads currently blocked on condition variable cv.

2.7.4 Counting Semaphores

csem p make c semaphore(int initval)

Create and returns a counting semaphore initialized to initval.

void init c semaphore(csem p cs, int initval)

Initialize counting semaphore to initval.

void c semaphore p(csem p cs)

Wait until value of counting semaphore cs is positive, then decrements value of cs and returns.

void c semaphore v(csem p cs)

Increment value of counting semaphore cs.

void c semaphore wait(csem p cs)

Wait for counting semaphore cs to attain a positive value.

2.7.5 Barriers

The following are centralized implementations of barriers and thus are not scalable. For large numbers of processors, tree barriers should be used. (See Section 2.7.6).

sm barrier p make sm barrier(int n)

Create and return a barrier initialized for n participants.

void init sm barrier(sm barrier p b, int n)

Initialize barrier b for n participants.

2.8 Message Passing Primitives

Alewife supports message-passing functions based on the Active Message model: upon re- ceipt of a message, the receiving processor is trapped and the message handler is executed.

The following calls send an active message to a processor:

void do on(unsigned pid, handler (*handler fn)(), ...) Send a message to pro- cessor pid to call handler handler fn with the arguments in the rest of the argument list. Arguments are restricted to 32-bit values.

void do on dma(unsigned pid, handler (*handler fn)(),..., dwords, dest base, base, dwords) Send a message to processor pid to call dma handler handler fn with the argu- ments in the rest of the argument list, minus the last two arguments. Arguments are restricted to 32-bit values, and there must be an even number of them. The last two arguments, base and dwords, specify a block of dwords double words starting at address base to be appended to the end of the message. This block will be automatically stored back at the destination processor starting at address dest base (the third to last argument.)

Message handlers are declared as type handler and dma handler.

handler handler fn(...)

Defines the handler to be invoked on receipt of a message sent via do on.

dma handler handler fn(...)

Defines the dma handler to be invoked on receipt of a message sent via do on dma. Call wait for storeback() within the handler to ensure that the block at the end of the message has been successfully copied to the specified destination.

Once a handler is invoked on a processor, it runs atomically until it exits. In other words, subsequent incoming messages to the processor do not interrupt the processor until the currently running handler exits.

Typically, a handler will only access data in a processor’s private memory. On occasion, it may be necessary to access data in shared memory from a handler. For various reasons beyond the scope of this document, it is highly risky to access shared memory from a handler that is executing atomically–doing so can lead to deadlock. Therefore, before accessing shared memory, a handler must transition out of atomic execution. To do this, the handler should execute the following function:

void user active global()

Transition out of atomic execution.

After calling void user active global(), it is safe for the handler to access shared mem- ory. However, the currently running handler will loose its atomic execution properties; therefore, another incoming handler will interrupt the currently running handler. Any sub- sequent handlers that interrupt the current handler will execute atomically (unless they themselves call void user active global()), and the original handler will not re-execute until all other atomic handlers have completed.

See Section 2.11 for an example of a fetch-and-add operation implemented with active messages.

2.9 Useful Functions and Variables

unsigned NPROCS

Number of processors in machine (read-only, set by (set-n-processors N) in NWO).

unsigned my pid()

Processor-id of the local processor (0... NPROCS-1).

unsigned CReg->CycleCount

Returns the current value of the hardware cycle counter (incremented once each cycle).

2.10 p

The following p4 functions are supported:

char *p4_shmalloc(int n) VOID p4_shfree(char *p)

int p4_moninit(p4_monitor_t *m, int i) VOID p4_menter(p4_monitor_t *m) VOID p4_mexit(p4_monitor_t *m) VOID p4_mcontinue(p4_monitor_t *m, int i) VOID p4_mdelay(p4_monitor_t *m, int i)

VOID p4_lock_init(p4_lock_t *l) VOID p4_lock(p4_lock_t *l) VOID p4_unlock(p4_lock_t *l)

int p4_getsub_init(p4_getsub_monitor_t *gs) VOID p4_getsub(p4_getsub_monitor_t *gs, int *s, int max, int nprocs) VOID p4_getsubs(p4_getsub_monitor_t *gs, int *s, int max, int nprocs, int stride)

This program demonstrates how to call do in parallel and use its associated spin barrier.

#include <stdio.h> #include <parallel.h>

#define NBAR 4 #define NSPIN 4

void worker() { int i; for (i = 0; i < NSPIN; i++) { mp_spin_barrier(); if (my_pid() == 0) printf(" *** Spin barrier ***\n"); } }

main() { int i; for (i = 0; i < NBAR; i++) { do_in_parallel(worker); printf("*** Barrier ***\n"); } }

This program demonstrates how to use the reduction functions, and how to pass arguments to threads spawned via do in parallel.

#include <stdio.h> #include <parallel.h>

double op(double x, double y) { return MAX(x,y); }

void worker(red_tree_p redtree, int *totals) { int pid = my_pid(); totals[pid] = (int)REDUCE(pid, redtree, op, (double)pid); }

main () { red_tree_p redtree; int *totals;

redtree = make_global_reduction_tree(); totals = (int *) shmalloc(sizeof(int) * NPROCS);

do_in_parallel(worker, redtree, totals)

return totals[0]; }

This program demonstrates the use of p4 monitor functions and how to distribute values of global variables via DISTRIBUTE DATA.

#include <p4.h>

#define ALLOC(type) ((type *) shmalloc(sizeof(type)))

#define NITERS_I 4 #define NITERS_J 4

p4_monitor_t *mon; p4_barrier_monitor_t *bar; int *count;

void worker();

main(int argc, char **argv) { p4_initenv(&argc, argv);

mon = ALLOC(p4_monitor_t); p4_moninit(mon, NPROCS);

bar = ALLOC(p4_barrier_monitor_t); p4_barrier_init(bar);

count = ALLOC(int);

printf("Distributing variables ...\n"); DISTRIBUTE_DATA(); /* replicate global variables b4 forking */ printf(" done.\n");

do_in_parallel(worker);

return(count); / should equal NITERS_INITERS_JNPROCS */ }

void worker() { int i,j;

printf("%d, t = %d: Executing worker\n", p4_get_my_id(), p4_clock()); for (i = 0; i < NITERS_I; i++) { for (j = 0; j < NITERS_J; j++) { p4_menter(mon); *count += 1; p4_mexit(mon); } p4_barrier(bar, NPROCS); } }

2.12 Scalable Synchronization Library

Recent research has designed scalable algorithms that perform well under high contention (at the price of a higher latency under low contention.) The scalable synchronization library implements some of these algorithms on Alewife. It also includes reactive syn- chronization algorithms that dynamically select the protocol to use to implement the synchronization operation. Alewife currently provides reactive algorithms for spin locks and fetch-and-op.

The reactive spin lock selects between using test-and-test-and-set with backoff and the MCS queue lock protocols. The reactive fetch-and-op selects between test-and-test-and- set, queuing, and combining tree protocols.

Note: Due to Alewife’s non-preemptive scheduling, a spin-waiting thread can hog the processor and run into deadlock. Thus, either the programmer should take care to avoid deadlock scenarios, or spawn off only as many threads as there are processors/hardware contexts to run the threads.

2.12.1 Spin Locks

lock t *make lock()

Allocate a spin lock.

void init lock(lock t *L)

Initialize spin lock *L.

void acquire lock(lock t *L)

Acquire spin lock *L.

void release lock(lock t *L)

Release spin lock *L.

In order to use these functions, you need to use:

#include <synch/locks/spin-backoff.h> for the test-and-set with backoff protocol.

#include <synch/locks/mcslock.h> for the MCS queue lock protocol.

#include <synch/locks/reactive.h> for the reactive spin lock algorithm.

For p4 users: In order to substitute the default p4 lock algorithm with any of the above algorithms, add #include <synch/p4.h> after #include <p4.h>, then include one of the three include files above. This redefines p4 lock t, p4 lock init, p4 lock and p4 unlock to one of the algorithms above.