














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A collection of questions from a final exam in principles of parallel computing. The exam covers various topics related to parallel computing, including serial and parallel programs, cache, virtual memory, multithreading, and multicore systems. The questions are presented as true or false statements, multiple choice questions, and programming exercises.
Typology: Exams
1 / 22
This page cannot be seen from the preview
Don't miss anything!















multicore integrated circuits integrated circuits with multiple conventional processors on a single chip. serial programs These types of programs are written for a conventional single-core processor True True or False: Translation programs are usually unable to completely parallelize serial programs. parallel programs These types of programs can make use of multiple cores. coordinate When we write parallel programs, we usually need to __________ the work of the cores. communication among the cores, load balancing, and synchronization of the cores. Coordinating the work of the cores in a parallel program can involve these three things. True True or False: MPI is used for programming distributed-memory systems. False (they are used for programming shared-memory systems) True or False: Pthreads and OpenMP are used for programming distributed-memory systems. distributed-memory systems In these types of systems, the cores have their own private memories. shared-memory systems In these types of systems, it's possible for each core to access each memory location. concurrent programs These kinds of programs can have multiple tasks in progress at any instant. True True or False: Parallel and distributed programs usually have tasks that can execute simultaneously. serial hardware and software This kind of hardware and software generally runs a single job at a time.
main memory, a central-processing unit (CPU) or processor or core, and an interconnection between the memory and the CPU The classical von Neumann architecture consists of these three things. registers Data in the CPU and information about the state of an executing program are stored in this special, very fast kind of storage. program counter This register, in the control unit, stores the address of the next instruction to be executed. bus An interconnect that transfers instructions and data between the CPU and memory, consisting of parallel wires and some hardware controlling access to the wires. False (they are said to be "fetched" or "read" from memory) True or False: When data or instructions are transferred from memory to the CPU, they are said to be "written" or "stored" into the CPU. True True or False: When data is transferred from the CPU to memory, it is said to be "written to memory" or "stored." The control unit is responsible for deciding which instructions in a program should be executed, and the ALU is responsible for executing the actual instructions. In this diagram of the von Neumann architecture, briefly describe the role of the arithmetic and logical unit, and the control unit. von Neumann bottleneck The separation of memory and the CPU is often called this, since the interconnect determines the rate at which instructions and data can be accessed. operating system a major piece of software whose purpose is to manage hardware and software resources on a computer. process an instance of a computer program that is being executed, created by the operating system when a user runs a program. True True or False: A process consists of the executable machine language program.
a mechanism for programmers to divide their programs into more or less independent tasks with the property that when one thread is blocked another thread can be run. forks When a thread is started, it _____ off the process. joins When a thread terminates, it _____ the process. cache a collection of memory locations that can be accessed in less time than some other memory locations. CPU cache a collection of memory locations that the CPU can access more quickly than it can access main memory. locality Consider this loop: float z[1000]; sum = 0.0; for (i = 0; i < 1000; i++) { sum += z[i]; } This is an example of what principle? locality the principle that an access of one location is followed by an access of a nearby location. spatial locality a program accessing a memory location and then accessing another one nearby. temporal locality a program accessing a memory location and then accessing another one shortly afterward. cache blocks (also cache lines) blocks of data and instructions that a memory access operates on. levels
A cache is usually divided into these: the first one is the smallest and the fastest, and subsequent ones are larger and slower. cache hit (also hit) When a cache is checked for information and information is available. cache miss (also miss) When a cache is checked for information and information is not available. True True or False: The memory access terms "read" and "write" are also used for caches. inconsistent If the value in the cache and the value in main memory are different they are ____________. write-through caches In these types of caches, the line is written to main memory when it is written to the cache. write-back caches In these types of caches, inconsistent data in the cache is marked dirty when it is updated, and when the cache line is replaced by a new cache line from memory, the dirty line is written to memory. fully associative cache a cache in which a new line can be placed at any location in the cache. direct mapped cache a cache in which each cache line has a unique location in the cache to which it will be assigned. n-way set associative cache a cache in which each cache line can be placed in one of n different locations in the cache. 0, 1, 2, or 3 Consider a main memory that consists of 16 lines and a cache that consists of four lines. If this were a fully associative cache, which cache line(s) could line 3 of the main memory be assigned to? 3
/∗First pair of loops∗/ for(i= 0;i
/∗First pair of loops∗/ for(i= 0;i
hardware multithreading This kind of multithreading provides a means for systems to continue doing useful work when the task being currently executed has stalled - for example, if the current task has to wait for data to be loaded from memory. fine-grained multithreading In this kind of multithreading, the processor switches between threads after each instruction, skipping threads that are stalled. coarse-grained multithreading This kind of multithreading only switches threads that are stalled waiting for a time- consuming operation to complete (e.g. a load from main memory), in an attempt to avoid the potential problem of a thread that's ready to execute a long sequence of instructions having to wait to execute every instruction. simultaneous multithreading a variation of fine-grained multithreading that attempts to exploit superscalar processors by allowing multiple threads to make use of multiple functional units. Flynn's Taxonomy This is used to classify computer architectures; it classifies a system according to the number of instruction streams and the number of data streams it can simultaneously manage. single instruction stream, single data stream (SISD) A classical von Neumann system is said to be this since it executes a single instruction at a time and it can fetch or store one item of data at a time. single instruction, multiple data (SIMD) systems These kinds of parallel systems operate on multiple data streams by applying the same instruction to multiple data items; therefore, it has a single control unit and multiple ALUs. into the ith ALU; the ALU can then add y[i] to x[i] and then store the result to x[i] Consider the following: for (i=0; i
for (i=0; i 0.0) x[i] += y[i]; What should the SIMD system do to determine whether y[i] is positive? data-parallelism Parallelism that's obtained by dividing data among the processors and having the processors all apply (more or less) the same instructions to their data of the subsets of the data. vector processors processors that can operate on vectors or arrays or vectors of data. vector registers registers capable of storing a vector of operands and operating simultaneously on their contents. vectorized and pipelined functional units In a vector system, these are applied to each pair of corresponding elements in the two vectors. vector instructions instructions that operate on vectors in a vector system. interleaved memory the memory system of a vector system, which consists of multiple "banks" of memory that can be accessed more or less independently. strided memory access In this kind of memory access, the program accesses elements of a vector located at fixed intervals. hardware scatter/gather In this kind of memory access, the program writes or reads elements of a vector located at irregular interval. scalability the ability of a vector system to handle irregular data structures as well as other parallel architectures. graphics processing pipeline
distributed-memory systems that are composed of a collection of commodity systems. nodes the individual computational units of a system joined together by the communication networks. hybrid systems clusters with shared-memory nodes grid the infrastructure necessary to turn large networks of geographically distributed computers into a unified distributed-memory system. switched interconnects These kinds of interconnects use switches to control the routing of data among the connected devices. crossbar the bidirectional communication links between cores or memory modules and switches. direct interconnect In this kind of interconnect, each switch is directly to a processor-memory pair, and the switches are connected to each other. ring This is superior to a simple bus because it allows multiple simultaneous communications. toroidal mesh This must support five links, and if there are p processors, this has 2p links. bisection width a measure of the number of simultaneous communication or the connectivity, which divides a parallel system into two halves, and each half contains half of the processors or nodes. bisection bandwidth the sum of the bandwidth of the links of two halves of a system. fully connected network A direct interconnect in which each switch is directly connected to every other switch. hypercube
a highly connected direct interconnect with 2^d nodes, d being the number of dimensions indirect interconnects In these kinds of interconnects, the switches may not be directly connected to a processor. indirect The crossbar and the omega networks are examples of ________ networks. crossbar A distributed-memory ________ has unidirectional links and as long as two processors don't attempt to communicate with the same processor, all the processors can simultaneously communicate with another processor. omega network an indirect network, more complex, but less expensive than a crossbar in which not all processes can occur simultaneously (1/2)p x log2(p) formula for the number of 2 x 2 crossbar switches the omega network uses. 2p x log2(p) the formula for the total number of switches an omega network uses. p^ formula for the total number of switches a crossbar uses. latency the time that elapses between the source's beginning to transmit the data and the destination's starting to receive the first byte. bandwidth the rate at which the destination receives data after it has started to receive the first byte. latency (sec.) + (number of bytes/bandwidth (bytes per sec.)) formula for the message transmission time cache coherence problem the problem of multiple caches for single processor systems do not provide a mechanism for insuring that when the caches of multiple processors store the same variable, an update to one processor to the cached variable is "seen" by the other processors; the cached value stored by the other processors is also updated.
parallelization the process of converting a serial program or algorithm into a parallel program. embarrassingly parallel Programs that can be parallelized by simply dividing the work among processes/threads are sometimes said to be ______________ ________.
mutual exclusion lock (also mutex, lock) Consider the following: myVal = Compute_val(my_rank); Lock(&add_my_val_lock); x += myVal; Unlock(&add_my_val_lock); What kind of object is being used in the Lock() and Unlock() function? serialization A mutex enforces _____________ of a critical section, by giving it access to only one thread at a time. busy-waiting In ____ _______, a thread enters a loop whose sole purpose is to test a condition. busy-waiting Consider the following: myVal = Compute_val(my_rank); if (my_rank == 1) while (!ok_for_1) x += myVal; if (my_rank == 0) ok_for_1 = true; The above is an example of what? semaphores These are similar to mutexes, but the details of their behavior are slightly different, and can implement some types of thread synchronization more easily. monitor an object whose methods can only be executed by one thread at a time, providing mutual exclusion at a high level. transactional memory This kind of memory treats critical sections in shared-memory programs as transactions. transaction an access to a database that the system treats as a single unit.
Amdahl's Law law that says that unless virtually all of a serial program is parallelized, the possible speedup is going to be very limited - regardless of the number of cores available. Gustafson's Law law that states that, if we increase the problem size, the "inherently serial" fraction of the program decreases in size. scalable If we can find a corresponding rate of increase in the problem size so that the program always has efficiency E, then the program is ________. E = (n) / [p(n/p+1)] = (n) / (n+p) If Tserial = n, and Tparallel = n/p+1, what does E equal? scalable Consider the following: E = n / (n+p) = xn / (xn + kp). If the above statement regarding a program's efficiency is true, the program is ________. resolution the unit of measurement on a timer variable. barrier a function that approximately synchronizes all of the threads. approximately synchronizes all of the threads Consider the following: /∗Synchronize all processes/threads∗/ Barrier(); mystart=Getcurrenttime(); /∗Code that we want to time∗/.. .myfinish=Getcurrenttime();myelapsed=myfinish−mystart; /∗Find the max across all processes/threads∗/ globalelapsed=Globalmax(myelapsed); if(myrank== 0) printf("The elapsed time = %e seconds\n",globalelapsed);
What does the Barrier() function above do? #pragma Pragmas in C and C++ start with this. fopenmp To compile an OpenMP program with gcc, we need to include the -_______ option. strtol(const char* number_p /in/, char** end_p /out/, int base /in/); The syntax of the strtol function is ... #pragma omp OpenMP pragmas always begin with this. structured block a C statement or a compound C statement with one point of entry and one point of exit, although calls to the function "exit" are allowed. thread of execution "Thread" is short for this. clause a text that modifies a directive in OpenMP. team In OpenMP, this is the collection of threads executing the parallel block; the original thread and the new threads. master the original thread in an OpenMP program slaves the threads that are additional to the original thread in an OpenMP program. implicit barrier If a thread that has completed the block of code will wait for all the other threads in the team to complete the block, there's an ________ _______. error checking Consider the following: #if def _OPENMP
#endif