






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Professor: Back; Class: Computer Systems; Subject: Computer Science; University: Virginia Polytechnic Institute And State University; Term: Fall 2009;
Typology: Exams
1 / 11
This page cannot be seen from the preview
Don't miss anything!







Solutions are shown in this style. This exam was given in Fall 2009.
The following questions relate to how programs are compiled for IA32.
a) (8 pts) In lecture, we had discussed how each function obtains its own activation record, or stack frame, every time it is called. The stack frame is used for several purposes, including to hold the values of arguments passed to a function or to hold the values of local variables that cannot be kept in registers. Typically, accesses to these arguments and variables involve loads or stores that use relative addressing using the %ebp register as a base. Recent versions of gcc support an optimization option ‘-fomit-frame- pointer’ that organizes accesses to local variables differently. Instead of using the base/frame pointer register %ebp, the stack pointer register %esp is used to access local variables and arguments passed to a function. As a result, %ebp is available for other uses.
i. (6 pts) Explain why and how this would work! Why is the base pointer, apparently, redundant?
The base pointer is always at known offset from the stack pointer, so any accesses that use addressing relative to $ebp can be replaced with accesses that use addressing relative to $esp.
ii. (2 pts) Consider the example of accessing the first argument, which is traditionally accessed using 8(%ebp). How would code compiled with –fomit-frame-pointer access this argument?
Let SFSIZE = |$ebp-$esp| be the current stack frame size, then any access to disp($ebp) can be replaced with disp+SFSIZE($esp). For example, 8($ebp) would become 8+SFSIZE($esp). For example:
int sum(int x, int y) { char localarray[16]; return x + y; }
When compiled with –fomit-frame-pointer (but without other optimizations), the code shows:
sum: subl $16, %esp movl 24(%esp), %eax addl 20(%esp), %eax addl $16, %esp ret
b) (12 pts) Consider the following assembly code, which was produced by gcc for a function ‘g()’. The left column shows the result when compiling at the first level of optimization (-O1), the right column shows the result of compiling at the second optimization level. IA 32 Code,compiled with –O1 IA 32 Code, compiled with –O g: pushl %ebp movl %esp, %ebp subl $8, %esp movl 8(%ebp), %eax movl 12(%ebp), %edx cmpl %edx, %eax je .L cmpl $1, %eax je .L cmpl $1, %edx jne .L .L6: movl $1, %eax jmp .L .L4: cmpl %edx, %eax jge .L movl %eax, 4(%esp) subl %eax, %edx movl %edx, (%esp) call g jmp .L .L7: movl %edx, 4(%esp) subl %edx, %eax movl %eax, (%esp) call g .L2: leave ret
g: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl 12(%ebp), %ecx cmpl %ecx, %edx je .L .L15: cmpl $1, %edx je .L cmpl $1, %ecx je .L cmpl %ecx, %edx jge .L movl %ecx, %eax movl %edx, %ecx subl %edx, %eax movl %eax, %edx .L7: cmpl %edx, %ecx jne .L .L3: popl %ebp movl %ecx, %eax ret .L5: popl %ebp movl $1, %eax ret .L9: subl %ecx, %edx jmp .L i. (9 pts) Provide a C version of function g()! Hint: ‘g’ implements a well-known, classic mathematical algorithm!
‘g’ implements Euclid’s algorithm for finding the greatest common divisor:
int g(int m, int n) {
You repeat the compilation and, in fact, the error goes away:
$ gcc -c a.c b.c main.c $ gcc a.o b.o main.o $
(2 pts) Why did the linker not report an error this time?
As an uninitialized global variable, global_shared_variable becomes a weak symbol, hence the linker will not report an error for multiple definitions.
static int global_shared_variable = -1;
You repeat the compilation and, in fact, the error is gone:
$ gcc -c a.c b.c main.c $ gcc a.o b.o main.o $
(2 pts) Why did the linker not report an error this time?
global_shared_variable has become 2 distinct local symbols in a.o and b.o that happen to have the same name, hence there is no conflict for the linker to report.
(2 pts) Explain why this solution is not a good one!
It would create 2 copies of this variable with distinct memory locations holding potentially different values, updates to one would not affect the other. This is likely not what the programmer intended when placing the definition into shared.h.
shared.h
extern int global_shared_variable;
a.c b.c main.c
#include “shared.h”
int global_shared_variable = -1;
#include “shared.h” int main() { }
Alternatively, the definition could be contained in b.c or main.c It’s also possible to omit the ‘extern’, in which case the linker rule applies that a single strong definition in a.o overrides the weak definition in b.o. However, this is not good practice (-Wl,--warn-common would flag it). Some suggested placing ‘extern int global_shared_variable’ in b.c – this would compile and link, but is generally not considered sound programming practice.
b) (4 pts) A “fence” is a technique that is sometimes used to detect out-of- bounds memory accesses. The idea is to place some ‘fence’ values that rarely occur during normal execution before and after each array. Then, out-of-bounds accesses can be detected by checking whether the fence values were changed. Complete the program below to implement this idea to protect array ‘a’ which is passed to a buggy update routine that contains out-of-bounds accesses.
void buggy_update_array(int *array, int n, int delta) { int i; for (i = 0; i <= n; i++) { array[i-1] = array[i] + delta; } }
int a[10];
int main() { buggy_update_array(a, 10, 1); }
Knowing that the linker will allocate variables of the same storage class consecutively in memory, the program can be completed as follows:
void buggy_update_array(int *array, int n, int delta) { int i; for (i = 0; i <= n; i++) { array[i-1] = array[i] + delta; } }
int leftfence; int a[10]; int rightfence;
int main() { #define MAGIC 0xdeadbeef; leftfence = rightfence = MAGIC;
i. (2 pts) With respect to code
Yes, it uses loops whose instructions are executed many times.
ii. (3 pts) With respect to data
No, each matrix element is accessed once and only once. (Though less relevant, I also accepted ‘yes’ if you pointed out that there is reuse of ‘i' and ‘j’ – but not ‘tmp’)
b) (5 pts) Does this algorithm exhibit spatial locality? Briefly say why or why not!
i. (2 pts) With respect to code
Yes – the executed code is contained in a contiguous section of instructions. (The fact that each backward branch in the loop causes a non-contiguous control transfer notwithstanding.)
ii. (3 pts) With respect to data
If the matrix is stored in row-major order, as in C, the accesses to matrix[i][j] exhibit spatial locality, but the accesses to matrix[j][i] do not.
c) (5 pts) Assume a memory hierarchy with just one level of caching and a cache line size of 64 bytes, which can hold 16 ints. How many cache misses would you expect per inner loop iteration?
Each loop iteration accesses both matrix[i][j] and matrix[j][i]. If the matrix is large enough (so that the distance between &matrix[k][x] and &matrix[k+1][x] is large), matrix[j][i] would miss every time, and matrix[i][j] every 16 th^ time – once per cache line - thus we would expect 1+1/16=1.0625 cache misses per iteration.
d) (5 pts) In lecture we had discussed blocking as a method to speed up dense matrix multiplication. Could blocking be applied to speed up in- place matrix transposition? Briefly justify your answer!
Yes. Divide the matrix into small squares that fit in the cache, transpose the elements in each square block using a temporary buffer. If the temporary buffer, the source block, and the destination block fit into the cache, there will be no penalty for the lack of spatial locality because the cache block fetched when accessing the b[k][] will still be in the cache when b[k+1][] is accessed. This description is simplified: in practice, one needs to worry about conflict misses as well. This blocking avoids the cache misses due to lack of spatial locality when accessing neighboring columns; it does not introduce temporal locality.
a) (4 pts) Consider the following C code
void matrix_vector_multiply(int * y, int M[2][2], int * x) { y[0] = M[0][0] * x[0] + M[0][1] * x[1]; y[1] = M[1][0] * x[0] + M[1][1] * x[1]; }
Suppose you have an infinitely sophisticated compiler and you are using a machine with plenty of registers such as x86_64. How many memory load instructions and how many memory store instructions would the body of this function contain? (Not counting any accesses needed for stack frame management or saving callee-saved registers.)
Because x and y could refer to the same vector, we need 8 loads and 2 stores. Loads are for M[0][0], M[0][1], x[0], x[1], M[1][0], M[1][1], x[0], and x[1], stores for y[0] and y[1]. For example, here is the x86_64 code:
matrix_vector_multiply: movl 4(%rdx), %ecx # load x[1] movl (%rdx), %eax # load x[0] imull 4(%rsi), %ecx # load M[0][1] imull (%rsi), %eax # load M[0][0] addl %eax, %ecx movl %ecx, (%rdi) # store y[0] movl 4(%rdx), %ecx # load x[1] movl (%rdx), %eax # load x[0] imull 12(%rsi), %ecx # load M[1][1] imull 8(%rsi), %eax # load M[1][0] addl %eax, %ecx movl %ecx, 4(%rdi) # store y[1] ret
b) (4 pts) Now consider this C function, which is almost identical to the one above, except that the matrix M is no longer a nested array:
void matrix_vector_multiply2(int * y, int *M[], int * x) { y[0] = M[0][0] * x[0] + M[0][1] * x[1]; y[1] = M[1][0] * x[0] + M[1][1] * x[1]; }
How many memory load and store instructions would a compiler emit for this function?
imull 4(%rsi), %eax # load M[0][1] imull (%rsi), %edx # load M[0][0] addl %edx, %eax movl %eax, (%rdi) # store y[0] imull 12(%rsi), %r8d # load M[1][1] imull 8(%rsi), %ecx # load M[1][0] addl %ecx, %r8d movl %r8d, 4(%rdi) # store y[1] ret
d) (6 pts) Assuming an optimizing compiler, is there always a performance cost for declaring many local variables within one function? If yes, say why. If not, explain precisely when there is a cost and when there isn’t!
No, not always. Optimizing compilers perform register allocation. The number of declared local variables does not matter unless the lifetime of these local variables overlaps. Each register can hold only one local variable at a time, if there are more local variables alive at any point in a function than there are registers, spilling occurs and a performance penalty is paid.
a) (14 pts) Consider the following example programs. List all legal outputs this program may produce when executed on a Unix system. The output consists of strings made up of multiple letters.
// included in both programs #include <unistd.h> #include <sys/wait.h> // W(A) means write(1, “A”, sizeof “A”) #define W(x) write(1, #x, sizeof #x)
Possible Outputs:
int main() { W(A); fork(); W(B); fork(); W(C); }
i) 6 pts
Possible outputs are: ABBCCCC ABCBCCC ABCCBCC
int main() { W(A); int child = fork(); W(B); if (child) wait(NULL);
ii) 8 pts
Possible outputs are: ABCBC ABBCC
W(C); }
There is a bug in the program: it should be write(1, #x, sizeof #x – 1). The program as is outputs a ‘\0’ character, which however does not appear on the terminal.
b) (6 pts) Consider the following two programs. Below each program is shown the output sent to the terminal when the program is run:
int main() { if (fork()) *(int *)0 = 42; }
int main() { if (!fork()) *(int *)0 = 42; } Output: $ ./crash Segmentation fault $
Output: $ ./crash $
Why is the message “Segmentation fault” displayed for the program on the left, but not for the program on the right?
The segmentation fault message is displayed by the shell if a child process is terminated with signal 11, SIGSEGV. On the left, where fork() returns not zero, the shell’s child is terminated. On the right, the process that is terminated is the child process, which is a grandchild of the shell. ‘wait()’ does not allow the shell to wait for grandchildren, hence the shell cannot learn that the process terminated with a fault, hence no message. Note that this behavior occurs independent of whether the scheduler runs the parent or the child first after the fork (on a single processor system).