Introduction to Computer Systems: Abstraction, Reality, and Performance Optimization, Lecture notes of Computer Networks

Lecture notes from the first lecture of the course Introduction to Computer Systems at Carnegie Mellon. The lecture covers topics such as code security, memory referencing errors, and memory system performance. The notes include code examples and assembly code. The document could be useful as study notes or a summary for students taking the course or a similar course in computer systems.

Typology: Lecture notes

Pre 2010

Uploaded on 05/11/2023

hugger
hugger 🇺🇸

4.8

(12)

916 documents

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Carnegie Mellon
Introduction to Computer Systems
15-213/18-243, spring 2009
1st Lecture, Jan. 12th
Instructors:
Gregory Kesden and Markus Püschel
The course that gives CMU its “Zip”!
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Introduction to Computer Systems: Abstraction, Reality, and Performance Optimization and more Lecture notes Computer Networks in PDF only on Docsity!

Introduction to Computer Systems

15-213/18-243, spring 2009

1 st^ Lecture, Jan. 12th

Instructors:

Gregory Kesden and Markus Püschel

The course that gives CMU its “Zip”!

Overview

 Course theme

 Five realities

 How the course fits into the CS/ECE curriculum

 Logistics

Great Reality #1:

Int’s are not Integers, Float’s are not Reals

 Example 1: Is x^2 ≥ 0?

 Float’s: Yes!

 Int’s:

 Example 2: Is (x + y) + z = x + (y + z)?

 Unsigned & Signed Int’s: Yes!

 Float’s:

 (1e20 + -1e20) + 3.14 --> 3.

 1e20 + (-1e20 + 3.14) --> ??

Code Security Example

 Similar to code found in FreeBSD’s implementation of

getpeername

 There are legions of smart people trying to find

vulnerabilities in programs

/* Kernel memory region holding user-accessible data */ #define KSIZE 1024 char kbuf[KSIZE];

/* Copy at most maxlen bytes from kernel region to user buffer */ int copy_from_kernel(void user_dest, int maxlen) { / Byte count len is minimum of buffer size and maxlen */ int len = KSIZE < maxlen? KSIZE : maxlen; memcpy(user_dest, kbuf, len); return len; }

Malicious Usage

/* Kernel memory region holding user-accessible data */ #define KSIZE 1024 char kbuf[KSIZE];

/* Copy at most maxlen bytes from kernel region to user buffer */ int copy_from_kernel(void user_dest, int maxlen) { / Byte count len is minimum of buffer size and maxlen */ int len = KSIZE < maxlen? KSIZE : maxlen; memcpy(user_dest, kbuf, len); return len; }

#define MSIZE 528

void getstuff() { char mybuf[MSIZE]; copy_from_kernel(mybuf, -MSIZE);

... }

Computer Arithmetic

 Does not generate random values

 Arithmetic operations have important mathematical properties

 Cannot assume all “usual” mathematical properties

 Due to finiteness of representations

 Integer operations satisfy “ring” properties

 Commutativity, associativity, distributivity

 Floating point operations satisfy “ordering” properties

 Monotonicity, values of signs

 Observation

 Need to understand which abstractions apply in which contexts

 Important issues for compiler writers and serious application

programmers

Assembly Code Example

 Time Stamp Counter

 Special 64-bit register in Intel-compatible machines

 Incremented every clock cycle

 Read with rdtsc instruction

 Application

 Measure time (in clock cycles) required by procedure

double t;

start_counter();

P();

t = get_counter();

printf( " P required %f clock cycles\n " , t);

Code to Read Counter

 Write small amount of assembly code using GCC’s asm facility

 Inserts assembly code into machine code generated by

compiler

static unsigned cyc_hi = 0;

static unsigned cyc_lo = 0;

/* Set *hi and *lo to the high and low order bits

of the cycle counter.

void access_counter(unsigned *hi, unsigned *lo)

asm( "rdtsc; movl %%edx,%0; movl %%eax,%1 "

: "=r" (hi), "=r" (lo)

: "%edx", "%eax");

Memory Referencing Bug Example

double fun(int i)

volatile double d[1] = {3.14};

volatile long int a[2];

a[i] = 1073741824; /* Possibly out of bounds */

return d[0];

fun(0) – > 3.

fun(1) – > 3.

fun(2) – > 3.

fun(3) – > 2.

fun(4) – > 3.14, then segmentation fault

Memory Referencing Bug Example

double fun(int i) { volatile double d[1] = {3.14}; volatile long int a[2]; a[i] = 1073741824; /* Possibly out of bounds */ return d[0]; }

fun(0) > 3. fun(1) > 3. fun(2) > 3. fun(3) > 2. fun(4) > 3.14, then segmentation fault

Saved State

d7 … d

d3 … d

a[1]

a[0] 0

Location accessed by

fun(i)

Explanation:

Memory System Performance Example

 Hierarchical memory organization

 Performance depends on access patterns

 Including how step through multi-dimensional array

void copyji(int src[2048][2048], int dst[2048][2048]) { int i,j; for (j = 0; j < 2048; j++) for (i = 0; i < 2048; i++) dst[i][j] = src[i][j]; }

void copyij(int src[2048][2048], int dst[2048][2048]) { int i,j; for (i = 0; i < 2048; i++) for (j = 0; j < 2048; j++) dst[i][j] = src[i][j]; }

21 times slower

(Pentium 4)

The Memory Mountain

s1 s s5 s s s s s15 8m^ 2m^ 512k

128k

32k

8k

2k

0

200

400

600

800

1000

1200

Read throughput (MB/s)

Stride (words) Working set size (bytes)

Pentium III Xeon 550 MHz 16 KB on-chip L1 d-cache 16 KB on-chip L1 i-cache 512 KB off-chip unified L2 cache

L

L

Mem

Example Matrix Multiplication

 Standard desktop computer, vendor compiler, using optimization flags

 Both implementations have exactly the same operations count (2n^3 )

 What is going on?

0

5

10

15

20

25

30

35

40

45

50

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9, matrix size

Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz (double precision) Gflop/s

160x

Triple loop

Best code (K. Goto)

MMM Plot: Analysis

0

5

10

15

20

25

30

35

40

45

50

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9, matrix size

Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz Gflop/s

Memory hierarchy and other optimizations: 20x

Vector instructions: 4x

Multiple threads: 4x

 Reason for 20x: Blocking or tiling, loop unrolling, array scalarization,

instruction scheduling, search to find best choice

 Effect: less register spills, less L1/L2 cache misses, less TLB misses