Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Performance Programming and Static Analysis for Security: Lecture 28 by Sean Peisert, Slides of Computer Science

Kent State University (KSU) - Ashtabula Campus Computer Science

A transcript of Lecture 28 from the Performance Programming and Static Analysis for Security course given by Sean Peisert at Lawrence Livermore National Labs in Spring 2009. The lecture covers topics such as Power3's capabilities, harnessing its power, FLOP to MemOp ratio, pipeline latency, Fortran vs C, and security vulnerabilities like buffer overflows and TOCTTOU.

Typology: Slides

2019/2020

Uploaded on 06/15/2020

judyth 🇺🇸

4.6

(27)

316 documents

1 / 40

This page cannot be seen from the preview

Don't miss anything!

Discover Slides of Computer Science Kent State University (KSU) - Ashtabula Campus

Partial preview of the text

Download Performance Programming and Static Analysis for Security: Lecture 28 by Sean Peisert and more Slides Computer Science in PDF only on Docsity!

Performance Programming

& Static Analysis for Security

Lecture 28

(partially taken from slides given at Lawrence

Livermore National Labs by Larry Carter and

Sean Peisert)

Dr. Sean Peisert – ECS 142 – Spring 2009

Status

Project 3 back sometime in the middle of this week.
Project 4 due this Friday, 11:55pm 2

4 Power3’s power … and limits

Eight pipelined functional units
2 floating point
2 load/store
2 single-cycle integer
1 multi-cycle integer
1 branch
Powerful operations
Fused multiply-add (FMA)
Load (or Store) update
Launch 4 ops per cycle
Can’t launch 2 stores/cyc
FMA pipe 3-4 cycles long
Memory hierarchy (Tues)

5 Can its power be harnessed?

for (j=0; j<n; j+=4){

p00 += a[j+0]*a[j+2];

m00 -= a[j+0]*a[j+2];

p01 += a[j+1]*a[j+3];

m01 -= a[j+1]*a[j+3];

p10 += a[j+0]*a[j+3];

m10 -= a[j+0]*a[j+3];

p11 += a[j+1]*a[j+2];

m11 -= a[j+1]*a[j+2];

8 FMAʼs

4 Loads

Runs at 4.6 cycles/iteration (= 772 MFLOP/S)

CL.6:

FMA fp31=fp31,fp2,fp0,fcr LFL fp1=()double(gr3,16) FNMS fp30=fp30,fp2,fp0,fcr LFDU fp3,gr3=()double(gr3,32) FMA fp24=fp24,fp0,fp1,fcr FNMS fp25=fp25,fp0,fp1,fcr LFL fp0=()double(gr3,24) FMA fp27=fp27,fp2,fp3,fcr FNMS fp26=fp26,fp2,fp3,fcr LFL fp2=()double(gr3,8) FMA fp29=fp29,fp1,fp3,fcr FNMS fp28=fp28,fp1,fp3,fcr BCT ctr=CL.6,

7 FLOP to MemOp ratio

Most scientific programs have at most one FMA per MemOp
```
 Matrix-vector product: ( _K+1_ ) loads, _K_ fma’s 
```
- FFT butterfly: 8 MemOps, 10 floats (but 5 or 6 FMA)
- DAXPY: 2 Loads, 1 Store, 1 FMA
- DDOT: 2 Loads, 1 FMA

A few have more (use ESSL!)

 Matrix multiply (well-tuned): 2 FMA per load

Radix-8 FFT

Performance is limited by Memory Operations!

8 The effect of pipeline latency for (i=0; i<size; i++) { sum = a[i] + sum; } for (i=0; i<size; i+=4) { sum0 += a[i]; sum1 += a[i+1]; sum2 += a[i+2]; sum3 += a[i+3]; } sum = sum0+sum1+sum2+sum3; 3.86 cycles/addition 1.1 cycles/addition Next add can’t start until previous is finished (3 to 4 cycles later)

What’s so great about Fortran??

DO I = 1, N

A(I) = B(I)

ENDDO

CL.8:

L4A gr0=b(gr5,4) L4A gr6=b(gr5,8) L4A gr7=b(gr5,12) L4AU gr8,gr5=b(gr5,16) ST4A a(gr4,8)=gr ST4A a(gr4,4)=gr ST4A a(gr4,12)=gr ST4U gr4,a(gr4,16)=gr BCT ctr=CL.8,

for (i=0; i<N; i++) {

b[i] = a[i];

CL.6:

ST4U gr4,()int(gr4,4)=gr L4AU gr24,gr3=()int(gr3,4) BCT ctr=CL.6,

10 Fortran vs C - what’s going on??

C prevents compiler from unrolling code

A feature, not a bug!

User may want b[0] and a[1] to be same location

tricky way to set a[n] = ..… = a[1] = a[0]

Most C compilers don’t try to prove non-aliasing

a and b were malloc-ed in this example

Fortran doesn’t allow arrays to be aliased

Unless explicit, e.g. via EQUIVALENCE

12 Decreasing MemOp to FLOP Ratio for (i=1; i<N; i++) for (j=1; j<N; j++) b[i,j] = 0.25 * (a[i-1][j] + a[i+1][j]

a[i,j-1] + a[i][j-1]); for (i=1; i<N-2; i+=3) { for(j=1; j<N; j++) { b[i+0][j] = ... ; b[i+1][j] = ... ; b[i+2][j] = ... ; } } 3 loads 4 floats 1 store 5 loads 12 floats 3 store

Compilers: Topics Not Covered

Lambda Calculus

Foundation for Programming

Languages

Instruction Scheduling

Compiling Exceptions

Vulnerable Program? int main(int argc, char *argv[]) { char buffer[500]; strcpy(buffer, argv[1]); printf("Safe program?"); return 0; }

Shellcode \x31\xc0\x50\x68\x2\x2\x73\x68\x68\x2\x62\x 69\x6\x89\xe3\x50\x53\x50\x54\x53\xb0\x3\x 0\xcd\x

Buffer Overflows 18

Impact of Buffer Overflows

Buffer overflows (including those on the stack, the heap, and also integer overflows) have dominated exploitable programming flaws for years. (http://www.sans.org/top20/)
E.g., worms: Blaster, Morris, Slammer, Witty, etc....
It would be great if programmers just wrote better code and did their own bounds checking. But in practice, even good coders make mistakes.

Performance Programming and Static Analysis for Security: Lecture 28 by Sean Peisert, Slides of Computer Science

Related documents

Partial preview of the text

Download Performance Programming and Static Analysis for Security: Lecture 28 by Sean Peisert and more Slides Computer Science in PDF only on Docsity!

Performance Programming

& Static Analysis for Security

Lecture 28

(partially taken from slides given at Lawrence

Livermore National Labs by Larry Carter and

Sean Peisert)

Dr. Sean Peisert – ECS 142 – Spring 2009

for (j=0; j<n; j+=4){

p00 += a[j+0]*a[j+2];

m00 -= a[j+0]*a[j+2];

p01 += a[j+1]*a[j+3];

m01 -= a[j+1]*a[j+3];

p10 += a[j+0]*a[j+3];

m10 -= a[j+0]*a[j+3];

p11 += a[j+1]*a[j+2];

m11 -= a[j+1]*a[j+2];

8 FMAʼs

4 Loads

Runs at 4.6 cycles/iteration (= 772 MFLOP/S)

CL.6:

Most scientific programs have at most one FMA per MemOp

A few have more (use ESSL!)

DO I = 1, N

A(I) = B(I)

ENDDO

CL.8:

for (i=0; i<N; i++) {

b[i] = a[i];

CL.6:

A feature, not a bug!

tricky way to set a[n] = ..… = a[1] = a[0]

a and b were malloc-ed in this example

Compilers: Topics Not Covered

Lambda Calculus

Foundation for Programming

Languages

Instruction Scheduling

Compiling Exceptions