Data Flow Analysis in CMSC 631: Understanding Program Analysis - Prof. William Pugh, Papers of Computer Science

An overview of data flow analysis, a technique used in program analysis during the compilation process. The transformation of source code into abstract syntax trees (ast) and control flow graphs (cfg), the concept of data flow facts, and various data flow analyses such as forward and backward analysis. The document also discusses the importance of lattices in data flow analysis and the distinction between forward and backward problems.

Typology: Papers

Pre 2010

Uploaded on 07/30/2009

koofers-user-dab
koofers-user-dab 🇺🇸

9 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Flow Analysis
CMSC 631 — Program Analysis and
Understanding
Spring 2009
CMSC 631 2
Source code parsed to produce AST
AST transformed to CFG
Data flow analysis operates on control flow
graph (and other intermediate representations)
Compiler Structure
Source
Code
Abstract
Syntax
Tree
Control
Flow
Graph
Object
Code
CMSC 631 3
Abstract Syntax Tree (AST)
Programs are written in text
!I.e., sequences of characters
!Awkward to work with
First step: Convert to structured representation
!Use lexer (like flex) to recognize tokens
-Sequences of characters that make words in the language
!Use parser (like bison) to group words structurally
-And, often, to produce AST
CMSC 631 4
Abstract Syntax Tree Example
!"#$"%"&"'(
)"#$"%"*"'(
+,-./"0)"1"%2"3
""""%"#$"%"&"4(
""""!"#$"%"&"'
5
Program
:=
x +
a b
while
>
y a
Block
:=
a +
a 1
...
...
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Data Flow Analysis in CMSC 631: Understanding Program Analysis - Prof. William Pugh and more Papers Computer Science in PDF only on Docsity!

Data Flow Analysis

CMSC 631 — Program Analysis and

Understanding

Spring 2009

CMSC 631 (^2)

• Source code parsed to produce AST

• AST transformed to CFG

• Data flow analysis operates on control flow

graph (and other intermediate representations)

Compiler Structure

Source

Code

Abstract

Syntax

Tree

Control

Flow

Graph

Object

Code

Abstract Syntax Tree (AST)

• Programs are written in text

! (^) I.e., sequences of characters

! Awkward to work with

• First step: Convert to structured representation

! (^) Use lexer (like flex) to recognize tokens

  • Sequences of characters that make words in the language

! (^) Use parser (like bison) to group words structurally

  • And, often, to produce AST

Abstract Syntax Tree Example

!"#$"%"&"'(

)"#$"%"*"'(

+,-./"0)"1"%2"

""""%"#$"%"&"4(

""""!"#$"%"&"'

5

Program

:=

x +

a b

while

y a

Block

:=

a +

a 1

...

...

CMSC 631 (^5)

ASTs

• ASTs are abstract

! (^) They don’t contain all information in the program

  • E.g., spacing, comments, brackets, parentheses

! (^) Any ambiguity has been resolved

  • E.g., a + b + c produces the same AST as (a + b) + c

• For more info, see CMSC 430

! In this class, we will generally begin at the AST level

CMSC 631 (^6)

Disadvantages of ASTs

• AST has many similar forms

! (^) E.g., for, while, repeat...until

! E.g., if, ?:, switch

• Expressions in AST may be complex, nested

! (^) (42 * y) + (z > 5? 12 * z : z + 20)

• Want simpler representation for analysis

! (^) ...at least, for dataflow analysis

Control-Flow Graph (CFG)

• A directed graph where

! (^) Each node represents a statement

! Edges represent control flow

• Statements may be

! (^) Assignments x := y op z or x := op z

! Copy statements x := y

! (^) Branches goto L or if x relop y goto L

! etc.

!"#$"%"&"'(

)"#$"%"*"'(

+,-./"0)"1"%2"

""""%"#$"%"&"4(

""""!"#$"%"&"'

5

Control-Flow Graph Example

x := a + b

y := a * b

y > a

a := a + 1

x := a + b

CMSC 631 (^13)

• A framework for proving facts about programs

• Reasons about lots of little facts

• Little or no interaction between facts

! Works best on properties about how program

computes

• Based on all paths through program

! (^) Including infeasible paths

Data Flow Analysis

CMSC 631 (^14)

• An expression e is available at program point p if

! (^) e is computed on every path to p, and

! the value of e has not changed since the last time e

was computed on the paths to p

• Optimization

! (^) If an expression is available, need not be recomputed

  • (At least, if it’s still in a register somewhere)

Available Expressions

• Is expression e available?

• Facts:

! (^) a + b is available

! a * b is available

! a + 1 is available

Data Flow Facts

x := a + b

y := a * b

y > a

a := a + 1

x := a + b

exit

entry

• What is the effect of each

statement on the set of facts?

Gen and Kill

Stmt Gen Kill

x := a + b a + b

y := a * b a * b

a := a + 1

a + 1,

a + b,

a * b

x := a + b

y := a * b

y > a

a := a + 1

x := a + b

exit

entry

CMSC 631 (^17)

Computing Available Expressions

{a + b}

{a + b, a * b}

{a + b, a * b}

Ø

{a + b}

{a + b}

{a + b}

{a + b}

x := a + b

y := a * b

y > a

a := a + 1

x := a + b

entry

exit

CMSC 631 (^18)

Terminology

• A joint point is a program point where two

branches meet

• Available expressions is a forward must problem

! Forward = Data flow from in to out

! (^) Must = At join point, property must hold on all paths

that are joined

• Let s be a statement

! (^) succ(s) = { immediate successor statements of s }

! pred(s) = { immediate predecessor statements of s}

! In(s) = program point just before executing s

! Out(s) = program point just after executing s

• In(s) =

s pred(s)

Out(s )

• Out(s) = Gen(s) (In(s) - Kill(s))

Data Flow Equations

• A variable v is live at program point p if

! (^) v will be used on some execution path originating

from p...

! before v is overwritten

• Optimization

! (^) If a variable is not live, no need to keep it in a register

! If variable is dead at assignment, can eliminate

assignment

Liveness Analysis

CMSC 631 (^25)

• A definition of a variable v is an assignment to v

• A definition of variable v reaches point p if

! (^) There is no intervening assignment to v

• Also called def-use information

• What kind of problem?

! (^) Forward or backward?

! May or must?

Reaching Definitions

forward

may

CMSC 631 (^26)

• Most data flow analyses can be classified this way

! A few don’t fit: bidirectional analysis

• Lots of literature on data flow analysis

Space of Data Flow Analyses

May Must

Forward

Reaching

definitions

Available

expressions

Backward

Live

variables

Very busy

expressions

• Typically, data flow facts form a lattice

! (^) Example: Available expressions

Data Flow Facts and Lattices

a+b, a*b, a+

a+b, a*b a+b, a+

a+b

a*b, a+

a*b a+

(none)

“top”

“bottom”

• A partial order is a pair """"" such that

!

!

!

!

Partial Orders

(P, ≤)

≤ ⊆ P × P

≤ is reflexive: x ≤ x

≤ is anti-symmetric: x ≤ y and y ≤ x ⇒ x = y

≤ is transitive: x ≤ y and y ≤ z ⇒ x ≤ z

CMSC 631 (^29)

• A partial order is a lattice if and are defined

on any set:

! " is the meet or greatest lower bound operation:

! (^) " is the join or least upper bound operation:

Lattices

x! y ≤ x and x! y ≤ y

if z ≤ x and z ≤ y, then z ≤ x " y

if x ≤ z and y ≤ z, then x " y ≤ z

x ≤ x " y and y ≤ x " y

CMSC 631 (^30)

• A finite partial order is a lattice if meet and join

exist for every pair of elements

• A lattice has unique elements and such that

!

!

• In a lattice,

  • A partial order is a complete lattice if meet and join

are defined on any set S! P

Lattices (cont’d)

x! ⊥ = ⊥

x! " = x

x! ⊥ = x

x! " = "

x ≤ y iff x " y = x

x ≤ y iff x # y = y

Out(s) = Top for all statements s

// Slight acceleration: Could set Out(s) = Gen(s) (Top - Kill(s))

W := { all statements } (worklist)

repeat

Take s from W

In(s) := s pred(s)

Out(s )

temp := Gen(s) (In(s) - Kill(s))

if (temp != Out(s)) {

Out(s) := temp

W := W succ(s)

}

until W =

Forward Must Data Flow Algorithm

• A function f on a partial order is monotonic if

• Easy to check that operations to compute In and

Out are monotonic

! In(s) :=^ s pred(s)

Out(s )

! (^) temp := Gen(s) (In(s) - Kill(s))

• Putting these two together,

! temp :=

Monotonicity

x ≤ y ⇒ f (x) ≤ f (y)

f s

s ′ ∈pred(s)

Out(s

′ ))

a function f s

(In(s))

CMSC 631 (^37)

Fixpoints

• We always start with Top

! (^) Every expression is available, no defns reach this point

! Most optimistic assumption

! Strongest possible hypothesis

  • = true of fewest number of states

• Revise as we encounter contradictions

! Always move down in the lattice (with meet)

• Result: A greatest fixpoint

CMSC 631 (^38)

Lattices (P, " ), cont’d

• Live variables

! (^) P = sets of variables

! S1! S2 = S1 # S

! Top = empty set

• Very busy expressions

! (^) P = set of expressions

! S1! S2 = S1 S

! (^) Top = set of all expressions

Forward vs. Backward

Out(s) = Top for all s

W := { all statements }

repeat

! Take s from W

! temp := f s

s pred(s)

Out(s ))

! if (temp != Out(s)) {

!! Out(s) := temp

!! W := W succ(s)

until W =

In(s) = Top for all s

W := { all statements }

repeat

! Take s from W

! temp := f s

s succ(s)

In(s ))

! if (temp != In(s)) {

!! In(s) := temp

!! W := W pred(s)

until W =

Termination Revisited

• How many times can we apply this step:

temp := f s

(! s pred(s)

Out(s ))

! if (temp != Out(s)) { ... }

! Claim: Out(s) only shrinks

  • Proof:^ Out(s)^ starts out as top
    • So temp must be " than Top after first step
  • Assume^ Out(s^ )^ shrinks for all predecessors^ s^ of^ s

Then! s pred(s)

Out(s ) shrinks

Since f s

monotonic, f s

(! s pred(s)

Out(s )) shrinks

CMSC 631 (^41)

Termination Revisited (cont’d)

• A descending chain in a lattice is a sequence

! (^) x0 $!x1 $!x2 $!...

• The height of a lattice is the length of the longest

descending chain in the lattice

• Then, dataflow must terminate in O(nk) time

! (^) n = # of statements in program

! k = height of lattice

! assumes meet operation takes O(1) time

CMSC 631 (^42)

Least vs. Greatest Fixpoints

• Dataflow tradition: Start with Top, use meet

! (^) To do this, we need a meet semilattice with top

  • complete meet semilattice = meets defined for any set
  • finite height ensures termination

! Computes greatest fixpoint

• Denotational semantics tradition: Start with

Bottom, use join

! Computes least fixpoint

• By monotonicity, we also have

• A function f is distributive if

Distributive Data Flow Problems

f (x! y) ≤ f (x)! f (y)

f (x! y) = f (x)! f (y)

• Joins lose no information

Benefit of Distributivity

f g

h

k

k(h(f (!) " g(!))) =

k(h(f (!)) " h(g(!))) =

k(h(f (!))) " k(h(g(!)))

CMSC 631 (^49)

• A basic block is a sequence of statements s.t.

! (^) No statement except the last in a branch

! There are no branches to any statement in the block

except the first

• In practical data flow implementations,

! Compute Gen/Kill for each basic block

  • Compose transfer functions

! (^) Store only In/Out for each basic block

! Typical basic block ~5 statements

Basic Blocks

CMSC 631 (^50)

• Assume forward data flow problem

! (^) Let G = (V, E) be the CFG

! Let k be the height of the lattice

• If G acyclic, visit in topological order

! (^) Visit head before tail of edge

• Running time O(|E|)

! No matter what size the lattice

Order Matters

• If G has cycles, visit in reverse postorder

! (^) Order from depth-first search

• Let Q = max # back edges on cycle-free path

! Nesting depth

! Back edge is from node to ancestor on DFS tree

• Then if (sufficient, but not necessary)

! Running time is

  • Note direction of req’t depends on top vs. bottom

Order Matters — Cycles

O((Q + 1)|E|)

∀x.f (x) ≤ x

• Data flow analysis is flow-sensitive

! (^) The order of statements is taken into account

! I.e., we keep track of facts per program point

• Alternative: Flow-insensitive analysis

! (^) Analysis the same regardless of statement order

! Standard example: types

  • /* x : int / x := ... / x : int */

Flow-Sensitivity

CMSC 631 (^53)

• Must vs. May

! (^) (Not always followed in literature)

• Forwards vs. Backwards

• Flow-sensitive vs. Flow-insensitive

• Distributive vs. Non-distributive

Terminology Review

CMSC 631 (^54)

• Recall in practice, one transfer function per basic

block

• Why not generalize this idea beyond a basic

block?

! (^) “Collapse” larger constructs into smaller ones,

combining data flow equations

! Eventually program collapsed into a single node!

! (^) “Expand out” back to original constructs, rebuilding

information

Another Approach: Elimination

Lattices of Functions

• Let (P, ") be a lattice

• Let M be the set of monotonic functions on P

• Define^ f^ "

f

g if for all x, f(x) " g(x)

• Define the function f! g as

! (f! g) (x) = f(x)! g(x)

• Claim:^ (M,^ "

f

) forms a lattice

Elimination Methods: Conditionals

f ite

= (f then

◦ f if

) " (f else

◦ f if

Out(if) = f if

(In(ite)))

Out(then) = (f then

◦ f if

)(In(ite)))

Out(else) = (f else

◦ f if

)(In(ite)))

If

Then Else

z

IfThenElse

z

CMSC 631 (^61)

Comments

• Can also do backwards elimination

! (^) Not quite as nice (regions are usually single entry but

often not single exit )

• For bit-vector problems, elimination efficient

! Easy to compose functions, compute meet, etc.

• Elimination originally seemed like it might be

faster than iteration

! Not really the case

CMSC 631 (^62)

• What happens at a function call?

! (^) Lots of proposed solutions in data flow analysis

literature

• In practice, only analyze one procedure at a time

• Consequences

! (^) Call to function kills all data flow facts

! May be able to improve depending on language, e.g.,

function call may not affect locals

Data Flow Analysis and Functions

• An analysis that models only a single function at

a time is intraprocedural

• An analysis that takes multiple functions into

account is interprocedural

• An analysis that takes the whole program into

account is...guess?

• Note: global analysis means “more than one

basic block,” but still within a function

More Terminology

• Data Flow is good at analyzing local variables

! (^) But what about values stored in the heap?

! Not modeled in traditional data flow

• In practice: *x := e

! (^) Assume all data flow facts killed (!)

! Or, assume write through x may affect any variable

whose address has been taken

• In general, hard to analyze pointers

Data Flow Analysis and The Heap

CMSC 631 (^65)

Data Flow Analysis and Optimization

• Moore’s Law: Hardware advances double

computing power every 18 months.

• Proebsting’s Law: Compiler advances double

computing power every 18 years.

! Not so much bang for the buck!

CMSC 631 66

DF Analysis and Defect Detection

• LCLint - Evans et al. (UVa)

• METAL - Engler et al. (Stanford, now Coverity)

• ESP - Das et al. (MSR)

• FindBugs - Hovemeyer, Pugh (Maryland)

! For Java. The first three are for C.

• Many other one-shot projects

! Memory leak detection

! (^) Security vulnerability checking (tainting, info. leaks)