Improving Data-flow Analysis with Path Profiles - Notes | CS 6463, Papers of Computer Science

Material Type: Paper; Class: AT:Fundament of High Perf Comp; Subject: Computer Science; University: University of Texas - San Antonio; Term: Unknown 1989;

Typology: Papers

Pre 2010

Uploaded on 07/30/2009

koofers-user-ez3
koofers-user-ez3 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Improving Data-flow Analysis with Path Profiles*
Glenn Ammons James R. Larust
Department of Computer Sciences
University of Wisconsin-Madison
1210 West Dayton St.
Madison, WI 53706
Abstract
Data-flow analysis computes its solutions over the paths in
a control-flow graph. These paths-whether feasible or in-
feasible, heavily or rarely executed-contribute equally to a
solution. However, programs execute only a small fraction
of their potential paths and, moreover, programs’ execution
time and cost is concentrated in a far smaller subset of hot
paths.
This paper describes a new approach to analyzing and
optimizing programs, which improves the precision of data
flow analysis along hot paths. Our technique identifies
and duplicates hot paths, creating a hot path graph in
which these paths are isolated. After flow analysis, the
graph is reduced to eliminate unnecessary duplicates of un-
profitable paths. In experiments on SPEC95 benchmarks,
path qualification identified 2-112 times more non-local con-
stants (weighted dynamically) than the Wegman-Zadek con-
ditional constant algorithm, which translated into l-7%
more dynamic instructions with constant results.
1 Introduction
Data-flow analysis computes its solutions over the paths in
a control-flow graph. The well-known, meet-over-all-paths
formulation produces safe, precise solutions for general data-
flow problems. All paths-whether feasible or infeasible,
heavily or rarely executed-contribute equally to a solution.
This egalitarian approach, unfortunately, is at odds with
the realities of program behavior. Even moderately large
programs execute only a few tens of thousands of paths (out
of a universe of billions of acyclic paths) and, moreover,
programs’ execution time and cost is concentrated in a far
smaller subset of hot paths [BL96, ABL97].
This paper presents a new data-flow analysis technique
that attempts to compute more precise solutions along the
hot paths in a program. Improved analysis along these paths
“This research supported by: NSF NY1 Award CCR-9357779, with
support
from
Sun Microsystems
and Intel, and NSF Grant MIP-
9625558.
‘On sabbatical at Microsoft Research.
Permission 10 make digital or hard copies of all or pan of this wcrk for
personal or classroom use is granted without ka provided that
copies are not made or distributed for profit or ccmmwcial advan-
tage and that copies bear this notice and the full citation on the first page.
To copy otherwise, 10 republish, 10 pcsf on sawen w 10
redistributa 10 lists, requires prior specific psrmiasion and/w a fw.
SIGPLAN ‘98 Montrasl, Canada
@ 1998 ACM 0-89791~987.4/98/0006...$5.00
can aid a compiler in optimizing these heavily executed por-
tions of a program. Path-qualified data-flow analysis con-
sists of the following steps:
1.
2.
3.
4.
5.
Identify hot paths by profiling a program. We use a
Ba&Larus path profile [BL96] to determine how often
acyclic paths in a program execute.
Identify and isolate the hot paths in the program’s
control-flow graph (CFG). This step produces a new
CFG in which each hot path is duplicated. Since a
hot path is separated from other paths, data-flow facts
along the path do not merge with facts from other,
overlapping paths. Moreover, as programs do not exe-
cute many hot paths, this hot-path graph (HPG) is not
much larger than the original graph.
Perform data-flow analysis on the HPG. The solutions
found by this technique are conservative in the hot
path graph-not in the original control-flow graph.
Reduce the graph to preserve only valuable solutions.
The HPG duplicates code for paths whose solutions
did not improve. Extra code both increases the cost
of subsequent compiler analyses and adversely affects
a processor’s instruction cache and branch predictor.
Reduction uses results from the data-flow analysis
and
frequencies from the path profile to decide which paths
to preserve in the
TI&K~
hot-path graph (THPG).
Translate the original path profile into a path pro-
file for the rHPG, so profiling information is avail-
able for subsequent analyses and optimizations. Ball-
Lams path profiles are determined by a set of recording
edges, which start and end paths. The algorithm that
produces an HPG also identifies recording edges in the
HPG, which allows interpretation of the original path
profile as a path profile of the HPG. The reduction
step properly maintains these recording edges.
The technique can be applied to any data-flow prob-
lem, although this paper focuses on constant propagation.
In experiments on SPEC95 benchmarks, path qualification
identified 2-112 times more non-local constants (weighted
dynamically) than the Wegman-Zadek conditional constant
algorithm, which translated into l-7% more dynamic in-
structions with constant results. Moreover, the technique is
practical. With the exception of the go benchmark, the hot-
path graphs were 3-32% larger and the reduced hot-path
graphs were only l-7% larger than the original CFG. On
72
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Improving Data-flow Analysis with Path Profiles - Notes | CS 6463 and more Papers Computer Science in PDF only on Docsity!

Improving Data-flow Analysis with Path Profiles*

Glenn Ammons James R. Larust

[email protected] [email protected]

Department of Computer Sciences

University of Wisconsin-Madison

1210 West Dayton St.

Madison, WI 53706

Abstract

Data-flow analysis computes its solutions over the paths in a control-flow graph. These paths-whether feasible or in- feasible, heavily or rarely executed-contribute equally to a solution. However, programs execute only a small fraction of their potential paths and, moreover, programs’ execution time and cost is concentrated in a far smaller subset of hot paths. This paper describes a new approach to analyzing and optimizing programs, which improves the precision of data flow analysis along hot paths. Our technique identifies and duplicates hot paths, creating a hot path graph in which these paths are isolated. After flow analysis, the graph is reduced to eliminate unnecessary duplicates of un- profitable paths. In experiments on SPEC95 benchmarks, path qualification identified 2-112 times more non-local con- stants (weighted dynamically) than the Wegman-Zadek con- ditional constant algorithm, which translated into l-7% more dynamic instructions with constant results.

1 Introduction

Data-flow analysis computes its solutions over the paths in a control-flow graph. The well-known, meet-over-all-paths formulation produces safe, precise solutions for general data- flow problems. All paths-whether feasible or infeasible, heavily or rarely executed-contribute equally to a solution. This egalitarian approach, unfortunately, is at odds with the realities of program behavior. Even moderately large programs execute only a few tens of thousands of paths (out of a universe of billions of acyclic paths) and, moreover, programs’ execution time and cost is concentrated in a far smaller subset of hot paths [BL96, ABL97]. This paper presents a new data-flow analysis technique that attempts to compute more precise solutions along the hot paths in a program. Improved analysis along these paths

“This research supported by: NSF NY1 Award CCR-9357779, with support from Sun Microsystems and Intel, and NSF Grant MIP-

‘On sabbatical at Microsoft Research.

Permission 10 make digital or hard copies of all or pan of this wcrk for personal or classroom use is granted without ka provided that copies are not made or distributed for profit or ccmmwcial advan- tage and that copies bear this notice and the full citation on the first page. To copy otherwise, 10 republish, 10 pcsf on sawen w 10 redistributa 10 lists, requires prior specific psrmiasion and/w a fw. SIGPLAN ‘98 Montrasl, Canada @ 1998 ACM 0-89791~987.4/98/0006...$5.

can aid a compiler in optimizing these heavily executed por- tions of a program. Path-qualified data-flow analysis con- sists of the following steps:

Identify hot paths by profiling a program. We use a Ba&Larus path profile [BL96] to determine how often acyclic paths in a program execute.

Identify and isolate the hot paths in the program’s control-flow graph (CFG). This step produces a new CFG in which each hot path is duplicated. Since a hot path is separated from other paths, data-flow facts along the path do not merge with facts from other, overlapping paths. Moreover, as programs do not exe- cute many hot paths, this hot-path graph (HPG) is not much larger than the original graph.

Perform data-flow analysis on the HPG. The solutions found by this technique are conservative in the hot path graph-not in the original control-flow graph.

Reduce the graph to preserve only valuable solutions. The HPG duplicates code for paths whose solutions did not improve. Extra code both increases the cost of subsequent compiler analyses and adversely affects a processor’s instruction cache and branch predictor.

Reduction uses results from the data-flow analysis and

frequencies from the path profile to decide which paths to preserve in the TI&K~ hot-path graph (THPG).

Translate the original path profile into a path pro- file for the rHPG, so profiling information is avail- able for subsequent analyses and optimizations. Ball- Lams path profiles are determined by a set of recording edges, which start and end paths. The algorithm that produces an HPG also identifies recording edges in the HPG, which allows interpretation of the original path profile as a path profile of the HPG. The reduction step properly maintains these recording edges.

The technique can be applied to any data-flow prob- lem, although this paper focuses on constant propagation. In experiments on SPEC95 benchmarks, path qualification identified 2-112 times more non-local constants (weighted dynamically) than the Wegman-Zadek conditional constant algorithm, which translated into l-7% more dynamic in- structions with constant results. Moreover, the technique is practical. With the exception of the go benchmark, the hot- path graphs were 3-32% larger and the reduced hot-path graphs were only l-7% larger than the original CFG. On

go, the hot-path graphs were 184% larger and the reduced hot-path graphs 70% larger.

1.1 Qualified Flow Analysis

Our implementation is based on Holley and Rosen’s qual- ified flow analysis technique [HR~I]. A qualified data-flow problem is a conventional data-flow problem together with a deterministic finite automaton, A, whose transitions are labelled by the edges of the control-flow graph, G. A en- codes additional information about the program-in path- qualified analysis, it recognizes hot paths. Data-flow analy- sis answers questions of the form “What can be said about the data-flow value at vertex v?” Qualified data-flow analy- sis answers questions of the form “What can be said about the data-flow value at vertex v given that A is in state q?” Holley and Rosen used qualified data-flow analysis to identify infeasible paths and exclude them from analysis. They created an automaton in which infeasible paths ended in a failure state. The best solution at v, given that A is not in the failure state, is the meet over the non-failure states of A. For path qualification, we use the Aho-Corasick al- gorithm [Aho94] to construct an automaton that recognizes hot paths in a path profile. Holley and Rosen describe two techniques for solving qualified problems, data-flow tracing and conted tupling. This paper uses data-flow tracing, which constructs a new graph GA whose vertices encode the vertex from G and the state from A. The qualified problem is then solved as a con- ventional data-flow problem over GA-qualified solutions in G have become true solutions in GA. GA is, of course, our hot-path graph. The qualified solution is never lower in the lattice than the unqualified solution. To see why, consider the solution at vertex v of G. If P is the set of all paths from routine entry to v and I, is the data-flow value from path p E P, then the meet-over-all-paths solution 1, at v is given by

1, = A 1,. PEP

Now partition P by the state of A. If Q is the set of states of A, Pp C_P is the set of paths in P that drive A to state q E Q. It is clear that

A 1, = A A 1,. PEP nEQ PEP,

Or, omitting the outer meet on the right hand side and converting the equality to an inequality, for all q E Q

PEP PEP,

The inequality is not strict, so the qualified solution is not necessarily sharper than the meet-over-all-paths solu- tion. However, when it is sharper, it is doubly beneficial to find this increased precision in heavily executed code.

1.2 Contributions

This paper makes four contributions:

l It shows how path profiles can improve the precision of data-flow analysis through guided code duplication.

l It describes how to reduce the hot-path graph, by elim- inating paths that prove unnecessary or unprofitable.

l It shows how to preserve path-profiling information through the CFG transformations.

l It applies path qualification to constant propagation and demonstrates a significant improvement over the widely-used Wegman-Zadek technique.

1.3 Outline of the Paper

This paper is structured as follows. Section 2 sketches the theoretical groundwork and formalizes path profiles. Sec- tion 3 describes the automaton that recognizes the hot path in the profile. Section 4 shows how data-flow tracing con- structs a new control-flow graph with duplicated hot paths. Section 5 shows how to reduce the traced graph. Section 6 presents the results of our experiments on SPEC95 bench- marks. Section 7 discusses related work.

2 Preliminaries

This section states definitions and theorems used in the rest of this paper.

2.1 Data Flow Problems

We begin with standard definitions of data-flow problems and their solutions.

Definition 1 A monotonic data-flow problem D is a tuple (L, A, F, G,r, l,, M) where:

l L is a complete semilattice with meet operation A.

l F is a set of monotonic functions from L to L.

l G = (V, E) is a control-flow graph with entry vertex r.

l 1, E L is the data-pow fact associated with I-.

l M : E + F maps the edges of G to functions in F.

M can be extended to map every path p = [eo, el,... , ek] in G to a function f : L + L:

f = M(p) = M(ek) 0 M(ek-1) 0... 0 M(eo)

The next three definitions come from Holley and Rosen p~8i].

Definition 2 A solution I of D is a map I : V + L such that, for any path p from r to a vertex: u, I(u) < (M(p))&).

Definition 3 A fixpoint J of D is a map J : V -+ L such that J(r) 5 I, and, if e is an edge from vertex u to vertex v, J(v) 5 (M(e))(J(u))-

Definition 4 A good solution I of D is a solution of D such that, for any fipoint J, J(u) < I(u) for all u E V.

x

(0)’ CJ%D

Figure 2: A path profile for the example.

(5 times) [Entry, A, B, D, E, F, H]. [B,D, E, G, HI .[B, D, E, F, H, I, Exit] (25 times) [Entry, A, B, D, E, F, H]. [B, D, E, G, HI .[B, D, E, F, H, I, Exit]

the path profile is shown in Figure 2.

3 Creating the Automaton

This section describes an algorithm to construct a deter- ministic finite automaton that recognizes hot paths. The algorithm is an application of the Aho-Corasick algorithm for matching keywords in a string [Aho94]. In our case, the keywords are hot Ball-Larus paths. The constructed DFA is used as the qualification automaton for data-flow tracing. The Aho-Corasick algorithm begins by constructing a retrieval tree (also known as a trie) from a set of keywords. A retrieval tree is a tree with edges labelled by letters from the alphabet, which satisfies two properties. First, each path from the root of the tree to a node corresponds to a prefix of a keyword from the set. Second, every prefix of every keyword has a unique path from the root that is labelled by letters of the prefix. Given a set of keywords, constructing its retrieval tree takes time proportional to the sum of the lengths of the keywords. In our case, the alphabet is edges in a CFG and key- words are hot paths. Assuming that all paths in Figure 2 are hot, Figure 3 shows the retrieval tree. Our algorithm for constructing the retrieval tree consists of the following steps:

Identify the hot paths. In our experiments, we selected the minimal set of paths that executed a fixed fraction CA (e.g., 97%) of the dynamic instructions in a training run. Hot paths were selected by considering each path, ordered by the number of instructions executed along the path (length times frequency), and marking paths hot until CA of the dynamic instructions were covered.

Trim the final recording edge from each hot path. The constructed automaton will recognize these trimmed paths. Trimming paths ensures that the automaton returns to the same state after any recording edge.

Construct the retrieval tree for the set of trimmed hot paths.

Note that only one edge in the retrieval tree is labelled by 0. In general, we make this definition:

Definition 9 q. is the target of the retrieval tree edge la- belled by 0.

016

Figure 3: A retrieval tree for the path profile in our example.

In Figure 3, q. = qo. In Aho-Corasick, pattern matching steps through an in- put string while making transitions between vertices of the retrieval tree. At each step, if an edge from the current ver- tex in the tree is labelled by the next letter in the string, that edge is followed. If a leaf of the tree is reached, a match has been found. If no edge from the current vertex is labelled by the next letter (a), the current vertex (q) is reset by consulting a failure function, h(q, a). The failure function avoids rescanning the input string, by resuming scanning in the retrieval tree state correspond- ing to the longest keyword prefix that could lead to a match. If this prefix is nonempty, it must consist of a proper suffix of the match that just failed followed by a. A Ball-Larus path p starts with a l representing a record- ing edge and ends with another recording edge. No other edges in p are recording edges, by definition. Thus, no paths start with a substring from the middle of another path, so the failure function always resets the automaton. The fol- lowing theorem shows that the failure function becomes triv- ial.

Theorem 2 Say q,, is the retrieval tree vertm representing the keyword prejix u. For any Aho-Corasick rewgnizer pro- duced from a set of trimmed Ball-Larus paths, h(q,, u) = qe

if a is not a recording edge and h(q,, a) = q. if a is a record-

ing edge.

Proof: Suppose v is the longest proper su#ix of u that is also a prefix of some tm’mmed path in the profile. v must start with a l , which represents a recording edge. But no proper sufi of u contains a recording edge, so Iv1 = 0. If a is not a recording edge, it cannot begin a Ball-Larms path and h(qU, u) = qt. If a is a recording edge, then it is equivalent to l and so h(qU, a) = q..

Since the failure function is trivial, our implementation only stores retrieval tree edges, which greatly reduces its

size. If the automaton is in state q and sees the input a, the next state is found by checking:

l If there exists a retrieval tree edge from q labelled by a, the next state is the target of the edge.

l If a is a recording edge, the next state is q..

l Otherwise, the next state is qC.

4 Building the HPG

This section explains how a hot path graph (HPG) is con- structed. The algorithm both produces a graph for data flow analysis and also identifies recording edges in that graph, SO that path profiling information can be carried over to later stages of compilation. Section 4.1 applies Holley and Rosen’s data-flow tracing algorithm to the original graph and the path qualification automaton from Section 3. The output of the tracing algo- rithm is a hot path graph without recording edges--GA from Definition 6. In this HPG, every path from entry represents both a path in the original CFG and a path in the automa- ton. Moreover, two paths from entry end at the same vertex iff the corresponding paths in the CFG end at the same ver- tex and the corresponding paths in the automaton end in the same state. Thus, data-flow solutions over the HPG do not merge values from paths that reach different automaton states. Section 4.2 explains how our algorithm also identifies the recording edges in the HPG, so that the path profile infor- mation can be correctly interpreted in the modified CFG. Holley and Rosen discuss two qualification methods, one of which is data-flow tracing. The other method is context tupling. Section 4.3 explains why we use data-flow tracing instead of context tupling for path qualification.

4.1 Tracing the HPG

Figure 4 presents Holley and Rosen’s algorithm for data-flow tracing, extended to identify recording edges (discussed in the next subsection). The algorithm is a worklist algorithm that finds all pairs of CFG vertices and automaton states reachable from the entry of the CFG (T) and the initial state of the automaton (qc). The vertices of the HPG are these pairs. Initially, the worklist holds (r,qr). Each iteration of the While loop removes a pair (v,q) from the worklist. The algorithm iterates over each pair (v’,q’) reachable in one step from (v, q). If (v’, q’) is not in the HPG, it is added to the HPG and the worklist. In any case, an edge is added from (v,q) to (v’,q’). The algorithm terminates when the worklist is exhausted, at which point all possible pairs have been added to the HPG. The constructed HPG fits the definition of GA in Defi- nition 6. The following theorem, together with Theorem 1, justifies performing data-flow analysis on the HPG.

Theorem 3 When the algorithm in Figure 4 completes,

ii)

(v,q) E VA iff there exists a path p in G from r to v that drives A from its start state qC to q.

((%no), (vl,ql)) E EA iff^ there^ exists^ an^ edge (us, VI) E E and a transition in A from qa to q1 on (VO,Wl).

G = (V, E) is a control-flow graph. A is a qualification automaton. Q is the set of states of A. qe is the start state of A. T is the set of transitions in A. R C E is the set of recording edges. IV% a worklist of pairs (v,q), where v E V and q E Q. GA = (VA,EA) is the hot path graph. RA 2 EA is the new set of recording edges.

2 +-~b-,q.)l

R::te W +- (r,n.) While W # 0 (v, d + Take(W) ForeachEdge (v,v’) E E (q, (v, v’), q’) E T (it is unique) If (V',(f) e VA VA + VA u (V',(f) putw, (v', n')) EA t EA U {((v,q), (v’,q’))} If (v,v’) E R RA + RA u {((%q), (V’,‘i)))

Figure 4: An algorithm for data-flow tracing. The original Holley-Rosen algorithm has been extended to mark record- ing edges in the traced graph.

Proof: The “‘only-if” direction of both requirements is ob- vious. The %f” direction follows by induction on the length of the paths p. For paths of length 0, both requirements are trivially true. If i) holds for all paths up to some length n and ii) holds for all edges reachable along such paths, then all final nodes of such paths must have been added to the worklist at some point in the algorithm. After these nodes are processed, the requirements hold for all paths up to length n+ 1.

Figure 5 shows the example after data-flow tracing. The automaton is in state q. at shaded vertices and state qa at vertices filled with diagonal lines. Only these vertices are targeted by multiple edges, as qI and qo are the only states in the automaton reached by multiple transitions. The original graph had no constant results other than simple assignments, but the graph in Figure 5 has several new constant results: a + b is always 6 at H14, 5 at H and H15, and 4 at H13, i + + is 1 at H14 and H15, and n is always 1 at 117. Unfortunately, although the original flow graph in Fig- ure 1 is reducible, the rHPG in Figure 5 is not. For example, the edge (HE, BO) is a retreating edge but not a backedge in a natural loop since BO does not dominate He. Because of this problem, tracing should only be used with data-flow solvers that can handle irreducible graphs.

4.2 Identifying Recording Edges in the

HPG

The algorithm in Figure 4 makes an HPG edge ((vo,qa), (v~,ql)) a recording edge iff (ve,vr) is a record- ing edge in the original graph. The next two lemmas show

Number of blocks

Figure 7: The distribution of dynamic executions of constant instructions by basic block in selected SPEC95 benchmarks.

In this paper, we describe the reduction algorithm in terms of constant propagation. However, the algorithm is not restricted to constant propagation. All that is necessary is a way to assign a benefit to duplicating a vertex in the HPG. The reduction algorithm is a heuristic algorithm with the following steps:

  1. Identify the hot vertices. First,, the vertices are ordered by the number of dynamic constants they execute, as computed from the path profile. In our experiments, we chose a fixed fraction CR of the (nonlocal) dynamic constants as a goal. Vertices are added to the set of hot vertices until CR is reached. In our running example, Hl2 weighs 30, H13 weighs 100, H14 weighs 140, H15 weighs 60, and 117 weighs
    1. All the other vertices have weight 0. For the sake of the example, suppose cR is chosen such that H and H14 are the only hot vertices.
  2. For each vertex v in the original graph, partition the vertices (v, q) in the HPG into sets of vertices that are compatible. This is the heuristic step of the algorithm. Call the partition II. At this stage, two vertices are compatible if neither vertex is hot or, if one or both is hot, lowering both solutions to the meet of their lattice values does not destroy any constants in a hot vertex.

Compatibility is not an equivalence relation (it is not transitive), so II cannot be found by looking for equiva- lence classes. Instead, II is formed greedily by “throw- ing in” vertices one at a time. As each vertex (v,q) is thrown in, it merges with the first S E IX for which adding (v,q) to S does not destroy constants in a hot vertex. If there is no such S, (v,q) starts a new set. Our implementation tries to keep hot vertices to- gether by considering the vertices in descending order by weight.

In the example, since H13 and H14 are the only hot vertices, II is

{Entree), {AO), 0% Bl), {CE, C3), W, D41, {Ee, E5, E6, E7}, {FE, F8,FlO, Fll}, {GE, G9}, @,;,, H15), Q-I13), (H141, (kI16, II7), xi

  1. Use the standard DFA minimization algorithm [Gri73] to produce a partition II’ which respects the data- flow solutions. The complexidty of this algorithm is O(7LlogTZ). Why is this algorithm applicable? The HPG can be thought of as a finite automaton with edges labelled by the edges of the original graph. The elements of II can be thought of as equivalence classes of final states of an automaton that recognizes several different kinds of tokens. The only way to lower the solution over a set S E II is to cause some new path p from entry to reach a vertex in S. Viewing p as a string, that would mean that p was not recognized as a token of type S before minimization but is recognized as such a token after minimization. This cannot happen. In our example, the minimized partition II’ is

-Pntveh {AOh WXI, @lb {ce), {C3), {D2), {D4), 0% E71, {E5), @6), {FE, F8, Fll), VW, {GE), {G% @,H12,H15), {H13), {H14}, {IE, 116,117}, {ExitO}

  1. Replace the vertices in each set in II’ with a rep- resentative and produce a new set of recording edges. If SO,& E II’ have representatives SO and 51, respectively, an edge (SO,~1) exists iff an edge ((v~,qo), (vI,~I)) exists in the HPG, where (vo,qo) E

SOand (vl,ql) E Sl. If ((vo,qo),(vl,ql)) isarecording

edge, (SO,~1) is a recording edge. This is well-defined: for (vo,no),(vo,d) E So ad (vl,ql),(vl,d) E Sl, ((vo,qo),(v1,ql)) is a recording edge iff (vo,v~) is a recording edge in the original graph and the same for ((vo, q&), (~1, q:)), so ((vo, qo), (VI, 41)) is a recording edge iff ((~0, qb), (VI, q{)) is a recording edge. Figure 8 shows the reduced hot path graph for our running example.

6 Experimental Results

This section presents measurements of the benefits and costs of using path-qualified flow analysis for constant propaga- tion. We implemented the analysis as two new passes in the SUIF compiler [WFW+]. The fist pass, PP, instrumented a C program for path profiling. The other pass, PW, used a path profile to perform path-qualified constant propagation. The first step was to produce a path profile for each routine in the program. In this stage, SUIF compiled a C program into its low-SUIF intermediate form. The PP pass instrumented this intermediate code for path profiling. We did not run SUIF’s optimization passes. The SUIF-to-C converter transformed PP’s output into C code, which was compiled by GCC into an instrumented program. When run, this program produced a path profile. The next step was to optimize programs. The program was again compiled by SUIF. This time, the SUIF code was

I Proeram I Nodes I Paths I Hot Paths I ComDile Time I Anal. Time I

r vortex I 21190 I 1729 I 152 I 1042 I 163 1

Table 1: General information about the benchmarks. Nodes is the total number of CFG nodes in the original program. Paths is the number of Ball-Larus paths executed in the training run. Hot Paths is the number of paths needed to cover 97% of a training run’s dynamic instructions. Compile Time is the total compile time (seconds) without constant propagation. Anal. Time is the total time (seconds) required for constant propagation with CA = 0.

t

Exit

Figure 8: The control-flow graph after reduction. State numbers have been dropped from all merged vertices.

fed to PW together with the previously obtained path pro- file. PW used the path profile to construct a hot path graph, discover constants, produce a reduced hot path graph, and finally generate optimized code. The output of PW was con- verted to C code, which was compiled by GCC (-02) into an optimized executable. As SUIF did not directly generate assembly or machine code, our evaluations are in terms of the SUIF intermediate code. In this paper, by “instruction” we always mean SUIF instructions, not machine instructions. The constant propagator in PW uses Wegman and Zadek’s Conditional Constant algorithm [WZSl]. This al- gorithm is a worklist algorithm that symbolically executes a routine, starting at its entry node and propagating values only across the legs of branches that can execute, given the current assignment of values to variables. Our implementa- tion is conservative, as it does not track pointers or constants manipulated through pointers or structures, sssumes that calls and assignments through pointers write to all aliased

variables, and initially sets all variables to 1. Since we ran the constant propagator immediately after SUIF’s front end, the constant propagator saw code that was very close to the original C. We ran PP and PW on seven of the C SPEC95 bench- marks on a Sun UltraSPARC SMP. In all cases, we used an input data set from the SPEC train data to produce the path profile that drove the flow analysis. A different and larger input from the ref data set produced a path profile used to evaluate the effectiveness of the constant propaga- tor. The path profile of the reference input did not affect the optimization; it was only used to compute the dynamic number of constants discovered by the propagator. Path-qualified analysis becomes more expensive as the number of hot paths increases. On the other hand, consid- ering more paths can improve a solution, as it increases the portion of the program’s execution covered by an analyzed path. To quantify this tradeoff, we ran the path-qualified analysis several times, varying path coverage-the CA pa- rameter in Section 3. That is, the analysis was first run on the minimum set of paths that covered three quarters of the program’s execution, then on paths that covered seven eighths of the execution, and so forth. The other parameter in our analysis is CR, the benefit cutoff for the graph reduction algorithm. In the experi- ments, we set CR to .95, so reduction preserved approxi- mately 95% of the nontrivial constants discovered by con- stant propagation. This value was arrived at empirically. Table 1 lists basic information about the benchmarks. Most of the analysis time for per1 was spent in two huge routines, yylex and eval, for which the non-linear running time of constant propagation became a problem.

6.1 Benefit of Path Qualification

Figure 9 shows that the number of executed instructions with statically constant results increased as the hot path coverage increased. At full coverage (CA = l), the improve- ment ranged from 7% for m88ksim and vortex to 0.6% for perl. In all benchmarks, most of the benefit of path qual- ification was attained before full coverage was reached- typically somewhere above 90% coverage. ijpeg attained most of its benefit at CA = 0.75 (the lowest non-zero value tested) but all benchmarks saw virtually all of their bene- fit by cA = 0.97. In two cases, the improvement degraded slightly at high coverage, because of heuristics in the reduc- tion algorithm. These results confirm earlier path profiling

Variable These instructions are found to be constant by the qualified analysis, but have different values at dif- ferent sites in the reduced graph. For example, in Fig- ure 8, the value of a + b is 6 at H14 and is 4 at H13. Only duplication will reveal these constants. Meet- over-all-paths will not find these constants.

Unknowable Instructions in this category either are not constant or cannot be identified as constant because of other limitations of the analyses. Our analyses do not track pointers, values stored in memory, or the results of calls. Therefore, instructions that consume these values will never be found constant. We estimated this set by counting the number of values produced within a basic block, yet found equal to 1.

Figure 10 divides instructions (dynamically weighted) into these categories. Most instructions in each benchmark fall in the Unknowable or Local categories. Path qualifi- cation does not affect these categories. The other part of Figure 10 focuses on the instructions targeted by constant propagation algorithms. Our technique found many (2-122) times more knowable and nonlocally constant instructions. Interestingly, most instructions found constant by qualified analysis were neither Identical nor Variable. These instructions had one constant value at one or more sites and were also unknown at one or more sites. The exceptions are vortex and go, both of which contained a significant, but small, number of Variable constants. Other techniques, which do not duplicate paths, will not find these constants. Although the direct improvement from our technique is large, the instructions it finds constant still make up a small percentage of all dynamic instructions. This further explains why we did not see speedups for most of the benchmarks. In the above discussion, we assumed that the MOP is not attainable for constant propagation. This is true for the non-distributive Wegman-Zadek formulation. Recently, Bodfk and Anik published a distributive formulation of con- stant propagation [BA98]. It would be interesting to com- pare path-qualified analysis against this formulation.

6.3 Cost of Path Qualification

This section examines the cost of path qualification.

6.3.1 Cost of Duplication

Figure 11 shows that CFG size only increased significantly for go and that the reduction algorithm successfully con- trolled the increase in CFG size. The cost of data-flow analysis is proportional to the CFG’s size before reduction. For go, the maximum increase was 722%, and for the other programs the maximum increase was 80%. However, Figure 9 showed that 100% coverage of- fers little benefit. 97% coverage achieves almost all of the benefit, and limits CFG growth to 184% for go and 32% for the other programs. The CFG’s size after reduction is an indirect measure of the spatial locality of the constants found. Our experiments show that this locality is high-with CR = 0.95, only go grew by more than 10% at any level of coverage. go grew by 77% at full coverage, but, again, full coverage is unnecessary: at CA = 0.97, its increase was 70%. The cost of subsequent

Unknowable

Figure 13: A Venn diagram classifying a program’s dynamic instructions.

analysis and the running time of the program may degrade as the CFG grows, but these increases seem manageable. Why was go exceptional? Table 1 shows that go exe- cuted many more paths than other programs and also re- quired more paths to reach high coverage levels. Further experiments are necessary to see whether go’s distribution is atypical or not.

6.3.2 Analysis Time

Path-qualified data-flow analysis increases analysis time, both by adding three new steps-building the qualifica- tion automaton, tracing, and reduction-and by running the data-flow solver on larger graphs. Figure 12 shows the relative increase in SJdySiS time as cA iS increased. Once again, go was exceptional. For the other benchmarks, the increase was less than 61% at almost full coverage. Figure 9 shows that most of the benefit is gained before full cover- age, so these increases are reasonable. For go, analysis time increased sixfold at CA = 0.97. The observed analysis time seems to grow a bit faster than linearly with the size of the hot path graph.

7 Related Work

Feasible path analysis attempts to identify and eliminate in- feasible paths. Holley and Rosen introduced qualified data- flow analysis to separate known infeasible paths from the remaining paths, some of which might be feasible m81]. Goldberg et al. applied theorem proving techniques to iden- tify infeasible paths in testing a program’s path cover- age [GWZ94]. Bodik et al. used a weaker (but less expen- sive) decision technique to determine if all paths between a definition and use were infeasible, and therefore the def-use pair actually did not exist [BGS97b]. Our work differs from these, as we focus on directly improving the precision of

Benchmark Benchmark

(a) Local and unknowable (b) Other categories

Figure 10: Fraction of dynamic instructions that fall into categories in Figure 13. The qualified analysis was done at full coverage (CA = 1).

program analysis along a subset of important paths, rather than improving analysis everywhere by eliminating spurious paths. However, the two techniques are certainly comple- mentary, as our technique would work well in a CFG from which infeasible paths were eliminated. Paths have long been used in program analysis and opti- mization. Fisher’s trace scheduling technique heavily opti- mized the hot paths (called traces) in a CFG [Fis81]. ‘I&e scheduling did not duplicate paths, instead it introduced fixup code along control flow edges into or out of the mid- dle of a trace. More recently, Hwu et al. eliminated this fixup code by duplicating paths to form superblocks, which is a collection of traces without control flow into the middle of a trace [mWHMC+93]. Our approach differs from both techniques. First, it is a technique for improving program analysis, not a technique for optimization and instruction scheduling. Second, although it duplicates paths, like su- perblocks, its duplication is guided by path profiles. Finally, both scheduling techniques attempted to maximize the size of traces. This work evaluates the improvement from dupli- cation, and eliminates duplicated blocks that provide little or no improvement. Mueller and Whalley used an ad-hoc framework and code duplication to eliminate certain partially redundant branches [MW95]. Mueller and Whalley’s code duplica- tion algorithm can be seen as a qualification algorithm in which states in the qualification automaton encode in- formation about the direction of the partially redundant branches. Bodfk et al. used a limited form of interproce- dural analysis to detect redundant branches along interpro- cedural paths [BGS97a]. This work differs by incorporating paths into a more precise and general framework, by us- ing paths to derive more precise data-flow analyses, and by using path frequencies to overcome the costs of exploiting increased precision (code duplication).

Bodfk et al. presented an algorithm for complete partial

redundancy elimination using both code motion and code duplication [BGS98]. Their technique also used profiles (ei- ther edge or path) to drive code duplication. Our paper is not directly comparable with their paper, as their paper used duplication to carry out an optimization while our pa- per uses duplication to improve analysis. However, there is a difference in philosophy between the two papers. They first analyzed the original control flow graph to identify ver- tices for which duplication would enable better code motion. Using a profile, their algorithm decides which of these can- didates should be duplicated. Our work takes the other tack: a profile guides an initial round of duplication. Anal- ysis of the duplicated flow graph, together with the profile, identifies blocks that should not have been duplicated. By contrast, their approach starts and performs analysis over a smaller graph. Our approach, however, can find solutions not found by a meet-over-all-paths analysis. Ftamalingam combined data-flow analysis with program frequency information by associating probabilities with dataflow values and developing a data-flow framework for combining these pairs of values @am96]. Our goal differs. Instead of incorporating frequencies into the meet-over-all- paths framework, we use frequency information to improve analysis precision in heavily executed code.

8 Conclusion

This paper describes a new approach to analyzing and op- timizing programs. Our technique starts with a path profile that identifies the hot paths that incur most of the program’s cost. This information provides the basis for a hot path graph, in which hot paths are isolated in order to compute data-flow values more precisely. After analysis, the hot path graph is reduced to eliminate unnecessary or unprofitable paths. We applied this technique to constant propagation

  • KO

0 0.7 0.8 0.9^ 1. Path coverage

(b) Other benchmarks

  • m88ksim
  • compnss -84-
  • We -F-
  • YOrteX

Figure 12: Time required for qualified flow analysis versus path coverage (CA). The baseline is the time required at CA = 0.

[Aho94]

[BASS]

[BGS97a]

[BGS97b]

[BGS98]

(BL96j

[FisSl]

[Gri73]

[GWZ94]

[HRSl]

[MW98]

Alfred V. Aho. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A, chapter 5, pages 255-300. MIT Press, 1994. Rastislav Bodlk and Sadun Anik. Path-sensitive value-flow analysis. In Proceedings of the SIGPLAN ‘98 Symposium on Principles of Programming Lan- guages (POPL), January 1998. Rastislav Bodik, Rajiv Gupta, and Mary Lou Soffa. Interprocedural conditional branch elimination. In Proceedings of the SIGPLAN ‘97 Conjerence on Programming Language Design and Implementa- tion (PLDI), pages 146-158, June 1997. Rastislav Bodik, Rajiv Gupta, and Mary Lou Soffa. Refining data flow information using infeasible paths. In Fifth ACM SIGSOFT Symposium on Founda- tions of Software Engineering and Sixth European Software Engineering Conference, September 1997. Rastislav Bodik, Rajiv Gupta, and Mary Lou S&a. Complete removal of redundant computations. In Proceedings of the SIGPLAN ‘98 Conference on Programming Language Design and Implementa- tion (PLDI), June 1998. To appear. T. Ball and J. R. Laws. Efficient path profiling. In Proceedings of MICRO 96, pages 46-57, December

Joseph A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30(7):478-490, July 1981. David Gries. Describing an algorithm by hopcroft. Acta Injonatica, 2:97-109, 1973. Allen Goldberg, T. C. Wang, and David Zimmer- man. Applications of feasible path analysis to pro- gram testing. In International Symposium on Sojt- zuar.e Testing and Analysis. ACM SIGSOFT, August

L. Howard Halley and Barry K. Rosen. Qualified data flow problems. IEEE Transactions on Software En- gineering, SE-7(1):60-78, January 1981. Frank Mueller and David B. Whalley. Avoiding conditional branches by code replication. In Pro- ceedings of the SIGPLAN ‘95 Conference on Pro- gramming Language Design and Implementation (PLDI), pages 56-66, June 1995.

[mWHMC+931 Wen mei W. Hwu, Scott A. Mahlke, William Y. ’ Chen, Pohua P. Chang, Nancy J. Warter, Roger A. Bringmann, Roland G. Ouellette, Richard E. Hank, Tokuso Kiyohara, Grant E. Haab, John G. Helm, and Daniel M. Lavery. The superblock: An effec- tive technique for VLIW and superscalar compila- tion. The Journal of Supercomputing, 7(1-2):229- 248, May 1993. [Ram

[WFW+]

[WZ91]

G. Ramalingam. Data flow frequency analysis. In Proceedings of the SIGPLAN ‘96 Conference on Programming Language Design and Implementa- tion, pages 267-277, May 1996. Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman^ P.^ Amarasinghe,^ Jennifer^ M.^ An- derson, Steve W. K. Tjiang, Shih-Wei Liao, Chau- Wen Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. An overview of the SUIF com- piler system. Published on the World Wide Web at http://suif.stanford.edu/suif/suifl/suif-overview/suif.html. Mark N. Wegman and F. Kenneth Zadeck. Constant propagation with conditional branches. ACM !&a~- actions on Programming Languages and Systems, 13(2):181-210, April 1991.