


















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A novel method to solve flow problems on wide-area networks using software-defined traffic engineering schemes. The method involves solving a simpler problem on a contraction of the network and a set of sub-problems in parallel on disjoint clusters within the network. The results show that the method nearly matches the solution quality of currently deployed solutions but is 8 times faster and requires 6 fewer FIB entries. The document also discusses the value-add from using a faster solver to track changing demands and react to faults.
Typology: Lecture notes
1 / 26
This page cannot be seen from the preview
Don't miss anything!



















Abstract– Many enterprises today manage traffic on their wide-area networks using software-defined traffic engineer- ing schemes, which scale poorly with network size; the solver runtimes and number of forwarding entries needed at switches increase to untenable levels. We describe a novel method, which, instead of solving a multi-commodity flow problem on the network, solves (1) a simpler problem on a contrac- tion of the network, and (2) a set of sub-problems in parallel on disjoint clusters within the network. Our results on the topology and demands from a large enterprise, as well as on publicly available topologies, show that, in the median case, our method nearly matches the solution quality of currently deployed solutions, but is 8 ⇥ faster and requires 6 ⇥ fewer FIB entries. We also show the value-add from using a faster solver to track changing demands and to react to faults.
Wide-area networks (WANs), which connect locations across the globe with high-capacity optical fiber, are an expensive resource [7, 35, 36, 38]. Hence, enterprises seek to carefully manage the traffic on their WANs to offer low latency and jitter for customer-facing applications [28, 62, 69] and fast response times for bulk data transfers [46, 56]. The state-of-the-art approach used in several enterprises today [35, 36, 38] is to compute optimal routing schemes for the current demand by solving global multi-commodity flow problems [7,35,36,38]; the global flow problems are re-solved periodically, since demands may change or links may fail, and the computed routes are encoded into switch forwarding tables using software-defined networking techniques [7]. As network sizes grow, solving multi-commodity flow prob- lems on the entire network becomes practically intractable. As noted in [36], the “algorithm run time increased super- linearly with the site count,” which led to “extended periods of traffic blackholing during data plane failures, ultimately violating our availability targets,” as well as “scaling pressure on limited space in switch forwarding tables.” This problem is unlikely to go away: anecdotal reports indicate that WAN
Contract network
Allocate flow on contracted network occasionally
Network Clusters
Demands
Flow Demand Allocation History
Paths (periodically; e.g., every few min) Figure 1: NCFlow’s workflow.
Cluster
Figure 2: The original network on the left is divided into clusters, shown with different background colors. The contracted network is on the right.
footprints today are already over 10 ⇥ larger than the few tens of sites that were considered in prior work [35, 36], since enterprises have built more sites to move closer to users. In this paper, we seek to retain the benefits of global traffic management for large WAN networks without requiring ex- cessively many forwarding entries at switches or prohibitively long solver runtimes. Also, by using a faster solver, WAN operators can reduce loss when faults occur and carry more traffic on the network by tracking demand changes. Our solution is motivated by the observation that WAN topologies and demands are concentrated: the topology typi- cally has well-connected portions separated by a few, lower- capacity edges, and more demand is between nearby datacen- ters. This is likely due to multiple operational considerations: (1) submarine cables have become shared choke points for connectivity between continents (see Figure 3), (2) the con- nectivity over land follows the road or rail networks along which fiber is typically laid out, and (3) enterprises build datacenters close to users, then steer traffic to nearby datacen- ters [12, 62, 69]. Therefore, more capacity and demand are available between nearby nodes; an analysis of data from a large enterprise WAN in §2 supports this observation. We leverage this concentration of capacity and demand to decompose the global flow problem into several smaller problems, many of which can be solved in parallel. As shown
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 175
Figure 3: Submarine cables serve as choke points in WAN topologies; figure is excerpted from [63].
in Figure 2, we divide the network into multiple connected components, which we refer to as clusters. We then solve modified flow problems on each cluster, as well as on the con- tracted network, where nodes are clusters and edges connect clusters that have connected nodes. Prior work [4, 9, 15] notes that Google and other map providers use different contractions to compute shortest paths on road network graphs. Our goal is to closely match the multi-commodity max flow solution in quality (i.e., carry nearly as much total flow), while reducing the solver runtime and number of required forwarding entries. We discuss related work in §7; to our knowledge, we are the first to demonstrate a practical technique for multi-commodity flow problems on large WAN topologies. Solving flow problems on the contracted network poses two key challenges:
Our solution NCFlow^1 achieves a high-quality flow alloca- tion with a low runtime and space complexity by addressing each of these challenges in turn. First, we contract the network using well-studied algorithms such as modularity-based clus- tering [25] and spectral clustering [53], which are designed to identify the choke-point edges in a network. Second, we bundle demands whose sources and/or targets are in the same cluster, treating them as a single demand. In Figure 2 for ex- ample, the yellow cluster considers as one bundled demand all traffic from source nodes in the red cluster to target nodes in the green cluster. Doing so can lead to inconsistent flow allocations between clusters (which we explain in §3.1.1) and we devise careful heuristics to provably avoid them (§3.2). Finally, we reduce the forwarding entries needed at switches
(^1) short for Network Contractions for Flow problems
1
10
0 500 1000 1500 2000
Normalized
Change
Time (mins)
0
1
0 0.3 0.
CDF
Metric
Norm. Change Fract. demand unmet
Figure 4: On the left, we plot the L2 norm of the change in the demands between successive 5-minute periods divided by the L2 norm of the traffic matrix at a time. On the right, we show the CDF of this change ratio. We also show a CDF of the fraction of demand that is unsatisfied if using the allocation computed for the previous period.
by reusing pathlets within clusters and traffic splitting rules across multiple demands (§3.5). Figure 1 shows the workflow for NCFlow. First, we choose appropriate clusters and paths using an offline procedure over historical traffic—these choices are pushed into the switch forwarding entries. This step happens infrequently, such as when the topology and/or traffic changes substantially. Then, online (e.g., once every few minutes), NCFlow computes how best to route the traffic over the clusters and paths, similar to deployed solutions [35, 36, 38]. Overall, our key contributions are:
We have open-sourced an anonymous version of NCFlow [2], and are in the early stages of integrating NCFlow into produc- tion use at a large enterprise.
We analyze the changes in topology and traffic on a large enterprise WAN over a several-month period. As Figure 4
176 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Maximization term Additional Constraints Used in Known best complexity MaxFlow (^)  (^) k 2 D f (^) k none [35, 38] O(M 2 e ^2 log O(^1 )^ M) [27] MaxFlow with Cost Budget (^)  (^) k 2 D f (^) k  (^) k  (^) p 2 Pk  (^) e 2 p f (^) kp Coste Budget O(e ^2 M log M(M + N log N) log O(^1 )^ M) [27] Max Concurrent Flow a dk a f (^) k , 8 k 2 D [19, 39, 40] O(e ^2 (M 2 + KN) log O(^1 )^ M) [41] Table 1: We illustrate a few different multi-commodity flow problems all of which find feasible flows but optimize for different objectives and can have additional constraints; see notation in Table 2. Equation 6 fleshes out the problem completely for the case of maximizing flow. More problems are discussed in [11].
Term Meaning V , E , D , P Sets of nodes, edges, demands, and paths N, M, K The numbers of nodes, edges, and demands, i.e., N = |V |, M = |E |, K = | D | e, c (^) e , p Edge e has capacity c (^) e ; path p is a set of connected edges (sk ,tk , dk ) Each demand k in D has source and target nodes ( sk ,tk 2 V ) and a non-negative volume (dk ). f, f (^) kp Flow assignment vector for a set of demands and the flow for demand k on path p. Table 2: Notation for framing multi-commodity flow problems.
Vagg , Eagg , Dagg , Pagg
Nodes, edges, demands, and paths in the aggregated graph Vx , Ex , Dx , Px Subscript denotes entities in the restricted graph for cluster x x, h Each cluster x is a strongly connected set of nodes and h is the number of clusters k, Kxy , Ksy , Kxt An actual demand (k ); the rest are bundled demands from one source ( s ) or all nodes in a cluster ( x ) to a target (t) or to all nodes in a cluster (y)
Table 3: Additional notation specific to NCFlow.
SDN-based traffic engineering schemes [35, 38], in addi- tion to repeatedly solving global optimizations, must maintain an up-to-date view of the topology, gather desired volumes for demands and update traffic splits at switches based on the result of the optimization. Our production experience is that most of these repetitive steps have a latency of a few RTTs (round trip times) and so solving the optimization dominates, especially on large topologies. Moreover, demands are lim- ited to their allocated rates in software at the source servers and thus allocating less than the full desired rate need not result in packet loss [35]. Finally, applications that contribute a large fraction of the bytes moving between datacenters are elastic in short timescales; e.g., large dataset transfers for data analytics. That is, these apps seek a fast completion time but do not need a large rate in every optimization epoch. Some other applications have a decreasing marginal utility as their rate allocation increases such as video streams of varying quality [43]. Today’s SDN-based TE solutions [35, 38] use multiple priority classes to maximize allocations for elastic traffic without affecting the latency-sensitive traffic.
In this section, we describe NCFlow. Our steps are as shown in Figure 1. Offline, based on historical demands, we divide the network into clusters and determine paths. Further details are in §3.4. Online, we allocate flow to the current demands by solving a carefully constructed set of simpler sub-problems,
MaxAggFlow
MaxClusterFlow
MinPathE2E
SrcTargetMax
f 1 ,MaxFlow( Vagg , Eagg , Dagg , Pagg )
8 clusters x, fx 2 ,MaxFlow(Vx , Ex , Dx , Px ) s.t. NoMoreFlowThruCluster(f, f 1 , x) (see §D)
f 3 , fk, 8 k 2 Dagg s.t. s.t. NoMoreAlongPaths(f, f 2 ) (see §D)
k 2 Kxy
fk
k 2 Ksy
k 2 Kxt
fk f 2 y,Kxt , 8 t 2 y;
k 2 Kxy
fk f 3 ,Kxy ; fk dk, 8 k 2 Kxy
Figure 6: The basic flow allocation algorithm used by NCFlow ; notation used here is defined in Table 3.
some of which can be solved independently and in parallel. We describe these sub-problems in §3.1. Although they can be solved quickly, disagreements between independent solutions can lead to infeasible allocations; we present a simple heuris- tic in §3.2 that provably leads to feasible flow allocations. In §3.3, we discuss extensions that increase the total flow al- located by NCFlow. We also show sufficient conditions under which NCFlow is optimal and matches the flow allocated by MaxFlow. Finally, in §3.5, we discuss how NCFlow uses fewer forwarding entries by reusing pathlets within clusters and splitting rules for different demands.
We begin by describing a simple (but incomplete) version of NCFlow ’s flow allocation algorithm; the pseudocode is in Figure 6. We continue using Figure 2 as a running example. The basic algorithm proceeds in four steps. In the first step, we allocate flow on the aggregated graph; as shown in MaxAggFlow in Figure 6. In the aggregated graph, an example of which is in Figure 2 (right), nodes are clusters and the edges are bundled edges from the original graph— the edge between the red and yellow clusters corresponds to the five edges between these clusters on the actual graph. Similarly, we bundle demands on the aggregated graph: the demand Kxy between the clusters x and y corresponds to all of the demands whose sources are in cluster x and targets are
178 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
! "#^ (,%,)&' )
! (^1) ,)"#,^0 %,)/' ≤ "+^ ,,%^ &'+ "+^ .,%&' ≤ "+^ ,,%^ &'+ "+^ .,%&'
! "+^ ,,%&'
2
"
Figure 7: An example illustrating how the flow allocated in MaxAggFlow translates to constraints on the flow to be allocated in MaxClusterFlow.
in cluster y. The resulting flow allocation ( f 1 ) accounts for bottlenecks on the edges between clusters. However, this flow may not be feasible, since there may be bottlenecks within the clusters. In the second step, we refine the allocation from step 1 to account for intra-cluster demands and constraints. Specifically, we allocate flow for the demands whose sources and targets are within the cluster. We also allocate no more flow than was allocated in f 1 for the inter-cluster flows. MaxClusterFlow in Figure 6 shows code for this step. We note a few details:
In the third step, we reconcile end-to-end; that is, we find the largest flow that can be carried along each path on the aggregate graph. As shown by MinPathE2E in Figure 6, for each bundle of demands and each path, we take the minimum flow allocated (fx 2 ) at each cluster on the path. The flow allocation for the demands in a cluster x can be
Problem # of Nodes # of Edges # of Demands MaxFlow N M K MaxAggFlow h min(M, h 2 ) min(K, h 2 ) MaxClusterFlow ⇠ N h + h ⇠ M h + 2 h ⇠ (^) hK 2 + 2 N h + h 2 Table 4: Sizes of the problems in Figure 6 using notation from Tables 2 and 3. Just verifying that flow is feasible (i.e., FeasibleFlow in Eq. 1) uses O(# nodes ⇤ # edges) number of equations and variables. NCFlow has one instance of MaxAggFlow and executes the h instances of MaxClusterFlow in parallel. MinPathE2E and SrcTargetMax, are relatively insignificant.
2
!" !#
1
1
$" $#
1
1
(a) Disagreement arising from bundling edges: As shown on the right, the algo- rithm in Figure 6 will allocate 2 units of flow but only 5e units can be carried.
!" !#
!" $" $#
!# 1
%
%
% % `
1
1
1
1 1 1 1
(b) Disagreement arising from bundling demands: As shown on the right, the algorithm in Figure 6 will allocate 2 units of flow, but only 2e units can be carried.
Figure 8: Illustrating how disagreements in flow allocation can occur in the basic flow allocation algorithm; see §3.1.1. read directly from the fx 2 solution of MaxClusterFlow. For demands that span clusters, however, more work remains be- cause the steps thus far do not directly compute their flow. In particular, f 3 allocates flow for cluster bundles; such as say for all the demands whose sources are in cluster x and targets are in cluster y. The corresponding per-cluster flow allocations, f x 2 and f y 2 , allocate flow from a source node and to a given target respectively. Thus, in the final step, SrcTargetMax, we assign the maximal flow to each inter-cluster demand that respects all previous allocations.
3.1.1 Properties of Basic Flow Allocation
Solver runtime: The numbers of equations and variables in the sub-problems are shown in Table 4. If the number of clus- ters h is 1 , note that there is exactly one per-cluster problem, MaxClusterFlow , which matches the original problem from Eqn. 2. When using a few tens of clusters, we will show in § that all of the sub-problems are substantially smaller than the original problem (MaxFlow ). Feasibility: The flow allocated by Figure 6 satisfies demand and capacity constraints; we will prove this formally in §B.1. For demands whose source and target are in different clusters, however, disagreements may ensue since the different prob- lem instances assign flow to different bundles of edges and demands. We illustrate two such examples in Figure 8; both have 1 unit of demand from s 1 to t 1 and from s 2 to t 2. The dashed edges have a capacity of e ⌧ 1 and all of the other edges have a very large capacity.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 179
Proof. By optimal, we mean that the total allocated flow must be as large as an instance of Equation 6 wherein any path can be used. The proof is in §B.3. Intuitively, when the number of clusters is 1 and any paths can be used, a single instance of MaxClusterFlow is identical to the optimal problem in Equa- tion 6. Similarly, when the number of clusters equals the number of nodes, MaxAggFlow is identical to the optimal problem. Furthermore, the conditions listed lead to optimality because the optimal flow allocation can be transformed into an allocation that can be outputted by Figure 6.
Even though the listed conditions appear restrictive, note that the topology within clusters can be arbitrary. We will show in §5 that NCFlow offers nearly optimal flow allocations even when the above conditions do not hold.
The choice of clusters and paths affects both the solution quality and runtime of NCFlow. We cast cluster choice as a graph partitioning problem [5, 21, 65] with these objectives:
We empirically observe, based on experiments with many WANs and different types of demands, that:
p N clusters, irrespective of the clustering technique, leads to the best result, i.e., smallest runtime and fewest forwarding entries while allocating nearly the largest amount of flow possible; see Figure 13.
Thus, the optimal clustering choice for a WAN is unclear; it is possible that hand-tuning or using a learning technique may lead to better-performing clusters. Nevertheless, any of the three simple clustering schemes discussed below already suffice for NCFlow to improve substantially over baselines. We consider the following clustering choices because they are simple and fast; unless otherwise noted, results in this paper use FMPartitioning.
apply modularity-based clustering with edge weights set to their capacity.
Some other clustering techniques [5, 42, 65] can balance clus- ter sizes or trade-off between concentration and balance but are more complex computationally; it is possible that using such schemes can further improve NCFlow. Path choice in NCFlow : On the aggregated graph and on each cluster graph, we pre-compute offline a small number of paths between every pair of nodes. We consider the following different path choices and pick paths that lead to the largest flow allocation on historical demands:
NCFlow also pre-computes offline (1) a pseudo-random choice of which edges to use between a pair of connected clusters in each iteration and (2) which path on the aggregated graph to use for each cluster bundled demand in each iteration.
NCFlow uses many fewer switch forwarding entries than prior works due to the following reasons. First, the paths along which NCFlow allocates flow can be thought of as a sequence of pathlets [32, 47, 68] in each clus- ter connected by crossing edges between clusters. Figures 9 and 10 illustrate such paths on the right. This observation is crucial because a pathlet can be reused by multiple demands. For example, in Figure 9, the flow from any source in the red cluster to any target in the grey cluster would use the same pathlets shown in the yellow, green, and blue clusters. Prior work [35, 36], on the other hand, establishes paths for each demand. Using pathlets has two advantages. The number of pathlets used by NCFlow is about h times less than the number of paths used by prior works 2. Furthermore, a typical pathlet has fewer hops than a typical end-to-end path. Thus, NCFlow uses many fewer rules to encode paths in switches. (^2) More precisely, the number reduces from PN(N 1 ) to Âx P(Nx )(Nx 1 ) where P is the number of paths per node pair, the N nodes are divided into h clusters, and cluster x has Nx nodes. If clusters are evenly sized, Nx = N/h , and the ratio of these terms is ⇠ h.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 181
Next, whenever NCFlow allocates flow at the granularity of cluster bundles, all of the demands in a bundle take the same paths and are split in the same way across paths. Hence, NCFlow uses one traffic splitting rule for all demands in such bundles. For instance, the demands from source s in the red cluster in Figure 9 to any target in the grey cluster are split with the same ratio across the same pathlets in all clusters (except the grey cluster where they take different pathlets to reach their different targets). Thus, with NCFlow, the number of splitting rules at a source decreases by a factor of
p N/ 2 3. The paths and splitting rules to push into switch forwarding tables are determined by the offline component of NCFlow and only change occasionally. After each allocation, only the splitting ratios change. More details on the data-plane of NCFlow such as how to compute the total flow that can be sent by each demand and the splitting ratios as well as how to move packets from one pathlet to the next are in Appendix C. In §5, we measure the numbers of rules used by NCFlow.
Our current prototype of NCFlow is about 5K lines of Python code, which invokes Gurobi [33] v8.1.1 to solve all of the optimization problems. For clustering WAN topologies, we adapt [26] to find clusters that maximize modularity; we also use our own implementation of NJW spectral clustering [53]. We use a grid search over the number of clusters (h) and the above clustering techniques to identify the best perform- ing choice for each topology on a set of historical traffic matrices. To compare with state-of-the-art techniques, we customize the public implementations of SMORE [44, 45] and TEAVAR [19]. We have also implemented Fleischer’s algorithm [27]; our implementation is about 10 ⇥ faster than public implementations [8, 37] since we carefully optimize a key bottleneck in Fleischer’s algorithm. All of these code artefacts are available on GitHub [2].
We evaluate NCFlow on several WAN topologies, traffic matri- ces, and failure scenarios to answer the following questions:
p N, the ratio of these terms is
p N/ 2.
Topology # Nodes # Edges # Clusters PrivateLarge ⇠ 1000s ⇠ 1000s 31 Kdl 754 1790 81 PrivateSmall ⇠ 100s ⇠ 1000s 42 Cogentco 197 486 42 UsCarrier 158 378 36 Colt 153 354 36 GtsCe 149 386 36 TataNld 145 372 36 DialtelecomCz 138 302 33 Ion 125 292 33 Deltacom 113 322 30 Interoute 110 294 20 Uninett2010 74 202 24
Table 5: Some of the WAN topologies used in our evaluation; see §5.1.
Here, we describe our methodology—the topologies, traffic, baselines, and metrics used in our evaluation. Topologies: We use two real topologies from a large enterprise—PrivateSmall is a production internet-facing WAN with hundreds of sites, and PrivateLarge is a larger WAN that contains many more sites. We also use several topolo- gies from the Internet Topology Zoo [6] and reuse topolo- gies used by prior works [19, 38]. Table 5 shows details for some of the used topologies; note that the topologies shown are 10 ⇥ to 100 ⇥ larger than those considered by prior work [19, 35, 38, 44, 49]. Traffic Matrices (TMs): We benchmark NCFlow on traffic traces from PrivateSmall , which contain the total traffic be- tween node pairs at 5-minute intervals. We also generate the following kinds of synthetic traffic matrices for all topologies:
l, d
models demands with varying concentra- tion; the demand between nodes s and t is a Poisson random variable with mean lddst^ , where dst is the hop length of the shortest path between s and t and d 2 [ 0 , 1 ) is a decay factor. We choose d close to 0 or to 1 to model strongly and weakly concentrated demands, respectively.
v
[14, 60]: The total traffic leaving a node is proportional to the total capacity on the node’s outgoing links (parameterized by v); this traffic is divided among other nodes proportional to the total capacity on their incoming links.
[ 0 , a)
: The traffic between any pair of nodes is chosen uniformly at random, between 0 and a.
[ 0 , a), [b, c), p
[14]: A p fraction of the node pairs, chosen uniformly at random, receive demands from Uniform
[b, c)
while the rest receive demands from Uniform
[ 0 , a)
. We use p = 0 .2. For each above model, we select parameters such that fully satisfying the traffic matrix leads to a maximum link utiliza-
182 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Figure 13: NCFlow’s performance when using different numbers of clusters on PrivateLarge. The speedup ratio is plotted on the right y-axis in log scale; the other metrics use the left y-axis.
Both effects are because SMORE* allocates flow on Räcke’s RRTs instead of k-shortest paths. The edge and path variants of Fleischer’s, shown using purple and red lines in the figures, perform similarly; since they are approximate algorithms, they allocate less flow than PF 4 in roughly 50% of cases, but are also faster than PF (^4) in slightly less than 50% of cases. We conclude that these approximate algorithms are not practically better than PF 4. In contrast, NCFlow, shown with dark blue lines in the fig- ures, almost always allocates at least 80 % of PF 4 ’s total flow, while achieving large speedups. In the median case, NCFlow achieves 98 % of the flow and is over 8 ⇥ faster. These im- provements accrue from NCFlow solving smaller optimization problems than PF 4. Figures 18 and 19 tease apart the above results by load, traffic type and topology. Figures 23–27 show results for alter- nate path choices. Taken together, these results indicate that NCFlow ’s improvements hold across a variety of scenarios. For the same experiments considered above, Figure 12 shows the number of switch forwarding entries used in dif- ferent topologies. (A full set of results is in Table 6.) The bottom plot is the total number of forwarding entries across all switches, while the top shows the maximum for any switch. Note that both the x and y axes are in log scale. NCFlow con- sistently uses fewer forwarding entries; using NCFlow offers a greater amount of relative savings than switching from all edges to just a handful of paths per demand. The savings from NCFlow also increase with topology size. The reason, as noted in §3.5, is that NCFlow reuses pathlets and traffic splitting rules for many different demands.
Figure 13 shows how NCFlow’s performance varies with the numbers of clusters used on PrivateLarge. While NCFlow al- locates roughly the same amount of total flow, using about 30 clusters improves runtime and reduces forwarding entries. Figure 21 compares NCFlow ’s performance when using dif- ferent clustering techniques; more details are in §G.2. Recall from §3.3 that NCFlow uses multiple iterations of Figure 6. In the above experiments, the first iteration alone accounts for 75 % of the runtime and for roughly 90 % of the
Figure 14: Allocated flow and speedup relative to PF 4 on a sequence of production TMs from PrivateSmall. In half of the cases, NCFlow allocates at least 98.5% of the flow and is at least 8. 5 ⇥ faster.
flow that is allocated by NCFlow. Later iterations are faster perhaps because they have less traffic to consider. Breaking down the runtime by the steps in Figure 6, we see cases where MaxClusterFlow accounts for over 70 % of NCFlow’s runtime perhaps because the largest cluster contains a large fraction of the nodes. Better cluster choice or recur- sively dividing the largest clusters can further lower runtime.
Here, we experiment with a sequence of traffic traces collected on the PrivateSmall WAN. Figure 14 plots the moving average (over 5 windows) of the total flow and speedup relative to PF (^4) for two schemes—NCFlow in blue and PF (^4) w in light blue. The figure shows that PF (^4) w ’s warm start yields a median speedup of 1. 66 ⇥. NCFlow achieves a consistently higher speedup ( 8. 5 ⇥ in the median case), and the flow allocation is nearly optimal: the median total relative flow is 98. 5 %, and NCFlow always allocates more than 93%.
Here, we evaluate the impact of a technique’s runtime on its ability to stay on track with changing demands. Specifically, on the PrivateLarge topology, we use a time-series of traffic matrices, wherein a new TM arrives every five minutes and the change from one TM to the next is consistent with the findings in Figure 4 (more details are in Figure 20). At each time-step, all techniques have the opportunity to compute a new allocation for the current TM or to continue computing the allocation for an earlier TM if they have not yet finished; in the latter case, their most recently computed allocation will be used for the current TM. For example, a technique that requires five minutes to compute a new allocation will be always one window behind, i.e., each TM will receive the allocation that was computed for the previous TM. Figure 15 shows the fraction of demand that is satisfied by three different schemes; we also show the value for an instantaneous scheme which is not penalized for its runtime.
184 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Figure 15: When demands change, how solver runtimes affect flow allocation on PrivateLarge : Due to the slow runtime, PF 4 and PF (^4) w carry only 62 % of the traffic that can be satisfied by Instant PF 4 , a (hypothetical) scheme which has zero runtime. NCFlow carries 87 % of the traffic since its faster runtime compensates for its sub-optimality.
PF 4 ’s average runtime here is over 15 minutes; hence, as the orange dashed line shows, PF 4 is able to compute a new allocation only for every third or fourth TM. This leads to substantial demand being unsatisfied: for node pairs whose current demand is larger than before, PF 4 will not allocate enough flow. On the other hand, node pairs whose current demand is less than their earlier demand will be unable to fully use PF 4 ’s allocation. As the figure shows, PF 4 only satisfies 53 % of the changing demand on average, whereas Instant PF 4 satisfies 87% of the demand. PF (^4) w (the dash-dot light blue line), where the solver warm starts using the previous allocation, is modestly faster than PF 4 on average. As the figure shows, the average demand satisfied by PF (^4) w is only slightly larger than PF 4 (about 54 %). In contrast, NCFlow (the solid dark blue line) finishes well within five minutes which allows allocations to change along with the changing demands. We find that on average NCFlow satisfies 75 % of the demands; its smaller runtime more than makes up for sub-optimality, allowing NCFlow to carry more flow than PF 4 when demands change.
Here, we evaluate the effect of link failures. As we note in §F, TEAVAR* did not finish within several days on any of the topologies listed in Table 5 because when all possible 2-link failure scenarios are considered, the number of equations and variables in the optimization problem increase from O(N 2 ) for MaxFlow to O(M 2 N 2 ) for TEAVAR [19], where N and M are the numbers of nodes and edges, respectively. Hence, we report results on the 12 -node, 38 -edge WAN topology from B4 [38]. We generate synthetic traffic matrices as noted in §5.1. Using link failure probabilities from TEAVAR [3], we generate several hundred failure scenarios and, for each TM, we measure the flow carried by NCFlow and TEAVAR* before the fault, immediately after the fault, and after recovery. A key difference in fault recovery between NCFlow and TEAVAR* is that TEAVAR* requires sources to rebalance the
0
1
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.
CDF (over faults)
Loss = 1 - (Flow carried by scheme/ Flow carried by PF 4 when no fault)
NCFlow before fault NCFlow after recompute NCFlow after fault TEAVAR* before fault TEAVAR* after re-balance TEAVAR* after fault
(a) CDFs of the flow loss before faults, immediately after faults and after recovery (B topology, many traffic matrices and faults; see §5.6).
0
1
Fault happensFault happens
Tunnels rebalanceTunnels rebalance
NCFlow recomputesNCFlow recomputes
Total Flow, relative to PF
4
Time
NCFlow TEAVAR* TEAVAR
(b) Timelapse of when a fault occurs (B4 topology, Uni- form traffic matrix, b = 0 .99)
0
1
0 5 10 15 20
CDF (over faults)
Recompute Time (ms) (c) NCFlow’s time to re- compute after fault. Figure 16: Comparing failure response of NCFlow with prior work. traffic splits when a failure happens; doing so takes about one RTT on the WAN. Given a parameter b , TEAVAR* guarantees that there will be no flow loss after the tunnels re-balance with a probability of 1 b. See §F for more details. We use b = 0. 99 , as recommended in [19]. NCFlow, on the other hand, recomputes flow allocations taking into account the links that have failed; doing so takes one execution of NCFlow and some RTTs to change the traffic splits at switches; more details are in §E. Figure 16c shows that the recomputation time is well within one RTT on the WAN. Figure 16b shows a timelapse of the flow carried on the network before the fault, immediately after the fault, and after recovery. As the figure shows, TEAVAR* can have a smaller loss and for a shorter duration; i.e., until sources rebalance traffic while NCFlow can carry more flow before fault and after recovery; moreover, the fast solver time can reduce the duration of loss. Figure 16a shows CDFs over many faults and traffic ma- trices for NCFlow and TEAVAR*. We record the flow loss at three stages: before the fault, immediately after the fault, and after recovery. As the figure shows, NCFlow’s ability to carry more flow before the fault and after recovery more than com- pensates for the slightly larger loss it may accrue in between.
Extending beyond MaxFlow: FeasibleFlow is a common con- straint for many objectives beyond MaxFlow (see Table 1). Since the algorithm in §3.1 and the heuristic in §3.2 guarantee feasibility, NCFlow can apply to objectives beyond MaxFlow;
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 185
[1] Capacity planning for the Google backbone network. https://bit.ly/2lViJ4t.
[2] Code for NCFlow and Baselines. https://github. com/netcontract/ncflow.
[3] Code for TEAVAR. https://github.com/ manyaghobadi/teavar.
[4] Contraction Hierarchies Path Finding Algorithm. https://bit.ly/3eaiqtg.
[5] GAP: Generalizable Approximate Graph Partitioning Framework. https://arxiv.org/pdf/1903.00614. pdf.
[6] Internet Topology Zoo. http://www.topology-zoo. org/.
[7] Market Trends: SD-WAN and NFV for Enterprise Net- work Services. https://gtnr.it/3c8hNyA.
[8] Cristinel Ababei. Code for Karakostas. https://bit. ly/2woSloP.
[9] Ittai Abraham, Daniel Delling, Andrew V. Goldberg, and Renato F. Werneck. A Hub-Based Labeling Algorithm for Shortest Paths in Road Networks. In Experimental Algorithms, 2011.
[10] Firas Abuzaid, Srikanth Kandula, Behnaz Arzani, Ishai Menache, Matei Zaharia, and Peter Bailis. Solving Flow Problems Quickly by Contracting Wide-area Net- work Topologies: Extended Version. https://bit.ly/ 35oyQdU.
[11] Ravindra Ahuja, Thomas Magnanti, and James Orlin. Network Flows. Theory, Algorithms, and Applications. Prentice Hall.
[12] Muthukaruppan Annamalai et al. Sharding the Shards: Managing Datastore Locality at Scale with Akkio. In OSDI, 2018.
[13] D. Applegate, L. Breslau, and E. Cohen. Coping with Network Failures: Routing Strategies for Optimal De- mand Oblivious Restoration. In SIGMETRICS, 2004.
[14] David Applegate and Edith Cohen. Making Intra- Domain Routing Robust to Changing and Uncertain Traffic Demands. In SIGCOMM, 2003.
[15] Hannah Bast, Daniel Delling, Andrew V. Goldberg, Matthias Müller-Hannemann, Thomas Pajor, Peter Sanders, Dorothea Wagner, and Renato F. Werneck. Route Planning in Transportation Networks. CoRR,
[16] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.
[17] Daniel Bienstock. Potential function methods for ap- proximately solving linear programming problems: the- ory and practice, volume 53. Springer Science & Busi- ness Media, 2002.
[18] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lam- biotte, and Etienne Lefebvre. Fast unfolding of commu- nities in large networks, 2008.
[19] Jeremy Bogle et al. TEAVAR: striking the right utilization-availability balance in WAN traffic engineer- ing. In SIGCOMM, 2019.
[20] Stephen Boyd and Lieven Vandenberghe. Convex Opti- mization. Cambridge University Press, 2004.
[21] P. Brucker. On the complexity of clustering problems. In Optimizing and Operations Research, Berlin, West Germany, 1977. Springer-Verlag.
[22] Yiyang Chang, Sanjay Rao, and Mohit Tawarmalani. Robust validation of network designs under uncertain demands and failures. In NSDI, 2017.
[23] P. Chardaire and A. Lisser. Simplex and Interior Point Specialized Algorithms for Solving Nonoriented Multi- commodity Flow Problems. Operations Research, 2002.
[24] David Chou, Tianyin Xu, Kaushik Veeraraghavan, An- drew Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, et al. Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge. In OSDI, 2019.
[25] A. Clauset, M.E.J. Newman, and C. Moore. Finding community structure in very large networks. Phys. Rev.,
[26] Aaron Clauset. Fast Modularity Community Structure Inference Algorithm. https://bit.ly/3aAVGQH.
[27] Lisa K. Fleischer. Approximating Fractional Multicom- modity Flow Independent of the Number of Commodi- ties. SIAM J. Discret. Math., 2000.
[28] Ken Florance. How Netflix Works With ISPs Around the Globe to Deliver a Great Viewing Experience. https: //bit.ly/2RYYrEM, 2016.
[29] B. Fortz and Mikkel Thorup. Internet Traffic Engineer- ing by Optimizing OSPF Weights in a Changing World. In INFOCOM, 2000.
[30] Naveen Garg and Jochen Könemann. Faster and Simpler Algorithms for Multicommodity Flow and Fractional Packing Problems. SIAM J. Comput., 2007.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 187
[31] A. M. Geoffrion and G. W. Graves. Multicommodity Distribution System Design by Benders Decomposition. Management Science, 1974.
[32] P. Brighten Godfrey, Igor Ganichev, Scott Shenker, and Ion Stoica. Pathlet routing. In SIGCOMM, 2009.
[33] Zonghao Gu, Edward Rothberg, and Robert Bixby. Gurobi optimizer reference manual, version 5.0. Gurobi Optimization Inc., Houston, USA, 2012.
[34] Jeff Hartline and Alexa Sharp. Hierarchical Flow. Tech- nical Report 2004-09-29, Cornell University, 2004.
[35] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Watten- hofer. Achieving high utilization with software-driven WAN. In SIGCOMM, 2013.
[36] Chi-Yao Hong, Subhasree Mandal, Mohammad Al- Fares, Min Zhu, Richard Alimi, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, et al. B4 and after: managing hierarchy, partitioning, and asymmetry for availability and scale in google’s software-defined WAN. In SIGCOMM, 2018.
[37] Yuanfang Hu, Yi Zhu, Hongyu Chen, Ronald L. Graham, and Chung-Kuan Cheng. Communication latency aware low power NoC synthesis. In DAC, 2006.
[38] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, and Min Zhu. B4: Experi- ence with a globally-deployed software defined WAN. In SIGCOMM, 2013.
[39] Virajith Jalaparti, Ivan Bliznets, Srikanth Kandula, Bren- dan Lucier, and Ishai Menache. Dynamic pricing and traffic engineering for timely inter-datacenter transfers. In SIGCOMM, 2016.
[40] S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula. Calendaring for Wide Area Networks. In SIGCOMM,
[41] George Karakostas. Faster Approximation Schemes for Fractional Multicommodity Flow Problems. ACM Trans. Algorithms, 2008.
[42] Robert Krauthgamer, Joseph (Seffi) Naor, and Roy Schwartz. Partitioning Graphs into Balanced Compo- nents. In SODA, 2009.
[43] Alok Kumar et al. Bwe: Flexible, hierarchical bandwidth allocation for wan distributed computing. In SIGCOMM,
[44] Praveen Kumar et al. Semi-Oblivious Traffic Engineer- ing: The Road Not Taken. In NSDI, 2018.
[45] Praveen Kumar, Chris Yu, Yang Yuan, Nate Foster, Robert Kleinberg, and Robert Soulé. YATES: Rapid Prototyping for Traffic Engineering Systems. In SOSR,
[46] Nikolaos Laoutaris, Michael Sirivianos, Xiaoyuan Yang, and Pablo Rodriguez. Inter-datacenter Bulk Transfers with NetStitcher. In SIGCOMM, 2011.
[47] David Lebrun, Mathieu Jadin, François Clad, Clarence Filsfils, and Olivier Bonaventure. Software Resolved Networks: Rethinking Enterprise Networks with IPv Segment Routing. In SOSR, 2018.
[48] Chansook Lim, S. Bohacek, Joao Hespanha, and Katia Obraczka. Hierarchical Max-Flow Routing. In Globe- com, 2005.
[49] Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. Traffic engineering with forward fault correction. In SIGCOMM, 2014.
[50] Richard McBride. Progress Made in Solving the Multi- commodity Flow Problem. SIAM Journal on Optimiza- tion, 1998.
[51] Srinivas Narayana, Joe Jiang, Jennifer Rexford, and Mung Chiang. Distributed Wide-Area Traffic Manage- ment for Cloud Services. In SIGMETRICS, 2012.
[52] NetworkX. Edge Disjoint Paths. https://bit.ly/ 37VJ71k.
[53] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In NIPS,
[54] Murat Oguz, Tolga Bektas, and Julia A. Bennell. Mul- ticommodity flows and Benders decomposition for re- stricted continuous location problems. European Jour- nal of Operational Research, 2017.
[55] James Orlin. A polynomial time primal network simplex algorithm for minimum cost flows. Math. Programming,
[56] Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Victor Bahl, and Ion Stoica. Low Latency Geo-distributed Data Analytics. In SIGCOMM, 2015.
[57] H Racke. Optimal Hierarchical Decompositions for Congestion Minimization in Networks. In STOC, 2008.
[58] R Tyrrell Rockafellar and Stanislav Uryasev. Con- ditional Value-at-Risk for General Loss Distributions. Journal of banking & finance, 26(7):1443–1471, 2002.
188 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
middle) and the two instances of MaxClusterFlow correspond- ing to the yellow and green clusters on the right. The later steps in Figure 6 do not increase flow and so we conclude that capacity constraints are satisfiable for all non-local edges.
Here, we prove Theorem 1. First, note that the heuristic in §3. which only restricts the edges between clusters and paths on the aggregate graph that can be used by some demands does not affect the proof in §B.1; that is, edges still receive flow less than their capacity and demand constraints hold. We now prove that the heuristic will satisfy flow conser- vation; that is, at any node in the network, for any demand which neither originates nor ends at this node, the net flow is zero, i.e., incoming flow to the node equals the flow leaving that node. It is easy to see that flow conservation holds for demands whose source and target are in the same cluster even without the heuristic in §3.2 because: (1) Only the instance of Max- ClusterFlow for that cluster assigns flow to such a demand. (2) Since MaxClusterFlow invokes FeasibleFlow in Equation 1, the flow is allocated along paths which start and end at the source and target of that demand respectively. (3) Thus, every node that is neither the source or target will have incoming flow equal to the outgoing flow. We now consider the remaining demands, that is, whose source and target are in different clusters. It is easy to see that for such demands, flow conservation holds at all nodes that do not have edges to or from other clus- ters by logic that is similar to the above. The MaxClusterFlow instance of the cluster containing such a node would allocate flow to some bundle of demands on paths in this cluster that neither start nor end at such a node. The only case left is nodes which have edges to and from other clusters. Suppose by contradiction that some demand k violates flow conservation at such a node u. The heuristic in §3.2 allocates flow for demand k along only one path in the aggregated graph and on only one edge between connected clusters. If the cluster containing u is not on the chosen path or none of the chosen edges are incident on u, then the net flow allocated for k over all edges incident on u will be zero. Let e be that one chosen crossing edge incident on u which can receive non-zero flow for demand k. Observe that all of the other demands whose source and target are in the same clusters as k would also be allocated flow on the same path and edges as k. Thus, all the flow allocated for these demands entering or leaving node u as the case may be would be on edge e. Two instances of MaxClusterFlow, one corresponding to the cluster that contains u and another corresponding to the other side of edge e, will assign possibly different flow values for this bundle of demands on edge e. To conclude our proof, note that MinPathE2E takes the minimum flow assigned
along all such crossing edges e on the chosen path through the aggregated graph and that SrcTargetMax further breaks open the bundle to assign feasible flow for each actual demand contained in the bundle. If more than one crossing edge or more than one path on the aggregate graph are used for a demand, it is easy to see how the above proof will break. The two instances of MaxClusterFlow that correspond to the clusters on either side of a crossing edge will be forced by MinPathE2E to only agree on the total volume for the cluster bundle of demands for all edges between the pair of clusters; that is, these instances may allocate different flow on different edges or allocate different flow to individual demands in the bundle. Figure 8 shows simple examples of such disagreement.
Here, we prove Theorem 2. We already discussed in §3.3 the case where the number of clusters, h, is 1 or N , the number of nodes in the graph. To prove optimality for the other sufficient conditions, we posit a helper theorem.
there exists a clustering of nodes into clusters such that any
the method in Figure 6 over those clusters.
Proof. The claim is trivially true by using N clusters, where each node is in a cluster by itself. We show that it is possible to
repetitions allowed), just v but not u , u! v (no repetitions of u or v anywhere else in the path) or v! u. Coalescing each
If Gagg is a tree and there is at most one edge between any
consist of contiguous segments that are contained within each cluster. Thus, per the above theorem, any flow allocated by MaxEdgeFlow (Equation 6) can also be allocated by the method in Figure 6. The only difference then between the global optimization and the method in Figure 6 is that whereas the former is a single optimization call, the latter is a sequence of optimizations. Since demands are satisfiable, however, all of the steps in Figure 6 will allocate the entirety of demand and hence will allocate the maximum amount of flow. Note, in particular, that for the sufficient conditions listed in Theorem 2 a single iteration of the steps in Figure 6 suffice. In §H, we show some counter-examples where NCFlow can lead to sub-optimal allocations when any of these sufficient conditions do not hold.
190 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
The SDN controller for NCFlow computes total flow per de- mand and some splitting ratios after each allocation.
Total Flow: The flow assigned to a demand whose source and target are in different clusters is read off SrcTargetMax , i.e., f (^4) ,k. For intra-cluster demands, their flow is read off Max- ClusterFlow, i.e., f 2 x,k at the cluster x that contains the source and target of demand k. These flow values are summed up over all the iterations used by NCFlow.
Splitting ratios at sources: At source s of cluster x , we have two cases depending on whether the target of the demand is within the cluster x or in some other cluster y.
demand k ; the splitting ratio for each path p in the set is f 2 x,,kp summed up over all iterations, divided by the total flow
assigned to demand k above. Here, f 2 x,,kp is the flow assigned to demand k on path p by the MaxClusterFlow instance for cluster x.
For the latter case, let zi be the next cluster on the one path that can receive flow in iteration i for all traffic going to targets in cluster y. The splitting ratio for path p in the path set
S
is the value of (^) Â (^) r 2 Ksy f 2 x,,rp summed up over all iterations where Ksy is the set of all demands from source s to targets in cluster y divided by the total value for all such paths.
Uniquely, note that each source s has a splitting ratio per target t within the same cluster or per target cluster y.
We call a subset of nodes as ingresses if they have at least one edge to a node in another cluster that is chosen by the offline component of NCFlow in §3.4 as a crossing edge
Splitting ratios at ingresses are computed in a similar way to the splitting ratios at sources. At each ingress node w of cluster y for traffic from cluster x , there are two cases depend- ing on whether the target is some node t in the same cluster as the ingress (y) or in some other cluster z.
For the former case, in iteration i , the splitting ratio for path
by the total over all such paths. As above, Kxt is the set of demands from sources in cluster x to target t.
For the latter case, in iteration i, let zi be the next cluster on the path to targets in z; the splitting ratio for path p in the set
all such paths. As above, Kxz is the set of all demands from sources in cluster x to targets in cluster z.
Note that an ingress node w has splitting ratios only for demands whose chosen path at an iteration contains w’s clus- ter (y) and whose chosen edge enters y at w.
Pathlets: NCFlow sets up label-switched paths (LSPs) be- tween each pair of nodes in each cluster. Which paths to setup is pre-determined by the offline component in §3.4. Splitting rules: A source s in cluster x has a splitting rule for each other node in the same cluster and for each other cluster. The splitting ratios are as computed in §C.1. In each iteration, at each cluster, at most one ingress node is active per pair of other clusters. This is because the bundle of demands for a given pair of clusters has at most one crossing edge entering a cluster. The active ingress node at a cluster x for the bundle of demands from cluster y to cluster z has one splitting rule when z 6 = x and one splitting rule per target in cluster x when z = x. Packet content: The LSP (which pathlet to use) is encoded in the L2 header [59]. Additionally, NCFlow has the following tuple in each packet: (x, y, i, e) where x and y are the source and target cluster ids, i is the iteration number of the flow allocation that the packets have been assigned to and e is the edge to leave the current cluster on. The bits needed are 2 ln h + ln I + ln node degree.^5 We note that 16 bits of header space suffice for all the WAN topologies and experiments con- sidered in this paper; that is h 64 clusters, I 8 iterations and up to 2 edges to nodes in other clusters being used per egress node by NCFlow. Data path actions:
In the flow vector computed by MaxClusterFlow at a cluster x , f x 2 , we use the subscript k to denote a bundle that may include (1) transit demands through cluster x (i.e., from all sources in some other cluster w to targets in some other cluster z), (2) leaving demands (i.e., from a source in cluster x to (^5) The edge id must suffice to distinguish at an egress node between the edges to a particular next cluster; so node degree is an overestimate.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 191
TEAVAR*(V , E , D , P , b, S ) (5) ,arg min (^) f
⇣ a + (^1) ^1 b (Â (^) i 2 S b (^) i Excess (^) i ( 1 (^) Â (^) i 2 S b (^) i )a)
⌘
s.t. f 2 FeasibleFlow(V , E , D , P ), (Eqn. 1) Li,k 0 , 8 i, k (loss is non-negative)
p 2 Pk
f (^) kp Active (^) p,i , 8 i, k (loss)
a 0 (loss cutoff) Excess (^) i 0 8 i (excess loss in scenario i)
k 2 D
Li,k a, 8 i (excess loss)
The formulation for TEAVAR* is in Equation 5. Recall that f (^) kp is the flow assigned to demand k on path p. Active (^) p,i is an indicator denoting whether path p is active in fault scenario i. Thus, the allocation for demand k in scenario i will be  (^) p 2 Pk f (^) kp Active (^) p,i. When the allocation is below the required
the flow loss for demand k in scenario i.
The flow allocation resulting from the above formulation cannot be promised to the demands; in particular, more flow will be assigned on some paths to account for possible failures on other paths. After solving the above LP, we compute the flow allocation for a demand k as follows: (1) sort the per-
0 , add up the probability of each scenario until the running sum is at least b —let ib be the unique crossing index; (3) Set
at the crossing index.
Choosing the fault scenarios to use in TEAVAR*:
Figure 18: Breaking down the NCFlow results from Figure 11b into four separate CDFs based on relative total flow.
Observe that the number of scenarios affects the complexity of the TEAVAR* optimization; specifically, the number of
number of nodes. The appropriate choice of fault scenarios to consider, as discussed above, depends on the size of the topology, the failure probability of edges, and the required assurance level b. Suppose one considers all 2-edge failure
Hence, the increase in equations and variables exceeds N 2 M 2. Note that MaxFlow is substantially simpler, having at most O(N 2 ) variables and constraints (Equation 1). On the topologies listed in Table 5, our implementation of TEAVAR* never ran to completion even after several days. We ran with b = 0. 99 and link failure probability set to 0. 004 ; both of these are the default values used in [3]. The reason is that the optimization problem becomes intractably large. TEAVAR behaves similarly [19]. We conclude that probabilis- tic fault protection using this methodology is infeasible on large topologies and for non-trivial fault assurance levels such as when considering multiple link failures. We also note that we are unable to simultaneously achieve the solution quality and the runtimes that are reported in TEAVAR [19] using their code [3]. Specifically, achieving the assurance levels reported in their experiments requires many scenarios to be considered. The runtimes reported in [19] appear to have been measured when considering only single link failures.
To further understand the performance of NCFlow, Figure 18 breaks down the results in Figure 11 into four ranges based on total relative flow. We plot CDFs of the speedup ratio per range. The solid blue and green dashed line, which correspond to relative flow above 0. 99 and in [ 0. 8 , 0. 99 ) respectively, ac- count for 49 % and 46 % of all experiments. The figure shows that NCFlow achieves sizable speedups while allocating large
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 193
Topology Edge-Based Räcke KSP NCFlow Total # FIB Entries PrivateLarge 945,038,502 52,515,090 22,483,244 1,694, Kdl 427,524,786 76,794,001 30,199,751 1,876, PrivateSmall 7,684,182 1,232,866 625,282 139, Cogentco 7,567,952 2,054,323 915,207 139, UsCarrier 3,894,542 1,520,821 510,894 82, Colt 3,534,912 1,048,779 346,905 67, GtsCe 3,263,696 1,077,350 535,135 101, TataNld 3,006,720 1,062,629 540,088 93, DialtelecomCz 2,590,122 1,427,780 529,663 83, Ion 1,922,000 886,414 418,362 71, Deltacom 1,417,472 459,159 246,811 53, Interoute 1,306,910 483,960 249,979 32, Uninett2010 394,346 133,742 57,428 21, Maximum # FIB Entries PrivateLarge 962,361 828,397 313,850 18, Kdl 567,009 576,274 309,575 16, PrivateSmall 38,809 49,663 21,796 3, Cogentco 38,416 60,676 30,601 3, UsCarrier 24,649 41,897 17,822 2, Colt 23,104 47,077 17,344 3, GtsCe 21,904 36,070 15,477 2, TataNld 20,736 24,776 13,179 2, DialtelecomCz 18,769 34,014 11,084 1, Ion 15,376 25,261 12,954 1, Deltacom 12,544 25,135 13,029 1, Interoute 11,881 14,182 8,346 710 Uninett2010 5,329 8,891 3,626 868
Table 6: Number of FIB entries for NCFlow vs. edge-based formulations (e.g., Fleischer-Edge), path-based formulations using Räcke Randomized Routing Trees ( SMORE), and path-based formulations using k -shortest paths (PF 4 , Fleischer-Path, TEAVAR) on every topology.
amounts of flow. Figure 19 further breaks down the aggregate results from Figure 11 across various aspects of interest. In the two left-most columns, we break down the results by differ- ent settings of a , which illustrates how NCFlow performs on both under-subscribed (a = { 1 , 8 } ) and over-subscribed ( a = { 32 , 64 , 128 }) traffic matrices. In the former case, NCFlow is typically able to fully satisfy the TM’s requested demand, thereby matching the total flow allocated by the other methods. At the same time, NCFlow is strictly faster on all TMs, except for those belonging to smaller topologies (e.g., Uninett2010), which we discuss later on. As a increases, so, too, does NCFlow’s runtime advantage; however, this does come at the cost of the total flow allocated. For example, when a = 32 , we see many instances where NCFlow is > 100 ⇥ faster than PF 4 , but allocates 75% of PF 4 ’s total flow in the worst case. This effect becomes more evident for the largest settings of a: here, the speedups are > 1000 ⇥ , but more flow is sacri- ficed for some TMs. This behavior occurs perhaps because, as the traffic volume increases and the topology becomes more congested, paths that are not allowed by NCFlow ’s scheme become more critical for maximizing the total flow. In the middle two columns, we break down the results by traffic model. NCFlow tends to perform best when demands are highly concentrated within clusters. In the bottom middle plot (Poisson , d! 0 ), we see that NCFlow allocates > 90% of PF 4 ’s total flow for almost every TM, while still achieving speedups > 100 ⇥. Recall that as d! 0 in the Poisson traffic
model, the traffic volume between clusters decreases, thus generating concentrated demands. In contrast, when d! 1 , demands are less concentrated, which leads to worse perfor- mance for NCFlow in terms of total flow, but not in terms of runtime. Finally, in the two right-most columns, we break down the results by topology size. On Uninett2010, the smallest topol- ogy in our evaluation set, NCFlow ’s trade-off between total flow and runtime is not much better than the other baselines, particularly Fleischer-Edge. As the topology size increases, NCFlow ’s advantage be- comes more apparent. On Colt, NCFlow offers faster runtimes and sacrifices little flow, no more than 10% less than PF 4. On PrivateSmall and Kdl, NCFlow ’s speedup increases even more: > 100 ⇥ faster than PF 4 on the majority of cases on Kdl. But flow is sacrificed, particularly for large values of a. However, NCFlow ’s trade-off is still favorable compared to other methods: for Kdl, we see multiple instances where NCFlow achieves 1,000 ⇥ speedups at only a 20% reduction in flow. For PrivateLarge , we see both the biggest speedups and the smallest fraction of total flow relative to PF 4. As previously discussed, the outlier coincides with a highly over- subscribed TM (a = 128 ). When we move to other regimes on PrivateLarge , NCFlow ’s performance improves: on 31 of the 400 TMs with a 2 { 32 , 64 }, NCFlow is > 1 , 000 ⇥ faster than PF 4 while achieving > 80% of PF 4 ’s total flow. In summary, we can see in this panel of CDF plots where NCFlow’s strengths lie: on (1) large topologies, and (2) TMs with moderate demand volumes that are highly concentrated within the topology.
For each topology, we evaluate the three different clustering techniques mentioned in §3.4; on each topology we ask each technique to compute the number of clusters listed in Table 5. Figure 21 shows CDFs of the ratio of total flow and latency speed-up of a clustering technique relative to that achieved by using FMPartitioning; thus values to the left of x = 1 indicate worse performance compared to FMPartitioning while those on the right indicate better performance. The figure shows that clusters discovered by FM partitioning almost always let NCFlow carry more flow (red lines); using either spectral clustering or leader election leads to a noticeably smaller allocation in about 20 % and 40 % of the cases. The figure shows a less clear-cut separation on latency speed-up; clusters discovered by leader election offer more speedup in over 30 % of the experiments. Overall, we see that FMPartitioning performs better on average but not in all cases.
Figure 22 shows a CDF of the normalized path latency for de-
194 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association