
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The use of in-network aggregation to accelerate distributed machine learning workloads. The approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. the challenges of building an in-network aggregation primitive using programmable switches and how SwitchML addresses these challenges. The document also shows that it is possible for a programmable network device to perform in-network aggregation at line rate.
Typology: Lecture notes
1 / 24
This page cannot be seen from the preview
Don't miss anything!

















Training machine learning models in parallel is an increas- ingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the vol- ume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frame- works to provide an efficient solution that speeds up training by up to 5.5× for a number of real-world benchmark models.
Today’s machine learning (ML) solutions’ remarkable success derives from the ability to build increasingly sophisticated models on increasingly large data sets. To cope with the result- ing increase in training time, ML practitioners use distributed training [1, 22]. Large-scale clusters use hundreds of nodes, each equipped with multiple GPUs or other hardware acceler- ators (e.g., TPUs [48]), to run training jobs on tens of workers that take many hours or days. Distributed training is increasingly a network-bound work- load. To be clear, it remains computationally intensive. But the last seven years have brought a 62 × improvement in com- pute performance [64, 78], thanks to GPUs [74] and other hardware accelerators [11, 34, 35, 48]). Cloud network deploy- ments have found this pace hard to match, skewing the ratio of computation to communication towards the latter. Since parallelization techniques like mini-batch stochastic gradi- ent descent (SGD) training [37, 43] alternate computation with synchronous model updates among workers, network performance now has a substantial impact on training time. Can a new type of accelerator in the network alleviate the network bottleneck? We demonstrate that an in-network
∗Equal contribution. Amedeo Sapio is affiliated with Barefoot Networks, but was at KAUST during much of this work.
aggregation primitive can accelerate distributed ML work- loads, and can be implemented using programmable switch hardware [5, 10]. Aggregation reduces the amount of data transmitted during synchronization phases, which increases throughput, diminishes latency, and speeds up training time. Building an in-network aggregation primitive using pro- grammable switches presents many challenges. First, the per- packet processing capabilities are limited, and so is on-chip memory. We must limit our resource usage so that the switch can perform its primary function of conveying packets. Sec- ond, the computing units inside a programmable switch oper- ate on integer values, whereas ML frameworks and models operate on floating-point values. Finally, the in-network ag- gregation primitive is an all-to-all primitive, unlike traditional unicast or multicast communication patterns. As a result, in- network aggregation requires mechanisms for synchronizing workers and detecting and recovering from packet loss. We address these challenges in SwitchML, showing that it is indeed possible for a programmable network device to perform in-network aggregation at line rate. SwitchML is a co-design of in-switch processing with an end-host trans- port layer and ML frameworks. It leverages the following insights. First, aggregation involves a simple arithmetic op- eration, making it amenable to parallelization and pipelined execution on programmable network devices. We decompose the parameter updates into appropriately-sized chunks that can be individually processed by the switch pipeline. Second, aggregation for SGD can be applied separately on different portions of the input data, disregarding order, without affect- ing the correctness of the final result. We tolerate packet loss through the use of a light-weight switch scoreboard mecha- nism and a retransmission mechanism driven solely by end hosts, which together ensure that workers operate in lock-step without any decrease in switch aggregation throughput. Third, ML training is robust to modest approximations in its com- pute operations. We address the lack of floating-point support in switch dataplanes by having the workers scale and convert floating-point values to fixed-point using an adaptive scaling factor with negligible approximation loss.
SwitchML integrates with distributed ML frameworks, such as PyTorch and TensorFlow, to accelerate their commu- nication, and enable efficient training of deep neural networks (DNNs). Our initial prototype targets a rack-scale architec- ture, where a single switch centrally aggregates parameter updates from serviced workers. Though the single switch limits scalability, we note that commercially-available pro- grammable switches can service up to 64 nodes at 100 Gbps or 256 at 25 Gbps. As each worker is typically equipped with multiple GPUs, this scale is sufficiently large to push the statistical limits of SGD [32, 43, 50, 98]. We show that SwitchML’s in-network aggregation yields end-to-end improvements in training performance of up to
In the distributed setting, ML training yields a high- performance networking problem, which we highlight below after reviewing the traditional ML training process.
Supervised ML problems, including logistic regression, sup- port vector machines and deep learning, are typically solved by iterative algorithms such as stochastic gradient descent (SGD) or one of its many variants (e.g., using momentum, mini-batching, importance sampling, preconditioning, vari- ance reduction) [72, 73, 83, 90]. A common approach to scal- ing to large models and datasets is data-parallelism, where the input data is partitioned across workers.^1 Training in a data-parallel, synchronized fashion on n workers can be seen as learning a model x ∈ Rd^ over input/training data D by performing iterations of the form xt+^1 = xt^ + (^) ∑ni= 1 ∆(xt^ , Dti ), where xt^ is a vector of model parameters^2 at iteration t, ∆(·, ·) is the model update function^3 and Dti is the data subset used
(^1) In this paper, we do not consider model-parallel training [28, 82], al- though that approach also requires efficient networking. Further, we focus exclusively on widely-used distributed synchronous SGD [1, 37]. (^2) In applications, x is typically a 1, 2, or 3 dimensional tensor. To simplify notation, we assume its entries are vectorized into one d dimensional vector. (^3) We abstract learning rate (step size) and model averaging inside ∆.
at worker i during that iteration. The key to data parallelism is that each worker i, in parallel, locally computes the update ∆(xt^ , Dti ) to the model parameters based on the current model xt^ and a mini-batch, i.e., a subset of the local data Dti. Typically, a model update contributed by worker i is a multiple of the stochastic gradient of the loss function with respect to the current model parameters xt^ com- puted across a mini-batch of training data, Dti. Subsequently, workers communicate their updates, which are aggregated (∑) and added to xt^ to form the model parameters of the next iteration. Importantly, each iteration acts only on a mini-batch of the training data. It requires many iterations to progress through the entire dataset, which constitutes a training epoch. A typical training job requires multiple epochs, reprocessing the full training data set, until the model achieves acceptable error on a validation set. From a networking perspective, the challenge is that data- parallel SGD requires computing the sum of model updates across all workers after every iteration. Each model update has as many parameters as the model itself, so they are often in 100s-of-MB or GB range. And their size is growing expo- nentially: today’s largest models exceed 32 GB [84]. These aggregations need to be performed frequently, as increasing the mini-batch size hurts convergence [66]. Today’s ML toolk- its implement this communication phase in one of two ways: The parameter server (PS) approach. In this approach, workers compute model updates and send them to param- eter servers [45, 56, 64]. These servers, usually dedicated machines, aggregate updates to compute and distribute the new model parameters. To prevent the PS from becoming a bottleneck, the model is sharded over multiple PS nodes. The all-reduce approach. An alternate approach uses the workers to run an all-reduce algorithm – a collective commu- nication technique common in high-performance computing – to combine model updates. The workers communicate over an overlay network. A ring topology [6], where each worker communicates to the next neighboring worker on the ring, is common because it is bandwidth-optimal (though its la- tency grows with the number of workers) [79]. Halving and doubling uses a binary tree topology [93] instead.
Fundamentally, training alternates compute-intensive phases with communication-intensive model update synchronization. Workers produce intense bursts of traffic to communicate their model updates, whether it is done through a parameter server or all-reduce, and training stalls until it is complete. Recent studies have shown that performance bottleneck in distributed training is increasingly shifting from compute to communication [64]. This shift comes from two sources. The first is a result of advances in GPUs and other compute acceler- ators. For example, the recently released NVIDIA A100 offers 10 × and 20 × performance improvements for floating-point
sw
n 1 U^1 U^2 n 2
A 1 A 2
(b): Received 2 aggregates; 2 more aggregates in-flight; last 2 pieces sent
time
sw
n 1 U^1 U^2 n 2
A 1 A 2
n 1 A^1 A^2 n 2 (c): Model update fully aggregated Figure 1: Example of in-network aggregation of model updates. Ui is the model update computed by worker i. Workers stream pieces of model updates in a coordinated fashion. In the exam- ple, each workers can have at most 4 outstanding packets at any time to match the slots in the switch. The switch aggregates up- dates and multicasts back the values, which are collected into the aggregated model update Ai, then used to form the model parameters of the next iteration.
ring all-reduce performance. INA is consistently superior to RAR. For communication-bound models (the four models in the 100 Gbps case), INA is up to 80% and up to 67% faster at 10 and 100 Gbps, respectively. Note that this analysis reflects a theoretically optimal implementation of RAR. The measured speedups (§6) of our real INA implementation are higher, because real RAR implementations do not achieve optimal performance; it is difficult to fully exploit all available bandwidth and avoid system overheads.
We also note that our profiling environment uses NVIDIA P100 devices. These are currently two-generation old GPU accelerators. We investigate with real benchmarks in §6 the impact of faster GPUs, which increases the relative impact of communication overheads.
Alternative: gradient compression. Another way to reduce communication costs is to reduce the data volume of model updates using lossy compression. Proposed approaches in- clude reducing the bit-width of gradient elements (quantiza- tion) or transmitting only a subset of elements (sparsification). These approaches come with tradeoffs: too much compression loss can impact the resulting model accuracy.
We adopt the results of a recent survey of gradient com- pression methods [96] to emulate the behavior of Top-k [3] and QSGD [4] as two representative sparsification and quan- tization compressors, respectively. We use data from that study to identify the compression overhead and data reduction achieved. Our synthetic communication time, then, includes both the computational cost of compression and the communi- cation cost of the all-gather operation used to exchange model
Model INA QSGD Top-k 64 256 1% 10% 10 Gbps DeepLight 1.80 1.27 0.97 9.24 (-1.1%) 1.05 (-0.9%) LSTM 1.77 1.27 0.97 7.49 1. NCF 1.54 1.22 0.96 4.07 1.05 (-2.2%) BERT 1.54 1.20 0.98 3.45 (†) 1.04 (†) VGG19 1.60 1.22 0.97 2.13 (-10.4%) 1.04 (-3.3%) UGATIT 1.22 1.12 0.99 1.58 1. ResNet-50 1.05 1.07 0.95 1.15 (-1.7%) 1.02 (+0.2%) SSD 1.01 1.00 1.00 1.01 (-2.4%) 1.00 (-0.6%) 100 Gbps DeepLight ⁄1.67 0.93 0.78 2.96 (-1.1%) 0.47 (-0.9%) LSTM ⁄1.20 0.98 0.84 1.37 0. NCF 1.22 1.00 0.85 1.22 0.65 (-2.2%) BERT ⁄1.14 0.98 0.92 1.27 (†) 0.74 (†) † (^) The BERT task is fine-tuning from a pre-trained model, for which compression does not have a noticeable impact. The impact during pretraining is analyzed in Appendix E. Table 2: Analysis of batch processing speedup relative to ring all-reduce based on synthetic communication. For Top-k com- pression, impact on model quality is shown in parentheses. Ac- curacy penalties greater than 1% are shaded in gray; red indi- cates failure to converge. At 100 Gbps, only the models that are network bottlenecked are shown. ⁄ indicates 100 Gbps cases where SwitchML achieves a higher batch processing speedup due to practical system overheads. updates (following their implementation [96]). We observe (Table 2) that, although gradient compression decreases data volume, it is not necessarily superior to INA. In general, the computational cost of compression and decom- pression is non-negligible [58,96]; in some cases, it outweighs the communication-reduction benefits. In particular, INA out- performs QSGD on all workloads for both the 64 and 256 levels (6 and 8 bits). Similarly, Top-k underperforms INA at the 10% compression level, and even reduces performance relative to RAR in the 100 Gbps setting. These observations agree with recent work [58, 96]. In particular, Li et al. [58] proposed additional hardware offloading, using an FPGA at every worker, to mitigate compression costs. As this requires additional hardware, our analysis does not consider it. Gradient compression does outperform INA when it can achieve high compression ratios, as with Top-k at 1%. How- ever, in many cases, this level of compression either requires more training iterations to converge, or hurts the accuracy of the resulting model [96]. For example, the NCF model achieves 95.8% hit rate without compression after 20 epochs of training, while with Top-k compression at 10% it achieves 93.6%. It fails to converge at 1% compression. We report convergence comparisons for various models in Appendix D.
Our system, SwitchML, implements the aggregation primitive in a programmable dataplane switch. Such switches are now
commercially available, with only a small cost premium com- pared to fixed-function switches [5]. In-network aggregation is conceptually straightforward, but implementing it inside a programmable switch, however, is challenging. Although programmable switches allow placing computation into the network path, their limited computation and storage capabili- ties impose constraints on implementing gradient aggregation. The system must also tolerate packet loss, which, although uncommon in the rack-scale cluster environment, is neverthe- less possible for long-running DNN training jobs. SwitchML addresses these challenges by appropriately dividing the func- tionality between the hosts and the switches, resulting in an efficient and reliable streaming aggregation protocol.
Limited computation. Mathematically, gradient aggregation is the average over a set of floating-point vectors. While a seemingly simple operation, it exceeds the capabilities of today’s programmable switches. As they must maintain line rate processing, the number of operations they can perform on each packet is limited. Further, the operations themselves can only be simple integer arithmetic/logic operations; neither floating-point nor integer division operations are possible.
Limited storage. Model updates are large. In each iteration, each worker may supply hundreds of megabytes of gradient values. This volume far exceeds the on-switch storage capac- ity, which is limited to a few tens of MB and must be shared with forwarding tables and other core switch functions. This limitation is unlikely to change in the future [10], given that speed considerations require dataplane-accessible storage to be implemented using on-die SRAM.
Packet loss. SwitchML must be resilient to packet loss, with- out impact on efficiency or correctness (e.g., discarding part of an update or applying it twice because of retransmission).
SwitchML aims to alleviate communication bottlenecks for distributed ML training applications using in-network ag- gregation, in a practical cluster setting.^6 SwitchML uses the following techniques to reduce communication costs while meeting the above challenges.
Combined switch-host architecture. SwitchML carefully partitions computation between end-hosts and switches to cir- cumvent the restrictions of the limited computational power at switches. The switch performs integer aggregation, while end- hosts are responsible for managing reliability and performing more complex computations.
(^6) For simplicity, we assume dedicated bandwidth for the training jobs. We also assume that worker, link or switch failures are handled by the ML framework, as it is common in practice [1, 56].
Algorithm 1 Switch logic. 1: Initialize State: 2: n = number of workers 3: pool[s], count[s] := { 0 } 4: upon receive p(idx, off, vector) 5: pool[p.idx] ← pool[p.idx] + p.vector {+ is the vector addition} 6: count[p.idx]++ 7: if count[p.idx] = n then 8: p.vector ← pool[p.idx] 9: pool[p.idx] ← 0; count[p.idx] ← 0 10: multicast p 11: else 12: drop p Pool-based streaming aggregation. A complete model up- date far exceeds the storage capacity of a switch, so it cannot aggregate entire vectors at once. SwitchML instead streams aggregation through the switch: it processes the aggregation function on a limited number of vector elements at once. The abstraction that makes this possible is a pool of integer aggre- gators. In SwitchML, end hosts handle the management of aggregators in a pool – determining when they can be used, reused, or need more complex failure handling – leaving the switch dataplane with a simple design. Fault tolerant protocols. We develop lightweight schemes to recover from packet loss with minimal overheads and adopt traditional mechanisms to solve worker or network failures. Quantized integer-based aggregation. Floating-point oper- ations exceed the computational power of today’s switches. We instead convert floating-point values to 32-bit integers using a block floating-point-like approach [25], which is done efficiently at end hosts without impacting training accuracy. We now describe each of these components in turn. To ease the presentation, we describe a version of the system in which packet losses do not occur. We remove this restriction later.
We begin by describing the core network primitive provided by SwitchML: in-switch integer aggregation. A SwitchML switch provides a pool of s integer aggregators, addressable by index. Each slot in the pool aggregates a vec- tor of k integers, which are delivered all at the same time in one update packet. The aggregation function is the addition operator, which is commutative and associative – meaning that the result does not depend on the order of packet arrivals. Note that addition is a simpler form of aggregation than ulti- mately desired: model updates need to be averaged. As with the all-reduce approach, we leave the final division step to the end hosts, as the switch cannot efficiently perform this. Algorithm 1 illustrates the behavior of the aggregation primitive. A packet p carries a pool index, identifying the particular aggregator to be used, and contains a vector of k integers to be aggregated. Upon receiving a packet, the switch aggregates the packet’s vector (p.vector) into the slot addressed by the packet’s pool index (p.idx). Once the slot has
Algorithm 3 Switch logic with packet loss recovery. 1: Initialize State: 2: n = number of workers 3: pool[2, s], count[2, s], seen[2, s, n] := { 0 } 4: upon receive p(wid, ver, idx, off, vector) 5: if seen[p.ver, p.idx, p.wid] = 0 then 6: seen[p.ver, p.idx, p.wid] ← 1 7: seen[(p.ver+1)%2, p.idx, p.wid] ← 0 8: count[p.ver, p.idx] ← (count[p.ver, p.idx]+1)%n 9: if count[p.ver, p.idx] = 1 then 10: pool[p.ver, p.idx] ← p.vector 11: else 12: pool[p.ver, p.idx] ← pool[p.ver, p.idx] + p.vector 13: if count[p.ver, p.idx] = 0 then 14: p.vector ← pool[p.ver, p.idx] 15: multicast p 16: else 17: drop p 18: else 19: if count[p.ver, p.idx] = 0 then 20: p.vector ← pool[p.ver, p.idx] 21: forward p to p.wid 22: else 23: drop p
order to keep switch dataplane complexity low, packet loss detection is done by the workers if they do not receive a re- sponse packet from the switch in a timely manner. However, naïve retransmission creates its own problems. If a worker retransmits a packet that was actually delivered to the switch, it can cause a model update to be applied twice to the aggre- gator. On the other hand, if a worker retransmits a packet for a slot that was actually already fully aggregated (e.g., because the response was lost), the model update can be applied to the wrong data because the slot could have already been reused by other workers who received the response correctly. Thus, the challenges are (1) to be able to differentiate packets that are lost on the upward paths versus the downward ones; and (2) to be able to retransmit an aggregated response that is lost on the way back to a worker.
We modify the algorithms to address these issues by keep- ing two additional pieces of switch state. First, we explicitly maintain information as to which workers have already con- tributed updates to a given slot. This makes it possible to ig- nore duplicate transmissions. Second, we maintain a shadow copy of the previous result for each slot. That is, we have two copies or versions of each slot, organized in two pools; work- ers alternate between these two copies to aggregate successive chunks that are assigned to the same slot. The shadow copy allows the switch to retransmit a dropped result packet for a slot even when the switch has started reusing the slot for the next chunk.
The key insight behind this approach’s correctness is that, even in the presence of packet losses, our self-clocking strat- egy ensures that no worker node can ever lag more than one chunk behind any of the others for a particular slot. This invariant is because the switch will not release a slot to be reused, by sending a response, until it has received an update packet from every worker for that slot. Furthermore, a worker
Algorithm 4 Worker logic with packet loss recovery. 1: for i in 0 : s do 2: p.wid ← Worker ID 3: p.ver ← 0 4: p.idx ← i 5: p.off ← k · i 6: p.vector ← U[p.off : p.off + k] 7: send p 8: start_timer(p) 9: repeat 10: receive p(wid, ver, idx, off, vector) 11: cancel_timer(p) 12: A[p.off : p.off +k] ← p.vector 13: p.off ← p.off + k · s 14: if p.off < size(U) then 15: p.ver ← (p.ver+1)% 16: p.vector ← U[p.off : p.off + k] 17: send p 18: start_timer(p) 19: until A is incomplete
20: upon timeout p /* Timeout Handler */ 21: send p 22: start_timer(p) will not send the next chunk for a slot until it has received the response packet for the slot’s previous chunk, preventing the system from moving ahead further. As a result, it is sufficient to keep only one shadow copy. Besides obviating the need for more than one shadow copy, this has a secondary benefit: the switch does not need to track full phase numbers (or offsets); a single bit is enough to distinguish the two active phases for any slot. In keeping with our principle of leaving protocol complex- ity to end hosts, the shadow copies are kept in the switch but managed entirely by the workers. The switch simply exposes the two pools to the workers, and the packets specify which slot acts as the active copy and which as the shadow copy by indicating a single-bit pool version (ver) field in each update packet. The pool version starts at 0 and alternates each time a slot with the same idx is reused. Algorithms 3 and 4 show the details of how this is done. An example illustration is in Appendix A. In the common case, when no losses occur, the switch receives updates for slot idx, pool ver from all workers. When workers receive the response packet from the switch, they change the pool by flipping the ver field – making the old copy the shadow copy
reuse only when there is the certainty that all the workers have received the slot’s aggregated result. Slot reuse happens when all the workers have sent their updates to the same slot of the other pool, signaling that they have all moved forward. Note this scheme works because the completion of aggregation for a slot idx in one pool safely and unambiguously confirms that the previous aggregation result in the shadow copy of slot idx has indeed been received by every worker. This mechanism’s main cost is switch memory usage: keep- ing a shadow copy doubles the memory requirement, and tracking the seen bitmask adds additional cost. This may ap- pear problematic, as on-switch memory is a scarce resource. In practice, however, the total number of slots needed – tuned based on the network bandwidth-delay product (Appendix C)
DNN training commonly uses floating-point numbers, but current programmable switches do not natively support them. We explored two approaches to bridging this gap. Floating-point numbers are already an approximation. SGD and similar algorithms are defined over real numbers. Floating-point numbers approximate real numbers by trading off range, precision, and computational overhead to provide a numerical representation that can be broadly applied to appli- cations with widely different properties. However, many other approximations are possible. An approximation designed for a specific application can obtain acceptable accuracy with lower overhead than standard floating-point offers. In recent years, the community has explored many spe- cialized numerical representations for DNNs. These repre- sentations exploit the properties of the DNN application do- main to reduce the cost of communication and computation. For instance, NVIDIA Volta and Ampere GPUs [17, 74] include mixed-precision (16-/32-bit) TPUs that can train with accuracy matching full-precision approaches. Other work has focused on gradient exchange for SGD, using fixed-point quantization, dithering, or sparsification to reduce both the number of bits and the gradient elements transmit- ted [7, 8, 60, 69, 88, 95, 99]. Further, others have explored block floating-point representations [25, 53], where a single exponent is shared by multiple tensor elements, reducing the amount of computation required to perform tensor operations. This innovation will continue (as work [40, 70] that builds upon our architecture demonstrates); our goal is not to pro- pose new representations but to demonstrate that techniques like those in the literature are practical with programmable switches. We use a numeric representation, inspired by block floating- point, that combines 32-bit fixed-point addition in the switch with adaptive scaling on the workers. This representation is used only when aggregating gradients; all other data (weights, activations) remain in 32-bit floating-point representation.
0 5 10 15 20 25 30 35 40 Epoch
Test accuracy
SwitchML Baseline
Figure 2: Test accuracy of ResNet-110 on CIFAR10. SwitchML achieves similar accuracy to the baseline. To implement our representation, we scale gradient values using a per-packet scaling factor f , which is automatically de- termined for each use of an aggregator slot in the switch. The scaling factor is set so that the maximum aggregated floating point value within a block of k gradients is still representable as a 32-bit fixed point value. Namely, let h be the largest abso- lute value of a block of gradients; f is set to ( 231 − 1 )/(n · 2 m), where m is the exponent of h rounded up to a power of 2 and n is the number of workers. Appendix E formally analyzes the precision of this representation. To realize this quantization of floating-point values, work- ers need to agree on a global value of m prior to sending the corresponding block of gradients. We devise a simple look-ahead strategy: when workers send the j-th block to slot i, they include their local block j + 1 ’s maximum gradient (rounded up to a power of 2). The switch identifies the global maximum m and piggy-backs that value when sending the aggregated gradients of the j-th block. We verify experimentally that this communication quantiza- tion allows training to similar accuracy in a similar number of iterations as an unquantized network. We illustrate the conver- gence behavior by training a ResNet-110 model on CIFAR dataset for 64,000 steps (about 41 epochs) using 8 workers. Figure 2 shows the test accuracy over time. The accuracy obtained by SwitchML (about 91-93% in the last 5 points) is similar to that obtained by training with TensorFlow on the same worker setup, and it matches prior results [38] with the same hyperparameter settings. The training loss curves (not shown) show the same similarity. In Appendix E, we further give a detailed convergence analysis for the aforementioned representation on models in Table 1.
While the above representation is used in the remainder of the paper, we also explored the implementation of a restricted form of 16-bit floating-point. In this version, the switch con- verts each 16-bit floating-point value in the incoming packets into a fixed-point value and then performs aggregation. When generating responses, the switch converts fixed-point values back into floating-point values. Due to resource limitations in the switch, we were only able to support half the dynamic range of the 16-bit floating-point format; we expect this to lead to poor convergence during training. Conversely, our 32-bit integer format uses minimal switch resources, provides good dynamic range, and has a minimal overhead on workers. A 16-bit format would provide a bandwidth benefit (§6.3).
(^04 8 )
100
200
300
ATE/s (
x^10
6 )^
10 Gbps
4 8 16 Number of workers
0
1000
2000
3000
ATE/s (
x^10
6 )^
100 Gbps
SwitchML RDMA 256 SwitchML RDMA 64 SwitchML DPDK 256 SwitchML DPDK 64 NCCL-RDMA NCCL-TCP Gloo-TCP Colocated PS Dedicated PS Max RDMA goodput
Figure 3: Microbenchmarks showing aggregated tensor ele- ments per second on a 10 (top) and 100 (bottom) Gbps network as workers increase. which we use to analyze performance by replaying profile traces. This throughput metric includes communication and computation costs, but excludes the time to load data. Benchmarks. We evaluate SwitchML by training with 8 DNNs introduced in Table 1. The detailed configuration of the benchmarks is in Table 3 in Appendix B. Half of the benchmarks execute on PyTorch and half on TensorFlow. Setup. As a baseline, we run both PyTorch with native dis- tributed data-parallel module and TensorFlow with Horovod. By default, we use NCCL as the communication library, and use both TCP and RDMA as the transport protocol. Our de- fault setup is to run experiments on 8 workers.
To illustrate SwitchML’s efficiency in comparison to other communication strategies, we devise a communication-only microbenchmark that performs continuous tensor aggrega- tions, without any gradient computation on the GPU. We verify that the tensors – initially, all ones – are aggregated correctly. We test with various tensor sizes from 50 MB to 1.5 GB. We observe that the number of aggregated tensor elements per time unit (ATE/s) is not very sensitive to the tensor size. Thus, we report results for 100 MB tensors only. For these experiments, we benchmark SwitchML against the popular all-reduce communication libraries (Gloo [31] and NCCL [77]). We further compare against a parameter server-like scenario, i.e., a set of worker-based processes that assist with the aggregation. To this end, we build a DPDK- based program that implements streaming aggregation as in Algorithm 1. To capture the range of possible PS performance, we consider two scenarios: (1) when the PS processes run on dedicated machines, effectively doubling the cluster size, and (2) when a PS process is co-located with every worker. We choose to run as many PS processes (each using 8 cores) as workers so that the tensor aggregation workload is equally spread among all machines (uniformly sharded) and avoids introducing an obvious performance bottleneck due to over- subscribed bandwidth, which is the case when the ratio of workers to PS nodes is greater than one. Figure 3 shows the results at 10 and 100 Gbps on three clus- ter sizes. The results demonstrate the efficiency of SwitchML:
DeepLight
LSTMBERTVGG19UGATITNCFSSDResNet
0
5
10
Speedup
(^) 9.
2.9 1.
3.1 1.
11.0 (^) 10. 7.9 (^) 7.
3.1 3.
DeepLight
LSTMBERTVGG19UGATITNCFSSDResNet
SwitchML/NCCL-TCP (P100) SwitchML/NCCL-TCP (GPU 10x) SwitchML/NCCL-RDMA (P100) SwitchML/NCCL-RDMA (GPU 10x)
1.3 (^) 1.2 1.0 1.0 1. 0.9 1.
2.0 (^) 1.9 1.7 2. 1.0 1.3 0.8 1.
Figure 4: Training batch processing speedup at 100 Gbps con- sidering a P100 GPU and a 10× faster GPU.
DeepLight
LSTM BERT NCF
0
1
2
3
4
5
6
Speedup
1.71 1.
SwitchML/NCCL-TCP (100 Gbps) SwitchML/NCCL-RDMA (100 Gbps)
Figure 5: Training performance speedup normalized to NCCL with TCP and RDMA transport protocols. its highest-performing variant, which uses RDMA with 256- value (1024-byte payload) packets, is within 2% of the max- imum achievable goodput. Using smaller packets (k = 64 instead of 256 ) has a noticeable performance impact, un- derscoring the importance of our multi-pipeline design. The DPDK implementation has additional host-side overhead that prevents it from achieving full link utilization at 100 Gbps. In spite of this, SwitchML can still outperform the best cur- rent all-reduce system, NCCL, even when it uses RDMA and SwitchML does not. Moreover, SwitchML always maintains a predictable rate of ATE/s regardless of the number of workers. This trend should continue with larger clusters. The Dedicated PS approach (with 256 values per packet)
We analyze training performance on eight DNN benchmarks. We normalize results to NCCL as the underlying communica- tion backend of PyTorch and TensorFlow. Figure 4 reports the speedup for processing a training batch for SwitchML compared to NCCL at 100 Gbps. SwitchML uses the DPDK implementation with 256-value packets. These results replay the profile traces collected on our cluster
0.01% 0.1% 1% Loss rate
TAT inflation
SwitchML Gloo NCCL
Figure 6: Inflation of TAT due to packet loss and recovery. Re- sults are normalized to a baseline scenario where no loss oc- curs and the worker implementation does not incur any timer- management overhead.
(§2.2), allowing us to report both the speedup on our testbed GPUs (P100s, which are two generations old) and hypotheti- cal GPUs that are 10× faster (by uniformly scaling the traces by this factor). This emulation lets us evaluate the setting where a fast network is paired with fast GPUs. While it is hard to predict future evolution of GPU speed vs. network bandwidth, we reason that this scaling factor currently corre- sponds to the span of two to three GPU generations (the A benchmarks at 4.2× the V100 [75], which in turn is 1.4-2.2× faster than our P100 [97]) and represents a likely bound on the real-world speedups achievable, which are anyway dependent on the model (e.g., the ResNet50 model sees a nearly 10× speedup from an NVIDIA V100 GPU compared to a K GPU [71]) and the other infrastructure specifics. As expected, SwitchML accelerates batch processing espe- cially for the larger DNNs. The speedup over NCCL-RDMA is at most 2.1×, which is in line with the fundamental 2× advantage of INA over RAR (§3). In most cases, the mea- sured speedup is higher than the emulated communication results (Table 2) predict, because NCCL’s RAR implemen- tation does not achieve the theoretical maximum efficiency. The speedup relative to NCCL-TCP is larger (up to one or- der of magnitude), which is attributable primarily to DPDK’s kernel-bypass advantage. SwitchML provides significant benefits for many, but not all, real-world DNNs, even with 100 Gbps networks. For example, DeepLight and LSTM enjoy major improvements. BERT sees a somewhat lower speedup, in part because its gradient consists of many relatively small (∼60 MB) ten- sors. Similarly, NCF, a relatively small model, has a modest speedup. Other models, like UGATIT, SSD, and ResNet are simply not network-bound at 100 Gbps. SSD is a particularly challenging case: not only is it a small model that would re- quire an α = 15. 2 × faster GPU to become network-bound (Table 1), it also makes many aggregation invocations for small gradients. The overheads of starting an aggregation are not well amortized, especially in the 10× scaled scenario. Finally, we consider the end-to-end speedup on a complete training run with 16 workers. We focus on the four models that are network-bottlenecked at 100 Gbps. Figure 5 shows the training performance speedup compared to NCCL us- ing RDMA and TCP. These measurements use SwitchML’s DPDK implementation, with 256-value packets; we expect a larger speedup once SwitchML’s RDMA implementation is integrated with the training framework. Even so, SwitchML’s speedups range between 1.13-2.27× over NCCL-RDMA and
0 50 100 150 200 250 300 350 400 time [ms]
0
20000
40000
60000
packets per 10 ms
132 ms 138 ms 424 ms
0%0.01% 1%
ideal packet rate0.01% resent 1% resent
Figure 7: Timeline of packets sent per 10 ms during an aggrega- tion with 0%, 0.01% and 1% packet loss probability. Horizon- tal bars denote the TAT in each case. 2.05-5.55× over NCCL-TCP. The results are not directly comparable to Figure 4, because (1) they use a larger 16-node cluster, and (2) they report total end-to-end iteration time, which also includes data loading time. Our deployment does not use any optimized techniques for data loading, an orthogo- nal problem being addressed by other work (e.g., DALI [76]).
Packet loss recovery. We study how packet loss affects TAT. To quantify the change in TAT due to packet loss, we experi- ment with a uniform random loss probability between 0.01% and 1% applied on every link. The retransmission timeout is set to 1 ms. We run microbenchmark experiments in similar scenarios as §6.1. We report a few representative runs. Figure 6 measures the inflation in TAT with different loss probabilities. SwitchML completes tensor aggregation signif- icantly faster than Gloo or NCCL when the loss is 0.1% or higher. A loss probability of 0.01% minimally affects TAT in either case. To better illustrate the behavior of SwitchML, we show in Figure 7 the evolution of packets sent per 10 ms at a representative worker for 0.01% and 1% loss. We observe that SwitchML generally maintains a high sending rate – relatively close to the ideal rate – and quickly recovers by retransmitting dropped packets. The slowdown past the 150 ms mark with 1% loss occurs because some slots are unevenly affected by random losses and SwitchML does not apply any form of work-stealing to rebalance the load among aggregators. This presents a further opportunity for optimization. Tensor scaling and type conversion. We analyze whether any performance overheads arise due to the tensor scaling operations (i.e., multiply updates by f and divide aggregates by f ) and the necessary data type conversions: float32-to- int32 → htonl → ntohl → int32-to-float32. To quantify overheads, we use int32 as the native data type while running the microbenchmarks. This emulates a native float32 scenario with no scaling and conversion operations. We also illustrate the potential improvement of quantization to single-precision (float16) tensors, which halves the volume of data to be sent to the network. (We include a conversion from/to float32.) This setting is enabled by the ability to per- form at line rate, in-switch type conversion (float16 ↔ int32), which we verified with the switch chip vendor. However, for this experiment, we emulate this by halving the tensor size. We find that these overheads are negligible at 10 Gbps.
the idea of partitioning aggregation functionality between a switch (for performance) and a server (for capacity) so as to seamlessly support multi-job scenarios.
Encrypted traffic. Given the cluster setting and workloads we consider, we do not consider it necessary to accommodate for encrypted traffic. Appendix F expands on this issue.
In-network computation trends. The trend towards pro- grammable data planes has sparked a surge of propos- als [20, 21, 46, 47, 55, 61] to offload, when appropriate [80], application-specific primitives into network devices.
In-network aggregation. We are not the first to propose ag- gregating data in the network. Targeting partition-aggregate and big data (MapReduce) applications, NetAgg [65] and CamDoop [18] demonstrated significant performance ad- vantages, by performing application-specific data aggrega- tion at switch-attached high-performance middleboxes or at servers in a direct-connect network topology, respectively. Parameter Hub [64] does the same with a rack-scale parame- ter server. Historically, some specialized supercomputer net- works [2, 26] offloaded MPI collective operators (e.g., all- reduce) to the network. SwitchML differs from all of these approaches in that it performs in-network data reduction using a streaming aggregation protocol. The closest work to ours is DAIET [86], which was our initial but incomplete proposal of in-network aggregation for minimizing the overhead of exchanging ML model updates. Mellanox’s Scalable Hierarchical Aggregation Protocol (SHARP) is a proprietary in-network aggregation scheme available in certain InfiniBand switches [33]. SHARP uses dedicated on-chip FPUs for collective offloading. The most recent version, SHARPv2 [68] uses streaming aggregation analogous to ours. A key difference is that SHARP builds on InfiniBand where it can leverage link-layer flow control and lossless guarantees, whereas SwitchML runs on standard Ethernet^8 with an unmodified network architecture, necessi- tating a new packet recovery protocol. More fundamentally, SwitchML builds on programmable network hardware rather than SHARP’s fixed-function FPUs, which offers two benefits. First, operators can deploy a single switch model either for SwitchML or traditional networking without waste: the ALUs used for aggregation can be repurposed for other tasks. Sec- ond, it allows the system design to evolve to support new ML training approaches. For example, we are currently experi- menting with new floating-point representations and protocols for sparse vector aggregations [27]. With a fixed-function ap- proach, these would require new hardware, just as moving from single HPC reductions (SHARPv1) to streaming ML reductions (SHARPv2) required a new ASIC generation.
(^8) Although SwitchML uses RDMA, it uses only unreliable connections, and so does not require any of the “lossless Ethernet” features of RoCE.
Concurrently, Li et al. [57] explored the idea of in-switch acceleration for Reinforcement Learning (RL). Their design (iSwitch) differs from ours in two fundamental ways. First, while their FPGA-based implementation supports more com- plex computation (e.g., native floating point), it operates at much lower bandwidth (4×10 Gbps). Second, it stores an entire gradient vector during aggregation; for RL workloads with small models, this works, but it does not scale for large DNN models. Our work targets both large models and high throughput – a challenging combination given the limited on- chip memory in high-speed networking ASICs. SwitchML’s software/hardware co-design approach, using a self-clocking streaming protocol, provides 40 × higher throughput than iSwitch, while supporting arbitrarily large models. Finally, targeting NVIDIA’s intra-GPU NVLink network, Klenk et al. [52] proposed in-network aggregation in the con- text of a distributed shared-memory fabric supporting multi- GPU systems where accelerators are directly attached to the fabric. While this work is orthogonal to ours, their push reduc- tion design resembles the SwitchML protocol, suggesting that streaming in-network aggregation has broader applicability than discussed in this paper. Accelerating DNN training. A large body of work has pro- posed improvements to hardware and software systems, as well as algorithmic advances for faster DNN training. We only discuss a few relevant prior approaches. Improving training performance via data or model parallelism has been explored by numerous deep learning systems [1, 13, 15, 22, 56, 64, 94]. While data parallelism is most common, it can be advanta- geous to combine the two approaches. Recent work even shows how to automatically find a fast parallelization strat- egy for a specific parallel machine [44]. Underpinning any distributed training strategy, lies parameter synchronization. Gibiansky was among the first to research [30] using fast collective algorithms in lieu of the traditional parameter server approach. Many platforms have now adopted this ap- proach [30, 37, 41, 87, 89]. We view SwitchML as a further ad- vancement on this line of work – one that pushes the boundary by co-designing networking functions with ML applications.
SwitchML speeds up DNN training by minimizing commu- nication overheads at single-rack scale. SwitchML uses in- network aggregation to efficiently synchronize model updates at each training iteration among distributed workers executing in parallel. We evaluated SwitchML with eight real-world DNN benchmarks on a GPU cluster with 10 Gbps and 100 Gbps networks; we showed that SwitchML achieves training throughput speedups up to 5.5× and is generally better than state-of-the-art collective communications libraries. We are in the process of integrating SwitchML-RDMA in various ML frameworks.
We are grateful to Ibrahim Abdelaziz for his work on an ini- tial proof-of-concept of SwitchML and to Omar Alama for contributing with performance improvements and with the code release. We thank Tom Barbette and Georgios Katsikas for their help with DPDK issues. This work benefited from feedback and comments by Steven Hand, Petros Maniatis, Ky- oungSoo Park, and Amin Vahdat. We thank our shepherd, Su- jata Banerjee, and the anonymous reviewers for their helpful feedback. For computer time, this research used the resources of the Supercomputing Laboratory at KAUST.
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, 2016. [2] N. Adiga, G. Almasi, G. Almasi, Y. Aridor, R. Barik, D. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, J. Brunheroto, C. Ca¸scaval, J. Castaños, W. Chan, L. Ceze, P. Coteus, S. Chatter- jee, D. Chen, G. Chiu, T. Cipolla, P. Crumley, K. De- sai, A. Deutsch, T. Domany, M. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. Giampapa, B. Gopal- samy, J. Gunnels, M. Gupta, F. Gustavson, S. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Herger, D. Hoenicke, R. Jackson, T. Jamal-Eddine, G. Kopcsay, E. Krevat, M. Kurhekar, A. Lanzetta, D. Lieber, L. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. Mor- eira, B. Nathanson, M. Newton, M. Ohmacht, A. Oliner, V. Pandit, R. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. Sahoo, A. Sanomiya, E. Schen- feld, M. Sharma, E. Shmueli, S. Singh, P. Song, V. Srini- vasan, B. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. Tremaine, M. Tsao, A. Umama- heshwaran, P. Verma, P. Vranas, T. Ward, M. Wazlowski, W. Barrett, C. Engel, B. Drehmel, B. Hilgart, D. Hill, F. Kasemkhani, D. Krolak, C. Li, T. Liebsch, J. Marcella, A. Muff, A. Okomo, M. Rouse, A. Schram, M. Tubbs, G. Ulsh, C. Wait, J. Wittrup, M. Bae, K. Dockser, L. Kissel, M. Seager, J. Vetter, and K. Yates. An Overview of the BlueGene/L Supercomputer. In SC,
[3] A. F. Aji and K. Heafield. Sparse Communication for Distributed Gradient Descent. In EMNLP, 2017. [4] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vo- jnovic. QSGD: Communication-Efficient SGD via Ran- domized Quantization. In NIPS, 2017.
[5] Barefoot Networks. Tofino. https: //barefootnetworks.com/products/brief- tofino/.
[6] M. Barnett, L. Shuler, R. van de Geijn, S. Gupta, D. G. Payne, and J. Watts. Interprocessor collective communi- cation library (InterCom). In SHPCC, 1994.
[7] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed Optimisation for Non-Convex Problems. In ICML, 2018.
[8] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anand- kumar. signSGD with Majority Vote is Communica- tion Efficient And Byzantine Fault Tolerant. arXiv 1810.05291, 2018.
[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKe- own, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming Protocol- independent Packet Processors. SIGCOMM Comput. Commun. Rev., 44(3), July 2014.
[10] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McK- eown, M. Izzard, F. Mujica, and M. Horowitz. Forward- ing Metamorphosis: Fast Programmable Match-action Processing in Hardware for SDN. In SIGCOMM, 2013.
[11] Cerebras. https://www.cerebras.net.
[12] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One Billion Word Bench- mark for Measuring Progress in Statistical Language Modeling. arXiv 1312.3005, 2013.
[13] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Workshop on Machine Learning Systems, 2016.
[14] J. H. Cheon, A. Kim, M. Kim, and Y. Song. Homomor- phic Encryption for Arithmetic of Approximate Num- bers. In ASIACRYPT, 2017.
[15] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanara- man. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, 2014.
[16] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Var- gaftik, A. Berger, G. Mendelson, M. Alizadeh, S.-T. Chuang, I. Keslassy, A. Orda, and T. Edsall. dRMT: Dis- aggregated Programmable Switching. In SIGCOMM,
[17] J. Choquette, O. Giroux, and D. Foley. Volta: Perfor- mance and Programmability. IEEE Micro, 38(2), 2018.
[45] Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In OSDI,
[46] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim, and I. Stoica. NetChain: Scale-Free Sub-RTT Coordination. In NSDI, 2018.
[47] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In SOSP, 2017.
[48] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hag- mann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Nor- rie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In- Datacenter Performance Analysis of a Tensor Processing Unit. In ISCA, 2017.
[49] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the Limits of Language Modeling. arXiv 1602.02410, 2016.
[50] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In ICLR, 2017.
[51] J. Kim, M. Kim, H. Kang, and K. H. Lee. U-GAT- IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to- Image Translation. In ICLR, 2020.
[52] B. Klenk, N. Jiang, G. Thorson, and L. Dennison. An In- Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives. In ISCA, 2020.
[53] U. Köster, T. J. Webb, X. Wang, M. Nassar, A. K. Bansal, W. H. Constable, O. H. Elibol, S. Gray, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. In NIPS,
[54] C. Lao, Y. Le, K. Mahajan, Y. Chen, W. Wu, A. Akella, and M. Swift. ATP: In-network Aggregation for Multi- tenant Learning. In NSDI, 2021.
[55] J. Li, E. Michael, and D. R. K. Ports. Eris: Coordination- Free Consistent Transactions Using In-Network Concur- rency Control. In SOSP, 2017.
[56] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling Distributed Machine Learning with the Parameter Server. In OSDI, 2014.
[57] Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang. Accelerating Distributed Reinforcement Learning with In-Switch Computing. In ISCA, 2019.
[58] Y. Li, J. Park, M. Alian, Y. Yuan, Z. Qu, P. Pan, R. Wang, A. Gerhard Schwing, H. Esmaeilzadeh, and N. Sung Kim. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks. In MICRO, 2018.
[59] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
[60] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR, 2018.
[61] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya. IncBricks: Toward In-Network Compu- tation with an In-Network Cache. In ASPLOS, 2017.
[62] S. Liu, Q. Wang, J. Zhang, Q. Lin, Y. Liu, M. Xu, R. C. Chueng, and J. He. NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration. arXiv 2009.09736, 2020.
[63] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot MultiBox Detector. In ECCV, 2016.
[64] L. Luo, J. Nelson, L. Ceze, A. Phanishayee, and A. Kr- ishnamurthy. PHub: Rack-Scale Parameter Server for Distributed Deep Neural Network Training. In SoCC,
[65] L. Mai, L. Rupprecht, A. Alim, P. Costa, M. Migliavacca, P. Pietzuch, and A. L. Wolf. NetAgg: Using Middle- boxes for Application-Specific On-path Aggregation in Data Centres. In CoNEXT, 2014.
[66] D. Masters and C. Luschi. Revisiting small batch train- ing for deep neural networks. arXiv 1804.07612, 2018.
[67] Mellanox RDMA Aware Networks Programming User Manual. https://www.mellanox.com/related- docs/prod_software/RDMA_Aware_Programming_ user_manual.pdf.
[68] Mellanox Scalable Hierarchical Aggregation and Re- duction Protocol (SHARP). https://www.mellanox. com/products/sharp.
[69] K. Mishchenko, E. Gorbunov, M. Takáˇc, and P. Richtárik. Distributed Learning with Compressed Gradient Differ- ences. arXiv 1901.09269, 2019. http://arxiv.org/ abs/1901.09269.
[70] K. Mishchenko, B. Wang, D. Kovalev, and P. Richtárik. IntSGD: Floatless Compression of Stochastic Gradients. arXiv 2102.08374, 2021. https://arxiv.org/abs/ 2102.08374.
[71] D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phan- ishayee, and M. Zaharia. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In OSDI, 2020.
[72] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Ro- bust Stochastic Approximation Approach to Stochastic Programming. SIAM J. Optim., 19(4), 2009.
[73] A. Nemirovski and D. B. Yudin. Problem complex- ity and method efficiency in optimization. Wiley Inter- science, 1983.
[74] NVIDIA Ampere Architecture In-Depth. https://devblogs.nvidia.com/nvidia-ampere- architecture-in-depth/.
[75] NVIDIA’s Ampere A100 GPU Is Unstoppable, Breaks 16 AI Performance Records, Up To 4.2x Faster Than Volta V100. https://wccftech.com/nvidia- ampere-a100-fastest-ai-gpu-up-to-4-times- faster-than-volta-v100/.
[76] NVIDIA Data Loading Library (DALI). https:// developer.nvidia.com/DALI.
[77] NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl.
[78] NVIDIA Data Center Deep Learning Product Per- formance. https://developer.nvidia.com/deep- learning-performance-training-inference.
[79] P. Patarasuk and X. Yuan. Bandwidth Optimal All- reduce Algorithms for Clusters of Workstations. Journal of Parallel and Distributed Computing, 69(2), 2009.
[80] D. R. K. Ports and J. Nelson. When Should The Network Be The Computer? In HotOS, 2019.
[81] P. Rajpurkar, R. Jia, and P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. In ACL, 2018.
[82] P. Richtárik and M. Takáˇc. Distributed Coordinate De- scent Method for Learning with Big Data. Journal of Machine Learning Research, 17(1), 2016.
[83] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3),
[84] C. Rosset. Turing-NLG: A 17-billion-parameter language model by Microsoft. https://www. microsoft.com/en-us/research/blog/turing- nlg-a-17-billion-parameter-language-model- by-microsoft/.
[85] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Vi- sual Recognition Challenge. International Journal of Computer Vision, 115(3), 2015.
[86] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis. In-Network Computation is a Dumb Idea Whose Time Has Come. In HotNets, 2017.
[87] F. Seide and A. Agarwal. CNTK: Microsoft’s Open- Source Deep-Learning Toolkit. In KDD, 2016.
[88] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-Bit Stochastic Gradient Descent and Application to Data- Parallel Distributed Training of Speech DNNs. In IN- TERSPEECH, 2014.
[89] A. Sergeev and M. D. Balso. Horovod: fast and easy dis- tributed deep learning in TensorFlow. arXiv 1802.05799,
[90] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cam- bridge University Press, 2014.
[91] K. Simonyan and A. Zisserman. Very Deep Convolu- tional Networks for Large-Scale Image Recognition. In ICLR, 2015.
[92] A. Sivaraman, A. Cheung, M. Budiu, C. Kim, M. Al- izadeh, H. Balakrishnan, G. Varghese, N. McKeown, and S. Licking. Packet Transactions: High-Level Program- ming for Line-Rate Switches. In SIGCOMM, 2016.
[93] R. Thakur, R. Rabenseifner, and W. Gropp. Opti- mization of Collective Communication Operations in MPICH. Int. J. High Perform. Comput. Appl., 19(1), Feb. 2005.
Aggregation [ off ]
Aggregation [ off+k*s ]
Result Distribution [ off ]
Result Distribution [ off + k*s ]
Aggregation [ off+2ks ]
Result Distribution [ off - k*s ]
Result Distribution [ off+2ks ]
Aggregation [ off+3ks ]
pool 0
pool 1 time
t0, w
t2, w
t1, w
t4, w1-re
t6, w3-re
(ignored) (ignored)
t5, w2-re
t8, w1-re
t11, w
t12, w
t13, w
t14, w
t9, w
t10, w
(retransmission)
t3 t
t
Figure 8: An example execution of a SwitchML switch interacting with three workers. The figure illustrates how a slot with index x is used during the different phases (shown in different colors) that alternate between the two pools.
on the hardware pipeline and will be rejected by the compiler. Moreover, the number of memory accesses per-stage is inher- ently limited by the maximum per-packet latency; a switch may be able to parse more data from a packet than it is able to store into the switch memory during that packet’s time in the switch.
We make a number of design trade-offs to fit within the switch constraints. First, our P4 program makes the most use of the limited memory operations by performing the widest register accesses possible (64 bits). We then use the upper and lower part of each register for alternate pools. These parts can execute different operations simultaneously; for example, when used for the received work bitmap, we can set a bit for one pool and clear a bit for the alternate pool in one opera- tion. Second, we minimize dependencies (e.g., branches) in our Algorithm 3 in order to process 64 elements per packet within a single ingress pipeline. We confine all processing to the ingress pipeline; when the aggregation is complete, the traffic manager duplicates the packet containing the ag- gregated result and performs a multicast. In a first version of our program, we used both ingress and egress pipelines for the aggregation, but that required packet recirculation to duplicate the packets. This caused additional dependencies that required more stages, preventing the processing of more than 64 elements per packets. Moreover, this design experi- enced unaccounted packet losses between the two pipelines and during recirculation, which led us to search for a better, single pipeline, program.
Worker component. Our goal for implementing the worker component is to achieve high I/O performance for aggregating model updates. At the same time, we want to support existing ML frameworks without modifications.
In existing ML frameworks, a DNN model update U com- prises of a set of tensors T , each carrying a subset of the gradients. This is because the model consists of many layers; most existing frameworks emit a gradient tensor per layer and reduce each layer’s tensors independently. Back-propagation produces the gradients starting from the output layer and mov- ing towards the input layer. Thus, communication can start on the output layer’s gradients while the other gradients are still being computed, partially overlapping communication with
computation. This implies that for each iteration, there are as many aggregation tasks as the number of tensors (e.g., 152 for ResNet50). Our implementation exposes the same synchronous all- reduce interface as Gloo. However, rather than treating each tensor as an independent reduction and resetting switch state for each one, our implementation is efficient in that it treats the set of tensors virtually as a single, continuous stream of data across iterations. Upon invocation, our API passes the input tensor to a virtual stream buffer manager which streams the tensor to the switch, breaking it into the small chunks the switch expects. Multiple threads may call SwitchML’s all-reduce, with the requirement that each worker machine’s tensor reduction calls must occur in the same order; the stream buffer manager then performs the reductions and steers results to the correct requesting thread. One CPU core is sufficient to do reduction at line rate on a 10 Gbps network. However, to be able to scale beyond 10 Gbps, we use multiple CPU cores at each worker and use the Flow Director technology (implemented in hardware on modern NICs) to uniformly distribute incoming traffic across the NIC RX queues, one for each core. Every CPU core runs an I/O loop that processes every batch of packets in a run- to-completion fashion and uses a disjoint set of aggregation slots. Packets are batched in groups of 32 to reduce per-packet transmission overhead. We use x86 SSE/AVX instructions to scale the model updates and convert between types. We are careful to ensure all processing is NUMA aware.
RDMA implementation details. We found that the cost of processing individual SwitchML packets, even using DPDK with 256-element packets and multiple cores, was too high to achieve line rate. Other aggregation libraries use RDMA to offload packet processing to the NIC. In RDMA-based sys- tems NICs implement packetization, flow control, congestion control, and reliable delivery. In normal usage, clients use RDMA to send and receive messages of up to a gigabyte; the NIC turns them into packets and ensures they are delivered re- liably. Furthermore, clients may register memory regions with the NIC, allowing other clients to remotely read and write them without CPU involvement. This reduces or eliminates
work done on the clients’ CPUs to complete the transfer. Turning a Tofino switch into a fully-featured RDMA end- point is not the solution. Implementing timeouts and retrans- mission in a way that is compatible with existing poorly- documented existing RDMA NICs would be complex. Fur- thermore, such an implementation would not be an good fit for SwitchML: the RDMA protocols are largely designed for point-to-point communication, whereas SwitchML’s protocol is designed for collective communication. Fortunately, RDMA NICs implement multiple protocols with different properties. The standard Reliable Connected (RC) mode ensures reliable delivery and supports CPU- bypassing remote reads and writes (as well as sends and receives) of up to 1 GB. The UDP-like Unreliable Datagram (UD) mode supports just sends and receives of up the net- work MTU. Finally, the Unreliable Connected (UC) mode fits somewhere in between. It supports packetization, allowing for sends, receives, and writes of up to 1 GB. It also generates and checks sequence numbers, allowing it to detect packet drops, but it does not retransmit: instead, if a gap in sequence numbers is detected, incoming packets are silently dropped until the first packet of a new message arrives. Then, the se- quence number counter is reset to the sequence number of that packet, and normal reception continues. We use RDMA UC to implement a RDMA-capable variant of SwitchML using a subset of the RoCE v2 protocol [42]. Its operation is very similar to what is described in Section 4, with three main differences. First, where base SwitchML sends and receives slot-sized packets, SwitchML RDMA sends multi-slot messages. Each packet of a message is treated largely as it is in the base protocol by the switch, but the pool index for each packet is computed as an offset from the base index provided with the first packet of the message. Timeouts are tracked just as they are in the base protocol, but when a packet drop is detected, the client retransmits the entire message rather than just the dropped packet. This makes retransmissions more expensive, but it also drastically lowers the cost incurred sending packets in the common case of no packet drops; since packet drops are rare within a datacenter, the benefit is large. Second, SwitchML consumes and generates sequence num- bers on the switch. In order to allow messages with multiple packets to aggregate concurrently, each in-flight message is allocated its own queue pair, with its own sequence number register. This allows clients to to be notified when a write message from the switch has arrived with no drops; it also allows the switch to ignore packets in messages received out of sequence. However, the same per-slot bitmap used in the base protocol is still used to ensure that duplicate packets from a retransmission of a partially-received messages are not re-applied. Packets are transmitted as individual slots ad- dressed by a message complete. This means that the packets from multiple messages may interleave on the wire, but since each is on a separate queue pair with its own sequence number
32 64 128 256 512 1024 2048 4096 8192 Pool size
50
100
TAT [ms]
SwitchMLRTT TAT at line rate
0
100
200
RTT [ s]
Figure 9: Effect of pool size on overall tensor aggregation time (TAT) and per-packet RTT (right y-axis) at 100 Gbps. space, the NIC will reassemble them successfully. Third, SwitchML RDMA uses RDMA Write Immediate messages for all communication. This allows clients to send data directly from GPU memory, and the switch to write directly into GPU memory (if the host is GPU Direct-capable). Byte order conversion and scaling are done on the GPU; the CPU is responsible only for issuing writes when data from the GPU is ready, detecting completions and timeouts, and issuing retransmissions when necessary. Necessary metadata for the SwitchML protocol is encoded in fields of the RDMA header; the RDMA RKey and Address fields are used to encode the destination slot and the address to write the response to. The Immediate field is used to carry up to four scaling factors. At job setup time, the clients communicate with the switch and give it their queue pair numbers, initial sequence numbers, and an RKey for its switch-writable memory region. The switch uses these to form RDMA Write Immediate messages with appropriate sequence numbers, destination addresses, and immediate values, of the same size as the messages sent from the clients to the switch. Finally, it is important to note that SwitchML RDMA does not require lossless Ethernet to be configured, as is common in RoCE deployments. Enabling lossless Ethernet would reduce the probability of packet drops, but would add complexity to the network deployment. SwitchML’s reliability protocol makes this unnecessary. DNN workloads. Table 3 details the models, datasets and ML toolkits used in the experiments.
As mentioned, the pool size s affects performance and relia- bility. We now analyze how to tune this parameter. Two factors affect s. First, because s defines the number of in-flight packets in the system that originate from a worker, to avoid wasting each worker’s network bandwidth, s should be no less than the bandwidth-delay product (BDP) of each worker. Note that the delay here refers to the end-to-end delay, including the end-host processing time, which can be easily measured in a given deployment. Let b be the packet size, which is constant in our setting. To sustain line rate transmission, the stream of response packets must arrive at line rate, and this is possible when s · b matches the BDP. A significantly higher value of s, when used as the initial window size, will unnecessarily increase queuing time within the workers.