Scaling Distributed Machine Learning with In-Network Aggregation

Amedeo Sapio∗

KAUST

Marco Canini∗

KAUST

Chen-Yu Ho

KAUST

Jacob Nelson

Microsoft

Panos Kalnis

KAUST

Changhoon Kim

Barefoot Networks

Arvind Krishnamurthy

University of Washington

Masoud Moshref

Barefoot Networks

Dan R. K. Ports

Microsoft

Peter Richtárik

KAUST

Abstract

Training machine learning models in parallel is an increas-

ingly important workload. We accelerate distributed parallel

training by designing a communication primitive that uses a

programmable switch dataplane to execute a key step of the

training process. Our approach, SwitchML, reduces the vol-

ume of exchanged data by aggregating the model updates

from multiple workers in the network. We co-design the

switch processing with the end-host protocols and ML frame-

works to provide an efficient solution that speeds up training

by up to 5.5

for a number of real-world benchmark models.

1 Introduction

Today’s machine learning (ML) solutions’ remarkable success

derives from the ability to build increasingly sophisticated

models on increasingly large data sets. To cope withthe result-

ing increase in training time, ML practitioners use distributed

training [1,22]. Large-scale clusters use hundreds of nodes,

each equipped with multiple GPUs or other hardware acceler-

ators (e.g., TPUs [48]), to run training jobs on tens of workers

that take many hours or days.

Distributed training is increasingly a network-bound work-

load. To be clear, it remains computationally intensive. But

the last seven years have brought a

62×

improvement in com-

pute performance [64,78], thanks to GPUs [74] and other

hardware accelerators [11,34,35,48]). Cloud network deploy-

ments have found this pace hard to match, skewing the ratio

of computation to communication towards the latter. Since

parallelization techniques like mini-batch stochastic gradi-

ent descent (SGD) training [37,43] alternate computation

with synchronous model updates among workers, network

performance now has a substantial impact on training time.

Can a new type of accelerator in the network alleviate

the network bottleneck? We demonstrate that an in-network

∗

Equal contribution. Amedeo Sapio is affiliated with Barefoot Networks,

but was at KAUST during much of this work.

aggregation primitive can accelerate distributed ML work-

loads, and can be implemented using programmable switch

hardware [5,10]. Aggregation reduces the amount of data

transmitted during synchronization phases, which increases

throughput, diminishes latency, and speeds up training time.

Building an in-network aggregation primitive using pro-

grammable switches presents many challenges. First, the per-

packet processing capabilities are limited, and so is on-chip

memory. We must limit our resource usage so that the switch

can perform its primary function of conveying packets. Sec-

ond, the computing units inside a programmable switch oper-

ate on integer values, whereas ML frameworks and models

operate on floating-point values. Finally, the in-network ag-

gregation primitive is an all-to-all primitive, unlike traditional

unicast or multicast communication patterns. As a result, in-

network aggregation requires mechanisms for synchronizing

workers and detecting and recovering from packet loss.

We address these challenges in SwitchML, showing that

it is indeed possible for a programmable network device to

perform in-network aggregation at line rate. SwitchML is

a co-design of in-switch processing with an end-host trans-

port layer and ML frameworks. It leverages the following

insights. First, aggregation involves a simple arithmetic op-

eration, making it amenable to parallelization and pipelined

execution on programmable network devices. We decompose

the parameter updates into appropriately-sized chunks that

can be individually processed by the switch pipeline. Second,

aggregation for SGD can be applied separately on different

portions of the input data, disregarding order, without affect-

ing the correctness of the final result. We tolerate packet loss

through the use of a light-weight switch scoreboard mecha-

nism and a retransmission mechanism driven solely by end

hosts, which together ensure that workers operate in lock-step

without any decrease in switch aggregation throughput. Third,

ML training is robust to modest approximations in its com-

pute operations. We address the lack of floating-point support

in switch dataplanes by having the workers scale and convert

floating-point values to fixed-point using an adaptive scaling

factor with negligible approximation loss.

Scaling Distributed Machine Learning with In-Network Aggregation, Lecture notes of Computer Systems Networking and Telecommunications

Related documents

Partial preview of the text

Download Scaling Distributed Machine Learning with In-Network Aggregation and more Lecture notes Computer Systems Networking and Telecommunications in PDF only on Docsity!

Amedeo Sapio∗

KAUST

Marco Canini∗

KAUST

Chen-Yu Ho

KAUST

Jacob Nelson

Microsoft

Panos Kalnis

KAUST

Changhoon Kim

Barefoot Networks

Arvind Krishnamurthy

University of Washington

Masoud Moshref

Barefoot Networks

Dan R. K. Ports

Microsoft

Peter Richtárik

KAUST

Abstract

1 Introduction

2 Network bottlenecks in ML training

2.1 Training and all to all communication

2.2 The network bottleneck

4 Design

4.1 Challenges

4.2 SwitchML overview

4.3 Switch-side aggregation protocol

4.6 Dealing with floating-point numbers

6.1 Tensor aggregation microbenchmarks

6.2 SwitchML improves training speed

6.3 Overheads

8 Related work

9 Conclusion

Acknowledgments

References

C Tuning the pool size