Lecture 29: Network interconnect topologies, Exams of Network Design

Edgar Solomonik. University of Illinois at Urbana-Champaign. December 7, 2016 ... we will first discuss torus networks and topology-aware collectives.

Typology: Exams

2022/2023

Uploaded on 05/11/2023

stefan18
stefan18 🇺🇸

4.2

(36)

278 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 598: Communication Cost Analysis of Algorithms
Lecture 29: Network interconnect topologies
Edgar Solomonik
University of Illinois at Urbana-Champaign
December 7, 2016
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Lecture 29: Network interconnect topologies and more Exams Network Design in PDF only on Docsity!

CS 598: Communication Cost Analysis of Algorithms

Lecture 29: Network interconnect topologies

Edgar Solomonik

University of Illinois at Urbana-Champaign

December 7, 2016

Direct network topologies Introduction

Network interconnects

For the duration of the course, we have focused on communication on ‘fully-connected’ networks, implicitly assuming that (1) any pair of processors can exchange messages at the same speed (2) messages between distinct pairs do not affect one another this effect is known as network contention we did this consciously, aiming general analysis of algorithms there are no widely-used generic network models for algorithms the connectivity structure of different networks can differ drastically on real systems, algorithms execute on a subset of a network (the only type of existing large-scale architecture on which this subset is structured are the BlueGene and K-computer torus networks) we will first discuss torus networks and topology-aware collectives then we will study a few other topologies based on general metrics

Direct network topologies Torus networks

Mesh and torus networks

A simple direct n -node network topology is a k -dimensional grid a torus is distinguished from a mesh by wrap-around links the simplest torus is a ring tori are generally advantageous as all nodes are ‘created equal’ Q: how could a ring network be constructed so each link is the same physical length? when P = 2 k^ a mesh topology is a hypercube 3D torus topologies have been popular in HPC because many applications map nicely to them larger k implies higher bisection bandwidth, which scales with n ( k −^1 ) /k^ , so 5D torus networks have been more popular recently

Direct network topologies Torus networks

Collective communication on torus networks

As a case-study consider broadcast on a torus with full injection bandwidth so injection bandwidth is 2 k times link bandwidth BlueGene tori have full injection bandwidth, but not Cray XT/XE series or K computer optimal broadcast protocols are given by edge-disjoint spanning trees root 2D 4X4 Torus^ Spanning tree 1^ Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

n/ 2 k data is sent along each spanning tree

Direct network topologies Torus networks

All-to-all on torus networks

One-to-all and all-to-one collectives like broadcast, reduce, scatter, gather are done efficiently by rectangular algorithms like broadcast all-to-all is noticeably more difficult let each processor sends a total of s data, s/P to each processor Ω( sP ) data needs to cross any balanced cut, while bisection bandwidth is P ( k −^1 ) /k^ with respect to link bandwidth so Ω( sP^1 /k^ ) data must cross some link in practice, randomized algorithms are used for all-to-all a more concrete approach is to perform all-to-all along rings in each dimension (in sequence or a distinct subset of the all-to-all data along disjoint dimensional orderings) Q: how can we perform an all-to-all along a ring communicator? A: pass data going m -hops away for m = P^1 /k^ − 1 to m = 1, total communication cost

P 1 /k (^) − 1 i = 1 2

i (^) s = O ( sP 1 /k (^) )

Direct network topologies Network routing

General routing strategies

Specialized collective routines can be designed to be very efficient on tori, but a network topology must permit any communication pattern even a broadcast for a random subset of nodes poses a difficulty bandwidth-efficiency is achieved for messages via wormhole routing each message is subsdivided into packets a message can be stalled due to head-of-line blocking multiple packets from two input links can follow the same output link one line of packets must wait in a buffer deadlock can occur if route 1 goes through link a then through link b while route 2 goes through link b then through link a Q: can deadlock occur if no pair of routes behave like this? A: yes, consider route 1: { a, b }, route 2: { b, c }, route 3: { c, a } to prevent deadlock, routing protocol graph should not contain cycles graph has edge between link a and link b if some route follows this consecutive pair of links

Short pause

Indirect network topologies Fat-tree network topology

Tree network topologies

Indirect topologies leverage routers that are not associated with any node typically each node is connected to a single router a network connects these routers, possibly using a hierarchy of routers Ex: a butterfly network has P nodes and P (log 2 P − 1 ) routers tree networks are a natural hierarchical construction Q: if the network is a binary tree with uniform link bandwidth, what is its bisection bandwidth? A: its bisection bandwidth is bound by the root and is equal to the link bandwidth fat-trees (Leiserson 1985) solve this problem by increasing link bandwidth exponentially from leaves to root

Indirect network topologies Fat-tree network topology

Fat-tree bisection bandwidth

Fat-trees can be specified differently depending on the desired properties to achieve maximal bisection bandwidth, we can increase link bandwidth by factor of 2 from leaves to root Q: what factor of increase do we need if we have P leaves and want bisection bandwidth to be Pk^ times link bandwidth for k < 1 A: need factor of f so that f log^2 ( P )^ = Pk^ , so f = 2 k to be able to construct a fat-tree efficiently, it makes sense to choose k = 2 / 3 and f = 41 /^3 this choice enables the fat-tree to be embedded into 3D space (bisection bandwidth is like 3D torus) the construction is universal : no network can be constructed with the same number of components that is faster by more than a polylogarithmic factor key idea: use decomposition tree to subdivide physical space and simulate any communication pattern via fat-tree

Indirect network topologies Fat-tree network topology

Universality of fat-trees

We sketch the analysis of Leisersion 1985 “Fat-Trees: universal networks for hardware-efficient supercomputing” consider routing a message set M ∈ [ 1 , P ] × [ 1 , P ] (all possible interchanges of datums of unit size between processors) for each link l , define load( M, l ) to be the number of messages passing through l cap( l ) to be the number of messages that can pass through l simultaneously (effective bandwidth) the load factor for l is

λ ( M, l ) =

load( M, l ) cap( l )

load factor for the whole tree λ ( M ) is the max λ ( M, l ) for any link l given any definitions of cap( l ) and it is possible to decompose any M into M =

d i = 1 Mi^ such that^ λ ( Mi^ ) =^ 1 and^ d^ =^ O ( λ ( M )^ log( P ))

Indirect network topologies Slim-fly network topology

Low-diameter network topologies

Fat-trees are optimal within polylogarithmic factors, but hardware is designed with consideration even for small constant factors the current trend is towards topologies that have a low diameter (maximum path length) consider a definition of the diameter just in terms of the number of routers, and assume there are O ( P ) base routers (connected to nodes) the latest Cray architectures leverage the Dragonfly topology [Kim, Dally, Scott, Abts, ISCA 2008] define densly connected groups (cliques) of routers connect a pair of routers between each group resulting topology is diameter 3 one of the latest innovations is the Slim-Fly topology [Besta, Hoefler 2014], which is diameter 2 and satisfies some optimality properties

Indirect network topologies Slim-fly network topology

Motivation for slim-fly

Q: what is the simplest diameter 2 topology you can think of? define a 2D grid of nodes Π ∈ [ 1 ,

P ] × [ 1 ,

P ] connect each ( i, j ) ∈ Π to ( i, k ) , ( k, j ) ∈ Π for each k require 2

P incoming and outgoing links per node Q: how many 2-hop routes are there between each pair of nodes? A: there are 2, which suggests that we may be able to construct a network with fewer links it is possible to use fewer links by relaxing the assumption that each link is bidirectional, but this is undesirable in hardware terms

Indirect network topologies Slim-fly network topology

Slim-fly construction

source: Besta, Hoefler 2014. Slim Fly: A Cost Effective Low-Diameter Network a network of size 2 q^2 is constructed where q is (almost any) prime there are two q × q grids A, B , each node is connected to some nodes in its column and some nodes in the other grid given a node ( x , y ) ∈ A in the first grid and ( m, c ) ∈ B in the second grid, they are connected iff mx + cy ≡ 0 mod q

these links suffice to connect any pair of nodes in two columns of the same grid!

Indirect network topologies Slim-fly network topology

Slim-fly routing

Given ( x , y ) and ( x, y ′), there must exist ( m, c ) such that

mx + cymx ′^ + cy ′^ ≡ 0 mod q

to route we need to determine m, c given x , x, y , y ′ we can do some modular arithmetic to determine m, c

mx + cyq mx ′^ + cym ( xx ′) ≡ q yymq ( xx ′)−^1 ( yy ′)

where we need to find the modular multiplicative inverse (this is one of the reasons q needs to be prime) we also need to connect ( x , y ) to ( m, c ) by finding ( x, y ) and ( m, c ) that are connected, so mx ′^ ≡ q cy , which defines how nodes should be connected in columns