













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Edgar Solomonik. University of Illinois at Urbana-Champaign. December 7, 2016 ... we will first discuss torus networks and topology-aware collectives.
Typology: Exams
1 / 21
This page cannot be seen from the preview
Don't miss anything!














Lecture 29: Network interconnect topologies
Edgar Solomonik
University of Illinois at Urbana-Champaign
December 7, 2016
Direct network topologies Introduction
For the duration of the course, we have focused on communication on ‘fully-connected’ networks, implicitly assuming that (1) any pair of processors can exchange messages at the same speed (2) messages between distinct pairs do not affect one another this effect is known as network contention we did this consciously, aiming general analysis of algorithms there are no widely-used generic network models for algorithms the connectivity structure of different networks can differ drastically on real systems, algorithms execute on a subset of a network (the only type of existing large-scale architecture on which this subset is structured are the BlueGene and K-computer torus networks) we will first discuss torus networks and topology-aware collectives then we will study a few other topologies based on general metrics
Direct network topologies Torus networks
A simple direct n -node network topology is a k -dimensional grid a torus is distinguished from a mesh by wrap-around links the simplest torus is a ring tori are generally advantageous as all nodes are ‘created equal’ Q: how could a ring network be constructed so each link is the same physical length? when P = 2 k^ a mesh topology is a hypercube 3D torus topologies have been popular in HPC because many applications map nicely to them larger k implies higher bisection bandwidth, which scales with n ( k −^1 ) /k^ , so 5D torus networks have been more popular recently
Direct network topologies Torus networks
As a case-study consider broadcast on a torus with full injection bandwidth so injection bandwidth is 2 k times link bandwidth BlueGene tori have full injection bandwidth, but not Cray XT/XE series or K computer optimal broadcast protocols are given by edge-disjoint spanning trees root 2D 4X4 Torus^ Spanning tree 1^ Spanning tree 2
Spanning tree 3 Spanning tree 4 All 4 trees combined
n/ 2 k data is sent along each spanning tree
Direct network topologies Torus networks
One-to-all and all-to-one collectives like broadcast, reduce, scatter, gather are done efficiently by rectangular algorithms like broadcast all-to-all is noticeably more difficult let each processor sends a total of s data, s/P to each processor Ω( sP ) data needs to cross any balanced cut, while bisection bandwidth is P ( k −^1 ) /k^ with respect to link bandwidth so Ω( sP^1 /k^ ) data must cross some link in practice, randomized algorithms are used for all-to-all a more concrete approach is to perform all-to-all along rings in each dimension (in sequence or a distinct subset of the all-to-all data along disjoint dimensional orderings) Q: how can we perform an all-to-all along a ring communicator? A: pass data going m -hops away for m = P^1 /k^ − 1 to m = 1, total communication cost
∑ P 1 /k (^) − 1 i = 1 2
i (^) s = O ( sP 1 /k (^) )
Direct network topologies Network routing
Specialized collective routines can be designed to be very efficient on tori, but a network topology must permit any communication pattern even a broadcast for a random subset of nodes poses a difficulty bandwidth-efficiency is achieved for messages via wormhole routing each message is subsdivided into packets a message can be stalled due to head-of-line blocking multiple packets from two input links can follow the same output link one line of packets must wait in a buffer deadlock can occur if route 1 goes through link a then through link b while route 2 goes through link b then through link a Q: can deadlock occur if no pair of routes behave like this? A: yes, consider route 1: { a, b }, route 2: { b, c }, route 3: { c, a } to prevent deadlock, routing protocol graph should not contain cycles graph has edge between link a and link b if some route follows this consecutive pair of links
Indirect network topologies Fat-tree network topology
Indirect topologies leverage routers that are not associated with any node typically each node is connected to a single router a network connects these routers, possibly using a hierarchy of routers Ex: a butterfly network has P nodes and P (log 2 P − 1 ) routers tree networks are a natural hierarchical construction Q: if the network is a binary tree with uniform link bandwidth, what is its bisection bandwidth? A: its bisection bandwidth is bound by the root and is equal to the link bandwidth fat-trees (Leiserson 1985) solve this problem by increasing link bandwidth exponentially from leaves to root
Indirect network topologies Fat-tree network topology
Fat-trees can be specified differently depending on the desired properties to achieve maximal bisection bandwidth, we can increase link bandwidth by factor of 2 from leaves to root Q: what factor of increase do we need if we have P leaves and want bisection bandwidth to be Pk^ times link bandwidth for k < 1 A: need factor of f so that f log^2 ( P )^ = Pk^ , so f = 2 k to be able to construct a fat-tree efficiently, it makes sense to choose k = 2 / 3 and f = 41 /^3 this choice enables the fat-tree to be embedded into 3D space (bisection bandwidth is like 3D torus) the construction is universal : no network can be constructed with the same number of components that is faster by more than a polylogarithmic factor key idea: use decomposition tree to subdivide physical space and simulate any communication pattern via fat-tree
Indirect network topologies Fat-tree network topology
We sketch the analysis of Leisersion 1985 “Fat-Trees: universal networks for hardware-efficient supercomputing” consider routing a message set M ∈ [ 1 , P ] × [ 1 , P ] (all possible interchanges of datums of unit size between processors) for each link l , define load( M, l ) to be the number of messages passing through l cap( l ) to be the number of messages that can pass through l simultaneously (effective bandwidth) the load factor for l is
λ ( M, l ) =
load( M, l ) cap( l )
load factor for the whole tree λ ( M ) is the max λ ( M, l ) for any link l given any definitions of cap( l ) and it is possible to decompose any M into M =
⋃ d i = 1 Mi^ such that^ λ ( Mi^ ) =^ 1 and^ d^ =^ O ( λ ( M )^ log( P ))
Indirect network topologies Slim-fly network topology
Fat-trees are optimal within polylogarithmic factors, but hardware is designed with consideration even for small constant factors the current trend is towards topologies that have a low diameter (maximum path length) consider a definition of the diameter just in terms of the number of routers, and assume there are O ( P ) base routers (connected to nodes) the latest Cray architectures leverage the Dragonfly topology [Kim, Dally, Scott, Abts, ISCA 2008] define densly connected groups (cliques) of routers connect a pair of routers between each group resulting topology is diameter 3 one of the latest innovations is the Slim-Fly topology [Besta, Hoefler 2014], which is diameter 2 and satisfies some optimality properties
Indirect network topologies Slim-fly network topology
Q: what is the simplest diameter 2 topology you can think of? define a 2D grid of nodes Π ∈ [ 1 ,
P ] connect each ( i, j ) ∈ Π to ( i, k ) , ( k, j ) ∈ Π for each k require 2
P incoming and outgoing links per node Q: how many 2-hop routes are there between each pair of nodes? A: there are 2, which suggests that we may be able to construct a network with fewer links it is possible to use fewer links by relaxing the assumption that each link is bidirectional, but this is undesirable in hardware terms
Indirect network topologies Slim-fly network topology
source: Besta, Hoefler 2014. Slim Fly: A Cost Effective Low-Diameter Network a network of size 2 q^2 is constructed where q is (almost any) prime there are two q × q grids A, B , each node is connected to some nodes in its column and some nodes in the other grid given a node ( x , y ) ∈ A in the first grid and ( m, c ) ∈ B in the second grid, they are connected iff mx + c − y ≡ 0 mod q
these links suffice to connect any pair of nodes in two columns of the same grid!
Indirect network topologies Slim-fly network topology
Given ( x , y ) and ( x ′ , y ′), there must exist ( m, c ) such that
mx + c − y ≡ mx ′^ + c − y ′^ ≡ 0 mod q
to route we need to determine m, c given x , x ′ , y , y ′ we can do some modular arithmetic to determine m, c
mx + c − y ≡ q mx ′^ + c − y ′ m ( x − x ′) ≡ q y − y ′ m ≡ q ( x − x ′)−^1 ( y − y ′)
where we need to find the modular multiplicative inverse (this is one of the reasons q needs to be prime) we also need to connect ( x , y ) to ( m, c ) by finding ( x ′ , y ) and ( m ′ , c ) that are connected, so m ′ x ′^ ≡ q c − y , which defines how nodes should be connected in columns