Download Data Center Network Topologies: FatTree and more Study notes Computer Networks in PDF only on Docsity!
Data Center Network Topologies:
FatTree
Hakim Weatherspoon
Assistant Professor, Dept of Computer Science
CS 5413: High Performance Systems and Networking
September 22, 2014
Slides used and adapted judiciously from Networking Problems in Cloud Computing EECS 395/495 at Northwestern University
Goals for Today
- A Scalable, Commodity Data Center Network Architecture
- M. Al-Fares, A. Loukissas, A. Vahdat. ACM SIGCOMM Computer Communication Review (CCR), Volume 38, Issue 4 (October 2008), pages 63-74.
- Main Goal: addressing the limitations of today’s data center
network architecture
- single point of failure
- oversubscription of links higher up in the topology
- trade-offs between cost and providing
- Key Design Considerations/Goals
- Allows host communication at line speed
- no matter where they are located!
- Backwards compatible with existing infrastructure
- no changes in application & support of layer 2 (Ethernet)
- Cost effective
- cheap infrastructure
- and low power consumption & heat emission
Topology:
2 layers: 5K to 8K hosts
3 layer: >25K hosts
Switches:
○ Leaves: have N GigE ports (48-288) + N 10 GigE uplinks to one or more layers of network elements ○ Higher levels: N 10 GigE ports (32-128)
Multi-path Routing:
Ex. ECMP
○ without it, the largest cluster = 1,280 nodes ○ Performs static load splitting among flows ○ Lead to oversubscription for simple comm. patterns ○ Routing table entries grows multiplicatively with number of paths, cost ++, lookup latency ++
Background
Internet
Servers
Access^ Layer-2 switch
Data Center
Aggregation Layer-2/3 switch
Core^ Layer-3 router
Common Data Center Topology
Leverages specialized hardware and communication protocols, such as InfiniBand, Myrinet.
- These solutions can scale to clusters of thousands of nodes with high bandwidth
- Expensive infrastructure, incompatible with TCP/IP applications Leverages commodity Ethernet switches and routers to interconnect cluster machines
- Backwards compatible with existing infrastructures, low-cost
- Aggregate cluster bandwidth scales poorly with cluster size, and achieving the highest levels of bandwidth incurs non-linear cost increase with cluster size
Current Data Center Network Architectures
- Single point of failure
- Over subscript of links higher up in the topology
- Trade off between cost and provisioning
Problems with common DC Topology
Properties of the solution
- Backwards compatible with existing infrastructure
- No changes in application
- Support of layer 2 (Ethernet)
- Cost effective
- Low power consumption & heat emission
- Cheap infrastructure
- Allows host communication at line speed
Clos Networks/Fat-Trees
- Adopt a special instance of a Clos topology
- Similar trends in telephone switches led to
designing a topology with high bandwidth by interconnecting smaller commodity switches.
- Why Fat-Tree?
- Fat tree has identical bandwidth at any bisections
- Each layer has the same aggregated bandwidth
- Can be built using cheap devices with uniform capacity
- Each port supports same speed as end host
- All devices can transmit at line speed if packets are distributed uniform along available paths
- Great scalability: k-port switch supports k^3 /4 servers
Fat tree network with K = 6 supporting 54 hosts
FatTree-based DC Architecture
Does using fat-tree topology to inter-connect racks of
servers in itself sufficient?
- What routing protocols should we run on these
switches?
- Layer 2 switch algorithm: data plane flooding!
- Layer 3 IP routing:
- shortest path IP routing will typically use only one path
despite the path diversity in the topology
- if using equal-cost multi-path routing at each switch
independently and blindly, packet re-ordering may occur;
further load may not necessarily be well-balanced
- Aside: control plane flooding!
FatTree Topology is great, But…
15
Enforce a special (IP) addressing scheme in DC
unused.PodNumber.switchnumber.Endhost
Allows host attached to same switch to route only
through switch
Allows inter-pod traffic to stay within pod
FatTree Modified
- Use two level look-ups to distribute traffic
and maintain packet ordering
- First level is prefix lookup
- used to route down the topology to servers
- Second level is a suffix lookup
- used to route up towards core
- maintain packet ordering by using same ports for same server
- Diffuses and spreads out traffic
FatTree Modified
- Flow scheduling , Pay attention to routing large flows , edge
switches detect any outgoing flow whose size grows above a predefined threshold, and then send notification to a central scheduler. The central scheduler tries to assign non-conflicting paths for these large flows.
- Eliminates global congestion
- Prevent long lived flows from sharing the same links
- Assign long lived flows to different links
FatTree Modified
- In this scheme, each switch in the network maintains a BFD (Bidirectional Forwarding Detection) session with each of its neighbors to determine when a link or neighboring switch fails
Failure between upper layer and core switches
Outgoing inter-pod traffic, local routing table marks the affected link as unavailable and chooses another core switch Incoming inter-pod traffic, core switch broadcasts a tag to upper switches directly connected signifying its inability to carry traffic to that entire pod, then upper switches avoid that core switch when assigning flows destined to that pod
Fault Tolerance