Scarica Warehouse-Scale Computing: Architectures, Storage, and Virtualization - Prof. Roveri e più Schemi e mappe concettuali in PDF di Architettura Dei Calcolatori solo su Docsity!
COMPUTING INFRASTRUCTURES
Data center
A data center is a building hosting many servers and communication devices , placed together for:
- Environmental needs (cooling, humidity)
- Physical security
- Easier maintenance Traditional data centers run many small/medium-sized applications, each application uses its own dedicated and isolated hardware with no interaction. Often shared by different organizational units or companies
Warehouse-Scale Computer
To support large-scale internet services (like Gmail, Maps, Google Search), a new architecture was developed: the Warehouse-Scale Computer. A WSC:
- Runs a few very large applications
- Uses a homogeneous hardware/software platform
- Belongs to a single organization
- Shares a central resource management layer
- Is seen as one logical computing unit, even though it consists of thousands of servers Availability and Geographic Distribution Because services running on WSCs need to be available almost all the time (typically 99.99% uptime, meaning less than 1 hour of downtime per year), the architecture is designed to be fault-tolerant and geographically distributed. To achieve this, WSCs are deployed across multiple locations, organized in a hierarchy:
- Geographic Areas (GAs): Defined by legal or geopolitical borders, each containing at least two computing regions
- Computing Regions (CRs): Groups of data centers close enough (within ~2ms latency) to support coordinated activity and disaster recovery
- Availability Zones (AZs): Independent data centers within a region that offer synchronous replication and redundancy for fault tolerance (usually 3 per region to allow for majority-based decision making) HW (^) INFRASTRUCTURES SYSTEM LEVEL (before)
In a Warehouse-Scale Computer, a node is the smallest operational unit, basically a server that handles computation, storage, and network operations. Each node typically includes: A motherboard that holds: One or more CPU sockets Multiple DIMM slots for RAM (up to 192 slots) PCIe slots for GPUs or network cards SATA/SAS ports for local disks RAM : often hundreds of GB, ECC-protected Storage : from a few to 24 HDDs or SSDs, either SATA or faster SAS Network Interface Cards (NICs): often 10 Gbps or faster Power Supply Unit (PSU) and cooling systems (fans, airflow guides) Think of it as a “super-powered PC” mounted to work efficiently with others in a rack. Servers are mounted in standardized racks (typically 19-inch width, 48 inch depth ), flat and stackable units. Each rack:
- Hosts multiple servers (1U, 2U, etc.)
- Shares power and cooling
- Has Top-of-Rack (ToR) switches for network aggregation
- Supports modular cable management and airflow optimization Cooling system Racks are arranged into alternating corridors: Cold aisle → where server fronts face each other (cold air intake) Hot aisle → where server backs face each other (hot air exhaust) Cold air is pushed into the front of the racks, flows through the servers to cool components, and hot air is expelled out the back. This approach is called cold aisle / hot aisle containment, and it’s key for energy-efficient cooling in large-scale datacenters. NODE LEVEL SERVERS
Others techniques:
- Raised floor airflow
- Liquid cooling (e.g., Google TPUs)
- Evaporative cooling for energy savings Power Infrastructure Inside the Datacenter In addition to cooling and server management, datacenters must ensure a stable and redundant power supply:
- Rack-level PDUs (Power Distribution Units)
- UPS (Uninterruptible Power Supplies) for short-term backup
- Battery systems integrated with racks or rows
- Diesel generators and redundant power feeds for long outages Power systems are deeply integrated into the rack and corridor structure, making the datacenter a cohesive, engineered environment. Some datacenters occupy more than 100,000 m². Power consumption may exceed 150 MW (equivalent to a small city). A typical goal is 99.99% availability → less than 1 hour downtime per year Design priorities: Redundancy (N+1, 2N) Modularity (easy to replace parts) Maintainability Energy efficiency Hardware accelerators As AI workloads exploded and Moore’s Law slowed down, WSCs increasingly adopted hardware accelerators to handle the rising compute demands efficiently. Main Types:
- GPUs : Ideal for parallel workloads like ML training; widely supported (CUDA, TensorFlow)
- TPUs (Google): Custom ASICs optimized for TensorFlow; much faster but vendor-locked
- FPGAs : Reconfigurable for specific logic; efficient but harder to program Accelerators are plugged via PCIe or custom interconnects like NVLink, depending on the workload.
To optimize storage, modern disks use:
- Zoned Bit Recording: More sectors on outer tracks → faster sequential reads
- Skewing: Slight delays between adjacent tracks to minimize access time Access Time Components 1 Seek Time : move the head to the correct track → includes acceleration, coasting, deceleration, settling → average seek time ≈ T_seek_avg = T_max / 3 2 Rotational Latency : wait for the sector to spin under the head → T_rotation = 60s / RPM → average latency = T_rotation_avg = T_rotation / 2 3 Transfer Time : time to read/write actual bytes Ttransfer = data_size / transfer_rate 4 Controller Overhead : small delays in handling the request Performance Depends on Locality
- High locality: Data read in one seek/rotation → fast
- Low locality: Each block requires its own seek/rotation → very slow
2. SSDs - Solid State Drives
A Solid-State Drive (SSD) is a storage device based on NAND flash memory, with no moving parts. Unlike HDDs, which rely on spinning platters and mechanical arms, SSDs are purely electronic, making them:
- Much faster, especially for random access
- More reliable, with no mechanical failures
- Silent and energy-efficient
- Better suited for data-intensive, high-throughput workloads common in WSCs SSDs are built from:
- Flash memory chips (organized in pages and blocks)
- A controller to manage operations
- An internal firmware layer called the Flash Translation Layer (FTL) T_IO = Tseek + Trotation + Ttransfer + Toverhead T_response = Tqueue + T_IO FCFS SSTF SCAN C-SCAN C-LOOK · (1-locality )
arrival closest (^) - to correc S L
S
SSDs organize their memory as follows:
- Page : the smallest unit that can be read or written → Typically 4 KB
- Block : the smallest unit that can be erased → Typically consists of 64 or 128 pages (e.g., 256 KB–512 KB) Important constraint: You can read/write pages erase entire blocks. Only empty pages can be written. To free a page, you must erase its entire block. The Write Amplification Problem When you update a page (e.g., 4 KB), you can’t just overwrite it in-place. Instead, the SSD must: 1 Read the whole block into cache 2 Modify the relevant page(s) 3 Erase the block 4 Write back the updated version of the entire block → This means that writing 4 KB may involve reading and writing hundreds of KB. This overhead is called write amplification — you're writing much more than the original data size. Consequences: - Slower write performance • Increased wear on memory cells • Shorter SSD lifespan The Flash Translation Layer (FTL) To manage the complex nature of flash memory, SSDs rely on the FTL, a key firmware layer that provides: a. Logical to Physical Mapping
- Maps OS-level LBAs (logical block addresses) to internal physical pages
- Enables overwriting without needing immediate erase b. Garbage Collection (GC)
- Finds blocks full of dirty pages
- Copies valid pages to a new block
- Erases the old block to reclaim space GC introduces latency and write amplification, especially when triggered during heavy I/O. c. Wear Leveling
- Flash blocks have limited erase/write cycles (typically 3,000–100,000)
- FTL spreads writes across all blocks to ensure uniform wear
- Prevents premature failure of frequently written areas Delete and TRIM Command File deletion is not sufficient in SSDs because OS typically just removes file metadata. → The TRIM command allows the OS to explicitly inform the SSD about which blocks are no longer in use. TRIM helps the SSD:
- Identify garbage early. • Reduce unnecessary copying during G. • Improve long-term performance
RAID Architectures
RAID (Redundant Array of Independent Disks) is a technique used to:
- Combine multiple physical disks into a single logical storage unit
- Improve performance, reliability, or both
- Offer transparency to the OS, which sees one logical volume regardless of underlying complexity Why Use RAID in WSCs?
- Disks are prone to failure → risk increases as more disks are added
- RAID adds redundancy and/or parallelism
- Used in databases, storage backends, and critical data systems Striping : distributes data across disks to improve speed Redundancy : introduces duplication or parity to tolerate failures RAID 0 – Striping Only Data is evenly split across disks ( the capacity is fully employed for storage space) No redundancy — any disk failure = total data loss Best for temporary, non-critical data like scratch space RAID 1 – Mirroring Only Every disk has an exact copy (mirror) ( Only 50% of disk capacity is usable) High fault tolerance Ideal for small, critical datasets or system partitions RAID 0+1: Stripe first, then mirror Stripe first: creates 2 groups of RAID 0 Then mirror: creates a copy (RAID 1) Second failure in a stripe group disables the whole group RAID 1+0 (RAID 10): Mirror first, then stripe Mirrors pairs of disks, then stripes across them Can tolerate multiple failures (as long as mirrors survive)
RAID 10 is preferred in databases and high-performance workloads Mirror Stripe Stripe Stripe Mirror Mirror c (^) capacity) ES : N =^ G disk C = (^) 1TB storage capacity^?
N.^ C^ =^ GTB^ storage
- C = (^) 1TB (^) storage (col (^5) copie S -v & = 3 S -v O
RAID 5 – Striping with Single Parity One disk is used for recovery in case of failure: parity block -> allows recovery from one disk failure Write operation requires: (slower)
- Reading old data + old parity
- Computing and writing new parity Good balance of performance, capacity, and reliability RAID 6 – Double Parity Two disks are used for recovery in case of failure: 2 parity block -> allows recovery from two simultaneous failures Even slower write (and higher cost) Used in archival systems, WSCs with large disk arrays Parity Block A = 00110011 Block B = 10100011 Parity = A ⊕ B = 10010000 Now suppose A is lost: → Recovered A = Parity ⊕ B = 10010000 ⊕ 10100011 = 00110011 Il disco di parità non conserva i dati originali, ma sufficiente informazione ridondante (parity) calcolata su tutti gli altri, per ricostruire i dati persi se un disco fallisce. Formule: MTTF RAID 0 =
MTTF
#disks 1 disk MTTF (^) RAID 1 = Prob (DL) = Pr( 1°fail) • Pr(2°fail) Pr( 1°fail) = n/ MTTF Pr(2°fail) = MTTR/ MTTF MTTF (^) RAID 10= Prob (DL)
Pr( 1°fail) • Pr(2°fail) Pr( 1°fail) = n/ MTTF Pr(2°fail) = MTTR/ MTTF during repair MTTF (^) RAID 01= (^) Prob (DL) =^ Pr( 1°fail) • Pr(2°fail) Pr( 1°fail) = n/ MTTF (^) Pr(2°fail) = MTTR • n / 2 MTTF MTTF (^) RAID 5 = Prob (DL)
Pr( 1°fail) • Pr(2°fail) Pr( 1°fail) = n/ MTTF Pr(2°fail) = MTTR • (n-1) / MTTF MTTF (^) RAID 6 = Pr( 1°fail) • Pr(2°fail) • Pr(3°fail) Pr( 1°fail) = n/ MTTF Pr(2°fail) = MTTR • (n-1) / MTTF Pr(3°fail) = MTTR • (n-2) / 2 MTTF ② (N- 1) - C =^ 5TB Disk (^) D D1 D2 D O ⑧ 1 Po ? Pl us 5 S (N -^ 2)^.^ C^ =^ GTB Disk D D1 D2 D3^ D O ⑧ Po^ QU Q (^) usefull when ? (^) i (^) - S (^) MITR is P3 Q3 big 2 2 I 11 2 2 11 I
Warehouse-Scale Computers (WSCs) include thousands of interconnected servers running distributed applications(e.g., microservices, machine learning, replication, etc.). As compute power increases, network interconnects must scale accordingly. A good Intra-Datacenter Network (DCN) must provide:
- High throughput
- Low latency
- Fault tolerance
- Load balancing Key Concepts 1. East-West vs North-South Traffic East-West = server-to-server (internal), e.g., service calls, replication North-South = client-to-datacenter (external), e.g., user web requests → In modern DCs, ~76% of traffic is East-West 2. Bisection Bandwidth The total bandwidth between two halves of the datacenter. It must scale linearly with the number of servers to avoid bottlenecks. DCN Architecture Types Traditional DCN architecture : 3-Tier Model Layers: 1 Access (ToR switches) 2 Aggregation (connects ToRs) 3 Core (connects to Internet or other DCs) Limitations:
- Limited bisection bandwidth
- Costly and hard to scale beyond a few thousand servers ToR vs EoR switches:
- ToR: 1 per rack, easy cabling, harder to manage
- EoR: 1 per corridor, centralized, harder cabling Oversubscription: Bandwidth to upper layers is often oversubscribed (e.g., 4:1), which may hurt latency-sensitive apps. NETWORKING
3 unicast^ (1-1)^ ,^ multycast^ (1-may
,^ incast^ (many -^ 1)
~ internal bandwidth is far more important
Modern DCN architectures Modern designs aim to:
- Use commodity switches
- Maximize parallelism and path diversity
- Achieve scalability without bottlenecks Clos / Spine-and-Leaf
- Two layers: leaf switches (ToR) and spine switches (backbone)
- Each leaf connects to all spines
- All paths are equal-cost (→ ECMP load balancing) Benefits:
- Scalable, uniform latency
- High bisection bandwidth
- Easy to expand (just add more switches)
Used in practice in hyperscale datacenters like Google, Facebook, Azure Fat-Tree A specific Clos variant optimized for commodity switches. Built around Pods, each with servers and two switch layers. Let k = number of switch ports → Supports up to 2k³ servers using 5k² switches Advantages:
- Modular and symmetric
- Full bandwidth between any two servers
- Perfect for scale-out designs Recent innovations:
- Optical Circuit Switching (OCS) for direct, high-bandwidth connections
- Central SDN control for traffic engineering, fault response, and live reconfiguration V
Two-Loop vs Three-Loop Three-loop: Adds chillers to further cool water before reuse More efficient, but also more expensive and complex Two-loop: First loop: Air circulates within datacenter (via CRACs) Second loop: Liquid coolant removes heat to rooftop exchangers Advanced Cooling Techniques As servers get more powerful (especially with GPUs/TPUs), we need cooling closer to the source. In-Rack Cooling
- Heat exchangers are placed inside the racks, right at the hot air exit
- Very effective and fast cooling In-Row Cooling
- Cooling units placed between server racks
- Easier to maintain, and allows flexible datacenter design Liquid Cooling
- Uses cold plates directly on high-heat components
- Heat is transferred via liquid to a nearby exchanger
- Efficient, but not all components can be cooled this way (some remain air-cooled) Container-Based Datacenters These are modular server systems built inside shipping containers (6–12 meters long).
- Include integrated power and cooling
- Can be deployed rapidly, even in remote locations
- Ideal for edge computing or fast scaling
Energy consumption and sustainability
Datacenters consume huge amounts of energy:
- Cooling alone may account for ~50% of the total power
- Globally, datacenters: use ~3% of electricity (more than the UK), emit ~2% of global CO₂ (similar to aviation) This makes energy efficiency and sustainability a major concern in datacenter design. raffreddo i^ server · (^) vi-raffreddo l'aria aggiunge un^ secondo^ ri-raffreddamento D (^) ML and Al
PUE – Power Usage Effectiveness PUE is the standard metric for datacenter efficiency: PUE = Inversely, DCiE (Data Center infrastructure Efficiency) is: 1 / PUE Total power used (including cooling, lights, etc.) Power used by IT equipment only -> Ideal PUE = 1.0 (impossible in practice): The closer to 1, the more efficient the datacenter [ In 2012, Google achieved PUE < 1.1 ] Data Center Tiers Availability and reliability are standardized in Tier levels (from 1 to 4):
- Tier I: Basic capacity, no redundancy
- Tier II: Redundant capacity components
- Tier III: Concurrent maintainability (no shutdowns during maintenance)
- Tier IV: Fault-tolerant infrastructure Higher tier = higher cost, higher reliability
Reliability
1. System VMs Emulate the entire machine, including the ISA and hardware Support full operating systems Managed by a Virtual Machine Monitor (VMM), also known as a Hypervisor Type 1 (Bare-metal): runs directly on hardware Type 2 (Hosted): runs inside an OS 2. Process VMs Focus only on a single application Translate code into instructions for the host OS Examples: JVM, .NET CLR Virtualization Methods
- Multi-Programming Not true virtualization, but OS-based abstraction:
- The OS gives each process its own address space and file system
- Feels like each user has their own machine
- Emulation Used when the guest ISA ≠ host ISA:
- Software translates instructions from one ISA to another
- Useful for running legacy systems (e.g., old games or other architectures)
- High-Level Language VMs
- Focused on portability across platforms
- Run applications in sandboxed environments Example: JVM runs Java bytecode the same way on any OS Properties of Virtualization Virtualization offers key features for cloud systems: Partitioning: multiple OSs on one physical server Isolation: failure in one VM doesn’t affect others Encapsulation: entire VM state can be saved/restored or migrated live Hardware Independence: move VMs between servers easily TYPE & TYPE 2 &
The Hypervisor A Hypervisor (aka VMM) is the software layer that manages virtual machines.
- Allocates CPU, memory, and I/O
- Handles privileged instructions
- Ensures security and performance isolation Types of Hypervisors Type 1 – Bare-Metal
- Installed directly on hardware
- More efficient and secure
- Two types: Monolithic: all drivers inside the hypervisor Microkernel: uses a separate service VM for drivers Type 2 – Hosted
- Runs inside a traditional OS (e.g., VirtualBox, VMware Workstation)
- Easier to set up and manage, but with slightly higher overhead Hardware-Level VMM sits between hardware and OS Entire OSs can run as guests Application-Level VMM sits between OS and applications Apps run inside sandboxed containers or language VMs System-Level VMM allows secondary OSs to run atop a host OS Paravirtualization The guest OS is modified to be "aware" it's running on a VM Uses hypercalls instead of slow system traps High performance. Requires OS modification Full Virtualization Guest OS runs unmodified The hypervisor traps and emulates privileged instructions Compatible with all OSs. Higher overhead, may require hardware support Containers vs Virtual Machines Containers are another virtualization method—but at the OS level.
- Share the same kernel across containers
- Contain all code, dependencies, configs needed to run the app
- Ideal for microservices, CI/CD, and cloud-native development & 3