System Architecture - Computer Systems Architecture - Lecture Slides, Slides of Computer Architecture and Organization

Some concept of Computer Systems Architecture are Acyclic Graph, Advanced Micro Devices, Basic Grid Architecture, Control Flow Prediction, Desktop Processor Architecture, Message-Driven Processor. Main points of this lecture are: System Architecture, Hydrodynamics, Quantum Chemistry, Molecular Dynamics, Climate Modeling, Financial Modeling, Peak Performance, Low Latency, Bandwidth Networks, Point Operations

Typology: Slides

2012/2013

Uploaded on 04/27/2013

jutt
jutt 🇮🇳

4.5

(154)

75 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The IBM Blue Gene/L System
Architecture
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download System Architecture - Computer Systems Architecture - Lecture Slides and more Slides Computer Architecture and Organization in PDF only on Docsity!

The IBM Blue Gene/L System

Architecture

What is Blue Gene/L?

  • Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing.
  • In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world.
  • This project is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 TeraFLOPS.
  • Example usage:
    • hydrodynamics
    • quantum chemistry
    • molecular dynamics
    • climate modeling
    • financial modeling

Main Design Principles for Blue Gene/L

  • Some science & engineering applications scale up to and beyond 10,000 parallel processes.
  • Improve computing capability, holding total system cost.
  • Reduce cost/FLOP.
  • Reduce complexity and size.
    • ~25KW/rack is max for air-cooling in standard room.
    • Need to improve performance/power ratio.
    • 700MHz PowerPC440 for ASIC has excellent FLOP/Watt.
  • Maximize Integration:
    • On chip: ASIC with everything except main memory.
    • Off chip: Maximize number of nodes in a rack..
  • Large systems require excellent reliability, availability, serviceability (RAS)

Main Design Principles (cont’d)

  • Make cost/performance trade-offs considering the end-use: - Applications <> Architecture <> Packaging - Examples: - 1 or 2 differential signals per torus link. - I.e. 1.4 or 2.8Gb/s. - Maximum of 3 or 4 neighbors on collective network. - I.e. Depth of network and thus global latency.
  • Maximize the overall system efficiency:
    • Small team designed all of Blue Gene/L.
    • Example: Chose ASIC die and chip pin-out to ease circuit card routing.

Blue Gene/L Architecture

  • Up to 323264=65536 nodes (3D torus).
  • Max 360 teraFLOPS computation power.
  • Each processor can perform 4 floating point

operations per cycle (in the form of two 64-bit floating point multiply-add’s per cycle)

  • 5 networks connect nodes to themselves and

to the world.

Node Architecture

  • IBM PowerPC embedded CMOS processors, embedded DRAM, and system-on-a-chip technique is used.
  • 11.1-mm square die size, allowing for a very high density of processing.
  • The ASIC uses IBM CMOS CU-11 0.13 micron technology.
  • 700 Mhz processor speed close to memory speed.
  • Two processors per node.
  • Second processor is intended primarily for handling message passing operations

BlueGene/L node diagram. Docsity.com

Link ASIC

  • In addition to the compute ASIC, there is a “link” ASIC.
  • When crossing
    • a midplane boundary
    • BG/L’s torus
    • global combining tree
    • global interrupt signals pass through the BG/L link ASIC.
  • It redrives signals over the cables between BG/L midplanes.
  • The link ASIC can redirect signals between its different ports.
    • enables BG/L to be partitioned into multiple, logically separate systems in which there is no traffic interference between systems.

The FP2 core (cont’d)

  • This enhanced set goes beyond the capabilities of

traditional SIMD architectures.

  • A single instruction can initiate a different but related

operation on different data.

  • Single Instruction Multiple Operation Multiple Data

(SIMOMD).

  • Either of the sides can access data from the other

side’s register file.

  • This saves a lot of swapping when working purely on

complex arithmetic operations.

Memory System

  • It is designed for high bandwidth, low latency

memory and cache accesses.

  • An L2 hit returns in 6 to 10 processor cycles
  • An L3 hit in about 25 cycles
  • An L3 miss in about 75 cycles
  • System has a 16 byte interface to nine 256Mb

SDRAM-DDR devices.

  • Operating at a speed of one half or one third

of the processor.

Torus Network (cont’d)

  • Class Routing Capability (Deadlock-free

Hardware Multicast)

  • Packets can be deposited along route to specified destination.
  • Allows for efficient one to many in some instances
  • Active messages allows for fast transposes as

required in FFTs.

  • Independent on-chip network interfaces enable

concurrent access.

Other Networks

  • A global combining/broadcast tree for

collective operations

  • A Gigabit Ethernet network for connection to

other systems, such as hosts and file systems.

  • A global barrier and interrupt network
  • And another Gigabit Ethernet to JTAG network

for machine control

Gb Ethernet Disk/Host I/O Network

  • IO nodes are leaves on collective network.
  • Compute and IO nodes use same ASIC, but:
    • IO node has Ethernet not torus. Provedes IO seperation on application.
    • Compute node has torus, not Ethernet: No need for 65536 cables.
  • Configurable ratio of IO to compute = 1:8,16,32,64,128.
  • Application runs on compute nodes, not IO nodes.

Fast Barrier/Interrupt Network

  • Four Independent Barrier or Interrupt Channels
    • Independently Configurable as "or" or "and"
  • Asynchronous Propagation
    • Halt operation quickly (current estimate is 1.3usec worst case round trip)
    • 3/4 of this delay is time-of-flight.
  • Sticky bit operation
    • Allows global barriers with a single channel.
  • User Space Accessible
    • System selectable
  • It is partitioned along same boundaries as Tree, and Torus
    • Each user partition contains it's own set of barrier/ interrupt signals