Chip Multiprocessor - Computer Systems Organization | EE 457, Study notes of Computer Architecture and Organization

chip multiprocessor Material Type: Notes; Professor: Puvvada; Class: Computer Systems Organization; Subject: Electrical Engineering; University: University of Southern California; Term: Fall 2010;

Typology: Study notes

Pre 2010

Uploaded on 12/12/2010

mihir-mandavia2000
mihir-mandavia2000 🇺🇸

5

(1)

3 documents

1 / 47

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
An Oracle White Paper
October 2010
Oracle's SPARC T3-1, SPARC T3-2,
SPARC T3-4 and SPARC T3-1B Server
Architecture
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f

Partial preview of the text

Download Chip Multiprocessor - Computer Systems Organization | EE 457 and more Study notes Computer Architecture and Organization in PDF only on Docsity!

An Oracle White Paper

October 2010

Oracle's SPARC T3-1, SPARC T3-2,

SPARC T3-4 and SPARC T3-1B Server

Architecture

Oracle's SPARC T3-1, SPARC T3-2, SPARC T3-4, and SPARC T3-1B Server Architecture

  • Introduction .................................
  • The Evolution of Chip Multithreading .................................
    • Business Challenges for Enterprise Applications .................................
    • Rule-Changing Chip Multithreading Technology .................................
    • SPARC T3-1, T3-2, T3-4, and T3-1B Servers .................................
  • SPARC T3 Processor ...............................
    • The World's First Sixteen-Core Massively Threaded System-on-a-Chip .........................
    • Taking Chip Multithreaded Design to the Next Level ...............................
    • SPARC T3 Processor Architecture ...............................
  • Server Architecture ...............................
    • System-Level Architecture ...............................
    • Chassis Design Innovations ...............................
    • Oracle's SPARC T3-1 Server Overview ...............................
    • Oracle's SPARC T3-2 Server Overview ...............................
    • Oracle's SPARC T3-4 Server Overview ...............................
    • Oracle's SPARC T3-1B Server Overview ...............................
  • Enterprise-Class Management and Software ...............................
    • System Management Technology ...............................
    • Scalability and Support for Chip Multithreading Technology ...............................
    • Fault Management and Predictive Self-Healing ...............................
    • Chip Multithreading Tools: Performance and Rapid Time to Market ...............................
  • Conclusion ...............................

The Evolution of Chip Multithreading

Oracle’s UltraSPARC processors have led the industry for years—first, with the introduction of the multithreaded, multicore chip design in the first-generation UltraSPARC T1 processor in 2005 and now with the fourth-generation SPARC T3 processor. By any measure, the first generation CMT processors were an unprecedented success. Delivering up to five times the throughput in a quarter of the space and power, systems using these processors have rapidly been welcomed and accepted. Now fourth-generation CMT technology is evolving rapidly to meet the constantly changing demands of a wide range of enterprise data center applications.

Business Challenges for Enterprise Applications

Organizations across many industries hope to address larger markets, reduce costs, and gain better insights into their customers. At the same time, an increasingly broad array of wired and wireless client devices are bringing network computing into the everyday lives of millions of people. This strong demand has a “pull-through” effect on the IT services that must be satisfied in the data center. These trends are redefining data center scalability and capacity requirements, even as they collide with fundamental real estate, power, and cooling constraints.

Driving Data Center Virtualization and Eco-Efficiency

Coincident with the need to scale services, many data centers recognize the advantages of deploying fewer standard platforms to run a mixture of commercial and technical workloads. This process involves consolidating underused and sprawling server infrastructures with effective virtualization solutions that serve to enhance business agility, improve disaster recovery, and reduce operating costs. This focus can help reduce energy costs and break through data center capacity constraints by improving the amount of realized performance for each watt of power the data center consumes.

Eco-efficiency provides tangible benefits, improving ecology by reducing the carbon footprint to meet legislative and corporate social responsibility goals, even as it improves the economy of the organization paying the electric bill. As systems are consolidated onto more dense and capable computing infrastructure, demand for data center real estate is also reduced. With careful planning, this approach can also improve service uptime and reliability by reducing hardware failures resulting from excess heat load. Servers with high levels of standard reliability, availability, and serviceability (RAS) are now considered a requirement.

Building Out for Web-scale Applications

Web-scale applications engender a new pace and urgency to infrastructure deployment. Organizations must accelerate time to market and time to service, while delivering scalable high-quality and high- performance applications and services. Many need to be able to start small with the ability to scale very quickly, with new customers and innovative new Web services often implying a doubling of capacity in months rather than years.

At the same time, organizations must reduce their environmental impact by working within the power, cooling, and space available in their current data centers. Operational costs too are receiving new scrutiny, along with system administrative costs that can account for up to 40 percent of an IT budget. Simplicity and speed are paramount, giving organizations the ability to respond quickly to dynamic business conditions. Organizations are also striving to eliminate vendor lock-in as they look to preserve previous, current, and future investments. Open platforms built around open standards help provide maximum flexibility while reducing costs of both entry and exit.

Securing the Enterprise at Speed

Organizations are increasingly interested in securing all communications with their customers and partners. Given the risks, end-to-end encryption is essential to inspire confidence in security and confidentiality. Encryption is also increasingly important for storage, helping to secure stored and archived data even as it provides a mechanism to detect tampering and data corruption.

Unfortunately, the computational costs of increased encryption can increase the burden on already overtaxed computational resources. Security also needs to take place at line speed, without introducing bottlenecks that can impact the customer experience or slow transactions. Solutions must help to ensure security and privacy for clients and bring business compliance for the organization, all without impacting performance or increasing costs.

Rule-Changing Chip Multithreading Technology

Addressing these challenges has outstripped the capabilities of traditional processors and systems, and required a fundamentally new approach.

Moore’s Law and the Diminishing Returns of Traditional Processor Design

The oft-quoted tenet of Moore’s law states that the number of transistors that will fit in a square inch of integrated circuitry will approximately double every two years. For more than three decades the pace of Moore’s law has held, driving processor performance to new heights. Processor manufacturers have long exploited these gains in chip real estate to build increasingly complex processors, with instruction- level parallelism (ILP) as a goal. These traditional processors employ very high frequencies along with a variety of sophisticated tactics to accelerate a single instruction pipeline, including

  • Large caches
  • Superscalar designs
  • Out-of-order execution
  • Very high clock rates
  • Sophisticated branching techniques
  • Deep pipelines
  • Speculative prefetches

First introduced with the UltraSPARC T1 processor, CMT takes advantage of CMP advances, but adds a critical capability—the ability to scale with threads rather than frequency. Unlike traditional single- threaded processors and even most current multicore processors, hardware multithreaded processor cores allow rapid switching between active threads as other threads stall for memory. Figure 1 illustrates the difference between CMP, fine-grained hardware multithreading (FG-MT), and CMT. The key to this approach is that each core in a CMT processor is designed to switch between multiple threads on each clock cycle. As a result, the processor’s execution pipeline remains active doing real useful work, even as memory operations for stalled threads continue in parallel.

Figure 1. CMT combines CMP and fine-grained hardware multithreading.

CMT provides real value since it increases the ability of the execution pipeline to do actual work on any given clock cycle. Use of the processor pipeline is greatly enhanced because a number of execution threads now share its resources. The negative effects of memory latency are effectively masked, because the processor and memory subsystems remain active in parallel to the processor execution pipeline. Since these individual processor cores implement much-simpler pipelines that focus on scaling with threads rather than frequency (emphasizing TLP over ILP), they are also substantially cooler and require significantly less electrical energy to operate. This innovative approach results in a unique processor technology—multiple physical instruction execution pipelines (one for each core), with multiple active thread contexts per core. In addition, SPARC T3 processors feature two execution pipelines per core to further boost scalability.

The SPARC T3 Processor

Unlike complex single-threaded processors, CMT processors use the available transistor budget to implement multiple hardware multithreaded processor cores on a chip die. SPARC T3 processors take the CMT model to the next level, providing up to 16 cores per processor, with each core supporting up

to eight threads via two independent pipelines—effectively doubling the throughput of UltraSPARC T2 and T2 Plus processors with minor increases in the clock frequency. In addition, these processors use the increased transistor budget resulting from the use of a 40 nm silicon technology to implement the industry’s first massively threaded system–on-a-chip (SoC), with a single processor die hosting:

  • Up to 128 threads per processor (up to sixteen cores supporting eight threads per core)
  • On-chip Level 1 and Level 2 caches
  • Newly designed floating point pipeline per core
  • Per core cryptographic acceleration of 12 different ciphers
  • Two on-chip 10 Gigabit Ethernet (GbE) interfaces
  • Two on-chip PCI Express Generation 2 (PCIe Gen2) interfaces
  • Six on-chip cache coherency links and logic

Through SoC design, the SPARC T3 processor significantly enhances the general-purpose nature of the CPU—building in 16 newly-designed floating-point units (one per core). Enhanced floating-point capabilities further open the SPARC T3 to the world of compute-intensive applications as well as the traditionally CMT-friendly data center throughput applications. No-cost security and cryptographic acceleration is provided by the on-chip, per-core streaming accelerators. In addition, the ability to move data in and out of the SPARC T3 processor is significantly aided by two integrated PCIe Generation 2 interfaces and dual 10 GbE interfaces. The SPARC T3 processor also implements cache coherency logic and links on the processor silicon that facilitate a multisocket, glueless system design.

SPARC T3-1, T3-2, T3-4, and T3-1B Servers

Oracle's SPARC T3-1, SPARC T3-2, SPARC T3-4 and SPARC T3-1B servers all are designed to leverage the considerable resources of the SPARC T3 processors in the form of cost-effective, general- purpose platforms (Figure 2). SPARC T3-based servers deliver up to twice the throughput of their predecessors, while leading competitors in terms of performance, performance per watt, and SWaP performance (as evaluated by the Space, Watts, and Performance metric detailed later in this section). SPARC T3-2 servers extend this scalability by adding dual sockets for SPARC T3 processors and considerably large memory support. Further extending this scalability is the quad socket SPAR T3- server. All these systems extend the benefits of CMT from multithreaded commercial workloads into technical workloads oriented towards floating-point operations.

Figure 2. Oracle's SPARC T3-1, SPARC T3-2, SPARC T3-4, and SPARC T3-1B servers are designed to leverage the considerable resources of the SPARC T3 processor.

directly by the SPARC T3 processor. This approach provides leading levels of performance and scalability with extremely high levels of power, heat, and space efficiency. SPARC T3-2 servers extend this breakthrough compute and memory density, delivering up to 256 threads in a single system, while typically consuming less power than an equivalently configured previous-generation system. SPARCT T3-2 servers deliver twice the I/O bandwidth of Sun SPARC T5120 and T5220 servers by providing two PCIe root complexes associated with each SPARC T processor.

  • Accelerated time to market. SPARC T3-4 servers running Oracle Solaris provide full binary compatibility with earlier SPARC systems, preserving investments and rapid time to market. The Cool Tools for SPARC help accelerate application selection, profiling, testing, tuning, debugging, and the deployment of key applications on CMT systems. This functionality has been integrated into the Oracle Solaris Studio 12 release.
  • Industry-leading tools for virtualization and consolidation. Oracle’s chip multithreading (CMT) technology is ideal for consolidation, providing low-level multithreading support for virtualization at every layer of the technology stack. Oracle’s Virtual Machine Server for SPARC (OVMSS) technology exploits the SPARC T3 processor’s up to 128 threads per socket, offering multiple guest operating system instances. In addition, Oracle Solaris Containers provide virtualization within a single Oracle Solaris instance. The advanced Oracle Solaris ZFS file system provides storage virtualization for storage and considerable scalability.
  • System and data center reliability. Reliability is key to keeping applications available and costs down. With the greater levels of integration provided by an SoC design, SPARC T3-1, T3-2, T3-4, and T3-1B servers provide commensurately higher levels of reliability, availability, and serviceability (RAS). Lower power consumption and higher performance per watt greatly reduce generated heat loads and the associated issues they cause. Technologies such as Oracle's Solaris Predictive Self Healing are integrated with the hardware, and help keep systems available.
  • A tradition of leading eco-efficiency. Oracle's Sun Fire and Sun SPARC Enterprise T1000 and Oracle's Sun SPARC Enterprise T2000 servers were the industry's first eco-responsible servers. Oracle's Sun SPARC Enterprise T5120, T5220, T5140, T5240, and T5440 servers continued this tradition by offering the best performance and performance per watt across a wide range of commercial and technical workloads. In addition, Oracle's UltraSPARC T2 and UltraSPARC T2 Plus processors were the first processors to incorporate unique power management features at both core and memory levels of the processor. Oracle's SPARC T3-1, T3-2, T3-4, and T3-1B server utilizing the SPARC T3 processor take this functionality to an even higher level by achieving higher levels of integration at the processor level.
  • Zero-cost security. Providing secure communications and data protection has never been more important, with attempted electronic intrusion and theft at an all-time high. With up to eight integrated cryptographic accelerators on each SPARC T3 processor, there simply is no need to send plain text on the network or store plain text in storage systems. SPARC T3-1, T3-2, T3-4, and T3-1B

servers support many more cryptographic operations per second than competitive systems with dedicated cryptographic accelerator cards—all with minimal impact to system overhead.

  • Simplified management. Each SPARC T3-1, T3-2, T3-4, and T3-1B server provides an Integrated Lights Out Manager (ILOM 3.0) service processor, compatible with Oracle’s x64 servers. Integrated Lights Out Manager provides a command-line interface (CLI), a Web-based graphical user interface (GUI), and Intelligent Platform Management Interface (IPMI) functionality to aid out-of-band monitoring and administration. Integrated Lights Out Manager will not provide an Advanced Lights Out Management (ALOM) backward-compatibility mode for administrators: the assignment of user cli_mode=alom is no longer offered and at best, only provided a subset of full ALOM capabilities.

Innovative System Design

Beyond the capabilities of individual systems, Oracle understands that data centers have unique and pressing needs that require attention on the part of system designers. Density, performance, and scalability are all essential considerations, but systems must also be serviceable and fit in with modern data center strategies that consider power, cooling, and serviceability. SPARC T3-1, T3-2, T3-4, and T3-1B servers share an innovative design philosophy that extends across Oracle’s volume x64 and SPARC server platforms. Principles of this philosophy include the following.

  • Maximum compute density. Oracle’s volume servers provide leading density in terms of CPU cores, memory, storage and I/O. This focus on density often lets Oracle’s 2RU rack mount servers replace competitive 3RU servers, for a 33 percent space savings.
  • Continued investment protection. Oracle designs for maximum investment protection. Even with breakthrough technology such as chip multithreaded processors, applications simply run without modification.
  • Leading storage capacity. Oracle’s volume servers provide leading density and flexible RAID options. Smaller disk drives and innovations in structure, airway, and carrier design allow more disk capacity in smaller spaces, while enhancing system airflow.
  • Common, shared management. SPARC T3-1, T3-2, T3-4, and T3-1B servers are designed for ease of management and serviceability with service processors shared by other Oracle volume server platforms. Systems and components are designed for easy identification, and hot-swap components facilitate online replacement.

Table 1 compares the SPARC T3-1, T3-2, T3-4, and T3-1B servers.

TABLE 1. SPARC T3-1/T3-2/T3-4/T3-1B SERVER FEATURES FEATURE SPARC T3-1 SERVER SPARC T3-2 SERVER SPARC T3-4 SERVER SPARC T3-1B BLADE SERVER CPUs • 16 - core 1.65 GHz SPARC T3 processor

  • 16 - core 1.65 GHz SPARC T3 processor (Dual) - 16 - core 1.65 GHz SPARC T3 processor (Dual or Quad) - 8 - or 16-core 1.65 GHz SPARC T3 processor

Threads • Up to 128 • Up to 256 • Up to 512 • Up to 128

protection—along with redundant hot-swap disks, power supplies, and fans. The following key design elements in the SPARC T3-1, T3-2, T3-4, and T3-1B servers are key to improving the dependability of IT services.

  • Processor thread and core off-lining and built-in RAID capabilities
  • Redundancy and hot-swap components
  • Parity protection and error correction capabilities
  • System monitoring
  • Integrated Lights Out Manager service processor
  • Superior energy efficiency
  • Robust virtualization technology
  • Comprehensive fault management

Space, Watts, and Performance: The SWaP Metric

SPARC T3-1, T3-2, T3-4, and T3-1B servers deliver leading performance across a range of multithreaded workloads and benchmarks. However, with energy and real estate costs and pressures, it is not enough to measure performance in isolation. Delivering the required level of throughput in a fixed space and power envelope is critical. Traditional system-to-system benchmarks are valuable as a way of comparing one system to another, but are limited when it comes to understanding the power and density attributes of the systems being compared. For this reason, Oracle has developed the space, watts, and performance (SwaP) metric. Designed to provide a simple and transparent measure of overall server efficiency, SWaP is calculated using the following formula: SWaP = Performance / (Space * Power Consumption) where

  • Performance is measured by industry-standard benchmarks
  • Space refers to the height of the server in rack units
  • Power is measured by watts used by the system, taken during actual benchmark runs or from vendor site planning guides

SPARC T3 Processor

The SPARC T3 processor is the industry’s most highly integrated system-on-a-chip, supplying the most cores and threads of any general-purpose processor available, and integrating all key system functions.

The World's First Sixteen-Core Massively Threaded System-on-a-Chip

The SPARC T3 processor eliminates the need for expensive custom hardware and software development by integrating computing, security, and I/O onto a single chip. Binary compatible with

earlier SPARC processors, no other processor delivers so much performance in so little space and with such small power requirements—letting organizations rapidly scale the delivery of new network services with maximum efficiency and predictability. The SPARC T3 processor is shown in Figure 3.

Figure 3. The SPARC T3 processor allows organizations to rapidly scale the delivery of new network services as well as increasingly compute-intensive workloads with maximum efficiency and predictability.

Table 2 provides a comparison between the SPARC T3 and UltraSPARC T2 and T2 Plus processors.

TABLE 2. SPARC T3, ULTRASPARC T2, AND ULTRASPARC T2 PLUS PROCESSOR FEATURES FEATURE SPARC T3 PROCESSOR ULTRASPARC T2 PROCESSOR ULTRASPARC T2 PLUS PROCESSOR Cores/Processor • Up to 16 • Up to 8 • Up to 8 Threads/Core Threads/Processor Hypervisor

  • 8
  • 128
  • Yes
    • 8
    • 64
    • Yes
      • 8
      • 64
      • Yes Sockets Supported • 1, 2, or 4 • 1 •^ 2 or 4* Memory (^) • Two memory controllers
  • Up to 16 DDR3 DIMMs
  • Four memory controllers
  • Up to 16 FB-DIMMs
  • Two memory controllers
  • Up to 16 or 32 FB-DIMMs Caches • 16 KB instruction cache
  • 8 KB data cache, 6 MB L cache (16 banks, 24-way associative)
  • 16 KB instruction cache
  • 8 KB data cache, 4 MB L2 cache (8 banks, 16-way associative)
  • 16 KB instruction cache
  • 8 KB data cache, 4 MB L cache (8 banks, 16-way associative) Technology • 40 nm technology • 65 nm technology • 65 nm technology Floating Point • 1 FPU with Mul/Add per core
  • 8 FPUs per chip
  • 1 FPU per core
  • 8 FPUs per chip
  • 1 FPU per core
  • 8 FPUs per chip Integer Resources • 2 integer execution units/core • 2 integer execution units/core • 2 integer execution units/core Cryptography (^) • Stream processing unit/core
  • 12 most popular ciphers
  • Stream processing unit/core
  • 10 most popular ciphers
  • Stream processing unit/core
  • 10 most popular ciphers Additional On-chip • Dual PCIe interface (x8) • Dual 10 GbE interfaces • PCIe interface (x8)

Figure 4. A single 16-core SPARC T3 processor supports up to 128 threads, with up to 2 threads running in each core simultaneously.

SPARC T3 Processor Architecture

The SPARC T3 processor extends Oracle’s CMT initiative with an elegant and robust architecture that delivers real performance to applications. Figure 5 provides a block-level diagram of the SPARC T processor.

Figure 5. The SPARC T3 processor provides six coherence links to connect to up to four other processors.

The SPARC T3 has coherence link interfaces to allow communication between up to four SPARC T processors in a system without requiring any external hub chip. There are six coherence links, each with 14 bits in each direction running at 9.6 Gbps. Each frame has 168 bits, so maximum frame rate is 800M frames per second. The SPARC T3 has two coherence link controllers. Each includes two Coherence and Ordering Units (COU), three Link Framing Units (LFU) and a cross bar (CLX) between COUs and LFUs. Each COU interfaces to two L2 bank pairs. The coherence links run a cache coherence (snoopy) protocol over an FB-DIMM like physical interface. The memory link speed of the SPARC T3 was increased to 6.4 Gb/sec over the UltraSPARC T2 Plus processor's 4.8 Gb/sec, and 4.0 Gb/sec of the UltraSPARC T2 processor.

The SPARC T3 processor can support one-, two- and four-socket implementations. A typical two- socket implementation is shown in Figure 6. Dual-socket SPARC T3 implementations interconnect the processors’ six coherence links; no additional circuitry is required.

Figure 7. Block-level diagram of a core of the SPARC T3 processor.

Components implemented in each core include the following.

  • Trap logic unit. The trap logic unit (TLU) updates the machine state as well as handling exceptions and interrupts.
  • Instruction fetch unit. The instruction fetch unit (IFU) includes a 16 KB instruction cache (32-byte lines, 8-way set associative) and a 64-entry fully associative instruction translation lookup buffer (ITLB).
  • Integer execution unit. Dual integer execution units (EXUs) are provided per core with four threads sharing each unit. Eight register windows are provided per thread, with 160 integer register file (IRF) entries per thread.
  • Floating point/graphics unit. A floating point/graphics unit (FGU) is provided within each core and it is shared by all eight threads assigned to the core. Thirty-two floating-point register file entries are provided per thread. A fused floating point Mul/Add instruction is implemented.
  • Stream processing unit. Each core contains a stream processing unit (SPU) that provides cryptographic co-processing.
  • Memory management unit. The memory management unit (MMU) provides a hardware table walk (HWTW) and supports 8 KB, 64 KB, 4 MB, and 256 MB pages.

An eight-stage integer pipeline and a new 9-stage floating-point pipeline is provided by the SPARC T processor core (Figure 8). A pick pipeline stage exists to choose two threads (out of the eight possible per core) to execute each cycle.

Figure 8. An 8-stage integer pipeline and a 9-stage floating-point pipeline are provided by the SPARC T3 processor core.

To illustrate how the dual integer pipelines function, Figure 9 depicts the integer pipeline with the load store unit (LSU). The instruction cache is shared by all eight threads within the core. A least-recently- fetched algorithm is used to select the next thread to fetch. Each thread is written into a thread-specific instruction buffer (IB) and each of the eight threads is statically assigned to one of two thread groups within the core.

Figure 9. Threads are interleaved between pipeline stages with very few restrictions (integer pipeline shown, letters depict pipeline stages, numbers depict different scheduled threads)

The pick stage chooses one thread each cycle within each thread group. Picking within each thread group is independent of the other, and a least-recently-picked algorithm is used to select the next thread to execute. The decode state resolves resource conflicts that are not handled during the pick stage. As shown in the illustration, threads are interleaved between pipeline stages with very few restrictions. Any thread can be at the fetch or cache stage, before being split into either of the two thread groups. Load/store and floating-point units are shared between all eight threads. Only one thread from either thread group can be scheduled on such a shared unit.

Integrated Networking

By providing integrated on-chip networking, the SPARC T3 processor is able to provide better networking performance. All network data is supplied directly from and to main memory. Placing networking so close to memory reduces latency, provides higher memory bandwidth, and eliminates inherent inefficiencies of I/O protocol translation. The SPARC T3 processor provides two 10 Gigabit Ethernet ports with integrated serializer/deserializer (SerDes), offering line-rate packet classification at up to 30 million packets/second (based on layer 14 of the protocol stack). Multiple DMA engines ( transmit and 16 receive DMA channels) match DMAs to individual threads, providing binding flexibility between ports and threads. Virtualization support includes provisions for eight partitions, and interrupts may be bound to different hardware threads.

Stream Processing Unit

The SPU on each core runs in parallel with the core at the same frequency. The cipher/hash unit supports RC4, DES/3DES, AES-128/192/256, MD5, SHA-1, SHA-256 ciphers. Added to the SPARC T3 processor are SHA-384/SHA-512, Kasumi Bulk Cipher, and Galois Field Operations. The SPU is designed to achieve wire-speed encryption and decryption on the processor’s 10 GbE ports.