







































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
chip multiprocessor Material Type: Notes; Professor: Puvvada; Class: Computer Systems Organization; Subject: Electrical Engineering; University: University of Southern California; Term: Fall 2010;
Typology: Study notes
1 / 47
This page cannot be seen from the preview
Don't miss anything!








































The Evolution of Chip Multithreading
Oracle’s UltraSPARC processors have led the industry for years—first, with the introduction of the multithreaded, multicore chip design in the first-generation UltraSPARC T1 processor in 2005 and now with the fourth-generation SPARC T3 processor. By any measure, the first generation CMT processors were an unprecedented success. Delivering up to five times the throughput in a quarter of the space and power, systems using these processors have rapidly been welcomed and accepted. Now fourth-generation CMT technology is evolving rapidly to meet the constantly changing demands of a wide range of enterprise data center applications.
Organizations across many industries hope to address larger markets, reduce costs, and gain better insights into their customers. At the same time, an increasingly broad array of wired and wireless client devices are bringing network computing into the everyday lives of millions of people. This strong demand has a “pull-through” effect on the IT services that must be satisfied in the data center. These trends are redefining data center scalability and capacity requirements, even as they collide with fundamental real estate, power, and cooling constraints.
Driving Data Center Virtualization and Eco-Efficiency
Coincident with the need to scale services, many data centers recognize the advantages of deploying fewer standard platforms to run a mixture of commercial and technical workloads. This process involves consolidating underused and sprawling server infrastructures with effective virtualization solutions that serve to enhance business agility, improve disaster recovery, and reduce operating costs. This focus can help reduce energy costs and break through data center capacity constraints by improving the amount of realized performance for each watt of power the data center consumes.
Eco-efficiency provides tangible benefits, improving ecology by reducing the carbon footprint to meet legislative and corporate social responsibility goals, even as it improves the economy of the organization paying the electric bill. As systems are consolidated onto more dense and capable computing infrastructure, demand for data center real estate is also reduced. With careful planning, this approach can also improve service uptime and reliability by reducing hardware failures resulting from excess heat load. Servers with high levels of standard reliability, availability, and serviceability (RAS) are now considered a requirement.
Building Out for Web-scale Applications
Web-scale applications engender a new pace and urgency to infrastructure deployment. Organizations must accelerate time to market and time to service, while delivering scalable high-quality and high- performance applications and services. Many need to be able to start small with the ability to scale very quickly, with new customers and innovative new Web services often implying a doubling of capacity in months rather than years.
At the same time, organizations must reduce their environmental impact by working within the power, cooling, and space available in their current data centers. Operational costs too are receiving new scrutiny, along with system administrative costs that can account for up to 40 percent of an IT budget. Simplicity and speed are paramount, giving organizations the ability to respond quickly to dynamic business conditions. Organizations are also striving to eliminate vendor lock-in as they look to preserve previous, current, and future investments. Open platforms built around open standards help provide maximum flexibility while reducing costs of both entry and exit.
Securing the Enterprise at Speed
Organizations are increasingly interested in securing all communications with their customers and partners. Given the risks, end-to-end encryption is essential to inspire confidence in security and confidentiality. Encryption is also increasingly important for storage, helping to secure stored and archived data even as it provides a mechanism to detect tampering and data corruption.
Unfortunately, the computational costs of increased encryption can increase the burden on already overtaxed computational resources. Security also needs to take place at line speed, without introducing bottlenecks that can impact the customer experience or slow transactions. Solutions must help to ensure security and privacy for clients and bring business compliance for the organization, all without impacting performance or increasing costs.
Addressing these challenges has outstripped the capabilities of traditional processors and systems, and required a fundamentally new approach.
Moore’s Law and the Diminishing Returns of Traditional Processor Design
The oft-quoted tenet of Moore’s law states that the number of transistors that will fit in a square inch of integrated circuitry will approximately double every two years. For more than three decades the pace of Moore’s law has held, driving processor performance to new heights. Processor manufacturers have long exploited these gains in chip real estate to build increasingly complex processors, with instruction- level parallelism (ILP) as a goal. These traditional processors employ very high frequencies along with a variety of sophisticated tactics to accelerate a single instruction pipeline, including
First introduced with the UltraSPARC T1 processor, CMT takes advantage of CMP advances, but adds a critical capability—the ability to scale with threads rather than frequency. Unlike traditional single- threaded processors and even most current multicore processors, hardware multithreaded processor cores allow rapid switching between active threads as other threads stall for memory. Figure 1 illustrates the difference between CMP, fine-grained hardware multithreading (FG-MT), and CMT. The key to this approach is that each core in a CMT processor is designed to switch between multiple threads on each clock cycle. As a result, the processor’s execution pipeline remains active doing real useful work, even as memory operations for stalled threads continue in parallel.
Figure 1. CMT combines CMP and fine-grained hardware multithreading.
CMT provides real value since it increases the ability of the execution pipeline to do actual work on any given clock cycle. Use of the processor pipeline is greatly enhanced because a number of execution threads now share its resources. The negative effects of memory latency are effectively masked, because the processor and memory subsystems remain active in parallel to the processor execution pipeline. Since these individual processor cores implement much-simpler pipelines that focus on scaling with threads rather than frequency (emphasizing TLP over ILP), they are also substantially cooler and require significantly less electrical energy to operate. This innovative approach results in a unique processor technology—multiple physical instruction execution pipelines (one for each core), with multiple active thread contexts per core. In addition, SPARC T3 processors feature two execution pipelines per core to further boost scalability.
The SPARC T3 Processor
Unlike complex single-threaded processors, CMT processors use the available transistor budget to implement multiple hardware multithreaded processor cores on a chip die. SPARC T3 processors take the CMT model to the next level, providing up to 16 cores per processor, with each core supporting up
to eight threads via two independent pipelines—effectively doubling the throughput of UltraSPARC T2 and T2 Plus processors with minor increases in the clock frequency. In addition, these processors use the increased transistor budget resulting from the use of a 40 nm silicon technology to implement the industry’s first massively threaded system–on-a-chip (SoC), with a single processor die hosting:
Through SoC design, the SPARC T3 processor significantly enhances the general-purpose nature of the CPU—building in 16 newly-designed floating-point units (one per core). Enhanced floating-point capabilities further open the SPARC T3 to the world of compute-intensive applications as well as the traditionally CMT-friendly data center throughput applications. No-cost security and cryptographic acceleration is provided by the on-chip, per-core streaming accelerators. In addition, the ability to move data in and out of the SPARC T3 processor is significantly aided by two integrated PCIe Generation 2 interfaces and dual 10 GbE interfaces. The SPARC T3 processor also implements cache coherency logic and links on the processor silicon that facilitate a multisocket, glueless system design.
Oracle's SPARC T3-1, SPARC T3-2, SPARC T3-4 and SPARC T3-1B servers all are designed to leverage the considerable resources of the SPARC T3 processors in the form of cost-effective, general- purpose platforms (Figure 2). SPARC T3-based servers deliver up to twice the throughput of their predecessors, while leading competitors in terms of performance, performance per watt, and SWaP performance (as evaluated by the Space, Watts, and Performance metric detailed later in this section). SPARC T3-2 servers extend this scalability by adding dual sockets for SPARC T3 processors and considerably large memory support. Further extending this scalability is the quad socket SPAR T3- server. All these systems extend the benefits of CMT from multithreaded commercial workloads into technical workloads oriented towards floating-point operations.
Figure 2. Oracle's SPARC T3-1, SPARC T3-2, SPARC T3-4, and SPARC T3-1B servers are designed to leverage the considerable resources of the SPARC T3 processor.
directly by the SPARC T3 processor. This approach provides leading levels of performance and scalability with extremely high levels of power, heat, and space efficiency. SPARC T3-2 servers extend this breakthrough compute and memory density, delivering up to 256 threads in a single system, while typically consuming less power than an equivalently configured previous-generation system. SPARCT T3-2 servers deliver twice the I/O bandwidth of Sun SPARC T5120 and T5220 servers by providing two PCIe root complexes associated with each SPARC T processor.
servers support many more cryptographic operations per second than competitive systems with dedicated cryptographic accelerator cards—all with minimal impact to system overhead.
Innovative System Design
Beyond the capabilities of individual systems, Oracle understands that data centers have unique and pressing needs that require attention on the part of system designers. Density, performance, and scalability are all essential considerations, but systems must also be serviceable and fit in with modern data center strategies that consider power, cooling, and serviceability. SPARC T3-1, T3-2, T3-4, and T3-1B servers share an innovative design philosophy that extends across Oracle’s volume x64 and SPARC server platforms. Principles of this philosophy include the following.
Table 1 compares the SPARC T3-1, T3-2, T3-4, and T3-1B servers.
TABLE 1. SPARC T3-1/T3-2/T3-4/T3-1B SERVER FEATURES FEATURE SPARC T3-1 SERVER SPARC T3-2 SERVER SPARC T3-4 SERVER SPARC T3-1B BLADE SERVER CPUs • 16 - core 1.65 GHz SPARC T3 processor
Threads • Up to 128 • Up to 256 • Up to 512 • Up to 128
protection—along with redundant hot-swap disks, power supplies, and fans. The following key design elements in the SPARC T3-1, T3-2, T3-4, and T3-1B servers are key to improving the dependability of IT services.
Space, Watts, and Performance: The SWaP Metric
SPARC T3-1, T3-2, T3-4, and T3-1B servers deliver leading performance across a range of multithreaded workloads and benchmarks. However, with energy and real estate costs and pressures, it is not enough to measure performance in isolation. Delivering the required level of throughput in a fixed space and power envelope is critical. Traditional system-to-system benchmarks are valuable as a way of comparing one system to another, but are limited when it comes to understanding the power and density attributes of the systems being compared. For this reason, Oracle has developed the space, watts, and performance (SwaP) metric. Designed to provide a simple and transparent measure of overall server efficiency, SWaP is calculated using the following formula: SWaP = Performance / (Space * Power Consumption) where
SPARC T3 Processor
The SPARC T3 processor is the industry’s most highly integrated system-on-a-chip, supplying the most cores and threads of any general-purpose processor available, and integrating all key system functions.
The SPARC T3 processor eliminates the need for expensive custom hardware and software development by integrating computing, security, and I/O onto a single chip. Binary compatible with
earlier SPARC processors, no other processor delivers so much performance in so little space and with such small power requirements—letting organizations rapidly scale the delivery of new network services with maximum efficiency and predictability. The SPARC T3 processor is shown in Figure 3.
Figure 3. The SPARC T3 processor allows organizations to rapidly scale the delivery of new network services as well as increasingly compute-intensive workloads with maximum efficiency and predictability.
Table 2 provides a comparison between the SPARC T3 and UltraSPARC T2 and T2 Plus processors.
TABLE 2. SPARC T3, ULTRASPARC T2, AND ULTRASPARC T2 PLUS PROCESSOR FEATURES FEATURE SPARC T3 PROCESSOR ULTRASPARC T2 PROCESSOR ULTRASPARC T2 PLUS PROCESSOR Cores/Processor • Up to 16 • Up to 8 • Up to 8 Threads/Core Threads/Processor Hypervisor
Figure 4. A single 16-core SPARC T3 processor supports up to 128 threads, with up to 2 threads running in each core simultaneously.
The SPARC T3 processor extends Oracle’s CMT initiative with an elegant and robust architecture that delivers real performance to applications. Figure 5 provides a block-level diagram of the SPARC T processor.
Figure 5. The SPARC T3 processor provides six coherence links to connect to up to four other processors.
The SPARC T3 has coherence link interfaces to allow communication between up to four SPARC T processors in a system without requiring any external hub chip. There are six coherence links, each with 14 bits in each direction running at 9.6 Gbps. Each frame has 168 bits, so maximum frame rate is 800M frames per second. The SPARC T3 has two coherence link controllers. Each includes two Coherence and Ordering Units (COU), three Link Framing Units (LFU) and a cross bar (CLX) between COUs and LFUs. Each COU interfaces to two L2 bank pairs. The coherence links run a cache coherence (snoopy) protocol over an FB-DIMM like physical interface. The memory link speed of the SPARC T3 was increased to 6.4 Gb/sec over the UltraSPARC T2 Plus processor's 4.8 Gb/sec, and 4.0 Gb/sec of the UltraSPARC T2 processor.
The SPARC T3 processor can support one-, two- and four-socket implementations. A typical two- socket implementation is shown in Figure 6. Dual-socket SPARC T3 implementations interconnect the processors’ six coherence links; no additional circuitry is required.
Figure 7. Block-level diagram of a core of the SPARC T3 processor.
Components implemented in each core include the following.
An eight-stage integer pipeline and a new 9-stage floating-point pipeline is provided by the SPARC T processor core (Figure 8). A pick pipeline stage exists to choose two threads (out of the eight possible per core) to execute each cycle.
Figure 8. An 8-stage integer pipeline and a 9-stage floating-point pipeline are provided by the SPARC T3 processor core.
To illustrate how the dual integer pipelines function, Figure 9 depicts the integer pipeline with the load store unit (LSU). The instruction cache is shared by all eight threads within the core. A least-recently- fetched algorithm is used to select the next thread to fetch. Each thread is written into a thread-specific instruction buffer (IB) and each of the eight threads is statically assigned to one of two thread groups within the core.
Figure 9. Threads are interleaved between pipeline stages with very few restrictions (integer pipeline shown, letters depict pipeline stages, numbers depict different scheduled threads)
The pick stage chooses one thread each cycle within each thread group. Picking within each thread group is independent of the other, and a least-recently-picked algorithm is used to select the next thread to execute. The decode state resolves resource conflicts that are not handled during the pick stage. As shown in the illustration, threads are interleaved between pipeline stages with very few restrictions. Any thread can be at the fetch or cache stage, before being split into either of the two thread groups. Load/store and floating-point units are shared between all eight threads. Only one thread from either thread group can be scheduled on such a shared unit.
Integrated Networking
By providing integrated on-chip networking, the SPARC T3 processor is able to provide better networking performance. All network data is supplied directly from and to main memory. Placing networking so close to memory reduces latency, provides higher memory bandwidth, and eliminates inherent inefficiencies of I/O protocol translation. The SPARC T3 processor provides two 10 Gigabit Ethernet ports with integrated serializer/deserializer (SerDes), offering line-rate packet classification at up to 30 million packets/second (based on layer 14 of the protocol stack). Multiple DMA engines ( transmit and 16 receive DMA channels) match DMAs to individual threads, providing binding flexibility between ports and threads. Virtualization support includes provisions for eight partitions, and interrupts may be bound to different hardware threads.
Stream Processing Unit
The SPU on each core runs in parallel with the core at the same frequency. The cipher/hash unit supports RC4, DES/3DES, AES-128/192/256, MD5, SHA-1, SHA-256 ciphers. Added to the SPARC T3 processor are SHA-384/SHA-512, Kasumi Bulk Cipher, and Galois Field Operations. The SPU is designed to achieve wire-speed encryption and decryption on the processor’s 10 GbE ports.