






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This paper provides an overview of the Berkeley NOW project, a parallel computing system developed at the University of California, Berkeley. It describes the hardware and software architecture of the system, including the Active Message layer and MPI message passing. The paper also reports on the performance obtained at each layer of the system and benchmark parallel applications. The hardware configuration of the system consists of one hundred and five Sun Ultra 170 workstations connected by a large Myricom network. The network uses multiple stages of Myricom switches in a fat-tree topology.
Typology: Lecture notes
1 / 11
This page cannot be seen from the preview
Don't miss anything!







Abstract: The UC Berkeley Network of Workstations (NOW) project demonstrates a new approach to large- scale system design enabled by technology advances that provide inexpensive, low latency, high bandwidth, scalable interconnection networks. This paper provides an overview of the hardware and software architecture of NOW and reports on the performance obtained at each layer of the system: Active Messages, MPI mes- sage passing, and benchmark parallel applications.
In the early 1990’s it was often said that the “Killer Micro” had attacked the supercomputer market, much as it had the minicomputer and mainframe markets earlier. This attack came in the form of massively parallel pro- cessors (MPPs) which repackaged the single-chip microprocessor, cache, DRAM, and system chip-set of workstations and PCs in a dense configuration to con- struct very large parallel computing systems. However, another technological revolution was brewing in these MPP systems – the single-chip switch – which enabled building inexpensive, low latency, high bandwidth, scal- able interconnection networks. As with other important technologies, this “killer switch” has taken on a role far beyond its initial conception. Emerging from the eso- teric confines of MPP backplanes, it has become avail- able in a form that is readily deployed with commodity workstations and PCs. This switch is the basis for sys- tem area networks , which have performance and scal- ability of the MPP interconnects and the flexibility of a local area network, but operate on a somewhat restricted physical scale.
The Berkeley NOW project seeks to demonstrate that it is viable to build large parallel computing systems that are fast, inexpensive, and highly available, by simply snapping these switches together with the latest com- modity components. Such cost-effective, incrementally scalable systems provide a basis for traditional parallel computing, but also for novel applications, such as inter- net services[Brew96].
This paper provides an overview of the Berkeley NOW as a parallel computing system. Section 2 gives a description of the NOW hardware configuration and its layered software architecture. In the following sections, the layers are described from the bottom-up. Section 3 describes the Active Message layer and compares its performance to what has been achieved on MPPs. Section 4 shows the performance achieved through MPI, built on top of Active Messages. Section 5 illustrates the application performance of NOW using the NAS Paral- lel Benchmarks in MPI. Section 6 provides a more detailed discussion of the world’s leading disk-to-disk sort, which brings out a very important property of this class of system: the ability to concurrently perform I/O to disks on every node.
The hardware configuration of the Berkeley NOW sys- tem consists of one hundred and five Sun Ultra 170 workstations, connected by a large Myricom net- work[Bode95], and packaged into 19-inch racks. Each workstation contains a 167 MHz Ultra1 microprocessor with 512 KB level-2 cache, 128 MB of memory, two 2. GB disks, ethernet, and a Myricom “Lanai” network interface card (NIC) on the SBus. The NIC has a 37. MHz embedded processor and three DMA engines, which compete for bandwidth to 256 KB of embedded SRAM. The node architecture is shown in Figure 1.
The network uses multiple stages of Myricom switches, each with eight 160 MB/s bidirectional ports, in a vari- ant of a fat-tree topology.
We encountered a number of interesting engineering issues in assembling a cluster of this size that are not so apparent in smaller clusters, such as our earlier 32-node prototype. This rack-and-stack style of packaging is extremely scalable, both in the number of nodes and the ability to upgrade nodes over time. However, structured cable management is critical. In tightly packaged sys- tems the interconnect is hidden in the center of the
machine. When multiple systems are placed in a machine room, all the interconnect is hidden under the floor in an indecipherable mess. However, in clusters, the interconnect is a clearly exposed part of the design. (a bit like the service conduits in deconstructionist buildings). Having the interconnect exposed is valuable for working on the system, but it must stay orderly and well structured, or it becomes both unsightly and diffi- cult to manage.
The Berkeley NOW has four distinct interconnection networks. First, the Myrinet provides high-speed com- munication within the cluster. We discuss this in detail below. Second, switched-Ethernet into an ATM back- bone provides scalable external access to the cluster. The need for an external network that scales with the size of the cluster was not apparent when we began the project, but the traffic between the cluster and other servers, especially file servers, is an important design consideration. Third, a terminal concentrator provides
direct console access to all the nodes via the serial port. This is needed only in situations when the node cannot be rebooted through the network, or during system development and debugging. Fourth, conventional AC lines provide a power distribution network. As clusters transition to the commercial mainstream, one engineer- ing element will be to consolidate these layers of inter- connect into a clean modular design. Figure 2 shows a picture of the NOW system.
The Myrinet switches that form the high-speed intercon- nect use source routing and can be configured in arbi- trary topologies. The NOW automatic mapping software can handle arbitrary interconnect[Mai*97]; however, we wire the machine as a variant of a Fat-tree to create a system with more uniform bandwidth between nodes, thereby minimizing the impact of process placement. The topology is constrained by the use of 8-port (bidi- rectional) switches and wiring density concerns. Ini- tially we planned to run cables from all the nodes to a central rack of switches; however, the cable cross-sec- tional area near the switches became unmanageable as a result of bulky, heavily-shielded copper network cables. Using fiber-optic cables that are now available, the cable density may be reduced enough to centrally locate the switches.
Figure 1. NOW Node Configuration
L2 Cache
Mem S-bus (25 MHz)
UltraSparc
s dma
host dma
mP
sram
Myricom Lanai NIC (37.5 MHz proc, 256 KB sram 3 dma units)
r dma
Link Interface
Bus Interface
160 MB/s bidirectional links
8-port wormhole switches
Figure 2. NOW System
exchanges pages with remote page daemons. The other provides similar operation on specially mapped regions using only signals.
Active Messages are the basic communication primi- tives in NOW. This work continues our investigation of implementation trade-offs for fast communication lay- ers[vE92,Gol96,Kri*96] and on NOW we have sought to generalize the approach and take full advan- tage of the complete OS on every node. The segment driver and device driver interface is used to provide applications with direct, protected user-level access to the network. Active Messages map to simple operations on queues and buffers that are shared between the user process and the communication firmware, which is exe- cuted on a dedicated processor embedded in the network interface card.
We have built two Active Message layers. The first, Generic Active Messages (gam) is oriented toward the traditional single-parallel-program-at-a-time style of parallel machines, and provides exactly the same API across a wide range of platforms[Cul*95].This serves as a valuable basis for comparison.
The newer AM layer[Main95], AM-II, provides a much more general purpose communication environment, which allows many simultaneous parallel programs, as well as client/server and system use. It is closely inte- grated with POSIX threads. The AM implementation is extremely versatile. It provides error detection and retry a the NIC-to-NIC level and allows the network to be reconfigured in a running system. A privileged mapper daemon explores the physical interconnection, derives deadlock-free routes, and distributes routes periodi- cally[Mai*97]. AM-II provides a clean return-to-sender error model to support highly available applications.
The Active Messages communication model is essen- tially a simplified remote procedure call that can be implemented efficiently on a wide range of hardware. Three classes of messages are supported. Short mes- sages pass eight 32-bit arguments to a handler on a des- tination node, which executes with the message data as arguments. Medium messages treat one of the argu- ments as a pointer to a 128 byte to 8 KB data buffer and invoke the handler with a pointer to a temporary data buffer at the destination. Bulk messages perform a mem- ory-to-memory copy before invoking the handler. A request handler issues replies to the source node.
We have developed a microbenchmarking tool to char- acterize empirically the performance of Active Mes- sages in terms of the LogP model[Cul93, Cul95]. Figure 5 compares the gam short message LogP param- eters on NOW with the best implementations on a range of parallel machines. The bars on the left show the one- way message time broken down into three components: send overhead ( os) , receive overhead ( or ), and the remaining latency ( L ). The bars on the right shows the time per message ( g = 1/MessageRate) for a sequence of messages. NOW obtains competitive or superior communication performance to the more tightly inte- grated, albeit older, designs.
The overhead on NOW is dominated by the time to write and read data across the I/O bus. The Paragon has a dedicated message processor and network interface on the memory bus; however, there is considerable over- head in the processor-to-processor transfer due to the cache coherence protocol and the latency is large because the message processors must write the data to the NI and read it from the NI. The actual time on the wire is quite small. The Meiko has a dedicated message processor on the memory bus with a direct connection to the network, but the overhead is dominated by the exchange instruction that queues a message descriptor for the message processor and the latency is dominated by the slow message processor accessing the data from host memory. Medium and bulk messages achieve 38 MB/s on NOW, limited primarily by the SBus.
Traditional communication APIs and programming models are built upon the Active Message layer. We have built a version of the MPI message passing stan-
Figure 5. Active Messages LogP Performance
dard for parallel programs in this fashion, as well as a version of the Berkeley Sockets API, called Fast Sock- ets[Rod97]. A shared address space parallel C, called Split-C[Cul93], compiles directly to Active Messages, whereas HPF[PGI] compiles down to the MPI layer.
Our implementation of MPI is based on the MPICH ref- erence implementation, but realizes the abstract device interface (ADI) through Active Message operations. This approach achieves good performance and yet is portable across Active Message platforms. The MPI communicator and related information occupy a full short message. Thus, a zero-byte control message is implemented as a single small-message request- response, with the handler performing the match opera- tion against a receive table. The one-way time for an echo test is 15 μ s. MPI messages of less than 8 KB use an adaptive protocol implemented with medium Active Messages. Each node maintains a temporary input buffer for each sender and senders keep track of whether their buffers are available on the destination nodes. If the buffer is available, the send issues the data without handshaking. Buffer availability is conveyed back to the source through the response, if the match succeeds, or via a request issued by the later matching receive. Large messages perform a handshake to do the tag match and convey the destination address to the source. A bulk operation moves the message data directly into the user buffer.
Figure 6 shows the bandwidth obtained as a function of message size using Dongarra’s echo test on NOW and on recent MPP platforms[DoDu95]. The NOW version has lower start-up cost than the other distributed mem- ory platforms and has intermediate peak bandwidth. The T3D/pvm version does well for small messages, but has trouble with cache effects. Newer MPI implementations on the T3D should perform better than the T3D/pvm in the figure, but data is not available in the Dongarra report.
An application-level comparison of NOW with recent parallel machines on traditional scientific codes can be obtained with the NAS MPI-based parallel benchmarks in the NPB2 suite[NPB]. We report briefly on two appli- cations. The LU benchmark solves a finite difference discretization of the 3-D compressible Navier-Stokes equations. A 2-D partitioning of the 3-D data grid onto a power-of-two number of processors is obtained by halv-
ing the grid repeatedly in the first two dimensions, alter- nating between x and y , resulting in vertical pencil-like grid partitions. The ordering of point based operations constituting the SSOR procedure proceeds on diagonals which progressively sweep from one corner on a given z plane to the opposite corner of the same z plane, there- upon proceeding to the next z plane. This constitutes a diagonal pipelining method and is called a “wavefront” method by its authors [Bar*93]. The software pipeline spends relatively little time filling and emptying and is perfectly load-balanced. Communication of partition boundary data occurs after completion of computation on all diagonals that contact an adjacent partition.
The BT algorithm solves three sets of uncoupled sys- tems of equations, first in the x , then in the y , and finally in the z direction. These systems are block tridiagonal with 5x5 blocks and are solved using a multi-partition scheme[Bru88]. The multi-partition approach provides good load-balance and uses coarse-grained communica- tion. Each processor is responsible for several disjoint sub-blocks of points (“cells”) in the grid. The cells are arranged such that for each direction in the line-solve phase, the cells belonging to a certain processor are evenly distributed along the direction of solution. This allows each processor to perform useful work through- out a line-solve, instead of being forced to wait for the partial solution to a line from another processor before beginning work. Additionally, the information from a cell is not sent to the next processor until all sections of linear equation systems handled in this cell have been solved. Therefore the granularity of communications is kept large and fewer messages are sent. The BT code requires a square number of processors.
Figure 6. MPI bandwidth
Message Size (bytes)
parallel sort of records on processors is a general- ized bucket sort and contains four steps:
from its local disks into memory.
At the end of the algorithm, the data is sorted across the disks of the workstations, with the lowest-valued keys on processor 0 and the highest-valued keys on processor
. The number of records per node will only be approximately equal, and depends upon the actual distri- bution of key values.
A key advantage of a NOW is that the performance of each node can be studied and optimized in isolation before considering the interactions between nodes. For NOW-Sort, we needed to understand how best to utilize the disks, the memory, and the operating system sub- strate of each node.
To fully utilize the aggregate bandwidth of multiple disks per machine, we implemented a user-level library for file striping on top of each local Solaris file system. We have two disk configurations to consider: two 5400 rpm disks on a fast-narrow SCSI bus and an additional two 7200 rpm disks on a fast-wide SCSI. Table 2 shows the performance of the striped file system for several configurations.
In the first two rows, we see that the two 5400 rpm disks saturate the fast-narrow SCSI bus, which has a peak bandwidth of 10 MB/s. We measure 8.3 MB/s from two disks capable of a total of 11 MB/s. The full NOW clus- ter has two disks per node, providing a potential file I/O bandwidth of 830 MB/s on 100 nodes.
A subset of the nodes have additional external disks. The second two rows indicate that the (external) fast- wide SCSI bus adequately handles the two faster disks. Finally, the last rows shows we achieve 20.5 MB/s, or
96% of the peak aggregate bandwidth from the four disks. To harness this bandwidth in the sort, we need to adjust the blocking factor on the two kinds of disks to balance the transfer times.
Given a near peak bandwidth local file system, the sort- ing performance of each node depends critically on how effectively memory is used. With a general purpose OS, there is no simple way for the application to determine how much memory is actually available to it. Depending upon the system interface used to read data from disk, the application may or may not be able to effectively control its memory usage.
We compare two approaches for reading records from disk: read and mmap with madvise. For demonstra- tion purposes, we use a simple implementation of the sorting algorithm using one UltraSparc with one disk and 64 MB of DRAM. It quicksorts all of the keys in memory. The upper graph of Figure 9 shows that when the application uses the read call to read records into memory from disk, the total execution time increases severely when more than 20 MB of records are sorted, even though 64 MB of physical memory are in the machine. The operating system uses roughly 20 MB and the file system performs its own buffering, which effec- tively doubles the application footprint. This perfor- mance degradation occurs because the system starts paging out the sorting data.
To avoid double-buffering while leveraging the conve- nience of the file system, we use memory mapped files by opening the file, calling mmap to bind the file into a memory segment of the address space, and accessing the memory region as desired. The auxiliary system call, madvise , informs the operation system of the intended access pattern. For example, one call to madvise noti- fies the kernel that region will be accessed sequentially, thus allowing the OS to fetch ahead of the current page
N P
N P
P – 1
TABLE 2. Bandwidths of disk configurations
Seagate Disks
Bus
Read (MB/s)
Write (MB/s) 1 5400 rpm Hawk 2 5400 rpm Hawk
narrow 5.
1 7200 rpm Barracuda 2 7200 rpm Barracuda
wide 6.
2 of each 2 of each (peak)
both 20.
and to throw away pages that have already been accessed. The lower graph of Figure 9 shows that with mmap and madvise the sorting program has linear per- formance up to roughly 40 MB, when it has used all the memory that the OS makes available. For larger data sets, the two pass algorithm is used, which places greater demand on the file system and requires multiple threads to maintain the disk bandwidth[Arp*97].
The optimizations for striping data across local disks and using operating system interfaces for memory man- agement apply when running on a single-node or multi- ple nodes. The impact of parallelization is isolated to the communication phase.
After each node has memory-mapped its local input files, it calculates the processor which should contain this key in the final sorted order. Using the assumption that the keys are from a uniform distribution, we deter- mine the destination processor with a simple bucket function (i.e., the top bits of each key) and copy
the record from the input file to a 4 KB send-buffer allo- cated for each destination processor. When a send-buffer is filled, it is sent to its destination processor.
Upon arrival of a message, an Active Message handler is executed. The handler moves the full record into main memory and copies a portion of the key into one of buckets based on the high-order bits of the key. The number of buckets is calculated at run-time such that the average number of keys per bucket fits into the second-level cache.
The read and communication phases are easily over- lapped due to the interfaces provided by mmap and Active Messages. Copying keys into send-buffers is completely hidden under the disk transfer time. Obtain- ing this same performance with the read system call would require more programming complexity; because the cost of issuing each read is high, threads must be used prefetch data in large chunks.
Measurements on a cluster with two disks per worksta- tion show that the communication time is mostly hidden by the read time. However, with four disks per worksta- tion, very little communication is overlapped with read- ing, and the algorithm is actually slower than with only two disks. This penalty occurs because the UltraSPARC I/O bus, the SBus, saturates long before its theoretical peak of 80 MB/s. Since almost all records are sent to another processor, the SBus must transfer three times the I/O rate: once for reading, once for sending, and once for receiving.
The sort and write phase on each node are straightfor- ward. After synchronizing across processors to ensure that all records have been sent and received, each node performs a partial-radix sort on the set of keys in each bucket, very similar to the approach described in [Agar96]. The partial-radix sort orders keys using the top 22-bits after the stripping off the bits used to determine the destination processor and bucket. At this point, with high-probability, most keys are correctly sorted, and a simple bubble-sort cleans-up the misor- dered keys. A pointer is kept to the full 100-byte record so that only keys and pointers are swapped. The sorted records are then gathered and written to local disk using the write interface.
Our performance on the Datamation benchmark is shown in Figure 10 for two NOW configurations. For
Figure 9. Sensitivity to OS Interface
0
20
40
60
80
100
120
0 10 20 30 40 50 60
Time (Seconds)
Size (MB)
read() Total Time Write Time Sort Time Read Time
0
20
40
60
80
100
120
0 10 20 30 40 50 60
Time (Seconds)