W H I T E P A P E R W i t h I t s N e w P o w e r X C e l l 8i Product Line, IBM Intends to Take Accelerated Processing into the HPC Mainstream

Size: px

Start display at page:

Download "W H I T E P A P E R W i t h I t s N e w P o w e r X C e l l 8i Product Line, IBM Intends to Take Accelerated Processing into the HPC Mainstream"

Lesley Robinson
6 years ago
Views:

1 W H I T E P A P E R W i t h I t s N e w P o w e r X C e l l 8i Product Line, IBM Intends to Take Accelerated Processing into the HPC Mainstream Sponsored by: IBM Richard Walsh Earl C. Joseph, Ph.D. August 2008 Steve Conway Jie Wu Global Headquarters: 5 Speen Street Framingham, MA USA P F IDC OPINION Fifteen years ago, the high-performance computing (HPC) market started to abandon its data-parallel, vector architectural lineage and turned to commodity-priced scalar processors. One by one, the other custom components of HPC systems have been pushed aside in favor of cheaper, standards-based alternatives. With some notable exceptions, most HPC system component technologies have been mainstreamed, a change driven by the price-performance advantages offered by standards-based components engineered to serve volume markets. Nothing reflects this more strongly than the fact that standards-based cluster sales based on x86 microprocessors were responsible for over 65% of the revenue generated in the HPC market in 2007, up from just a 20% share in However, blood is thicker than water, and the HPC user community has not forgotten where it came from or the fundamental data intensity of most HPC workloads. The mainstream x86 instruction set architecture (ISA) was not designed with HPC dataparallel requirements in mind and because of this has limited the sustained performance of many HPC applications. While processor clock speeds have recently stopped climbing, processor cores have multiplied, exacerbating this sustained performance shortcoming. There is little to suggest that HPC buyers will abandon the x86 mainstream and return to purchasing large numbers of custom data-parallel or vector systems to improve their applications' sustained performance, but the aim of IBM's PowerXCell TM 8i product line, with its single instruction, multiple data (SIMD) ISA and memory flow controller (MFC), is to bring data-parallel computing back to HPC and deliver higher sustained performance and power efficiency to HPC workloads with a processing engine supported by volume economics. In IDC's opinion, IBM's new PowerXCell 8i processor and its go-to-market strategy have the potential to stimulate the return and mainstreaming of data-parallel processing to HPC. The key features of IBM's new PowerXCell 8i product line and its market strategy include:! A single-chip, MFC-controlled, high memory bandwidth, shared memory design! A double data rate (DDR2) memory subsystem and fully IEEE-compliant doubleprecision floating-point capabilities! A broad range of PowerXCell 8i based products configurable at a variety of scales

2 ! A multitiered programming model with strong support among IBM's customers and partners! Its use in the world's fastest computer, Los Alamos National Laboratory's (LANL) petascale supercomputer, Roadrunner! A well-defined road map supported by volume economics from the gaming industry IN THIS WHITE PAPER In this white paper, IDC reviews the state of the HPC market, its recent five years of very strong growth, the rise of standards-based clusters, and the growing importance of blades and custom-engineered enclosures. HPC buyer "pain points" related to memory bandwidth shortages, parallel programming of multicore processors, and power consumption are discussed, as is their potential to stimulate the more mainstream use of accelerators and data-parallel programming. Finally, the paper reviews IBM's PowerXCell 8i product line, multitiered programming environment, and some of its parallel programming software partnerships. SITUATION OVERVIEW HPC's Strong Market Growth The HPC market has shown rapid growth in the five years since 2002, especially when compared with the background rate of IT spending generally. HPC revenue had three years of double-digit growth between 2003 and 2005, followed by a stillimpressive 9% year-over-year growth between 2005 and In 2007, despite a slowing economy, HPC revenue growth over 2006 was 15.5%, exceeding IDC estimates. Table 1 shows revenue growth over this period by competitive segment. TABLE 1 Worldwide HPC Market Revenue by Competitive Segment, ($M) Competitive Segment Price Range CAGR (%) Supercomputer >$500,000 2,401 2,631 2,881 2,567 2, Technical divisional $250, , ,197 1,420 1, Technical departmental $100, , ,117 2,561 3,323 4, Technical workgroup $0 99,999 1,806 2,668 2,568 2,744 2, Total 5,698 7,393 9,208 10,055 11, Source: IDC, # IDC

3 Given the growing global interest in HPC technology as an essential component in national economic and technology strategies and the robust competition in the market, which continues to produce rapid innovation, IDC sees few major threats to a continued pattern of high growth in 2008 and beyond. In IDC's view, even a softening world economy should not greatly alter this forecast market growth because HPC's heavy R&D focus and longer buying cycles have largely insulated it historically from short-term economic downturns. IDC projects that HPC server revenue will increase at around a 9% CAGR through 2012 to reach almost $18 billion, up from under $6 billion in 2003 (see Table 2). TABLE 2 Worldwide HPC Market Revenue Forecast by Competitive Segment, ($M) Competitive Segment Price Range CAGR (%) Supercomputer >$500,000 3,035 3,247 3,463 3,682 3, Technical divisional $250, ,999 2,102 2,427 2,755 3,086 3, Technical departmental $100, ,999 4,801 5,400 5,990 6,570 7, Technical workgroup $0 99,999 2,784 2,959 3,131 3,301 3, Total 12,723 14,033 15,339 16,639 17, Source: IDC, 2008 HPC Clusters Fuel Market Growth IDC's data show that the surge in HPC revenue has been fueled primarily by purchases of x86-based, Linux cluster systems priced below $500,000 (especially those priced under $250,000). This growth was sustained by MPI, a maturing, message-based parallel programming model. HPC workloads with largely partitionable data structures, already parallelized for custom massively parallel processing (MPP) and constellation systems, could be moved easily to clusters. Once there, input data sets could be grown to match the memory and bandwidth provided on the additional cluster nodes and allow for further scaling (so-called weak scaling). Even less scalable workloads benefited because more jobs with distinct inputs could be run simultaneously, increasing throughput, increasing the research and development iteration rate, and reducing time to solution. This process pushed the HPC price-performance curve sharply downward, creating a zero-gravity sensation and the expectation that performance should more than double in a technological generation while costing no more. This price-performance advantage and the other advantages that HPC buyers associate with clusters are presented in Figure IDC #

4 FIGURE 1 Cluster Drivers: Top Reasons to Purchase HPC Clusters Better price/performance Greater system throughput Ability to do new more/better science Ability to run larger problems Total cost of ownership (TCO) Improved capacity management To improve competitiveness Other (Number of responses) Source: IDC, 2008 As recently as early 2003, clusters accounted for just 20% of overall HPC server revenue. The dramatic penetration of the HPC market by clusters and their replacement of custom HPC systems through 4Q07 is shown in Figure 2. By the end of 2007, clusters had attained a 65% share of HPC server revenue. IDC sees clusters eventually topping out at about 80% of the HPC market, with the other 20% made up of systems that do not qualify as clusters, such as single-node servers, systems with symmetric multiprocessing (SMP) architectures, and MPP systems such as the IBM Blue Gene, Cray XT, and the SiCortex SC5832 that have too much custom content to fit the standards-based cluster definition. 4 # IDC

5 FIGURE 2 Worldwide High-Performance Computing Revenue Share by Server Type, 1Q03 4Q07 (%) Source: IDC, Q03 2Q03 3Q03 Cluster Noncluster 4Q03 1Q04 2Q04 3Q04 4Q04 1Q05 2Q05 3Q05 4Q05 1Q06 2Q06 3Q06 4Q06 1Q07 2Q07 3Q07 4Q07 IDC sees cluster revenue growth and market penetration continuing and pushing down into entry-level systems. "Ease-of-everything" cluster offerings designed for the technical workgroup (systems selling for under $100,000) at smaller firms and in back-office locations are expected to show particularly strong growth in 2008 and beyond. However, the rapid acceptance of HPC cluster computing systems, separate compute nodes built from standard component technologies (x86 processors, commodity motherboards, standards-based networking technology, and primarily the Linux OS), will continue to cause disruptive changes in the HPC market. Such changes, challenges, new market requirements, and buyer "pain points" also define new market opportunities. The HPC market's growing interest in data-level parallelism (DLP) acceleration technology is just such an HPC market opportunity. The Challenges of HPC Clusters, Buyer "Pain Points," and IBM's PowerXCell 8i Solution It is perhaps stating the obvious that the overarching elements potentially missing from a cluster system assembled à la carte from commodity hardware and software components are integration and a balanced system design. Custom-built HPC systems are balanced to suit the HPC task and integrated to simplify its completion. Because of this, custom HPC systems have generally been able to achieve higher sustained performance on individual jobs and better overall utilization rates. As clusters have scaled out to very large node counts and scaled in to "fatter" nodes with much more processing power per rack unit, the intangibles of integration and balance have been deemphasized. The price-per-peak-performance and capital cost advantages of HPC clusters have, until recently, overwhelmed their operational 2008 IDC #

6 drawbacks system component imbalance and complexity, which limit sustained performance and lower overall system utilization rates. Lastly, the cluster revolution has placed cluster systems in many new environments, and their low cost has led to substantial growth in average node counts (a sixfold increase between 2004 and 2006 alone, according to IDC data). A consequence is that supplying basic operational inputs such as power, cooling, space, and support has become an important concern for HPC buyers. Table 3 summarizes these and other HPC cluster buyer "pain points" and also indirectly presents the market requirements that support buyer interest in integrated, blade-based, DLP acceleration technology of the type that IBM now offers with its new PowerXCell 8i product line and its QS22 blade in particular. TABLE 3 HPC Cluster Buyer/User "Pain Points" "Pain Point" Category "Pain Point" Particulars System installation, monitoring, upgrades Managing HPC cluster complexity System administration, middleware User and application support Power, cooling, and space requirements Cluster price-performance drives down capital costs but drives up operating costs and resource use Scheduling and programming complexity Multicore, multisocket, multinode issues Memory size and bandwidth inadequacy Interconnect bandwidth, message rate mismatches Server interconnect performance Latency, bandwidth, message rates, collectives performance Total storage, file size, file number Storage system performance, data management Bandwidth, IOPs, reliability Data staging, archiving Parallel application coding and scaling issues Third-party software costs Multicore, multisocket, multinode, accelerators Licensing models Limited parallel price-performance, scaling Extremely large-scale systems require new approaches Better reliability, availability, and serviceability (RAS) New buyer requirements New production and operational environments require new approaches "Ease-of-everything" needs of new buyers Source: IDC, # IDC

7 While the QS22 (and IBM's other PowerXCell 8i based products) is presented in more detail below, it is important to note how its basic features respond to some key challenges facing today's HPC cluster buyers and users. Blade-Based Design First among these is the QS22's compact, integrated blade-based design. IBM and other HPC vendors with strong engineering skills have addressed the dilemma of providing integrated solutions while still using standards-based components by engineering dense, form-factor blades and their companion integrated enclosures. Blade sales are growing as a percentage of overall cluster sales. Blades and their enclosures provide vendors with the scope to engineer in value. This reduces cluster operating expenses and complexity while allowing the continued use of standardsbased components that exploit volume-driven price-performance curves. The QS22, like other blade systems, reduces cluster management complexity and lowers power, cooling, and space requirements. Fully IEEE-Compliant Double Precision The feature of the QS22's PowerXCell 8i processor that is most unique and stands out against competition from graphics processing unit (GPU) accelerators is its pipelined, fully IEEE-compliant, double-precision processing capability. The QS22 contains two tightly coupled PowerXCell 8i processors that provide 2 x (1 + 8) = 18 cores. Sixteen of these are DLP, SIMD processors. Vector and other data-parallel architectures are known to be both bandwidth and power efficient, and IBM has exploited this principle and engineered its new PowerXCell 8i double-precision processors with a surprisingly small transistor count. Low Latency and High Bandwidth Memory Access The PowerXCell 8i's MFC and on-chip DDR2 memory controller make its memory large in size, low in first byte latency, and high in bandwidth. PowerXCell 8i's MFC supports DMA and blocked or vector-like memory operations among all the cores and main memory. These features relieve cluster buyer pain in the categories of power and cooling (the QS22 delivers large numbers of FLOPS per watt), multicore and multisocket bandwidth inadequacy (both its SIMD instruction set and sustained perprocessor bandwidth help here), and even in the area of parallel application scalability, where the multicore, multisocket QS22 allows more parallel work to be done per node. Reliability and Cost Effectiveness Other problem areas faced by cluster buyers on certain applications that the QS22 could potentially address include high application licensing costs and improved reliability, availability, and serviceability (RAS). The more efficient parallel performance afforded by a DLP processor has the potential to reduce the number of application licenses required, and the highly integrated QS22 blade with 18 cores in a single form factor reduces the operating temperature per FLOP and the number of independent parts that could fail IDC #

8 IBM's PowerXCell 8i and QS22 blade are not entirely HPC cluster "pain point" positive. As with many other acceleration technologies, the PowerXCell 8i and QS22 introduce an additional layer of programming complexity because the Power Processor Element (PPE) and the Synergistic Processor Element (SPE) instruction sets are not x86 based and the programming model is not single binary. This issue has not been ignored by IBM and is a focal point of its effort to mainstream PowerXCell 8i acceleration technology. IBM's PowerXCell 8i programming models, its Software Development Kit (SDK) for multicore acceleration, and its application development partnerships in both the government and commercial sectors are intended to address programmability and are considered in more detail below. The Promise and Challenge of Accelerators In the high-dimensional space (e.g., line width, clock speed, instruction set architecture, memory, and cache subsystem) that defines HPC processor microarchitecture, design themes have generally had a limited life span and alternatives have always persisted on the sidelines in service to particular application classes or special-purpose requirements (e.g., custom MPP and vector architectures). Changes in HPC market economics, technological breakthroughs, or barriers governing processor design can push such alternatives to the forefront and current approaches to the side. The HPC market's sharp change of course away from vector architectures to MPP systems in the mid-1990s is one example of this, and as presented earlier, the rapid replacement of these custom HPC MPP architectures by standards-based cluster systems is another. As the era of clock-driven, superscalar, instruction-level parallelism (ILP) processor design has waned, the HPC market has entered another period of transition. Power dissipation considerations have forced chip designers to look at alternative forms of on-chip parallelism that provide performance acceleration without requiring so much power. Both thread-level parallelism (TLP) and DLP processor designs are being explored. They have been dubbed accelerators because in many cases they augment general-purpose performance from a separate bus or because they are simply not integral to the general-purpose processor instruction set. Today, the HPC market has multiple approaches to consider provided in multiple implementations. In addition to IBM's PowerXCell 8i processor, which is our focus here, the accelerator category includes FPGAs, GPUs, multicore and many-core processors, vector processors, many-threaded processors, and application-specific integrated circuits (ASICs). While the variety of approaches in this category is large today, collectively they suggest an abstract or future architecture that includes many, probably simpler mixed-type processing elements, perhaps with field-programmable features (perhaps the on-chip interconnect, if not the cores themselves), and instructions that move streams (or vectors) of data onto the chip in a single issue. The common elements of these alternatives and the great incentive to unify and simplify the parallel programming model used to drive accelerator performance have stimulated investment in parallel programming software for accelerators at IBM for the PowerXCell 8i. This growth in investment and the potential future convergence of accelerator microarchitectures suggest a future of much improved price-, power-, and productivity-performance for HPC. 8 # IDC

9 IDC has been examining the accelerator category through market surveys, market forecasts, and technology analyses. With this analysis as a backdrop, the promise and challenges of accelerators are reviewed here, as are the specific concerns of potential buyers. This is provided as context within which to consider IBM's new PowerXCell 8i hardware and software product offerings. The Promise of Accelerators Crucial among all the factors that support the future use of accelerators in the era of HPC clusters is that today most of the alternatives are backed by volume economics. Intel's and AMD's multicore and future many-core processors obviously are. IBM's PowerXCell 8i is an HPC-specific modification of IBM's first-generation Cell Broadband Engine (Cell/B.E.) processor designed for the computer gaming market and the Sony PlayStation. GPUs have similar volume market support from the gaming industry. FPGAs are supported by volume purchases in the embedded signal processing space. Of the alternatives listed earlier, only vector processors and kernelspecific ASICs are without current volume economic support. Accelerator technologies that meet HPC's volume economic price-performance requirements have the best chance for success. Accelerators, both TLP and DLP designs, also offer the prospect of improved memory bandwidth use and higher sustained performance the former by hiding load latency underneath processor-ready work in other threads and the latter by parallel pipelining data streams from memory into the processor and back. Vector or DLP designs, such as the PowerXCell 8i, have a particular advantage for HPC workloads because of their natural data intensity. Another advantage that accelerators with heterocores or field-programmable cores offer over general-purpose processors is workload-specific functionality. They can be designed with only those functional units and/or the precision required by a particular class of HPC applications, or even that of an individual application kernel in the case of an ASIC. Heterogeneous core chips, also called "chips with personality" (or programmability in the case of FPGAs), provide high use functionality and eliminate the general-purpose circuitry that consumes extra space and power. The scalar and vector processors that remain part of the Cray microarchitecture are perhaps the original examples of processors with personality. The heterogeneous design of the PowerXCell 8i is another example with a first-order division of labor and function (scalar and parallel) between the PPE and SPE cores on the chip. IDC expects that as line widths drop and as the number of cores per chip increases, the additional cores will offer an increasing variety of special-purpose functions. Accelerators also have appeal because they can offer HPC datacenters efficiencies that deliver operational savings. Other things being equal, parallel systems, whether TLP or DLP, require less power to achieve the same level of performance and therefore run cooler and can be more densely packed. This allows fewer rackmounted units to provide the same performance using less power and leads to operational benefits in the current regime of scaled-out clusters. Finally, the interest in acceleration technology in all its forms has stimulated community thinking about the parallel programming abstraction and promises more 2008 IDC #

10 universal parallel programming language concepts and compilers that can produce code for the full variety of back-end parallel acceleration microarchitectures. IBM's investment in its SDK and its partnerships in the parallel software industry are significant efforts that take HPC in this direction. IBM also supports centers of expertise in academia (at the Barcelona Supercomputer Center, Georgia Tech, and the University of Maryland) to ensure that graduates in computer science and electrical engineering are exposed to current trends in computational science. The advantages presented earlier transfer in total to accelerators as a class, but only in part to each particular type of accelerator. The Challenge of Accelerators Substantial barriers remain to be overcome to mainstream acceleration technology, and Table 4 reminds the readers of these barriers. It also makes clear that while some challenges are general across the class, others apply only to specific types of accelerators. While all accelerators require extra programming effort to use, and IDC surveys place programming difficulty at the top of the list of accelerator challenges, FPGAs stand out as the most difficult to program, while single object vector processors are perhaps the easiest. IBM's PowerXCell 8i lands somewhere between. Most HPC workloads require or prefer double-precision floating point, but many of the alternatives today fall short in this category. FPGAs can be programmed with full IEEE 754 double-precision floating-point units, but these units consume large numbers of transistors, limiting the maximum performance per chip. Some GPU microarchitectures support the IEEE 754 double-precision format and meet some of its functional requirements; however, GPU vendors have avoided providing full double-precision capability because of its potential effect on performance. At this time, the Cray X2 vector processor and now the PowerXCell 8i heterogeneous multicore processor are the only fully IEEE 754 double-precision floating-point compatible HPC acceleration technologies available. Continuing to work through Table 4, we note that those acceleration technologies designed as discrete components and that are accessed via an external bus must manage bandwidth limitations to the card and often have less memory than is available to the general-purpose processor on the motherboard. This is typically the case with GPUs and FPGAs. The PowerXCell 8i and custom vector processors both have the advantage of being able to address a unified, board-local memory space directly. Accelerators are typically less flexible than x86 architectures. GPUs can now handle more conditional data-parallel operations, but still have weaker integer performance. As noted earlier, FPGA floating-point capability is limited by the transistor count required to build these units. Limited scalar processing power is often another issue. The scalar processors of both the Cray and IBM PowerXCell 8i cannot match that of a fast x86 or Power 6 core. GPUs are known to consume a lot of power, although not necessarily per peak FLOP. The growth in use of blades limits the number of practical accelerator choices, as accelerator products have not yet generally accommodated the increased use of blades (IBM's PowerXCell 8i QS22 blade is an exception). With respect to volumeprice requirements, custom ASICs and other custom accelerated processing technologies with compelling performance features still do not meet broad HPC market price requirements. 10 # IDC

11 TABLE 4 Accelerator User "Pain Points" "Pain Point" Category Programming difficulties "Pain Point" Particulars More difficult to program (especially FPGAs) Adds another parallel programming layer Requires dual object compiles Requires algorithmic adjustments Programming skill shortages Insufficient precision, reliability Single precision only, non-ieee conformant No ECC in bus or memory Continued bandwidth limitations Performance limited by external bus speeds Card-local memory size limitations Adds a layer to memory hierarchy Poor instruction set support for memory operations Inflexible architecture Inability to handle loop conditionals or asymmetric TLP Lockstep parallelism/threads Poor scalar (or integer or floating-point) performance Limited portability, high risk Too many programming models for ISV support Investment in climbing the learning curve could be lost Consume too much power Wrong form factors Too expensive GPUs have high absolute power requirements Need blade-ready form factors Vector, ASIC, or too much custom content Source: IDC, 2008 As noted earlier, the barriers to the widespread adoption of accelerator technology are significant. Some are generic to the entire category such as programming difficulty, and others are specific to individual accelerator types. The number of alternatives available is good news for the HPC market and gives buyers with specific needs choices. Many members of the HPC community are optimistic about accelerators in the longer term. One-third of those surveyed by IDC expected that accelerators would be very useful within a two- to three-year time frame, and another third believed that they would be at least somewhat useful. To quote one individual directly: 2008 IDC #

12 These barriers are largely removable. The issue is the business case. Improvements will be gated by the providers' view of the size of the market opportunity and the rate at which providers of commodity microprocessors improve their product's performance for HPC workloads. IDC expects that as milestones on the various accelerator road maps are reached (as they have been recently with the PowerXCell 8i processor from IBM), these barriers will be lowered. The PowerXCell 8i Lowers Accelerator Barriers Walking backwards through the list of accelerator pain points, we can evaluate the PowerXCell 8i's features with respect to each. IBM and its partners are offering the PowerXCell 8i in a greater variety of forms and at several more price points than its predecessor. Some have been designed and priced to compete with GPU accelerator card offerings. The IBM QS22 blade improves on the QS21 blade in that it contains dual-powerxcell 8i processors and is among the first acceleration technologies available in dense blade form. The QS22 and the composite Triblade (which includes the QS22 blade) in the LANL's Roadrunner system are somewhat more power efficient than their QS21 predecessor, both in an absolute sense and on a per doubleprecision MFLOPS basis, and compare well with the competition. PowerXCell 8i has some of programming difficulty, investment risk, and portability issues that are similar to those of other accelerator technologies. IBM's investments in the PowerXCell 8i programming environment to further reduce this barrier have continued since the release of the original Cell/B.E. With respect to flexibility, the PowerXCell 8i has some advantages. It offers fast integer and single- and doubleprecision floating-point performance, and the relative independence of its SPEs gives the PowerXCell 8i the ability to handle data-parallel conditionals as independent threads. As noted, the PowerXCell 8i adds high-speed double-precision to the singleprecision speed of its predecessor. Both are IEEE 754 format compatible, although single-precision operations are not fully compliant with every element of the standard. All memory and buses on the PowerXCell 8i include ECC to meet HPC reliability standards, which is not the case with some accelerator alternatives. Finally, the PowerXCell 8i's memory bandwidth, type, and size improvements make it much improved for HPC workloads over the original Cell/B.E. Its DDR2 memory is potentially large and directly addressable from the chip, avoiding some of the memory-related issues of bus-based accelerators. Its MFC unit extends PowerXCell 8i's data-parallel design out to memory with its DMA and blocked memory reference capabilities. All in all, the incremental improvements of the QS22 and PowerXCell 8i validate the optimism expressed by the HPC user in the preceding quote on the prospects for accelerators in HPC. While hurdles remain to be overcome before accelerators are fully integrated in the HPC mainstream, much has been done to make the QS22 and PowerXCell 8i more HPC friendly. 12 # IDC

13 IBM's New HPC Acceleration Products: The PowerXCell 8i Processor, the PowerXCell 8i PXCAB Card, and IBM's BladeCenter QS22 With the release of its PowerXCell 8i processor (65nm, SOI) and associated blades, accelerator cards, and systems, IBM offers the HPC market a range of thirdgeneration PowerXCell 8i based products, all with features that should significantly expand Cell/B.E.'s breadth of applicability in HPC and elsewhere. Important HPCrelated improvements to its microarchitecture, additional form factors and features, improvements to its software development kit, and additional system offerings contribute to the PowerXCell 8i's expanded potential in HPC. This development at IBM is part of a broader pattern of change in the HPC market that has DLP acceleration technology (both hardware and software), supported by volume economics, potentially finding a place in the HPC mainstream. New PowerXCell 8i Processor Retooled for HPC While the PowerXCell 8i's lineage is clearly derived from the original graphicsoriented Cell/B.E. processor, its microarchitectural differences make it a new, HPCspecific branch off of that original Cell/B.E. line still supported by the volume economics of Sony PlayStation sales, but tactically augmented for HPC. Like its predecessor, the PowerXCell 8i has one PPE and eight SIMD stream SPEs, giving the chip nine processors in all (see Figure 3). IBM's road map indicates that a PowerXCell 8i follow-on is planned for the 2010 time frame that will double the number of PPEs and quadruple its SPEs to 32 in a 45nm SOI process. FIGURE 3 IBM's Third-Generation PowerXCell 8i Heterogeneous Multicore Processor Source: IBM, IDC #

14 First among the several important HPC-specific features designed into the new PowerXCell 8i is its enhanced double-precision (edp) capability and performance. The double-precision units on earlier generation Cell/B.E. SPEs were not fully pipelined. On the PowerXCell 8i they are, and therefore each 3.2 GHz SPE delivers double-precision floating-point results seven times faster (one result per cycle) than its predecessor at a rate of 12.8 GFLOPS (3.2 GHz x 2 64-bit floating-point words x 2 64-bit floating-point operations [fused multiply-add]). This gives the eight SPEs per chip a combined double-precision peak performance of GFLOPS or exactly one-half the chip's single-precision performance (~204.8 GFLOPS) twice as many 32-bit, single-precision words (four versus two) fit in the SPE's 128-bit floating-point registers. IDC expects IBM to focus on the potential advantage in sustained performance per watt the PowerXCell 8i may have due to its single-chip architecture and unified, MFC-supported memory space. It is worth noting that this increase in double-precision performance comes without a substantial increase in transistor count, chip size, or thermal design power (TDP), which is listed at 92 watts for PowerXCell 8i. Like the double-precision functional units in the first- and second-generation Cell/B.E. processors, the new double-precision functional units are fully IEEE 754 compatible in both format and function. The PowerXCell 8i's high-speed, single-precision floatingpoint units (designed more for graphics than for HPC applications) remain less than fully IEEE 754 floating-point compliant in function. However, fully compliant single-precision results can be generated by truncating double-precision runs, but these complete at double-precision rates, which are half that of native "graphics" single-precision rates. Double-precision floating-point capability is now also available from other acceleration technologies, but typically without full IEEE compliance. This feature of the PowerXCell 8i is one of several that distinguish it from other accelerators. Equal in importance to the PowerXCell 8i's edp capability is its redesigned on-chip memory controller, which addresses a larger, more standard DDR2-based memory subsystem. The previous-generation Cell/B.E. processor is based on a Rambus XDR memory architecture, which is bandwidth rich but limited in per-board memory capacity to values that are substantially lower than typical HPC applications require. The PowerXCell 8i is designed to preserve the memory bandwidth of the older Cell/B.E. (25.6 GBps per chip or.25 bytes per double-precision FLOP) while offering greater memory capacity. A consequence is that the dual, 128-bit (plus parity) memory buses of the new DDR2 memory controller increase the pin count of the PowerXCell 8i processor package, making it pin incompatible with older-generation Cell/B.E. processors. The result is that PowerXCell 8i supports four DIMM slots and up to 16 GB of memory (more with future higher-density DIMMs) compared with the Cell/B.E.'s maximum of 1 GB. In addition, the PowerXCell 8i memory and memory bus subsystems are fully error corrected. Most of the remaining features of the PowerXCell 8i microarchitecture match those of the earlier Cell/B.E. version of the chip, but we remind the reader of 256 KB local store associated with each SPE. This is a DMA-enabled, memorymapped, local memory with none of the transistor-demanding features of a full-blown cache to which the PowerXCell 8i SPEs can asynchronously pipeline data to and from memory or other SPE local stores with the help of each SPE's MFC. The local store's size, 16- and 128-byte blocked loads, and large outstanding memory reference queue are key features in the PowerXCell 8i's bandwidth profile. 14 # IDC

15 PowerXCell 8i in a PCIe Card Form Factor, IBM's PXCAB Card Positioned and priced to compete with GPUs offered in standard PCIe form factors, IBM's PXCAB card is a double-wide, PCIe 16x card offered with custom packaging and labeling to OEMs for use in rackmounted units that might also accept GPU accelerator cards from NVIDIA or ATI. The PXCAB card includes one PowerXCell 8i processor, up to 8 GB of DDR2 memory on card, and two 1 Gigabit Ethernet ports. It functions more as a standalone component than a typical GPU accelerator. It runs the Linux operating system and communicates with the board's general-purpose processor via the PCIe bus using Ethernet emulation. This compact card retains the same advantages as IBM's other PowerXCell 8i products, including a large directly addressable memory that is error corrected, good double-precision performance per watt, and support for the components in IBM's SDK. IBM's QS22 Brings PowerXCell 8i Performance to the Cluster IBM's BladeCenter QS22 uses the same form factor as the older QS21 and the other blade-based offerings from IBM (see Figure 4). IBM's BladeCenter H chassis accepts 14 of the QS22 blades (or QS21 or other IBM blades), and sites with QS21 blades can add or upgrade to the QS22. The QS22 is a full-height blade and includes two 3.2 GHz PowerXCell 8i processors coherently connected with IBM's BIF interface; up to 16 GB of DDR2 memory per processor; two BladeCenter, midplane-facilitated Gigabit Ethernet ports; room for an InfiniBand adapter, a SAS adapter, and I/O buffer memory; and support for IBM's SDK. Peak single-precision performance per blade is 460 GFLOPS (2 x [PPE+SPE]) and double-precision performance per blade is 217 GFLOPS (again, 2 x [PPE+SPE]). This works out to 3.04 TFLOPS per chassis or TFLOPS per rack for doubleprecision and 6.44 TFLOPS per chassis or TFLOPS per rack for single precision. Linpack performance per QS22 blade has been measured at around 170 double-precision GFLOPS, which is about 80% of peak performance per blade IDC #

16 FIGURE 4 IBM's QS22 Blade Source: IBM, 2008 An examination of the QS22's power efficiency shows that a single blade consumes about 250 watts while running Linpack. A complete QS22 cluster running Linpack has been measured at 488 MFLOPS per watt. This heterogeneous multicore, dataparallel SIMD processor has very good MFLOPS-per-watt specs when compared with most general-purpose microprocessors, which typically have measured values under 300 MFLOPS per watt. GPU power efficiency is generally quoted with respect to the power consumed only by the card, and GPUs come out somewhat ahead of the QS22 when this is done; however, when the power consumed by the board supporting the GPU card is included, the results are much closer to equal. The deciding factor for efficiency for a particular application will be the sustained performance observed. IBM believes that the QS22's directly addressable memory with 2 x 25.6 GBps bandwidth, MFC-supported DMA engines, full IEEE compatibility, and coherent interchip interface will give it a double-precision, sustained-performance advantage over its competitors. Like the QS21's processors, the QS22's PowerXCell 8i processors function as standalone, multicore processors, two to a board and coherently linked in a manner not dissimilar to a dual-socket Opteron board linked by HyperTransport. The Linux OS runs independently on the PPE core of each processor and manages the use of its eight SPEs. In this sense, a BladeCenter H enclosure fitted with QS22 blades is not a bus-accelerated cluster like those that add GPUs to a standard x86-based cluster system, but a cluster of tightly coupled heterogeneous, multicore, cc-numa PowerXCell 8i based nodes. 16 # IDC

17 For scalar work, the performance of PowerXCell 8i's PPE core does not equal the performance of the latest Intel or AMD x86 scalar cores. Yet, the QS22's tightly coupled architecture and large mixed-core count promise better sustained performance than bus-accelerated cluster systems on certain HPC applications. While the QS22 offers a cc-numa, dual-socket architecture with 18 cores, the Triblade in the Roadrunner system IBM built for LANL has a bus-based design similar to that of GPU-based accelerators. Roadrunner, IBM's HPC Hybrid System for LANL: A Milestone in Design and Performance The announcement by IBM on June 10, 2008, that the PowerXCell 8i based supercomputer (LANL's Roadrunner) it had assembled at its Poughkeepsie, New York, facility had become the first computer to run the industry's standard Linpack benchmark at a sustained petaflop was an HPC milestone. While newswire attention has focused on reaching the petaflop goal (a quadrillion double-precision floatingpoint operations per second), from IDC's perspective, the milestone is really defined by several other important features of this achievement. The Meaning of the Petaflop Milestone The first is that a system based on components that are standards based and largely volume priced is now at the top of HPC's TOP500 list for the first time. The components include AMD Opteron dual-core processors, 4x DDR 20 Gbps InfiniBand interconnect, DDR2 memory, and the PowerXCell 8i heterogeneous, multicore, DLP acceleration engine. Quibbling about whether the PowerXCell 8i is standards based is acceptable (its Triblade is a custom enclosure), but its presence on the scene is clearly driven by volume economic trends and early investment by Sony, Toshiba, and IBM in a processing engine designed not for HPC, but game boxes, in this case the Sony PlayStation. As one might expect of such a high-end system, its standardsbased components are custom integrated, but architecturally, it is an InfiniBand switched cluster with acceleration technology supported by volume economics. The acceleration technology is the second important feature of the announcement. The fastest computer in the world is now accelerator based, and the acceleration technology has not just augmented the performance of its general-purpose microprocessors. It is the primary engine behind Roadrunner's sustained Linpack petaflop. The system's PowerXCell 8i processors offer 1,332 TFLOPS compared with only about 50 TFLOPS from the dual-core Opterons. It is also noteworthy that acceleration technology did not merely put the system into the top 10 or 20 places of the TOP500 list, but rather put it at the very top. Finally, LANL's Roadrunner and IBM's other PowerXCell 8i based products bring HPC and its highest-performing system back to its data-parallel roots. Linpack, a benchmark with significant cache-reuse potential, runs at 78% efficiency on Roadrunner, which has no L2 cache and only a modestly sized, user-programmed, 256 KB local memory. This is a reminder of how well vector and data-parallel microarchitectures suit typically data-intensive HPC workloads (and also perhaps that the Cray-2 had similarly sized local memory). PowerXCell 8i's simplified, in-order data-parallel design also offers the added benefit of a reduced transistor count and 2008 IDC #

18 therefore lower power consumption per FLOP. The PowerXCell 8i has only 250 million transistors on its 65nm die. Intel's quad-core Harpertown has 410 million; AMD's quad-core Barcelona has 463 million; and NVIDIA's Tesla GPU has 681 million. On the Linpack benchmark, Roadrunner achieves about 437 MFLOPS per watt even while carrying the power consumed by the AMD Opteron part of the system's Triblade (as we saw earlier, the QS22 blade is still more power efficient at 488 MFLOPS per watt). This makes Roadrunner over 65 MFLOPS per watt more efficient than even IBM's Blue Gene/P system, which was at the top of the Green500 list in February 2008, and almost 200 MFLOPS per watt more efficient than the best unaccelerated, x86-based cluster system. When taken together, the features of Roadrunner discussed here and of IBM's other PowerXCell 8i based products send a powerful message: HPC clusters designed around standards-based components, but in custom enclosures that use DLP acceleration technology augmented by blocked memory reference hardware (the PowerXCell 8i's MFC), can provide both industry-leading doubleprecision performance and power efficiency. IDC expects that the HPC community is paying close attention to this message. Roadrunner's Design Elements Unlike the QS22 cluster mentioned earlier, in LANL's Roadrunner, IBM features a bus-based architecture that places the PowerXCell 8i blade under the control of a dual-socket Opteron node and that is accessible through a HyperTransport-to-PCI Express (HT-to-PCIe) bridge bus. This design is the basis for IBM's Triblade that fit three to a standard IBM BladeCenter H chassis. The four-slot Triblade (see Figure 5) is currently available only as part of IBM's QS22/LS21-based Roadrunner system at LANL. It includes two QS22 accelerator blades, each with dual-socket, 3.2 GHz PowerXCell 8i boards and four slots for their own directly controlled, DDR2 boardlocal memory. The QS22 blades are connected to a single dual-socket, dual-core 1.8 GHz Opteron-based master node-blade called the LS21. They are connected through an HT-to-PCIe bridge-blade, which is sandwiched between them and gives the Triblade its quad-blade appearance. The two, 2 x 8x PCIe to 16x HT links provide GB per second of bandwidth per QS22 PowerXCell 8i socket to the LS21. Both the LS21 and QS22 have four DIMM slots per socket. Currently, Roadrunner is configured with 8 GB of memory per Opteron socket and 4 GB per PowerXCell 8i socket for (8 x 2) + (4 x 4) = 32 GB of memory in total per Triblade and 80 TB for the entire system. Memory within each board (LS21 and QS22) is cc-numa integrated. From a programming perspective, the dual-processor Opteron LS21 functions as the programmable node for MPI message-passing, while the two QS22 boards are programmed at a lower level using one of the components of IBM's SDK. 18 # IDC

FIGURE 5 Roadrunner's Custom Integrated Triblade Source: IBM, 2008 Stepping back and looking at the larger design features, we see that LANL's Roadrunner combines 180 of these Triblade nodes and 12

19 FIGURE 5 Roadrunner's Custom Integrated Triblade Source: IBM, 2008 Stepping back and looking at the larger design features, we see that LANL's Roadrunner combines 180 of these Triblade nodes and 12 I/O blades with a 288-port DDR InfiniBand switch into "connected units" (CUs). There are 17 CUs in the entire system, giving it 3,060 Triblade nodes for computation; a total of 6,120 dual-core Opteron chips (50 TFLOPS peak); and 12,240 PowerXCell 8i chips (1.33 PFLOPS peak). There is one Opteron core for each PowerXCell 8i chip. When its I/O and management nodes are included, Roadrunner contains 130,464 computational cores. Roadrunner uses a two-tier, fat tree topology supported by standard 288-port, 20 Gbps DDR InfiniBand switches and network adapters. There is full bisection bandwidth within each CU, and half bisection bandwidth among the CUs. All of Roadrunner's interconnect cables are optical, and its bisection bandwidth is uniformly 3.5 TBps. Its 216 I/O nodes support an aggregate bandwidth of 432 GB per second to LANL's 2-plus petabyte high-performance global file system from Panasas. A schematic diagram of Roadrunner's two-tiered DDR InfiniBand-based fat tree and its interconnected CUs is presented in Figure IDC #

20 FIGURE 6 Roadrunner's Two-Tiered DDR InfiniBand Fat Tree Connected Unit (CU) cluster 180 compute nodes w/cells 12 I/O nodes PCIe attached Cell blades I/O 12,240 12,240 Cell Cell edp edp chips chips PF, PF, TB TB 6,120 6,120 dual-core dual-core Opterons Opterons TF, TF, TB TB 17 Cluster Units 3,060 Compute Nodes 288-port IB 4x DDR 296 racks 3.9 MW 288-port IB 4x DDR Eight 2 nd -stage 288-port IB 4X DDR switches 12 links per CU to each of 8 switches Source: LANL, 2008 While our attention naturally turns to the details of Roadrunner's design as presented earlier and its sheer scale it consumes 2.3 MWatts of power; has 130,000 cores; weighs 500,000 pounds; and will take 21 trucks to deliver the applications that will be run on it and the new science they make possible should be our focus. As noted earlier, HPC users feel that the primary barrier to generating new science on accelerators, including IBM's PowerXCell 8i product line, will be their programmability. IBM and Los Alamos National Laboratory, RapidMind, Gedae, and others are investing heavily in the development of PowerXCell 8i's programming environment. Investing in PowerXCell 8i Programmability The major challenge in improving the performance, productivity, and portability (the three Ps of HPC programming) of today's HPC applications is the ubiquity, variety, and growth of parallelism in HPC system architectures. At the high end, government labs are adapting or rewriting their key applications to take maximum advantage of ultraparallel HPC systems that now contain as many as 100,000 independent computational cores. At the low end, smaller businesses and corporate departments engaged in computational science and engineering can no longer count on clockperiod performance improvements and must adopt and improve the parallel performance of their applications on multicore, multisocket servers and clusters to achieve the productivity that will keep them competitive. Investment by government, business, and venture capital firms in technologies to improve the three Ps of HPC application programming has grown to respond to this challenge. Latency and bandwidth limitations, the multitiered memory hierarchy, synchronization bottlenecks, load balancing challenges, and recovery from failure are 20 # IDC

21 among the many factors that make today's ultraparallel programming problem a very difficult one even when every processing element runs the same instruction set. With all their potential benefits, the heterogeneous or hybrid multi instruction set computing models that come along with most HPC acceleration technologies make this problem only more challenging. As a heterogeneous chip multiprocessor (CMP), IBM's QS22 PowerXCell 8i hybrid architecture is positioned between the bus- or network-divided, dual-isa approach of GPU acceleration and the vector-scalar, functional unit integrated, single-isa approach of Cray vector accelerators (and probably future designs from Intel and AMD). PowerXCell 8i's designers adopted the view that accelerators (and their instruction sets) should be allowed to evolve independently from their supporting general-purpose scalar cores but should be placed on the same chip and tightly coupled through a high-bandwidth interconnect and memory management system. This stems from IBM's long-term view that data-parallel acceleration is just an initial phase in a process in which accelerator cores will become more workload specific. PowerXCell 8i's parallel model is based on SIMD threads that are spawned from the general-purpose PPE onto the SPEs. These threads are supported with data delivered asynchronously from each SPE's DMA-enabled MFC units capable of as many as 128 outstanding 128-byte blocked, simultaneous memory references. IBM's on-chip division of labor separates the requirement of compiling and running threads for the distinct, general-purpose core from that for the accelerated cores while offering very high-bandwidth intercore communication and a shared memory space to connect them. As implemented, this approach has several advantages:! This thread-based model is readily supported in the Linux kernel, and it gives IBM the flexibility to present a software-integrated programming environment while developing its acceleration and general-purpose hardware independently.! It pushes the functions of the on-chip interconnect to center stage (especially in its support of distributed DMAs and thread synchronization) and has forced IBM to think hard about on-chip interconnect requirements that will need to be addressed in HPC's probable many-core future. PowerXCell 8i's division-of-labor design also maps naturally to the variety of parallel abstraction and programming models being developed within IBM and by external partners. This has already stimulated the development of multiple programming model alternatives for PowerXCell 8i and should ease the burden of porting PowerXCell 8i programs to other accelerated platforms.! While it is more difficult to single-source compile to two distinct processors and instruction sets mediated by an interconnect, IBM's intention to produce such a compiler is visible in its working prototype and supported by successes already achieved in other contexts such as Partitioned Global Address Space (PGAS) compiler development in which locality and parallel extensions have been added to standard programming languages and subroutine libraries act as a mediation layer. IBM and its PowerXCell 8i software development partners are working along these lines to lower the accelerated, parallel programming barrier that IDC has noted is the primary difficulty limiting the adoption of acceleration technology in HPC IDC #

W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b l e d N e x t - G e n e r a t i o n V N X

Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b