W H I T E P A P E R W i t h I t s N e w P o w e r X C e l l 8i Product Line, IBM Intends to Take Accelerated Processing into the HPC Mainstream

Size: px
Start display at page:

Download "W H I T E P A P E R W i t h I t s N e w P o w e r X C e l l 8i Product Line, IBM Intends to Take Accelerated Processing into the HPC Mainstream"

Transcription

1 W H I T E P A P E R W i t h I t s N e w P o w e r X C e l l 8i Product Line, IBM Intends to Take Accelerated Processing into the HPC Mainstream Sponsored by: IBM Richard Walsh Earl C. Joseph, Ph.D. August 2008 Steve Conway Jie Wu Global Headquarters: 5 Speen Street Framingham, MA USA P F IDC OPINION Fifteen years ago, the high-performance computing (HPC) market started to abandon its data-parallel, vector architectural lineage and turned to commodity-priced scalar processors. One by one, the other custom components of HPC systems have been pushed aside in favor of cheaper, standards-based alternatives. With some notable exceptions, most HPC system component technologies have been mainstreamed, a change driven by the price-performance advantages offered by standards-based components engineered to serve volume markets. Nothing reflects this more strongly than the fact that standards-based cluster sales based on x86 microprocessors were responsible for over 65% of the revenue generated in the HPC market in 2007, up from just a 20% share in However, blood is thicker than water, and the HPC user community has not forgotten where it came from or the fundamental data intensity of most HPC workloads. The mainstream x86 instruction set architecture (ISA) was not designed with HPC dataparallel requirements in mind and because of this has limited the sustained performance of many HPC applications. While processor clock speeds have recently stopped climbing, processor cores have multiplied, exacerbating this sustained performance shortcoming. There is little to suggest that HPC buyers will abandon the x86 mainstream and return to purchasing large numbers of custom data-parallel or vector systems to improve their applications' sustained performance, but the aim of IBM's PowerXCell TM 8i product line, with its single instruction, multiple data (SIMD) ISA and memory flow controller (MFC), is to bring data-parallel computing back to HPC and deliver higher sustained performance and power efficiency to HPC workloads with a processing engine supported by volume economics. In IDC's opinion, IBM's new PowerXCell 8i processor and its go-to-market strategy have the potential to stimulate the return and mainstreaming of data-parallel processing to HPC. The key features of IBM's new PowerXCell 8i product line and its market strategy include:! A single-chip, MFC-controlled, high memory bandwidth, shared memory design! A double data rate (DDR2) memory subsystem and fully IEEE-compliant doubleprecision floating-point capabilities! A broad range of PowerXCell 8i based products configurable at a variety of scales

2 ! A multitiered programming model with strong support among IBM's customers and partners! Its use in the world's fastest computer, Los Alamos National Laboratory's (LANL) petascale supercomputer, Roadrunner! A well-defined road map supported by volume economics from the gaming industry IN THIS WHITE PAPER In this white paper, IDC reviews the state of the HPC market, its recent five years of very strong growth, the rise of standards-based clusters, and the growing importance of blades and custom-engineered enclosures. HPC buyer "pain points" related to memory bandwidth shortages, parallel programming of multicore processors, and power consumption are discussed, as is their potential to stimulate the more mainstream use of accelerators and data-parallel programming. Finally, the paper reviews IBM's PowerXCell 8i product line, multitiered programming environment, and some of its parallel programming software partnerships. SITUATION OVERVIEW HPC's Strong Market Growth The HPC market has shown rapid growth in the five years since 2002, especially when compared with the background rate of IT spending generally. HPC revenue had three years of double-digit growth between 2003 and 2005, followed by a stillimpressive 9% year-over-year growth between 2005 and In 2007, despite a slowing economy, HPC revenue growth over 2006 was 15.5%, exceeding IDC estimates. Table 1 shows revenue growth over this period by competitive segment. TABLE 1 Worldwide HPC Market Revenue by Competitive Segment, ($M) Competitive Segment Price Range CAGR (%) Supercomputer >$500,000 2,401 2,631 2,881 2,567 2, Technical divisional $250, , ,197 1,420 1, Technical departmental $100, , ,117 2,561 3,323 4, Technical workgroup $0 99,999 1,806 2,668 2,568 2,744 2, Total 5,698 7,393 9,208 10,055 11, Source: IDC, # IDC

3 Given the growing global interest in HPC technology as an essential component in national economic and technology strategies and the robust competition in the market, which continues to produce rapid innovation, IDC sees few major threats to a continued pattern of high growth in 2008 and beyond. In IDC's view, even a softening world economy should not greatly alter this forecast market growth because HPC's heavy R&D focus and longer buying cycles have largely insulated it historically from short-term economic downturns. IDC projects that HPC server revenue will increase at around a 9% CAGR through 2012 to reach almost $18 billion, up from under $6 billion in 2003 (see Table 2). TABLE 2 Worldwide HPC Market Revenue Forecast by Competitive Segment, ($M) Competitive Segment Price Range CAGR (%) Supercomputer >$500,000 3,035 3,247 3,463 3,682 3, Technical divisional $250, ,999 2,102 2,427 2,755 3,086 3, Technical departmental $100, ,999 4,801 5,400 5,990 6,570 7, Technical workgroup $0 99,999 2,784 2,959 3,131 3,301 3, Total 12,723 14,033 15,339 16,639 17, Source: IDC, 2008 HPC Clusters Fuel Market Growth IDC's data show that the surge in HPC revenue has been fueled primarily by purchases of x86-based, Linux cluster systems priced below $500,000 (especially those priced under $250,000). This growth was sustained by MPI, a maturing, message-based parallel programming model. HPC workloads with largely partitionable data structures, already parallelized for custom massively parallel processing (MPP) and constellation systems, could be moved easily to clusters. Once there, input data sets could be grown to match the memory and bandwidth provided on the additional cluster nodes and allow for further scaling (so-called weak scaling). Even less scalable workloads benefited because more jobs with distinct inputs could be run simultaneously, increasing throughput, increasing the research and development iteration rate, and reducing time to solution. This process pushed the HPC price-performance curve sharply downward, creating a zero-gravity sensation and the expectation that performance should more than double in a technological generation while costing no more. This price-performance advantage and the other advantages that HPC buyers associate with clusters are presented in Figure IDC #

4 FIGURE 1 Cluster Drivers: Top Reasons to Purchase HPC Clusters Better price/performance Greater system throughput Ability to do new more/better science Ability to run larger problems Total cost of ownership (TCO) Improved capacity management To improve competitiveness Other (Number of responses) Source: IDC, 2008 As recently as early 2003, clusters accounted for just 20% of overall HPC server revenue. The dramatic penetration of the HPC market by clusters and their replacement of custom HPC systems through 4Q07 is shown in Figure 2. By the end of 2007, clusters had attained a 65% share of HPC server revenue. IDC sees clusters eventually topping out at about 80% of the HPC market, with the other 20% made up of systems that do not qualify as clusters, such as single-node servers, systems with symmetric multiprocessing (SMP) architectures, and MPP systems such as the IBM Blue Gene, Cray XT, and the SiCortex SC5832 that have too much custom content to fit the standards-based cluster definition. 4 # IDC

5 FIGURE 2 Worldwide High-Performance Computing Revenue Share by Server Type, 1Q03 4Q07 (%) Source: IDC, Q03 2Q03 3Q03 Cluster Noncluster 4Q03 1Q04 2Q04 3Q04 4Q04 1Q05 2Q05 3Q05 4Q05 1Q06 2Q06 3Q06 4Q06 1Q07 2Q07 3Q07 4Q07 IDC sees cluster revenue growth and market penetration continuing and pushing down into entry-level systems. "Ease-of-everything" cluster offerings designed for the technical workgroup (systems selling for under $100,000) at smaller firms and in back-office locations are expected to show particularly strong growth in 2008 and beyond. However, the rapid acceptance of HPC cluster computing systems, separate compute nodes built from standard component technologies (x86 processors, commodity motherboards, standards-based networking technology, and primarily the Linux OS), will continue to cause disruptive changes in the HPC market. Such changes, challenges, new market requirements, and buyer "pain points" also define new market opportunities. The HPC market's growing interest in data-level parallelism (DLP) acceleration technology is just such an HPC market opportunity. The Challenges of HPC Clusters, Buyer "Pain Points," and IBM's PowerXCell 8i Solution It is perhaps stating the obvious that the overarching elements potentially missing from a cluster system assembled à la carte from commodity hardware and software components are integration and a balanced system design. Custom-built HPC systems are balanced to suit the HPC task and integrated to simplify its completion. Because of this, custom HPC systems have generally been able to achieve higher sustained performance on individual jobs and better overall utilization rates. As clusters have scaled out to very large node counts and scaled in to "fatter" nodes with much more processing power per rack unit, the intangibles of integration and balance have been deemphasized. The price-per-peak-performance and capital cost advantages of HPC clusters have, until recently, overwhelmed their operational 2008 IDC #

6 drawbacks system component imbalance and complexity, which limit sustained performance and lower overall system utilization rates. Lastly, the cluster revolution has placed cluster systems in many new environments, and their low cost has led to substantial growth in average node counts (a sixfold increase between 2004 and 2006 alone, according to IDC data). A consequence is that supplying basic operational inputs such as power, cooling, space, and support has become an important concern for HPC buyers. Table 3 summarizes these and other HPC cluster buyer "pain points" and also indirectly presents the market requirements that support buyer interest in integrated, blade-based, DLP acceleration technology of the type that IBM now offers with its new PowerXCell 8i product line and its QS22 blade in particular. TABLE 3 HPC Cluster Buyer/User "Pain Points" "Pain Point" Category "Pain Point" Particulars System installation, monitoring, upgrades Managing HPC cluster complexity System administration, middleware User and application support Power, cooling, and space requirements Cluster price-performance drives down capital costs but drives up operating costs and resource use Scheduling and programming complexity Multicore, multisocket, multinode issues Memory size and bandwidth inadequacy Interconnect bandwidth, message rate mismatches Server interconnect performance Latency, bandwidth, message rates, collectives performance Total storage, file size, file number Storage system performance, data management Bandwidth, IOPs, reliability Data staging, archiving Parallel application coding and scaling issues Third-party software costs Multicore, multisocket, multinode, accelerators Licensing models Limited parallel price-performance, scaling Extremely large-scale systems require new approaches Better reliability, availability, and serviceability (RAS) New buyer requirements New production and operational environments require new approaches "Ease-of-everything" needs of new buyers Source: IDC, # IDC

7 While the QS22 (and IBM's other PowerXCell 8i based products) is presented in more detail below, it is important to note how its basic features respond to some key challenges facing today's HPC cluster buyers and users. Blade-Based Design First among these is the QS22's compact, integrated blade-based design. IBM and other HPC vendors with strong engineering skills have addressed the dilemma of providing integrated solutions while still using standards-based components by engineering dense, form-factor blades and their companion integrated enclosures. Blade sales are growing as a percentage of overall cluster sales. Blades and their enclosures provide vendors with the scope to engineer in value. This reduces cluster operating expenses and complexity while allowing the continued use of standardsbased components that exploit volume-driven price-performance curves. The QS22, like other blade systems, reduces cluster management complexity and lowers power, cooling, and space requirements. Fully IEEE-Compliant Double Precision The feature of the QS22's PowerXCell 8i processor that is most unique and stands out against competition from graphics processing unit (GPU) accelerators is its pipelined, fully IEEE-compliant, double-precision processing capability. The QS22 contains two tightly coupled PowerXCell 8i processors that provide 2 x (1 + 8) = 18 cores. Sixteen of these are DLP, SIMD processors. Vector and other data-parallel architectures are known to be both bandwidth and power efficient, and IBM has exploited this principle and engineered its new PowerXCell 8i double-precision processors with a surprisingly small transistor count. Low Latency and High Bandwidth Memory Access The PowerXCell 8i's MFC and on-chip DDR2 memory controller make its memory large in size, low in first byte latency, and high in bandwidth. PowerXCell 8i's MFC supports DMA and blocked or vector-like memory operations among all the cores and main memory. These features relieve cluster buyer pain in the categories of power and cooling (the QS22 delivers large numbers of FLOPS per watt), multicore and multisocket bandwidth inadequacy (both its SIMD instruction set and sustained perprocessor bandwidth help here), and even in the area of parallel application scalability, where the multicore, multisocket QS22 allows more parallel work to be done per node. Reliability and Cost Effectiveness Other problem areas faced by cluster buyers on certain applications that the QS22 could potentially address include high application licensing costs and improved reliability, availability, and serviceability (RAS). The more efficient parallel performance afforded by a DLP processor has the potential to reduce the number of application licenses required, and the highly integrated QS22 blade with 18 cores in a single form factor reduces the operating temperature per FLOP and the number of independent parts that could fail IDC #

8 IBM's PowerXCell 8i and QS22 blade are not entirely HPC cluster "pain point" positive. As with many other acceleration technologies, the PowerXCell 8i and QS22 introduce an additional layer of programming complexity because the Power Processor Element (PPE) and the Synergistic Processor Element (SPE) instruction sets are not x86 based and the programming model is not single binary. This issue has not been ignored by IBM and is a focal point of its effort to mainstream PowerXCell 8i acceleration technology. IBM's PowerXCell 8i programming models, its Software Development Kit (SDK) for multicore acceleration, and its application development partnerships in both the government and commercial sectors are intended to address programmability and are considered in more detail below. The Promise and Challenge of Accelerators In the high-dimensional space (e.g., line width, clock speed, instruction set architecture, memory, and cache subsystem) that defines HPC processor microarchitecture, design themes have generally had a limited life span and alternatives have always persisted on the sidelines in service to particular application classes or special-purpose requirements (e.g., custom MPP and vector architectures). Changes in HPC market economics, technological breakthroughs, or barriers governing processor design can push such alternatives to the forefront and current approaches to the side. The HPC market's sharp change of course away from vector architectures to MPP systems in the mid-1990s is one example of this, and as presented earlier, the rapid replacement of these custom HPC MPP architectures by standards-based cluster systems is another. As the era of clock-driven, superscalar, instruction-level parallelism (ILP) processor design has waned, the HPC market has entered another period of transition. Power dissipation considerations have forced chip designers to look at alternative forms of on-chip parallelism that provide performance acceleration without requiring so much power. Both thread-level parallelism (TLP) and DLP processor designs are being explored. They have been dubbed accelerators because in many cases they augment general-purpose performance from a separate bus or because they are simply not integral to the general-purpose processor instruction set. Today, the HPC market has multiple approaches to consider provided in multiple implementations. In addition to IBM's PowerXCell 8i processor, which is our focus here, the accelerator category includes FPGAs, GPUs, multicore and many-core processors, vector processors, many-threaded processors, and application-specific integrated circuits (ASICs). While the variety of approaches in this category is large today, collectively they suggest an abstract or future architecture that includes many, probably simpler mixed-type processing elements, perhaps with field-programmable features (perhaps the on-chip interconnect, if not the cores themselves), and instructions that move streams (or vectors) of data onto the chip in a single issue. The common elements of these alternatives and the great incentive to unify and simplify the parallel programming model used to drive accelerator performance have stimulated investment in parallel programming software for accelerators at IBM for the PowerXCell 8i. This growth in investment and the potential future convergence of accelerator microarchitectures suggest a future of much improved price-, power-, and productivity-performance for HPC. 8 # IDC

9 IDC has been examining the accelerator category through market surveys, market forecasts, and technology analyses. With this analysis as a backdrop, the promise and challenges of accelerators are reviewed here, as are the specific concerns of potential buyers. This is provided as context within which to consider IBM's new PowerXCell 8i hardware and software product offerings. The Promise of Accelerators Crucial among all the factors that support the future use of accelerators in the era of HPC clusters is that today most of the alternatives are backed by volume economics. Intel's and AMD's multicore and future many-core processors obviously are. IBM's PowerXCell 8i is an HPC-specific modification of IBM's first-generation Cell Broadband Engine (Cell/B.E.) processor designed for the computer gaming market and the Sony PlayStation. GPUs have similar volume market support from the gaming industry. FPGAs are supported by volume purchases in the embedded signal processing space. Of the alternatives listed earlier, only vector processors and kernelspecific ASICs are without current volume economic support. Accelerator technologies that meet HPC's volume economic price-performance requirements have the best chance for success. Accelerators, both TLP and DLP designs, also offer the prospect of improved memory bandwidth use and higher sustained performance the former by hiding load latency underneath processor-ready work in other threads and the latter by parallel pipelining data streams from memory into the processor and back. Vector or DLP designs, such as the PowerXCell 8i, have a particular advantage for HPC workloads because of their natural data intensity. Another advantage that accelerators with heterocores or field-programmable cores offer over general-purpose processors is workload-specific functionality. They can be designed with only those functional units and/or the precision required by a particular class of HPC applications, or even that of an individual application kernel in the case of an ASIC. Heterogeneous core chips, also called "chips with personality" (or programmability in the case of FPGAs), provide high use functionality and eliminate the general-purpose circuitry that consumes extra space and power. The scalar and vector processors that remain part of the Cray microarchitecture are perhaps the original examples of processors with personality. The heterogeneous design of the PowerXCell 8i is another example with a first-order division of labor and function (scalar and parallel) between the PPE and SPE cores on the chip. IDC expects that as line widths drop and as the number of cores per chip increases, the additional cores will offer an increasing variety of special-purpose functions. Accelerators also have appeal because they can offer HPC datacenters efficiencies that deliver operational savings. Other things being equal, parallel systems, whether TLP or DLP, require less power to achieve the same level of performance and therefore run cooler and can be more densely packed. This allows fewer rackmounted units to provide the same performance using less power and leads to operational benefits in the current regime of scaled-out clusters. Finally, the interest in acceleration technology in all its forms has stimulated community thinking about the parallel programming abstraction and promises more 2008 IDC #

10 universal parallel programming language concepts and compilers that can produce code for the full variety of back-end parallel acceleration microarchitectures. IBM's investment in its SDK and its partnerships in the parallel software industry are significant efforts that take HPC in this direction. IBM also supports centers of expertise in academia (at the Barcelona Supercomputer Center, Georgia Tech, and the University of Maryland) to ensure that graduates in computer science and electrical engineering are exposed to current trends in computational science. The advantages presented earlier transfer in total to accelerators as a class, but only in part to each particular type of accelerator. The Challenge of Accelerators Substantial barriers remain to be overcome to mainstream acceleration technology, and Table 4 reminds the readers of these barriers. It also makes clear that while some challenges are general across the class, others apply only to specific types of accelerators. While all accelerators require extra programming effort to use, and IDC surveys place programming difficulty at the top of the list of accelerator challenges, FPGAs stand out as the most difficult to program, while single object vector processors are perhaps the easiest. IBM's PowerXCell 8i lands somewhere between. Most HPC workloads require or prefer double-precision floating point, but many of the alternatives today fall short in this category. FPGAs can be programmed with full IEEE 754 double-precision floating-point units, but these units consume large numbers of transistors, limiting the maximum performance per chip. Some GPU microarchitectures support the IEEE 754 double-precision format and meet some of its functional requirements; however, GPU vendors have avoided providing full double-precision capability because of its potential effect on performance. At this time, the Cray X2 vector processor and now the PowerXCell 8i heterogeneous multicore processor are the only fully IEEE 754 double-precision floating-point compatible HPC acceleration technologies available. Continuing to work through Table 4, we note that those acceleration technologies designed as discrete components and that are accessed via an external bus must manage bandwidth limitations to the card and often have less memory than is available to the general-purpose processor on the motherboard. This is typically the case with GPUs and FPGAs. The PowerXCell 8i and custom vector processors both have the advantage of being able to address a unified, board-local memory space directly. Accelerators are typically less flexible than x86 architectures. GPUs can now handle more conditional data-parallel operations, but still have weaker integer performance. As noted earlier, FPGA floating-point capability is limited by the transistor count required to build these units. Limited scalar processing power is often another issue. The scalar processors of both the Cray and IBM PowerXCell 8i cannot match that of a fast x86 or Power 6 core. GPUs are known to consume a lot of power, although not necessarily per peak FLOP. The growth in use of blades limits the number of practical accelerator choices, as accelerator products have not yet generally accommodated the increased use of blades (IBM's PowerXCell 8i QS22 blade is an exception). With respect to volumeprice requirements, custom ASICs and other custom accelerated processing technologies with compelling performance features still do not meet broad HPC market price requirements. 10 # IDC

11 TABLE 4 Accelerator User "Pain Points" "Pain Point" Category Programming difficulties "Pain Point" Particulars More difficult to program (especially FPGAs) Adds another parallel programming layer Requires dual object compiles Requires algorithmic adjustments Programming skill shortages Insufficient precision, reliability Single precision only, non-ieee conformant No ECC in bus or memory Continued bandwidth limitations Performance limited by external bus speeds Card-local memory size limitations Adds a layer to memory hierarchy Poor instruction set support for memory operations Inflexible architecture Inability to handle loop conditionals or asymmetric TLP Lockstep parallelism/threads Poor scalar (or integer or floating-point) performance Limited portability, high risk Too many programming models for ISV support Investment in climbing the learning curve could be lost Consume too much power Wrong form factors Too expensive GPUs have high absolute power requirements Need blade-ready form factors Vector, ASIC, or too much custom content Source: IDC, 2008 As noted earlier, the barriers to the widespread adoption of accelerator technology are significant. Some are generic to the entire category such as programming difficulty, and others are specific to individual accelerator types. The number of alternatives available is good news for the HPC market and gives buyers with specific needs choices. Many members of the HPC community are optimistic about accelerators in the longer term. One-third of those surveyed by IDC expected that accelerators would be very useful within a two- to three-year time frame, and another third believed that they would be at least somewhat useful. To quote one individual directly: 2008 IDC #

12 These barriers are largely removable. The issue is the business case. Improvements will be gated by the providers' view of the size of the market opportunity and the rate at which providers of commodity microprocessors improve their product's performance for HPC workloads. IDC expects that as milestones on the various accelerator road maps are reached (as they have been recently with the PowerXCell 8i processor from IBM), these barriers will be lowered. The PowerXCell 8i Lowers Accelerator Barriers Walking backwards through the list of accelerator pain points, we can evaluate the PowerXCell 8i's features with respect to each. IBM and its partners are offering the PowerXCell 8i in a greater variety of forms and at several more price points than its predecessor. Some have been designed and priced to compete with GPU accelerator card offerings. The IBM QS22 blade improves on the QS21 blade in that it contains dual-powerxcell 8i processors and is among the first acceleration technologies available in dense blade form. The QS22 and the composite Triblade (which includes the QS22 blade) in the LANL's Roadrunner system are somewhat more power efficient than their QS21 predecessor, both in an absolute sense and on a per doubleprecision MFLOPS basis, and compare well with the competition. PowerXCell 8i has some of programming difficulty, investment risk, and portability issues that are similar to those of other accelerator technologies. IBM's investments in the PowerXCell 8i programming environment to further reduce this barrier have continued since the release of the original Cell/B.E. With respect to flexibility, the PowerXCell 8i has some advantages. It offers fast integer and single- and doubleprecision floating-point performance, and the relative independence of its SPEs gives the PowerXCell 8i the ability to handle data-parallel conditionals as independent threads. As noted, the PowerXCell 8i adds high-speed double-precision to the singleprecision speed of its predecessor. Both are IEEE 754 format compatible, although single-precision operations are not fully compliant with every element of the standard. All memory and buses on the PowerXCell 8i include ECC to meet HPC reliability standards, which is not the case with some accelerator alternatives. Finally, the PowerXCell 8i's memory bandwidth, type, and size improvements make it much improved for HPC workloads over the original Cell/B.E. Its DDR2 memory is potentially large and directly addressable from the chip, avoiding some of the memory-related issues of bus-based accelerators. Its MFC unit extends PowerXCell 8i's data-parallel design out to memory with its DMA and blocked memory reference capabilities. All in all, the incremental improvements of the QS22 and PowerXCell 8i validate the optimism expressed by the HPC user in the preceding quote on the prospects for accelerators in HPC. While hurdles remain to be overcome before accelerators are fully integrated in the HPC mainstream, much has been done to make the QS22 and PowerXCell 8i more HPC friendly. 12 # IDC

13 IBM's New HPC Acceleration Products: The PowerXCell 8i Processor, the PowerXCell 8i PXCAB Card, and IBM's BladeCenter QS22 With the release of its PowerXCell 8i processor (65nm, SOI) and associated blades, accelerator cards, and systems, IBM offers the HPC market a range of thirdgeneration PowerXCell 8i based products, all with features that should significantly expand Cell/B.E.'s breadth of applicability in HPC and elsewhere. Important HPCrelated improvements to its microarchitecture, additional form factors and features, improvements to its software development kit, and additional system offerings contribute to the PowerXCell 8i's expanded potential in HPC. This development at IBM is part of a broader pattern of change in the HPC market that has DLP acceleration technology (both hardware and software), supported by volume economics, potentially finding a place in the HPC mainstream. New PowerXCell 8i Processor Retooled for HPC While the PowerXCell 8i's lineage is clearly derived from the original graphicsoriented Cell/B.E. processor, its microarchitectural differences make it a new, HPCspecific branch off of that original Cell/B.E. line still supported by the volume economics of Sony PlayStation sales, but tactically augmented for HPC. Like its predecessor, the PowerXCell 8i has one PPE and eight SIMD stream SPEs, giving the chip nine processors in all (see Figure 3). IBM's road map indicates that a PowerXCell 8i follow-on is planned for the 2010 time frame that will double the number of PPEs and quadruple its SPEs to 32 in a 45nm SOI process. FIGURE 3 IBM's Third-Generation PowerXCell 8i Heterogeneous Multicore Processor Source: IBM, IDC #

14 First among the several important HPC-specific features designed into the new PowerXCell 8i is its enhanced double-precision (edp) capability and performance. The double-precision units on earlier generation Cell/B.E. SPEs were not fully pipelined. On the PowerXCell 8i they are, and therefore each 3.2 GHz SPE delivers double-precision floating-point results seven times faster (one result per cycle) than its predecessor at a rate of 12.8 GFLOPS (3.2 GHz x 2 64-bit floating-point words x 2 64-bit floating-point operations [fused multiply-add]). This gives the eight SPEs per chip a combined double-precision peak performance of GFLOPS or exactly one-half the chip's single-precision performance (~204.8 GFLOPS) twice as many 32-bit, single-precision words (four versus two) fit in the SPE's 128-bit floating-point registers. IDC expects IBM to focus on the potential advantage in sustained performance per watt the PowerXCell 8i may have due to its single-chip architecture and unified, MFC-supported memory space. It is worth noting that this increase in double-precision performance comes without a substantial increase in transistor count, chip size, or thermal design power (TDP), which is listed at 92 watts for PowerXCell 8i. Like the double-precision functional units in the first- and second-generation Cell/B.E. processors, the new double-precision functional units are fully IEEE 754 compatible in both format and function. The PowerXCell 8i's high-speed, single-precision floatingpoint units (designed more for graphics than for HPC applications) remain less than fully IEEE 754 floating-point compliant in function. However, fully compliant single-precision results can be generated by truncating double-precision runs, but these complete at double-precision rates, which are half that of native "graphics" single-precision rates. Double-precision floating-point capability is now also available from other acceleration technologies, but typically without full IEEE compliance. This feature of the PowerXCell 8i is one of several that distinguish it from other accelerators. Equal in importance to the PowerXCell 8i's edp capability is its redesigned on-chip memory controller, which addresses a larger, more standard DDR2-based memory subsystem. The previous-generation Cell/B.E. processor is based on a Rambus XDR memory architecture, which is bandwidth rich but limited in per-board memory capacity to values that are substantially lower than typical HPC applications require. The PowerXCell 8i is designed to preserve the memory bandwidth of the older Cell/B.E. (25.6 GBps per chip or.25 bytes per double-precision FLOP) while offering greater memory capacity. A consequence is that the dual, 128-bit (plus parity) memory buses of the new DDR2 memory controller increase the pin count of the PowerXCell 8i processor package, making it pin incompatible with older-generation Cell/B.E. processors. The result is that PowerXCell 8i supports four DIMM slots and up to 16 GB of memory (more with future higher-density DIMMs) compared with the Cell/B.E.'s maximum of 1 GB. In addition, the PowerXCell 8i memory and memory bus subsystems are fully error corrected. Most of the remaining features of the PowerXCell 8i microarchitecture match those of the earlier Cell/B.E. version of the chip, but we remind the reader of 256 KB local store associated with each SPE. This is a DMA-enabled, memorymapped, local memory with none of the transistor-demanding features of a full-blown cache to which the PowerXCell 8i SPEs can asynchronously pipeline data to and from memory or other SPE local stores with the help of each SPE's MFC. The local store's size, 16- and 128-byte blocked loads, and large outstanding memory reference queue are key features in the PowerXCell 8i's bandwidth profile. 14 # IDC

15 PowerXCell 8i in a PCIe Card Form Factor, IBM's PXCAB Card Positioned and priced to compete with GPUs offered in standard PCIe form factors, IBM's PXCAB card is a double-wide, PCIe 16x card offered with custom packaging and labeling to OEMs for use in rackmounted units that might also accept GPU accelerator cards from NVIDIA or ATI. The PXCAB card includes one PowerXCell 8i processor, up to 8 GB of DDR2 memory on card, and two 1 Gigabit Ethernet ports. It functions more as a standalone component than a typical GPU accelerator. It runs the Linux operating system and communicates with the board's general-purpose processor via the PCIe bus using Ethernet emulation. This compact card retains the same advantages as IBM's other PowerXCell 8i products, including a large directly addressable memory that is error corrected, good double-precision performance per watt, and support for the components in IBM's SDK. IBM's QS22 Brings PowerXCell 8i Performance to the Cluster IBM's BladeCenter QS22 uses the same form factor as the older QS21 and the other blade-based offerings from IBM (see Figure 4). IBM's BladeCenter H chassis accepts 14 of the QS22 blades (or QS21 or other IBM blades), and sites with QS21 blades can add or upgrade to the QS22. The QS22 is a full-height blade and includes two 3.2 GHz PowerXCell 8i processors coherently connected with IBM's BIF interface; up to 16 GB of DDR2 memory per processor; two BladeCenter, midplane-facilitated Gigabit Ethernet ports; room for an InfiniBand adapter, a SAS adapter, and I/O buffer memory; and support for IBM's SDK. Peak single-precision performance per blade is 460 GFLOPS (2 x [PPE+SPE]) and double-precision performance per blade is 217 GFLOPS (again, 2 x [PPE+SPE]). This works out to 3.04 TFLOPS per chassis or TFLOPS per rack for doubleprecision and 6.44 TFLOPS per chassis or TFLOPS per rack for single precision. Linpack performance per QS22 blade has been measured at around 170 double-precision GFLOPS, which is about 80% of peak performance per blade IDC #

16 FIGURE 4 IBM's QS22 Blade Source: IBM, 2008 An examination of the QS22's power efficiency shows that a single blade consumes about 250 watts while running Linpack. A complete QS22 cluster running Linpack has been measured at 488 MFLOPS per watt. This heterogeneous multicore, dataparallel SIMD processor has very good MFLOPS-per-watt specs when compared with most general-purpose microprocessors, which typically have measured values under 300 MFLOPS per watt. GPU power efficiency is generally quoted with respect to the power consumed only by the card, and GPUs come out somewhat ahead of the QS22 when this is done; however, when the power consumed by the board supporting the GPU card is included, the results are much closer to equal. The deciding factor for efficiency for a particular application will be the sustained performance observed. IBM believes that the QS22's directly addressable memory with 2 x 25.6 GBps bandwidth, MFC-supported DMA engines, full IEEE compatibility, and coherent interchip interface will give it a double-precision, sustained-performance advantage over its competitors. Like the QS21's processors, the QS22's PowerXCell 8i processors function as standalone, multicore processors, two to a board and coherently linked in a manner not dissimilar to a dual-socket Opteron board linked by HyperTransport. The Linux OS runs independently on the PPE core of each processor and manages the use of its eight SPEs. In this sense, a BladeCenter H enclosure fitted with QS22 blades is not a bus-accelerated cluster like those that add GPUs to a standard x86-based cluster system, but a cluster of tightly coupled heterogeneous, multicore, cc-numa PowerXCell 8i based nodes. 16 # IDC

17 For scalar work, the performance of PowerXCell 8i's PPE core does not equal the performance of the latest Intel or AMD x86 scalar cores. Yet, the QS22's tightly coupled architecture and large mixed-core count promise better sustained performance than bus-accelerated cluster systems on certain HPC applications. While the QS22 offers a cc-numa, dual-socket architecture with 18 cores, the Triblade in the Roadrunner system IBM built for LANL has a bus-based design similar to that of GPU-based accelerators. Roadrunner, IBM's HPC Hybrid System for LANL: A Milestone in Design and Performance The announcement by IBM on June 10, 2008, that the PowerXCell 8i based supercomputer (LANL's Roadrunner) it had assembled at its Poughkeepsie, New York, facility had become the first computer to run the industry's standard Linpack benchmark at a sustained petaflop was an HPC milestone. While newswire attention has focused on reaching the petaflop goal (a quadrillion double-precision floatingpoint operations per second), from IDC's perspective, the milestone is really defined by several other important features of this achievement. The Meaning of the Petaflop Milestone The first is that a system based on components that are standards based and largely volume priced is now at the top of HPC's TOP500 list for the first time. The components include AMD Opteron dual-core processors, 4x DDR 20 Gbps InfiniBand interconnect, DDR2 memory, and the PowerXCell 8i heterogeneous, multicore, DLP acceleration engine. Quibbling about whether the PowerXCell 8i is standards based is acceptable (its Triblade is a custom enclosure), but its presence on the scene is clearly driven by volume economic trends and early investment by Sony, Toshiba, and IBM in a processing engine designed not for HPC, but game boxes, in this case the Sony PlayStation. As one might expect of such a high-end system, its standardsbased components are custom integrated, but architecturally, it is an InfiniBand switched cluster with acceleration technology supported by volume economics. The acceleration technology is the second important feature of the announcement. The fastest computer in the world is now accelerator based, and the acceleration technology has not just augmented the performance of its general-purpose microprocessors. It is the primary engine behind Roadrunner's sustained Linpack petaflop. The system's PowerXCell 8i processors offer 1,332 TFLOPS compared with only about 50 TFLOPS from the dual-core Opterons. It is also noteworthy that acceleration technology did not merely put the system into the top 10 or 20 places of the TOP500 list, but rather put it at the very top. Finally, LANL's Roadrunner and IBM's other PowerXCell 8i based products bring HPC and its highest-performing system back to its data-parallel roots. Linpack, a benchmark with significant cache-reuse potential, runs at 78% efficiency on Roadrunner, which has no L2 cache and only a modestly sized, user-programmed, 256 KB local memory. This is a reminder of how well vector and data-parallel microarchitectures suit typically data-intensive HPC workloads (and also perhaps that the Cray-2 had similarly sized local memory). PowerXCell 8i's simplified, in-order data-parallel design also offers the added benefit of a reduced transistor count and 2008 IDC #

18 therefore lower power consumption per FLOP. The PowerXCell 8i has only 250 million transistors on its 65nm die. Intel's quad-core Harpertown has 410 million; AMD's quad-core Barcelona has 463 million; and NVIDIA's Tesla GPU has 681 million. On the Linpack benchmark, Roadrunner achieves about 437 MFLOPS per watt even while carrying the power consumed by the AMD Opteron part of the system's Triblade (as we saw earlier, the QS22 blade is still more power efficient at 488 MFLOPS per watt). This makes Roadrunner over 65 MFLOPS per watt more efficient than even IBM's Blue Gene/P system, which was at the top of the Green500 list in February 2008, and almost 200 MFLOPS per watt more efficient than the best unaccelerated, x86-based cluster system. When taken together, the features of Roadrunner discussed here and of IBM's other PowerXCell 8i based products send a powerful message: HPC clusters designed around standards-based components, but in custom enclosures that use DLP acceleration technology augmented by blocked memory reference hardware (the PowerXCell 8i's MFC), can provide both industry-leading doubleprecision performance and power efficiency. IDC expects that the HPC community is paying close attention to this message. Roadrunner's Design Elements Unlike the QS22 cluster mentioned earlier, in LANL's Roadrunner, IBM features a bus-based architecture that places the PowerXCell 8i blade under the control of a dual-socket Opteron node and that is accessible through a HyperTransport-to-PCI Express (HT-to-PCIe) bridge bus. This design is the basis for IBM's Triblade that fit three to a standard IBM BladeCenter H chassis. The four-slot Triblade (see Figure 5) is currently available only as part of IBM's QS22/LS21-based Roadrunner system at LANL. It includes two QS22 accelerator blades, each with dual-socket, 3.2 GHz PowerXCell 8i boards and four slots for their own directly controlled, DDR2 boardlocal memory. The QS22 blades are connected to a single dual-socket, dual-core 1.8 GHz Opteron-based master node-blade called the LS21. They are connected through an HT-to-PCIe bridge-blade, which is sandwiched between them and gives the Triblade its quad-blade appearance. The two, 2 x 8x PCIe to 16x HT links provide GB per second of bandwidth per QS22 PowerXCell 8i socket to the LS21. Both the LS21 and QS22 have four DIMM slots per socket. Currently, Roadrunner is configured with 8 GB of memory per Opteron socket and 4 GB per PowerXCell 8i socket for (8 x 2) + (4 x 4) = 32 GB of memory in total per Triblade and 80 TB for the entire system. Memory within each board (LS21 and QS22) is cc-numa integrated. From a programming perspective, the dual-processor Opteron LS21 functions as the programmable node for MPI message-passing, while the two QS22 boards are programmed at a lower level using one of the components of IBM's SDK. 18 # IDC

19 FIGURE 5 Roadrunner's Custom Integrated Triblade Source: IBM, 2008 Stepping back and looking at the larger design features, we see that LANL's Roadrunner combines 180 of these Triblade nodes and 12 I/O blades with a 288-port DDR InfiniBand switch into "connected units" (CUs). There are 17 CUs in the entire system, giving it 3,060 Triblade nodes for computation; a total of 6,120 dual-core Opteron chips (50 TFLOPS peak); and 12,240 PowerXCell 8i chips (1.33 PFLOPS peak). There is one Opteron core for each PowerXCell 8i chip. When its I/O and management nodes are included, Roadrunner contains 130,464 computational cores. Roadrunner uses a two-tier, fat tree topology supported by standard 288-port, 20 Gbps DDR InfiniBand switches and network adapters. There is full bisection bandwidth within each CU, and half bisection bandwidth among the CUs. All of Roadrunner's interconnect cables are optical, and its bisection bandwidth is uniformly 3.5 TBps. Its 216 I/O nodes support an aggregate bandwidth of 432 GB per second to LANL's 2-plus petabyte high-performance global file system from Panasas. A schematic diagram of Roadrunner's two-tiered DDR InfiniBand-based fat tree and its interconnected CUs is presented in Figure IDC #

20 FIGURE 6 Roadrunner's Two-Tiered DDR InfiniBand Fat Tree Connected Unit (CU) cluster 180 compute nodes w/cells 12 I/O nodes PCIe attached Cell blades I/O 12,240 12,240 Cell Cell edp edp chips chips PF, PF, TB TB 6,120 6,120 dual-core dual-core Opterons Opterons TF, TF, TB TB 17 Cluster Units 3,060 Compute Nodes 288-port IB 4x DDR 296 racks 3.9 MW 288-port IB 4x DDR Eight 2 nd -stage 288-port IB 4X DDR switches 12 links per CU to each of 8 switches Source: LANL, 2008 While our attention naturally turns to the details of Roadrunner's design as presented earlier and its sheer scale it consumes 2.3 MWatts of power; has 130,000 cores; weighs 500,000 pounds; and will take 21 trucks to deliver the applications that will be run on it and the new science they make possible should be our focus. As noted earlier, HPC users feel that the primary barrier to generating new science on accelerators, including IBM's PowerXCell 8i product line, will be their programmability. IBM and Los Alamos National Laboratory, RapidMind, Gedae, and others are investing heavily in the development of PowerXCell 8i's programming environment. Investing in PowerXCell 8i Programmability The major challenge in improving the performance, productivity, and portability (the three Ps of HPC programming) of today's HPC applications is the ubiquity, variety, and growth of parallelism in HPC system architectures. At the high end, government labs are adapting or rewriting their key applications to take maximum advantage of ultraparallel HPC systems that now contain as many as 100,000 independent computational cores. At the low end, smaller businesses and corporate departments engaged in computational science and engineering can no longer count on clockperiod performance improvements and must adopt and improve the parallel performance of their applications on multicore, multisocket servers and clusters to achieve the productivity that will keep them competitive. Investment by government, business, and venture capital firms in technologies to improve the three Ps of HPC application programming has grown to respond to this challenge. Latency and bandwidth limitations, the multitiered memory hierarchy, synchronization bottlenecks, load balancing challenges, and recovery from failure are 20 # IDC

21 among the many factors that make today's ultraparallel programming problem a very difficult one even when every processing element runs the same instruction set. With all their potential benefits, the heterogeneous or hybrid multi instruction set computing models that come along with most HPC acceleration technologies make this problem only more challenging. As a heterogeneous chip multiprocessor (CMP), IBM's QS22 PowerXCell 8i hybrid architecture is positioned between the bus- or network-divided, dual-isa approach of GPU acceleration and the vector-scalar, functional unit integrated, single-isa approach of Cray vector accelerators (and probably future designs from Intel and AMD). PowerXCell 8i's designers adopted the view that accelerators (and their instruction sets) should be allowed to evolve independently from their supporting general-purpose scalar cores but should be placed on the same chip and tightly coupled through a high-bandwidth interconnect and memory management system. This stems from IBM's long-term view that data-parallel acceleration is just an initial phase in a process in which accelerator cores will become more workload specific. PowerXCell 8i's parallel model is based on SIMD threads that are spawned from the general-purpose PPE onto the SPEs. These threads are supported with data delivered asynchronously from each SPE's DMA-enabled MFC units capable of as many as 128 outstanding 128-byte blocked, simultaneous memory references. IBM's on-chip division of labor separates the requirement of compiling and running threads for the distinct, general-purpose core from that for the accelerated cores while offering very high-bandwidth intercore communication and a shared memory space to connect them. As implemented, this approach has several advantages:! This thread-based model is readily supported in the Linux kernel, and it gives IBM the flexibility to present a software-integrated programming environment while developing its acceleration and general-purpose hardware independently.! It pushes the functions of the on-chip interconnect to center stage (especially in its support of distributed DMAs and thread synchronization) and has forced IBM to think hard about on-chip interconnect requirements that will need to be addressed in HPC's probable many-core future. PowerXCell 8i's division-of-labor design also maps naturally to the variety of parallel abstraction and programming models being developed within IBM and by external partners. This has already stimulated the development of multiple programming model alternatives for PowerXCell 8i and should ease the burden of porting PowerXCell 8i programs to other accelerated platforms.! While it is more difficult to single-source compile to two distinct processors and instruction sets mediated by an interconnect, IBM's intention to produce such a compiler is visible in its working prototype and supported by successes already achieved in other contexts such as Partitioned Global Address Space (PGAS) compiler development in which locality and parallel extensions have been added to standard programming languages and subroutine libraries act as a mediation layer. IBM and its PowerXCell 8i software development partners are working along these lines to lower the accelerated, parallel programming barrier that IDC has noted is the primary difficulty limiting the adoption of acceleration technology in HPC IDC #

W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b l e d N e x t - G e n e r a t i o n V N X

W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b l e d N e x t - G e n e r a t i o n V N X Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b

More information

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop November, IBM Corporation IBM HPC DIRECTIONS Dr Don Grice ECMWF Workshop November, 2008 IBM HPC Directions Agenda What Technology Trends Mean to Applications Critical Issues for getting beyond a PF Overview of the Roadrunner Project

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel High Performance Computing: Blue-Gene and Road Runner Ravi Patel 1 HPC General Information 2 HPC Considerations Criterion Performance Speed Power Scalability Number of nodes Latency bottlenecks Reliability

More information

Roadmapping of HPC interconnects

Roadmapping of HPC interconnects Roadmapping of HPC interconnects MIT Microphotonics Center, Fall Meeting Nov. 21, 2008 Alan Benner, bennera@us.ibm.com Outline Top500 Systems, Nov. 2008 - Review of most recent list & implications on interconnect

More information

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to

More information

Mercury Computer Systems & The Cell Broadband Engine

Mercury Computer Systems & The Cell Broadband Engine Mercury Computer Systems & The Cell Broadband Engine Georgia Tech Cell Workshop 18-19 June 2007 About Mercury Leading provider of innovative computing solutions for challenging applications R&D centers

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

AMD Opteron Processors In the Cloud

AMD Opteron Processors In the Cloud AMD Opteron Processors In the Cloud Pat Patla Vice President Product Marketing AMD DID YOU KNOW? By 2020, every byte of data will pass through the cloud *Source IDC 2 AMD Opteron In The Cloud October,

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Broadband Engine. Spencer Dennis Nicholas Barlow Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

W H I T E P A P E R S e r v e r R e f r e s h t o M e e t t h e C h a n g i n g N e e d s o f I T?

W H I T E P A P E R S e r v e r R e f r e s h t o M e e t t h e C h a n g i n g N e e d s o f I T? W H I T E P A P E R S e r v e r R e f r e s h t o M e e t t h e C h a n g i n g N e e d s o f I T? Sponsored by: Sun Microsystems and Intel Kenneth Cayton September 2008 E X E C U T I VE SUMMARY Global

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Performance of the AMD Opteron LS21 for IBM BladeCenter

Performance of the AMD Opteron LS21 for IBM BladeCenter August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the

More information

HPC and Accelerators. Ken Rozendal Chief Architect, IBM Linux Technology Cener. November, 2008

HPC and Accelerators. Ken Rozendal Chief Architect, IBM Linux Technology Cener. November, 2008 HPC and Accelerators Ken Rozendal Chief Architect, Linux Technology Cener November, 2008 All statements regarding future directions and intent are subject to change or withdrawal without notice and represent

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET CRAY XD1 DATASHEET Cray XD1 Supercomputer Release 1.3 Purpose-built for HPC delivers exceptional application performance Affordable power designed for a broad range of HPC workloads and budgets Linux,

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

New Approach to Unstructured Data

New Approach to Unstructured Data Innovations in All-Flash Storage Deliver a New Approach to Unstructured Data Table of Contents Developing a new approach to unstructured data...2 Designing a new storage architecture...2 Understanding

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Six-Core AMD Opteron Processor

Six-Core AMD Opteron Processor What s you should know about the Six-Core AMD Opteron Processor (Codenamed Istanbul ) Six-Core AMD Opteron Processor Versatility Six-Core Opteron processors offer an optimal mix of performance, energy

More information

Customer Success Story Los Alamos National Laboratory

Customer Success Story Los Alamos National Laboratory Customer Success Story Los Alamos National Laboratory Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory Case Study June 2010 Highlights First Petaflop

More information

Paving the Road to Exascale

Paving the Road to Exascale Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

The Cray Rainier System: Integrated Scalar/Vector Computing

The Cray Rainier System: Integrated Scalar/Vector Computing THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier

More information

Intel Core i7 Processor

Intel Core i7 Processor Intel Core i7 Processor Vishwas Raja 1, Mr. Danish Ather 2 BSc (Hons.) C.S., CCSIT, TMU, Moradabad 1 Assistant Professor, CCSIT, TMU, Moradabad 2 1 vishwasraja007@gmail.com 2 danishather@gmail.com Abstract--The

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Global Headquarters: 5 Speen Street Framingham, MA USA P F Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com WHITE PAPER A New Strategic Approach To HPC: IBM's Blue Gene Sponsored by: IBM Christopher G. Willard,

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

The Future of Computing: AMD Vision

The Future of Computing: AMD Vision The Future of Computing: AMD Vision Tommy Toles AMD Business Development Executive thomas.toles@amd.com 512-327-5389 Agenda Celebrating Momentum Years of Leadership & Innovation Current Opportunity To

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Lowering Cost per Bit With 40G ATCA

Lowering Cost per Bit With 40G ATCA White Paper Lowering Cost per Bit With 40G ATCA Prepared by Simon Stanley Analyst at Large, Heavy Reading www.heavyreading.com sponsored by www.radisys.com June 2012 Executive Summary Expanding network

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

Who says world-class high performance computing (HPC) should be reserved for large research centers? The Cray CX1 supercomputer makes HPC performance

Who says world-class high performance computing (HPC) should be reserved for large research centers? The Cray CX1 supercomputer makes HPC performance Who says world-class high performance computing (HPC) should be reserved for large research centers? The Cray CX1 supercomputer makes HPC performance available to everyone, combining the power of a high

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Cisco Unified Computing System Delivering on Cisco's Unified Computing Vision

Cisco Unified Computing System Delivering on Cisco's Unified Computing Vision Cisco Unified Computing System Delivering on Cisco's Unified Computing Vision At-A-Glance Unified Computing Realized Today, IT organizations assemble their data center environments from individual components.

More information

Gen-Z Memory-Driven Computing

Gen-Z Memory-Driven Computing Gen-Z Memory-Driven Computing Our vision for the future of computing Patrick Demichel Distinguished Technologist Explosive growth of data More Data Need answers FAST! Value of Analyzed Data 2005 0.1ZB

More information

Exascale: Parallelism gone wild!

Exascale: Parallelism gone wild! IPDPS TCPP meeting, April 2010 Exascale: Parallelism gone wild! Craig Stunkel, Outline Why are we talking about Exascale? Why will it be fundamentally different? How will we attack the challenges? In particular,

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Top 4 considerations for choosing a converged infrastructure for private clouds

Top 4 considerations for choosing a converged infrastructure for private clouds Top 4 considerations for choosing a converged infrastructure for private clouds Organizations are increasingly turning to private clouds to improve efficiencies, lower costs, enhance agility and address

More information

Building supercomputers from embedded technologies

Building supercomputers from embedded technologies http://www.montblanc-project.eu Building supercomputers from embedded technologies Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results

More information

White Paper. Technical Advances in the SGI. UV Architecture

White Paper. Technical Advances in the SGI. UV Architecture White Paper Technical Advances in the SGI UV Architecture TABLE OF CONTENTS 1. Introduction 1 2. The SGI UV Architecture 2 2.1. SGI UV Compute Blade 3 2.1.1. UV_Hub ASIC Functionality 4 2.1.1.1. Global

More information

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.

More information

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads Gen9 server blades give more performance per dollar for your investment. Executive Summary Information Technology (IT)

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)

More information

AMD EPYC Empowers Single-Socket Servers

AMD EPYC Empowers Single-Socket Servers Whitepaper Sponsored by AMD May 16, 2017 This paper examines AMD EPYC, AMD s upcoming server system-on-chip (SoC). Many IT customers purchase dual-socket (2S) servers to acquire more I/O or memory capacity

More information

Intel High-Performance Computing. Technologies for Engineering

Intel High-Performance Computing. Technologies for Engineering 6. LS-DYNA Anwenderforum, Frankenthal 2007 Keynote-Vorträge II Intel High-Performance Computing Technologies for Engineering H. Cornelius Intel GmbH A - II - 29 Keynote-Vorträge II 6. LS-DYNA Anwenderforum,

More information

IBM Virtual Fabric Architecture

IBM Virtual Fabric Architecture IBM Virtual Fabric Architecture Seppo Kemivirta Product Manager Finland IBM System x & BladeCenter 2007 IBM Corporation Five Years of Durable Infrastructure Foundation for Success BladeCenter Announced

More information

Parallelism and Concurrency. COS 326 David Walker Princeton University

Parallelism and Concurrency. COS 326 David Walker Princeton University Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary

More information

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation SAS Enterprise Miner Performance on IBM System p 570 Jan, 2008 Hsian-Fen Tsao Brian Porter Harry Seifert IBM Corporation Copyright IBM Corporation, 2008. All Rights Reserved. TABLE OF CONTENTS ABSTRACT...3

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

W H I T E P A P E R I B M S y s t e m X 4 : D e l i v e r i n g High Value Through Scale Up

W H I T E P A P E R I B M S y s t e m X 4 : D e l i v e r i n g High Value Through Scale Up W H I T E P A P E R I B M S y s t e m X 4 : D e l i v e r i n g High Value Through Scale Up Sponsored by: IBM Kenneth Cayton January 2008 Jed Scaramella EXECUTIVE SUMMARY Global Headquarters: 5 Speen Street

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Introduction: PURPOSE BUILT HARDWARE. ARISTA WHITE PAPER HPC Deployment Scenarios

Introduction: PURPOSE BUILT HARDWARE. ARISTA WHITE PAPER HPC Deployment Scenarios HPC Deployment Scenarios Introduction: Private and public High Performance Computing systems are continually increasing in size, density, power requirements, storage, and performance. As these systems

More information

High Performance Computing in Europe and USA: A Comparison

High Performance Computing in Europe and USA: A Comparison High Performance Computing in Europe and USA: A Comparison Erich Strohmaier 1 and Hans W. Meuer 2 1 NERSC, Lawrence Berkeley National Laboratory, USA 2 University of Mannheim, Germany 1 Introduction In

More information

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations Performance Brief Quad-Core Workstation Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations With eight cores and up to 80 GFLOPS of peak performance at your fingertips,

More information

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Global Headquarters: 5 Speen Street Framingham, MA USA P F WHITE PAPER SSDs: The Other Primary Storage Alternative Sponsored by: Samsung Jeff Janukowicz January 2008 Dave Reinsel IN THIS WHITE PAPER Global Headquarters: 5 Speen Street Framingham, MA 01701 USA

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

It s Time to Move Your Critical Data to SSDs Introduction

It s Time to Move Your Critical Data to SSDs Introduction It s Time to Move Your Critical Data to SSDs Introduction by the Northamber Storage Specialist Today s IT professionals are well aware that users expect fast, reliable access to ever-growing amounts of

More information

Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors

Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors Solution Brief December, 2018 2018 Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors HIGHLIGHTS o The AMD EPYC SoC brings a new balance to the datacenter. Utilizing an x86-architecture,

More information

Text Messaging Helps Your Small Business Perform Big

Text Messaging Helps Your Small Business Perform Big White Paper Text Messaging Helps Your Small Business Perform Big Sponsored by: AT&T Denise Lund August 2017 IN THIS WHITE PAPER This white paper introduces small businesses to the benefits of communicating

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Cell Processor and Playstation 3

Cell Processor and Playstation 3 Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22

More information

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice Providing the Best Return on Investment by Delivering the Highest System Efficiency and Utilization Top500 Supercomputers June

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

IBM System x3455 AMD Opteron SMP 1 U server features Xcelerated Memory Technology to meet the needs of HPC environments

IBM System x3455 AMD Opteron SMP 1 U server features Xcelerated Memory Technology to meet the needs of HPC environments IBM Europe Announcement ZG07-0492, dated July 17, 2007 IBM System x3455 AMD Opteron SMP 1 U server features Xcelerated Memory Technology to meet the needs of HPC environments Key prerequisites...2 Description...3

More information

Technology challenges and trends over the next decade (A look through a 2030 crystal ball) Al Gara Intel Fellow & Chief HPC System Architect

Technology challenges and trends over the next decade (A look through a 2030 crystal ball) Al Gara Intel Fellow & Chief HPC System Architect Technology challenges and trends over the next decade (A look through a 2030 crystal ball) Al Gara Intel Fellow & Chief HPC System Architect Today s Focus Areas For Discussion Will look at various technologies

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information