HPCS HPCchallenge Benchmark Suite

Similar documents
HPCS HPCchallenge Benchmark Suite

Requirements for Scalable Application Specific Processing in Commercial HPEC

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

Parallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT

HPCC Results. Nathan Wichmann Benchmark Engineer

An Update on CORBA Performance for HPEC Algorithms. Bill Beckwith Objective Interface Systems, Inc.

Setting the Standard for Real-Time Digital Signal Processing Pentek Seminar Series. Digital IF Standardization

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

Dana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone:

High-Performance Linear Algebra Processor using FPGA

Sparse Linear Solver for Power System Analyis using FPGA

High-Assurance Security/Safety on HPEC Systems: an Oxymoron?

DoD Common Access Card Information Brief. Smart Card Project Managers Group

Multi-Modal Communication

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

The Cray Rainier System: Integrated Scalar/Vector Computing

Service Level Agreements: An Approach to Software Lifecycle Management. CDR Leonard Gaines Naval Supply Systems Command 29 January 2003

Composite Metrics for System Throughput in HPC

4. Lessons Learned in Introducing MBSE: 2009 to 2012

FUDSChem. Brian Jordan With the assistance of Deb Walker. Formerly Used Defense Site Chemistry Database. USACE-Albuquerque District.

Parallelization of a Electromagnetic Analysis Tool

Towards a Formal Pedigree Ontology for Level-One Sensor Fusion

Benchmark Results. 2006/10/03

Washington University

ARINC653 AADL Annex Update

Vision Protection Army Technology Objective (ATO) Overview for GVSET VIP Day. Sensors from Laser Weapons Date: 17 Jul 09 UNCLASSIFIED

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

Distributed Real-Time Embedded Video Processing

A Relative Development Time Productivity Metric for HPC Systems

75th Air Base Wing. Effective Data Stewarding Measures in Support of EESOH-MIS

Title: An Integrated Design Environment to Evaluate Power/Performance Tradeoffs for Sensor Network Applications 1

Data Reorganization Interface

New Dimensions in Microarchitecture Harnessing 3D Integration Technologies

Empirically Based Analysis: The DDoS Case

Dr. Stuart Dickinson Dr. Donald H. Steinbrecher Naval Undersea Warfare Center, Newport, RI May 10, 2011

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

The Evaluation of GPU-Based Programming Environments for Knowledge Discovery

Space and Missile Systems Center

Concept of Operations Discussion Summary

Using Model-Theoretic Invariants for Semantic Integration. Michael Gruninger NIST / Institute for Systems Research University of Maryland

The C2 Workstation and Data Replication over Disadvantaged Tactical Communication Links

A Distributed Parallel Processing System for Command and Control Imagery

Kathleen Fisher Program Manager, Information Innovation Office

CENTER FOR ADVANCED ENERGY SYSTEM Rutgers University. Field Management for Industrial Assessment Centers Appointed By USDOE

Running CyberCIEGE on Linux without Windows

A Review of the 2007 Air Force Inaugural Sustainability Report

ASPECTS OF USE OF CFD FOR UAV CONFIGURATION DESIGN

Lessons Learned in Adapting a Software System to a Micro Computer

DoD M&S Project: Standardized Documentation for Verification, Validation, and Accreditation

MODELING AND SIMULATION OF LIQUID MOLDING PROCESSES. Pavel Simacek Center for Composite Materials University of Delaware

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

AFRL-ML-WP-TM

C2-Simulation Interoperability in NATO

Space and Missile Systems Center

Topology Control from Bottom to Top

Directed Energy Using High-Power Microwave Technology

Information, Decision, & Complex Networks AFOSR/RTC Overview

Technological Advances In Emergency Management

Edwards Air Force Base Accelerates Flight Test Data Analysis Using MATLAB and Math Works. John Bourgeois EDWARDS AFB, CA. PRESENTED ON: 10 June 2010

Corrosion Prevention and Control Database. Bob Barbin 07 February 2011 ASETSDefense 2011

Introducing I 3 CON. The Information Interpretation and Integration Conference

Architecting for Resiliency Army s Common Operating Environment (COE) SERC

Balance of HPC Systems Based on HPCC Benchmark Results

Using Templates to Support Crisis Action Mission Planning

2013 US State of Cybercrime Survey

Exploring the Query Expansion Methods for Concept Based Representation

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Use of the Polarized Radiance Distribution Camera System in the RADYO Program

SINOVIA An open approach for heterogeneous ISR systems inter-operability

MATREX Run Time Interface (RTI) DoD M&S Conference 10 March 2008

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

The extreme Adaptive DSP Solution to Sensor Data Processing

Energy Security: A Global Challenge

Nationwide Automatic Identification System (NAIS) Overview. CG 939 Mr. E. G. Lockhart TEXAS II Conference 3 Sep 2008

U.S. Army Research, Development and Engineering Command (IDAS) Briefer: Jason Morse ARMED Team Leader Ground System Survivability, TARDEC

Balancing Transport and Physical Layers in Wireless Ad Hoc Networks: Jointly Optimal Congestion Control and Power Control

SURVIVABILITY ENHANCED RUN-FLAT

Current Threat Environment

SCIENTIFIC VISUALIZATION OF SOIL LIQUEFACTION

Development and Test of a Millimetre-Wave ISAR Images ATR System

INTEGRATING LOCAL AND GLOBAL NAVIGATION IN UNMANNED GROUND VEHICLES

Air Virtual At Sea (VAST) Platform Stimulation Analysis

Accuracy of Computed Water Surface Profiles

Stream Processing for High-Performance Embedded Systems

Web Site update. 21st HCAT Program Review Toronto, September 26, Keith Legg

ENVIRONMENTAL MANAGEMENT SYSTEM WEB SITE (EMSWeb)

SETTING UP AN NTP SERVER AT THE ROYAL OBSERVATORY OF BELGIUM

Monte Carlo Techniques for Estimating Power in Aircraft T&E Tests. Todd Remund Dr. William Kitto EDWARDS AFB, CA. July 2011

BUPT at TREC 2009: Entity Track

Considerations for Algorithm Selection and C Programming Style for the SRC-6E Reconfigurable Computer

Advanced Numerical Methods for Numerical Weather Prediction

Col Jaap de Die Chairman Steering Committee CEPA-11. NMSG, October

Automation Middleware and Algorithms for Robotic Underwater Sensor Networks

ASSESSMENT OF A BAYESIAN MODEL AND TEST VALIDATION METHOD

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

M&S Strategic Initiatives to Support Test & Evaluation

Detection, Classification, & Identification of Objects in Cluttered Images

FPGA Acceleration of Information Management Services

Transcription:

HPCS HPCchallenge Benchmark Suite David Koester The Corporation Email Address: dkoester@mitre.org Jack Dongarra and Piotr Luszczek ICL/UT Email Addresses: {dongarra, luszczek}@cs.utk.edu Abstract The Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) examine the performance of High Performance Computing (HPC) architectures using kernels with more challenging memory access patterns than just the High Performance LINPACK (HPL) benchmark used in the Top500 list. The build on the HPL framework and augment the Top500 list by providing benchmarks that bound the performance of many real applications as a function of memory access locality characteristics. The real utility of the HPCchallenge benchmarks are that architectures can be described with a wider range of metrics than just Flop/s from HPL. Even a small percentage of random memory accesses in real applications can significantly affect the overall performance of that application on architectures not designed to minimize or hide memory latency. The includes a new metric Giga UPdates per Second and a new benchmark RandomAccess to measure the ability of an architecture to access memory randomly, i.e., with no locality. When looking only at HPL performance and the Top500 List, inexpensive build-your-own clusters appear to be much more cost effective than more sophisticated HPC architectures. HPCchallenge Benchmarks provide users with additional information to justify policy and purchasing decisions. We will compare the measured HPCchallenge Benchmark performance on various HPC architectures from Cray X1s to Beowulf clusters in the presentation and paper. Additional information on the HPCchallenge Benchmarks can be found at http://icl.cs.utk.edu/hpcc/ Introduction At SC2003 in Phoenix (15-21 November 2003), Jack Dongarra (ICL/UT) announced the release of a new benchmark suite the that examine the performance of HPC architectures using kernels with more challenging memory access patterns than High Performance Linpack (HPL) used in the Top500 list. The are being designed to complement the Top500 list and provide benchmarks that bound the performance of many real applications as a function of memory access characteristics e.g., spatial and temporal locality. Development of the is being funded by the Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) Program. The HPCchallenge Benchmark Kernels Local Global DGEMM (matrix x matrix multiply) High Performance LINPACK (HPL) STREAM PTRANS parallel matrix transpose COPY SCALE ADD TRIADD RandomAccess (MPI)RandomAccess 1D FFT 1D FFT I/O b_eff effective bandwidth benchmark

Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 01 FEB 2005 2. REPORT TYPE N/A 3. DATES COVERED - 4. TITLE AND SUBTITLE HPCS HPCchallenge Benchmark Suite 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) The Corporation 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited 11. SPONSOR/MONITOR S REPORT NUMBER(S) 13. SUPPLEMENTARY NOTES See also ADM00001742, HPEC-7 Volume 1, Proceedings of the Eighth Annual High Performance Embedded Computing (HPEC) Workshops, 28-30 September 2004 Volume 1., The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified 18. NUMBER OF PAGES 28 19a. NAME OF RESPONSIBLE PERSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Additional information on the can be found at http://icl.cs.utk.edu/hpcc/. Flop/s The Flop/s metric from HPL has been the de facto standard for comparing High Performance Computers for many years. HPL works well on all architectures even cache-based, distributed memory multiprocessors and the measured performance may not be representative of a wide range of real user applications like adaptive multi-physics simulations used in weapons and vehicle design and weather, climate models, and defense applications. HPL is more compute friendly than these applications because it has more extensive memory reuse in the Level 3 BLAS-based calculations.. Memory Performance There is a need for benchmarks that test memory performance. When looking only at HPL performance and the Top500 List, inexpensive build-your-own clusters appear to be much more cost effective than more sophisticated HPC architectures. HPL has high spatial and temporal locality characteristics shared by few real user applications. HPCchallenge benchmarks provide users with additional information to justify policy and purchasing decisions Not only does the Japanese Earth Simulator outperform the top American systems on the HPL benchmark (Tflop/s), the differences in bandwidth performance on John McCalpin s STREAM TRIAD benchmark (Level 1 BLAS) shows even greater performance disparity. The Earth Simulator outperforms the ASCI Q by a factor of 4.64 on HPL. Meanwhile, the higher bandwidth memory and interconnect systems of the Earth Simulator are clearly evident as it outperforms ASCI Q by a factor of 36.25 on STREAM TRIAD. In the presentation and paper, we will compare the measured HPCchallenge Benchmark performance on various HPC architectures from Cray X1s to Beowulf clusters using the updated results at http://icl.cs.utk.edu/hpcc/hpcc_results.cgi Even a small percentage of random memory accesses in real applications can significantly affect the overall performance of that application on architectures not designed to minimize or hide memory latency. Memory latency has not kept up with Moore s Law. Moore s Law hypothesizes a 60% compound growth rate per year for microprocessor performance, while memory latency has been improving at a compound rate of only 7% per year. The memory-processor performance gap has been growing at a rate of over 50% per year since 1980. The includes a new metric Giga UPdates per Second and a new benchmark RandomAccess to measure the ability of an architecture to access memory randomly, i.e., with no locality. GUPS is calculated by identifying the number of memory locations that can be randomly updated in one second, divided by 1 billion (1e9). The term randomly means that there is little relationship between one address to be updated and the next, except that they occur in the space of ½ the total system memory. An update is a read-modify-write operation on a table of 64-bit words. An address is generated, the value at that address read from memory, modified by an integer operation (add, and, or, xor) with a literal value, and that new value is written back to memory

HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1

Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary Results Summary Slide-2

High Productivity Computing Systems ¾ Create a new generation of economically viable computing systems and a procurement methodology for the security/industrial community (2007 2010) Impact: z Performance (time-to-solution): speedup critical national security applications by a factor of 10X to 40X z Programmability (idea-to-first-solution): reduce cost and time of developing application solutions z Portability (transparency): insulate research and operational application software from system z Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors HPCS Program Focus Areas Applications: z Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology Fill Fill the the Critical Critical Technology Technology and and Capability Capability Gap Gap Today Today (late (late 80 s 80 s HPC HPC technology)..to..future technology)..to..future (Quantum/Bio (Quantum/Bio Computing) Computing) Slide-3

High Productivity Computing Systems -Program Overview- Create a new generation of economically viable computing systems and a procurement methodology for the security/industrial community (2007 2010) Full Scale Development Half-Way Point Phase 2 Technology Assessment Review Vendors Petascale/s Systems Validated Procurement Evaluation Methodology Advanced Design & Prototypes Concept Study Productivity Team Test Evaluation Framework New Evaluation Framework Slide-4 Phase 1 Phase 2 (2003-2005) Phase 3 (2006-2010)

HPCS Program Goals HPCS overall productivity goals: Execution (sustained performance) 1 Petaflop/sec (scalable to greater than 4 Petaflop/sec) Reference: Production workflow Development 10X over today s systems Reference: Lone researcher and Enterprise workflows Productivity Framework Base lined for today s systems Successfully used to evaluate the vendors emerging productivity techniques Provide a solid reference for evaluation of vendor s proposed Phase III designs. Subsystem Performance Indicators 1) 2+ PF/s LINPACK 2) 6.5 PB/sec data STREAM bandwidth 3) 3.2 PB/sec bisection bandwidth 4) 64,000 GUPS Bob Graybill (DARPA/IPTO) (Emphasis added) Slide-5

Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary Results Summary Slide-6

Processor-Memory Performance Gap Performance 1000 100 10 1 1980 1981 Moore s Law 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU DRAM µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/yr. Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors *Taken from Patterson-Keeton Talk to SigMod Slide-7

Processing vs. Memory Access Doesn t cache solve this problem? It depends. With small amounts of contiguous data, usually. With large amounts of non-contiguous data, usually not In most computers the programmer has no control over cache Often a few Bytes/FLOP is considered OK However, consider operations on the transpose of a matrix (e.g., for adjunct problems) Xa= b X T a= b If X is big enough, 100% cache misses are guaranteed, and we need at least 8 Bytes/FLOP (assuming a and b can be held in cache) Latency and limited bandwidth of processor-memory and node-node communications are major limiters of performance for scientific computation Slide-8

Processing vs. Memory Access High Performance LINPACK Consider another benchmark: Linpack A x = b Solve this linear equation for the vector x, where A is a known matrix, and b is a known vector. Linpack uses the BLAS routines, which divide A into blocks. On the average Linpack requires 1 memory reference for every 2 FLOPs, or 4Bytes/Flop. Many of these can be cache references Slide-9

Processing vs. Memory Access STREAM TRIAD Consider the simple benchmark: STREAM TRIAD a(i) = b(i) + q * c(i) a(i), b(i), and c(i) are vectors; q is a scalar Vector length is chosen to be much longer than cache size Each execution includes 2 memory loads + 1 memory store 2 FLOPs 12 Bytes/FLOP (assuming 32 bit precision) No computer has enough memory bandwidth to reference 12 Bytes for each FLOP! Slide-10

Processing vs. Memory Access RandomAccess Tables T k 64 bits The expected value of the number of accesses per memory location T[ k ] E[ T[ k ] ] = (2 n+2 / 2 n ) = 4 2 n 1/2 Memory T[ k ] a i Define Addresses Sequences of bits within a i Data Stream {A i } Length 2 n+2 Data-Driven Memory Access a i 64 bits k = [a i <63, 64-n>] Highest n bits Bit-Level Exclusive Or The Commutative and Associative nature of allows processing in any order Acceptable Error 1% Look ahead and Storage 1024 per node Slide-11

Bounding Mission Partner Applications Low HPCS Productivity Design Points FFT RandomAccess Spatial Locality High High HPL Mission Partner Applications Temporal Locality Low PTRANS STREAM Slide-12

Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary Results Summary Slide-13

HPCS HPCSchallenge Benchmarks Being developed by Jack Dongarra (ICL/UT) Funded by the DARPA High Productivity Computing Systems (HPCS) program (Bob Graybill (DARPA/IPTO)) To To examine the performance of of High Performance Computer (HPC) architectures using kernels with more challenging memory access patterns than High Performance Linpack (HPL) Slide-14

HPCchallenge Goals To examine the performance of HPC architectures using kernels with more challenging memory access patterns than HPL HPL works well on all architectures even cachebased, distributed memory multiprocessors due to 1. Extensive memory reuse 2. Scalable with respect to the amount of computation 3. Scalable with respect to the communication volume 4. Extensive optimization of the software To complement the Top500 list To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics e.g., spatial and temporal locality Slide-15

Local DGEMM (matrix x matrix multiply) STREAM COPY SCALE ADD TRIADD EP-RandomAccess 1D FFT Global High Performance LINPACK (HPL) PTRANS parallel matrix transpose G-RandomAccess 1D FFT b_eff interprocessor bandwidth and latency FFT DGEMM EP-RandomAccess STREAM FFT HPL G-RandomAccess PTRANS HPCchallenge pushes spatial and temporal boundaries; sets performance bounds Available for download http://icl.cs.utk.edu/hpcc/ Slide-16

Web Site http://icl.cs.utk.edu/hpcc/ Home Rules News Download FAQ Links Collaborators Sponsors Upload Results Slide-17

Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary Results Summary Slide-18

Preliminary Results Machine List (1 of 2) Affiliatio n Manufacturer System ProcessorType Procs U Tenn Atipa Cluster AMD 128 procs Conquest cluster AMD Opteron 128 AHPCRC Cray X1 124 procs X1 Cray X1 MSP 124 AHPCRC Cray X1 124 procs X1 Cray X1 MSP 124 AHPCRC Cray X1 124 procs X1 Cray X1 MSP 124 ERDC Cray X1 60 procs X1 Cray X1 MSP 60 ERDC Cray X1 60 procs X1 Cray X1 MSP 60 ORNL Cray X1 252 procs X1 Cray X1 MSP 252 ORNL Cray X1 252 procs X1 Cray X1 MSP 252 AHPCRC Cray X1 120 procs X1 Cray X1 MSP 120 ORNL Cray X1 64 procs X1 Cray X1 MSP 64 AHPCRC Cray T3E 1024 procs T3E Alpha 21164 1024 ORNL HP zx6000 Itanium 2 128 procs Integrity zx6000 Intel Itanium 2 128 PSC HP AlphaServer SC45 128 procs AlphaServer SC45 Alpha 21264B 128 ERDC HP AlphaServer SC45 484 procs AlphaServer SC45 Alpha 21264B 484 Slide-19

Preliminary Results Machine List (2 of 2) Affiliation Manufacturer System ProcessorType Procs IBM IBM 655 Power4+ 64 procs eserver pseries 655 IBM Power 4+ 64 IBM IBM 655 Power4+ 128 procs eserver pseries 655 IBM Power 4+ 128 IBM IBM 655 Power4+ 256 procs eserver pseries 655 IBM Power 4+ 256 NAVO IBM p690 Power4 504 procs p690 IBM Power 4 504 ARL IBM SP Power3 512 procs RS/6000 SP IBM Power 3 512 ORNL IBM p690 Power4 256 procs p690 IBM Power 4 256 ORNL IBM p690 Power4 64 procs p690 IBM Power 4 64 ARL Linux Networx Xeon 256 procs Powell Intel Xeon 256 U Manchester SGI Altix Itanium 2 32 procs Altix 3700 Intel Itanium 2 32 ORNL SGI Altix Itanium 2 128 procs Altix Intel Itanium 2 128 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 U Tenn SGI Altix Itanium 2 32 procs Altix Intel Itanium 2 32 NASA ASC SGI Origin 23900 R16K 256 procs Origin 3900 SGI MIPS R16000 256 U Aachen/RWTH SunFire 15K 128 procs Sun Fire 15k/6800 SMP-Cluster Sun UltraSparc III 128 OSC Voltaire Cluster Xeon 128 procs Pinnacle 2X200 Cluster Intel Xeon 128 Slide-20

STREAM TRIAD vs HPL 120-128 Processors Basic Performance 120-128 Processors 2.5 ()op/s 2.0 1.5 1.0 EP-STREAM TRIAD Tflop/s HPL TFlop/s 0.5 STREAM TRIAD a(i) = b(i) + q *c(i) 0.0 HPL A x = b Cray X1 124 procs Cray X1 124 procs Cray X1 124 procs Cray X1 120 procs SunFire 15K 128 procs SGI Altix Itanium 2 128 procs IBM 655 Power4+ 128 procs HP AlphaServer SC45 128 procs HP zx6000 Itanium 2 128 procs Atipa Cluster AMD 128 procs Voltaire Cluster Xeon 128 procs Slide-21

STREAM TRIAD vs HPL >252 Processors Basic Performance >=252 Processors 2.5 ()op/s 2.0 1.5 1.0 EP-STREAM TRIAD Tflop/s HPL TFlop/s 0.5 0.0 STREAM TRIAD a(i) = b(i) + q *c(i) HPL A x = b SGI Origin 23900 R16K 256 procs Linux Networx Xeon 256 procs IBM p690 Power4 256 procs IBM 655 Power4+ 256 procs Cray X1 252 procs Cray X1 252 procs Cray T3E 1024 procs IBM SP Pow er3 512 procs IBM p690 Power4 504 procs HP AlphaServer SC45 484 procs Slide-22

STREAM ADD vs PTRANS 60-128 Processors Basic Performance 60-128 Processors GB/s 10,000.0 1,000.0 100.0 10.0 PTRANS GB/s EP-STREAM ADD GB/s 1.0 0.1 STREAM ADD a(i) = b(i) + c(i) PTRANS a = a + b T Voltaire Cluster Xeon 128 procs SunFire 15K 128 procs SGI Altix Itanium 2 128 procs IBM 655 Power4+ 128 procs HP AlphaServer SC45 128 procs HP zx6000 Itanium 2 128 procs Atipa Cluster AMD 128 procs Cray X1 124 procs Cray X1 124 procs Cray X1 124 procs Cray X1 120 procs IBM p690 Power4 64 procs IBM 655 Power4+ 64 procs Cray X1 64 procs Cray X1 60 procs Cray X1 60 procs Slide-23

STREAM ADD vs PTRANS >252 Processors Basic Performance >=252 Processors 10,000.0 GB/s 1,000.0 100.0 10.0 PTRANS GB/s EP-STREAM ADD GB/s 1.0 0.1 STREAM ADD a(i) = b(i) + c(i) PTRANS a = a + b T SGI Origin 23900 R16K 256 procs Linux Networx Xeon 256 procs IBM p690 Power4 256 procs IBM 655 Power4+ 256 procs Cray X1 252 procs Cray X1 252 procs Cray T3E 1024 procs IBM SP Power3 512 procs IBM p690 Power4 504 procs HP AlphaServer SC45 484 procs Slide-24

Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary Results Summary Slide-25

Summary DARPA HPCS Subsystem Performance Indicators 2+ PF/s LINPACK 6.5 PB/sec data STREAM bandwidth 3.2 PB/sec bisection bandwidth 64,000 GUPS Important to understand architecture/application characterization Where did all the lost Moore s Law performance go? http://icl.cs.utk.edu/hpcc/ Peruse the results! Contribute! Low HPCS Productivity Design Points FFT RandomAccess Spatial Locality High High HPL Mission Partner Applications Temporal Locality Low PTRANS STREAM Slide-26