PetaFlop+ Supercomputing. Eric Kronstadt IBM TJ Watson Research Center Yorktown Heights, NY IBM Corporation

Size: px

Start display at page:

Download "PetaFlop+ Supercomputing. Eric Kronstadt IBM TJ Watson Research Center Yorktown Heights, NY IBM Corporation"

Derek Caldwell
5 years ago
Views:

1 PetaFlop+ Supercomputing Eric Kronstadt IBM TJ Watson Research Center Yorktown Heights, NY

2 Multiple PetaFlops - Why should one care? President s Information Technology Advisory Committee (PITAC) report to President Bush on Computational Science, June 2005: Computational science is now indispensable to the solution of complex problems in every sector, from traditional science and engineering domains to such key areas as national security, public health, and economic innovation Computational science has become the third pillar of the scientific enterprise, a peer alongside theory and physical experiment. 2

Business Unit or Product Name Petascale Computing The Billion Dollar Race Japan: Effort to regain number one position PetaFlop in 2008 3-4 PetaFlops

system in 2010-2011 Cray, IBM, Sun NSF: Announced program to procure >PF sustained system in 2011 Large DOE labs looking for something before 2010

3 Business Unit or Product Name Petascale Computing The Billion Dollar Race Japan: Effort to regain number one position PetaFlop in PetaFlops in 2010 (official target) 10 PF (unofficial target) 100 s of PF in US: DARPA: HPCS program - Develop: Multi - PetaFlops highly productive system in Cray, IBM, Sun NSF: Announced program to procure >PF sustained system in 2011 Large DOE labs looking for something before 2010 NET: There will be supercomputers with peak performance of 1PFlop before the end of the decade 10 s-100 s of PFLops by the second half of the next decade 3

4 The Tyranny of Large numbers Each extra Dollar per GigaFlop/s costs $1,000,000 Each extra Watt per GigaFlop/s costs 1 Megawatts $1M annual electricity costs An additional cubic inch per GigaFlop/s costs 580 cubic feet 82 sq ft floor space = ~45TFlops of BlueGene/L Every penny/megabyte in DRAM memory costs will cost $5,000,000-$10,000,000 assuming a balanced system DRAM prices today range from $ $0.15 4

5 The Tyranny of large numbers itsy bitsy performance losses As you increase the number of processing elements from N to N+1, performance should increase by (N+1)/N Suppose at each step the best you could achieve was * (N+1)/N performance improvement Then you d have to stop designing after you had added the 100,000 Processing elements ,000 Processing elements 5

6 Tyranny of Large numbers continued Reliability If Hardware fails on any node roughly once every 5 years a node failure would be expected on a 10,000-20,000 node system every 2-5 hours a node failure would be expected on a 100,000 node system every 27 minutes If software (or anything else) fails on any node (independently) once in a month a node failure would be expected on a 10,000-20,000 node system every 2-5 minutes a node failure would be expected on a 100,000 node system every 27 seconds Real solutions are likely to require anywhere from 100,000 to 1,000,000 compute nodes 6

transistor (the smallest dimensions today) Assume only 1 atom high defects on each surrounding silicon layer For

7 Semiconductor technology issues: We re down to atoms Gate Source Drain 1.2 nm oxynitride Field effect transistor Thick gate oxide t Scaled gate oxide Consider the gate oxide in a CMOS transistor (the smallest dimensions today) Assume only 1 atom high defects on each surrounding silicon layer For a modern scaled oxide, 6 atoms thick, 33% variability is induced. Result: Single atom defects can cause local current leakage x higher than average 7

8 Power is THE problem Single thread focus has resulted in power inefficient design Steam Iron 5W/cm2? 8

9 Power vs Performance Trade Offs 5 4 Relative Power Relative Performance 9

10 How scaling helped us in the past. 5 4 Relative Power Relative Performance 5 4 Relative Power Relative Performance 10

11 Power vs Performance Trade Offs 5 4 Relative Power Relative Performance

12 Observation 1. Although the shape of the curve and the relative positioning of the limiting lines are different, the same reasoning seems to apply if you replace the y axis with relative cost 5 4 Relative Cost Relative Performance 12

BlueGene/L System Buildup Rack 32 Node Cards System 64 Racks, 64x32x32 Node Card (32 chips 4x4x2) 16 compute, 0-2 IO cards 360 TF/s 32-64TB Compute Card 2 chips, 1x2x1 5.

13 BlueGene/L System Buildup Rack 32 Node Cards System 64 Racks, 64x32x32 Node Card (32 chips 4x4x2) 16 compute, 0-2 IO cards 360 TF/s 32-64TB Compute Card 2 chips, 1x2x1 5.6 TF/s GB Chip 2 processors 5.6 GF/s 4 MB 11.2 GF/s GB 180 GF/s GB 3 in Top10 (#1 and #2) 9 in Top20 15 in Top50 25 overall in Top x less Watts/Flop

14 Blue Gene Interconnection Networks 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth Global Collective Network One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link; Latency of tree traversal 2.5 µs ~23TB/s total binary tree bandwidth (64k machine) Interconnects all compute and I/O nodes (1024) Low Latency Global Barrier and Interrupt Round trip latency 1.3 µs Control Network Boot, monitoring and diagnostics Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.) 14

Measured MPI Send Bandwidth & Latency Petascale Computing System Software Goals Scalability Performance - scaling to tens of thousands of processors Reliability an independent failure on compute node

15 Measured MPI Send Bandwidth & Latency Petascale Computing System Software Goals Scalability Performance - scaling to tens of thousands of processors Reliability an independent failure on compute node once a year would translate into Bandwidth 700 MHz MHz = Message 3.3 size (bytes) * Manhattan distance 1 neighbor 2 neighbors 3 neighbors 4 neighbors 5 neighbors 6 neighbors a failure every 8 minutes on a 64K node system Familiarity Enable familiar programming models and programming environments for end-users Our approach Simplicity Avoid features not absolutely necessary for high performance computing Using simplicity to achieve both efficiency and reliability New organization of familiar functionality Hierarchical organization Same interface, new implementation Message passing provides foundation Research on higher level programming models using that base * Midplane hops ls Lines of Code (Thousands) Noise measurements (from Adolphy Hoisie) Blue Gene Porting Experience Most codes ported in one person-week Person-Days to Port 15

16 QBox: TF sustained Nature, Oct 7, 2004 (cover picture) Hydrogen Under High Pressure Transition from Molecular Solid to Quantum Liquid A possible new state of matter 16

17 Pitman, M. C., Grossfield, A., Suits, F. & Feller, S. E. J. Am. Chem. Soc. 127, (2005). Rhodopsin - Dark Ensemble Light-adapted Rhodopsin Rhodopsin in 2:2:1 SDPC/SDPE/CHOL Lysozyme Misfolding 17

Biology of Transcriptional Business Unit

upstream of a gene and promote or inhibit

Srinivas Aluru Goal Identify both the TFs

18 Biology of Transcriptional Business Unit or Regulation Product Name Boston University Department of Bioinformatics Transcription Factors (TFs) bind DNA upstream of a gene and promote or inhibit RNA transcription Genes bound by the same TF can be co-regulated Why sequence the maize genome? Srinivas Aluru Goal Identify both the TFs and the places they bind (i.e. the genes they regulate) Identify sets of gene regulated by the same TF An economically important crop in the US (and Iowa). Best studied model organism for the cereal crops. Just as the human genome project will intensify upcoming medical advances, cereal genomes (rice and maize) will help improve worldwide food production. Maize genome is comparable in size to the human genome (2.5 GB) but is highly repetitive (65-80%). Less than 10-15% is gene space. Dr. Yuan-Ping Pang, Mayo Clinic 18

Blue Gene enables LOFAR to provide higher

lowfrequency radio telescope LOFAR: (LOw

from an array of simple omni-directional

computer system to emulate a conventional dish

from first principles to understand the physics

Interfaces: Structure & chemistry Formation &

19 Blue Gene enables LOFAR to provide higher resolution and sensitivity than any other lowfrequency radio telescope LOFAR: (LOw Frequency ARray) digitizes MHz signals from an array of simple omni-directional antennas and processes the data on a central computer system to emulate a conventional dish antenna CPMD and Material Sciences Simulations from first principles to understand the physics and chemistry of current technology and guide the design of nextgeneration materials Dielectric constants Electronic properties Interfaces: Structure & chemistry Formation & process dependence of physical properties High-k k on Silicon Hercules Earthquake Simulation Physical model Interpretation Velocity (m/s) Meshing Partitioner Solver Time Visualization 19

20 Observations: 2. The premium is on SYSTEM design Power Reliability / Availability Interconnect Programmability Tuning & Debugging Cost Efficiency vs cost (not vs Peak) 3. It s about the Science and Applications Not the benchmarks 4. Ultra Scalability works and is here to stay Performance Cost/Performance Power efficiency Ease of use Ease of porting Applicable to a large (and growing) class of applications 20

21 NSF Procurement Issues Example: Cost / Balance tradeoffs 1 PetaFlop/s sustained over a wide range of applications suggests peak performance targets of 3-10 PetaFlops Assuming ASCI ratios, 5 PetaFlop/s peak performance implies: 2.5PB of DRAM memory 100PB of Disk Storage 5TB/s Bandwidth to Disk DRAM price estimates for 2010 range from 44 to 124 per 2.5PB of DRAM costs $100M We expect memory costs to be twice this Form factor and quality affect price 5TB/s Storage bandwidth in 2010 is estimated to require ~150,000 drives. Bandwidth requirement dominates the capacity requirement (100PB) Costs could be in the $60-$100M range Potential floor space and power implications 21

22 Observations: 5. We need to Rethink our Definition of a Balanced System and of System Efficiency FLOPs are cheap Why do we value them so much? Memory, Storage, and interconnect (and the bandwidths associated with them) are expensive and valuable We need to optimize these and be prepared to throw FLOPs at them In general purpose computing we have historically thrown MIPS as Software layers to improve functionality, usability, productivity, etc. Even more valuable is skilled human resources How do we program these beasts? 22

23 Another approach special hardware based (accelerator) architectures Lots of transistors on Microprocessors are there for noncomputational functionality Do you have to replicate all of these??? Examples: Cell Clear Speed FPGA s Mixtures of the above Can give exceptional Power/Performance 23

24 Another approach special hardware based (accelerator) architectures So far we have boards and chips What are the correct system structures for these? Programmability / useability / practicality is the big question?? Memory, Storage and Interconnect (and their bandwidths) will still dominate costs Ultimately we will want to scale these out too 24

25 Combinations and Convergence Hybrid Systems What makes sense? Swiss Army Knife model VS Integrated System System Level Accelerators Ultrascale or accelerator based SYSTEMS as an accessory to a more general purpose system that orchestrates, coordinates, and initiates computation 25

System Level Accelerators Cluster Node Cluster Node Optional 1G Ethernet InfiniBand Dual 4x Fabric Service LAN Cell, Blue Gene or other Systems

Laptop; Power SMP; Intel, AMD Clusters Interconnect Fast networking and clever algorithms that maximize network bandwidth Ethernet / InfiniBand

26 System Level Accelerators Cluster Node Cluster Node Optional 1G Ethernet InfiniBand Dual 4x Fabric Service LAN Cell, Blue Gene or other Systems Base System Parts of the workload that Require large or shared memory footprint, Data base functions, Visualization clusters, workstations Laptop; Power SMP; Intel, AMD Clusters Interconnect Fast networking and clever algorithms that maximize network bandwidth Ethernet / InfiniBand have emerged as standards here Accelerator Off load / accelerate compute intensive components of workload Linux cluster Blue Gene Cell Blade Center 26

27 Summary Considerations of cost and power are driving very high end design choices Ultra scalability it works! Special purpose or accelerator hardware potential to be determined Raw Flops will become increasingly less expensive relative to memory, storage, interconnect (and their associated bandwidth) Continued focus on total system design Value needs to be measured by real results on real problems vs the cost of the solution Including programming costs 27

28 Thank You

Real Parallel Computers

Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short