Aim High Intel. Exa, Zetta, Yotta Not Just Goofy Words Oklahoma Supercomputing Symposium October 3, Stephen R. Wheat, Ph.D.

Size: px

Start display at page:

Download "Aim High Intel. Exa, Zetta, Yotta Not Just Goofy Words Oklahoma Supercomputing Symposium October 3, Stephen R. Wheat, Ph.D."

Candace Hancock
6 years ago
Views:

1 Aim High Intel Exa, Zetta, Yotta Not Just Goofy Words Oklahoma Supercomputing Symposium October 3, Intel Corporation Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

2 Risk Factors Today s presentations contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing available on our website for more information on the risk factors that could cause actual results to differ. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations ( Rev. 7/19/06 2

3 Silicon Future 90nm nm nm nm nm New Intel technology generation every 2 years Intel R&D technologies drive this pace well into the next decade 25 nm 22nm nm nm nm 2017 Roadmap Research 3

4 45nm Advantage Intel Xeon 5300 Processor (Clovertown) 65nm Intel Xeon 5400 Processor (Harpertown) 45nm Hi-k 143 mm 2 * 143 mm 2 * 582m Transistors 8 MB Cache *Source: Intel Note: die picture sizes are approximate 107 mm 2 * 107 mm 2 * 820m Transistors 12 MB Cache Millions of Quad Core Processors Shipped

5 High Performance Computing Technically motivated computing where performance matters more than cost or ease of use from desktop to highest end supercomputing: - Scientific Discovery - Engineering Innovation - Finance and Decision Support - Geo-Economics and Societal Complexities - Knowledge Discovery 5

6 HPC and Mainstream Computing Historically most hardware and software architectural innovations have come through High End Computing Today, innovation moves up from the bottom (e.g. low-power processing) and down from the top (e.g., parallel computing). But the High-End is still a dominant source of new ideas What is in today s supercomputer will be in tomorrow s desktop and next week s embedded platform. 6

7 Size does matter Peta Mac Exa Mac Zeta Mac Yotta Mac 7

8 Yesterday, Today and Tomorrow in HPC ENIAC 20 Numbers in Main Memory CDC 6600 First successful Supercomputer 9MFlops ~2008 Beyond Climate Astrophysics Cell-base Community Simulation ASCI Red (world s s fastest Jan 1997 Nov 2000 on top500 till Nov 2005) First Teraflop Computer, 9298 Intel Pentium II Xeon Processors Intel ENDEAVOR 464 Intel Xeon Processors 5100 series, 6.85 Teraflop MP Linpack, #68 on top500 Petascale Platforms Yesterday s s High-end Supercomputing is Today s s Personal Computing 8

9 Just a Few Highly Parallel Apps Game physics Fluid/Structural simulation Portfolio management Text mining Semantic Web Signal / image processing primitives Derivative pricing suite Stochastic optimization suites Non-linear Crash models Neural Networks for Medicine and AI 9

Exploding Demand for Data Processing Example: HPC in Medical Imaging 10 9 Images/Exam (K) 2MB / Image 8 7 6 5 4 3 2 Single

Source: Intel Digital Health 3-D renderings of the images Computer aided diagnostic algorithms Fusions of images from

10 Exploding Demand for Data Processing Example: HPC in Medical Imaging 10 9 Images/Exam (K) 2MB / Image Single slice in 96: 100 images 64 slice in 05: 3,000 images 256 slice in 07: 10,000 images Source: Intel Digital Health 3-D renderings of the images Computer aided diagnostic algorithms Fusions of images from different modalities MRI, CT, PET, and SPECT Real-time applications are appearing Full Body CT 256 slice/10,000 images: a 20GB file 10

Today s Science Demands Petascale Example: HPC in Climate Computing Some believe that global warming will produce more extremes weather (drought/flooding).

11 Today s Science Demands Petascale Example: HPC in Climate Computing Some believe that global warming will produce more extremes weather (drought/flooding). Current models are too coarse and inaccurate to be reliable globally much less for predicting climate change at the national level. To predict regional climate change: Community climate model resolution goal is 10 km Simulate ~150 days/day on today s fastest computer at 10 km using NCAR/Sandia SEAM Typical climate simulation is for 100 yrs. To Simulate 100 Year Climate: 1.6 sustained PFLOPS = about a month of Computing; 50 PFLOPS = a day 11

12 Modeling the Brain zetta Human Brain exa Bird Brain FLOPS peta tera Worm Brain giga 10 9 A Neuron mega

The Top500: Reaching Petascale PF on all Top500 reached 04 PF on a single system at ~2008 Sustained PF ~2011.

13 The Top500: Reaching Petascale PF on all Top500 reached 04 PF on a single system at ~2008 Sustained PF ~ Intel Columbia NEC IBM BG-L Intel Red 1TF barrier to entry Surpassed in It takes merely 8 years to move from #1 to being off the list! Source: top500.org 13

14 Driving to Petascale & Exascale 1 ZFlops 100 EFlops SUM Of Top500 #1 10 EFlops 1 EFlops 100 PFlops 10 PFlops 1 PFlops 100 TFlops 10 TFlops 1 TFlops Disruption Intel many core 100 GFlops 10 GFlops 1 GFlops 100 MFlops Source: Dr. Steve Chen, The Growing HPC Momentum in China,, June 30 th, 2006, Dresden, Germany 14

Driving to Petascale & Exascale For Science and engineering 1 ZFlops 100 EFlops SUM Of Top500 #1 10 EFlops 1 EFlops 100 PFlops 10 PFlops 1 PFlops 100 TFlops 10 TFlops 1 TFlops 100 GFlops 10 GFlops 1

15 Driving to Petascale & Exascale For Science and engineering 1 ZFlops 100 EFlops SUM Of Top500 #1 10 EFlops 1 EFlops 100 PFlops 10 PFlops 1 PFlops 100 TFlops 10 TFlops 1 TFlops 100 GFlops 10 GFlops 1 GFlops 100 MFlops Real world challenges: Aerodynamic Full modeling Analysis: of a aircraft in all conditions Laser Green Optics: airplanes Molecular Genetically Dynamics tailored in medicine Biology: Aerodynamic Understand Design: the origin of the universe Computational Synthetic fuels Cosmology: everywhere Turbulence Accurate extreme in Physics: weather prediction Computational Understanding Chemistry: Human Intelligence Disruption 1 PetaFLOPS 10 PetaFLOPS 20 PetaFLOPS 1 ExaFLOPS 10 ExaFLOPS 100 ExaFLOPS 1 ZettaFLOPS Source: Dr. Steve Chen, The Growing HPC Momentum in China,, June 30 th, 2006, Dresden, Germany 15

16 Challenges to Success at the Petascale Programmability and Scalability Processor Speed Memory Performance Network Performance Power Reliability Manageability Purchase and Ownership Cost 16

17 Processor Performance 1.E+15 Peta FLOPS 1.E+14 1.E+13 1.E+12 Tera Projected Multi/Many Core Performance 1.E+11 1.E+10 1.E+09 Giga 1.E+08 1.E E Pentium III Architecture 486 Intel Core uarch Pentium II Architecture Pentium Architecture Pentium 4 Architecture Source: Intel Sustaining Petascale with ~6000 Processors in

18 Performance within the Power Envelope P ~ ½ ω C V min min 2 V min ~ V 0 ω -> > P ~ ½ C V 2 0 ω 3 As we shrink features, new challenges arise: Error management Leakage power Moore s s Law continues, But Frequency increase is becoming more challenging 18

19 Performance within the Power Envelope P ~ ½ ω C V min min 2 V min ~ V 0 ω > P ~ ½ C V 2 0 ω 3 Cache Cache Core Core Core Voltage = 1 Freq = 1 Power = 1 Perf = 1 Voltage = -20% Freq = -20% Power = 1 Perf = ~1.7 19

20 A Sample Many Core System 10mm 45nm 10mm 32nm 10mm Core Cache 50% 50% 8 Cores, 1V, 3GHz 3.5mm each core Total: 400MT 16 Cores, 1V, 3GHz 2.5mm each core Total: 800MT 65nm, 4 Cores 1V, 3GHz 10mm die, 5mm each core Core Logic: 6MT, Cache: 44MT Total transistors: 200M 22nm 10mm 32 Cores, 1V, 3GHz 1.8mm each core Total: 1.6BT 16nm 10mm 64 Cores, 1V, 3GHz 1.3mm each core Total: 3.2BT Note: the above pictures don t t represent any current or future Intel products Research Challenge: Asymmetric vs. symmetric, Homogenous vs. heterogeneous What kind of applications will benefit? What kind of applications 20 will benefit?

21 Teraflops Research Chip 100 Million Transistors 80 Tiles 275mm 2 First tera-scale programmable silicon: Teraflops performance Tile design approach On-die mesh network Novel clocking Power-aware capability Supports 3D-memory Not designed for IA or product

22 Tera-scale Introduction Represents significant Intel transition from large cores to 32+ low-power, highly-threaded IA cores per die Motivations for a new architecture Enable emerging workloads and new use-models Low Power IA cores provide 4-5X 4 greater performance-power power efficiency Scaling beyond the limits of Instruction level parallelism and single-core power Tera-scale is NOT simply SMP-on on-die Will require complete platform and software enabling Parameter SMP Tera-scale Improvement Optimizations Bandwidth Latency 12 GB/s ~1.2 TB/s ~100X Massive bandwidth between cores 400 cycles 20 cycles ~20X Ultra-fast synchronization

Cores networked in a grid allows for super high bandwidth communications in and between cores 5-port, 80GB/s* routers Low

23 Tiled Design & Mesh Network Repeated Tile Method: Compute + router Modular, scalable Small design teams Short design cycle West neighbor To Future Stacked Memory North neighbor Compute Element Mesh Interconnect: Network-on-a-Chip South neighbor Cores networked in a grid allows for super high bandwidth communications in and between cores 5-port, 80GB/s* routers Low latency (1.25ns*) Future: connect IA/or and special purpose cores Router One tile East neighbor * When operating at a nominal speed of 4GHz

Fine Grain Power Management Novel, modular clocking scheme saves power over global clock New instructions to make any core sleep or wake as apps demand Chip

8GHz) 0 Dynamic sleep STANDBY: Memory retains data 50% less power/tile FULL SLEEP: Memories fully off 80% less power/tile 21 sleep regions per tile (not all

24 Fine Grain Power Management Novel, modular clocking scheme saves power over global clock New instructions to make any core sleep or wake as apps demand Chip Voltage & freq. control ( V, 0-5.8GHz) 0 Dynamic sleep STANDBY: Memory retains data 50% less power/tile FULL SLEEP: Memories fully off 80% less power/tile 21 sleep regions per tile (not all shown) Data Memory Sleeping: 57% less power Instruction Memory Sleeping: 56% less power Router Sleeping: 10% less power (stays on to pass traffic) FP Engine 1 Sleeping: 90% less power FP Engine 2 Sleeping: 90% less power Industry leading energy-efficiency efficiency of 16 Gigaflops/Watt

01 Teraflops 5.1 GHz 1.2 V 175W 2.61 Terabits/s 1.

25 Research Data Summary Frequency Voltage Power 3.16 GHz 0.95 V 62W Bisection Bandwidth Performance 1.62 Terabits/s 1.01 Teraflops 5.1 GHz 1.2 V 175W 2.61 Terabits/s 1.63 Teraflops 5.7 GHz 1.35 V 265W 2.92 Terabits/s 1.81 Teraflops 1.01 Teraflops 62 Watts

More than the Cores Performance increase 30 25 20

thread scheduling Baseline 1 2 4 8 16 32 64 Number

26 More than the Cores Performance increase New instructions Cache improvements HW thread scheduling Baseline Number of cores Value of Tera-scale Research Just Adding Cores

27 Intra-chip Interconnect Bus for Future Many Core Chip? Issues: Slow Shared, limited scalability? Benefits: Power? Simpler cache coherency Traditional Bus is Not a Good Interconnect Option 27

Intra-chip Interconnect Options to Evaluate Bandwidth, Link Bandwidth and Power 10 8 6 4 2 0 Topology Effect

2 xbar Interconnect Area Iso-Bandwidth Relation to Compute Area Energy Iso-Bandwidth Energy per bit (J) ring

28 Intra-chip Interconnect Options to Evaluate Bandwidth, Link Bandwidth and Power Topology Effect on Bandwidth Normalized B/W xbar mesh ring Number of nodes xbar Interconnect Area Iso-Bandwidth Relation to Compute Area Energy Iso-Bandwidth Energy per bit (J) ring Clustered ring mesh Number of cores 0.2 ring Clustered ring 0.1 mesh 0.0 xbar Number of cores 28

How Do We Feed the Machine? Bandwidth (GB/s) or Computation (GFlops/s) 1000000 100000 10000 1000 100 10 1 0 0 187 Soda 1MP 1 247 Ray Tracing: 0.

29 How Do We Feed the Machine? Bandwidth (GB/s) or Computation (GFlops/s) Soda 1MP Ray Tracing: B/Flop 1 Soda 4MP RMS Workload - Bandwidth and Computation Requirements 6809 Beetle 1MP 1 TFLOP 150 GB/sec 4 8 FB_Est in Video Surv, Computer Vision: 0.45 B/Flop FB_Est in Video Surv, FB_Est in Body Tracking, CFD (Splash, 75x50x50, Physical Sim : 0.17 B/Flop CFD (Splash, 150x100x100, 13 2 Financial Analytics: 8.5 B/Flop Memory Bandwidth Computation 17 2 Small ALM - 6 Assets, Large ALM - 6 Assets, Large ALM - 6 Assets, 10 Source: Intel Labs Memory Bandwidth and Processor Performance Need to Keep Pace 29

30 What If? Moore s s Law Could be Applied to the Airline Industry? 30

31 What If? Moore s s Law Could be Applied to the Airline Industry? 31

Memory Performance for Balanced Computing 0.7 X86 Bytes Per FLOP 0.6 0.5 0.4 0.3 0.2 0.

32 Memory Performance for Balanced Computing 0.7 X86 Bytes Per FLOP Source: Intel Byte : Flop Ratio has been Consistent 32

33 When we have tera-ops processors We ll need TB/s Memory systems To do the most common arithmetic operation X <= A*X + B Requires that 16 bytes be read into the functional unit And that 8 bytes be written In addition, an 8 Bytes instruction must be processed 33

34 When we have tera-ops processors We ll need TB/s Memory systems Arithmetic unit registers L1 I$ L1 D$ L2$ MC DRAM This neglects multi-core and coherency issues 34

35 Issues with today s memory system options A big processor socket has O(2K) pins DDR achieves relatively good BW and relatively low power at amazingly low cost - Slow links, lots of them There are simply not enough pins to achieve traditional B/F ratios FB-DIMM technology: higher BW, many fewer pins But: much higher power (~2X) Longer latencies Cost(?) 35

36 Buffer on Board Memory system solutions for Terascale processors DRAM on package Stacked DRAM Silicon Photonics 36

37 Memory Bandwidth options Augment on-pkg MC with very fast links to An on-board MC CPU Package Smart buffers Package 37

38 Memory Bandwidth options: DRAM on Pkg CPU DRAM Package DRAM, CPU integrated on die 38

39 Memory Bandwidth futures: 3D Die Stacking Heat-sink Power and IO signals go through DRAM to CPU Thin DRAM die CPU DRAM Package Through DRAM vias DRAM, Voltage Regulators, and High Voltage I/O All on the 3D integrated die 39

40 Silicon Photonics Future I/O Vision HPC and Data Center Fabrics Chip-to to-chip Interconnects Backplane and Display Interconnects Chemical Analysis Medical Lasers Research Challenge: Intra-chip and Inter-chip I/O Architecture and Topology Options 40

Nehalem Based System Architecture Nehalem Nehalem Nehalem Nehalem PCI Express* DMI I/O Hub ICH I/O Hub PCI Express* Nehalem Nehalem I/O Hub PCI Express* Intel

41 Nehalem Based System Architecture Nehalem Nehalem Nehalem Nehalem PCI Express* DMI I/O Hub ICH I/O Hub PCI Express* Nehalem Nehalem I/O Hub PCI Express* Intel QuickPath Interconnect 2, 4, 8 Cores 4, 8, 16 Threads Intel QuickPath Architecture Integrated Memory Controller Buffered or Un-buffered Memory *Optional Integrated Graphics

42 Interconnect The bigger the system, the more critical the interconnect: Blue Gene* has nearly 3X the peak speed of XT- 3/Red Storm*; RS outperforms BG on most apps WHY? It s about the interconnect 42

43 Interconnect For large systems, meshes/tori with fat links are best Rule of thumb: - I/O speed (B/S) ~ processor speed (Flops) Terascale Processor: - Rule of thumb cannot hold currently signaling at more than 10 Gbits/s/wire pair = Grand Challenge Pin counts, Power & Space budgets are limited Optics speeds are also limited: by cost, power, and electrical input 43

44 Well-designed MPI codes scale on well-designed interconnects Example: Scalable Neutronics (Los Alamos) MPI-based code Run time only stays constant on a well balanced system 44

45 PCIe 2.0 2X PCIe1.0 Bandwidth Nine Cards from Seven Vendors Working with Intel s s Stoakley Platform at 5GHz Broad IHV Support Intel Xeon 5400 Chipset and X38 Express Chipset In 2H 07

46 PCIe 2.0 PCIe 3.0 2X PCIe1.0 Bandwidth 2X PCIe 2.0 Bandwidth Data Reuse Dynamic Power Management Atomic Operations Broad IHV Support Industry Standard Attach For Accelerators Intel Xeon 5400 Chipset and X38 Express Chipset In 2H 07 Specifications In 2009 Products Expected In 2010 Expanding Momentum and Innovation

47 Programmability Nearly all HPC apps today are written for X86 architecture Nearly all use MPI for remote memory access OpenMP is successful on SMP nodes Most people are anxious to avoid hybrid programming So, what do we do about Million MPI threads? 47

48 Programmability biased opinion X86 is a huge advantage MPI is not going away Many new apps would be written using GAS models - If the support were there Multi-threading/dataflow languages are interesting but have a huge barrier to adoption The problem is less programming complexity than achieving performance (jitter, load balance, reliability ) Transactional memory will aid in on-socket shared memory parallel programming Fractal computing models will become important 48

49 Shared Memory Scalability on Many Core systems 64 48x-57x Speedup x-33x Ray Tracing (Avg.) Mod. Levelset FB_Est Body Tracker Gauss-Seidel Sparse Matrix (Avg.) Dense Matrix (Avg.) Kmeans SVD SVM_classification Cholesky (watson) Cholesky (mod2) Forward Solver (pds-10) Forward Solver (world) Backward Solver (pds-10) Backward Solver (watson) # of cores Source: Intel Labs 49

50 Power Speed costs power Memory bandwidth costs power Memory size costs power Power delivery costs power Cooling costs power Optics costs power Electrical signaling costs even more power Power envelope and density drives cooling issues Green policies There is no single magic bullet 50

System Power/Cooling Efficiency Silicon Packages Heat Sinks Silicon: Moore s s law, Strained silicon, Transistor leakage control techniques, Clock gating, use

grain power management, Ultra fine grain power management High efficiency power converters Facilities: Air cooling and liquid cooling options Vertical integration

51 System Power/Cooling Efficiency Silicon Packages Heat Sinks Silicon: Moore s s law, Strained silicon, Transistor leakage control techniques, Clock gating, use more Si to replace faster Si Processor: Policy-based power allocation Multi-threaded threaded cores Transistors Facilities Systems System Power Delivery: Fine grain power management, Ultra fine grain power management High efficiency power converters Facilities: Air cooling and liquid cooling options Vertical integration of cooling solutions Research Challenge: Zero-overhead HW &SW solutions For system and facility level power management For system and facility level power 51 management

52 Reliability: Reliable Answers From Unreliable Components Petascale-to-Exascale systems are huge With current technology, ~ 3 5X increase in part count over today s biggest systems 3 4X the number of SW instances Exquisite efforts needed to keep parts cool Engineered-in reliability is not an option; it is an imperative SW complexity and brittleness will remain the largest cause of unreliability in HPC 20 Hour MTBI is a realizable goal for multi peta-ops systems 52

53 Reliable Systems with Unreliable Components Architectural Techniques Micro Solutions Macro Solutions Detect, Correct, Log and Signal the errors Parity SECDED ECC π bit Device Param Tuning Circuit Techniques Process Techniques Lockstepping Redundant multithreading (RMT) Redundant multi-core CPU Rad-hard Cell Creation State-of of-art Processes Research Challenge: From software (Apps, OS, VMM, etc.) to hardware a reliable Petascale HPC system needs management top down a reliable Petascale HPC system needs 53 management top down

54 Summing it up: Intel will pave the road to Exascale Intel processors will provide a step-function advantage in - Programmability - Performance and reliability - Cost of performance Intel will bring its world-leading technologies to bear - Si process - Many-core - Memory technologies - Optics - Power reduction and mgmt - Productivity through SW Intel will architect fastest-in-the-world systems SRW (not Intel) Prediction: Red River Shootout 2007 Sooners: 23 Longhorns 10 54

55 Intel

Petascale Computing Research Challenges

Petascale Computing Research Challenges - A Manycore Perspective Stephen Pawlowski Intel Senior Fellow GM, Architecture & Planning CTO, Digital Enterprise Group Yesterday, Today and Tomorrow in HPC ENIAC