Challenges in High Performance Computing. William Gropp

Size: px

Start display at page:

Download "Challenges in High Performance Computing. William Gropp"

Sharyl Sharp
5 years ago
Views:

1 Challenges in High Performance Computing William Gropp

2 2 What is HPC? High Performance Computing is the use of computing to solve challenging problems that require significant computing resources Supercomputers Clusters (Large and Small) Accelerator-equipped workstations Custom hardware (e.g., Anton) Traditionally FLOPS But could be memory, data, bandwidth, real-time,

3 3 What Sort of Problems are Solved with HPC? Engineering everything from consumer products to spacecraft Getting Pringles into a can (3-D, timedependent CFD) Optimizing bottles for minimum weight with required robustness Fuel-efficient aircraft with novel materials Insights in science Formation of galaxies; effects of different theories of dark matter and energy Weather and climate forecasting Formation (and points of attack) of viruses such as HIV

4 4 Advancing Science and Engineering Advances in a broad range of science and engineering disciplines will be enabled by sustained petascale computers: Molecular Science Weather & Climate Forecasting Astronomy Earth Science Health

5 5 Messages Big is big Data driven is an important area, but not all data driven problems are big data (despite current hype). The distinction is important There are different measures of big, but a TB of data that can be processed by a linear algorithm is not big Key feature of an extreme computing system is a fast interconnect Low latency, high link bandwidth, high bisection bandwidth Provides fast access to data everywhere in system, particularly with one-sided access models Think map(r1,r2, ) function that requires more than one record, where the specific input records are unpredictable (e.g., data dependent on previous result)

6 6 HPC and Clouds and Big Data Clouds can provide some HPC services Esp. where each experiment runs on one node Single nodes can do a lot! But clouds very poor at applications that require tightly coordinated computations across tens or hundreds of thousands of nodes The one thing that makes a supercomputer super today is a high-bandwidth, low-latency interconnect, and the software to match Big Data also requires big compute Some applications only need independent computation (e.g., clouds)

7 Extrapolation is Risky 1989 T 24 years Intel introduces 486DX Eugene Brooks writes Attack of the Killer Micros 4 years before TOP500 Top systems at about 2 GF Peak 1999 T 14

7 7 Extrapolation is Risky 1989 T 24 years Intel introduces 486DX Eugene Brooks writes Attack of the Killer Micros 4 years before TOP500 Top systems at about 2 GF Peak 1999 T 14 years NVIDIA introduces its GPU (GeForce 256) Programming GPUs still a challenge 14 years later Top system ASCI Red, 9632 cores, 3.2 TF Peak (about 3 GPUs in 2013) MPI is 7 years old

8 8 HPC Today High(est)-End systems 1 PF (10 15 Ops/s) achieved on a few peak friendly by applications 2011; Blue Waters demonstrated > 1PF endto-end on a larger set this year Much worry about scalability, how we re going to get to an ExaFLOPS Systems are all oversubscribed DOE INCITE awarded almost 900M processor hours in 2009; 1600M-1700M hours in ; over 5B hours in 2013 NSF PRAC awards for Blue Waters similarly competitive Widespread use of clusters, many with accelerators; cloud computing services These are transforming the low and midrange Laptops (far) more powerful than the supercomputers I used as a graduate student

9 9 HPC in 2011 Sustained PF systems K Computer (Fujitsu) at RIKEN, Kobe, Japan (2011) Sequoia Blue Gene/Q at LLNL NSF Track 1 Blue Waters at Illinois Milky Way-2 (TH-2) in China (apps yet to be shown) Still programmed with MPI and MPI+other (e.g., MPI+OpenMP or MPI+OpenCL/CUDA or MPI+OpenACC) But in many cases using toolkits, libraries, etc. And not so bad applications will be able to run when the system is turned on Replacing MPI will require some compromise e.g., domain specific (higher-level but less general) Lots of evidence that fully automatic solutions won t work

10 10 Blue Waters and Sequoia Computing Systems NCSA LLNL System Attribute Blue Waters Sequoia Vendors Cray/AMD/NVIDIA IBM Processors Interlagos/Kepler PowerPCA2 variant Total Peak Performance (PF) Total Peak Performance (CPU/GPU) 7.6/ /0.0 Number of CPU Chips (8, 16 cores/chip) 48,576 98,304 Number of GPU Chips 3,072 0 Amount of CPU Memory (TB) 1,510 1,572 Interconnect 3D Torus 5D Torus Amount of On-line Disk Storage (PB) 26 50(?) Sustained Disk Transfer (TB/sec) > Amount of Archival Storage >300? Sustained Tape Transfer (GB/sec) >100?

11 11 Blue Waters and MilkyWay-2 Computing Systems NCSA NUDT System Attribute Blue Waters Milky Way 2 Vendors Cray/AMD/NVIDIA NUDT/Inspur Processors Interlagos/Kepler IvyBridge/Phi Total Peak Performance (PF) Total Peak Performance (CPU/GPU) 7.6/ /48.1 Number of CPU Chips (8, 16 cores/chip) 48,576 (12 cores) 32,000 Number of GPU Chips 3,072 48,000 Amount of CPU Memory (TB) 1,510 1,404 Interconnect 3D Torus Fat Tree Amount of On-line Disk Storage (PB) 26 12(?) Sustained Disk Transfer (TB/sec) >1? Amount of Archival Storage >300? Sustained Tape Transfer (GB/sec) >100?

12 HPC in 2018-2020 2020-2023 Exascale systems are likely to have Extreme power constraints, leading to Clock Rates similar to today s systems A wide-diversity of simple computing elements (simple

detection may be the job of the programmer, as hardware detection takes power Extreme scalability and performance irregularity Performance will require enormous concurrency Performance is likely to

12 12 HPC in Exascale systems are likely to have Extreme power constraints, leading to Clock Rates similar to today s systems A wide-diversity of simple computing elements (simple for hardware but complex for software) Memory per core and per FLOP will be much smaller Moving data anywhere will be expensive (time and power) Faults that will need to be detected and managed Some detection may be the job of the programmer, as hardware detection takes power Extreme scalability and performance irregularity Performance will require enormous concurrency Performance is likely to be variable Simple, static decompositions will not scale A need for latency tolerant algorithms and programming Memory, processors will be 100s to 10000s of cycles away. Waiting for operations to complete will cripple performance

13 13 Algorithms and Applications Will Change Applications need to become more dynamic, more integrated System software must work with application: Code complexity (Autotuning) Dynamic resources (no simple PGAS) Latency hiding (Nonblocking algorithms, interfaces, futures) Resource sharing (more performance information, performance asserts, runtime coordination)

14 How Do We Make Effective Use of These Systems? Better use of our existing systems Blue Waters provides a sustained PF, but that typically requires ~10PF peak Improve node performance Make the compiler better Give better code to the compiler Get realistic with algorithms/data structures Improve parallel performance/scalability Improve productivity of applications Better tools and interoperable languages, not a (single) new programming language Improve algorithms Optimize for the real issues data movement, power, resilience, 14

15 15 Common Themes Multiple operations must be pending at any time Asynchronous I/O, communication, even computation Split computations and communication Complex systems require adaptive approaches Autotuning for likely choices, runtime optimization Operations must be on aggregates CPU: vectors (GPU gangs/workers/vectors) I/O: Collective, parallel I/O Example: Parallel collective I/O for a distributed data structure mesh distributed across all nodes

16 Four Levels of Collective I/O Level 0 Level 1 Level 2 Level Processes 16

17 Distributed Array Access: Write Bandwidth Array size: 512 x 512 x 512 Note:Log Scale! 1GB data 128 procs 256 procs 32 procs 256procs 128 procs Thanks to Weikuan Yu, Wei-keng Liao, Bill Loewe, and Anthony Chan for these results

18 18 What s Different at Peta/Exascale Performance Focus Only a little basically, the resource is expensive, so a premium placed on making good use of resource Quite a bit node is more complex, has more features that must be exploited, interconnect performs operations Scalability Solutions that work at way often inefficient at 100,000-way Some algorithms scale well Explicit time marching in 3D Some don t Direct implicit methods Some scale well for a while FFTs (communication volume in Alltoall) Load balance, latency are critical issues Fault Tolerance becoming important Now: Reduce time spent in checkpoints Soon: Lightweight recovery from transient errors

19 19 Preparing for the Next Generation of HPC Systems Better use of existing resources Performance-oriented programming Dynamic management of resources at all levels Embrace hybrid programming models (you have already if you use SSE/VSX/OpenMP/ ) Focus on results Adapt to available network bandwidth and latency Exploit I/O capability (available space grew faster than processor performance!) Prepare for the future Fault tolerance Hybrid processor architectures Latency tolerant algorithms Data-driven systems

20 20 Research Directions Integrated, interoperable, component oriented languages Generalization of so-called domain-specific language Really data-structure-specific languages Performance modeling and tuning Performance info in language; performance considered as part of correctness Fault tolerance at the high end Fault tolerance features in the language, working with hardware and algorithms Correctness Correctness features for testing in the language Support for special cases (e.g., provably deterministic expression of deterministic algorithms)

21 Recommended Reading Bit reversal on uniprocessors (Alan Karp, SIAM Review, 1996) Achieving high sustained performance in an unstructured mesh CFD application (W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, B. F. Smith, Proceedings of Supercomputing, 1999) Experimental Analysis of Algorithms (Catherine McGeoch, Notices of the American Mathematical Society, March 2001) Reflections on the Memory Wall (Sally McKee, ACM Conference on Computing Frontiers, 2004) 21

22 Still open a once a year opportunity to work with high-end networks sc13.supercomputing.org/content/scinet-network-research-program 22

23 23 Six Questions 1. What is the appropriate balance between HPC needs at the extreme scale (fundamental research that can be done in no other way) and the needs of the long tail (research that needs more than a desktop computer)? How do you support all computing needs for research? 2. Applications have needs and wants. These may not be the same. E.g., applications want to use their existing algorithms and code, but it may not be possible to run those fast enough. The application may need to change approach. How do you get application scientists to separate their needs and wants? 3. HPC is about performance. How do you support both the research and especially the engineering work to make codes efficient? If you don t do this, where do you find the funds to buy the additional computational capacity required to meet the additional need created by running less than optimized codes?

24 24 Six Questions (con t) 4. Data and HPC should be closely connected. Truly big data (much greater than 10 PB) requires significant compute, for example. How do you change the perception that big data and big compute are antagonistic? What Big Data problems would be best solved on big compute platforms such as Blue Waters? 5. Current computer Architecture is reaching its limits. Where are the new architectures? How do you solve the chicken and egg problem do new architectures require a demand from new applications, and applications require a well-established, dependable architecture? Should NSF only consume architectures created by others (whether industry or other agencies) or should it have some control of its core computational technology? 6. How do we get past the usual suspects of applications (computational fluid dynamics, n-body problems, etc.)? How do we extend the use of HPC into new areas in science, engineering, and the humanities?

25 Six Questions: The Short Form 1. What is the right balance between HPC and other computing infrastructure? 2. How can we encourage applications to try new approaches? 3. How do we ensure applications make efficient use of infrastructure? 4. What Big Data problems need HPC? 5. How can we support innovative computer architecture research? 6. How do we bring computing to new areas of science? 25

The Next Generation of High Performance Computing. William Gropp

The Next Generation of High Performance Computing William Gropp www.cs.illinois.edu/~wgropp Extrapolation is Risky 1989 T 23 years Intel introduces 486DX Eugene Brooks writes Attack of the Killer Micros