CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

Similar documents
Emerging Heterogeneous Technologies for High Performance Computing

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

HPC as a Driver for Computing Technology and Education

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Top500

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Trends in HPC (hardware complexity and software challenges)

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

High Performance Computing in Europe and USA: A Comparison

High-Performance Computing - and why Learn about it?

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

HPC Algorithms and Applications

Fra superdatamaskiner til grafikkprosessorer og

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

Hybrid Architectures Why Should I Bother?

International Conference Russian Supercomputing Days. September 25-26, 2017, Moscow

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

An Overview of High Performance Computing and Challenges for the Future

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Cray XC Scalability and the Aries Network Tony Ford

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek

Why we need Exascale and why we won t get there by 2020

Preparing GPU-Accelerated Applications for the Summit Supercomputer

HPC Technology Trends

What have we learned from the TOP500 lists?

ECE 574 Cluster Computing Lecture 2

HPC Technology Update Challenges or Chances?

An Overview of High Performance Computing

Presentation of the 16th List

High Performance Computing in Europe and USA: A Comparison

An Overview of High Performance Computing and Future Requirements

Seagate ExaScale HPC Storage

Parallel Programming

Chapter 1. Introduction

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Real Parallel Computers

Overview of HPC and Energy Saving on KNL for Some Computations

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Steve Scott, Tesla CTO SC 11 November 15, 2011

The TOP500 Project of the Universities Mannheim and Tennessee

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Fabio AFFINITO.

Jack Dongarra INNOVATIVE COMP ING LABORATORY. University i of Tennessee Oak Ridge National Laboratory

Introduction to Computational Science (aka Scientific Computing)

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING Top 10 Supercomputers in the World as of November 2013*

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

The TOP500 list. Hans-Werner Meuer University of Mannheim. SPEC Workshop, University of Wuppertal, Germany September 13, 1999

TOP500 Listen und industrielle/kommerzielle Anwendungen

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Intro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Exploring Emerging Technologies in the Extreme Scale HPC Co- Design Space with Aspen

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Mathematical computations with GPUs

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Digital Signal Processor Supercomputing

Motivation Goal Idea Proposition for users Study

Roadmapping of HPC interconnects

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

European energy efficient supercomputer project

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

The Stampede is Coming: A New Petascale Resource for the Open Science Community

An Overview of High Performance Computing. Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 11/29/2005 1

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Vectorisation and Portable Programming using OpenCL

The Mont-Blanc approach towards Exascale

The Era of Heterogeneous Computing

Confessions of an Accidental Benchmarker

GPU-centric communication for improved efficiency

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CME 213 S PRING Eric Darve

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

HPC Architectures. Types of resource currently in use

Intro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

HPC Capabilities at Research Intensive Universities

HPC Issues for DFT Calculations. Adrian Jackson EPCC

Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle

Linear Algebra for Modern Computers. Jack Dongarra

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

The Future of High- Performance Computing

Transcription:

Slide No 1 CSE5351: Parallel Procesisng Part 1B

Slide No 2 State of the Art In Supercomputing Several of the next slides (or modified) are the courtesy of Dr. Jack Dongarra, a distinguished professor of Computer Science at the University of Tennessee.

Slide No 3 Look at the Fastest Computers Strategic importance of supercomputing Essential for scientific discovery Critical for national security Fundamental contributor to the economy and competitiveness through use in engineering and manufacturing Supercomputers are the tool for solving the most challenging problems through simulations 3

Slide No 4 Rate H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem - Updated twice a year SC xy in the States in November Meeting in Germany in June Size TPP performance - All data available from www.top500.org 4

Performance Development of HPC over the Last 23 Years from the Top500 1 Eflop/s 1E+09 362 PFlop/s 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 SUM 33.9 PFlop/s 166 TFlop/s 10 Tflop/s 10000 N=1 1 Tflop/s 1000 1.17 TFlop/s 100 Gflop/s 100 N=500 59.7 GFlop/s 10 Gflop/s 10 1 Gflop/s 1 400 MFlop/s 100 Mflop/s 0.1 Slide No 5 My Laptop 70 Gflop/s My iphone 4 Gflop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 20142015

State of Supercomputing in 2015 Pflops computing fully established with 67 systems. Three technology architecture possibilities or swim lanes are thriving. Commodity (e.g. Intel) Commodity + accelerator (e.g. GPUs) (88 systems) Special purpose lightweight cores (e.g. IBM BG, Knights Landing) Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 computers are in industry). Exascale projects exist in many countries and regions. Intel processors largest share, 86% followed by AMD, 4%. Slide No 6 6

July 2015: The TOP 10 Systems Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops/ Watt 1 National Super Computer Center in Guangzhou Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi (57c) + Custom China 3,120,000 33.9 62 17.8 1905 2 DOE / OS Ridge Nat Lab Oak Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom USA 560,640 17.6 65 8.3 2120 3 DOE / NNSA Livermore Nat Lab L Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 17.2 85 7.9 2063 4 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + Custom Japan 705,024 10.5 93 12.7 827 5 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + Custom USA 786,432 8.16 85 3.95 2066 6 Swiss CSCS 7 KAUST Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon 16C + Custom Swiss 115,984 6.27 81 2.3 2726 Saudi Arabia 196,608 5.54 77 4.5 1146 8 Texas Advanced Computing Center Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB USA 204,900 5.17 61 4.5 1489 9 Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom Germany 458,752 5.01 85 2.30 2178 DOE / NNSA L Vulcan, BlueGene/Q, 10 USA 393,216 4.29 85 1.97 2177 CSE5351 Livermore (Part Nat ( 1 Lab Power BQC 16C 1.6GHz+Custom 500 (422) Software Comp HP Cluster USA 18,896.309 48 Slide No 7

Seven Top500 Systems in Australia Rank Name Computer Site Manufacturer Total Cores Rmax Rpeak SuperBlade SBI-7127RG- 15 C01N E/SGI ICE X, Intel Xeon E5- Supermicro/S 2695v2 12C 2.4GHz, Tulip Trading GI Infiniband FDR, Intel Xeon 265440 3521000 4470408 Phi 7120P/NDIVA M2090 57 Magnus 71 Cray XC40, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect Fujitsu PRIMERGY CX250 S1, Xeon E5-2670 8C 2.600GHz, Infiniband FDR 100 Avoca BlueGene/Q, Power BQC 16C 1.60GHz, Custom CSIRO Nitro G16 3GPU, Xeon E5-199 GPU 2650 8C 2GHz, Infiniband Cluster FDR, Nvidia K20m 406 Sukuriputo Okane 408 Galaxy SGI ICE X, Intel Xeon E5-2695v2 12C 2.4GHz, Infiniband FDR, NVIDIA 2090 Cray XC30, Intel Xeon E5-2692v2 10C 3.000GHz, Aries interconnect Pawsey SC Centre, WA Cray Inc. 35712 1097558 1485619 ANU Fujitsu 53504 978600 1112883 Victorian Life Sci Comp Initiative CSIRO IBM 65536 715551 838861 Xenon Systems 6875 335300 472497.5 C01N SGI 46400 192370 1305928 Pawsey SC Centre, WA Cray Inc. 9440 192100 226560 Slide No 8 8

Systems Slide No 9 Accelerators (53 systems) 60 50 40 30 20 10 0 2006 2007 2008 2009 2010 2011 2012 2013 Intel MIC (13) Clearspeed CSX600 (0) ATI GPU (2) IBM PowerXCell 8i (0) NVIDIA 2070 (4) NVIDIA 2050 (7) NVIDIA 2090 (11) NVIDIA K20 (16) 19 US 9 China 6 Japan 4 Russia 2 France 2 Germany 2 India 1 Italy 1 Poland 1 Australia 2 Brazil 1 Saudi Arabia 1 South Korea 1 Spain 2 Switzerland 1 UK

Slide No 10 Processors / Systems 2% 1% 1% Intel SandyBridge 4% 4% Intel Nehalem 10% AMD x86_64 PowerPC 55% Power 23% Intel Core Sparc Others

Slide No 11 Vendors / System Share NEC, 4, 1% Hitachi, 4, Others, 33, 6% 1% NUDT Dell, 8, 2%, 4, Fujitsu, 8, 2% 1% Bull, 14, 3% SGI, 17, 3% Cray Inc., 48, 9% IBM, 164, 33% HP, 196, 39% IBM HP Cray Inc. SGI Bull Fujitsu Dell NUDT Hitachi NEC Others

Countries Share Slide No 12 Absolute Cou US: 267 China: 63 Japan: 28 UK: 23 France: 22 Germany:

Slide No 13 Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 000000 10 Pflop/s 000000 1 Pflop/s 000000 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 0.1 N=1 N=500 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

Slide No 14 Today s #1 System Systems 2014-15 Tianhe-2 2020-2022 Difference Today & Exa System peak 55 Pflop/s 1 Eflop/s ~20x Power System memory Node performance 18 MW (3 Gflops/W) 1.4 PB (1.024 PB CPU +.384 PB CoP) 3.43 TF/s (.4 CPU +3 CoP) Node concurrency 24 cores CPU + 171 cores CoP ~20 MW (50 Gflops/W) O(1) ~15x 32-64 PB ~50x 1.2 or 15TF/s O(1) O(1k) or 10k ~5x - ~50x Node Interconnect BW 6.36 GB/s 200-400GB/s ~40x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12 M 12.48M threads (4/core) O(billion) ~100x MTTF Few / day O(<1 day) O(?)

Exascale System Architecture with a cap of $200M and 20MW Slide No 15 Systems 2014-15 Tianhe-2 2020-2022 Difference Today & Exa System peak 55 Pflop/s 1 Eflop/s ~20x Power System memory Node performance 18 MW (3 Gflops/W) 1.4 PB (1.024 PB CPU +.384 PB CoP) 3.43 TF/s (.4 CPU +3 CoP) Node concurrency 24 cores CPU + 171 cores CoP ~20 MW (50 Gflops/W) O(1) ~15x 32-64 PB ~50x 1.2 or 15TF/s O(1) O(1k) or 10k ~5x - ~50x Node Interconnect BW 6.36 GB/s 200-400GB/s ~40x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12 M 12.48M threads (4/core) O(billion) ~100x MTTF Few / day Many / day O(?)

ORNL s Titan Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors 4,352 ft 2 404 m 2 SYSTEM SPECIFICATIONS: Peak performance of 27 PF 24.5 Pflop/s GPU + 2.6 Pflop/s AMD 18,688 Compute Nodes each with: 16-Core AMD Opteron CPU NVIDIA Tesla K20x GPU 32 + 6 GB memory 512 Service and I/O nodes 200 Cabinets 710 TB total system memory Cray Gemini 3D Torus Interconnect 9 MW peak power 16 Slide No 16

Slide No 17 Summary Major Challenges are ahead for extreme computing Parallelism Hybrid Fault Tolerance Power and many others not discussed here We will need completely new approaches and technologies to reach the Exascale level

Slide No 19 To be published in the January 2011 issue of The International Journal of High Performance Computing Applications 19 We can only see a short distance ahead, but we can see plenty there that needs to be done. Alan Turing (1912 1954) www.exascale.org

Slide No 20 Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) Electronics Magazine, 1965 Number of devices/chip doubles every 18 months 2X transistors/chip Every 1.5 years Called Moore s Law Microprocessors have become smaller, denser, and more powerful. Not just processors, bandwidth, storage, etc. 2X memory and processor speed and ½ size, cost, & power every 18 months. 20

Slide No 21 Moore s Law is Alive and Well 1.E+07 1.E+06 1.E+05 Transistors (in Thousands) 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

But Clock Frequency Scaling Replaced by Scaling Cores / Chip Slide No 22 1.E+07 1.E+06 1.E+05 15 Years of exponential growth ~2x year has ended Transistors (in Thousands) Frequency (MHz) Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

Performance Has Also Slowed, Along with Power Slide No 23 1.E+07 1.E+06 1.E+05 Power is the root cause of all this Transistors (in Thousands) Frequency (MHz) Power (W) 1.E+04 Cores A hardware issue just became a software problem 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1970 1975 1980 1985 1990 1995 2000 2005 2010 Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Slide from Kathy Yelick

Slide No 24 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 24

Slide No 25 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3 25

Looking at the Gordon Bell Prize (Recognize outstanding achievement in high-performance computing application and encourage development of parallel processing ) Slide No 26 1 GFlop/s; 1988; Cray Y-MP; 8 Processors Static finite element analysis 1 TFlop/s; 1998; Cray T3E; 1024 Processors Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method. 1 PFlop/s; 2008; Cray XT5; 1.5x10 5 Processors Superconductive materials 1 EFlop/s; ~2018;?; 1x10 7 Processors (10 9 threads)

Hardware and System Software Scalability Slide No 28 Barriers Fundamental assumptions of system software architecture did not anticipate exponential growth in parallelism Number of components and MTBF changes the game Technical Focus Areas System Hardware Scalability System Software Scalability Applications Scalability Technical Gap 1000x improvement in system software scaling 100x improvement in system software reliability 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Average Number of Cores Per Supercomputer Top20 of the Top500

Commodity Intel Xeon 8 cores 3 GHz 8*4 ops/cycle 96 Gflop/s (DP) Commodity plus Accelerator Today 88 of the Top500 Systems Accelerator (GPU) Nvidia K20X Kepler 2688 Cuda cores.732 GHz 2688*2/3 ops/cycle 1.31 Tflop/s (DP) 192 Cuda cores/smx 2688 Cuda cores Interconnect PCI-e Gen2/3 16 lane 64 Gb/s (8 GB/s) 1 GW/s 6 GB 31 Slide No 31

Recent Developments US DOE planning to deploy O(100) Pflop/s systems for 2017-2018 - $525M hardware Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems Argonne Lab to receive Intel based system After this Exaflops US Dept of Commerce is preventing some China groups from receiving Intel technology Citing concerns about nuclear research being done with the systems; February 2015. On the blockade list: National SC Center Guangzhou, site of Tianhe-2 National SC Center Tianjin, site of Tianhe-1A National University for Defense Technology, developer National SC Center Changsha, location of NUDT For the first time, < 50% of Top500 are in the U.S. 231 of the systems are U.S.-based, China #3 w/37. Slide No 32 32

Today s Multicores All of Top500 Systems Are Based on Multicore Intel Haswell (18 cores) Intel Xeon Phi (60 cores) IBM Power 8 (12 cores) AMD Interlagos (16 cores) Nvidia Kepler (2688 Cuda cores) IBM BG/Q (18 cores) Fujitsu Venus (16 cores) 33 ShenWei (16 core) Slide No 33

Slide No 34 Problem with Processors As we put more processing power on the multicore chip, one of the problems is getting the data to the cores Next generation will be more integrated, 3D design with a photonic network 34

Peak Performance - Per Core We are here Floating point operations per cycle per core Most of the recent computers have FMA (Fused multiple add): (i.e. x x + y*z in one cycle) Intel Xeon earlier models and AMD Opteron have SSE2 2 flops/cycle DP & 4 flops/cycle SP Intel Xeon Nehalem ( 09) & Westmere ( 10) have SSE4 4 flops/cycle DP & 8 flops/cycle SP Intel Xeon Sandy Bridge( 11) & Ivy Bridge ( 12) have AVX 8 flops/cycle DP & 16 flops/cycle SP Intel Xeon Haswell ( 13) & (Broadwell ( 14)) AVX2 16 flops/cycle DP & 32 flops/cycle SP Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP Intel Xeon Skylake ( 15) CSE5351 (Part ( 1 32 flops/cycle DL & 64 flops/cycle SP Slide No 35

Memory transfer (Its All About Data Movement) Example on my laptop: One level of memory 56 GFLOP/sec/core x 2 cores CPU ( Omitting latency here. ) Intel Core i7 4850HQ Haswell, 2.3 GHz Turbo Boost 3.5 GHz Cache (6 MB) 25.6 GB/sec Main memory (8 GB) The model IS simplified (see next slide) but it provides an upper bound on performance as well. I.e., we will never go faster than what the model predicts. ( slowergocanwe, courseof, And) Slide No 36