The Future of High- Performance Computing

Similar documents
The Future of High Performance Computing

The Future of High- Performance Computing

Data Intensive Scalable Computing

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Trends in HPC (hardware complexity and software challenges)

CME 213 S PRING Eric Darve

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Preparing GPU-Accelerated Applications for the Summit Supercomputer

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Parallel and Distributed Programming Introduction. Kenjiro Taura

Multi-Processors and GPU

Cray XC Scalability and the Aries Network Tony Ford

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Lecture 9: MIMD Architecture

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Chapter 5 Supercomputers

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 9: MIMD Architectures

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Exascale: challenges and opportunities in a power constrained world

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

What is Parallel Computing?

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Parallel Programming: Principle and Practice

The Art of Parallel Processing

Advances of parallel computing. Kirill Bogachev May 2016

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Large-Scale GPU programming

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

modern database systems lecture 10 : large-scale graph processing

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

COMP 633 Parallel Computing.

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Introduction to Data Management CSE 344

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Outline Marquette University

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Steve Scott, Tesla CTO SC 11 November 15, 2011

HPC future trends from a science perspective

Scalable deep learning on distributed GPUs with a GPU-specialized parameter server

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Computer Architecture. Fall Dongkun Shin, SKKU

Automatic Tuning of the High Performance Linpack Benchmark

CS516 Programming Languages and Compilers II

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Clustering Documents. Case Study 2: Document Retrieval

Fra superdatamaskiner til grafikkprosessorer og

BIG DATA TESTING: A UNIFIED VIEW

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Big Orange Bramble. August 09, 2016

Timothy Lanfear, NVIDIA HPC

Service Oriented Performance Analysis

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Accelerating Implicit LS-DYNA with GPU

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Introduction to Database Systems CSE 414

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

: A new version of Supercomputing or life after the end of the Moore s Law

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

ADVANCED COMPUTER ARCHITECTURES

CSE6331: Cloud Computing

Fundamentals of Computer Design

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Motivation Goal Idea Proposition for users Study

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

45-year CPU Evolution: 1 Law -2 Equations

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Transcription:

Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017

Comparing Two Large-Scale Systems Oakridge Titan Google Data Center Monolithic supercomputer (3 rd fastest in world) Designed for computeintensive applications Servers to support millions of customers Designed for data collection, storage, and analysis 2 15-418/618 2

Data Intensity Google Data Center Internet-Scale Computing Web search Mapping / directions Language translation Video streaming Carnegie Mellon Computing Landscape Cloud Services Oakridge Titan Traditional Supercomputing Personal Computing Modeling & Simulation-Driven Science & Engineering 3 Computational Intensity 15-418/618 3

Data Intensity Supercomputing Landscape Oakridge Titan Modeling & Simulation-Driven Science & Engineering 4 Computational Intensity 15-418/618 4

Supercomputer Applications Science Simulation-Based Modeling System structure + initial conditions + transition behavior Discretize time and space Run simulation to see what happens Requirements Model accurately reflects actual system Simulation faithfully captures model Industrial Products Public Health 5 15-418/618 5

Titan Hardware Local Network CPU CPU CPU GPU GPU GPU Node 1 Node 2 Node 18,688 Each Node AMD 16-core processor nvidia Graphics Processing Unit 38 GB DRAM No disk drive Overall 7MW, $200M 6 15-418/618 6

Titan Node Structure: CPU DRAM Memory CPU 16 cores sharing common memory Supports multithreaded programming ~0.16 x 10 12 floating-point operations per second (FLOPS) peak performance 7 15-418/618 7

Titan Node Structure: GPU 8 Kepler GPU 14 multiprocessors Each with 12 groups of 16 stream processors 14 X 12 X 16 = 2688 Single-Instruction, Multiple-Data parallelism Single instruction controls all processors in group 4.0 x 10 12 FLOPS peak performance 15-418/618 8

Titan Programming: Principle Solving Problem Over Grid E.g., finite-element system Simulate operation over time Bulk Synchronous Model Partition into Regions p regions for p-node machine Map Region per Processor 9 15-418/618 9

Titan Programming: Principle (cont) Bulk Synchronous Model Map Region per Processor Alternate All nodes compute behavior of region Perform on GPUs All nodes communicate values at boundaries P 1 P 2 P 3 P 4 P 5 Communicate Communicate Communicate Compute Compute Compute 10 15-418/618 10

Bulk Synchronous Performance P 1 P 2 P 3 P 4 P 5 Communicate Communicate Compute Compute Compute Limited by performance of slowest processor Strive to keep perfectly balanced Engineer hardware to be highly reliable Tune software to make as regular as possible Communicate Eliminate noise Operating system events Extraneous network activity 11 15-418/618 11

Titan Programming: Reality System Level Message-Passing Interface (MPI) supports node computation, synchronization and communication Node Level OpenMP supports thread-level operation of node CPU CUDA programming environment for GPUs Performance degrades quickly if don t have perfect balance among memories and processors Result Single program is complex combination of multiple programming paradigms Tend to optimize for specific hardware configuration 12 15-418/618 12

MPI Fault Tolerance P 1 P 2 P 3 P 4 P 5 Checkpoint Compute & Communicate Restore Compute & Communicate Wasted Computation Checkpoint Periodically store state of all processes Significant I/O traffic Restore When failure occurs Reset state to that of last checkpoint All intervening computation wasted Performance Scaling Very sensitive to number of failing components 13 15-418/618 13

Supercomputer Programming Model Application Programs Software Packages Hardware Machine-Dependent Programming Model Program on top of bare hardware Performance Low-level programming to maximize node performance Keep everything globally synchronized and balanced Reliability Single failure causes major delay Engineer hardware to minimize failures 14 15-418/618 14

Data Intensity Google Data Center Internet-Scale Computing Web search Mapping / directions Language translation Video streaming Data-Intensive Computing Landscape Cloud Services Personal Computing 15 Computational Intensity 15-418/618 15

Internet Computing Web Search Aggregate text data from across WWW No definition of correct operation Do not need real-time updating Mapping Services Huge amount of (relatively) static data Each customer requires individualized computation Online Documents Must be stored reliably Must support real-time updating (Relatively) small data volumes 16 15-418/618 16

Other Data-Intensive Computing Applications Wal-Mart 267 million items/day, sold at 6,000 stores HP built them 4 PB data warehouse Mine data to manage supply chain, understand market trends, formulate pricing strategies LSST Chilean telescope will scan entire sky every 3 days A 3.2 gigapixel digital camera Generate 30 TB/day of image data 17 15-418/618 17

Data-Intensive Application Characteristics Carnegie Mellon Diverse Classes of Data Structured & unstructured High & low integrity requirements Diverse Computing Needs Localized & global processing Numerical & non-numerical Real-time & batch processing 18 15-418/618 18

Google Data Centers Dalles, Oregon Hydroelectric power @ 2 / KW Hr 50 Megawatts Enough to power 60,000 homes Engineered for low cost, modularity & power efficiency Container: 1160 server nodes, 250KW 19 15-418/618 19

Google Cluster Local Network CPU CPU CPU Node 1 Node 2 Node n Typically 1,000 2,000 nodes Node Contains 2 multicore CPUs 2 disk drives DRAM 20 15-418/618 20

Hadoop Project File system with files distributed across nodes Local Network CPU CPU CPU Node 1 Node 2 Store multiple (typically 3 copies of each file) If one node fails, data still available Logically, any node has access to any file May need to fetch across network Node n Map / Reduce programming environment Software manages execution of tasks on nodes 21 15-418/618 21

Map/Reduce Operation Map/Reduce Map Reduce Map Reduce Map Reduce Map Reduce Carnegie Mellon Characteristics Computation broken into many, shortlived tasks Mapping, reducing Tasks mapped onto processors dynamically Use disk storage to hold intermediate results Strengths Flexibility in placement, scheduling, and load balancing Can access large data sets Weaknesses Higher overhead Lower raw performance 22 15-418/618 22

Map/Reduce Fault Tolerance Map/Reduce Map Reduce Map Reduce Map Reduce Map Reduce Data Integrity Store multiple copies of each file Including intermediate results of each Map / Reduce Continuous checkpointing Recovering from Failure Simply recompute lost result Localized effect Dynamic scheduler keeps all processors busy Use software to build reliable system on top of unreliable hardware Carnegie Mellon 23 15-418/618 23

Cluster Programming Model Application programs written in terms of high-level operations on data Runtime system controls scheduling, load balancing, Scaling Challenges Centralized scheduler forms bottleneck Copying to/from disk very costly Hard to limit data movement Significant performance factor Machine-Independent Programming Model Application Programs Runtime System Hardware 24 15-418/618 24

Recent Programming Systems Spark Project at U.C., Berkeley Grown to have large open source community GraphLab Started as project at CMU by Carlos Guestrin Environment for describing machine-learning algorithms Sparse matrix structure described by graph Computation based on updating of node values 25 15-418/618 25

Data Intensity Computing Landscape Trends Mixing simulation with data analysis Traditional Supercomputing Modeling & Simulation-Driven Science & Engineering 26 Computational Intensity 15-418/618 26

Combining Simulation with Real Data Limitations Simulation alone: Hard to know if model is correct Data alone: Hard to understand causality & what if Combination Check and adjust model during simulation 27 15-418/618 27

Real-Time Analytics Millenium XXL Simulation (2010) 3 X 10 9 particles Simulation run of 9.3 days on 12,228 cores 700TB total data generated Save at only 4 time points 70 TB Large-scale simulations generate large data sets What If? Could perform data analysis while simulation is running Simulation Engine Analytic Engine 28 15-418/618 28

Data Intensity Google Data Center Internet-Scale Computing Sophisticated data analysis Computing Landscape Trends 29 Computational Intensity 15-418/618 29

Example Analytic Applications Microsoft Project Adam Image Classifier Description English Text Transducer German Text 30 15-418/618 30

Data Analysis with Deep Neural Networks Task: Compute classification of set of input signals Training Use many training samples of form input / desired output Compute weights that minimize classification error Operation Propagate signals from input to output 31 15-418/618 31

DNN Application Example Facebook DeepFace Architecture 32 15-418/618 32

Training DNNs 20 Model Size 20 Training Data 400 Training Effort 15 15 300 10 5 10 5 200 100 0 0 5 10 15 20 0 0 5 10 15 20 0 0 5 10 15 20 Characteristics Iterative numerical algorithm Regular data organization Project Adam Training 2B connections 15M images 62 machines 10 days 33 15-418/618 33

Data Intensity Google Data Center Internet-Scale Computing Sophisticated data analysis Trends Convergence? Mixing simulation with real-world data 34 Traditional Supercomputing Modeling & Simulation-Driven Science & Engineering Computational Intensity 15-418/618 34

Challenges for Convergence Supercomputers Data Center Clusters Customized Optimized for reliability Hardware Consumer grade Optimized for low cost Run-Time System Source of noise Static scheduling Provides reliability Dynamic allocation Low-level, processor-centric model Application Programming High level, data-centric model 35 15-418/618 35

Summary: Computation/Data Convergence Carnegie Mellon Two Important Classes of Large-Scale Computing Computationally intensive supercomputing Data intensive processing Internet companies + many other applications Followed Different Evolutionary Paths Supercomputers: Get maximum performance from available hardware Data center clusters: Maximize cost/performance over variety of data-centric tasks Yielded different approaches to hardware, runtime systems, and application programming A Convergence Would Have Important Benefits Computational and data-intensive applications But, not clear how to do it 36 15-418/618 36

GETTING TO EXASCALE 37 15-418/618 37

World s Fastest Machines Top500 Ranking: High-performance LINPACK Benchmark: Solve N x N linear system Some variant of Gaussian elimination 2/3 N 3 + O(N 2 ) operations Vendor can choose N to give best performance (in FLOPS) Alternative: High-performance conjugate gradient Solve sparse linear system ( 27 nonzeros / row) Iterative method Higher communication / compute ratio 15-418/618 38

Sunway TaihuLight Wuxi China Operational 2016 Machine Total machine has 40,960 processor chips Processor chip contains 256 compute cores + 4 management cores Each has 4-wide SIMD vector unit 8 FLOPS / clock cycle Performance HPL: 93.0 PF (World s top) HPCG: 0.37 PF 15.4 MW 1.31 PB DRAM Ratios (Big is Better) GigaFLOPS/Watt: 6.0 Bytes/FLOP: 0.014 15-418/618 39

Tianhhe-2 Guangzhou China Operational 2013 Machine Total machine has 16,000 nodes Each with 2 Intel Xeons + 3 Intel Xeon Phi s Performance Ratios (Big is Better) HPL: 33.9 PF HPCG: 0.58 PF (world s best) 17.8 MW 1.02 PB DRAM GigaFLOPS/Watt: 1.9 Bytes/FLOP: 0.030 15-418/618 40

Titan Oak Ridge, TN Operational 2012 Machine Total machine has 18,688 nodes Each with 16-core Opteron + Tesla K20X GPU Performance Ratios (Big is Better) HPL: 17.6 PF HPCG: 0.32 PF 8.2 MW 0.71 PB DRAM GigaFLOPS/Watt: 2.2 Bytes/FLOP: 0.040 15-418/618 41

How Powerful is a Titan Node? Titan CPU Opteron 6274 Nov., 2011. 32nm technology 2.2 GHz 16 cores (no hyperthreading) 16 MB L3 cache 32 GB DRAM GPU Kepler K20X Feb., 2013. 28nm Cuda capability 3.5 3.9 TF Peak (SP) GHC Machine 15-418/618 42 CPU Xeon E5-1660 June, 2016. 14nm technology 3.2 GHz 8 cores (2x hyperthreaded) 20 MB L3 cache 32 GB DRAM GPU GeForce GTX 1080 May, 2016. 16nm Cuda capability 6.0 8.2 TF Peak (SP)

Performance of Top 500 Machines From presentation by Jack Dongarra Machines far off peak when performing HPCG 15-418/618 43

What Lies Ahead DOE CORAL Program Announced Nov 2014 Delivery in 2018 Vendor #1 IBM + nvidia + Mellanox 3400 nodes 10 MW 150 300 PF peak Vendor #2 Intel + Cray ~50,000 nodes (Xeon Phi s) 13 MW > 180 PF peak 15-418/618 44

TECHNOLOGY CHALLENGES 45 15-418/618 45

Moore s Law 46 Basis for ever-increasing computer power We ve come to expect it will continue 15-418/618 46

Challenges to Moore s Law: Technical 2022: transistors with 4nm feature size Si lattice spacing 0.54nm 47 Must continue to shrink features sizes Approaching atomic scale Difficulties Lithography at such small dimensions Statistical variations among devices 15-418/618 47

Challenges to Moore s Law: Economic Growing Capital Costs State of art fab line ~$20B Must have very high volumes to amortize investment Has led to major consolidations 48 15-418/618 48

Dennard Scaling Due to Robert Dennard, IBM, 1974 Quantifies benefits of Moore s Law How to shrink an IC Process Reduce horizontal and vertical dimensions by k Reduce voltage by k Outcomes Devices / chip increase by k 2 Clock frequency increases by k Power / chip constant Significance Increased capacity and performance No increase in power 49 15-418/618 49

End of Dennard Scaling What Happened? Can t drop voltage below ~1V Reached limit of power / chip in 2004 More logic on chip (Moore s Law), but can t make them run faster Response has been to increase cores / chip 50 15-418/618 50

Research Challenges Supercomputers Can they be made more dynamic and adaptive? Requirement for future scalability Can they be made easier to program? Abstract, machine-independent programming models Data-Intensive Computing Can they be adapted to provide better computational performance? Can they make better use of data locality? Performance & power-limiting factor Technology / Economic What will we do when Moore s Law comes to an end for CMOS? How can we ensure a stable manufacturing environment? 51 15-418/618 51