Steve Scott, Tesla CTO SC 11 November 15, 2011

Similar documents
Timothy Lanfear, NVIDIA HPC

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Trends in HPC (hardware complexity and software challenges)

Preparing GPU-Accelerated Applications for the Summit Supercomputer

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

John Levesque Nov 16, 2001

The Future of GPU Computing

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CUDA. Matthew Joyner, Jeremy Williams

An Introduction to OpenACC

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

GPU Fundamentals Jeff Larkin November 14, 2016

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

HIGH-PERFORMANCE COMPUTING

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OPENACC DIRECTIVES FOR ACCELERATORS NVIDIA

Lecture 1: Introduction

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Oak Ridge National Laboratory Computing and Computational Sciences

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

HPC Technology Trends

Fundamentals of Quantitative Design and Analysis

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Stan Posey, NVIDIA, Santa Clara, CA, USA

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Present and Future Leadership Computers at OLCF

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

CPU-GPU Heterogeneous Computing

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Hybrid Architectures Why Should I Bother?

Accelerating HPL on Heterogeneous GPU Clusters

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

The Era of Heterogeneous Computing

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Motivation Goal Idea Proposition for users Study

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

HPC Algorithms and Applications

The End of Denial Architecture and The Rise of Throughput Computing

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

PART I - Fundamentals of Parallel Computing

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

Mathematical computations with GPUs

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

GPU Computing. Axel Koehler Sr. Solution Architect HPC

The Stampede is Coming: A New Petascale Resource for the Open Science Community

CME 213 S PRING Eric Darve

Ian Buck, GM GPU Computing Software

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear

Technology for a better society. hetcomp.com

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Trends in the Infrastructure of Computing

GPU Computing fuer rechenintensive Anwendungen. Axel Koehler NVIDIA

ECE 574 Cluster Computing Lecture 18

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

ECE 8823: GPU Architectures. Objectives

GPU for HPC. October 2010

EECS4201 Computer Architecture

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

LLVM for the future of Supercomputing

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

FUSION PROCESSORS AND HPC

Scientific Computing on GPUs: GPU Architecture Overview

HPC future trends from a science perspective

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Parallel Programming on Ranger and Stampede

IBM CORAL HPC System Solution

Accelerating High Performance Computing.

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

An Introduction to the SPEC High Performance Group and their Benchmark Suites

The Art of Parallel Processing

GPU Consideration for Next Generation Weather (and Climate) Simulations

Using the Cray Programming Environment to Convert an all MPI code to a Hybrid-Multi-core Ready Application

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

What does Heterogeneity bring?

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

GPU ARCHITECTURE Chris Schultz, June 2017

Transcription:

Steve Scott, Tesla CTO SC 11 November 15, 2011

What goal do these products have in common? Performance / W

Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost or power

The Road Ahead is Steep In the Good Old Days Leakage was not important, and voltage scaled with feature size L = L/2 V = V/2 E = CV 2 = E/8 f = 2f D = 1/L 2 = 4D P = P Halve L and get 4x the transistors and 8x the capability for the same power! MF to GF to TF and almost to PF Technology was giving us 68% per year in perf/w! Processors realized ~50% per year in perf/w (spent it on single thread performance) The New Reality Leakage has limited threshold voltage, largely ending voltage scaling Halve L and get 2x the capability for the same power. Technology will give us only 19% per year in perf/w

The Future Belongs to the Efficient Chips have become power (not area) constrained density increases quadratically with feature size energy/op decreases linearly with feature size

Which Takes More Energy? Performing a 64-bit floating-point FMA: 893,500.288914668 43.90230564772498 = 39,226,722.78026233027699 + 2.02789331400154 = 39,226,724.80815564 Or moving the three 64-bit operands 18 mm across the die: This one takes over 4.2x the energy (40nm)! It s getting worse: in10nm, relative cost will be 15x! Loading the data from off chip takes >> 100x the energy. Flops are cheap. Communication is expensive.

Achieving Energy Efficiency Reduce overhead. Minimize data motion.

Multi-core CPUs Industry has gone multi-core as a first response to power issues Performance through parallelism Dial back complexity and clock rate Exploit locality But CPUs are fundamentally designed for single thread performance rather than energy efficiency Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Dynamic conflict detection Lots of predictions and speculative execution Lots of instruction overhead per operation Less than 2% of chip power today goes to flops.

CPU 1690 pj/flop Optimized for Latency Caches GPU 225 pj/flop Optimized for Throughput Explicit Management of On-chip Memory Westmere 32nm Fermi 40nm

Growing Momentum for GPUs in Supercomputing Tesla Powers 3 of 5 Top Systems (November 2011) #2 : Tianhe-1A 7168 Tesla GPUs 2.6 PFLOPS #1 : K Computer 88K Fujitsu Sparc CPUs 10.5 PFLOPS #4 : Nebulae 4650 Tesla GPUs 1.3 PFLOPS #3 : Jaguar 36K AMD Opteron CPUs 1.8 PFLOPS Titan 18000 Tesla GPUs >20 PFLOPS #5 : Tsubame 2.0 4224 Tesla GPUs 1.2 PFLOPS (most efficient PF system)

NVIDIA GPU Computing Uptake >300,000,000 >500,000 >100,000 >450 Compute-capable NVIDIA GPUs NVIDIA SW Development Toolkit Downloads Active NVIDIA GPU Computing Developers Universities Teaching GPU Computing Widespread adoption by HPC OEMs

ORNL Adopts GPUs for Next-Gen Supercomputer Titan Cray XK6 18,000 Tesla GPUs 20+ PetaFlops ~90% of flops from GPUs 2x Faster, 3x More Energy Efficient (and much smaller!) than Current #1 (K Computer)

NCSA Mixes GPUs into BlueWaters By incorporating a future version of the XK6 system, Blue Waters will provide a bridge to the future of scientific computing. --NCSA Director Thom Dunning Contains over 30 Cray XK cabinets With over 3000 NVIDIA Tesla GPUs

What About Programming?

GPU Libraries: Plug In & Play cublas Dense Linear Algebra Parallel Algorithms QUDA Lattice QCD

Directives: Ease of Programming and Portability Available from PGI, CAPS, and soon Cray OpenMP Cray Directives CPU CPU GPU main() { double pi = 0.0; long i; main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n); } #pragma omp acc_region_loop #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } #pragma omp end acc_region_loop printf( pi = %f\n, pi/n); }

OpenACC: Open Parallel Programming Standard Easy, Fast, Portable OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan. Buddy Bland Titan Project Director Oak Ridge National Lab OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP. Michael Wong CEO, OpenMP Directives Board

Focus on Exposing Parallelism With Directives, tuning work focusses on exposing parallelism, which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL S3D Research more efficient combustion with nextgeneration fuels CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios Tuning top 3 kernels (90% of runtime) 3 to 6x faster on CPU+GPU vs. CPU+CPU But also improved all-cpu version by 50% Tuning top key kernel (50% of runtime) 6.5x faster on CPU+GPU vs. CPU+CPU Improved performance of CPU version by 100%

The Future of HPC is Green We re constrained by power You can t simultaneously optimize for single thread performance and power efficiency Most work must be done by cores designed for throughput and efficiency Locality is key explicit control of memory hierarchy GPUs are the path to our tightly-coupled, energyefficient, hybrid processor future

Thank You. Questions?