Accelerating High Performance Computing.

Size: px

Start display at page:

Download "Accelerating High Performance Computing."

Morris Shepherd
5 years ago
Views:

1 Accelerating High Performance Computing

2 Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational Fluid Dynamics Medical Imaging Computed Tomography Astrophysics n-body Options Pricing Monte Carlo Product Development Finite Difference Time Domain Weather Forecasting Atmospheric Physics 2

3 GPU Computing Bridging the CPU Wall Source: David Patterson, Director Parallel Computing Research Laboratory, U.C. Berkeley 3

4 GPUs = Higher Flops and Memory Bandwidth Double Precision: NVIDIA GPU Double Precision: x86 CPU NVIDIA GPU (ECC off) x86 CPU 4

5 DP GFLOPS per Watt Tesla: 2-3x Faster GPU Every 2 Years 16 Maxwell Kepler 4 2 T10 Fermi

6 CPU GPU Add GPUs: Accelerate x86 Applications 6

7 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN GPUs Accelerate Science 149X 47X 20X 130X 30X Financial Simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland 7

8 CPU Pizza Delivery Process: Delivery truck delivers one pizza and then moves to next house Original Idea by Jedox 8

9 NVIDIA GPU Pizza Delivery Process: Many deliveries to many houses Original Idea by Jedox 9

10 The Era of Accelerated Computing is Here Era of Accelerated Computing Era of Vector Computing Era of Distributed Computing

Titan: World s Fastest Supercomputer 2012 18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of

11 Titan: World s Fastest Supercomputer ,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs Petaflops Sustained Performance on Linpack #3 on the Green 500 list 11

12 Two Supercomputers Built at the Same Time Tsubame 2.0 Hopper- NERSC 4,224 Tesla GPUs + 2,816 x86 CPUs 12,784 x86 CPUs 1.4 Megawatts 2060 Homes in Japan 4.0 MegaWatts 5860 Homes in Japan World s Greenest Petaflop Supercomputer (2011)

13 GPU Supercomputers: More Power Efficient Performance Power GPU-CPU Supercomputer CPU only Supercomputer Power 13

Sciences Air Force Research Laboratory Mass General

14 GPUs are Mainstream Oil & Gas Edu/Research Government Life Sciences Finance Manufacturing Chinese Academy of Sciences Air Force Research Laboratory Mass General Hospital Naval Research Laboratory Max Planck Institute 14

15 Explosive Growth of GPU Computing 150K CUDA Downloads 1.5M CUDA Downloads 1 Supercomputer 50 Supercomputers 60 University Courses 630 University Courses 4,000 Academic Papers 22,500 Academic Papers

CUDA Apps Grows 60%, Accelerating Key Apps # of Apps 200 61% Increase 150 40% Increase 100 50 0 2010 2011 2012 For App Catalog:

com/teslaapps/ Top Supercomputing Apps Computational Chemistry Material Science Climate & Weather Physics CAE AMBER CHARMM

16 CUDA Apps Grows 60%, Accelerating Key Apps # of Apps % Increase % Increase For App Catalog: Top Supercomputing Apps Computational Chemistry Material Science Climate & Weather Physics CAE AMBER CHARMM GROMACS QMCPACK Quantum Espresso GAMESS COSMO GEOS-5 Chroma Denovo GTC ANSYS Mechanical MSC Nastran SIMULIA Abaqus LAMMPS NAMD DL_POLY Gaussian NWChem VASP CAM-SE NIM WRF GTS ENZO MILC ANSYS Fluent OpenFOAM LS-DYNA Accelerated, In Development 16

17 Accelerated computing 17

18 NVIDIA GPU Accelerates Computing Choose the Right Processor for the Right Task CPU Several sequential cores NVIDIA GPU Hundreds of parallel cores 18

for out-of-order and speculative execution GPU Optimized for

19 Low Latency or High Throughput? CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation

20 Low Latency or High Throughput? CPU architecture must minimize latency within each thread GPU architecture hides latency with computation from other thread warps GPU Stream Multiprocessor High Throughput Processor W 4 Computation Thread/Warp T n Processing W 3 W 2 W 1 CPU core Low Latency Processor T 1 T 2 T 3 T 4 Waiting for data Ready to be processed Context switch

21 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory

22 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

23 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

24 GPU Architecture 24

products ECC on/off option for Quadro and Tesla products Streaming Multiprocessors (SMs) Perform the actual

25 DRAM I/F Giga Thread HOST I/F DRAM I/F GPU Architecture: Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB Bandwidth currently up to 150 GB/s for Quadro and Tesla products ECC on/off option for Quadro and Tesla products Streaming Multiprocessors (SMs) Perform the actual computations Each SM has its own: Control units, registers, execution pipelines, caches L2 DRAM I/F DRAM I/F DRAM I/F DRAM I/F

GPU Architecture Fermi: Streaming Multiprocessor (SM) 32 CUDA s per SM

schedulers Up to 1536 threads concurrently 4 special-function units

Instruction Cache Register File Scheduler Dispatch Load/Store Units x

26 GPU Architecture Fermi: Streaming Multiprocessor (SM) 32 CUDA s per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock 2 warp schedulers Up to 1536 threads concurrently 4 special-function units 64KB shared mem + L1 cache 32K 32-bit registers Scheduler Dispatch Instruction Cache Register File Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

GPU Architecture Fermi: CUDA Floating point & Integer unit IEEE 754-2008 floating-point standard Fused multiply-add

Operand Collector FP Unit INT Unit Result Queue Instruction Cache Scheduler Dispatch Scheduler Dispatch Register

27 GPU Architecture Fermi: CUDA Floating point & Integer unit IEEE floating-point standard Fused multiply-add (FMA) instruction for both single and double precision Logic unit Move, compare unit Branch unit CUDA Dispatch Port Operand Collector FP Unit INT Unit Result Queue Instruction Cache Scheduler Dispatch Scheduler Dispatch Register File Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

28 GPU Architecture Fermi: Memory System L1 16 or 48KB / SM, can be chosen by the program Hardware-managed Aggregate bandwidth per GPU: 1.03 TB/s Shared memory User-managed scratch-pad Hardware will not evict until threads overwrite 16 or 48KB / SM (64KB total is split between Shared and L1) Aggregate bandwidth per GPU: 1.03 TB/s

29 GPU Architecture Fermi: Memory System Unified L2 cache (768k) Fast, coherent data sharing across all cores in the GPU ECC protection DRAM ECC supported for GDDR5 memory All major internal memories are ECC protected Register file, L1 cache, L2 cache

30 Kepler 30

31 Kepler Fastest, Most Efficient HPC Architecture Ever SMX Hyper-Q Dynamic Parallelism 31

32 Kepler: Fast & Efficient SM M2090 SMX K20 CONTROL LOGIC 32 cores 3x Perf / Watt CONTROL LOGIC 192 cores 32

33 Kepler GK110 Block Diagram Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP MB L2 Cache 384-bit GDDR5 PCI Express Gen3

34 Hyper-Q CPU s Simultaneously Run Tasks on Kepler FERMI 1 MPI Task at a Time KEPLER 32 Simultaneous MPI Tasks 34

35 GPU Utilization % GPU Utilization % Hyper-Q Max GPU Utilization, Slashes CPU Idle Time Time 0 Time 35

36 Dynamic Parallelism GPU Adapts to Data, Dynamically Launches New Threads CPU Fermi GPU CPU Kepler GPU 36

37 Dynamic Parallelism Makes GPU Computing Easier & Broadens Reach Too coarse Too fine Just right 37

Supercomputing Weather / Climate Modeling Molecular

Manufacturing Biochemistry Bioinformatics Material

(CFD) Electromagnetics Q2 Q3 Q4 Tesla M2090 Tesla

and Gas Signal Processing Image Processing Video

38 Supercomputing Weather / Climate Modeling Molecular Dynamics Computational Physics Life Sciences Manufacturing Biochemistry Bioinformatics Material Science Structural Mechanics Comp Fluid Dynamics (CFD) Electromagnetics Q2 Q3 Q4 Tesla M2090 Tesla M2075 Fermi Tesla K20 Kepler GK110 Defense / Govt Oil and Gas Signal Processing Image Processing Video Analytics Tesla K10 Kepler GK104 Reverse Time Migration Kirchoff Time Migration 38

39 TFLOPS Tesla K20 Family: 3x Faster Than Fermi Tesla K20X Tesla K20X Tesla K20 # CUDA s Peak Double Precision Peak DGEMM 1.32 TF 1.22 TF 1.17 TF 1.10 TF 1.25 Double Precision FLOPS (DGEMM) 1.22 TFLOPS Peak Single Precision Peak SGEMM 3.95 TF 2.90 TF 3.52 TF 2.61 TF TFLOPS.43 TFLOPS Xeon E Tesla M2090 Tesla K20X Memory Bandwidth 250 GB/s 208 GB/s Memory size 6 GB 5 GB Total Board Power 235W 225W

40 Whitepaper: 40

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect