High Performance Computing with Accelerators

Similar documents
GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

Introduction to GPU Programming

The Parallel Revolution in Computational Science and Engineering

QP: A Heterogeneous Multi-Accelerator Cluster

GPU Clusters for High-Performance Computing

Graphics Processor Acceleration and YOU

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

General Purpose GPU Computing in Partial Wave Analysis

Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC Clusters

Porting MILC to GPU: Lessons learned

Faster, Cheaper, Better: Biomolecular Simulation with NAMD, VMD, and CUDA

NAMD GPU Performance Benchmark. March 2011

Introduction to GPU hardware and to CUDA

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Threading Hardware in G80

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Accelerating NAMD with Graphics Processors

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Addressing Heterogeneity in Manycore Applications

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Introduc)on to GPU Programming

What does Heterogeneity bring?

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Accelerating Biomolecular Modeling with CUDA and GPU Clusters

Mathematical computations with GPUs

n N c CIni.o ewsrg.au

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 1: Introduction and Computational Thinking

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Tesla GPU Computing A Revolution in High Performance Computing

CME 213 S PRING Eric Darve

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

ENABLING NEW SCIENCE GPU SOLUTIONS

Accelerating High Performance Computing.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Simulating Life at the Atomic Scale

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

GPU for HPC. October 2010

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Accelerating Cosmological Data Analysis with Graphics Processors Dylan W. Roeh Volodymyr V. Kindratenko Robert J. Brunner

GPUs and Emerging Architectures

Experts in Application Acceleration Synective Labs AB

Pedraforca: a First ARM + GPU Cluster for HPC

Technology for a better society. hetcomp.com

OP2 FOR MANY-CORE ARCHITECTURES

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

Building NVLink for Developers

Parallel Programming on Ranger and Stampede

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

CS 668 Parallel Computing Spring 2011

Lecture 1: Gentle Introduction to GPUs

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures


NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

GPU Acceleration of Molecular Modeling Applications

AMBER 11 Performance Benchmark and Profiling. July 2011

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

Portland State University ECE 588/688. Graphics Processors

ECE 8823: GPU Architectures. Objectives

Introduc)on to Hyades

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Mapping MPI+X Applications to Multi-GPU Architectures

CUDA Experiences: Over-Optimization and Future HPC

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Mattan Erez. The University of Texas at Austin

University at Buffalo Center for Computational Research

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Current Trends in Computer Graphics Hardware

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Analysis and Visualization Algorithms in VMD

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

CUDA. Matthew Joyner, Jeremy Williams

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HIGH-PERFORMANCE COMPUTING

Introduction to CUDA (1 of n*)

HPC Architectures. Types of resource currently in use

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Transcription:

High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Presentation Outline Recent trends in application accelerators FPGAs, Cell, GPUs, Accelerator clusters Accelerator clusters at NCSA Cluster management Programmability Issues Production codes running on Accelerator Clusters Cosmology, molecular dynamics, electronic structure, quantum chromodynamics

HPC on Special-purpose Processors Field-Programmable Gate Arrays (FPGAs) Digital signal processing, embedded computing Graphics Processing Units (GPUs) Desktop graphics accelerators Physics Processing Units (PPUs) Desktop games accelerators Sony/Toshiba/IBM Cell Broadband Engine Game console and digital content delivery systems ClearSpeed accelerator Floating-point accelerator board for computeintensive applications

GFlop/s GPU Performance Trends: FLOPS 1200 GTX 285 1000 GTX 280 800 600 8800 Ultra 8800 GTX NVIDIA GPU 400 Intel CPU 7900 GTX 200 5800 5950 Ultra 7800 GTX 6800 Ultra Intel Xeon Quad-core 3 GHz 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 Courtesy of NVIDIA

GByte/s 180 160 140 GPU Performance Trends: Memory Bandwidth GTX 280 GTX 285 120 100 80 8800 GTX 8800 Ultra 60 7800 GTX 7900 GTX 40 5950 Ultra 6800 Ultra 20 5800 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 Courtesy of NVIDIA

NVIDIA Tesla T10 GPU Architecture L2 PCIe interface SM I cache MT issue C cache Shared memory TPC 1 Geometry controller ROP SMC SM I cache MT issue C cache Shared memory Texture units Texture L1 Input assembler SM I cache MT issue C cache Shared memory SM I cache MT issue C cache Shared memory TPC 10 Geometry controller 512-bit memory interconnect Thread execution manager SMC SM I cache MT issue C cache Shared memory Texture units Texture L1 SM I cache MT issue C cache Shared memory ROP DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM L2 T10 architecture 240 streaming processors arranged as 30 streaming multiprocessors At 1.3 GHz this provides 1 TFLOPS 86.4 TFLOPS DP 512-bit interface to off-chip GDDR3 memory 102 GB/s bandwidth

System monitoring Thermal management PCI x16 Power supply NVIDIA Tesla S1070 GPU Computing Server 4 T10 GPUs 4 GB GDDR3 SDRAM Tesla GPU 4 GB GDDR3 SDRAM Tesla GPU PCI x16 NVIDIA SWITCH NVIDIA SWITCH Tesla GPU 4 GB GDDR3 SDRAM Tesla GPU 4 GB GDDR3 SDRAM

GPU Clusters at NCSA Lincoln Production system available the standard NCSA/TeraGrid HPC allocation AC Experimental system available for anybody who is interested in exploring GPU computing

Intel 64 Tesla Linux Cluster Lincoln Dell PowerEdge 1955 server Intel 64 (Harpertown) 2.33 GHz dual socket quad core SDR IB IB SDR IB 16 GB DDR2 InfiniBand SDR Dell PowerEdge 1955 server Dell PowerEdge 1955 server S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 192 CPU cores: 1536 Accelerator Units: 96 GPUs: 384 PCIe x8 PCIe x8 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070

AMD AMD Opteron Tesla Linux Cluster AC HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 InfiniBand QDR S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 CPU cores: 128 Accelerator Units: 32 GPUs: 128 IB QDR IB HP xw9400 workstation PCIe x16 PCIe x16 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070 PCI-X Nallatech H101 FPGA card

AC Cluster: 128 TFLOPS ()

NCSA GPU Cluster Management Software CUDA SDK Wrapper Enables true GPU resources allocation by the scheduler E.g., qsub -l nodes=1:ppn=1 will allocate 1 CPU core and 1 GPU, fencing other GPUs in the node from accessing Virtualizes GPU devices Sets thread affinity to CPU core nearest the the GPU device GPU Memory Scrubber Cleans memory between runs GPU Memory Test utility Tests memory for manufacturing defects Tests memory for soft errors Tests GPUs for entering an erroneous state

GPU Node Pre/Post Allocation Sequence Pre-Job (minimized for rapid device acquisition) Assemble detected device file unless it exists Sanity check results Checkout requested GPU devices from that file Initialize CUDA wrapper shared memory segment with unique key for user (allows user to ssh to node outside of job environment and have same gpu devices visible) Post-Job Use quick memtest run to verify healthy GPU state If bad state detected, mark node offline if other jobs present on node If no other jobs, reload kernel module to heal node (for CUDA 2.2 bug) Run memscrubber utility to clear gpu device memory Notify of any failure events with job details Terminate wrapper shared memory segment Check-in GPUs back to global file of detected devices

GPU Programming 3 rd Party Programming tools CUDA C 2.2 SDK OpenCL 1.2 SDK PGI x64+gpu Fortran & C99 Compilers IACAT s on-going work CUDA-auto CUDA-lite GMAC CUDA-tune MCUDA/OpenMP

Implicitly parallel programming with data structure and function property annotations to enable auto parallelization Locality annotation programming to eliminate need for explicit management of memory types and data transfers CUDA-auto CUDA-lite GMAC Parameterized CUDA programming using auto-tuning and optimization space pruning CUDA-tune 1 st generation CUDA programming with explicit, hardwired thread organizations and explicit management of memory types and data transfers MCUDA/ OpenMP IA multi-core & Larrabe NVIDIA SDK NVIDIA GPU Courtesy of Wen-mei Hwu, UIUC

GMAC Designed to reduce accelerator use barrier Unified CPU / GPU Address Space: Same CPU and GPU address Customizable implicit data transfers: Transfer everything (safe mode) Transfer dirty data before kernel execution Transfer data as being produced (default) Multi-process / Multi-thread support CUDA compatible Courtesy of Wen-mei Hwu, UIUC

TPACF GPU Cluster Applications Cosmology code used to study how the matter is distributed in the Universe NAMD Molecular Dynamics code used to run large-scale MD simulations DSCF NSF CyberChem project; DSCF quantum chemistry code for energy calculations WRF Weather modeling code MILC Quantum chromodynamics code, work just started

speedup exeution time (sec) TPACF on AC 1000.0 QP GPU cluster scaling 567.5 180.0 160.0 140.0 120.0 100.0 88.2 154.1 100.0 10.0 289.0 147.5 79.9 1 10 number of MPI threads 46.3 29.3 23.7 80.0 60.0 40.0 59.1 45.5 30.9 20.0 6.8 0.0 Quadro FX 5600 GeForce GTX 280 () Cell B./E. () GeForce GTX 280 (DP) PowerXCell 8i (DP) H101 (DP)

STMV s/step NAMD on Lincoln (8 cores and 2 GPUs per node, very early results) 1.6 ~5.6 ~2.8 1.4 1.2 1 0.8 0.6 2 GPUs = 24 cores 4 GPUs 8 GPUs 16 GPUs CPU (8ppn) CPU (4ppn) CPU (2ppn) GPU (4:1) GPU (2:1) GPU (1:1) 0.4 0.2 8 GPUs = 96 CPU cores 0 4 8 16 32 64 128 CPU cores Courtesy of James Phillips, UIUC

DSCF on Lincoln Bovine pancreatic trypsin inhibitor (BPTI) 3-21G, 875 atoms, 4893 basis functions MPI timings and scalability Courtesy of Ivan Ufimtsev, Stanford

GPU Clusters Open Issues / Future Work GPU cluster configuration CPU/GPU ratio Host/device memory ratio Cluster interconnect requirements Cluster management Resources allocation Monitoring CUDA/OpenCL SDK for HPC Programming models for accelerator clusters CUDA/OpenCL+pthreads/MPI/OpenMP/Charm++/