High Performance Computing with Accelerators

High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Presentation Outline Recent trends in application accelerators FPGAs, Cell, GPUs, Accelerator clusters Accelerator clusters at NCSA Cluster management Programmability Issues Production codes running on Accelerator Clusters Cosmology, molecular dynamics, electronic structure, quantum chromodynamics

HPC on Special-purpose Processors Field-Programmable Gate Arrays (FPGAs) Digital signal processing, embedded computing Graphics Processing Units (GPUs) Desktop graphics accelerators Physics Processing Units (PPUs) Desktop games accelerators Sony/Toshiba/IBM Cell Broadband Engine Game console and digital content delivery systems ClearSpeed accelerator Floating-point accelerator board for computeintensive applications

GFlop/s GPU Performance Trends: FLOPS 1200 GTX 285 1000 GTX 280 800 600 8800 Ultra 8800 GTX NVIDIA GPU 400 Intel CPU 7900 GTX 200 5800 5950 Ultra 7800 GTX 6800 Ultra Intel Xeon Quad-core 3 GHz 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 Courtesy of NVIDIA

GByte/s 180 160 140 GPU Performance Trends: Memory Bandwidth GTX 280 GTX 285 120 100 80 8800 GTX 8800 Ultra 60 7800 GTX 7900 GTX 40 5950 Ultra 6800 Ultra 20 5800 0 9/22/02 2/4/04 6/18/05 10/31/06 3/14/08 Courtesy of NVIDIA

NVIDIA Tesla T10 GPU Architecture L2 PCIe interface SM I cache MT issue C cache Shared memory TPC 1 Geometry controller ROP SMC SM I cache MT issue C cache Shared memory Texture units Texture L1 Input assembler SM I cache MT issue C cache Shared memory SM I cache MT issue C cache Shared memory TPC 10 Geometry controller 512-bit memory interconnect Thread execution manager SMC SM I cache MT issue C cache Shared memory Texture units Texture L1 SM I cache MT issue C cache Shared memory ROP DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM L2 T10 architecture 240 streaming processors arranged as 30 streaming multiprocessors At 1.3 GHz this provides 1 TFLOPS 86.4 TFLOPS DP 512-bit interface to off-chip GDDR3 memory 102 GB/s bandwidth

System monitoring Thermal management PCI x16 Power supply NVIDIA Tesla S1070 GPU Computing Server 4 T10 GPUs 4 GB GDDR3 SDRAM Tesla GPU 4 GB GDDR3 SDRAM Tesla GPU PCI x16 NVIDIA SWITCH NVIDIA SWITCH Tesla GPU 4 GB GDDR3 SDRAM Tesla GPU 4 GB GDDR3 SDRAM

GPU Clusters at NCSA Lincoln Production system available the standard NCSA/TeraGrid HPC allocation AC Experimental system available for anybody who is interested in exploring GPU computing

Intel 64 Tesla Linux Cluster Lincoln Dell PowerEdge 1955 server Intel 64 (Harpertown) 2.33 GHz dual socket quad core SDR IB IB SDR IB 16 GB DDR2 InfiniBand SDR Dell PowerEdge 1955 server Dell PowerEdge 1955 server S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 192 CPU cores: 1536 Accelerator Units: 96 GPUs: 384 PCIe x8 PCIe x8 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070

AMD AMD Opteron Tesla Linux Cluster AC HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 InfiniBand QDR S1070 1U GPU Computing Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 CPU cores: 128 Accelerator Units: 32 GPUs: 128 IB QDR IB HP xw9400 workstation PCIe x16 PCIe x16 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070 PCI-X Nallatech H101 FPGA card

AC Cluster: 128 TFLOPS ()

NCSA GPU Cluster Management Software CUDA SDK Wrapper Enables true GPU resources allocation by the scheduler E.g., qsub -l nodes=1:ppn=1 will allocate 1 CPU core and 1 GPU, fencing other GPUs in the node from accessing Virtualizes GPU devices Sets thread affinity to CPU core nearest the the GPU device GPU Memory Scrubber Cleans memory between runs GPU Memory Test utility Tests memory for manufacturing defects Tests memory for soft errors Tests GPUs for entering an erroneous state

GPU Node Pre/Post Allocation Sequence Pre-Job (minimized for rapid device acquisition) Assemble detected device file unless it exists Sanity check results Checkout requested GPU devices from that file Initialize CUDA wrapper shared memory segment with unique key for user (allows user to ssh to node outside of job environment and have same gpu devices visible) Post-Job Use quick memtest run to verify healthy GPU state If bad state detected, mark node offline if other jobs present on node If no other jobs, reload kernel module to heal node (for CUDA 2.2 bug) Run memscrubber utility to clear gpu device memory Notify of any failure events with job details Terminate wrapper shared memory segment Check-in GPUs back to global file of detected devices

GPU Programming 3 rd Party Programming tools CUDA C 2.2 SDK OpenCL 1.2 SDK PGI x64+gpu Fortran & C99 Compilers IACAT s on-going work CUDA-auto CUDA-lite GMAC CUDA-tune MCUDA/OpenMP

Implicitly parallel programming with data structure and function property annotations to enable auto parallelization Locality annotation programming to eliminate need for explicit management of memory types and data transfers CUDA-auto CUDA-lite GMAC Parameterized CUDA programming using auto-tuning and optimization space pruning CUDA-tune 1 st generation CUDA programming with explicit, hardwired thread organizations and explicit management of memory types and data transfers MCUDA/ OpenMP IA multi-core & Larrabe NVIDIA SDK NVIDIA GPU Courtesy of Wen-mei Hwu, UIUC

GMAC Designed to reduce accelerator use barrier Unified CPU / GPU Address Space: Same CPU and GPU address Customizable implicit data transfers: Transfer everything (safe mode) Transfer dirty data before kernel execution Transfer data as being produced (default) Multi-process / Multi-thread support CUDA compatible Courtesy of Wen-mei Hwu, UIUC

TPACF GPU Cluster Applications Cosmology code used to study how the matter is distributed in the Universe NAMD Molecular Dynamics code used to run large-scale MD simulations DSCF NSF CyberChem project; DSCF quantum chemistry code for energy calculations WRF Weather modeling code MILC Quantum chromodynamics code, work just started

speedup exeution time (sec) TPACF on AC 1000.0 QP GPU cluster scaling 567.5 180.0 160.0 140.0 120.0 100.0 88.2 154.1 100.0 10.0 289.0 147.5 79.9 1 10 number of MPI threads 46.3 29.3 23.7 80.0 60.0 40.0 59.1 45.5 30.9 20.0 6.8 0.0 Quadro FX 5600 GeForce GTX 280 () Cell B./E. () GeForce GTX 280 (DP) PowerXCell 8i (DP) H101 (DP)

STMV s/step NAMD on Lincoln (8 cores and 2 GPUs per node, very early results) 1.6 ~5.6 ~2.8 1.4 1.2 1 0.8 0.6 2 GPUs = 24 cores 4 GPUs 8 GPUs 16 GPUs CPU (8ppn) CPU (4ppn) CPU (2ppn) GPU (4:1) GPU (2:1) GPU (1:1) 0.4 0.2 8 GPUs = 96 CPU cores 0 4 8 16 32 64 128 CPU cores Courtesy of James Phillips, UIUC

DSCF on Lincoln Bovine pancreatic trypsin inhibitor (BPTI) 3-21G, 875 atoms, 4893 basis functions MPI timings and scalability Courtesy of Ivan Ufimtsev, Stanford

GPU Clusters Open Issues / Future Work GPU cluster configuration CPU/GPU ratio Host/device memory ratio Cluster interconnect requirements Cluster management Resources allocation Monitoring CUDA/OpenCL SDK for HPC Programming models for accelerator clusters CUDA/OpenCL+pthreads/MPI/OpenMP/Charm++/