Accelerating Biomolecular Modeling with CUDA and GPU Clusters

Size: px

Start display at page:

Download "Accelerating Biomolecular Modeling with CUDA and GPU Clusters"

Dale Carroll
6 years ago
Views:

1 Accelerating Biomolecular Modeling with CUDA and GPU Clusters James Phillips John Stone Klaus Schulten Research/gpu/

2 Beckman Institute University o Illinois at Urbana-Champaign Theoretical and Computational Biophysics Group

3 National Center or Supercomputing Applications

4 Computational Microscopy Ribosome: synthesizes proteins rom genetic inormation, target or antibiotics Silicon nanopore: bionanodevice or sequencing DNA eiciently

NAMD: Practical Supercomputing 35,000 users can t all be computer experts.

User experience is the same on all platorms. No change in input, output, or coniguration iles.

Desktops and laptops setup and testing x86 and x86-64 Windows, and Macintosh Allow both shared-memory and

5 NAMD: Practical Supercomputing 35,000 users can t all be computer experts. 18% are NIH-unded; many in other countries have downloaded more than one version. User experience is the same on all platorms. No change in input, output, or coniguration iles. Run any simulation on any number o processors. Precompiled binaries available when possible. Desktops and laptops setup and testing x86 and x86-64 Windows, and Macintosh Allow both shared-memory and network-based parallelism. Linux clusters aordable workhorses x86, x86-64, and Itanium processors Gigabit ethernet, Myrinet, IniniBand, Quadrics, Altix, etc Phillips et al., J. Comp. Chem. 26: , 2005.

6 Our Goal: Practical Acceleration Broadly applicable to scientiic computing Programmable by domain scientists Scalable rom small to large machines Broadly available to researchers Price driven by commodity market Low burden on system administration Sustainable perormance advantage Perormance driven by Moore s law Stable market and supply chain

contact with company) Limited memory and

7 Acceleration Options or NAMD Outlook in : FPGA reconigurable computing (with NCSA) Diicult to program, slow loating point, expensive Cell processor (NCSA hardware) Relatively easy to program, expensive ClearSpeed (direct contact with company) Limited memory and memory bandwidth, expensive MDGRAPE Inlexible and expensive Graphics processor (GPU) Program must be expressed as graphics operations

CUDA: Practical Perormance November 2006: NVIDIA announces CUDA or G80 GPU.

No masquerading as graphics rendering. New shared memory and synchronization.

Resource and collaborators make it useul: Experience rom VMD development David Kirk

8 CUDA: Practical Perormance November 2006: NVIDIA announces CUDA or G80 GPU. CUDA makes GPU acceleration usable: Developed and supported by NVIDIA. No masquerading as graphics rendering. New shared memory and synchronization. No OpenGL or display device hassles. Multiple processes per card (or vice versa). Resource and collaborators make it useul: Experience rom VMD development David Kirk (Chie Scientist, NVIDIA) Wen-mei Hwu (ECE Proessor, UIUC) Fun to program (and drive) Stone et al., J. Comp. Chem. 28: , 2007.

volumetric data, quantum chemistry simulations, particle

9 VMD Visual Molecular Dynamics Visualization and analysis o molecular dynamics simulations, sequence data, volumetric data, quantum chemistry simulations, particle systems, User extensible with scripting and plugins Research/vmd/

10 CUDA Acceleration in VMD Electrostatic ield calculation, ion placement 20x to 44x aster Molecular orbital calculation and display 100x to 120x aster Imaging o gas migration pathways in proteins with implicit ligand sampling 20x to 30x aster

Apples to Oranges Perormance Results: Molecular Orbital Kernels Kernel Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel Core2 Duo CPU OpenCL scalar Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel

11 Apples to Oranges Perormance Results: Molecular Orbital Kernels Kernel Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel Core2 Duo CPU OpenCL scalar Intel QX6700 CPU ICC-SSE (SSE intrinsics) Intel Core2 Duo CPU OpenCL vec4 Cell OpenCL vec4*** no constant Radeon 4870 OpenCL scalar Radeon 4870 OpenCL vec4 GeForce GTX 285 OpenCL vec4 GeForce GTX 285 CUDA 2.1 scalar GeForce GTX 285 OpenCL scalar GeForce GTX 285 CUDA 2.0 scalar Cores Runtime (s) Speedup CUDA results demonstrate perormance variability with compiler revisions, and that with vendor eort, OpenCL has the potential to match the perormance o other APIs.

12 NAMD Hybrid Decomposition Kale et al., J. Comp. Phys. 151: , Spatially decompose data and communication. Separate but related work decomposition. Compute objects acilitate iterative, measurement-based load balancing system.

13 NAMD Code is Message-Driven No receive calls as in message passing Messages sent to object entry points Incoming messages placed in queue Priorities are necessary or perormance Execution generates new messages Implemented in Charm++ on top o MPI Can be emulated in MPI alone Charm++ provides tools and idioms Parallel Programming Lab:

14 System Noise Example Timeline rom Charm++ tool Projections

15 NAMD Overlapping Execution Phillips et al., SC2002. Oload to GPU Objects are assigned to processors and queued as data arrives.

16 Message-Driven CUDA? No, CUDA is too coarse-grained. CPU needs ine-grained work to interleave and pipeline. GPU needs large numbers o tasks submitted all at once. No, CUDA lacks priorities. FIFO isn t enough. Perhaps in a uture interace: Stream data to GPU. Append blocks to a running kernel invocation. Stream data out as blocks complete. Fermi looks very promising!

17 Nonbonded Forces on CUDA GPU Start with most expensive calculation: direct nonbonded interactions. Decompose work into pairs o patches, identical to NAMD structure. GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) Texture Unit Force Table Interpolation 8kB cache 16kB Shared Memory Patch A Coordinates & Parameters 32-way SIMD Multiprocessor multiplexed threads 32kB Registers Patch B Coords, Params, & Forces Constants Exclusions 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency Stone et al., J. Comp. Chem. 28: , 2007.

18 texture<loat4> orce_table; constant unsigned int exclusions[]; shared atom jatom[]; atom iatom; // per-thread atom, stored in registers loat4 iorce; // per-thread orce, stored in registers or ( int j = 0; j < jatom_count; ++j ) { loat dx = jatom[j].x - iatom.x; loat dy = jatom[j].y - iatom.y; loat dz = jatom[j].z - iatom.z; loat r2 = dx*dx + dy*dy + dz*dz; i ( r2 < cuto2 ) { } loat4 t = texetch(orce_table, 1./sqrt(r2)); bool excluded = alse; int indexdi = iatom.index - jatom[j].index; i ( abs(indexdi) <= (int) jatom[j].excl_maxdi ) { indexdi += jatom[j].excl_index; excluded = ((exclusions[indexdi>>5] & (1<<(indexdi&31)))!= 0); } loat = iatom.hal_sigma + jatom[j].hal_sigma; // sigma *= *; // sigma^3 *= ; // sigma^6 *= ( * t.x + t.y ); // sigma^12 * i.x - sigma^6 * i.y *= iatom.sqrt_epsilon * jatom[j].sqrt_epsilon; loat qq = iatom.charge * jatom[j].charge; i ( excluded ) { = qq * t.w; } // PME correction else { += qq * t.z; } // Coulomb iorce.x += dx * ; iorce.y += dy * ; iorce.z += dz * ; iorce.w += 1.; // interaction count or energy Nonbonded Forces CUDA Code } Stone et al., J. Comp. Chem. 28: , Force Interpolation Exclusions Parameters Accumulation

19 Overlapping GPU and CPU with Communication GPU x Remote Force Local Force CPU Remote Local Local Update x Other Nodes/Processes x One Timestep

20 Remote Forces Forces on atoms in a local patch are local Forces on atoms in a remote patch are remote Calculate remote orces irst to overlap orce communication with local orce calculation Not enough work to overlap with position communication Remote Patch Local Patch Local Patch Remote Patch Remote Patch Remote Patch Work done by one processor

21 Actual Timelines rom NAMD Generated using Charm++ tool Projections CPU GPU x Remote Force Local Force x x

22 NCSA 4+4 QuadroPlex Cluster aster 2.4 GHz Opteron + Quadro FX 5600

23 Current GPU Clusters at NCSA Lincoln Production system available via the standard NCSA/TeraGrid HPC allocation AC Experimental system available or exploring GPU computing

24 CUDA/OpenCL Wrapper Library Basic operation principle: Use /etc/ld.so.preload to overload (intercept) a subset o CUDA/OpenCL unctions, e.g. {cu cuda}{get Set}Device, clgetdeviceids, etc Purpose: Enables controlled GPU device visibility and access, extending resource allocation to the workload manager Prove or disprove eature useulness, with the hope o eventual uptake or reimplementation o proven eatures by the vendor Provides a platorm or rapid implementation and testing o HPC relevant eatures not available in NVIDIA APIs Features: NUMA Ainity mapping Sets thread ainity to CPU core nearest the gpu device Shared host, multi-gpu device encing Only GPUs allocated by scheduler are visible or accessible to user GPU device numbers are virtualized, with a ixed mapping to a physical device per user environment User always sees allocated GPU devices indexed rom 0

25 CUDA/OpenCL Wrapper Library Features (cont d): Device Rotation (deprecated) Virtual to Physical device mapping rotated or each process accessing a GPU device Allowed or common execution parameters (e.g. Target gpu0 with 4 processes, each one gets separate gpu, assuming 4 gpus available) CUDA 2.2 introduced compute-exclusive device mode, which includes allback to next device. Device rotation eature may no longer needed Memory Scrubber Independent utility rom wrapper, but packaged with it Linux kernel does no management o GPU device memory Must run between user jobs to ensure security between users Availability NCSA/UoI Open Source License

26 CUDA Memtest 4GB o Tesla GPU memory is not ECC protected Hunt or sot error Features Full re-implementation o every test included in memtest86 Random and ixed test patterns, error reports, error addresses, test speciication notiication Includes additional stress test or sotware and hardware errors Usage scenarios Hardware test or deective GPU memory chips CUDA API/driver sotware bugs detection Hardware test or detecting sot errors due to non-ecc memory No sot error detected in 2 years x 4 gig o cumulative runtime Availability NCSA/UoI Open Source License

27 GPU Node Pre/Post Allocation Sequence Pre-Job (minimized or rapid device acquisition) Assemble detected device ile unless it exists Sanity check results Checkout requested GPU devices rom that ile Initialize CUDA wrapper shared memory segment with unique key or user (allows user to ssh to node outside o job environment and have same gpu devices visible) Post-Job Use quick memtest run to veriy healthy GPU state I bad state detected, mark node oline i other jobs present on node I no other jobs, reload kernel module to heal node (or CUDA 2.2 driver bug) Run memscrubber utility to clear gpu device memory Notiy o any ailure events with job details Terminate wrapper shared memory segment Check-in GPUs back to global ile o detected devices

AMD Opteron Tesla Linux Cluster AC HP xw9400 workstation 2216 AMD

4 GHz dual socket dual core 8 GB DDR2 Ininiband QDR Tesla S1070 1U 4-GPU

3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32

28 AMD Opteron Tesla Linux Cluster AC HP xw9400 workstation 2216 AMD Opteron 2.4 GHz dual socket dual core 8 GB DDR2 Ininiband QDR Tesla S1070 1U 4-GPU Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 32 Accelerator Units: 32 (128 GPUS, 128 TF SP, 10 TF DP) PCIe x16 PCIe interace T10 DRA M QDR IB T10 IB HP xw9400 workstation DRA M T10 Tesla S1070 PCIe interace DRA M T10 DRA M PCI-X PCIe x16 Nallatech H101 FPGA card

Intel 64 Tesla Linux Cluster Lincoln Dell PowerEdge 1955 server Intel 64 (Harpertown) 2.33 GHz dual socket quad core 16 GB DDR2 Ininiband SDR Tesla S1070 1U 4-GPU Server 1.

29 Intel 64 Tesla Linux Cluster Lincoln Dell PowerEdge 1955 server Intel 64 (Harpertown) 2.33 GHz dual socket quad core 16 GB DDR2 Ininiband SDR Tesla S1070 1U 4-GPU Server 1.3 GHz Tesla T10 processors 4x4 GB GDDR3 SDRAM Cluster Servers: 192 Accelerator Units: 96 (384 GPUs, 384 TF SP, 32 TF DP) Dell PowerEdge 1955 server PCIe x8 PCIe interace T10 DRAM SDR IB T10 DRAM IB Tesla S1070 Dell PowerEdge 1955 server PCIe interace T10 DRAM SDR IB PCIe x8 T10 DRAM

30 NCSA 8+2 Lincoln Cluster How to share a GPU among 4 CPU cores? Send all GPU work to one process? Coordinate via messages to avoid conlict? Or just hope or the best?

31 NCSA Lincoln Cluster Perormance (8 Intel cores and 2 NVIDIA Telsa GPUs per node) STMV (1M atoms) s/step ~2.8 CPU cores 2 GPUs = 24 cores 4 GPUs GPUs 16 GPUs

32 NCSA Lincoln Cluster Perormance (8 cores and 2 GPUs per node) STMV s/step ~5.6 ~2.8 2 GPUs = 24 cores 4 GPUs 8 GPUs 16 GPUs 8 GPUs = 96 CPU cores CPU cores

33 No GPU Sharing (Ideal World) GPU 1 x Remote Force Local Force GPU 2 x Remote Force Local Force

34 GPU Sharing (Desired) x Remote Force Local Force Client 1 x Remote Force Local Force Client 2

35 GPU Sharing (Feared) x Remote Force Local Force Client 1 x Remote Force Local Force Client 2

36 GPU Sharing (Observed) x Remote Force Local Force Client 1 x Remote Force Local Force Client 2

37 GPU Sharing (Explained) CUDA is behaving reasonably, but Force calculation is actually two kernels Longer kernel writes to multiple arrays Shorter kernel combines output Possible solutions: Modiy CUDA to be less air (please!) Use locks (atomics) to merge kernels (not G80) Explicit inter-client coordination

38 Inter-client Communication First identiy which processes share a GPU Need to know physical node or each process GPU-assignment must reveal real device ID Threads don t eliminate the problem Production code can t make assumptions Token-passing is simple and predictable Rotate clients in ixed order High-priority, yield, low-priority, yield,

39 Token-Passing GPU-Sharing GPU1 GPU2 Local Remote Local Remote

40 GPU-Sharing with PME Local Remote Local Rem

41 Weakness o Token-Passing GPU is idle while token is being passed Busy client delays itsel and others Next strategy requires threads: One process per GPU, one thread per core Funnel CUDA calls through a single stream No local work until all remote work is queued Typically unnels MPI as well

42 Recent NAMD GPU Developments Production eatures in 2.7b2 release: Full electrostatics with PME 1-4 exclusions Constant-pressure simulation Improved orce accuracy: Patch-centered atom coordinates Increased precision o orce interpolation Perormance enhancements (in progress): Recursive bisection within patch on 32-atom boundaries Block-based pairlists based on sorted atoms Sort blocks in order o decreasing work

43 GPU-Accelerated NAMD Plans Serial perormance Target NVIDIA Fermi architecture Revisit GPU kernel design decisions made in 2007 Improve perormance o remaining CPU code Parallel scaling Target NSF Track 2D Keeneland cluster at ORNL Finer-grained work units on GPU (eature o Fermi) One process per GPU, one thread per CPU core Dynamic load balancing o GPU work Wider range o simulation options and eatures

44 Conclusions and Outlook CUDA today is suicient or Single-GPU acceleration (the mass market) Coarse-grained multi-gpu parallelism Enough work per call to spin up all multiprocessors Improvements in CUDA are needed or Assigning GPUs to processes Sharing GPUs between processes Fine-grained multi-gpu parallelism Fewer blocks per call than chip has multiprocessors Moving data between GPUs (same or dierent node) Eager to test Fermi architecture and eatures!

45 Acknowledgements Theoretical and Computational Biophysics Group, University o Illinois at Urbana-Champaign Pro. Wen-mei Hwu, Chris Rodrigues, IMPACT Group, University o Illinois at Urbana-Champaign Mike Showerman, Jeremy Enos, NCSA David Kirk, Massimiliano Fatica, others at NVIDIA UIUC NVIDIA CUDA Center o Excellence NIH support: P41-RR05969 Research/gpu/

Experience with NAMD on GPU-Accelerated Clusters

Experience with NAMD on GPU-Accelerated Clusters James Phillips John Stone Klaus Schulten Research/gpu/ Outline Motivational images o NAMD simulations Why all the uss about GPUs? What is message-driven