Simulating Life at the Atomic Scale

Size: px

Start display at page:

Download "Simulating Life at the Atomic Scale"

Morgan Gardner
6 years ago
Views:

1 Simulating Life at the Atomic Scale James Phillips Beckman Institute, University of Illinois Research/namd/

2 Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics Group

3 NAMD: Scalable Molecular Dynamics 2002 Gordon Bell Award 37,000 Users, 1700 Citations ATP synthase PSC Lemieux Computational Biophysics Summer School Blue Waters Target Application GPU Acceleration Illinois Petascale Computing Facility NVIDIA Tesla NCSA Lincoln

4 Computational Microscopy Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently

5 Molecular Mechanics Force Field

6 Classical Molecular Dynamics Energy function: used to determine the force on each atom: Newton s equation represents a set of N second order differential equations which are solved numerically via the Verlet integrator at discrete time steps to determine the trajectory of each atom. Small terms added to control temperature and pressure.

Biomolecular Time Scales Motion Time Scale (sec) Bond stretching 10-14 to 10-13 Elastic vibrations 10-12 to 10-11 Rotations of surface sidechains 10-11 to 10-10

7 Biomolecular Time Scales Motion Time Scale (sec) Bond stretching to Elastic vibrations to Rotations of surface sidechains to Hinge bending to 10-7 Rotation of buried side chains 10-4 to 1 sec Max Timestep: 1 fs Allosteric transistions Local denaturations 10-5 to 1 sec 10-5 to 10 sec

8 Sizes of Simulations Over Time BPTI 3K atoms Estrogen Receptor 36K atoms (1996) ATP Synthase 327K atoms (2001)

9 Our Solution: Parallel Computing HP 735 cluster 14 processors (1994) SGI Origin processors (1997) PSC Lemieux AlphaServer SC 3000 processors (2002)

10 NAMD Parallel Scaling Snapshot!"" +,-+!./ ' +,-+! :8;< ns/day!" ApoA1: 92K atoms +,-+! :8;= > ' > 8:8;< > 8:8;= STMV: 1M atoms!!#$ #%& %!#!"#' #"'$ '"(& $!(#!&)$' )#*&$ number of cores

11 Parallel Programming Lab University of Illinois at Urbana-Champaign Siebel Center for Computer Science

12 Quantum Chemistry (QM/MM) Develop abstractions in context of full-scale applications NAMD: Molecular Dynamics Protein Folding Computational Cosmology STM virus simulation Parallel Objects, Adaptive Runtime System Libraries and Tools Crack Propagation Rocket Simulation Dendritic Growth The enabling CS technology of parallel objects and intelligent runtime systems has led to several collaborative applications in CSE Space-time meshes

13 TCBG Experimental Collaborations Nearly every collaboration relies on NAMD. High-end simulations push scaling efforts. Try to anticipate needs: Million-atom virus just worked. Innovative simulations generate feature requests: What is science goal? Existing features usable? Find a scalable method. Make it general purpose.

Adaptability Through Scripting Tcl customizations are portable Top-level protocols: Minimize, heat, equilibrate Simulated annealing Replica exchange (two modifications) Long-range forces on selected

14 Adaptability Through Scripting Tcl customizations are portable Top-level protocols: Minimize, heat, equilibrate Simulated annealing Replica exchange (two modifications) Long-range forces on selected atoms Torques and other steering forces Adaptive bias free energy perturbation Coupling to external coarse-grain model Special boundary forces Applies potentially to every atom Several design iterations for efficiency Shrinking phantom pore for DNA

NAMD: Practical Supercomputing 37,000 users can t all be computer experts.

Supercomputers free allocations on TeraGrid Blue Waters sustained petaflop/s performance

15 NAMD: Practical Supercomputing 37,000 users can t all be computer experts have downloaded more than one version citations of NAMD reference papers. One program for all platforms. Desktops and laptops setup and testing Linux clusters affordable local workhorses Supercomputers free allocations on TeraGrid Blue Waters sustained petaflop/s performance User knowledge is preserved. No change in input or output files. Run any simulation on any number of cores. Available free of charge to all. Phillips et al., J. Comp. Chem. 26: , 2005.

16 Our Goal: Practical Acceleration Broadly applicable to scientific computing Programmable by domain scientists Scalable from small to large machines Broadly available to researchers Price driven by commodity market Low burden on system administration Sustainable performance advantage Performance driven by Moore s law Stable market and supply chain

Acceleration Options for NAMD Outlook in

contact with company) Limited memory and

17 Acceleration Options for NAMD Outlook in : FPGA reconfigurable computing (with NCSA) Difficult to program, slow floating point, expensive Cell processor (NCSA hardware) Relatively easy to program, expensive ClearSpeed (direct contact with company) Limited memory and memory bandwidth, expensive MDGRAPE Inflexible and expensive Graphics processor (GPU) Program must be expressed as graphics operations

18 GPU vs CPU: Raw Performance Calculation: 450 GFLOPS vs 32 GFLOPS Memory Bandwidth: 80 GB/s vs 8.4 GB/s G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX GFLOPS G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

CUDA: Practical Performance November 2006: NVIDIA announces CUDA for G80 GPU.

No masquerading as graphics rendering. New shared memory and synchronization.

TCBG and collaborators make it useful: Experience from VMD development David Kirk

19 CUDA: Practical Performance November 2006: NVIDIA announces CUDA for G80 GPU. CUDA makes GPU acceleration usable: Developed and supported by NVIDIA. No masquerading as graphics rendering. New shared memory and synchronization. No OpenGL or display device hassles. Multiple processes per card (or vice versa). TCBG and collaborators make it useful: Experience from VMD development David Kirk (Chief Scientist, NVIDIA) Wen-mei Hwu (ECE Professor, UIUC) Fun to program (and drive) Stone et al., J. Comp. Chem. 28: , 2007.

20 Typical CPU Architecture L2 Cache L3 Cache L1 I L1 D Dispatch/Retire FPU FPU ALU Memory Controller

21 Minimize the Processor No large caches or multiple execution units L1 I L1 D Dispatch/Retire FPU Do integer arithmetic on FPU Memory Controller

22 Maximize Floating Point 8 FP pipelines per SIMD unit L1 I L1 D Dispatch/Retire FPU FPU FPU FPU FPU FPU FPU FPU Memory Controller Shared data cache Single instruction stream One thread per FPU allows branches and gather/scatter.

23 Add More Threads Pipeline 4 threads per FPU to hide 4-cycle instruction latency. All 32 threads in a warp execute the same instruction. FPU FPU FPU FPU FPU FPU FPU FPU Divergent branches allowed through predication.

24 Add Even More Threads Multiple warps in a block hide main memory latency and can synchronize to share data. FPU FPU FPU FPU FPU FPU FPU FPU

25 Add More Threads Again Multiple blocks on a single multiprocessor hide both memory and synchronization latency. FPU FPU FPU FPU FPU FPU FPU FPU All blocks execute a kernel function independently without synchronization or memory coherency.

26 Add Cores to Suit Customer Kernel is invoked on a grid of uniform blocks. Blocks are dynamically assigned to available multiprocessors and run to completion. Synchronization occurs when all blocks complete.

27 Support Fine-Grained Parallelism Threads are cheap but desperately needed. How many can you give? 512 threads will keep all 128 FPUs busy threads will hide some memory latency. 12,288 threads can run simultaneously. Up to threads per kernel invocation.

volumetric data, quantum chemistry simulations, particle

28 VMD Visual Molecular Dynamics Visualization and analysis of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry simulations, particle systems, User extensible with scripting and plugins Research/vmd/

29 Molecular Modeling: Ion Placement Model structures are initially constructed in vacuum Solvent (water) and ions are added as necessary to reproduce the required biological conditions Computational requirements scale with the size of the simulated structure

30 Electrostatic Potential Maps Electrostatic potentials evaluated on 3-D lattice: Applications include: Ion placement for structure building Time-averaged potentials for simulation Visualization and analysis Isoleucine trna synthetase

31 Direct Summation Algorithm Each lattice point accumulates electrostatic potential contribution from all atoms: Lattice point j being evaluated potential[j] += atom[i].charge / r ij r ij : distance from lattice[j] to atom[i] atom[i]

32 Ion Placement via Direct Sum 110 CPU-hours on Altix 1.35 hours on GPU 27 minutes on three GPUs Satellite Tobacco Mosaic Virus (STMV) Ion Placement

33 CUDA Acceleration in VMD Electrostatic field calculation, ion placement 20x to 44x faster Molecular orbital calculation and display 100x to 120x faster Imaging of gas migration pathways in proteins with implicit ligand sampling 20x to 30x faster

34 NAMD Lincoln Cluster Performance (8 Intel cores and 2 NVIDIA Telsa GPUs per node) STMV (1M atoms) s/step ~2.8 2 GPUs = 24 cores 4 GPUs 8 GPUs 16 GPUs CPU cores

35 NAMD Petascale Preparations

36 Blue Waters Architecture IBM Power 7 Peak Perf ~10 PF Sustained ~1 PF 300,000+ cores 1.2+ PB Memory 18+ PB Disc 8 cores/chip 4 chips/mcm 8 MCMs/Drawer 4 Drawers/SuperNode 1024 cores/supernode Linux OS

37 Challenges and Opportunities Support systems >= 100 Million atoms Performance requirements for 100 Million atom Scale to over 300,000 cores Power 7 Hardware PPC architecture Wide node at least 32 cores with 128 HT threads Blue Waters Torrent interconnect Doing research under NDA

38 Planned Petascale Simulations

39 Thanks to NIH, NSF, DOE, and 15 years of NAMD and Charm++ developers and users. James Phillips Beckman Institute, University of Illinois Research/namd/

Graphics Processor Acceleration and YOU

Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have