Lecture 5: Input Binning

Similar documents
CS 677: Parallel Programming for Many-core Processors Lecture 7

Lecture 6: Input Compaction and Further Studies

GPU Particle-Grid Methods: Electrostatics

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

Accelerating Molecular Modeling Applications with Graphics Processors

Using GPUs to compute the multilevel summation of electrostatic forces

S05: High Performance Computing with CUDA. Case Study: Molecular Visualization and Analysis

ECE 498AL. Lecture 21-22: Application Performance Case Studies: Molecular Visualization and Analysis

Fast Molecular Electrostatics Algorithms on GPUs

Maximizing Face Detection Performance

Analysis and Visualization Algorithms in VMD

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Data Parallel Execution Model

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!

Warps and Reduction Algorithms

CS 677: Parallel Programming for Many-core Processors Lecture 6

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Multilevel Summation of Electrostatic Potentials Using GPUs

GPU-based Distributed Behavior Models with CUDA

Lecture 1: Introduction and Computational Thinking

Accelerating Scientific Applications with GPUs

Programming in CUDA. Malik M Khan

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

GPU-accelerated data expansion for the Marching Cubes algorithm

Practical Near-Data Processing for In-Memory Analytics Frameworks

Module 12 Floating-Point Considerations

Double-Precision Matrix Multiply on CUDA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

GPU Sparse Graph Traversal. Duane Merrill

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

TUNING CUDA APPLICATIONS FOR MAXWELL

CS-245 Database System Principles

CSCE Operating Systems Non-contiguous Memory Allocation. Qiang Zeng, Ph.D. Fall 2018

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Ray Tracing with Spatial Hierarchies. Jeff Mahovsky & Brian Wyvill CSC 305

GPU Sparse Graph Traversal

Module 12 Floating-Point Considerations

Lecture Data layout on disk. How to store relations (tables) in disk blocks. By Marina Barsky Winter 2016, University of Toronto

Chapter 4. The Processor

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Acceleration of Molecular Modeling Applications

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Mattan Erez. The University of Texas at Austin

John E. Stone. S03: High Performance Computing with CUDA. Heterogeneous GPU Computing for Molecular Modeling

1 Serial Implementation

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Accelerating Computational Biology by 100x Using CUDA. John Stone Theoretical and Computational Biophysics Group, University of Illinois

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Information Coding / Computer Graphics, ISY, LiTH

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge

File System Interface and Implementation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Direct Rendering. Direct Rendering Goals

Advanced Database Systems

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

TUNING CUDA APPLICATIONS FOR MAXWELL

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Module 3: CUDA Execution Model -I. Objective

Scan Primitives for GPU Computing

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Chapter 13: Query Processing

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Introduction to CUDA (1 of n*)

Accelerating NAMD with Graphics Processors

Global Memory Access Pattern and Control Flow

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

GPU Histogramming: Radial Distribution Functions

Outline. L13: Application Case Studies. Reconstructing MR Images. Reconstructing MR Images 3/26/12

Scan Algorithm Effects on Parallelism and Memory Conflicts

COMPUTER ORGANIZATION AND DESI

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Hashing. Hashing Procedures

Efficient Computation of Radial Distribution Function on GPUs

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

Computer Graphics. - Spatial Index Structures - Philipp Slusallek

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Parallel Memory Defragmentation on a GPU

MD-CUDA. Presented by Wes Toland Syed Nabeel

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Memory Management. Memory Management

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

NAMD, CUDA, and Clusters: Taking GPU Molecular Dynamics Beyond the Deskop

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture 11 Parallel Computation Patterns Reduction Trees

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Transcription:

PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 5: Input Binning 1

Objective To understand how data scalability problems in gather parallel execution motivate input binning To learn basic input binning techniques To understand common tradeoffs in input binning 2

Scatter to Gather Transformation in Thread 1 Thread 2 out in Thread 1 Thread 2 out 3

However Input tend to be much less regular than output It may be difficult for each thread to efficiently locate all inputs relevant to its output Or, to efficiently exclude all inputs irrelevant to its output In a naïve arrangement, all threads may have to process all inputs to decide if each input is relevant to its output This makes execution time scale poorly with data set size data scalability problem Especially a problem for many-cores designed to process large data sets 4

DCS Algorithm for Electrostatic Potentials Revisited At each grid point, sum the electrostatic potential from all atoms All threads read all inputs Highly data-parallel But has quadratic complexity Number of grid points number of atoms Both proportional to volume Poor data scalability in terms of volume 5

Direct Summation is accurate but has poor data scalability Same scalability among all cutoff implementations Scalability and Performance of different algorithms for calculating electrostatic potential map. 6

Algorithm for Electrostatic Potentials With a Cutoff Ignore atoms beyond a cutoff distance, r c Typically 8Å 12Å Long-range potential may be computed separately Number of atoms within cutoff distance is roughly constant (uniform atom density) 200 to 700 atoms within 8Å 12Å cutoff sphere for typical biomolecular structures 7

Cut-off Summation With fixed partial charge q i, electrostatic potential V at position r over all N atoms:, where 8

Implementation Challenge For each tile of grid points, we need to identify the set of atoms that need to be examined One could naively examine all atoms and only use the ones whose distance is within the given range But this examination still takes time, and brings the time complexity right back to number of atoms * number of grid points Each thread needs to avoid examining the atoms outside the range of its grid point(s) 9

Binning A process that groups data to form a chunk called bin Each bin collectively represents a property for data in the bin Helps problem solving due to data coarsening Uniform bin arrays, Variable bins, KD Trees, 10

Binning for Cut-Off Potential Divide the simulation volume with nonoverlapping uniform cubes Every atom in the simulation volume falls into a cube based on its spatial location Bins represent location property of atoms After binning, each cube has a unique index in the simulation space for easy parallel access 11

Spatial Sorting Using Binning Presort atoms into bins by location in space Each bin holds several atoms Cutoff potential only uses bins within r c Yields a linear complexity cutoff potential algorithm Some atoms will be examined by a thread but not used 12

Terminology Bin Size The size of the bin cubes that partition the simulation volume The bigger the bin size, the more the atoms that will fall into each bin Bin Capacity The number of atoms that can be accommodated by each bin in the array implementation 13

Bin Capacity Considerations Capacity of atom bins needs to be balanced Too large many dummy atoms in bins Too small some atoms will not fit into bins Target bin capacity to cover more than 95% or atoms Place all atoms that do not fit into bins into an overflow bin Use a CPU sequential algorithm to calculate their contributions to the energy grid lattice points. CPU and GPU can do potential calculations in parallel 14

Bin Design Uniform sized/capacity bins allow array implementation And the relative offset list approach Bin capacity should be big enough to contain all the atoms that fall into a bin Cut-off will screen away atoms that weren t processed Performance penalty if too many are screened away 15

Going from DCS Kernel to Large Bin Cut-off Kernel Adaptation of techniques from the direct Coulomb summation kernel for a cutoff kernel Atoms are stored in constant memory as with DCS kernel CPU loops over potential map regions that are (24Å) 3 in volume (cube containing cutoff sphere) Each map region requires only one large bin of atoms For each map region, atoms in the corresponding large bin are appended to the constant memory atom buffer until full, then GPU kernel is launched Host continue to reload constant memory and launch GPU kernels until all atoms in the corresponding large bin are consumed 16

Large Bin Design Concept Map regions are (24Å) 3 in volume Regions are sized large enough to provide the GPU enough work in a single kernel launch (48 lattice points) 3 for lattice with 0.5Å spacing (20 atoms) 3 on average for each large bin Bin size and capacity are designed to allow each kernel launch to cover enough lattice points to justify the kernel launch overhead and fully utilize the GPU hardware 17

Large Bin Cut-off Kernel Code static constant float4 atominfo[maxatoms]; global static void mgpot_shortrng_energy( ) { [ ] for (n = 0; n < natoms; n++) { float dx = coorx - atominfo[n].x; float dy = coory - atominfo[n].y; float dz = coorz - atominfo[n].z; float q = atominfo[n].w; float dxdy2 = dx*dx + dy*dy; float r2 = dxdy2 + dz*dz; if (r2 < CUTOFF2) { float gr2 = GC0 + r2*(gc1 + r2*gc2); float r_1 = 1.f/sqrtf(r2); accum_energy_z0 += q * (r_1 - gr2); } 18

Large-bin Cutoff Kernel Evaluation 6 speedup relative to fast CPU version Work-inefficient Coarse spatial hashing into (24Å) 3 bins Only 6.5% of the atoms a thread tests are within the cutoff distance Small-bin designs improve work efficiency but requires more sophisticated kernel code 19

Small-bin Kernels Improving Work Efficiency Thread block examines atom bins up to the cutoff distance Use a sphere of bins All threads in a block scan the same bins and atoms No hardware penalty for multiple simultaneous reads of the same address Simplifies fetching of data The sphere has to be big enough to cover all grid point at corners There will be a small level of divergence Not all grid points processed by a thread block relate to all atoms in a bin the same way (A within cut-off distance of N but outside cut-off of M) A M N 20

The Neighborhood is a volume Calculating and specifying all bin indexes of the sphere can be quite complex Rough approximations reduce efficiency 21

Neighborhood Offset List (Pre-calculated) A list of relative offsets enumerating the bins that are located within the cutoff distance for a given location in the simulation volume Detection of surrounding atoms becomes realistic for output grid points By visiting bins in the neighborhood offset list and iterating atoms they contain a bin in the (1, 2) (-1, -1) neighborhood list not included cutoff distance center (0, 0) 22

Large Sized Bin Design Each bin covers a simulation volume for all grid points processed by a thread block All input atoms potentially needed It also need to have sufficient capacity The efficiency is quite low, about 6.5% in CP 23

Large Bin Cut-off Kernel Code static constant float4 atominfo[maxatoms]; global static void mgpot_shortrng_energy( ) { [ ] for (n = 0; n < natoms; n++) { float dx = coorx - atominfo[n].x; float dy = coory - atominfo[n].y; float dz = coorz - atominfo[n].z; float q = atominfo[n].w; float dxdy2 = dx*dx + dy*dy; float r2 = dxdy2 + dz*dz; if (r2 < CUTOFF2) { float gr2 = GC0 + r2*(gc1 + r2*gc2); float r_1 = 1.f/sqrtf(r2); accum_energy_z0 += q * (r_1 - gr2); } 24

Improving Work Efficiency Thread block examines atom bins up to the cutoff distance Use a sphere of bins All threads in a block scan the same bins and atoms No hardware penalty for multiple simultaneous reads of the same address Simplifies fetching of data The sphere has to be big enough to cover all grid point at corners There will be a small level of divergence Not all grid points processed by a thread block relate to all atoms in a bin the same way (A within cut-off distance of N but outside cut-off of M) 25 A M N

The Neighborhood is a volume Calculating and specifying all bin indexes of the sphere can be quite complex Rough approximations reduce efficiency 26

Neighborhood Offset List (Pre-calculated) A list of relative offsets enumerating the bins that are located within the cutoff distance for a given location in the simulation volume Detection of surrounding atoms becomes realistic for output grid points By visiting bins in the neighborhood offset list and iterating atoms they contain a bin in the (1, 2) (-1, -1) neighborhood list not included Urbana, Wen-mei Illinois, August W. Hwu 2-5, and 2010 David Kirk/NVIDIA cutoff distance center (0, 0) 27

Pseudo Code of an Implementation // 1. binning for each atom in the simulation volume, index_of_bin := atom.addr / BIN_SIZE bin[index_of_bin] += atom // 2. generate the neighborhood offset list for each c from -cutoff to cutoff, if distance(0, c) < cutoff, nlist += c CPU // 3. do the computation for each point in the output grid, GPU index_of_bin := point.addr / BIN_SIZE for each offset in nlist, for each atom in bin[index_of_bin + offset], point.potential += atom.charge / (distance from point to atom) 28

Performance O(MN ) where M and N are the number of output grid points and atoms in the neighborhood offset list, respectively In general, N is small compared to the number of all atoms Works well if the distribution of atoms is uniform 29

Details on Small Bin Design For 0.5Å lattice spacing, a (4Å) 3 cube of the potential map is computed by each thread block 8 8 8 potential map points 128 threads per block (4 points/thread) 34% of examined atoms are within cutoff distance 30

More Design Considerations for the Cutoff Kernel High memory throughput to atom data essential Group threads together for locality Fetch bins of data into shared memory Structure atom data to allow fetching After taking care of memory demand, optimize to reduce instruction count Loop and instruction-level optimization 31

Tiling Atom Data Shared memory used to reduce Global Memory bandwidth consumption Threads in a thread block collectively load one bin at a time into shared memory Once loaded, threads scan atoms in shared memory Reuse: Loaded bins used 128 times Execution cycle of a thread block Threads individually compute potentials using bin in shared mem Collectively load next bin Suspend Data returned from global memory Ready Write bin to shared memory Time Another thread block runs while this one waits 32

Coalesced Global Memory Access to Atom Data Full global memory bandwidth only with 64- byte, 64-byte-aligned memory accesses Each bin is exactly 128 bytes Bins stored in a 3D array 32 threads in each block load one bin into shared memory, then processed by all threads in the block 128 bytes = 8 atoms (x,y,z,q) Nearly uniform density of atoms in typical systems 1 atom per 10 Å 3 Bins hold atoms from (4Å) 3 of space (example) Number of atoms in a bin varies For water test systems, 5.35 atoms in a bin on average Some bins overfull 33

Handling Overfull Bins In typical use, 2.6% of atoms exceed bin capacity Spatial sorting puts these into a list of extra atoms Extra atoms processed by the CPU Computed with CPU-optimized algorithm Takes about 66% as long as GPU computation Overlapping GPU and CPU computation yields in additional speedup CPU performs final integration of grid data 34

CPU Grid Data Integration Effect of overflow atoms are added to the CPU master energygrid array Slice of grid point values calculated by GPU are added into the master energygrid array while removing the padded elements 0,0 0,1 1,0 1,1 35

GPU Thread Coarsening Each thread computes potentials at four potential map points Reuse x and z components of distance calculation Check x and z components against cutoff distance (cylinder test) Exit inner loop early upon encountering the first empty slot in a bin 36

GPU Thread Inner Loop Exit when an empty atom bin entry is encountered Cylinder test Cutoff test and potential value calculation for (i = 0; i < BIN_DEPTH; i++) { aq = AtomBinCache[i].w; if (aq == 0) break; dx = AtomBinCache[i].x - x; dz = AtomBinCache[i].z - z; dxdz2 = dx*dx + dz*dz; if (dxdz2 < cutoff2) continue; dy = AtomBinCache[i].y - y; r2 = dy*dy + dxdz2; if (r2 < cutoff2) poten0 += aq * rsqrtf(r2); // Simplified example dy = dy - 2 * grid_spacing; /* Repeat three more times */ } 37

Cutoff Summation Runtime GPU cutoff with CPU overlap: 12x-21x faster than CPU core 50k 1M atom structure size 38

Summary Large bins allow re-use of all-input kernels with little code change But work efficiency can be very low Use of small-sized bins require more sophisticated kernel code to traverse list of small bins Much higher work efficiency Small bins also serve as tiles for locality CPU process overflow atoms from fixed capacity bins 39