Efficient Computation of Radial Distribution Function on GPUs

Similar documents
Computing Spatial Distance Histograms for Large Scientific Datasets On-the-Fly

Distance Histogram Computation Based on Spatiotemporal Uniformity in Scientific Data

Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees

Warped parallel nearest neighbor searches using kd-trees

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

A Comparative Study of Dual-tree Algorithms for Computing Spatial Distance Histogram

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Lecture 2: CUDA Programming

Accelerating Molecular Modeling Applications with Graphics Processors

A Comparative Study of Dual-tree Algorithm Implementations for Computing 2-Body Statistics in Spatial Data

A Parallel Access Method for Spatial Data Using GPU

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Using GPUs to compute the multilevel summation of electrostatic forces

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Efficient Range Query Processing on Uncertain Data

3D Registration based on Normalized Mutual Information

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Parallelising Pipelined Wavefront Computations on the GPU

Bandwidth Avoiding Stencil Computations

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

CME 213 S PRING Eric Darve

Optimization solutions for the segmented sum algorithmic function

Parallel Computing: Parallel Architectures Jin, Hai

Warps and Reduction Algorithms

GPU Implementation of a Multiobjective Search Algorithm

Multi Agent Navigation on GPU. Avi Bleiweiss

Max-Count Aggregation Estimation for Moving Points

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Benchmarking the Memory Hierarchy of Modern GPUs

Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Online Document Clustering Using the GPU

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

CS 475: Parallel Programming Introduction

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Face Detection using GPU-based Convolutional Neural Networks

EE/CSCI 451 Midterm 1

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

NVIDIA. Interacting with Particle Simulation in Maya using CUDA & Maximus. Wil Braithwaite NVIDIA Applied Engineering Digital Film

arxiv: v1 [cs.dc] 24 Feb 2010

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Unrolling parallel loops

Massively Parallel GPU-friendly Algorithms for PET. Szirmay-Kalos László, Budapest, University of Technology and Economics

When MPPDB Meets GPU:

High-Performance Holistic XML Twig Filtering Using GPUs. Ildar Absalyamov, Roger Moussalli, Walid Najjar and Vassilis Tsotras

GPU-accelerated data expansion for the Marching Cubes algorithm

Announcements. Written Assignment2 is out, due March 8 Graded Programming Assignment2 next Tuesday

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPU Optimized Monte Carlo

Architecture-Conscious Database Systems

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Accelerating CFD with Graphics Hardware

Variants of Mersenne Twister Suitable for Graphic Processors

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ADMS/VLDB, August 27 th 2018, Rio de Janeiro, Brazil OPTIMIZING GROUP-BY AND AGGREGATION USING GPU-CPU CO-PROCESSING

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Preparing seismic codes for GPUs and other

Optimisation Myths and Facts as Seen in Statistical Physics

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!

Simulation in Computer Graphics Space Subdivision. Matthias Teschner

CuSha: Vertex-Centric Graph Processing on GPUs

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Hash Joins for Multi-core CPUs. Benjamin Wagner

Compression in Molecular Simulation Datasets

Adaptive Parallel Compressed Event Matching

TUNING CUDA APPLICATIONS FOR MAXWELL

Deformable and Fracturing Objects

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Multidimensional Indexes [14]

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Numerical Simulation on the GPU

GPU Data Mining in Neuroimaging Genomics

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

School of Computing & Singapore-MIT Alliance, National University of Singapore {cuibin, lindan,

Collision Detection based on Spatial Partitioning

TUNING CUDA APPLICATIONS FOR MAXWELL

PFAC Library: GPU-Based String Matching Algorithm

Transcription:

Efficient Computation of Radial Distribution Function on GPUs Yi-Cheng Tu * and Anand Kumar Department of Computer Science and Engineering University of South Florida, Tampa, Florida

2 Overview Introduction and Motivation Spatial Distance Histogram (SDH) GPU Architecture Brute-force SDH Algorithm on GPU Advanced SDH Algorithms on GPU Conclusions and Future Work

3 Introduction and Motivation Scientific databases: very large amount of data CERN Hadron Collider generates 15 PB every year Next Generation Genome Sequencer generates 100s of GB in few days Molecular / Particle simulations Study physical systems through simulations System state is stored at many time instances Each instance is named a frame Frame constitutes system state Measurement of particle properties Physical location, velocity, charge, mass etc. Thousands of frames are generated in a simulation

4 Introduction and Motivation Millions of particles / atoms are simulated Analytical query processing in MS data Center of mass System density, energy Mean square displacement Radial distribution function (RDF) Spatial Distance Histogram (SDH) Key step for computing RDF and other queries Interesting and tough as compared to others - O(N 2 ) Simulation Space (diagonal 4w) Histogram buckets (of width w) 0 w 2w 3w 4w

5 Overview Introduction and Motivation Spatial Distance Histogram (SDH) GPU Architecture Brute-force SDH Algorithm on GPU Advanced SDH Algorithms on GPU Conclusions and Future Work

6 SDH Algorithms on CPUs Naïve algorithm in simulation software (e.g., GROMACS) Space partitioning trees (kd-tree) [NIPS00] Process group of particles as one unit Density-Map SDH (DM-SDH) based on Quad/Oct-trees [ICDE09,VLDBJ11,EDBT12,TKDE13,TKDE14] Runs in O(N (2d-1)/d ) for d-dimensional data Approximate variants with running time independent of N Modern multicore hardware posses tremendous power Utilize the power for On-the-fly SDH computation All aforementioned algorithms are easily parallelizable

7 Overview Introduction and Motivation Spatial Distance Histogram (SDH) GPU Architecture Brute-force SDH Algorithm on GPU Advanced SDH Algorithms on GPU Conclusions and Future Work

8 GPU Architecture GPU host thousands of cores Hierarchy of memory with different access latency Process data in SIMD fashion Multiple threads access memory in parallel GPU Device Multi- Processor Multi- Processor Multi- Processor Multi-Processor Thread Thread Block Block (0, 0) Thread Level Private Memory Block Level Shared Memory Instruction Cache Register File Block (0, 1) Multiprocessor Blocks Block (1, 0) Block (1, 1) Device Level Global Memory Block (2, 0) Block (2, 1) Global Memory Core Core Core Core Host Core Core Core Core CPU Main Memory Shared Memory L1 Cache L2 Cache

9 Overview Introduction and Motivation Spatial Distance Histogram (SDH) GPU Architecture Brute-force SDH Algorithm on GPU Advanced SDH Algorithms on GPU Conclusions and Future Work

10 Brute-force SDH on GPU Naïve Method Load all N points into GPU global memory Compute all distances and generate histogram!for each point p!!!compute distance with all other N-1 points! Read 4-byte word per thread Device Level Global Memory Hisotgram Buckets Each thread computes distance of one point to other points on right Multiprocessor 1 Multiprocessor 2 Multiprocessor 3 Multiprocessor 4

11 SDH on GPU Shared Memory Method To deal with: High access latency of global memory Histogram buckets are updated by all threads -> serialized writes into buckets Multiprocessor Inter segment distances Intra segment distances Hisotgram Buckets Multiprocessor Level Shared Memory Hisotgram Buckets Device Level Global Memory

12 Experimental Results GeForce GTX TITAN CUDA Driver Version / Runtime Version: 6.0 / 5.0 CUDA Capability Major/Minor version number: 3.5 (14) Multiprocessors x (192) CUDA Cores/MP: 2688 CUDA Cores Total amount of global memory: 6144 Mbytes Total amount of shared memory per block: 49152 bytes SDH of various number of data points Histogram is assumed to fit in portion of shared memory All data is assumed to fit in global memory

13 Experimental Results Time GPU running time Comparison Data Points CPU Main Memory Shared Memory 100,000 424.7 (7m) 1.54 3.40 300,000 3812 (1h) 11.39 26.42 Global Memory 800,000 27142 (7h) 77.20 177.92 1,600,000 1.5 Day 316.01 699.16 3,200,000 Days 1317.21 2771.21 6,400,000 Days 5335.71 11036.78

14 Experimental Results Speedup GPU Speedup Comparison with respect to CPU Data Points Shared Memory Global Memory 100,000 275.7 124.9 300,000 334.6 144.2 800,000 351.5 152.5 1,600,000 410.1 185.3 3,200,000 6,400,000

15 Overview Introduction and Motivation Spatial Distance Histogram (SDH) GPU Architecture Brute-force SDH Algorithm on GPU Advanced SDH Algorithms on GPU Conclusions and Future Work

16 Background of DM-SDH Main idea: avoid the calculation of pairwise distances Observation: two groups of points can be processed in one shot if the range of all inter-group distances falls into a histogram bucket We say these two groups are resolved into that bucket 17 29 Histogram[i] += 17*29; 21 bucket i

17 Background of DM-SDH The simulation space Density Map DM 0 Density Map - DM 1 Density Map - DM 2 17 29 2 7 3 10 5 12 6 1 42 21 15 8 6 9 6 13 4 2 Min distance Distances = 17*21 Max distance 17 29 x% y% z% 21 bucket i bucket i+1 bucket i+2

18 Spatial Uniformity Closed from PDF of distances approximate PDF is derived Monte-Carlo simulation is another approximation to PDF A Compute Exact Proportions to Buckets B x% y% z% Uniform cells Monte-Carlo Simulations bucket i bucket i+1 bucket i+2 Theorem: Number of distinct Monte-Carlo simulations performed for all pair of cells in density map of M cells is O(M)

19 DM-SDH on GPUs Load all density maps on global memory. Shared memory is loaded by threads in a CUDA block Each thread loads information about one cell Each thread processes a pair of cells Replacing only one cell in loop until all cell are processed Next a new cell is loaded and paired with distinct other cells

20 DM-SDH Algorithm on GPUs Global Memory Load DM cells G i G j G k slide Histogram Buckets Density Map (DM) Tree Intra-group cell pairs Inter-group cell pairs Histogram Buckets Shared Memory

21 DM-SDH Algorithm on GPUs Each CUDA thread Processes a pair of cells from DM If pair of cells contain uniformly distributed points Threads in each CUDA block perform Monte-Carlo simulations The probability distribution function for histogram buckets is hashed Now for each pair of cells If they have uniformly distributed points Retrieve information from hash and update histogram buckets Else Either compute pair-wise distance of all pairs of points in both cells by naïve way Or use some heuristics

22 Experimental Results Performance MM: CPU version, GM: GPU version

23 Experimental Results Performance MM: CPU version, GM: GPU Global Memory, SM: GPU Shared Memory

24 Experimental Results - Energy Consumption

25 DM-SDH vs. Brute-force Brute-force algorithm more suitable for GPU deployment Main problems for DM-SDH on GPUs: Code divergence (i.e., tree traverse) Size of a cell is much larger than size of a point

26 Overview Introduction and Motivation Spatial Distance Histogram (SDH) GPU Architecture Brute-force SDH Algorithm on GPU Advanced SDH Algorithms on GPU Conclusions and Future Work

27 Conclusions and Future Work Computing power of GPUs can be harnessed for SDH processing Speedup varies (3X 400X) with the algorithm implemented and problem parameter Take advantage of new GPU/CUDA features Shuffle instruction Large register pool Read-only cache Extension to other correlation functions

28 Relevant Publications [TKDE14] A. Kumar, V. Grupcev, Y. Yuan, Y. Tu, Jin Huang, and G. Shen. Computing Spatial Distance Histograms for Large Scientific Datasets On-the-fly. To appear in IEEE Transactions on Knowledge and Data Engineering (TKDE). [TKDE13] V. Grupcev, Y. Yuan, Y. Tu, Jin Huang, S. Chen, S. Pandit, and M. Weng. Approximate Algorithms for Computing Distance Histograms with Accuracy Guarantees. IEEE Transactions on Knowledge and Data Engineering (TKDE) 25(9):1982-1996, September 2013. [VLDBJ11] S. Chen, Y. Tu, and Y. Xia. Performance Analysis of A Dual-Tree Algorithm for Computing Spatial Distance Histograms. The VLDB Journal. 20(4): 471-494, August 2011. [EDBT12] A. Kumar, V. Grupcev, Y. Tu, Y. Yuan, and G. Shen. Distance Histogram Computation Based on Spatiotemporal Uniformity in Scientific Data. In Procs. of 15th IEEE International Conference on Extending Database Technology (EDBT). pp.288-299, Berlin, Germany, March 26-30, 2012. [ICDE09] Y. Tu, S. Chen, and S. Pandit. Computing Distance Histograms Efficiently in Scientific Databases. In Procs. of 25th International Conference on Data Engineering (ICDE), pp. 796-807, Shanghai, China, March 2009. [NIPS00] Gray, A.G., Moore, A.W. N-body problems in statistical learning. In Advances in Neural Information Processing Systems (NIPS), 1408 pp. 521 527, MIT Press (2000)

29 THANK YOU Questions?

30 Background of DM-SDH The Approximate Algorithm (ADM-SDH) Organize all data into a Quad- tree (2D data) or Oct- tree (3D data). Cache the atoms counts of each tree node (cell) Density map: all counts in one tree level start from one proper density map M 0 FOR every pair of cells A and B in M 0 resolve A and B IF A and B are not resolvable THEN IF at desired density map level THEN distribute distances proportionally into overlapping buckets ELSE FOR each child cell A of A FOR each child cell B of B resolve A and B

31 SDH on GPU Using Registers GPU with compute capability 3.0 and up support the _shift instruction. Load large number of registers with distinct point Compute the SDH by shifting one set of register While keeping another set constant All pairs of distances between the two sets will be computed

32 SDH on GPU Using Registers Shift left after one iteration of pair-wise computation Pair-wise distance If N is odd N/2 shifts If N is even (N-1)/2 shifts Keep moving selection Points in global memory of device