Nostradamus Grand Cross. - N-body simulation!

Similar documents
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Intermediate Parallel Programming & Cluster Computing

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Evaluating the Performance and Energy Efficiency of N-Body Codes on Multi-Core CPUs and GPUs

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm

GPUs and Einstein s Equations

CUDA Particles. Simon Green

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

Optimizing Multiple GPU FDTD Simulations in CUDA

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

CUDA Particles. Simon Green

Deep learning in MATLAB From Concept to CUDA Code

Efficient Computation of Radial Distribution Function on GPUs

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

High Performance Computing on GPUs using NVIDIA CUDA

Height field ambient occlusion using CUDA

Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Vector Quantization. A Many-Core Approach

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Developing, Debugging, and Optimizing GPU Codes for High Performance Computing with Allinea Forge

High performance Computing and O&G Challenges

COMP 605: Introduction to Parallel Computing Quiz 4: Module 4 Quiz: Comparing CUDA and MPI Matrix-Matrix Multiplication

A Fast Speckle Reduction Algorithm based on GPU for Synthetic Aperture Sonar

CUDA Architecture & Programming Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

UberFlow: A GPU-Based Particle Engine

Paralization on GPU using CUDA An Introduction

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Accelerating Financial Applications on the GPU

A Mixed Hierarchical Algorithm for Nearest Neighbor Search

Accelerated Machine Learning Algorithms in Python

Molecular Dynamics Simulations with Julia

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

High-performance particle simulation using CUDA

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

CUDA Experiences: Over-Optimization and Future HPC

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

APPENDIX Summary of Benchmarks

High-Performance Computing

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

MD-CUDA. Presented by Wes Toland Syed Nabeel

Tesla Architecture, CUDA and Optimization Strategies

Accelerating CFD with Graphics Hardware

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

CUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPU Accelerated Machine Learning for Bond Price Prediction

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Information Coding / Computer Graphics, ISY, LiTH

Anatomy of a Physics Engine. Erwin Coumans

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Automated Finite Element Computations in the FEniCS Framework using GPUs

Map Reduce Group Meeting

Object Support in an Array-based GPGPU Extension for Ruby

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Parallel multi-frontal solver for isogeometric finite element methods on GPU

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CSC 391/691: GPU Programming Fall N-Body Problem. Copyright 2011 Samuel S. Cho

GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Automatic shape optimisation in hydraulic machinery using OpenFOAM

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Computer Science, UCL, London

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

A Qualitative Comparison Study Between Common GPGPU Frameworks.

Particle Simulation using CUDA. Simon Green

Using GPUs to compute the multilevel summation of electrostatic forces

Accelerating image registration on GPUs

CS267 Report Particle Simulation on a GPU with PyCUDA

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

GPU Particles. Jason Lubken

Lecture 1 - Introduction (Class Notes)

Performance impact of dynamic parallelism on different clustering algorithms

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Addressing Heterogeneity in Manycore Applications

Supporting Data Parallelism in Matcloud: Final Report

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

Harnessing GPU speed to accelerate LAMMPS particle simulations

Fluid Structure Interaction - Moving Wall in Still Water

Asian Option Pricing on cluster of GPUs: First Results

Surface-related multiple elimination with multiple prediction on GPU

Computer and Machine Vision

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

ACCELERATION OF IMAGE RESTORATION ALGORITHMS FOR DYNAMIC MEASUREMENTS IN COORDINATE METROLOGY BY USING OPENCV GPU FRAMEWORK

Progress in automatic GPU compilation and why you want to run MPI on your GPU

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Transcription:

Nostradamus Grand Cross - N-body simulation! Presenter Name : Li Dong, Hui Lin! Title/Date : May 7, 2015!

Outline! Brief introduction to N-body problem" Grand Cross evolution" Demo" How we did it?!" Sequential implementation" Brutal force scheme vs. Barnes Hut algorithm" Implementation of Seq. Barnes Hut algorithm" Parallelization with CUDA" Brutal force scheme vs. Barnes Hut algorithm" Speedup chart"

Brief introduction to N-body problem! Two-body -> Three-body -> N-body" " Origin of the chaotic theory" Chaos: When the present determines the future, but the approximate present does not approximately determine the future." Ref: Three body: http://www.3body.net/wp-content/uploads/2013/06/3body_problem_figure2.gif! Newton Portrait: http://www.newton.cam.ac.uk/art/portrait.html!

Grand Cross evolution! Demo"

Grand Cross evolution cont.! How we did it?!" Solve a constrained minimization problem" " Cheat!" " Reverse velocity direction"

Sequential implementation! Brutal force scheme" Almost trivial, O(N 2 )" Barnes Hut algorithm" Quad tree (2D) / Oct tree (3D)! O(N log N)" Ref: Quad tree: http://www.cs.princeton.edu/courses/archive/fall03/cs126/assignments/barnes-hut.html! Barnes Hut algo: Barnes, Josh, and Piet Hut. "A hierarchical O (N log N) force-calculation algorithm." Nature,324, (1986): 446-449.!

Sequential implementation cont.! Implementation of Seq. Barnes Hut algorithm" For simplification: set a fixed bounding box, planets fly beyond the boundary are considered escaped and removed from the tree;" Accuracy control: Theta = s / d = 0.025, where s is the width of the region represented by the internal node, and d is the distance between the body and the node s center of mass; No sorting nodes, softening factor = 0.01 for close planets, nodemax = 16, fixed time step dt=0.001." Pseudo code" Initialize tree;! for 1 : nstep {calculate center of mass for each node;! calculate interactive attractions among nodes;! update positions;! migrate nodes if necessary; }! Ref: http://charm.cs.uiuc.edu/cs498lvk/projects/bhatele/code/sequential/!

Parallel implementation! Brutal force scheme" moved force calculations to the kernel " The kernel code computes forces between a body and itself to eliminate an if statement" Barnes Hut algorithm" Parallelism mainly exists in the following for-each loops [1] : " For each body x: search for the node set N 1 in the quad tree that act on the body for each node y in N 1 : calculate the force on x from y Grouping the bodies by spatial distance before the force calculation greatly improved the performance Ref: Martin Burtscher, Keshav Pingali."An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm." Nature, 324, (1986): 446-449.!

Performance Evaluation! Systems" Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz" Tesla K20Xm GPU" Compilers" Sequential Brute Force (gcc 4.9.2 -O3) " Sequential Barnes Hut (gcc 4.9.2 -O2)" CUDA Brute Force (nvcc 7.0 -O3 -arch=sm_20)" CUDA Barnes Hut (nvcc 7.0 -O3 -arch=sm_20)" Inputs" 10 to 10 6 bodies Best runtime of experiments, excluding I/O Ref: Barnes Hut algo: Barnes, Josh, and Piet Hut. "A hierarchical O (N log N) force-calculation algorithm." Nature,324, (1986): 446-449.!

Performance Evaluation cont.! The benefit is lower with small N since the amount of parallelism is lower" The cost of building tree is too much while N is small" The O(N logn) BH makes its benefit over the O(n 2 ) algorithm increases rapidly with larger N." Run/me per /me step(s) 100000 10000 1000 100 10 1 0.1 0.01 0.001 Brute Force Seq Brute Force CUDA Barnes Hut Seq Barnes Hut CUDA Total Run/me 0.0001 1.00E+000 1.00E+001 1.00E+002 1.00E+003 1.00E+004 1.00E+005 1.00E+006 Number of bodies

Performance Evaluation cont.! Performance of each step 1.00E+003 1.00E+002 1.00E+001 1.00E+000 1.00E-001 1.00E-002 1.00E-003 1.00E-004 1.00E-005 1.00E-006 1.00E-007 Seq. BH CUDA BH 1.00E+001 1.00E+002 1.00E+003 1.00E+004 1.00E+005 1.00E+006 1.00E+001 1.00E+002 1.00E+003 1.00E+004 1.00E+005 1.00E+006 1.00E+001 1.00E+002 1.00E+003 1.00E+004 1.00E+005 1.00E+006 1.00E+001 1.00E+002 1.00E+003 1.00E+004 1.00E+005 1.00E+006 tree force update total Force calculation > Building tree Updating new positions" The Spatial Grouping function greatly improved the performance in calculating forces part"

Conclusions! Cross shape exact recovery Implement entire Brute Force and Barnes Hut algorithm on CPU and GPU Number of bodies matters" Building tree cost" Spatial grouping greatly improves the performance" Ref: Barnes Hut algo: Barnes, Josh, and Piet Hut. "A hierarchical O (N log N) force-calculation algorithm." Nature,324, (1986): 446-449.!