From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

Similar documents
Complex Systems Simulations on the GPU

Template-Driven Agent-Based Modeling and Simulation with CUDA

FLAMEGPU Documentation

GPU-based Distributed Behavior Models with CUDA

Multi Agent Navigation on GPU. Avi Bleiweiss

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Parallelising Pipelined Wavefront Computations on the GPU

Modern Processor Architectures. L25: Modern Compiler Design

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Advanced CUDA Optimization 1. Introduction

S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Accelerated Machine Learning Algorithms in Python

Data Parallel Programming with Patterns. Peng Wang, Developer Technology, NVIDIA

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Supporting Data Parallelism in Matcloud: Final Report

CS 179: GPU Programming. Lecture 7

Addressing Heterogeneity in Manycore Applications

Designing a Domain-specific Language to Simulate Particles. dan bailey

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Cellular Level Agent Based Modelling on the Graphics Processing Unit

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Using GPUs to compute the multilevel summation of electrostatic forces

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

URBAN SCALE CROWD DATA ANALYSIS, SIMULATION, AND VISUALIZATION

Offloading Java to Graphics Processors

CS377P Programming for Performance GPU Programming - II

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Accelerated Test Execution Using GPUs

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

OpenACC Course. Office Hour #2 Q&A

Fast BVH Construction on GPUs

Object Support in an Array-based GPGPU Extension for Ruby

GPU Accelerating Speeded-Up Robust Features Timothy B. Terriberry, Lindley M. French, and John Helmsen

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CS 314 Principles of Programming Languages

Discrete Element Method

GPU Multisplit. Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens. S. Ashkiani (UC Davis) GPU Multisplit GTC / 16

Optimization solutions for the segmented sum algorithmic function

MD-CUDA. Presented by Wes Toland Syed Nabeel

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

A low memory footprint OpenCL simulation of short-range particle interactions

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

PARALLEL SYSTEMS PROJECT

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

TUNING CUDA APPLICATIONS FOR MAXWELL

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Fast Uniform Grid Construction on GPGPUs Using Atomic Operations

Warps and Reduction Algorithms

On the limits of (and opportunities for?) GPU acceleration

High Performance Computing on GPUs using NVIDIA CUDA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Lecture 6: Input Compaction and Further Studies

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Programmable Graphics Hardware (GPU) A Primer

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Accelerating image registration on GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

high performance medical reconstruction using stream programming paradigms

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

TUNING CUDA APPLICATIONS FOR MAXWELL

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 SPH CUDA 1 1 SPH GPU GPGPU CPU GPU GPU GPU CUDA SPH SoA(Structures Of Array) GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Analysis and Visualization Algorithms in VMD

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

A Cross-Input Adaptive Framework for GPU Program Optimizations

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Programming in CUDA. Malik M Khan

OpenCL for Database Query Processing. Georgios Krespis

Accelerating Molecular Modeling Applications with Graphics Processors

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

A Detailed GPU Cache Model Based on Reuse Distance Theory

CS 378: Computer Game Technology

Red Fox: An Execution Environment for Relational Query Processing on GPUs

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

GPU programming. Dr. Bernhard Kainz

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Accelerating sequential computer vision algorithms using commodity parallel hardware

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

L17: Lessons from Particle System Implementations

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

CUDA Particles. Simon Green

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Transcription:

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133) Dr Paul Richmond Research Fellow University of Sheffield (NVIDIA CUDA Research Centre)

Overview Complex Systems A Framework for Modelling Agents Degrees of Parallelisation Agent Communication Putting it all together

Complex Systems Many individuals Interact and behave according to simple rules System level behaviour emerges

Agent Based Modelling A method for specification and simulation of a complex system Model is a set of autonomous communicating agents Simulation helps to understand complex systems Interventions and prediction Presents a computational challenge! Especially for real time or faster

Difficulties in Applying GPUs Agents are heterogeneous i.e. They diverge Agents are born and agents die Leads to sparse populations and non coalesced access Agents communicate No global mechanism for GPU thread communication Agents don't stay still Acceleration structures used for simulation need to be rebuilt

Complex Systems A Framework for Modelling Agents Degrees of Parallelisation Agent Communication Putting it all together

A Formal Model of an Agent Abstract the underlying architecture Let modellers write models not parallel programs Describe agents as a form of state machine (X- Machine) Minimises divergence Describe state transition functions (agent functions) using high level script Describe communication as message dependencies between agent functions Results in Directed Acyclic Graph Identifies synchronisation points for scheduling

FLAME GPU: A Code Generation Framework XML Model File Describe Agents and Communication (messages) as a model in XML XSLT Templates Code generate a simulation API from agent descriptions Scripted Behaviour Scripted behaviour links with dynamic simulation API Simulation Program Loads initial data and provides I/O or interactive visualisation

Code Generation using XSLT Powerful technique for code generation from Declarative XML model Full functional programming language <xagents> <gpu:xagent> <name>circle</name>... <memory> <gpu:variable> <type>int</type> <name>id</name> </gpu:variable> <gpu:variable> <type>float</type> <name>x</name> </gpu:variable> <gpu:variable> <type>float</type> <name>y</name> </gpu:variable> <gpu:variable> <type>float</type> <name>z</name> </gpu:variable> <gpu:variable> <type>float</type> <name>fx</name> </gpu:variable> <gpu:variable> <type>float</type> <name>fy</name> </gpu:variable> </memory> <xsl:for-each select="xagents/gpu:xagent"> struct align (16) xmachine_memory_<xsl:value-of select="name"/> {<xsl:for-each select="memory/gpu:variable"> <xsl:value-of select="type"/><xsl:text> </xsl:text><xsl:if test="arraylength">*</xsl:if><xsl:value-of select="name"/>; </xsl:for-each> }; </xsl:for-each> struct align (16) xmachine_memory_circle { int id; float x; float y; float z; float fx; float fy; };

Mapping an Agent to the GPU Each agent function is corresponds to a single GPU kernel Each CUDA thread represents a single agent instance Agent functions use a dynamically generated API Agent Data is transparently loaded from Structures of arrays typedef struct agent_list{ float x[n]; float y[n]; } xm_memory_agent_list; 0 1 2 3 N 0 1 2 3 N typedef struct agent{ float x; float y; } xm_memory_agent_list [N]; 0 1 2 3 N FLAME_GPU_FUNC int read_locations( xmachine_memory_bird* xmemory, xmachine_message_location_list* location_messages) { /* Get the first message */ xmachine_message_location* location_message = get_first_location_message(location_messages); /* Repeat untill there are no more messages */ while(location_message) { /* Process the message */ if distance_check(xmemory, location_message) { updatesteervelocity(xmemory, location_message); } } /* Get the next message */ location_message = get_next_location_message(location_message, location_messages); /* Update any other xmemory variables */ xmemory->x += xmemory->vel_x*time_step;... } return 0;

Complex Systems A Framework for Modelling Agents Degrees of Parallelisation Agent Communication Putting it all together

Agent Divergence and Sparsity Divergence: Must group agents (threads) Good News: Agents are already grouped by state Bad News: Agents change states so we are left with sparse lists Agent Function Agent List Avoid Sparse Lists by using parallel compaction. Thrust C++ library Prefix Sum 1 0 1 1 0 1 1 0 0 0 1 2 2 3 4 4 Compact New Agent List

Parallelism within the model Behaviour consists of function layers Each layer is a synchronisation barrier Synchronisation between agents only required when a dependency exists (communication or agent memory) This creates parallelism within the function layers of the model CUDA Streams can be used to execute independent functions Layer 1 Layer 2 Layer 3

High Divergence Example Single agent cell type 5 types of cell within Single message type Advantages Large population counts (good utilisation) Simple modelling (but complicated agent transition functions) Disadvantages Lots of code divergence Unnecessary message reading

Low Divergence Example Multiple agent types Different agent type for each cell type Distinction between message Advantages Less divergent code More parallelism within the model Less message reading Disadvantages Complex dependencies More complex (looking) model Smaller population sizes

Time (ms) Speedup of Cell Behaviour Parallelism within the model - 2000 performance Average iteration time of cell behaviour 100 Simulation Speedup 1800 1600 1400 1200 10 1000 800 600 400 1 500 2000 8000 32000 128000 512000 200 0 500 2000 8000 32000 128000 512000 Population Size 0.1 Population Size High Divergence Low Divergence

Complex Systems A Framework for Modelling Agents Degrees of Parallelisation Agent Communication Putting it all together

Agent Communication Brute Force Messaging (N-Body problem) Tile Messages into shared memory Spatially Distributed Agents Build data structure to bin agents CUDA Particles Use counting sort to improve performance Discrete Space Limited Range (Cellular Automaton) Cache results via texture cache (good locality)

Spatially Distributed Communication Radix Sorting Count Sort Hash Message Sort using Thrust (Sort by Key) sort keys Hash Message atomic add to bin Prefix Sum global index of bins Reorder scatter messages build partition matrix Reorder scatter messages

Time (ms) Counting Sort Performance Study Sorting Performance (1M elements) Tesla K20 Tesla K40 GTX 980 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Element range Thrust Sort Counting Sort

Time (ms) Speedup Time (ms) Performance Breakdown for 16k agents 1 0.8 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 Performance Improvement using Count Sort (GTX980) Population Szie Counting sort best suited to smaller population sizes Message reading is the bottleneck 0.6 0.4 0.2 0 1400 1200 1000 800 600 400 200 0 Count Sort Count Sort Thrust Sort Performance Breakdown for 4M agents send_locations read_locations move Thrust Sort

Time (ms) Spatially Distributed Communication Benchmark 262144 65536 16384 4096 GTX 980 K40 FLAME CPU 1024 256 64 16 4 1 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Population Size 27k faster than FLAME on CPU with 50k agents (apples!= oranges) 700x faster than FLAME II with 50k agents on 16 cores (using MPI, vector splitting)

Complex Systems A Framework for Modelling Agents Degrees of Parallelisation Agent Communication Putting it all together

Pedestrian Dynamics Pedestrian agents Social Repulsion (Social Forces) Reynolds steering forces Reciprocal Velocity Obstacles Navigation agents Global Vector Field Navigation Graph Environment and Goals are calculated as a weighted influence An extension: Navigation graphs

Conclusions Agent based modelling can be used to represent complex systems at differing biological scales FLAME GPU is a framework for model description and CUDA code generation Using state based representation avoids divergence and allows parallelism within a model to be exploited Counting sort helpful for highly divergent population Visualisation is extremely cheap

Thank You Get the code for free from: http://www.flamegpu.com www.github.com/flamegpu Contact Me: p.richmond@sheffield.ac.uk http://www.paulrichmond.staff.shef.ac.uk Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!