GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy
|
|
- Anastasia Joseph
- 5 years ago
- Views:
Transcription
1 GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI 2010, IJCNN, July 23
2 Agenda Motivation Neural network models Simulation systems of neural networks Parker-Sochacki numerical integration method CUDA GPU architecture Implementation: software architecture, computation phases Verification Results Conclusion and future work Q&A
3 Motivation Other works: accuracy and verification problem J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp , J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in, 2009, pp To provide scalable accuracy To perform direct verification Based on: R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp , Aug
4 Neuron Models: IF, HH, IZ IF HH IZ IF: simple, but has poor spiking response HH: has reach response, but complex IZ: simple, has reach response, but phenomenological
5 System Modeling: Synchronous Systems Aligned events good for parallel computing Time quantization error introduced by dt Smaller dt more precise, but computation hangry May result in missing events STDP unfriendly Order of computation per second of simulated time N network size F - average firing rate of a neuron p average target neurons per spike source R. Brette, et al.
6 System Modeling: Asynchronous Systems Small computation order Events are unique in time no quantization error more accurate, STDP friendly Events are processed sequentially More computation per unit-time Spike predictor-corrector excessive re-computation Assumes analytical solution Order of computation per second of simulated time N network size F - average firing rate of a neuron p average target neurons per spike source R. Brette, et al.
7 System Modeling: Hybrid Systems Refreshes every dt more structured than event-driven good for parallel computing Events are unique in time no quantization error more accurate, STDP friendly Doesn t require analytical solution Events are processed sequentially Largest possible dt is limited by minimum delay and highest possible transient Order of computation per second of simulated time N network size, F - average firing rate of a neuron, p average target neurons per spike source R. Brette, et al.
8 Choice of Numerical Integration Method Motivation: need to solve an IVP Euler: compute next y based on tangent to current y Modified Euler: predict with Euler, correct with average slope Runge-Kutta 4 th Order: evaluate and average Bulirsch Stoer: modified midpoint method with evaluation and error tolerance check using extrapolation with rational functions. Adaptive order. Generally more suited for smooth functions. Parker-Sochacki: express IVP as power series. Adaptive order
9 Parcker-Sochacki Method A typical IVP: Assume that solution function can be represented with power series. Therefore, its derivative based on Maclaurin series properties is
10 Parcker-Sochacki Method If is linear: Shift it to eliminate constant term: As a result, the equation becomes: With finite order N: LLP Parallel reduction
11 Parcker-Sochacki Method If is quadratic: Shift it to eliminate constant term: As a result, the equation becomes: Quadratic term can be converted with series multiplication:
12 Parcker-Sochacki Method and the equation becomes: With finite order N: Loop-carried circular dependence on d Only partial parallelism possible
13 Parcker-Sochacki Method Local Lipschitz constant determines the number of iterations for achieving certain error tolerance: Power series representation adaptive order error tolerance control Limitations: Cauchy product reduces parallelism
14 CUDA: SW Kernel: code separate, task division Thread Block (1D, 2D, 3D) Grid (1D, 2D) Divide computation based on IDs Granularity: bit level (after warp bcast access)
15 CUDA: HW Scheduling Scheduling: parallel and sequential Scalability requirement for blocks to be independent Warp Warp = 32 threads Warp divergence Warp level synchronization Active blocks and threads: Active threads / SM: maximum1024 Goal: full occupancy = 1024 threads
16 Software Architecture
17 Update Phase Stewart and Bair Adaptive order p according to required error tolerance Can be processed in parallel for each neuron
18 Propagation Phase Translate spikes to synaptic events: global communication is required Encoded spikes are written to the global memory: bit mask + time values A propagation block reads and filters all spikes, decodes, fetches synaptic data and distributes into time slots
19 Sorting Phase Satish et al.
20 Software Architecture
21 Results: Verification Input Conditions Random parameter allocation Random connectivity Zero PS error tolerance GPU Device: GTX symmetric multiprocessors Shared memory size, 16 KB / SM Global memory size, 938 MB Clock rate, 1.3 GHz CPU Device: AMD Opteron 285 Dual core L2 cache size, 1 KB / core RAM size, 4 GB Clock rate, 2.6 GHz Output Membrane potential traces Passed test for equality
22 Simulation Time, sec. Results: Simulation Time vs. Network Size % 4% 8% 16% 2% 4% 8% 16% Network size, 1000 x neurons Conditions: 80% excitatory / 20% inhibitory synapses, zero tolerance, 10 sec of simulation, initially excited by pa current. Results: GPU simulation 8-9 times faster, RT performance for 2-4% - connected networks with size neurons. Major limiting factors: shared memory, number of SM
23 Simulation Time, sec. Results: Simulation Time vs. Event Throughput % 4% 8% 16% 2% 4% 8% 16% Mean Event Throughput, 1000 x events/(sec. x neuron) Conditions: increasing excitatory / inhibitory ratio from 0.8/0.2 to0.98/0.02, network of 4096 neurons, zero tolerance, 10 sec of simulation, initially excited by pa current. Results: GPU simulation 6-9 times faster, up to 10,000 events per sec per neuron. RT performance for 0-2% - connected networks with size of Major limiting factors: shared memory, number of SM
24 Results: Comparison with Other Works Metric This Work Other works Increase in speed Network Size Connectivity per neuron 6 9, RT 2K - 8K 10 35, RT 16K - 200K K 100 1K Reason GPU device, complexity of computation, numerical integration methods, simulation type, time scale Accuracy Full single precision FP Undefined Numerical integration method Verification Direct Indirect
25 Conclusion Implemented high-accurate PS-based hybrid system of spiking neural network with IZ neurons on GPU Directly verified implementation Future Work Add accurate STDP implementation Characterize accuracy in relation to signal processing, network size, network speed, learning Provide an example of application Port to Open CL Further optimization
26 Q&A Essential Bibliography R. Brette, et al., "Simulation of networks of spiking neurons: A review of tools and strategies," Journal of Computational Neurscience, vol. 23, no. 3, pp , R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp , Aug G. E. Parker and J. S. Sochacki, "Implementing the Picard iteration," Neural, Parallel Sci. Comput., vol. 4, pp , E. M. Izhikevich, "Simple model of spiking neurons," Neural Networks, IEEE Transactions on, vol. 14, pp , N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore GPUs," in, 2009, pp (2010, Apr.) CUDA Data Parallel Primitives Library. [Accessed online 04/30/2010]. (2008) NVIDIA CUDA Programming Guide 2.3. [Accessed online 04/30/2010]. Other works J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp , J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in, 2009, pp
CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationAccelerated Simulation of Spiking Neural Networks Using GPUs
WCCI 2010 IEEE World Congress on Computational Intelligence July, 18-23, 2010 - CCIB, Barcelona, Spain IJCNN Accelerated Simulation of Spiking Neural Networks Using GPUs Andreas K. Fidjeland and Murray
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationImplementation of configurable and multipurpose spiking neural networks on GPUs
WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN Implementation of configurable and multipurpose spiking neural networks on GPUs Antonio Arista-Jalife
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationEfficient Simulation of Large-Scale Spiking Neural Networks Using CUDA Graphics Processors
Efficient Simulation of Large-Scale Spiking Neural Networks Using CUDA Graphics Processors Jayram Moorkanikara Nageswaran, Nikil Dutt, Jeffrey L Krichmar, Alex Nicolau, Alex Veidenbaum Center for Embedded
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationBiologically Sound Neural Networks for Embedded Systems Using OpenCL
This is the author s version of the work. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purpose or for creating new collective
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationHigh Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer System and Architecture ICT, CAS, China Outline
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationIntroduction to GPU programming with CUDA
Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationRT 3D FDTD Simulation of LF and MF Room Acoustics
RT 3D FDTD Simulation of LF and MF Room Acoustics ANDREA EMANUELE GRECO Id. 749612 andreaemanuele.greco@mail.polimi.it ADVANCED COMPUTER ARCHITECTURES (A.A. 2010/11) Prof.Ing. Cristina Silvano Dr.Ing.
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationDense matching GPU implementation
Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationPOST-SIEVING ON GPUs
POST-SIEVING ON GPUs Andrea Miele 1, Joppe W Bos 2, Thorsten Kleinjung 1, Arjen K Lenstra 1 1 LACAL, EPFL, Lausanne, Switzerland 2 NXP Semiconductors, Leuven, Belgium 1/18 NUMBER FIELD SIEVE (NFS) Asymptotically
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationMulti Agent Navigation on GPU. Avi Bleiweiss
Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationGPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations
GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations Fred Lionetti @ CSE Andrew McCulloch @ Bioeng Scott Baden @ CSE University of California, San Diego What is heart modeling? Bioengineer
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationDebunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU
Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU The myth 10x-1000x speed up on GPU vs CPU Papers supporting the myth: Microsoft: N. K. Govindaraju, B. Lloyd, Y.
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationImplementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU
Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU 1 1 Samara National Research University, Moskovskoe Shosse 34, Samara, Russia, 443086 Abstract.
More informationCS 179: GPU Computing
CS 179: GPU Computing LECTURE 2: INTRO TO THE SIMD LIFESTYLE AND GPU INTERNALS Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ Separate CUDA code into.cu and.cuh
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationGPUs and Einstein s Equations
GPUs and Einstein s Equations Tim Dewey Advisor: Dr. Manuel Tiglio AMSC Scientific Computing University of Maryland May 5, 2011 Outline 1 Project Summary 2 Evolving Einstein s Equations 3 Implementation
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationExploiting graphical processing units for data-parallel scientific applications
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2009; 21:2400 2437 Published online 20 July 2009 in Wiley InterScience (www.interscience.wiley.com)..1462 Exploiting
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationFirst Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors
First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing Systems Chalmers University
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationMattan Erez. The University of Texas at Austin
EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More informationIBM Education Assistance for z/os V2R2
IBM Education Assistance for z/os V2R2 Item: RSM Scalability Element/Component: Real Storage Manager Material current as of May 2015 IBM Presentation Template Full Version Agenda Trademarks Presentation
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationUniversity of Zurich. Computing spike-based convolutions on GPUs. Zurich Open Repository and Archive. Nageswaran, J M; Dutt, N; Wang, Y; Delbrueck, T
University of Zurich Zurich Open Repository and Archive Winterthurerstr. 190 CH-8057 Zurich http://www.zora.uzh.ch Year: 2009 Computing spike-based convolutions on GPUs Nageswaran, J M; Dutt, N; Wang,
More informationGPUfs: Integrating a file system with GPUs
ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications
More informationPerformance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX
Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX Wen-mei Hwu with David Kirk, Shane Ryoo, Christopher Rodrigues, John Stratton, Kuangwei Huang Overview
More informationFacial Recognition Using Neural Networks over GPGPU
Facial Recognition Using Neural Networks over GPGPU V Latin American Symposium on High Performance Computing Juan Pablo Balarini, Martín Rodríguez and Sergio Nesmachnow Centro de Cálculo, Facultad de Ingeniería
More informationHardware/Software Co-Design
1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011 2 / 27 Outline 1 2 3 3 / 27 Outline 1 2 3 CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationODEs occur quite often in physics and astrophysics: Wave Equation in 1-D stellar structure equations hydrostatic equation in atmospheres orbits
Solving ODEs General Stuff ODEs occur quite often in physics and astrophysics: Wave Equation in 1-D stellar structure equations hydrostatic equation in atmospheres orbits need workhorse solvers to deal
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationSpiNNaker - a million core ARM-powered neural HPC
The Advanced Processor Technologies Group SpiNNaker - a million core ARM-powered neural HPC Cameron Patterson cameron.patterson@cs.man.ac.uk School of Computer Science, The University of Manchester, UK
More informationAdvanced CUDA Optimizing to Get 20x Performance. Brent Oster
Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More information