GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy

Size: px

Start display at page:

Download "GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy"

Anastasia Joseph
5 years ago
Views:

1 GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik Department of Computer Engineering Rochester Institute of Technology United States WCCI 2010, IJCNN, July 23

2 Agenda Motivation Neural network models Simulation systems of neural networks Parker-Sochacki numerical integration method CUDA GPU architecture Implementation: software architecture, computation phases Verification Results Conclusion and future work Q&A

3 Motivation Other works: accuracy and verification problem J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp , J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in, 2009, pp To provide scalable accuracy To perform direct verification Based on: R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp , Aug

4 Neuron Models: IF, HH, IZ IF HH IZ IF: simple, but has poor spiking response HH: has reach response, but complex IZ: simple, has reach response, but phenomenological

System Modeling: Synchronous Systems Aligned events good for

Smaller dt more precise, but computation hangry May result in

of simulated time N network size F - average firing rate of a

5 System Modeling: Synchronous Systems Aligned events good for parallel computing Time quantization error introduced by dt Smaller dt more precise, but computation hangry May result in missing events STDP unfriendly Order of computation per second of simulated time N network size F - average firing rate of a neuron p average target neurons per spike source R. Brette, et al.

System Modeling: Asynchronous Systems Small computation order Events are

are processed sequentially More computation per unit-time Spike

Order of computation per second of simulated time N network size F -

6 System Modeling: Asynchronous Systems Small computation order Events are unique in time no quantization error more accurate, STDP friendly Events are processed sequentially More computation per unit-time Spike predictor-corrector excessive re-computation Assumes analytical solution Order of computation per second of simulated time N network size F - average firing rate of a neuron p average target neurons per spike source R. Brette, et al.

System Modeling: Hybrid Systems Refreshes every dt more structured than event-driven good for parallel computing Events are unique in time no quantization error more accurate, STDP friendly Doesn t

7 System Modeling: Hybrid Systems Refreshes every dt more structured than event-driven good for parallel computing Events are unique in time no quantization error more accurate, STDP friendly Doesn t require analytical solution Events are processed sequentially Largest possible dt is limited by minimum delay and highest possible transient Order of computation per second of simulated time N network size, F - average firing rate of a neuron, p average target neurons per spike source R. Brette, et al.

Choice of Numerical Integration Method Motivation: need to solve an IVP Euler: compute next y based on tangent to current y Modified Euler: predict with Euler, correct with average slope Runge-Kutta

8 Choice of Numerical Integration Method Motivation: need to solve an IVP Euler: compute next y based on tangent to current y Modified Euler: predict with Euler, correct with average slope Runge-Kutta 4 th Order: evaluate and average Bulirsch Stoer: modified midpoint method with evaluation and error tolerance check using extrapolation with rational functions. Adaptive order. Generally more suited for smooth functions. Parker-Sochacki: express IVP as power series. Adaptive order

9 Parcker-Sochacki Method A typical IVP: Assume that solution function can be represented with power series. Therefore, its derivative based on Maclaurin series properties is

10 Parcker-Sochacki Method If is linear: Shift it to eliminate constant term: As a result, the equation becomes: With finite order N: LLP Parallel reduction

11 Parcker-Sochacki Method If is quadratic: Shift it to eliminate constant term: As a result, the equation becomes: Quadratic term can be converted with series multiplication:

12 Parcker-Sochacki Method and the equation becomes: With finite order N: Loop-carried circular dependence on d Only partial parallelism possible

13 Parcker-Sochacki Method Local Lipschitz constant determines the number of iterations for achieving certain error tolerance: Power series representation adaptive order error tolerance control Limitations: Cauchy product reduces parallelism

14 CUDA: SW Kernel: code separate, task division Thread Block (1D, 2D, 3D) Grid (1D, 2D) Divide computation based on IDs Granularity: bit level (after warp bcast access)

15 CUDA: HW Scheduling Scheduling: parallel and sequential Scalability requirement for blocks to be independent Warp Warp = 32 threads Warp divergence Warp level synchronization Active blocks and threads: Active threads / SM: maximum1024 Goal: full occupancy = 1024 threads

16 Software Architecture

17 Update Phase Stewart and Bair Adaptive order p according to required error tolerance Can be processed in parallel for each neuron

18 Propagation Phase Translate spikes to synaptic events: global communication is required Encoded spikes are written to the global memory: bit mask + time values A propagation block reads and filters all spikes, decodes, fetches synaptic data and distributes into time slots

19 Sorting Phase Satish et al.

20 Software Architecture

21 Results: Verification Input Conditions Random parameter allocation Random connectivity Zero PS error tolerance GPU Device: GTX symmetric multiprocessors Shared memory size, 16 KB / SM Global memory size, 938 MB Clock rate, 1.3 GHz CPU Device: AMD Opteron 285 Dual core L2 cache size, 1 KB / core RAM size, 4 GB Clock rate, 2.6 GHz Output Membrane potential traces Passed test for equality

22 Simulation Time, sec. Results: Simulation Time vs. Network Size % 4% 8% 16% 2% 4% 8% 16% Network size, 1000 x neurons Conditions: 80% excitatory / 20% inhibitory synapses, zero tolerance, 10 sec of simulation, initially excited by pa current. Results: GPU simulation 8-9 times faster, RT performance for 2-4% - connected networks with size neurons. Major limiting factors: shared memory, number of SM

23 Simulation Time, sec. Results: Simulation Time vs. Event Throughput % 4% 8% 16% 2% 4% 8% 16% Mean Event Throughput, 1000 x events/(sec. x neuron) Conditions: increasing excitatory / inhibitory ratio from 0.8/0.2 to0.98/0.02, network of 4096 neurons, zero tolerance, 10 sec of simulation, initially excited by pa current. Results: GPU simulation 6-9 times faster, up to 10,000 events per sec per neuron. RT performance for 0-2% - connected networks with size of Major limiting factors: shared memory, number of SM

24 Results: Comparison with Other Works Metric This Work Other works Increase in speed Network Size Connectivity per neuron 6 9, RT 2K - 8K 10 35, RT 16K - 200K K 100 1K Reason GPU device, complexity of computation, numerical integration methods, simulation type, time scale Accuracy Full single precision FP Undefined Numerical integration method Verification Direct Indirect

25 Conclusion Implemented high-accurate PS-based hybrid system of spiking neural network with IZ neurons on GPU Directly verified implementation Future Work Add accurate STDP implementation Characterize accuracy in relation to signal processing, network size, network speed, learning Provide an example of application Port to Open CL Further optimization

Q&A Essential Bibliography R. Brette, et al., "Simulation of networks of spiking neurons: A review of tools and strategies," Journal of Computational Neurscience, vol. 23, no. 3, pp. 349-398, 2007. R. Stewart and W.

26 Q&A Essential Bibliography R. Brette, et al., "Simulation of networks of spiking neurons: A review of tools and strategies," Journal of Computational Neurscience, vol. 23, no. 3, pp , R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp , Aug G. E. Parker and J. S. Sochacki, "Implementing the Picard iteration," Neural, Parallel Sci. Comput., vol. 4, pp , E. M. Izhikevich, "Simple model of spiking neurons," Neural Networks, IEEE Transactions on, vol. 14, pp , N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore GPUs," in, 2009, pp (2010, Apr.) CUDA Data Parallel Primitives Library. [Accessed online 04/30/2010]. (2008) NVIDIA CUDA Programming Guide 2.3. [Accessed online 04/30/2010]. Other works J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp , J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in, 2009, pp

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3