Developing a Data Driven System for Computational Neuroscience

Similar documents
Using FPGAS to accelerate the training process of a Gaussian mixture model based spike sorting system

Implementing a Hidden Markov Model Speech Recognition System in Programmable Logic

Advanced FPGA Design Methodologies with Xilinx Vivado

FPGA design with National Instuments

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs

Embedded Real-Time Video Processing System on FPGA

Optimize DSP Designs and Code using Fixed-Point Designer

Agenda. Introduction FPGA DSP platforms Design challenges New programming models for FPGAs

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FlexRIO. FPGAs Bringing Custom Functionality to Instruments. Ravichandran Raghavan Technical Marketing Engineer. ni.com

Modeling a 4G LTE System in MATLAB

Simplify System Complexity

Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation

High-Performance Linear Algebra Processor using FPGA

AccelDSP tutorial 2 (Matlab.m to HDL for Xilinx) Ronak Gandhi Syracuse University Fall

Introduction to C and HDL Code Generation from MATLAB

Implementing Photoshop Filters in Virtex

Multi MicroBlaze System for Parallel Computing

Mobile Robot Path Planning Software and Hardware Implementations

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

Introduction to Field Programmable Gate Arrays

System-on Solution from Altera and Xilinx

Simplify System Complexity

Chapter 9: Integration of Full ASIP and its FPGA Implementation

Tree-based Cluster Weighted Modeling: Towards A Massively Parallel Real- Time Digital Stradivarius

FPGA VHDL Design Flow AES128 Implementation

FPGA Based Digital Design Using Verilog HDL

Basic Xilinx Design Capture. Objectives. After completing this module, you will be able to:

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

SD 372 Pattern Recognition

FPGAs: FAST TRACK TO DSP

Real-time Reconfigurable Hardware WSAT Variants

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Lecture 7: Introduction to Co-synthesis Algorithms

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Early Models in Silicon with SystemC synthesis

Fast FPGA Routing Approach Using Stochestic Architecture

Chapter 5. Hardware Software co-simulation

AL8253 Core Application Note

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks

Parallel graph traversal for FPGA

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation

Copyright 2007 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol.

Lab 3 Sequential Logic for Synthesis. FPGA Design Flow.

Intro to System Generator. Objectives. After completing this module, you will be able to:

TSEA44 - Design for FPGAs

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

Multiple Sequence Alignment Using Reconfigurable Computing

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

High Speed Pipelined Architecture for Adaptive Median Filter

Reduce FPGA Power With Automatic Optimization & Power-Efficient Design. Vaughn Betz & Sanjay Rajput

TLL5000 Electronic System Design Base Module

FABRICATION TECHNOLOGIES

Spiral 2-8. Cell Layout

New Software-Designed Instruments

Note Set 4: Finite Mixture Models and the EM Algorithm

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers

FPGA: What? Why? Marco D. Santambrogio

White Paper. The advantages of using a combination of DSP s and FPGA s. Version: 1.0. Author: Louis N. Bélanger. Date: May, 2004.

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

CAD SUBSYSTEM FOR DESIGN OF EFFECTIVE DIGITAL FILTERS IN FPGA

Keck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore

Intel HLS Compiler: Fast Design, Coding, and Hardware

A Library of Parameterized Floating-point Modules and Their Use

Cover TBD. intel Quartus prime Design software

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

Embedded Systems: Hardware Components (part I) Todor Stefanov

A Reconfigurable SOM Hardware Accelerator

Designing and Prototyping Digital Systems on SoC FPGA The MathWorks, Inc. 1

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Accelerating FPGA/ASIC Design and Verification

SURVEY PAPER ON REAL TIME MOTION DETECTION TECHNIQUES

VHDL-MODELING OF A GAS LASER S GAS DISCHARGE CIRCUIT Nataliya Golian, Vera Golian, Olga Kalynychenko

General Purpose Signal Processors

Online Hardware Task Scheduling and Placement Algorithm on Partially Reconfigurable Devices

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

An Efficient Implementation of Floating Point Multiplier

Solving the Data Transfer Bottleneck in Digitizers

FT-UNSHADES credits. UNiversity of Sevilla HArdware DEbugging System.

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor

A Rapid Prototyping Methodology for Algorithm Development in Wireless Communications

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

An FPGA Based Adaptive Viterbi Decoder

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

ESL design with the Agility Compiler for SystemC

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Integrated Workflow to Implement Embedded Software and FPGA Designs on the Xilinx Zynq Platform Puneet Kumar Senior Team Lead - SPC

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Digital Integrated Circuits

Embedded Systems. 7. System Components

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

ATS-GPU Real Time Signal Processing Software

Design Development and Implementation of SPI

Simulation of the Dynamic Behavior of One-Dimensional Cellular Automata Using Reconfigurable Computing

개발과정에서의 MATLAB 과 C 의연동 ( 영상처리분야 )

Transcription:

Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate data acquisition and signal processing into the same system that will interact with this information. This can be done with general purpose processors (PCs), digital signal processors (DSPs), or more recently with field programmable gate arrays (FPGAs). In a computational neuroscience system that will interact with neural data recorded in real-time, classifying action potentials, commonly referred to as spike sorting, is an important step in this process. A comparison was made between using a PC, DSPs, and FPGAs to train a spike sorting system using Gaussian Mixture Models. The results show that FPGAs can significantly outperformed PCs or DSPs by embedding algorithms directly in hardware. 1 Introduction A data driven system is being developed for computational neuroscience that will be able to process an arbitrary number of real-time data streams and provide a platform for neural modeling that can interact with these real-time data streams. The platform will be used to aid the discovery process where neural encoding schemes through which sensory information is represented and transmitted within a nervous system will be uncovered. The goal of the system is to enable real-time decoding of neural information streams and allow neuronal models to interact with living simple nervous systems. This will enable the integration of experimental and theoretical neuroscience. Allowing experimental perturbation of neural signals while in transit between peripheral and central processing stages will provide an unprecedented degree of interactive control in the analysis of neural function, and could lead to major insights into the biological basis of neural computation. Integrating data acquisition, signal processing, and neural modeling requires an examination of the system architecture that is most suitable for this endeavor. This examination was done in the context of spike sorting, which is a necessary step in the acquisition and processing of neural signals. The spike sorting algorithm used for this study was based on Gaussian mixture models that have been found useful in signal processing applications, such as image processing, speech signal processing, and pattern recognition [1,2]. The hardware platforms compared were the desktop PC, a parallel digital signal processing (DSP) implementation, and the use of field programmable gate arrays (FPGAs). M. Bubak et al. (Eds.): ICCS 2004, LNCS 3038, pp. 822 826, 2004. c Springer-Verlag Berlin Heidelberg 2004

Developing a Data Driven System for Computational Neuroscience 823 2 Gaussian Mixture Model Spike sorting is the classification of neural action potential waveforms. The probability that an action potential x belongs to neuron o i is given by Bayes rule: p(o i x) = p(x o i )p(o i ) M k=1 p(x o k)p(o k ) (1) where the Gaussian component density p(x o i ) is given by p(x o i )= 1 (2π) R/2 Σ i 1/2 e 1 2 (x µi)t Σ 1 i (x µ i) (2) and µ i is the mean vector waveform and Σ i is the covariance matrix of the Gaussian model o i. 2.1 Expectation Maximization Algorithm The parameters of the Gaussian mixture model were found via the Expectation- Maximization (EM) algorithm. The EM algorithm is a general method of finding the maximum-likelihood estimate of the parameters of an underlying distribution from a given data set. Details of this iterative parameter estimation technique can be found in [3]. A log version of the EM algorithm was used to deal with underflow problems. 3 PC Implementation For comparison purposes we implemented the training of the EM algorithm in Matlab on various PCs. We used version 6.5 Release 13 of Matlab since it significantly increases the computational speed of Matlab as compared to prior versions. The spike sorting process can be divided into two phases: the model training phase which is computationally expensive and the spike classification phase which is much faster. The test results for estimating the parameters of 5 Gaussian components from 720 160-dimensional waveform vectors are shown in table 1. Table 1. Training time of EM algorithm using Matlab on PCs. Times are in seconds Pentium III AMD AMD Dual Pentium IV Average Best 1.2 GHz 1 1.37 GHz 1 1.2 GHz 2 3.3 GHz 3 Training Time 8.33 3.42 1.45 1.31 3.22 1.31 1 The Pentium III 1.2GHz and AMD 1.37GHz systems both had 512 MB DDRAM 2 The Dual AMD 1.2GHz system had 4 GB DDRAMM 3 The Pentium IV 3.3GHz system has 1GB DDRAM.

824 R. Snider and Y. Zhu The best performance for training was 1.31 seconds on the 3.3 GHz Pentium IV system. Classification performance was 0.05 seconds per waveform vector which meets real time performance since typical neural spike duration is around 2 ms. However, the training process ran much slower than the classification process and needs to be accelerated to minimize delays in an experimental setup. 4 Parallel DSP Implementation To speed up the training, the first approach we tried was to implement the EM algorithm on a parallel DSP system consisting of 4 floating-point DSPs. Digital signal processors have become more powerful with the increase in speed and size of on-chip memory. Furthermore, some modern DSPs are optimized for multiprocessing. The Analog Devices ADSP-21160M DSP has two types of integrated multiprocessing support, which are the link ports and a cluster bus. We used Bittware s Hammerhead PCI board with 4 ADSP-21160M DSPs to speed up the training algorithm. We used the profile command in Matlab to analyze the EM algorithm and found that 70% of the execution time was spent in two areas. These were calculating p(x o i ) and updating the means and covariance matrices. We wrote a parallel version of the EM to take advantage of the multiprocessing capability of the board. The results are shown in table 2. Table 2. Training time of EM algorithm using Parallel DSP board. Times are in seconds Data Transfer Method Single DSP Four DSPs No Semaphores N/A stopped at 7000 Semaphores 14.5 13.5 DMA transfer 15.9 3.4 Not using semaphores in data transfers resulted in bus contention that significantly lowered performance. Using semaphores to transfer data eliminated the bus contention but there was hardly any performance gain when going from one DSP to 4 DSPs. The code was then written to take advantage of the on board DMA controllers and this resulted in a nearly linear speedup. The 4 DSP solution appears to be faster than 4 times the single DSP solution. This is because all the data could be stored in internal memory with 4 DSPs and not with the single DSP solution. The parallel DSP method ended up being slower than the best PC implementation. The reason for this was that the clock speed of DSP was much less than the high end PC. The clock speed of the ADSP 21160 ran at 80MHz in order to get single cycle multiplies, while the clock speed of the Pentium IV ran at 3.3 GHz, which is heavily pipelined. Thus, the high-end PC was about 41 times faster than the DSP than in terms of clock speed. Even so, the DSP performance was only 11 times slower, highlighting the architectural advantage that the DSPs have in terms of multiply and accumulate operations.

Developing a Data Driven System for Computational Neuroscience 825 5 FPGA Implementation Field Programmable Gate Arrays (FPGAs) are reconfigurable devices where algorithms can be embedded directly in hardware. The advantage with using FPGAs is there can be hundreds of embedded multipliers running in parallel tied to custom logic. We targeted Xilinx s Virtex II FPGA that had the following resources (table 3). Table 3. Virtex II XC2V3000 Resources System Gates CLB Slices Multipliers Block RAMs 3M 14336 96 96 One drawback of using FPGAs to implementation complex algorithms like the EM algorithm is the long time it can take to design and optimize the algorithms for internal FPGA resources. We used a recently developed tool, AccelFPGA [4], to shorten the design time. AccelFPGA can compile Matlab code directly into VHDL code. By using AccelFPGA, we could focus on optimizing the algorithm in Matlab without dealing with low level hardware details inside the FPGA. The design cycle of the FPGA implementation using this method can be reduced significantly. The FPGA implementation through AccelFPGA is not as optimal as directly coding the algorithms in VHDL by hand. However, for a prototype design, AccelFPGA can provide a time-efficient FPGA solution with reasonably good performance. The performance of the FPGA implementation is shown in table 4. Table 4. Training time of EM algorithm using FPGA FPGA Execution Time Speedup relative Speedup relative (sec) to best PC to best DSP implementation 0.08 16.4 42.5 Using AccelFPGA v1.6 and Xilinx ISE v5.2.02i The FPGA implementation was 16.4 times faster than the fastest PC and 42.5 times faster than the best parallel DSP implementation. It should be noted that FPGAs are best for fixed-point processing and that the embedded multipliers are 18x18 bits. Thus the optimal use of FPGAs require the conversion from floating-point to fixed-point. This is ideally done in Matlab using the fixedpoint toolbox where simulations can be made examining the effects of reduced precision. Given that the algorithm can be mapped effectively to fixed-point, the next step is to add compiler directives for AccelFPGA. These directives can specify the usage of FPGA s embedded hardware resources such as BlockRAM and embedded multiplier blocks. The third step is the creation of the RTL model in VHDL and the associated testbenches which are automatically created by

826 R. Snider and Y. Zhu the AccelFPGA compiler. The fourth step is to synthesize the VHDL models using a logic synthesis tool. The gate-level netlist is then simulated to ensure functional correctness against the system specification. Finally the gate-level netlist is placed and routed by the place-and-route appropriate for the FPGAs used in the implementation. The design used 42 of the 96 embedded multipliers (42%), 58 of the 96 BlockRAMs (60%), and 6378 of the 14336 logic slices (44%). One advantage of AccelFPGA is that bit-true simulations can be done directly in the Matlab environment. Testbenches are automatically generated along with the simulations. You can use these testbenches later in the hardware simulation and compare the results with the Matlab simulation. As a result, it is very easy to know whether your hardware implementation is correct or not. Since Matlab is a high level language, the compiled implementation will typically no be as good as direct VHDL coding. However, by using proper directives, you can specify the parallelism of your algorithm and maximize the usage of the on-chip hardware resources. Hence, you can get a reasonably good performance out of your implementation. The structure of the FPGAs is optimized for high-speed fixed-point addition, subtraction and multiplication. However, a number of other math operations such as division or exponentiation have to be implemented. This was done via lookup tables using BlockRAMs. 6 Conclusion It appears that the use of FPGAs are ideal for use in data driven systems. Not only are they suitable for digital signal processing applications where incoming data streams need to be processed in real-time, they can also be used to implement sophisticated algorithms such as the EM algorithm. References 1. Yang, M.H., Ahuja, N.: Gaussian Mixture Model for Human Skin Color and Its Application in Image and Video Databases. Proc. of the SPIE, (1999) 3635 2. Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17 (1995) 3. Redner, R.A., Walker, H.F.: Mixture Densities, Maximum Likelihood and EM Algorithm. SIAM Review, 26 (1984) 195-239. 4. AccelFPGA User s Manual. V1.7. www.accelchip.com, (2003)