Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform

Size: px

Start display at page:

Download "Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform"

Victoria Holland
5 years ago
Views:

1 UNIVERSITY OF CALIFORNIA, LOS ANGELES Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform Aria Sarraf (SID: ) 12/8/2014

2 Abstract In this report I present a prototype design of the GiDEL PROCStar IV 530 FPGA that is capable of processing 192 simultaneous channels for neural spike detection and alignment. The emphasis of this design is to support the maximum number of neural channels while attaining high overall throughput. This report explains the limiting factors for the number of channels and how a single channel throughput of 32 Mbps is achieved for a 192-channel configuration. It also gives details of the contributing factors in latency and the methodologies for minimizing it. Furthermore, it demonstrates how this system provides a framework for implementing a complete spike sorting algorithm.

3 Table of Contents 1 Introduction Motivation Project Goals GiDEL PROCStar IV 530 FPGA Platform Functionality of Current Prototype Implementation Software Hardware Number of Channels Memory Multiplexor Memory Control Packet Size Detection Alignment Scaling the Number of Channels Results Accuracy Performance Utilization Future Work Conclusion References... 16

4 1 Introduction 1.1 Motivation The detection of neural spikes is a technical challenge that is essential for analyzing many types of brain function. Neural spike detection begins by extracting neural activity via an electrode, and the electrode is able to measure the activity as a voltage. This voltage is then sampled at rates of khz with resolutions ranging from bits per sample. However, these signals often contain high amount of background noise, which makes it difficult to accurately identify the neural spikes. Therefore, digital processing techniques algorithms are required to overcome this difficulty. With high sampling rates and resolutions, the neural recordings will occupy a tremendous amount of data. This places a high demand on computational resources to process all of that data. While it is possible to process this data using software on a general purpose CPU, the processing rate will be slow (0.94 Mpbs [1]). However, by utilizing digital processors designed to process neural data, computation times can be reduced by up to three orders of magnitude [1]. These processors are referred to as fieldprogrammable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). 1.2 Project Goals The goal of this report is to demonstrate the performance capabilities of the PROCStar IV 530 for multi-channel neural spike detection and alignment. The emphasis of the current design prototype is to support the maximum number of neural channels while achieving high performance in the following procedures: transferring data from the host to the FPGA platform memory allocation during FPGA run-time transferring data back from the FPGA platform to the host This design will include a spike detection and an alignment block as a framework for future development. While choosing a sophisticated algorithm for each of these blocks is not the intent of this project, they will primarily be used to model the latencies and ensure proper functionality of the system. 1.3 GiDEL PROCStar IV 530 FPGA Platform Using a high-capacity, high-speed FPGA platform is paramount to reducing the computation time of spike detection and alignment. The platform used for this project is the GiDEL PROCStar IV 530. This platform is an 8-lane PCIe hosted system with 4 Altera Stratix IV 530 FPGAs, capable of running at system speeds of up to 300MHz. This system is equipped with a total of 18 GB of DDR2 memory (16GB of external memory and 2GB of on-board memory), which is more than sufficient for the needs of this application. 4

Figure 1: System Overview of FPGA GiDEL also offers a hardware-software integration application called ProcWizard with the purpose of simplifying the project development task.

5 Figure 1: System Overview of FPGA GiDEL also offers a hardware-software integration application called ProcWizard with the purpose of simplifying the project development task. This software allows for automatic generation of PCI drivers and internal buses. Every item in the design, such as memories, registers, and modules, can be defined in the application and can automatically generate a template with its corresponding hardware and software design files. These hardware and software files make up an interface unit, which handle the host communication protocol for the system. As shown in Figure 1, all text and arrows in blue are elements that are implemented by the developer, and GiDEL ProcWizard automatically configures all other elements. ProcWizard also offers several GiDEL IP cores that utilize on-board memory to generate large delay lines, advanced memory controllers, and controllers to transfer data between sub-designs. The IP cores include ProcMegaDelay, ProcMultiPort, and ProcMegaFIFO. ProcMultiPort was used extensively in this project and provides for efficient usage of the memory banks. It effectively converts on-board memory to a true multi-port memory. This IP core allows for 16 ports (each with different port widths), to run simultaneously on each memory bank. 2 Functionality of Current Prototype The input data requirements for this design are based on Sara Gibson s report [1]. The current prototype s input data requirement for each sample s word size is 16 bits (8 integer bits and 8 fractional bits), all contained in a binary file. The integer bits are signed and can support values from -128 to 128 while the fractional bits are unsigned and can 5

6 support an accuracy of 1/256. Each binary file will be treated as a separate channel. A MATLAB script provided by Sara Gibson quantizes, filters, and converts the data to this format. After processing all of the data in MATLAB, the resulting binary files are placed in the input directory of the FPGA Spike Detecting/Alignment Platform. The number of files in the folder will correspond to the number of channels that the system will be configured for. The user will then configure two parameters related to the spike detection and alignment algorithm. These parameters are the spike threshold and spike width. An additional parameter is available to either make all the data in a single file or split into multiple files. After executing the program, the resulting data from the FPGA will be transferred to a.dat file on the host PC. This data will contain the timestamp of the aligned spike, the value corresponding to the spike, and the channel from which the spike came from. 3 Implementation 3.1 Software The software design for this prototype is developed on C++. GiDEL provides the necessary libraries for C++ applications to communicate to the FPGA via the PCIe bus. These libraries contain the application programming interface (API) to perform hardware initializations, load FPGA designs for the host s hard disk drive, and configure registers on the FPGA. The ProcWizard application generates the header file containing the offsets for the user-defined registers and memories that can be used in conjunction with the GiDEL libraries. The most critical feature of the C++ application is direct memory access (DMA). GiDEL provides two methods for DMA: simple mode and effective mode. Both modes have their own tradeoffs. Simple mode is able to execute all DMA operations in a single function call. It is called simple mode because it does not need preparation stages or additional variables. The cost to this simplicity is that it requires software overhead in the allocation and release of OS resources, which ultimately degrades its transfer rate. Effective mode is faster than simple mode and requires manually creating handles, buffers, performing the transfer, and releasing the resources. Since effective mode enables higher throughput than simple mode, effective mode is used in this prototype. 6

3.2 Hardware Figure 2: Block Diagram of the processing elements on the FPGA. The current prototype operates in several distinct blocks as shown in Figure 2.

7 3.2 Hardware Figure 2: Block Diagram of the processing elements on the FPGA. The current prototype operates in several distinct blocks as shown in Figure 2. While the signals on the block diagram are referring to a single channel, they can be extended for a multi-channel configuration as well. The design blocks start after the DMA is complete and the GLBL_EN is toggled from 0 to 1. The memory multiplexor first multiplexes all the single-channel data to form a single multi-channel data in memory. After this step, the memory control sends raw data to the detector. When the detector finds a spike, the aligner will be triggered to start. After the aligner processes the next spike_width samples, the aligner sends the spike s timestamp and peak value to the memory control block. The memory controller will then store this data in the SRAM. Once all of the samples are processed, the host PC retrieves the data from the SRAM via DMA. 3.3 Number of Channels One of the vital goals for this prototype is to simultaneously process as many channels of neural data as possible. Understanding the limitations that contribute to the maximum number of channels will be necessary when designing a sound architecture for the program. The maximum number of channels is limited by the throughput of the memory banks. The throughput of each memory bank is dependent on the memory bandwidth, the word size of the memory bank, and the access rate efficiency. The access rate efficiency accounts for the hardware overhead associated to the data transfer and is reported in the GiDEL data book [2]. The throughput is calculated as: Throughput = maximum memory clock frequency * word width * access rate efficiency Throughput= 666 MHz * 8 bytes * 75% =3.996 GB/s Once the throughput of the memory banks is known, the operating clock frequency of the prototype and the word size of the neural data can be used to calculate the number of channels. Given that each channel occupies 16 bits of data and the configured clock 7

8 frequency is set at 125MHz, the maximum number of channels that a memory bank can process is 16 channels. Since each memory bank can support 16 channels and there are 12 memory banks, this FPGA can support a total of 192 channels. Memory Throughput Maximum # of Channels= Clock Frequency*Input Word Size 4 GB/s = *12= 192 channels 125 MHz*16 bits *(# of memory banks) 3.4 Memory Multiplexor One of the significant challenges in this design is how to route the binary data from the host PC to the FPGA and back, with minimal cost to latency, while supporting 192 channels of data. The deciding factor on how data is managed is primarily influenced by the capabilities of ProcMultiPort and the bandwidth of the SRAM. As mentioned earlier, ProcMultiPort enables each memory bank to support up to 16 ports. These ports can either be constructed to randomly access or sequentially access the memory. Randomly accessing the memory causes major degradation of the overall throughput [2] and thus all configurations in the prototype use sequential access mode. Note that each port on the host PC can be configured up to 128 bits while each port on the FPGA can be configured up to 256 bits. In the following paragraphs, there will be a discussion comparing three configurations for ProcMultiPort. The first two configurations will have varying latencies and number of channels associated to that configuration, and their shortcomings will lead to why a third configuration was proposed and used for the prototype design. Assuming the timestamps of each spike is 32 bits, all configurations will use a 64-bit port writer on the FPGA that contains the spike timestamp, spike value at the timestamp, and the channel of the timestamp. Since more than one aligned spike can arrive at any given clock cycle, this would require the data to be buffered to accommodate that port. Otherwise, the system would need to be designed with sixteen 64-bit write ports to write data if all channels have an aligned spike on the same clock cycle. This is not feasible in this system, which is why buffering is considered with a 64-bit write port. Assuming the binary data is transferred directly from the host PC without additional processing, a possible configuration shown below can be used for the 16 available channels: Configuration # 1 Port Size # of System Read/Write Function (bits) Ports 128 Read 1 Host DMA 128 Write 1 64 Write 1 FPGA Detection/Alignment 16 Read 13 ProcMultiPort Utilization (remaining/total) 16/16 8

9 This configuration allows for a maximum of 13 channels per memory bank, which falls short of the 16-channel target described earlier. Another method to maximize the number of channels is to group-up or multiplex the input data before sending it to the FPGA. This effectively turns multiple single channel buffers into one multi-channel buffer in memory. By grouping the data, the FGPA can use one 256 bit read port to read the data simultaneously. This would create a configuration shown below: Configuration # 2 System Host FPGA Port Size (bits) Read/Write 9 # of Ports 128 Read Write 1 64 Write Read 1 Function DMA Detection/Alignment ProcMultiPort Utilization 4/16 (remaining/total) While this configuration appears reasonable and intuitive, it adds a major cost to latency when the host PC multiplexes the data. This strategy requires additional allocation of memory on the host PC and extra processing time to move the data from multiple single-channel buffers to one multi-channel buffer. The added latency for the C++ program to multiplex 16 channels with 8 MB was measured with a latency of 737ms. When this process is completed for all 12 memory banks on the FPGA, the added latency is 11.8 seconds. When considering 11.8 seconds, the effective throughput of a single channel is 5.36Mbps which severely bottlenecks the system. Another proposed approach, and the solution used in the current prototype, integrates a combination of both configurations discussed above. Instead of using the host PC to multiplex the data, the FPGA implements this process. Since the FPGA is much faster in processing data than the host PC, latency will be dramatically reduced when compared to the second approach. It will also meet the 16-channel goal per memory bank. This design requires ProcMultiPort to be configured in the following way: System Host FPGA FPGA Configuration # 3 Port Size # of Read/Write (bits) Ports 128 Read Write 1 64 Write Read Write 1 16 Read 8 Function DMA Detection/Alignment Multiplexor ProcMultiPort Utilization 14/16 (remaining/total) As shown above, the main difference in the configuration is the ports used in multiplexing data on the FPGA. The two 128-bit ports are functionally equivalent to one

10 256-bit port but are configured that way as a result to the design of the multiplexer. Ideally, the ports containing the multiplexed data would be 256 bits, but that would require sixteen 16-bit read ports, which would require a total of 21 ports, exceeding the 16-port maximum. Instead the 128-bit write port and eight 16-bit read ports are run twice in separate addresses in the FPGA memory. This process is shown in Figure 3. Figure 3: Data flow in the memory multiplexer The additional latency added to multiplex the data on the FPGA for 192 channels, each with 8 MB, was measured to be 160 ms. Thus, the effective throughput of a single channel during this process is 400 Mpbs. When comparing this result to the result of the same process in software (Configuration 2), a 74x improvement was attained in latency in the proposed configuration. 3.5 Memory Control The purpose of the memory control block is to read in new raw data in the system and output the spike timestamp, spike value, and channel number in FPGA memory. It also synchronizes all other blocks so that the multi-channel data have relative timestamps. One of the main challenges that was overcome in the current prototype was how to write out all the data during multi-channel operation. As mentioned earlier, it is not feasible to dedicate a write port for every channel, and thus, the 64-bit write port with buffering was proposed. Buffering the data output would require an extra FIFO on each channel. Implementing a FIFO on each channel can benefit the system in two major ways. First, it reduces the number of utilized ports in each memory bank since multiple channels can share the same port. Secondly, it reduces the overall latency in the system. Because the multi-channel data is read in at the maximum bandwidth of the memory bank, any time a write operation is performed, the reader is required to pause, which adds latency. If a buffer is not used in the design, and a one port per channel scheme was possible, the read port would constantly be halting. The latency would be variable depending on how spread apart the spikes are from channel to channel. If all the spikes arrived at the same time, the read port would only pause for one clock cycle since the write ports only need 10

11 one clock cycle to run. Buffering operates on the sample principle in that once all the buffers are filled, the write port can push out all the data and pause the reader for one clock cycle. Each channel in the memory control block has a dedicated buffer with fifteen 48-bit wide registers. Each element will store the spike timestamp (32 bits) and spike peak value (16 bits). During every clock cycle, the memory control block will wait for any one of the aligner processes to complete an execution. After the memory control block receives the done signal, it will store the aligned data to the corresponding channels buffer and increment the array counter on the buffer. Once any one of the buffers become full, the memory control block pauses the read port and transfers the data from all of the buffers to the memory banks. After the array counter on every buffer is reset, the read port resumes. 3.6 Packet Size The upper limit on the packet size of each channel is influenced by the maximum capacity of the smallest memory bank, the number of channels, and the design of the memory multiplexor block. In context of this project, a packet is defined as the amount of data per channel that the host transfers to the FPGA before the FPGA processes it. Each packet contains data pertaining only to that channel. The upper limit on the packet size of this system is: SRAM_size 512 MB Packet Size= = num_of_channels*2 16*2 = 8MB The packet size is limited to the size of the smallest memory bank, so that all channels occupy the memory space equally on each memory bank. One of the consequences of the memory multiplexor block is that it utilizes the top half of each memory bank. Therefore, the data packets of each channel will be available at the bottom half the memory bank. 3.7 Detection The purpose of the detection block is to determine whether the current sample is a spike or not. A simple algorithm was used in this block: if the current sample is greater than the user defined threshold, a spike is detected. While the algorithm is simple, a separate Verilog module is dedicated for this, which decouples it from other blocks. This enables other detection schemes to easily be incorporated in the design. 11

3.8 Alignment Figure 4: Single-channel raw-data showing which samples the detector/aligner begin evaluating The purpose of the aligner is to report a timestamp indicating the starting position of a

The aligner works on a window of samples after the detector determines a threshold cross. Figure 4 shows an example plot of the raw data.

12 3.8 Alignment Figure 4: Single-channel raw-data showing which samples the detector/aligner begin evaluating The purpose of the aligner is to report a timestamp indicating the starting position of a common characteristic of a spike. In the current prototype, the aligner reports the spike based on its maximum value. The aligner works on a window of samples after the detector determines a threshold cross. Figure 4 shows an example plot of the raw data. The red bolded overlay indicates the points that the aligner will evaluate. The aligner operates in the following scheme: 1. Run program if the spike is detected and the program is not already evaluating a spike 2. Store the first timestamp to initial_timestamp 3. Store the first timestamp and value to spike_peak_pos and spike_timestamp respectively 4. Receive new sample and increment the sample timestamp 5. Check if the new sample is greater than the spike_peak_pos. If it is greater, store the new sample value and the corresponding timestamp 6. Repeat steps, starting at step 5 until (timestamp-initial_timestamp) == spike_width 7. Set spike_valid to high so memory control can grab the peak value and timestamp 8. Repeat steps, starting at step 1 One of the advantages to this alignment method is that it does not add any latency to the system. Every sample sent to the aligner is processed on the same clock cycle. However, one of the prerequisites of this method requires that the raw data does not have spikes that overlap. If two spikes appear within spike_width samples, the aligner will choose the spike with the largest peak value. Therefore, to avoid misinterpretation of any 12

13 spikes in this process, it is necessary to choose a spike width setting most representative of the data. 3.9 Scaling the Number of Channels It was possible to scale from 1 channel to 192 channels due to the architectural design of the memory multiplexor block and the memory control block. After validating proper functionality of a single channel, the number of channels was increased to 16 channels, the maximum number of channels on each memory bank. Each memory bank consists of one memory multiplexor, one memory controller, 16 detector modules, and 16 aligner modules. The number of channels was then increased to 48, the maximum number of channels supported on each IC. The design was then duplicated on every IC, amounting to 192 channels. 4 Results The input to the following test consisted of 16bit inputs sampled at khz. There were 192 input files, each with a size of 8 MB. The threshold for the detector and the spike width for the aligner were set to the value Accuracy To determine whether the program is able to find the spikes properly, prior to running the program, I manually recorded where the peaks are. This required visually inspecting the spikes on a plot. This was used to compare to the output file generated from the prototype. In all cases where the spike was greater than the threshold value of 30, the system was able to accurately report the timestamp of the maximum value of each peak. 4.2 Performance The metrics for this prototype involved measuring the latency and calculating the throughput. The latency and throughput for the following processes were measured as: TX latency: duration to transfer the sample data from the host s RAM to the FPGA s RAM FPGA processing latency: duration for the FPGA to read from its memory banks, detect and align the spikes, and write back to its memory banks RX latency: duration to transfer the spike data from the FPGA s RAM to the host s RAM Total latency: the sum all of all TX, FPGA processing, and RX latencies Throughput of each channel: the throughput of each channel during its corresponding multi-channel configuration Throughput of all channels: the aggregated throughput of all the channels 13

14 Table 1: Performance Metrics Number of Channels File Size / channel 8MB 8MB Packet Size / channel 8MB 8MB TX latency 447 ms 1789 ms FPGA Processing latency 202 ms 203 ms RX latency 10 ms 35 ms Total latency 659 ms 2027 ms Throughput of each channel 97 Mbps 32 Mbps Throughput of all channels 4662 Mbps 6062 Mbps Table 1 shows the performance of the prototype against these metrics. It is clear that the bottleneck to the system is the DMA transfer from the host PC to the FPGA. In the 192 channel configuration, TX accounts for 88% of the total latency. While this may seem significant, the data rate during this transfer was approximately 6.86 Gb/sec, which is quite impressive in today s standards. The data rate of the DMA transfer is roughly the same as the benchmark reported by GiDEL s data book [2]. The bus for the DMA transfer is PCI express with 8 active channels. The throughput of each neural channel data was benchmarked at 31.6 Mbps and 97 Mbps for the 48 channel and 192 channel configuration. Since each channel is sampled at khz, the input data rate at which the samples are recorded is.44mbps. Thus, this prototype can process data at 71x the rate of the input data when all channels all 192 channels are being used and 220x the rate of the input data when 48 channels are used. 4.3 Utilization This design occupies less than a fifth of the resources available on the FPGA; 14% logic utilization, 8% combinational adaptive look-up tables (ALUT), 1% memory ALUTs, and 11% dedicated logic registers. These additional resources allow for the design of more complex detection and alignment schemes. Moreover, future work on spike sorting algorithms can take advantage of the memory ALUTs when creating multiple clusters. Table 2: FPGA resource utilization Logic Utilization 14% Combinational ALUTs 34,304 / 424,960 (8%) Memory ALUTs 816 / 212,480 (<1%) Dedicated logic registers 48,310 / 424,960 (11%) 14

5 Future Work Figure 5: Block Diagram of the processing elements for a spike sorter system Achieving multi-channel spike-sorting is essential to uniquely identify each spike.

15 5 Future Work Figure 5: Block Diagram of the processing elements for a spike sorter system Achieving multi-channel spike-sorting is essential to uniquely identify each spike. Because each neuron s signal may be shared across multiple channels, this current prototype will detect multiple spikes from the same neuron. This prototype was designed with the intention that a spike sorting module will be developed on it in the future. Figure 5 shows a block diagram of a system. This block diagram is similar to that of Figure 2, except it includes a spike sorter block that added in between the aligner and memory control block. The spike sorter will retrieve multi-channel spike timestamps and peak values, and determine which clusters those spikes came from. This will all be send to the memory control which will eventually be sent to the host. 6 Conclusion This report has demonstrated that the GiDEL PROCStar 530 is capable of processing data at 71x the rate of the input data when all 192 channels are being used and 220x the rate of the input data when 48 channels are used. It has stated why 192 neural channels was chosen as the target for number of neural channels for the design and how it was achieved. It also gives details of the challenges, specifically for memory allocation, in the design process and how they were overcome. While the current prototype focuses on multi-channel neural processing and its performance, the main ambition for this system is to support spike-sorting. This prototype does provide the necessary framework to support spike sorting. Essentially, the system can be treated as a black-box. The prototype provides the necessary inputs for a spike sorter (spike timestamp and peak value) and a means to transfer the resulting spike sorting outputs to the host. 15

16 References [1] Sarah Gibson, Jack W. Judy, Dejan Marković, An FPGA-based platform for accelerated offline spike sorting, Journal of Neuroscience Methods, Volume 215, Issue 1, 30 April 2013, Pages 1-11, ISSN Reference #2 [2] GiDEL, PROCStar IV Data Book, [3] Michael Lewicki, A review of methods for spike sorting: the detection and classification of neural action potentials, Network: Computer Neural Sys. 9, PII: S X(98)

Neural Spike Detection and Alignment on the Gidel ProcStar IV FPGA Platform

UNIVERSITY OF CALIFORNIA, LOS ANGELES Neural Spike Detection and Alignment on the Gidel ProcStar IV FPGA Platform (SID: 004-253-461) March 2015 Abstract Modern advances in neurological research and brain