Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster

Size: px

Start display at page:

Download "Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster"

Claire Berry
5 years ago
Views:

Thomas jadincjackson@stthomas.edu Bradley S.

1 Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster Jadin C. Jackson, PhD Biology University of St. Thomas jadincjackson@stthomas.edu Bradley S. Rubin, PhD Graduate Programs in Software University of St. Thomas bsrubin@stthomas.edu 1

2 Introduc)on: Hippocampus as a memory structure Human lesions - HM Medial Temporal Lobe resec)on for epilepsy IQ = 112 Unable to form las)ng declara)ve memory Corkin et al. (1997) J. Neurosci., 17(10):

3 Hippocampus Human Rat Paxinos & Watson 1997

4 Introduc)on Naviga)onal Memory Rat lesions Neural Correlates The Place Cell R TT04-01 OF

5 Recording neural ensembles n LeY: Hyperdrive n 12 tetrodes allow for simultaneous recording of up to 100 neurons, or more.

6 Tetrodes Neurons generate electrical signals we can record Tetrodes let us dis)nguish these signals based on their spa)al distribu)on. A C B

7 Neural signals Neurons are Highly Organized: Connec)vity Spa)al Distribu)on Organized Connec6vity and Distribu6on: Summa)on of currents Buzsaki (2004)

8 Signals we can get from Tetrodes 600 Hz 6 khz + ~1 msec Ac)on Poten)als Spikes 1 Hz 475 Hz + ~100 msec Local Field Poten)als (LFPs) or EEG Note: Electroencephalogram (EEG) usually refers to signals recorded from outside the brain.

9 Hippocampal Place Field

10 Analyzing LFP data

11 Filtering Hippocampal LFP + = 500 ms

12 Filtering Hippocampal LFP Theta (θ) = 6-10 Hz 500 ms Gamma γ Slow = 35-80Hz γ Fast = Hz

13 Convolu)on ()me domain) f and g are two real- valued func)ons t is )me τ is a dummy variable n is sample index k is a dummy index hfp://en.wikipedia.org/wiki/convolu)on

14 Con)nuous Wavelet Transform x is the signal ψ is the wavelet t is )me a is the scaling factor b is the shiy ψ =

15 Con)nuous Wavelet Transform x is the signal ψ is the wavelet t is )me a is the scaling factor b is the shiy ψ = = small

16 Con)nuous Wavelet Transform x is the signal ψ is the wavelet t is )me a is the scaling factor b is the shiy ψ = = large

18 Frequency Time

19 Wavelet Transform Signal (Samar et al., 2002)

20 Convolu)on (frequency domain) F {} is the Fourier Transform F{f(t) g(t)}= F{f(t)} F{g(t)} f and g are two real- valued func)ons t is )me Convolu)on = Bofom Line: Mul)plying the Frequency Response of two func)ons = The Frequency Response of the convolu)on of those func)ons. Take the FFT of two func)ons, mul)ply the results, and convert back to the )me domain.

21 Channel Averaging Frequency (Hz) Average

22 Event- triggered Analysis: subselng Frequency (Hz) t i t i +τ

23 Frequency (Hz) Applying the Wavelet Transform Theta (θ) Time γ Slow γ Fast

24 What is phase locking? Imagine we have + 2 to 3 dis6nct oscilla6ons being generated by local neural circuitry. = (+ Noise) If they are all coordinated with each other, then they would be phase- locked.

25 Compu)ng phase locking 360 o 0 o

26 Compu)ng phase locking 360 o 0 o

27 Compu)ng phase locking 360 o 0 o

28 Compu)ng phase locking 360 o 360 o 360 o 0 o 0 o 0 o Average by phase

29 Gamma Fast gamma and Slow gamma are on separate theta cycles. Fast gamma = Medial Entorhinal Ctx. Slow gamma = CA3 Bofom Line: Different gamma = Different inputs

30 Descisions Choice Point Behaviors at the Choice Point What inputs is the hippocampal network processing during this decision?

31 GPS Hadoop Cluster 24 Nodes (1 master, 23 slaves), running Ubuntu Server Nodes are Sun Fire X2200M2 with 2x AMD Opteron 2214 DualCore 2.2GHz processors Master: 18GB RAM, 2x 1 TB drives (RAID 1 mirrored) NameNode, SecondaryNameNode, JobTracker, MySQL for Hive metastore, user local home directories Slaves: 12GB RAM, 250 GB + 1 TB drives DataNode, TaskTracker, 2 CPU core slots for mappers and reducers 31!

32 Single Rat Run Processing Flow 6.6 million records/channel 15 channels 99 million records total 1.3 GB total 196 records 950 KB total Convolution Output 23.5 GB/channel 353 GB total 32!

33 Convolution Step Each HDFS file contains pairs of )me and voltage values for a single rat run channel We set the input to non- splifable, so the default 64MB HDFS block size is ignored, and each file is sent in its en)rety to a single mapper In the mapper, we read in the input file into a buffer (buffer size = power of 2) Once the buffer is completely loaded, in the close() method, we perform the FFT, complex number mul)plica)on, and inverse FFT for kernel frequencies from Hz in 1- Hz increments There is no reduce step, which helped with output file naming This approach does not work well with the percentage complete job monitoring! 33!

34 Directory Structure We write the convolu)on output data using this directory structure so that we can leverage Hive par))oning /neuro/output/rats dt= rat=r184 rat=r channel=1a channel=2a... R a R a 34

35 Hive External Table Creation Hive s external table crea)on capability allows SQL query capability over HDFS- resident output files CREATE EXTERNAL TABLE rats (time INT, frequency SMALLINT, convolution FLOAT) PARTITIONED BY(rat STRING, dt STRING, channel STRING) ROW FORMAT SERDE 'neurohadoop.ratserde' STORED AS SEQUENCEFILE LOCATION '/neuro/output/rats'; ALTER TABLE rats ADD PARTITION(rat='R184',dt=' ',channel='1a');...! Hive s par))oning capability limits the data traversal to only the subdirectories needed to answer the query, yielding a major performance boost for most subsequent opera)ons 35!

36 Hive Serde It is most convenient to use text files for Hadoop and Hive input and output Because we have a large amount of data to pass between Hadoop output and Hive input, we wanted to use a binary data format, called a SequenceFile Also, Hive ignores Keys in SequenceFiles, and only treats a Value as a Hive column, so we needed to create a custom Java serde (serializer/deserializer) for this complex value object consis)ng of an int, a short, and a float to map to the three columns This is not documented well, so here is our code hfp://pastebin.com/xuy36kxg Also, we used Snappy block- level compression Result: Bytes/convolu)on output record > !

37 Result! Amplitude! Frequency! 37!

38 Result Frequency! Phase! 38!

39 Performance Overview Performance growth is sub- linear for each group of 3 rat runs because ayer that point, we then use up all mapper slots (15/rat run * 3 rat runs = 45 slots, and we have 46 available) 39!

40 Performance Detail Channel Averaging Convolu)on 40!

41 Convolution Performance Load Kernel: 326 Load Data: Signal FFT: 1081 Kernel FFT: 604 Product: 203 Inverse FFT: 1070 Output Data: Kernel FFT: 418 Product: 85 Inverse FFT: 755 Output Data: Kernel FFT: 554 Product: 88 Inverse FFT: 859 Output Data: All 196 kernels are loaded from the distributed cache, then all the signal data for a channel is loaded and FFT processed Then each kernel is FFT processed, complex mul)plied with the signal FFT, and the result is processed with an inverse FFT, and then the output data is wrifen to HDFS, repeated for each kernel Performance results in milliseconds Average Java/Hadoop )me/convolu)on = 37 sec Average Matlab )me/convolu)on = 11 sec! We used the JTransforms FFT library: hfps://sites.google.com/site/piotrwendykier/soyware/jtransforms 41!

42 Overall Comparison with Matlab An apples- apples comparison is difficult because single worksta)on memory limita)ons dictate a non- op)mal processing flow for the Matlab approach The Hadoop approach does the processing in this order: Squaring, Averaging the 12 channels, Compu)ng the mean and standard devia)on of the average channel values, Subselng the )me intervals, Z- scoring This takes about 4 hours to process using 46 compu)ng slots The Matlab approach does the processing in this order: Squaring, Z- scoring, Subselng the )me intervals, Averaging the 12 channels This takes about 10 hours to process, using a small number of process cores 42!

43 Qualitative Hadoop/Hive Benefits vs. Matlab With Hadoop We were able to implement the op)mal processing path, unconstrained by single worksta)on memory All input and output data is online All intermediate data is online and available for subsequent ad hoc processing Easy total batch processing consis)ng of a sequence of Java MapReduce and Hive steps dump data into HDFS, issue command, come back when all done 43!

44 Amazon Elastic MapReduce We ported most of our applica)on to run on Amazon s Elas)c MapReduce service The only Java code change needed was changing file references to use an s3n://bucketname prefix AWS allows data in/out directly from persistent S3 storage, or transient HDFS storage (we did the former) Performance was comparable to our cluster, cost is in the low 10s of dollars. 44!

45 Amazon Elastic MapReduce Surprised Hive metadata is transient, unless a MySQL instance is configured, so tables created in one process step disappear in subsequent steps hfp://docs.amazonwebservices.com/elas)cmapreduce/latest/developerguide/ UsingEMR_Hive.html#emr- dev- create- metastore- outside S3 has 5 GB limit, so we needed to enable mul)- part upload hfp://docs.amazonwebservices.com/elas)cmapreduce/latest/developerguide/ Config_Mul)part.html?r=3305 An 8- core instance has 7 slots available for MR processing Small errors consume seconds, but billing rounds up to the nearest hour 45!

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook Apache Hive CMSC 491 Hadoop-Based Distributed Compu