BIG Data How to handle it Mark Holton, College of Engineering, Swansea University, m.d.holton@swansea.ac.uk
The usual What I m going to talk about The source of the data some tag (data loggers) history Selection criteria Some numbers Hardware / storage Preparation of the data Visualisations Where BIG data processing is heading
Selection criteria to get what we need / want Weight Durability
Choosing the right tag Tags vary greatly in their abilities Range of logging frequencies 1 Hz 800 Hz depending on need Different on-board sensors subsampled relative to the others Sensors from different manufacturers have different accuracies / sensitivities Potentially 10 or more channels of data Accelerometer X Y Z Magnetometer X Y Z Temperature Pressure Light level Battery Speed Humidity GPS / DGPS Differential pressure pitot probe Feeding - inter-mandibular angle sensor (IMASEN)
Tag development
Some numbers For 10 channels, at 40 Hz sampling 400 pieces of data per second 24,000 pieces of data per minute 1,440,000 pieces of data per hour 34,560,000 pieces of data per day Per month 1,036,800,000 This is just the RAW data Smoothed channels Various metrics Marker / sync channels (GPS etc.)
Hardware what is needed, actually really needed i5-i7 Intel processor Lots of memory (16-32 GB) Lots of local hard-drive space Networked storage
Pre-preparing the data Raw data can be difficult to understand. Some data preparation can help a lot with interpretation of results Simple preparation: Channel smoothing Median filters Threshold filters Histograms Time filters i.e. sectioning by the hour, day etc. Slightly harder preparation: Synchronisation of data to GPS data 3D normalisation Draw scaling and colour scaling Hard (multiple channel, large processing) preparation: Complex 3D plots of multi-channel data (the user has the option to plot any combinations of available channels)
Visualisations
Fourier Transforms for frequency analysis
Fourier Transforms for frequency analysis
Interpretation of acceleration traces as body pitch and sway Pitch Pitch
Look more at the bulk data Acceleration XYZ Acceleration XYZ Coloured by Pressure Acceleration XYZ Coloured by Pressure, the radius modulated by pressure
Spherical histograms, modulated by another channel Acceleration XYZ Coloured by Pressure Acceleration XYZ Coloured by Pressure spherical histogram, each block is the sum of one particular attribute of all points that fall within that area Acceleration XYZ Coloured by Pressure, the spherical histogram showing distribution of a different attribute Logarithmic spherical histogram
Acceleration magnitude separated by magnitude into layers
Determining mean angular separation of clusters of data
2 dimensional histograms to clustering or frequent behaviour across multiple axes 2D histograms of various channel data to aid in looking for behaviours due to unusual/unique clusters
Incorporating other systems data with sensor data; GPS GPS trace with associated behaviour/environmentally coloured traces GPS track coloured by one of many attributes including pre-calculated behaviour values GPS track coloured by one of many attributes with additional duplicate environmental and behavioural tracks stacked Different view of a section of GPS data
Searching data using thresholds Surfacing Picking out interesting data using a basic thresholding search algorithm
Everyone wants something different One large difficulty with developing a system to look at this type of data is that EVERYONE wants something different Correlation to compare channels against each other Cross correlation to compare waveform shapes to a database to determine best fit Various tests to determine similarity, convergence etc. For now make the raw and processed data as accessible to the user to carry out post processing in existing statistical/graphing packages such as Origin, MatLab etc. Advanced expression builder for search and tagging of data sections
The future of BIG data processing It s easy for the human eye to see patterns in large, often complex, data sets Data sets containing multiple variables can very quickly become a task beyond standard mathematics and will often require something new Analysis of data using neural networks has been around for decades Create a series of Input neurons or an Input layer Create a Hidden layer to accept outputs from the Input layer Or not Create an Output layer that effectively summarises the results from the Hidden layer The network is then Trained with data sets, and the Weighting that links the Input layer to the Hidden layer and the Hidden layer to the Output layer are mathematically adjusted to achieve a known result based on the known input data. There are many different algorithms around today, with varying degrees of strengths and weaknesses depending on the data set type. The oldest and most typical is the Feed Forward Neural Network that relies on the back-propagation of error. Such a network, if trained correctly for certain types of data sets, will be able to identify to a degree of probability a previously unseen data set. More complex networks include delays and loop-backs (recurrent) to earlier layers to allow sort of memory of previous input sets
Thank you