Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension

Similar documents
On optimal channel configurations for SMR based brain computer interfaces

A Simple Generative Model for Single-Trial EEG Classification

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Neurocomputing 108 (2013) Contents lists available at SciVerse ScienceDirect. Neurocomputing. journal homepage:

Weka ( )

EEG Imaginary Body Kinematics Regression. Justin Kilmarx, David Saffo, and Lucien Ng

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su

CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series

10-701/15-781, Fall 2006, Final

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

Introductory Concepts for Voxel-Based Statistical Analysis

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

The Anatomical Equivalence Class Formulation and its Application to Shape-based Computational Neuroanatomy

Chap.12 Kernel methods [Book, Chap.7]

Feature Selection for fmri Classification

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Breaking it Down: The World as Legos Benjamin Savage, Eric Chu

5 Learning hypothesis classes (16 points)

Data preprocessing Functional Programming and Intelligent Algorithms

Applying the Q n Estimator Online

3. Data Preprocessing. 3.1 Introduction

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

2. Data Preprocessing

Leave-One-Out Support Vector Machines

L1 Norm based common spatial patterns decomposition for scalp EEG BCI

CS229 Lecture notes. Raphael John Lamarre Townshend

Discriminate Analysis

Neural Processing Letter 17:21-31, Kluwer. 1. Introduction

Modeling Multiple Rock Types with Distance Functions: Methodology and Software

Classification by Nearest Shrunken Centroids and Support Vector Machines

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

New Approaches for EEG Source Localization and Dipole Moment Estimation. Shun Chi Wu, Yuchen Yao, A. Lee Swindlehurst University of California Irvine

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Principal Component Image Interpretation A Logical and Statistical Approach

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

Flexible Lag Definition for Experimental Variogram Calculation

Network Traffic Measurements and Analysis

Supplementary Figure 1. Decoding results broken down for different ROIs

Software Documentation of the Potential Support Vector Machine

Distance-Constrained Orthogonal Latin Squares for Brain- Computer Interface

Genetic algorithm and forward method for feature selection in EEG feature space

BCI Competition III: Dataset II - Ensemble of SVMs for BCI P300 Speller

European Journal of Science and Engineering Vol. 1, Issue 1, 2013 ADAPTIVE NEURO-FUZZY INFERENCE SYSTEM IDENTIFICATION OF AN INDUCTION MOTOR

Translating Thoughts Into Actions by Finding Patterns in Brainwaves

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

OHBA M/EEG Analysis Workshop. Mark Woolrich Diego Vidaurre Andrew Quinn Romesh Abeysuriya Robert Becker

Performing real-time BCI experiments

Smoothing of Spatial Filter by Graph Fourier Transform for EEG Signals

CoE4TN4 Image Processing. Chapter 5 Image Restoration and Reconstruction

Improved Non-Local Means Algorithm Based on Dimensionality Reduction

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

FEATURE SELECTION TECHNIQUES

The Pre-Image Problem in Kernel Methods

MR IMAGE SEGMENTATION

Classification of Mental Task for Brain Computer Interface Using Artificial Neural Network

3 Nonlinear Regression

Brain-Computer Interface for Virtual Reality Control

Quality Guided Image Denoising for Low-Cost Fundus Imaging

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Motivation. Technical Background

Textural Features for Image Database Retrieval

Lecture 7: Most Common Edge Detectors

Image Analysis, Classification and Change Detection in Remote Sensing

The organization of the human cerebral cortex estimated by intrinsic functional connectivity

A P300-speller based on event-related spectral perturbation (ERSP) Ming, D; An, X; Wan, B; Qi, H; Zhang, Z; Hu, Y

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

INF 4300 Classification III Anne Solberg The agenda today:

Application of MPS Simulation with Multiple Training Image (MultiTI-MPS) to the Red Dog Deposit

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

Introduction to Machine Learning

3 Nonlinear Regression

Learning from Data Linear Parameter Models

Individualized Error Estimation for Classification and Regression Models

Available Online through

FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU

Common Spatial-Spectral Boosting Pattern for Brain-Computer Interface

Learning to Recognize Faces in Realistic Conditions

CS229 Final Project: Predicting Expected Response Times

The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi

Facial Expression Detection Using Implemented (PCA) Algorithm

Multi-voxel pattern analysis: Decoding Mental States from fmri Activity Patterns

Lecture 3: Linear Classification

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Louis Fourrier Fabien Gaie Thomas Rolf

AN IMAGE BASED SYSTEM TO AUTOMATICALLY

Tensor Sparse PCA and Face Recognition: A Novel Approach

Cognitive States Detection in fmri Data Analysis using incremental PCA

Image Processing. Image Features

Transcription:

Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension Claudia Sannelli, Mikio Braun, Michael Tangermann, Klaus-Robert Müller, Machine Learning Laboratory, Dept. Computer Science, Berlin University of Technology, Berlin, Germany Intelligent Data Analysis Group, Fraunhofer FIRST, Berlin, Germany claudia@cs.tu-berlin.de Abstract About one third of the BCI subjects cannot communicate via BCI, a phenomenon that is known as BCI illiteracy. New investigations aiming to an early prediction of illiteracy would be very helpful to understand this phenomenon and to avoid hard BCI training for many subjects. In this paper, the first application on to electroencephalogram (EEG) of a newly developed machine learning tool, Relevant Dimension Estimation (), is presented. Detecting the label relevant information present in a data set, estimates the intrinsic noise and the complexity of the learning problem. Applied to EEG data collected during motor imagery paradigms, is able to deliver interesting insights into the illiteracy phenomenon. In particular can demonstrate that illiteracy is mostly not due to the non-stationarity or high ensionality present in the data, but rather due to a high intrinsic noise in the label related information. Moreover, in this paper is shown how to detect individual BCI-illiterate subjects in a very reliable way, based on a combination of the several features extracted by. Introduction Rehabilitation and communication for amyotrophic lateral sclerosis (ALS) patients are the most important motivations and long term goals for Brain Computer Interfaces (BCI), a research area which has enjoyed a growing interest in the last decade. In contrast, most BCI studies are performed on healthy subjects and work on improving existing algorithms for the classification of mental states using electroencephalogram (EEG). Actually, about one third of the BCI-users is still not able to communicate with the machines. Even a healthy subject could become very frustrated during an experiment, when he realizes that he is a so called BCI-illiterate, and very few patients are willing to experience this situation. BCI medical applications could find larger acceptance, if the ratio of BCI-illiterate users could be minimized to a very small percentage. A robust prediction of BCI illiteracy would also help to avoid false hopes and to reduce the efforts needed to train a patient for communication by BCI. For this purpose, new methods for EEG data set exploration and new features describing EEG data sets are needed in order to be used as predictors for BCI illiteracy. Relevant Dimension Estimation () is an algorithm proposed in [] which makes use of kernel PCA (Principal Component Analysis) in the feature space together with label information in order to assess the actual class related information contained in a data set. In particular, estimates two properties: () the ension of the subspace in kernel space containing the relevant information, and () the noise contained in the labels. Both numbers allow to measure the interaction between the data set and a chosen kernel, and in particular to give an accurate image of the problem complexity of the amount of noise contained in a learning problem.

Setting Name Band [Hz] Time [ms] Channels Feature calib-power-all 0.- 0-000 all band power calib-power-sel sel sel sel band power calib-power-cench 0.- 0-000 C* band power calib-power-cench-alpha 8-0-000 C* band power calib-csp-feat sel sel sel CSP features Table : Preprocessing parameter settings. In this study, a first application of on EEG. Using Gaussian kernels of different widths, the ensionality of the data set and the amount of noise is estimated at different scales. Our hypothesis is that a data set from an illiterate subject is intrinsically high ensional and therefore not well classifiable with features generated by the Common Spatial Patterns (CSP) method. To test this hypothesis, features extracted by are compared with the CSP features in terms of classification performance. Experimental setup A dataset of 8 BCI sessions from 0 healthy subjects has been investigated. Data was recording with the Berlin BCI (BBCI) during classical motor imagery BCI experiments (see e.g. [, ]). In the calibration session, the subjects were asked to perform 00-00 trials of motor imagery for the left or right hand and for the foot. Two classes were then chosen for the feedback session, depending on the offline classification performance of a linear classifier that processed CSP features []. In the feedback session, targets and feedback of the classifier output were given visually. Methods. Preprocessing Within this study, several preprocessing parameter settings have been used. The preprocessing steps for each setting are as follows: () low pass filtering at 00 Hz, () cutting continuous EEG in epochs in a specific time interval after the stimulus presentation, () optional channel selection, () rejecting bad trials and channels by variance based artifact rejection, () selecting trials belonging to the classes already chosen for the online feedback, (6) filtering in a setting specific frequency band and () calculating band power. Preprocessing settings differentiate in steps,, 6 and, see the overview in Table. In the Channel column, all means that all channels after step are used in calculating the features, i.e step was ignored. On the contrary, sel means that a further channel selection has been applied. In particular, the channel subset was determined by a heuristic that at the day of the experiment in order to maximize CSP performance. The same convention for sel is valid for the second and third columns. Finally, C* means that all central channels (according to the 0-0 EEG system) were used.. has been applied on each data set. A Gaussian RBF (radial basis function) kernel has been chosen. Two parameters had to be selected: the kernel width and the ension, i.e. the number of leading kernel PCA components. The range for the kernel width γ was between 0 and 0. The range for the ension d was [, N/] where N is the number of trials available. For each kernel width γ, the kernel matrix K( and the sorted eigenvectors E( have been calculated. In order to estimate the kernel width and the ension of each data set, both methods indicated in [] have been used. The first method finds the kernel width γ and the ension d which minimize the negative log-likelihood function L(γ, d) defined as

with L(γ, d) = d n log σ + n d n log σ () σ = d d i= s i and σ = n d n i=d+ s i () In the above equation, s i = u T i Y are the contributions to labels of the kernel PCA components and u i are the eigenvectors of E(. Within the second method, the label predictions are calculated for each parameter combination using the projections on the label of the kernel PCA components S(d, = d i= u iu T i. The best kernel width γ and the ension d are then chosen minimizing the leave-one-out cross-validation error as computed in [].. Noise Estimation The noise present in a data set is calculated by as the mean squared error over the label predictions obtained using the estimated best number of kernel components and the best kernel width: Noise = N N (SY i Y i ) () i= The variance of the negative log-likelihoods over all kernel widths and kernel PCA ensions has also been calculated as a feature, in order to capture the intrinsic noise. The smoothness of the log-likelihood function has also been calculated as the distance of the function from a smooth surface modelled by a fifth degree polynomial fitting the original function scaled between 0 and. Results. Subject Specific Analysis In figure, the negative log-likelihood functions calculated as in equation for three different preprocessing settings (,, and described in table ) are shown. Results from a subject with very good BCI performance (calibration error =.0) are visualized in the top row, while results from a subject with bad BCI performance, probably illiterate (calibration error =.0) are visualized in the bottom row. An evident difference can be seen between the functions resulting for the two subjects, even with the first preprocessing, where no subject depending frequency band, channels and time interval selection has been applied. Looking at the negative log-likelihood functions it can be hypothesized that the first method described in section. will fail in searching the best kernel width and ensionality, due to the extremely noisy function and many local minima. In fact, the results obtained looking at the minimum of the function revealed to be not robust against small changing in preprocessing settings, especially for subjects with bad BCI performance. For this reason, the second method has been chosen to estimate robustly the best kernel width and the best ension. Still, the negative log-likelihood function as shown in figure is extremely informative regarding the noise in feature space present in a data set and it is independent from the method chosen for parameter selection. The log-likelihood function for bad subjects is not just much less smooth, but its range is also much smaller than for good subjects. For this reason, as described in section., the smoothness and the variance of the log-likelihood function have been calculated as additional features indicating the noise in the data set. No significant improvement can be seen with other preprocessing settings, even with the subject specific parameter selections, as shown in the center column of figure. Applying on CSP features, which consist at most on 6 channels, the feature space becomes particularly free of noise

calib power allch, log likelihood function calib CSPfeat, log likelihood function calib power selch, log likelihood function 0. 0.0 0. 0. 0. 0. 0.. 0. 0. 0 0 00 0 00 0 0 00 00 calib power selch, log likelihood function calib power allch, log likelihood function 0.0 0 8 00 0 00 calib CSPfeat, log likelihood function 0.0 0. 0.0 0. 0.0 0.0 0. 0.0 0.0 6 0. 0.0 0 0 d 00 im 00 0 0 0 00 0 Figure : Negative log-likelihood function for all kernel widths and ensions. Top: good performing BCI subject (CSP calibration error =.0). Bottom: bad performing BCI subject (CSP calibration error =.0). From left to right, three different preprocessing settings: calib-powerallch, calib-power-selch, calib-cspfeat. best width: log0(=0.0, best : best width: log0(=.6, best : Projs on labels log function 6 Projs on labels log function 0 0 0 00 PCAs 0 00 0 0 60 80 00 0 PCAs 0 60 80 Figure : Negative log-likelihood function for the best kernel width. Preprocessing settings: calibpower-selch. Left: good BCI subject. Right: bad BCI subject. and low ensional, so that the log-likelihood function is very smooth as shown on the right side of figure. On the contrary, the surface extension is still much less for BCI subjects with poor performance, so that the variance is in fact a good feature to analyze. In figure, the negative log-likelihood function for the best kernel width is shown. The contributions of each kernel PCA component calculated as shown in section. are visualized on the background. Also in this case, a strong difference between the two subjects can be observed. In particular, when less noise is present, the first kernel PCA components are much more informative, so that one can ideally separate the model in two components as in equation, the first one containing the relevant information essential for label prediction and the second one containing mainly noise. In noisy data set as the second one, no structure can be seen in the contributions, since the noise is distributed over all components.. Group Analysis In order to simply confirm how much features correlate with subject performance, we investigated the correlation between the features extracted by with the simplest preprocessing setting calib-power-allch and the CSP performance on the same calibration data set, i.e the CSP

r =.90e 0, p =.8e 0 r =.60e 0, p =.e 0 r =.e 0, p =.6e 0 kernel width Noise Dimension var logf 0 0 0 0 0 0 0 0 0 0 0 0 r =.00e 0, p =.8e 0 r =.6e 0, p =.0e 0 smoothness logf 0 0 0 0 0 0 0 0 Figure : Correlation between and CSP offline performance. Setting: calib-power-allch. offline error. The results are shown in figure : for each subject, in each subplot, a feature is plotted against the CSP offline error. Correlation and significance values are written in the titles. Already with the simplest setting, strong correlation with subject performance can be observed for all features and it becomes even stronger for the calib-power-selch setting, not shown because of lack of space. Subjects with CSP offline classification error greater then 0% are represented by circles, while crosses are used for the others. The two groups are divided by a vertical line. In particular, the subjects with worst performance are pretty close and can be grouped as points having the following properties: () high noise, () small ensionality, () small kernel width, () small variance of the negative log-likelihood function, () small smoothness of the negative log-likelihood function. A bigger challenge is to gain additional information about a subject using, and try to predict from the calibration data his future online performance. For this reason, we investigated the correlation between features for the calib-power-selch setting and the CSP online error obtained from the feedback data. Results are shown in figure where the same conventions as in figure apply. Also, the correlation between CSP offline error and CSP online error is shown on the last subplot. Even if the correlation between CSP offline error and CSP online error is slightly better then for the features, subjects with poor performance (CSP online error >= 0%) can still be better characterized by () high noise, () small kernel width, () small ension, without using the CSP algorithm. Discussion In contrast to the hypothesis about the high ensionality of BCI illiterate data sets, the chooses very few kernel components for the feature subspace containing the label relevant information. This happens because the noise in the data set is so high that the relevant information is distributed over all components, as revealed by the structure of the projections in figure. In fact, the high noise prevents from choosing more components and forces to choose a small kernel width. As explained in [], particularly noisy free data set could also have very high ensionality and very large kernel width, exactly at the opposite of BCI illiterates. This also means that illiteracy is not due to the non-stationarity present in the data, but rather due to a high intrinsic noise in the label information, meaning that the class membership cannot be

r =.8e 0, p =.9e 0 r =.9e 0, p =.6e 0 r =.e 0, p = 9.e 0 Noise Dimension var logf 0 0 0 0 0 0 r =.8e 0, p = 6.e 0 0 0 0 0 0 0 r =.0e 0, p =.0e 0 0 0 0 0 0 0 r =.60e 0, p =.e 0 kernel width smoothness logf csp offline error 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure : Correlation between and CSP online performance. Setting: calib-power-selch. predicted well from the features over all whole range of possible scales. Finally, some subjects, not included in the illiterate group, exhibit a not so high noise, relative high ension and kernel width would probably benefit from more training examples. 6 Conclusion This study was motivated by the necessity to find new features that can predict the BCI performance of a subject with focus on an early illiteracy detection. For this reason, the algorithm has been applied on EEG data for the first time. The results show how can be used on labeled data to understand the structure of the information contained in the data. In particular, can be used to easily recognize illiterate subjects. It has been shown that the interaction among the three parameters is valuable in order to understand whether a poor BCI classification performance is due to the intrinsic noise present in the data or due to a lack of training examples. Finally, the hypothesis of too high ensionality of BCI illiterate data sets has been rejected. Acknowledgements: this study was supported by the DFG (Deutsche Forschungsgemeinschaft) MU 98/-. References [] M. Braun, J. Buchmann, and K-R. Müller. Denoising and ension reduction in feature space. Advances in Neural Inf. Proc. Systems 9 (NIPS 006), pages 8 9, 00. [] B. Blankertz, G. Curio, and K-R. Müller. Classifying single trial EEG: Towards brain computer interfacing. : 6, 00. [] B. Blankertz, G. Dornhege, M. Krauledat, K-R. Müller, and G. Curio. The non-invasive Berlin Brain- Computer Interface: Fast acquisition of effective performance in untrained subjects. neuroimage, :9 0, 00. [] H. Ramoser, J. Müller-Gerking, and G. Pfurtsheller. Optimal spatial filtering of single trial eeg during imagined hand movement. IEEE Trans. Rehab. Eng., 8(): 6, 000. [] G. Wahba. Spline models for observational data. Society for Ind. and App. Mathematics, 990. 6