Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension

Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension Claudia Sannelli, Mikio Braun, Michael Tangermann, Klaus-Robert Müller, Machine Learning Laboratory, Dept. Computer Science, Berlin University of Technology, Berlin, Germany Intelligent Data Analysis Group, Fraunhofer FIRST, Berlin, Germany claudia@cs.tu-berlin.de Abstract About one third of the BCI subjects cannot communicate via BCI, a phenomenon that is known as BCI illiteracy. New investigations aiming to an early prediction of illiteracy would be very helpful to understand this phenomenon and to avoid hard BCI training for many subjects. In this paper, the first application on to electroencephalogram (EEG) of a newly developed machine learning tool, Relevant Dimension Estimation (), is presented. Detecting the label relevant information present in a data set, estimates the intrinsic noise and the complexity of the learning problem. Applied to EEG data collected during motor imagery paradigms, is able to deliver interesting insights into the illiteracy phenomenon. In particular can demonstrate that illiteracy is mostly not due to the non-stationarity or high ensionality present in the data, but rather due to a high intrinsic noise in the label related information. Moreover, in this paper is shown how to detect individual BCI-illiterate subjects in a very reliable way, based on a combination of the several features extracted by. Introduction Rehabilitation and communication for amyotrophic lateral sclerosis (ALS) patients are the most important motivations and long term goals for Brain Computer Interfaces (BCI), a research area which has enjoyed a growing interest in the last decade. In contrast, most BCI studies are performed on healthy subjects and work on improving existing algorithms for the classification of mental states using electroencephalogram (EEG). Actually, about one third of the BCI-users is still not able to communicate with the machines. Even a healthy subject could become very frustrated during an experiment, when he realizes that he is a so called BCI-illiterate, and very few patients are willing to experience this situation. BCI medical applications could find larger acceptance, if the ratio of BCI-illiterate users could be minimized to a very small percentage. A robust prediction of BCI illiteracy would also help to avoid false hopes and to reduce the efforts needed to train a patient for communication by BCI. For this purpose, new methods for EEG data set exploration and new features describing EEG data sets are needed in order to be used as predictors for BCI illiteracy. Relevant Dimension Estimation () is an algorithm proposed in [] which makes use of kernel PCA (Principal Component Analysis) in the feature space together with label information in order to assess the actual class related information contained in a data set. In particular, estimates two properties: () the ension of the subspace in kernel space containing the relevant information, and () the noise contained in the labels. Both numbers allow to measure the interaction between the data set and a chosen kernel, and in particular to give an accurate image of the problem complexity of the amount of noise contained in a learning problem.

Setting Name Band [Hz] Time [ms] Channels Feature calib-power-all 0.- 0-000 all band power calib-power-sel sel sel sel band power calib-power-cench 0.- 0-000 C* band power calib-power-cench-alpha 8-0-000 C* band power calib-csp-feat sel sel sel CSP features Table : Preprocessing parameter settings. In this study, a first application of on EEG. Using Gaussian kernels of different widths, the ensionality of the data set and the amount of noise is estimated at different scales. Our hypothesis is that a data set from an illiterate subject is intrinsically high ensional and therefore not well classifiable with features generated by the Common Spatial Patterns (CSP) method. To test this hypothesis, features extracted by are compared with the CSP features in terms of classification performance. Experimental setup A dataset of 8 BCI sessions from 0 healthy subjects has been investigated. Data was recording with the Berlin BCI (BBCI) during classical motor imagery BCI experiments (see e.g. [, ]). In the calibration session, the subjects were asked to perform 00-00 trials of motor imagery for the left or right hand and for the foot. Two classes were then chosen for the feedback session, depending on the offline classification performance of a linear classifier that processed CSP features []. In the feedback session, targets and feedback of the classifier output were given visually. Methods. Preprocessing Within this study, several preprocessing parameter settings have been used. The preprocessing steps for each setting are as follows: () low pass filtering at 00 Hz, () cutting continuous EEG in epochs in a specific time interval after the stimulus presentation, () optional channel selection, () rejecting bad trials and channels by variance based artifact rejection, () selecting trials belonging to the classes already chosen for the online feedback, (6) filtering in a setting specific frequency band and () calculating band power. Preprocessing settings differentiate in steps,, 6 and, see the overview in Table. In the Channel column, all means that all channels after step are used in calculating the features, i.e step was ignored. On the contrary, sel means that a further channel selection has been applied. In particular, the channel subset was determined by a heuristic that at the day of the experiment in order to maximize CSP performance. The same convention for sel is valid for the second and third columns. Finally, C* means that all central channels (according to the 0-0 EEG system) were used.. has been applied on each data set. A Gaussian RBF (radial basis function) kernel has been chosen. Two parameters had to be selected: the kernel width and the ension, i.e. the number of leading kernel PCA components. The range for the kernel width γ was between 0 and 0. The range for the ension d was [, N/] where N is the number of trials available. For each kernel width γ, the kernel matrix K( and the sorted eigenvectors E( have been calculated. In order to estimate the kernel width and the ension of each data set, both methods indicated in [] have been used. The first method finds the kernel width γ and the ension d which minimize the negative log-likelihood function L(γ, d) defined as

with L(γ, d) = d n log σ + n d n log σ () σ = d d i= s i and σ = n d n i=d+ s i () In the above equation, s i = u T i Y are the contributions to labels of the kernel PCA components and u i are the eigenvectors of E(. Within the second method, the label predictions are calculated for each parameter combination using the projections on the label of the kernel PCA components S(d, = d i= u iu T i. The best kernel width γ and the ension d are then chosen minimizing the leave-one-out cross-validation error as computed in [].. Noise Estimation The noise present in a data set is calculated by as the mean squared error over the label predictions obtained using the estimated best number of kernel components and the best kernel width: Noise = N N (SY i Y i ) () i= The variance of the negative log-likelihoods over all kernel widths and kernel PCA ensions has also been calculated as a feature, in order to capture the intrinsic noise. The smoothness of the log-likelihood function has also been calculated as the distance of the function from a smooth surface modelled by a fifth degree polynomial fitting the original function scaled between 0 and. Results. Subject Specific Analysis In figure, the negative log-likelihood functions calculated as in equation for three different preprocessing settings (,, and described in table ) are shown. Results from a subject with very good BCI performance (calibration error =.0) are visualized in the top row, while results from a subject with bad BCI performance, probably illiterate (calibration error =.0) are visualized in the bottom row. An evident difference can be seen between the functions resulting for the two subjects, even with the first preprocessing, where no subject depending frequency band, channels and time interval selection has been applied. Looking at the negative log-likelihood functions it can be hypothesized that the first method described in section. will fail in searching the best kernel width and ensionality, due to the extremely noisy function and many local minima. In fact, the results obtained looking at the minimum of the function revealed to be not robust against small changing in preprocessing settings, especially for subjects with bad BCI performance. For this reason, the second method has been chosen to estimate robustly the best kernel width and the best ension. Still, the negative log-likelihood function as shown in figure is extremely informative regarding the noise in feature space present in a data set and it is independent from the method chosen for parameter selection. The log-likelihood function for bad subjects is not just much less smooth, but its range is also much smaller than for good subjects. For this reason, as described in section., the smoothness and the variance of the log-likelihood function have been calculated as additional features indicating the noise in the data set. No significant improvement can be seen with other preprocessing settings, even with the subject specific parameter selections, as shown in the center column of figure. Applying on CSP features, which consist at most on 6 channels, the feature space becomes particularly free of noise

calib power allch, log likelihood function calib CSPfeat, log likelihood function calib power selch, log likelihood function 0. 0.0 0. 0. 0. 0. 0.. 0. 0. 0 0 00 0 00 0 0 00 00 calib power selch, log likelihood function calib power allch, log likelihood function 0.0 0 8 00 0 00 calib CSPfeat, log likelihood function 0.0 0. 0.0 0. 0.0 0.0 0. 0.0 0.0 6 0. 0.0 0 0 d 00 im 00 0 0 0 00 0 Figure : Negative log-likelihood function for all kernel widths and ensions. Top: good performing BCI subject (CSP calibration error =.0). Bottom: bad performing BCI subject (CSP calibration error =.0). From left to right, three different preprocessing settings: calib-powerallch, calib-power-selch, calib-cspfeat. best width: log0(=0.0, best : best width: log0(=.6, best : Projs on labels log function 6 Projs on labels log function 0 0 0 00 PCAs 0 00 0 0 60 80 00 0 PCAs 0 60 80 Figure : Negative log-likelihood function for the best kernel width. Preprocessing settings: calibpower-selch. Left: good BCI subject. Right: bad BCI subject. and low ensional, so that the log-likelihood function is very smooth as shown on the right side of figure. On the contrary, the surface extension is still much less for BCI subjects with poor performance, so that the variance is in fact a good feature to analyze. In figure, the negative log-likelihood function for the best kernel width is shown. The contributions of each kernel PCA component calculated as shown in section. are visualized on the background. Also in this case, a strong difference between the two subjects can be observed. In particular, when less noise is present, the first kernel PCA components are much more informative, so that one can ideally separate the model in two components as in equation, the first one containing the relevant information essential for label prediction and the second one containing mainly noise. In noisy data set as the second one, no structure can be seen in the contributions, since the noise is distributed over all components.. Group Analysis In order to simply confirm how much features correlate with subject performance, we investigated the correlation between the features extracted by with the simplest preprocessing setting calib-power-allch and the CSP performance on the same calibration data set, i.e the CSP

r =.90e 0, p =.8e 0 r =.60e 0, p =.e 0 r =.e 0, p =.6e 0 kernel width Noise Dimension var logf 0 0 0 0 0 0 0 0 0 0 0 0 r =.00e 0, p =.8e 0 r =.6e 0, p =.0e 0 smoothness logf 0 0 0 0 0 0 0 0 Figure : Correlation between and CSP offline performance. Setting: calib-power-allch. offline error. The results are shown in figure : for each subject, in each subplot, a feature is plotted against the CSP offline error. Correlation and significance values are written in the titles. Already with the simplest setting, strong correlation with subject performance can be observed for all features and it becomes even stronger for the calib-power-selch setting, not shown because of lack of space. Subjects with CSP offline classification error greater then 0% are represented by circles, while crosses are used for the others. The two groups are divided by a vertical line. In particular, the subjects with worst performance are pretty close and can be grouped as points having the following properties: () high noise, () small ensionality, () small kernel width, () small variance of the negative log-likelihood function, () small smoothness of the negative log-likelihood function. A bigger challenge is to gain additional information about a subject using, and try to predict from the calibration data his future online performance. For this reason, we investigated the correlation between features for the calib-power-selch setting and the CSP online error obtained from the feedback data. Results are shown in figure where the same conventions as in figure apply. Also, the correlation between CSP offline error and CSP online error is shown on the last subplot. Even if the correlation between CSP offline error and CSP online error is slightly better then for the features, subjects with poor performance (CSP online error >= 0%) can still be better characterized by () high noise, () small kernel width, () small ension, without using the CSP algorithm. Discussion In contrast to the hypothesis about the high ensionality of BCI illiterate data sets, the chooses very few kernel components for the feature subspace containing the label relevant information. This happens because the noise in the data set is so high that the relevant information is distributed over all components, as revealed by the structure of the projections in figure. In fact, the high noise prevents from choosing more components and forces to choose a small kernel width. As explained in [], particularly noisy free data set could also have very high ensionality and very large kernel width, exactly at the opposite of BCI illiterates. This also means that illiteracy is not due to the non-stationarity present in the data, but rather due to a high intrinsic noise in the label information, meaning that the class membership cannot be

r =.8e 0, p =.9e 0 r =.9e 0, p =.6e 0 r =.e 0, p = 9.e 0 Noise Dimension var logf 0 0 0 0 0 0 r =.8e 0, p = 6.e 0 0 0 0 0 0 0 r =.0e 0, p =.0e 0 0 0 0 0 0 0 r =.60e 0, p =.e 0 kernel width smoothness logf csp offline error 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure : Correlation between and CSP online performance. Setting: calib-power-selch. predicted well from the features over all whole range of possible scales. Finally, some subjects, not included in the illiterate group, exhibit a not so high noise, relative high ension and kernel width would probably benefit from more training examples. 6 Conclusion This study was motivated by the necessity to find new features that can predict the BCI performance of a subject with focus on an early illiteracy detection. For this reason, the algorithm has been applied on EEG data for the first time. The results show how can be used on labeled data to understand the structure of the information contained in the data. In particular, can be used to easily recognize illiterate subjects. It has been shown that the interaction among the three parameters is valuable in order to understand whether a poor BCI classification performance is due to the intrinsic noise present in the data or due to a lack of training examples. Finally, the hypothesis of too high ensionality of BCI illiterate data sets has been rejected. Acknowledgements: this study was supported by the DFG (Deutsche Forschungsgemeinschaft) MU 98/-. References [] M. Braun, J. Buchmann, and K-R. Müller. Denoising and ension reduction in feature space. Advances in Neural Inf. Proc. Systems 9 (NIPS 006), pages 8 9, 00. [] B. Blankertz, G. Curio, and K-R. Müller. Classifying single trial EEG: Towards brain computer interfacing. : 6, 00. [] B. Blankertz, G. Dornhege, M. Krauledat, K-R. Müller, and G. Curio. The non-invasive Berlin Brain- Computer Interface: Fast acquisition of effective performance in untrained subjects. neuroimage, :9 0, 00. [] H. Ramoser, J. Müller-Gerking, and G. Pfurtsheller. Optimal spatial filtering of single trial eeg during imagined hand movement. IEEE Trans. Rehab. Eng., 8(): 6, 000. [] G. Wahba. Spline models for observational data. Society for Ind. and App. Mathematics, 990. 6