Louis Fourrier Fabien Gaie Thomas Rolf

Similar documents
Classification Algorithms in Data Mining

Network Traffic Measurements and Analysis

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Tutorials Case studies

The exam is closed book, closed notes except your one-page cheat sheet.

Evaluating Classifiers

Applications of Machine Learning on Keyword Extraction of Large Datasets

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

CS229 Final Project: Predicting Expected Response Times

Evaluating Classifiers

Predicting Popular Xbox games based on Search Queries of Users

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Logistic Regression: Probabilistic Interpretation

CS294-1 Assignment 2 Report

Supervised Learning Classification Algorithms Comparison

Contents. Preface to the Second Edition

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Fast or furious? - User analysis of SF Express Inc

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CS145: INTRODUCTION TO DATA MINING

06: Logistic Regression

Evaluating Classifiers

3 Feature Selection & Feature Extraction

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery: Practice Notes

Notes and Announcements

Part II: A broader view

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Applying Supervised Learning

Detecting Network Intrusions

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

On The Value of Leave-One-Out Cross-Validation Bounds

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Data Mining and Knowledge Discovery: Practice Notes

Classifiers and Detection. D.A. Forsyth

Machine Learning / Jan 27, 2010

Semen fertility prediction based on lifestyle factors

Allstate Insurance Claims Severity: A Machine Learning Approach

Building Classifiers using Bayesian Networks

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Support Vector Machines

CS 229 Midterm Review

Introduction to Automated Text Analysis. bit.ly/poir599

All lecture slides will be available at CSC2515_Winter15.html

Weka VotedPerceptron & Attribute Transformation (1)

Learning via Optimization

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Multi-voxel pattern analysis: Decoding Mental States from fmri Activity Patterns

Slides for Data Mining by I. H. Witten and E. Frank

Proactive Prediction of Hard Disk Drive Failure

2. On classification and related tasks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Bagging for One-Class Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Modern Medical Image Analysis 8DC00 Exam

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CS249: ADVANCED DATA MINING

Ivans Lubenko & Andrew Ker

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Classification of Tweets using Supervised and Semisupervised Learning

University of Wisconsin-Madison Spring 2018 BMI/CS 776: Advanced Bioinformatics Homework #2

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS8803: Statistical Techniques in Robotics Byron Boots. Thoughts on Machine Learning and Robotics. (how to apply machine learning in practice)

Classification: Linear Discriminant Functions

Performance Evaluation of Various Classification Algorithms

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

SUPPORT VECTOR MACHINES

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Artificial Neural Networks (Feedforward Nets)

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Lecture 9: Support Vector Machines

Machine Learning Techniques for Data Mining

Research on the New Image De-Noising Methodology Based on Neural Network and HMM-Hidden Markov Models

[Programming Assignment] (1)

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

Facial Expression Classification with Random Filters Feature Extraction

Weka ( )

Machine Learning. Supervised Learning. Manfred Huber

Generative and discriminative classification techniques

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

SUPPORT VECTOR MACHINES

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

List of Exercises: Data Mining 1 December 12th, 2015

Predicting Messaging Response Time in a Long Distance Relationship

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Transcription:

CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted by Ford whose goal is to design a binary classifier to detect whether a driver is alert or not. Driving while being distracted fatigued or drowsy may lead to accidents. Solving this problem may eventually save lives and is thus a great application of machine learning! b. Dataset Overview Ford provided a combination of vehicular environmental and driver physiological data acquired in real-time while driving. These different features do not have any name indicating what they represent which adds to the complexity of the problem since we cannot rely on physical intuition. The dataset consists of a number of "trials" each one representing about 2 minutes of sequential data that are recorded every 100 ms during a driving session. The trials are samples from some 500 drivers and 30 different features are recorded. The overall dataset is a 600000x30 matrix with the associated labels. 2. Methodology a. Data Analysis We started by plotting the probability density functions of the features in both alert and non-alert cases to quickly get a grasp of the impact and the different correlations. We completed this first analysis with some useful metrics: mean and variance of each feature distribution raw estimation of how far from a Gaussian the distribution is (based on kurtosis and skewness) mutual information between each feature and the output labels In order to run our algorithms we split our data into a training (~75%) and a test set (~25%). It is worth noticing that the data provided are unbalanced since there are more positive labels than negative labels. This is an important point to note since it often affects the learning algorithm behavior. b. Generation of new features Need for new features: The first attempts we made relied on the 30 features provided by the original dataset and our algorithms performed very poorly. Thus we decided to generate new features for algorithms such as Naive Bayes and Logistic regression which do not handle a Kernel Trick easily. Method to generate new features: We generated some additional features by combining existing features into new ones and making some mathematical operations on existing features: Added the inverse the square and the cube of every features (1/xi xi2 xi3) Fall 2012 CS 229: Machine Learning Final report 1

Added all the combination of 2 columns ( ) Generated time intervals variable (mainly and ). Suppressed 3 original features that proved to be useless since they remain constant for every row. This resulted in a total of approximately 600 features from an initial 30 feature dataset. It enabled us to capture more patterns when we implemented our algorithms on this enhanced set of features. Shuffling our data or not: We considered randomly shuffling our dataset at first. However this proved to be dangerous for our model because it resulted in splitting data from the same user in both the training and test sets thus artificially improving the accuracy and AUC results. The time sequencing of the data was also lost by this shuffling. We finally decided to split our users into a test and a training set first then compute the time varying features and finally shuffle data before running our algorithms. c. Evaluation criteria In order to assess our algorithms we mainly used two types of evaluation criteria: The accuracy of our algorithm (equivalently the training and testing errors). The Area Under the Curve (AUC) computed as the area under the ROC (Receiver operating characteristic) curve. This criterion is insensitive to unbalanced dataset and it does not require defining a specific threshold for the decision boundary. This metric is closely related to the true positive rate (TPR) and false positive rate (FPR). Figure 1: ROC curve obtained by logistic regression. The Area Under the Curve is 0.776. Throughout our project we ran algorithms using both evaluation criteria. We mainly present results using the accuracy since it gave us satisfying overall results and it allowed us to compare the performances of all our algorithms. We kept track of the AUC as a control metric and both the accuracy and the AUC improved as our algorithms got better. 3. Feature selection: Seeing the initial results and after generating new features we had to filter the best features out to reduce the noise and implement more efficient algorithms. a. Mutual information Our first method to select features was to use mutual information on the initial features. The five best features were and. We were surprised by the fact that none of the physiological features was selected so we decided to implement forward searches on different algorithms. b. Forward search We performed two different forward search procedures maximizing our two different evaluation criteria - the accuracy and the Fall 2012 CS 229: Machine Learning Final report 2

AUC. The selected features varied with the algorithm and the evaluation criteria. Allowed features ( ) Allowed features ( ) Naive - Bayes Logistic Regression SVM Figure 2: Five best features obtained from forward search in Naive Bayes Logistic Regression and SVM algorithms. The selected best features by the forward search procedures varied with the algorithm so that there is no ultimate feature set. Note that some features that we already captured with the Mutual Information like and appear several times in the table. When checking against the distribution of such features we can see that they are indeed helpful separating the data into both labels. Multinomial Naïve Bayes with Laplace smoothing: We discretized all the input data by dividing each feature into a predetermined number of buckets before computing the multinomial probabilities with Laplace smoothing. The number of buckets had a big impact on the performance of the algorithm with a maximum around 150 buckets. Too many buckets proved to be inefficient since it could not capture the distribution properly and led to random probabilities. Naive Bayes with Gaussian Distribution fitting: We modeled each distribution as a Gaussian distribution which is a strong assumption for some features but gave surprisingly good results thanks to the forward search procedure. These good results can be explained by the fact that some of the feature distributions are close to Gaussian distributions. b. Logistic Regression To begin with we implemented logistic regression algorithms with both batch and stochastic gradient descent. While the batch gradient descent always converged the stochastic one sometimes did not. We then leveraged Matlab optimized version of the logistic regression to have faster algorithms after insuring that the results were consistent with our own algorithms results. Figure 3: Plot of the distribution of feature alert and non-alert cases. 4. Learning Algorithms a. Naïve Bayes in both We implemented Naive Bayes algorithms with two different underlying models assuming that features were conditionally independent given the alertness of the driver. The logistic regression model enabled us to easily compute the AUC since we could output the probability of the label being 1 or 0 and sample the ROC curve by varying the decision threshold. We used this method to optimize on the AUC for one of our Forward Search experiment. Fall 2012 CS 229: Machine Learning Final report 3

c. SVM We implemented a SVM algorithm with both linear and nonlinear Kernels using liblinear and libsvm libraries for Matlab. First the SVM with linear Kernel applied to the 30 initial features gave a coherent accuracy of 64%. In order to work in a higher dimensional space we implemented a SVM with Gaussian Kernel but it did not converge in a reasonable time. We then tried to implement the linear Kernel on the enhanced dataset with our additional features but faced the same issue. Therefore we decided to implement a forward search to limit the number of features and prevent the algorithm to diverge. It turned out that choosing more than 10 features did not provide significant improvements. We maximized the accuracy while varying the SVM parameter C between 0.1 and 100. We noticed that C had little impact on the SVM accuracy and finally let C be 1. Multinomial Naïve Bayes Gaussian Naïve Bayes Logistic Regression SVM Accuracy TPR FPR AUC Figure 5: Performances of algorithm on initial dataset. These insufficient results were the reason for both the generation of additional features to capture better correlations and the selection of best features among this enhanced dataset. Results with Mutual Information Features: Selected features Accuracy AUC - depending on the algorithm Figure 6: Performances of algorithms applied to the five best features from the Mutual Information. The performances displayed above were a good start given the small number of features but encouraged us to implement forward search algorithms. Results with best selected features from Forward Search: The forward search procedures we used with each algorithm gave us the best results with the features described in the previous section. Accuracy (in %) Figure 4: Influence of the parameter C of the SVM on the error. d. Results Results with basic 30 features: Our first step was to run our three algorithms on the initial dataset of 30 features. Number of features in the Forward Search Fall 2012 CS 229: Machine Learning Final report 4

Naïve Bayes Logistic Regression SVM (linear Kernel) Accuracy TPR FPR AUC N A Figure 7: Accuracy evolution during the forward search (plot) and final performances of the algorithms (table). Out of the algorithm we implemented the Naïve Bayes with Gaussian Distribution fitting is clearly the most accurate one. The Naïve Bayes and the SVM have lower performances. The unsatisfying result of the SVM is due to the only linear Kernel. From an initial dataset of 30 features we created around 600 features before selecting only ten of them while significantly increasing the accuracy. A few features only therefore provided a good predictive power. 5. Conclusion The best features strongly depend on the algorithm used to predict the alertness of the driver. The set of selected features is always a combination of vehicular physiological and environmental features which seems reasonable to capture the maximum information about the driving situation. Our best results were obtained with a Naïve Bayes algorithm and such a set of features. To further improve these results we are considering a Random Tree Forest algorithm. Fall 2012 CS 229: Machine Learning Final report 5