CS 229: Machine Learning Final Report Identifying Driving Behavior from Data

Similar documents
Visual Analytics. Visualizing multivariate data:

Chapter 5. Track Geometry Data Analysis

Predicting Messaging Response Time in a Long Distance Relationship

G. Computation of Travel Time Metrics DRAFT

Random Search Report An objective look at random search performance for 4 problem sets

Kinematics and Vectors Project

Using the DATAMINE Program

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Separating Speech From Noise Challenge

Evaluating the Performance of a Vehicle Pose Measurement System

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

311 Predictions on Kaggle Austin Lee. Project Description

Inertial Navigation Static Calibration

CHAPTER 3: Data Description

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Semen fertility prediction based on lifestyle factors

DESIGNING ALGORITHMS FOR SEARCHING FOR OPTIMAL/TWIN POINTS OF SALE IN EXPANSION STRATEGIES FOR GEOMARKETING TOOLS

Applying Supervised Learning

Accelerometer Gesture Recognition

Automatic Fatigue Detection System

ECE 285 Class Project Report

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Data Visualization Techniques

1. Mesh Coloring a.) Assign unique color to each polygon based on the polygon id.

REAL-TIME CLASSIFICATION OF ROAD CONDITIONS

Classifying Depositional Environments in Satellite Images

MATH 90 CHAPTER 3 Name:.

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Averages and Variation

Welfare Navigation Using Genetic Algorithm

DATA PREPROCESSING. Pronalaženje skrivenog znanja Bojan Furlan

Bagging for One-Class Learning

Table of Contents (As covered from textbook)

3 Graphical Displays of Data

R Visualizing Data. Fall Fall 2016 CS130 - Intro to R 1

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

You will begin by exploring the locations of the long term care facilities in Massachusetts using descriptive statistics.

Applications of Machine Learning on Keyword Extraction of Large Datasets

CSC 2515 Introduction to Machine Learning Assignment 2

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Dimension Reduction CS534

CHAPTER 3. Preprocessing and Feature Extraction. Techniques

Integrated Math 1 Module 7 Honors Connecting Algebra and Geometry Ready, Set, Go! Homework Solutions

Lecture Notes 3: Data summarization

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Digital Compass Accuracy

(1) and s k ωk. p k vk q

Data Preprocessing. Data Preprocessing

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Face Recognition using Eigenfaces SMAI Course Project

TecNet Trace TTL 1000 Software User Manual V1.0

Data Visualization Techniques

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

STA 570 Spring Lecture 5 Tuesday, Feb 1

Identifying Important Communications

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

CS6716 Pattern Recognition

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread

3. Data Structures for Image Analysis L AK S H M O U. E D U

Louis Fourrier Fabien Gaie Thomas Rolf

DI TRANSFORM. The regressive analyses. identify relationships

Optimizing Completion Techniques with Data Mining

DAY 52 BOX-AND-WHISKER

Clustering and Visualisation of Data

General Instructions. Questions

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Plymouth Rock Assurance Road Rewards SM Program Frequently Asked Questions

2016 Stat-Ease, Inc. & CAMO Software

Basic Statistical Terms and Definitions

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

Stat 428 Autumn 2006 Homework 2 Solutions

Learning Log Title: CHAPTER 8: STATISTICS AND MULTIPLICATION EQUATIONS. Date: Lesson: Chapter 8: Statistics and Multiplication Equations

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

This was written by a designer of inertial guidance machines, & is correct. **********************************************************************

Using Statistical Techniques to Improve the QC Process of Swell Noise Filtering

Supplementary text S6 Comparison studies on simulated data

Combined Ranking Method for Screening Collision Monitoring Locations along Alberta Highways

Navigational Aids 1 st Semester/2007/TF 7:30 PM -9:00 PM

Here is the data collected.

IEEE 1588 PTP clock synchronization over a WAN backbone

Understanding and Comparing Distributions. Chapter 4

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Cluster Analysis Gets Complicated

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

APPLICATION OF AERIAL VIDEO FOR TRAFFIC FLOW MONITORING AND MANAGEMENT

Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination

A Simplified Vehicle and Driver Model for Vehicle Systems Development

CS 229 Project Report:

Image Processing. Image Features

Applying the Q n Estimator Online

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Transcription:

CS 9: Machine Learning Final Report Identifying Driving Behavior from Data Robert F. Karol Project Suggester: Danny Goodman from MetroMile December 3th 3 Problem Description For my project, I am looking into how the driving behaviors influence the fuel efficiency of different vehicles. Danny Goodman, with MetroMile, has provided me with a set of data consisting of accelerometer, heading, speed, and gas usage information which I am using in order to determine how efficient a given driving style is. The goal of this project is to identify techniques which people could use in order to save fuel which has become a significant cost in the lives of many Americans. Data Parsing Initial work needed to be done in order to parse the data into a usable format in MATLAB. Since each set of data came in a different file, each file had to be parsed, imported into MATLAB, and manipulated into an easy to access form. In particular, each file contains months worth of trips, and each trip needs to be separated out in order to determine when a car stopped and restarted. In order to do this I looked through the stamps of data points, and whenever there was a gap greater than 6 seconds between data entries which should be occurring at a rate of per second, I recorded it as a new trip. Then, any trip which lasted less than a minute was removed as short trips could introduce too much variability into the gas consumption since gas consumption is given as an instantaneous quantity so by coasting with the car an artificially high efficiency can be created. All of the data came from small units which have been attached to the on-board car computer which is where measurements of gas consumption come from. Since every car is different the measurements coming from each vehicle have slightly different noise characteristics. The unit has a built in accelerometer and magnetometer, so while the data might be noisy, or be off by a scale factor with bias, the data from each vehicle should be consistent. In total, there is data from 8 different cars which collected data for months on a number of different trips. The data collected does have a few major problems which need to be addressed. One of the biggest problems with the provided data is that the accelerometer is misaligned in ever vehicle. This means that the data must be transformed from the reference frame of the accelerometer to the reference frame of the car before it can be used. Additionally, the data was collected at a rate of Hz. While this is not a problem for heading data, it makes the accelerometer data extremely noisy. This is a problem which might skew the final values calculated for some quantities of interest, however, it shouldn t be a problem for spotting the important fuel consumption trends. A second problem for which there is no correction is the presence of multiple drivers in a single vehicle. While we are able to separate out the data obtained from each vehicle, there is no information available who is driving, or what the road conditions happen to be. This could become quite a serious issue if different fuels are being used. One example would be the different energy densities contained in the summer and winter

fuel mixtures that exist in California. I wrote a series of MATLAB scripts which import the data and put it in a more useful form. First, I wrote a script which parsed through the csv file, and saved all of the data into a.mat file as given in the dataset. This way, the data is much more accessible since it doesn t have to be imported for every use. A second script was written in order to take the instantaneous data, and integrate it over. This way I was able to get estimates of the total gas consumption and distance traveled for each trip the car took. Using this data I was able to work out the average fuel consumption for a given trip without as much error as is achieved by averaging the raw data. A second function was written capable of plotting the path the car by combining information gained from both the heading information, and instantaneous speed. This helps to determine how the car was moving for a given trip adding to the intuition of what is going on for a given trip. The most difficult data parsing task has been working with the accelerometer data. This is some of the most important data collected since acceleration is a huge part of fuel consumption, however, the misalignment problem is very difficult to deal with. In order to determine the correct orientation for the car I determined the s when the car was stopped, and used the average accelerometer values to determine what the gravity vector was. Using this information I was able to reconstruct the rotation matrices necessary in roll and pitch to move the gravitational vector to point straight down. While this orients the gravity vector and determines directional components, it is unable to determine which direction the car is facing relative to the accelerometer. In other words, we are unable to determine yaw using this information. In order to determine a value for yaw, I had to use my knowledge of what the x and y accelerations should be. By finding the parts of the data where the car was moving relatively straight, and using the fact that accelerations should be located in the x and z directions where positive z points down, x out the front of the vehicle, and y out the right side. By choosing the yaw value so that the acceleration is concentrated in the x direction while the heading is not changing, I was able to estimate the overall vehicle orientation. Sample Trip Figure shows some sample data from a typical trip. The speed, distance traveled, gas consumed, and an estimated path along which the car traveled. This gives a decent idea as to how useful the data might be, as well as which features might be worth looking into in order to save gas. Over the course of my projects I plotted a number of other features, some of which I thought were highly relevant to fuel efficiency, others which should have no affect. 8 3 6 3.3.. 3 3 y position (meters). unused trips high efficiency trips low efficiency trips 6 gas consumed (gallons) 7 Miles per Gallon distance (km) speed (mph) 3 5 3 x position (meters).5.5 Trip Figure : Dividing trips by efficiency. Figure : Example of a typical trip..5 x

Defining Efficiency The first step in identifying features of high or low efficiency driving is to determine what constitute a high or low efficiency trip. First, I defined the efficiency of a trip by the total distance traveled in a given trip divided by the amount of gas consumed. In order to do this I used k-means clustering to find two separate natural groupings. These two groups nicely divide the trips into lower and higher efficiency categories. By using k-means rather than sorting the trip efficiencies and dividing it at the median, or using a parameter like half the maximum efficincy, we can get more natural groupings which are less biased by outliers. Additionally, there is no reason for a fixed percentage of the trips to be considered efficient. Due to the nature of the data collection, some trips may have been cut off after a long period of coasting, or slowing down. Another possibility is the trip only includes when the car was accelerating if the data dropped out before cruising or slowing down. As can be seen in figure there are certain trials resulting in unrealistic efficiencies. In particular those with efficiencies of hundreds of miles per gallon, or near are probably not the trips to focus on. In order to remove some of these anomolies, only the second and third quartiles of data are used removing the outliers in each of the two categories. In figure the red points indicate trips categorized as high efficiency trips while the blue points indicate low efficiency trips. The remaining black points represent trips considered outliers in either of the two categories. Features Since the goal of this project is to determine relevant features which a user can keep track of in order to help maximize the fuel efficiency, I used the data provided to come up with 9 potential features which a driver could keep track of with relative ease if these features ended up being relevant to gas consumption. Using the data provided, I came up with a list of 9 features to look into for determining their relative importance. The following features were chosen for this analysis. In particular I looked at the maximum and mean values of various quantities as these would give a decent estimate of how the magnitude of these quantities directly affects fuel efficiency. By looking at the standard deviation of various quantities, I am able to get an idea of the affect that more stable quantities have as opposed to changing quantities. Finally by looking for peas in the Fourier transformation of a variable I can try to see if any natural cycling occurs in a variable and determine how important such cycles could be. Total Acceleration ( x-acceleration + y-acceleration + z-acceleration ) Planar Acceleration ( x-acceleration + y-acceleration ) maximum, mean, standard deviation x, y, and z accelerations Speed maximum, mean, standard deviation Heading maximum, mean Total distance. maximum Gas used 3

PCA One of the first tests I ran was a principal component analysis, in order to determine what combinations of features contributed to the largest variation across trials. While a linear combination of features may not be the best for giving drivers a concise and easy to remember set of instructions for reducing their fuel consumption, I thought that it could provide some useful insights into the data itself, as well as the particular features I choose to use. By examining the results of the principal component analysis, the following results appear. 88 % of the variation in by the data is primarily due to the maximum and mean of the heading information. While this makes intuitive sense as heading is completely independent of all fuel efficiency since the same trip can be done with any orientation with respect to north. By including the maximum and mean speed, as well as the total distance. 98% of all variation is explained. Once again though, total distance traveled is completely irrelevant towards overall efficiency. From these results I determined that while there might be better separation along a reoriented coordinate system, in order to perform feature selection it would be better to choose between the initial features rather than a linear combination to maximize variance. Feature Detection The most important part of this project is the feature selection algorithm to determine which features contribute the most to predicting high or low efficiency driving. In order to accomplish this, I decided to use a linear kernel for a support vector machine in order to determine which features, when removed. Would affect how the data was separated the least. I developed a cross validation scheme which would take a matrix containing all the trials, train a support vector machine using a linear kernel on 9% of the data, and test it against the remaining % of the data to determine the number of classification errors. This is repeated s, each holding out a different %. Using the cross validation scheme described above, I used the following feature selection algorithm. First, I ran the cross validation scheme on the matrix of trials and features 9 s. Once each using the matrix of trials and features having removed a single feature. Be comparing the cross validation error in each of these cases and selecting the one with minimal error, we could find which feature was least important in data separation since the support vector machine was capable of perfectly separating the data using all 9 features. This procedure was run repeatedly to determine the next least important feature which could be removed, resulting in a ranked list of least important to most important features for separating the data. In the end this method indicated the following ranking for the most important features which would contribute the most to determining high and low efficiency driving.. Average speed. Average gas used 3. Standard deviation of gas used.. Standard deviation of the planar acceleration 5. Average of the planar acceleration 6. Standard deviation of the y-acceleration 7. Maximum of the planar acceleration

While this feature selection algorithm is not perfect, it does give us a good idea of what driving techniques might cause higher gas usage. What this method does not do however, is distinguish which of these features come from driving habits, and which come from natural road features. For example, highway driving vs stop and go traffic are very different road conditions which are completely out of the control of the driver. SVM I have included a plot from the final run of the support vector machine cross validation to see how well separated the data is when only using the two most important features identified by the feature selection algorithm..8 x 3.6. Support Vectors Average gas used..8.6.. 3 5 6 7 8 Average speed Figure 3: Separation using only the most important features. Using these two features alone, prediction errors are significantly lower than %. Conclusion In the end, the goal of this algorithm was to determine a list of features, all easy to compute in real, which do a good job at predicting how efficient a driving experience will be. However, more importantly than that, this algorithm has determined some ways that drivers might be able to reduce their fuel consumption. Reducing average speed, the mean, and standard deviations of the planar acceleration are two of the most significant which the driver has control over. In particular a driver can save fuel by reducing the their average speed. While this is not practical in all situations, it suggests that following the speed limit rather than exceeding it by -5 miles per hour, is not only safer, but could help you save on gas. Additionally, in situations with stop and go traffic it is best to accelerate slowly so the average speed is reduced, and smoothly so the standard deviation of acceleration is reduced. This makes intuitive sense which gives some credibility to the algorithm itself. While this provided for a good method of predicting efficiency as well, there is less use in that as an individual driver may not have control over the variables at all s. 5