A Deep Learning Approach to Vehicle Speed Estimation

Similar documents
MOTION ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS. Mustafa Ozan Tezcan

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Camera-based Vehicle Velocity Estimation using Spatiotemporal Depth and Motion Features

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Perceptron: This is convolution!

Structured Prediction using Convolutional Neural Networks

Supplemental Material for End-to-End Learning of Video Super-Resolution with Motion Compensation

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning with Tensorflow AlexNet

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

arxiv: v1 [cs.cv] 20 Dec 2016

Deep neural networks II

Keras: Handwritten Digit Recognition using MNIST Dataset

CS231N Project Final Report - Fast Mixed Style Transfer

Machine Learning 13. week

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Deep Learning and Its Applications

MoonRiver: Deep Neural Network in C++

Dynamic Routing Between Capsules

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

An Exploration of Computer Vision Techniques for Bird Species Classification

Where s Waldo? A Deep Learning approach to Template Matching

Deep Learning for Computer Vision II

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

CAP 6412 Advanced Computer Vision

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Deconvolutions in Convolutional Neural Networks

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Using Machine Learning for Classification of Cancer Cells

DENSENET FOR DENSE FLOW. Yi Zhu and Shawn Newsam. University of California, Merced 5200 N Lake Rd, Merced, CA, US {yzhu25,

arxiv: v1 [cs.lg] 31 Oct 2018

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Demystifying Deep Learning

Know your data - many types of networks

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Recurrent Neural Networks and Transfer Learning for Action Recognition

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

Fuzzy Set Theory in Computer Vision: Example 3

Convolution Neural Networks for Chinese Handwriting Recognition

Smart Parking System using Deep Learning. Sheece Gardezi Supervised By: Anoop Cherian Peter Strazdins

Image Transformation via Neural Network Inversion

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Keras: Handwritten Digit Recognition using MNIST Dataset

POINT CLOUD DEEP LEARNING

arxiv: v1 [cs.cv] 22 Jan 2016

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

CSE 559A: Computer Vision

Video Frame Interpolation Using Recurrent Convolutional Layers

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad

Convolutional Networks in Scene Labelling

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Microscopy Cell Counting with Fully Convolutional Regression Networks

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Computer Vision Lecture 16

Convolutional Neural Networks for Facial Expression Recognition

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Demystifying Deep Learning

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Multi-Glance Attention Models For Image Classification

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

EVALUATION OF DEEP LEARNING BASED STEREO MATCHING METHODS: FROM GROUND TO AERIAL IMAGES

Kaggle Data Science Bowl 2017 Technical Report

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Deep Model Compression

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Classification of objects from Video Data (Group 30)

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Channel Locality Block: A Variant of Squeeze-and-Excitation

Final Report: Smart Trash Net: Waste Localization and Classification

CNN Basics. Chongruo Wu

Deep Lidar CNN to Understand the Dynamics of Moving Vehicles

Content-Based Image Recovery

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Advanced Video Analysis & Imaging

Lecture 20: Neural Networks for NLP. Zubin Pahuja

KamiNet A Convolutional Neural Network for Tiny ImageNet Challenge

An Introduction to Deep Learning with RapidMiner. Philipp Schlunder - RapidMiner Research

Weighted Convolutional Neural Network. Ensemble.

3D model classification using convolutional neural network

Learning visual odometry with a convolutional network

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Project 3 Q&A. Jonathan Krause

Deep Learning on Graphs

2015 The MathWorks, Inc. 1

Spatial Localization and Detection. Lecture 8-1

Using RGB, Depth, and Thermal Data for Improved Hand Detection

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Residual Networks for Tiny ImageNet

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

ImageNet Classification with Deep Convolutional Neural Networks

Transcription:

A Deep Learning Approach to Vehicle Speed Estimation Benjamin Penchas bpenchas@stanford.edu Tobin Bell tbell@stanford.edu Marco Monteiro marcorm@stanford.edu ABSTRACT Given car dashboard video footage, we aimed to estimate the speed of a car using a deep neural network. We saw this problem as a small but important part of building an autonomous vehicle. The problem is by its nature underdetermined (since we have no absolute reference for scale/distance), so we treated it as a classification task where we bucketed speeds into 4 mph intervals. We designed our own network architecture after studying common optical flow networks and optimized the hyperparameters by experimentation. After training on cloud GPUs for several epochs, we were able to achieve 91% accuracy on unseen test frames. I. INTRODUCTION The ultimate goal of this research was to estimate the speed of a car in real time given a mounted dashboard video stream a somewhat novel application of deep CNNs. Our dataset was a dashboard video taken by driving around the Bay Area. The dataset contains footage from a variety of road types and speeds. We took our data from the autonomous vehicle startup Comma AI s speed detection challenge 1. With the video, Comma AI provides a text file with the ground truth speed of the car (m/s) for every frame. We separated this video into 20,400 image frames (an example of one such frame is given below), scaled each image to 224 224, and coupled every pair of frames. We then randomly shuffled these pairs and partitioned 80% into a training set and 20% into a test set. 1 twitter.com/comma_ai/status/854488327797448704

Figure 1. Example video frame from our car video dataset. II. STUDYING OPTICAL FLOW We began by studying current state-of-the-art optical flow systems. In the beginning, we saw the problem of dense optical flow as very similar to our video-to-speed problem, but not exactly the same. Rather than use a pre-trained optical flow model such as Flownet, a very large network, we wanted to design and train our own lightweight optical flow model specific to this problem. To experiment with optical flow, we got a pre-trained optical flow network working and tried using it as a feature extractor before our network. Below is included sample output from that network on one pair of image frames. Figure 2. Flownet output (right) for subsequent frames of driving footage (left, blended). However, calculating the dense optical flow of every pair of images in our dataset would have had extremely high overhead and we were not convinced we required the full dense optical flow to estimate the speed; i.e., we hypothesized that a custom network might learn the optical flow of the white lines in the road and use

this flow without looking at the flow of every pixel in the image. We studied the architecture of such optical flow networks, however, and decided that the two-stream feature extractor at the beginning of the network was important. Figure 3. Flownet Architecture. Notice the two-stream feature extractor. III. FEATURE EXTRACTION WITH VGG-19 In keeping with this architecture, we decided to first try a simple CNN architecture with a two-stream feature extractor at the beginning. We convinced ourselves that pre-training on ImageNet would be valuable because scale is an important part of the problem i.e., how big cars, trees, signs typically are. Hence we initially used the first few layers of a pre-trained VGG-19 network as a two-stream feature extractor, mimicking the Flownet architecture and hoping to extract features like scale. We fed each image through a pretrained VGG-19 model, up to the third block, and then concatenated the results. Figure 4. VGG-19 architecture. The arrow marks the cutoff point for our architecture. We then fed the resulting volume through our trainable custom CNN. However, this approach failed for two reasons.

1. VGG-19 is an image classifier, and so it has been trained to extract semantic understanding. However, since subsequent frames are semantically very similar, the feature extraction with VGG-19 actually got rid of crucial spatial information. Using VGG-19 as a feature extractor led to a model that could not distinguish between visually similar images, such as subsequent frames of driving video. 2. Additionally, VGG-19 has rapid downsampling with several max-pool layers. This architecture caused a loss of resolution in the images. For optical flow, and this video-to-speed problem, high spatial resolution is very important. Thus we decided not to use a pre-trained feature extractor and to just let the network learn everything on its own. IV. NETWORK ARCHITECTURE Below is included a graphic of our final architecture. We begin by concatenating the volumes representing both images, and feeding the resulting volume into several convolutional layers followed by several fully connected layers. The convolution layers use 64 3x3 filters with a stride of 2 and the ReLU activation function. The first three fully connected layers have 4096 neurons each and also use the ReLU activation function. The last fully connected layer has 15 neurons and no activation function. We apply the softmax function to this layer s output to get the class probabilities across the 15 possible classes (discretized speeds).

Figure 5. Our final video to speed CNN architecture. V. TRAINING We used FloydHub, a cloud GPU provider, to train our model quickly. We decided to use the TensorFlow Estimator interface to make checkpointing easy (we wanted to be able to start and stop on Floydhub without losing training progress). We used the cross entropy loss between the true bucket (discretized during preprocessing) and our predicted bucket. We used a gradient descent optimizer with learning rate 0.001. Larger learning rates caused divergence. We trained for over 9,000 batches on Floydhub (about 15 epochs). We used TensorBoard to monitor the progress and output loss/accuracy graphs (included below). We also checkpointed the model every 60 seconds. We attempted to train locally, but doing so took on the order of seconds per batch, which was unacceptably slow.

Training Loss Figure 6. Plot of training set loss as a function of training iterations. Training Accuracy Figure 7. Plot of training set accuracy as a function of training iterations. As expected with training on mini-batches, our accuracy on each batch varies but tends upward. By the end of the 15 epochs, we were able to completely fit the train set. We then evaluated on our unseen test set. Our trained model achieved 91% accuracy on the test set. Given the varied nature of roads in the train/test sets, we feel confident the network would generalize further to unseen road environments. We did not get to try retraining with regularization (such as by using dropout); in our next iteration we would like to try regularizing the network.

VI. INTERPRETING RESULTS To better understand our results, we took the gradient of the loss with respect to each pixel of an input image. We then visualized the gradients by coloring the pixels by the magnitude of the norm of their gradients (brighter here means a larger gradient). In other words, we are left with a saliency map of which pixels the network focuses on to make its decision. Figure 8. Gradient of each pixel with respect to the network prediction to highlight where the network is looking. As you can see, the network learned to ignore the road and the sky. It learned to focus on the white lines in the road (something we hypothesized it would). Surprisingly, it also focuses on the hood of the car, the dashboard, and the top of the windshield. After investigating this phenomenon more, we believe the car itself shakes more at higher speeds and thus the network has learned to factor in the shaking of the car. How neat! VII. NEXT STEPS At the moment we have not analyzed the throughput of the model. We want to confirm the model can make predictions quickly enough so that it can practically be used to output a car s speed in real time. If not, we will modify the model to improve its throughput. We would also like to apply regularization during training, such as by using dropout, and see how this impacts performance on the training and test sets. Lastly, we would like to test our trained model on footage taken from diverse driving conditions to see how well it generalizes.

VIII. REFERENCES E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017. Fischer, Philipp, Dosovitskiy, Alexey, Ilg, Eddy, Hausser, Philip, Hazrba, Caner, Golkov, Vladimir, van der Smagt, Patrick, Cremers, Daniel, and Brox, Thomas. Flownet: Learning optical flow with convolutional neural networks. In ICCV, 2015. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. Websites github.com/pathak22/pyflow comma.ai