CAP 6412 Advanced Computer Vision

Similar documents
COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Object Detection on Self-Driving Cars in China. Lingyun Li

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Spatial Localization and Detection. Lecture 8-1

Project 3 Q&A. Jonathan Krause

Rich feature hierarchies for accurate object detection and semant

ECS 289H: Visual Recognition Fall Yong Jae Lee Department of Computer Science

Classification of objects from Video Data (Group 30)

Rich feature hierarchies for accurate object detection and semantic segmentation

Deep Learning with Tensorflow AlexNet

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Convolution Neural Networks for Chinese Handwriting Recognition

Tri-modal Human Body Segmentation

Deformable Part Models

Bus Detection and recognition for visually impaired people

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Rich feature hierarchies for accurate object detection and semantic segmentation

Real-time Object Detection CS 229 Course Project

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

ECE 6554:Advanced Computer Vision Pose Estimation

Yiqi Yan. May 10, 2017

Fully Convolutional Networks for Semantic Segmentation

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Deep Face Recognition. Nathan Sun

CIS581: Computer Vision and Computational Photography Project 4, Part B: Convolutional Neural Networks (CNNs) Due: Dec.11, 2017 at 11:59 pm

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Recap Image Classification with Bags of Local Features

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Lecture 7: Semantic Segmentation

Object Recognition II

Learning to Segment Object Candidates

WP1: Video Data Analysis

Deep Learning and Its Applications

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Segmenting Objects in Weakly Labeled Videos

RSRN: Rich Side-output Residual Network for Medial Axis Detection

Perceptron: This is convolution!

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Stacked Denoising Autoencoders for Face Pose Normalization

EE-559 Deep learning Networks for semantic segmentation

Advanced Video Analysis & Imaging

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Structured Prediction using Convolutional Neural Networks

Machine Learning. MGS Lecture 3: Deep Learning

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Deep Learning for Computer Vision II

Linear combinations of simple classifiers for the PASCAL challenge

Semantic Segmentation

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Lecture 37: ConvNets (Cont d) and Training

Robust PDF Table Locator

Pedestrian and Part Position Detection using a Regression-based Multiple Task Deep Convolutional Neural Network

Final Report: Smart Trash Net: Waste Localization and Classification

Finding Tiny Faces Supplementary Materials

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Learning Spatial Context: Using Stuff to Find Things

ECG782: Multidimensional Digital Signal Processing

Machine Learning 13. week

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Template Matching Rigid Motion

CAP 6412 Advanced Computer Vision

Two-Stream Convolutional Networks for Action Recognition in Videos

Traffic Sign Localization and Classification Methods: An Overview

Recognizing people. Deva Ramanan

Object Category Detection: Sliding Windows

Gradient of the lower bound

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Fish species recognition from video using SVM classifier

Find that! Visual Object Detection Primer

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Learning video saliency from human gaze using candidate selection

Clustering & Classification (chapter 15)

Joint Vanishing Point Extraction and Tracking (Supplementary Material)

Object detection with CNNs

Detection III: Analyzing and Debugging Detection Methods

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

CS6716 Pattern Recognition

Automatic detection of books based on Faster R-CNN

Short Survey on Static Hand Gesture Recognition

CS229 Final Project Report. A Multi-Task Feature Learning Approach to Human Detection. Tiffany Low

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

An Exploration of Computer Vision Techniques for Bird Species Classification

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

Multi-Glance Attention Models For Image Classification

Template Matching Rigid Motion. Find transformation to align two images. Focus on geometric features

Structured Models in. Dan Huttenlocher. June 2010

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Edge and corner detection

Lecture 5: Object Detection

CAP 6412 Advanced Computer Vision

Multi-View 3D Object Detection Network for Autonomous Driving

HOG-based Pedestriant Detector Training

Transcription:

CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong April 21st, 2016

Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha

Project II due: next Wednesday (04/27, 5PM) Final Project Presentation: 04/28, 1 3:50 PM Late submissions: https://docs.google.com/spreadsheets/d/1unpfusdnw5xfziv- PrQTo9xTWKfv7s-OPyuV_zZw9Fc/edit?usp=sharing

Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha

Free parameters (hyper-parameters) In Project 2, when you train the CNNs Learning rate, momentum, weight decay, dropout rate, early stopping, etc. Network architecture, nonlinear functions, strides, etc. In Linear regression In SVM min w MX (y m x T mw)+ kwk 2 2 m=1 min w, m,m=1,,m MX m=1 m + kwk 2 2 s.t. y m (x T mw) 1 m, & m 0 8m

Free parameters (hyper-parameters) In K-means clustering: K, the number of clusters In K-Nearest neighbors classifier: K, the number of neighbors In Canny edge detection Gaussian filter, thresholds In R-CNN Threshold of selective search # Layers, filter size, stride, where max pooling Padding or not, learning rate, momentum, weight decay, #iterations Trade-off parameter Feature selection for regression Batch size

Free parameters (hyper-parameters) Free parameters vs. Model parameters min w MX (y m x T mw)+ kwk 2 2 m=1 Often seek model parameters by optimization Gradient descent (GD), coordinate descent, Newton, stochastic GD, etc. How to choose the free parameters?

How to choose the free parameters Smallest error rate on Test set? Validation set? Smallest expected error rate on the entire population In practice, however, we have access to a finite set of examples! Approximate the expected error rate Choose free parameters which minimize the approximate error How to approximate the expected error?

Weak approximation of the expected error! Rarely used in practice.

Popular for small data.

Popular for small data.

1. Divide data to training, validation, and test sets. 2. Select free parameters 1. E.g., network layers, #hidden states, nonlinear functions, etc. 3. Train the model using the training set 4. Evaluate the model using the validation set 5. Repeat steps 2 4 using different free parameters à different models 6. Select the best model (and their associated free parameters) 7. Train the model (with the associated free parameters) using both training and validation sets. 8. Assess this final model using the test set. Popular for big data. Skip step 7 for big data.

Skip this step for big data.

Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha

Hand detection in Egocentric videos Aisha Urooj Course Instructor: Dr. Boqing Gong Advanced Computer Vision

Motivation Emergence of new wearable technologies Action cameras Smart glasses, so on These devices capture videos from first person s perspective. Record user s experiences Image Source: [1]

An overview of First Person Vision

Image Credits: [1]

A hierarchical structure, starting from the raw video sequence (bottom) to the desired objectives (top) Image Credits: [1]

Image Credits: [1]

Image Credits: [1]

Image Credits: [1]

Related Datasets [1]

Motivation Hands are very common in egocentric videos Appearance of hands and pose give important cues about human s actions attention Activity recognition user machine interaction, so on. Most of the egocentric computer vision problems, from object detection to activity recognition requires accurate hand detection.

Challenges in hand detection Hands are highly deformable objects. Occlusion Cluttered background Dynamic background Inconsistent lighting Poor imaging conditions Highly dynamic camera motion So on..

Lending a Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions Sven Bambach, Stefan Lee, David J. Crandall, Chen Yu Indiana University

Outline Paper s contribution Dataset details Approach Results Possible future directions

Paper s Contributions Deep model for hand detection and classification in egocentric video, including fast domainspecific region proposals. A new technique for pixel wise hand segmentation. A quantitative analysis of how hand location and pose can be useful in accurate activity recognition. A large dataset of egocentric interactions with fine grained ground truth.

Overview Image source: http://vision.soic.indiana.edu/projects/lending-a-hand/

Ground truth hand segmentation masks on sample frames from dataset.

A random subset of cropped hands according to ground Truth.

Dataset details 4 participants, 4 activities, 3 different locations (office, home, courtyard) Total 48 unique videos. Used Google Glass, 720x1280 at 30 fps. 2 persons in one video, each wearing google glass. (Synchronized video pairs and cut them to 90 seconds) Pixel level ground truth for over 15000 hand instances. Manual annotation of 100 frames/ video i.e. 4800 frames ground truth. Main Split: 36 training, 4 validation, 8 test videos.

Hand Detection: Approach Candidate windows generation Window classification using CNNs

Window Proposals Generation Probability that an object O appears in a region R of an image I. The proposed approach for candidate windows generation combines spatial biases and appearance models together.

Window Proposals Generation (Contd..) P (O) : Object occurrence probability P(R O) : Probability that a certain region R (a bounding box) contains a specific hand (O) P(I R, O): A pixel-level skin classifier Estimates the probability that central pixel of R is skin.

Coverage Results for Different Proposal Methods

Window classification A standard CNN classification framework used. CaffeNet from Caffe software package Slight variation of AlexNet Each training batch contains equal number of samples from each class. Disabled horizontal and vertical flipping of sample images in Caffe For differentiating between left and right hands.

Window classification (Contd..) The CNN weights are initialized from CaffeNet Except final fully connected layer which is set to zero mean gaussian. Fine-tuning using SGD Learning rate = 0.001 Momentum = 0.999 Input Generate Spatially sampled window proposals Classify window crops Using fine-tuned CNN Perform non-maximum suppression for each test frame

Hand Detection Two cases: Detect hands of any type Detect hand of specific type (own left, your right etc.) PASCAL VOC criteria for scoring detections is used Intersection over Union between the ground truth and detected bounding box should be > 0.5

Precision-Recall curves for Hand detection

Qualitative Results for Hand Detection

Quantitative Results for Hand Detection

Hands Segmentation Pixelwise hand segmentation is useful for: Hand pose recognition In-hand object detection, so on.. Goal: Label each pixel either to the background or to a specific hand class. Applied a semi-supervised segmentation algorithm GrabCut. Given an approximate foreground mask, GrabCut iteratively refines foreground and background pixels, relabeling them using Markov Random Field.

Hands Segmentation For each hand detected bounding box, initial foreground estimation is computed using same color skin model. Thresholded and marked each pixel within the box as foreground except with very low skin probability. Run GrabCut algorithm on bounding box including padded region. Final segmentation is the union of the output masks for all detected bounding boxes.

Quantitative Results for Hand Segmentation

Two modes of possible failures Failure to properly detect hand bounding boxes. Inaccuracy in distinguishing hand pixels from background. Applying segmentation algorithm on ground truth bounding boxes results in raise to average 0.73 Taking output of hand detector but using ground truth segmentation masks again increases average to 0.76

Qualitative Results for Hand Segmentation

Hand-based Activity Recognition Masked out all other non-hand background information by using ground truth hand segmentations. Fine-tuned a CNN to classify whole frames as one of the four activities. Training: 900 frames per activity for 36 videos Validation: 100 frames per activity for four videos Classification accuracy: 66.4% per frame

Hand-based Activity Recognition (contd..) Incorporating temporal constraints: Simple voting based approach Classify each individual frame in the context of a fixed-size temporal window centered on the frame Scores are summed across the window Frame is labeled as the highest scoring class

Hand-based Activity Recognition

Some sample hand poses not present in their dataset

Related work on Egocentric Hands Detection Work by A. Betancourt, University of Genoa, Italy 1)Hand Segmentation and tracking in FPV 2) A Sequential Classifier for Hand Detection in the Framework of Egocentric Vision. CVPR 2014 3) The Evolution of First Person Vision Methods: A Survey. Observations: Misses detection of hands in many frames for other people. Results show false positives in many frames. No detection on hands shown in videos running within a video. Segmentation is not efficient. At times both hands are detected as either left or right. Full arm is being considered as hand.

Possible Future Directions Improve segmentation technique Have an unbiased dataset Use an efficient tracking approach to incorporate temporal information Improve hand classifier

References [1] The Evolution of First Person Vision Methods:A Survey. A. Betancourt, P. Morerio, C. S. Regazzoni, and M. Rauterberg. IEEE Transactions on Circuits and Systems for Video Technology. Vol 25. Issue 5.

THANK YOU!