Learning video saliency from human gaze using candidate selection

Similar documents
Classification of objects from Video Data (Group 30)

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Synthetic Saliency. Ben Weems Stanford University. Anthony Perez Stanford University. Karan Rai Stanford University. Abstract. 1.

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Det De e t cting abnormal event n s Jaechul Kim

Active Fixation Control to Predict Saccade Sequences Supplementary Material

Action recognition in videos

CAP 6412 Advanced Computer Vision

A Family of Contextual Measures of Similarity between Distributions with Application to Image Retrieval

Motion Estimation and Optical Flow Tracking

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Features Points. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Bus Detection and recognition for visually impaired people

Motion in 2D image sequences

LEARNING TO SEGMENT MOVING OBJECTS IN VIDEOS FRAGKIADAKI ET AL. 2015

Joint Inference in Image Databases via Dense Correspondence. Michael Rubinstein MIT CSAIL (while interning at Microsoft Research)

Object and Action Detection from a Single Example

CS231A Section 6: Problem Set 3

Reconstruction of Images Distorted by Water Waves

Local Features: Detection, Description & Matching

Motion statistics at the saccade landing point: attentional capture by spatiotemporal features in a gaze-contingent reference

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Adaptive Learning of an Accurate Skin-Color Model

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Vision and Image Processing Lab., CRV Tutorial day- May 30, 2010 Ottawa, Canada

Local features: detection and description. Local invariant features

CS381V Experiment Presentation. Chun-Chen Kuo

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

CRF Based Point Cloud Segmentation Jonathan Nation

Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

EE795: Computer Vision and Intelligent Systems

Chapter 3 Image Registration. Chapter 3 Image Registration

Spatial Localization and Detection. Lecture 8-1

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Peripheral drift illusion

arxiv: v1 [cs.cv] 11 Mar 2016

Motion illusion, rotating snakes

Outline 7/2/201011/6/

Edge Detection CSC 767

Local features: detection and description May 12 th, 2015

ShadowDraw Real-Time User Guidance for Freehand Drawing. Harshal Priyadarshi

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Saliency Detection for Videos Using 3D FFT Local Spectra

Real-time Object Detection CS 229 Course Project

An Implementation on Histogram of Oriented Gradients for Human Detection

Adaptive Fusion of Human Visual Sensitive Features for Surveillance Video Summarization

Tracking. Hao Guan( 管皓 ) School of Computer Science Fudan University

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Final Review CMSC 733 Fall 2014

Local invariant features

Other Linear Filters CS 211A

Scale Invariant Feature Transform

The SIFT (Scale Invariant Feature

Local Feature Detectors

Digital Image Processing COSC 6380/4393

Multi-View 3D Object Detection Network for Autonomous Driving

ImageCLEF 2011

CS 4495 Computer Vision. Linear Filtering 2: Templates, Edges. Aaron Bobick. School of Interactive Computing. Templates/Edges

Sparse coding for image classification

Combining PGMs and Discriminative Models for Upper Body Pose Detection

OBJECT CATEGORIZATION

Rich feature hierarchies for accurate object detection and semantic segmentation

Video Clustering using Camera Motion

Autonomous Navigation for Flying Robots

Tri-modal Human Body Segmentation

INFRARED AUTONOMOUS ACQUISITION AND TRACKING

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Human Motion Detection and Tracking for Video Surveillance

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Scale Invariant Feature Transform

Automatic Colorization of Grayscale Images

Templates, Image Pyramids, and Filter Banks

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Multi-stable Perception. Necker Cube

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE

Local features and image matching. Prof. Xin Yang HUST

CS 378: Autonomous Intelligent Robotics. Instructor: Jivko Sinapov

Object Classification Problem

Human detection using histogram of oriented gradients. Srikumar Ramalingam School of Computing University of Utah

A System of Image Matching and 3D Reconstruction

Pictures at an Exhibition

Supporting Information

Quasi-thematic Features Detection & Tracking. Future Rover Long-Distance Autonomous Navigation

Main Subject Detection via Adaptive Feature Selection

Feature Based Registration - Image Alignment

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Robotics Programming Laboratory

Fast Sample Generation with Variational Bayesian for Limited Data Hyperspectral Image Classification

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

What have we leaned so far?

Edge and Texture. CS 554 Computer Vision Pinar Duygulu Bilkent University

Edge detection. Winter in Kraków photographed by Marcin Ryczek

CS4495 Fall 2014 Computer Vision Problem Set 5: Optic Flow

Deformable Part Models

Transcription:

Learning video saliency from human gaze using candidate selection Rudoy, Goldman, Shechtman, Zelnik-Manor CVPR 2013 Paper presentation by Ashish Bora

Outline What is saliency? Image vs video Candidates : Motivation Candidate extraction Gaze Dynamics : model and learning Evaluation Discussion

What is saliency? Captures where people look Distribution over all the pixels in the image or video frame Color, high contrast and human subjects are known factors Images credit : http://www.businessinsider.com/eye-tracking-heatmaps-2014-7 http://www.celebrityendorsementads.com

Image vs video saliency Image (3 sec) Video Shorter time - typically single most salient point (sparsity) Continuity across frames Motion cues Image credit : Rudoy et al

How to use this? Sparse saliency in video Redundant to computer saliency at all pixels Solution : inspect a few promising candidates Continuity in gaze Use preceding frames to model gaze transitions

Candidate requirements Salient Diffused : Salient area rather than a point Represented as a gaussian blob (mean, covariance matrix) Versatile : incorporate broad range of factors that cause saliency Static : local contrast or uniqueness Motion : inter-frame dependence Semantic : arise from what is important for humans Sparse : few per frame

Candidate extraction pipeline : Static Frame GBVS Sample many points Mean shift clustering Fit gaussian blobs Candidates Image credit : http://www.fast-lab.org/resources/meanshift-blk-sm.png http://www.vision.caltech.edu/~harel/share/gbvs.php

Static candidates : example Image credit : Rudoy et al

Discussion Why not fit a mixture of gaussians directly? Rationale in paper : Sampling followed by mean shift fitting gives more importance to capturing the peaks Is this because more points are sampled near the peaks and we weigh each point equally?

Candidate extraction pipeline : Motion Consecutive frames Optical Flow Magnitude and threshold DoG filtering Sample many points Mean shift clustering Fit gaussian blobs Candidates Images cropped from : http://cs.brown.edu/courses/csci1290/2011/results/final/psastras/images/sequence0/save_0.png http://www.liden.cc/visionary/images/difference_of_gaussians.gif

Motion candidates : example Image credit : Rudoy et al

Candidate extraction pipeline : Semantic Centre Blob Frame Face Detector Poselet detector Candidates

Semantic candidates : example Image credit : Rudoy et al

Modeling gaze dynamics si = source location d = destination candidate Learn transition probability P(d si) Image credit : Rudoy et al

Modeling gaze dynamics Use P(si) as a prior to get P(d) Combine destination gaussians with P(d) Image credit : http://i.stack.imgur.com/tyvjd.png Equation credit : Rudoy et al

Learning P(d si) : Features Only destination and interframe features are used Local neighborhood contrast where Equation credit : Rudoy et al

Learning P(d si) : Features (contd) Only destination and interframe features are used Mean GBVS of the candidate neighborhood Mean of Difference-of-Gaussians (DoG) of Vertical component of the optical flow Horizontal components of the optical flow Magnitude of the optical flow in local neighborhood of the destination candidate Face and person detection scores Discrete labels : motion, saliency (?), face, body, center, and the size (?) Euclidean distance from the location of d to the center of the frame

Discussion : unclear points It seems no feature depend on source location. In that case P(d si) will be independent of si. That would mean P(d) is independent of P(si) This is like modeling each frame independently with optical flow features Discrete labels for saliency and size

Discussion Non-human semantic candidates? not handled Extra features that can be useful General : Color and depth, SIFT, HOG, CNN features Task specific non-human semantic candidates (for example text, animals) activity based candidates memorability of image regions

Learning P(d si) : Dataset DIEM (Dynamic Images and Eye Movements) dataset [1] 84 videos with gaze tracks of about 50 participants per video [1] https://thediemproject.wordpress.com/

Learning P(d si) : Get relevant frames (Potentially) positive samples Find all the scene cuts Source frame is the frame just before the cut Destination is 15 frames later Negative samples Pairs of frames from the middle of every scene 15 frames apart

Learning P(d si) : Get source locations Ground truth human fixations Smoothing Thresholding (keep top 3%) Find Centres (foci) Source locations Image credit : Rudoy et al

Learning P(d si) Take all pairs of source locations and destination candidates for training set Positive labels: Pairs with centre of d near a focus of the destination frame Negative labels: If centre of d is far from every focus of destination frame Training Random Forest classifier

Labeling : example Image credit : Rudoy et al

Discussion Why Random Forest? No discussion in paper Other classifiers/models that can be used XGBoost LSTM to model long term dependencies

Results : video

Experiments : How good are the candidates? Candidates cover most human fixations Image credit : Rudoy et al

Experiments : How good are the candidates? Image credit : Rudoy et al

Experiments : Saliency metrics AUC ROC to compute the similarity between human fixations and the predicted saliency map Chi-squared distance between histograms Equation credit : http://mathoverflow.net/questions/103115/distance-metric-between-two-sample-distributions-histograms Image credit : https://upload.wikimedia.org/wikipedia/commons/6/6b/roccurves.png

Results Image credit : Rudoy et al

Discussion In paper authors mention that AUC considers the saliency results only at the locations of the ground truth fixation points. This will only give true positives and false negatives AUC ROC needs true negative and false positives as well. How is AUC computed without them?

Ablation results Dropping static or semantic cues results in big drop Results snapshot : Rudoy et al

More discussion points Why 15 frames? This parameter is based on typical time taken by human subjects to adjust gaze on a new image. Across scene-cuts, the content can change arbitrarily. Use in-shot transitions? The model needs dataset with human gaze and video to train Why does dense estimation (without candidate selection) give lower accuracy? Not clearly mentioned in the paper. Possible reason : candidate based model is able to model the transition probabilities better. The dense model gets confused due to large number of candidates.

More discussion points How can we capture gaze transitions within a shot? Relation between saliency and memorability We can reasonably expect saliency and memorability to be correlated. What is the breakdown between failure cases for this model? Besides DIEM and CRCNS, are there other datasets that could be used to experiment video saliency http://saliency.mit.edu/datasets.html Saliency to evaluate memorability?