CAP 6412 Advanced Computer Vision

CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong April 5th, 2016

Today Administrivia LSTM Attribute in computer vision, by Abdullah and Samer

Project II posted, due Tuesday 04/26, 11:59pm http://www.cs.ucf.edu/~bgong/cap6412/proj2.pdf Today: last day to acquire permission for taking option 2

Next week Tuesday (04/12) Javier Lores Thursday (04/14) Fareeha Irfan

Today Administrivia LSTM Attribute in computer vision, by Abdullah and Samer

A Plain RNN Three time steps and beyond Expressive in modeling sequences Training by backpropagation Unstable Vanishing & exploding gradients Troublesome in learning long-term dependencies Training by other methods? Alternatives exist Hard to use Image credits: Richard Socher

LSTM (Long Short-Term Memory) RNN Overwrite the hidden states àmultiplicative gradients LSTM Add to the cell states àadditive gradients Image credits: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTM step by step Memory cell & gates 1 Logistic 0.5 0 0.5-1 -10-5 0 5 10 '(x) = 1 1 + exp( x) Image credits: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTM step by step Additive update to the cell states f t : forget gate t t : input gate Image credits: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTM step by step Forget gate: Forget/remember some information of time step (t-1) Controlled by current input and previous hidden states, jointly Sometimes, also controlled by previous cell states C t-1 Image credits: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTM step by step Input gate & candidate cell states: They determine the new information to be stored, jointly Image credits: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTM step by step Output gate & hidden states: Hidden states depend on cell states Hidden states (& input) are not included by the LSTM unit Image credits: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTM step by step Output depends on hidden states: y t = (W yh h t + b y ) Image credits: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTM in a nutshell An LSTM contains: - Forget gate - Additive operations à additive gradients - Input gate - Output gate - Memory cell It does not contain: - Input x - Hidden states - Output y

Today Administrivia LSTM Attribute in computer vision, by Abdullah and Samer

Attribute Learning By Abdullah Jamal

Outline What is attribute learning? A Unified Multiplicative Framework for Attribute Learning, Kongming Liang, Hong Chang, Shiguang Shan, Xilin Chen, ICCV 2015 Motivation of the research Main Contribution Approach Outline Details of the Proposed Approach Experiments Conclusion Future Directions

Attribute? an inherent characteristic of an object. Color Shape Pattern Texture

What is visual attributes? Attributes are properties observable in images that have humandesignated names, such as Orange, striped, or Furry.

Attributes-based Recognition Dog Furry White Chimpanzee Black Big Tiger Striped Yellow Striped Black White Big Attributes provide a mode of communication between humans and machines! 5

Datasets Animals with Attributes 85 numeric attribute values for each of the 50 animal classes. 30475 images. minimum and maximum number of images from one category is 92 and 1,168 respectively. apascal/ayahoo 64 types of binary attributes annotated for each object sample of the apascaltrain and test sets, and the ayahoo test set 20 categories for apascal (12695 images), and 12 classes for ayahoo set (2644 images). The CUB-200-2011 Birds ( CUB ) 200 categories of bird species with 11,788 images. 312 binary attributes per image.

SUN attribute dataset 102 scene attributes are defined for each of the 14,340 scene images. 717 scene categories. Clothing Attribute Dataset 26 ground truth clothing attributes with 1856 clothing images. ImageNet Attributes (INA) 9600 images from 384 categories. each image is annotated with 25 attributes.

Attributes in Videos Attributes in video can be used in: Human action recognition Social activities of a group of people (e.g. YouTube video of a wedding reception). Surveillance

Datasets Attributes on UIUC Dataset: 22 action attributes are manually defined for each of the 14 human action classes such as walk, hand-clap, jump-forward, and jumpjack. 532 videos. manually defined 22 action attributes such as standing with arm motion, torso translation with arm motion, leg fold and unfold motion Attributes on Mixed Action Dataset: 34 action attributes are manually defined for each of the 21 human action classes and 2910 videos from the mixed UIUC Action, Weizmann(10 classes, 100 videos) KTH datasets(6 classes,2300 videos).

Attributes on Olympic Sports Dataset 39 action attributes are manually defined for each of the 16 human action classes (high-jump, long-jump, triple-jump, pole-vault, basketball lay-up, bowling, tennis-serve, platform diving, discus throw, hammer throw, javelin throw, shot put, springboard diving, snatch (weightlifting), clean and jerk (weightlifting), and gymnastic vault) 781 videos

A Unified Multiplicative Framework for Attribute Learning, ICCV 2015

Motivation Traditionally computer vision has focused on object recognition, classification, segmentation, retrieval and so on. Recent research shows that visual attributes can be benefit traditional learning problems (image search, object recognition etc.) But, attribute learning is still a challenging problem because They are not always predictable directly from input images. The variation of visual attributes is sometimes large across categories.

Limitations in previous methods Correlation between attributes are ignored. Naturally, attribute as properties of objects are correlated with each other, therefore it is more appropriate to learn all the attributes jointly, such as sharing attribute-specific parameters or common semantic representations Some attributes are hard or even unable to predict based on visual appearances. For example, it is impossible to infer color-relevant attribute from an gray image input or predict whether an animal is fast or slow based on an still image. Negative attribute correlation between object and scene. For weakly supervised attribute learning the input image contains both object and scene. It happens sometimes that the scene has some attributes that are negatively related to object attributes. For example, traditional attribute classifier may predict a polar bear swimming in the ocean to have blue attribute. Different visual attribute appearances vary across categories.

Main Contribution Propose a unified multiplicative framework for attribute learning to tackle all the discussed limitations.

Approach Outline The image and category vectors in the unified common space interact multiplicatively to predict the attributes.

Details of Proposed Approach N labeled training images, where xi RD denotes the D-dimensional image feature vector ai {0, 1}T indicates the absence or presence of all binary attributes. label vector yi RC where C is the number of classes. The training images can be expressed in matrix form as X = [x1, x2,.. similarly for the attribute matrix A RT and class label matrix

Multiplicative Attribute Learning Transform training images and labels into shared feature space. Images X and labels Y are parameterized by Dasd represent feature representation of image x i and its class information In multi-task learning framework, the t th (1,..,T) task represent the binary classifier for learning t th attribute.

Discriminative function of the t th attribute of an object in image x i is: As denotes the parameters for the t th classifier in the latent space.

Wx i means to learn a better visual representation for image x i to facilitate attribute classification. The component Uy i is used as a gate for the attribute classifier v t to transfer knowledge from category information. During training stage, all the parameters will be learned to automatically decide how to leverage image, attribute and category information.

Using logistic regression to jointly learn all the attributes. Loss function is defined as the negative log likelihood. Where dsadsdsa are shared across all images and tasks, a ti represent the absence or presence of the attribute and g(x) is a sigmoid function.

Objective function is defined as

Category-Specific Attribute Classifier The discriminative function can be expressed as U j is the j th column of U and y ji is the binary category label which indicates whether image x i belong to category j.

Train a multi-class softmax classifier by minimizing the loss function described as: At test stage, category can be estimated as

With the estimated category information, they also predict the attribute of x by marginalizing the category label as follows: where e j denotes a vector with only one nonzero coordinate of value 1 in j th position

Instance-specific attribute classifier Jointly train the multiclass classification model and attribute classifiers. After joint training, we obtain instance-specific attribute classifiers for x i :

Linear combination of all the category-specific attribute classifiers. For zero-shot learning, the instance-specific attribute classifier for an image from an unseen category can be estimated by the categoryspecific attribute classifiers of all the seen categories.

Optimization Traditional multiplicative models are optimized using alternating optimization algorithms. Converts the main problem into sub-problems and optimizes one parameter in one sub-problem with other being fixed. Such process is alternated until it converges to local minimum They also use alternating optimization to minimize their objective function.

The parameters W and V are initialized using SVD decomposition of logistic regression classifier parameters. The derivative of objective function w.r.t to parameters are: Where o denotes the Hadamard product. To estimate the optimal value of third matrix with two other are fixed, they use L-BFGS algorithm.

Enhancing Category Information Attributes are usually hard to define and costly to acquire. To counter the small scaled attribute dataset problem, they boost their attribute learning by enhancing category information. Suppose there are two types of training data X and X a. The former has both attribute label and category while latter only has category labels. Now the objective function can be written as

Experiments Datasets Animal with Attributes apascal/ayahoo CUB (Caltech-UCSD-Bird) ImageNet Attributes

For category-level attribute definition, they use Animals with Attributes and CUB. For instance-level attribute definition, apascal-ayahoo and ImageNet attributes are used. For Attribute prediction, they randomly split into training, validation and testing. The dimension of latent space is set to the minimum of the number of categories and attributes.

They use 4096-D DeCAF features extracted from CNN. Metrics are mean area under the curve and mean classification accuracy. For Zero-short learning, they use the specified seen and unseen classes of AwA. For CUB dataset, they split into 150 seen classes and 50 seen classes. The performance is measured by normalized multiclass accuracy.

Category-level Attribute Prediction

Instance-level Attribute Prediction Enhancing Instance-level Attribute Prediction:

Category-Sensitive Attribute Prediction

Zero-Shot learning Recognize images from unseen classes based on transferred attribute concepts, referred as zero-shot learning. Assume K seen classes {y 1,y 2,,y K } and L unseen classes {z 1,z 2,,z L }. Attribute classifiers are learned based on the K seen classes. During testing, the unseen category of an image x is determined based on posterior probability

Class prior p(z l ) is identical for all classes. Attribute priors are defined as Attribute-predictive probability of their method :

Conclusion Model explicitly captures the relationship among image, attribute and category in a multiplicative way in the latent feature space. Achieves better performance on four datasets. Reduces the effort of instance-level attribute annotation. Improves the accuracy of zero-shot learning.

Future Work Scene Recognition Image Retrieval Object Classification Precise image descriptions for human interpretation