PERSON RE-IDENTIFICATION is the problem of recognizing

Similar documents
Face detection and recognition. Detection Recognition Sally

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Dimension Reduction CS534

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Local Feature Detectors

Bilevel Sparse Coding

Robotics Programming Laboratory

Clustering CS 550: Machine Learning

Chapter 3 Image Registration. Chapter 3 Image Registration

Person Re-Identification by Efficient Impostor-based Metric Learning

Inception Network Overview. David White CS793

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Unsupervised learning in Vision

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION

Metric learning approaches! for image annotation! and face recognition!

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

ECG782: Multidimensional Digital Signal Processing

Person re-identification employing 3D scene information

Linear Discriminant Analysis for 3D Face Recognition System

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Learning a Manifold as an Atlas Supplementary Material

CSE 573: Artificial Intelligence Autumn 2010

Facial Expression Classification with Random Filters Feature Extraction

CS 231A Computer Vision (Fall 2012) Problem Set 3

Person Re-identification Using Clustering Ensemble Prototypes

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

CS 664 Segmentation. Daniel Huttenlocher

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching

Bumptrees for Efficient Function, Constraint, and Classification Learning

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Selection of Scale-Invariant Parts for Object Class Recognition

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Face Recognition by Combining Kernel Associative Memory and Gabor Transforms

Category vs. instance recognition

Semi-Supervised Clustering with Partial Background Information

Learning to Recognize Faces in Realistic Conditions

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Computer Vision for HCI. Topics of This Lecture

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Weighted Multi-scale Local Binary Pattern Histograms for Face Recognition

Deep Learning for Computer Vision II

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

The Curse of Dimensionality

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Object Category Detection. Slides mostly from Derek Hoiem

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Supervised vs unsupervised clustering

String distance for automatic image classification

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Face Recognition A Deep Learning Approach

Facial Expression Recognition Using Non-negative Matrix Factorization

Unsupervised Learning

Estimating Human Pose in Images. Navraj Singh December 11, 2009

arxiv: v2 [cs.cv] 24 May 2016

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Classification: Feature Vectors

Translation Symmetry Detection: A Repetitive Pattern Analysis Approach

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs

CS 223B Computer Vision Problem Set 3

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS

Su et al. Shape Descriptors - III

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

A Keypoint Descriptor Inspired by Retinal Computation

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Using the Kolmogorov-Smirnov Test for Image Segmentation

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

Minimizing hallucination in Histogram of Oriented Gradients

Classification of objects from Video Data (Group 30)

Selecting Models from Videos for Appearance-Based Face Recognition

Deep Learning with Tensorflow AlexNet

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

CAP 6412 Advanced Computer Vision

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Network Traffic Measurements and Analysis

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Binary Online Learned Descriptors

Object Category Detection: Sliding Windows

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Robust PDF Table Locator

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS. Kirthiga, M.E-Communication system, PREC, Thanjavur

Unsupervised Learning

MAny surveillance systems require autonomous longterm

arxiv: v2 [cs.cv] 17 Sep 2017

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

CS229 Final Project: Predicting Expected Response Times

Deep Convolutional Neural Network using Triplet of Faces, Deep Ensemble, and Scorelevel Fusion for Face Recognition

CRF Based Point Cloud Segmentation Jonathan Nation

Towards direct motion and shape parameter recovery from image sequences. Stephen Benoit. Ph.D. Thesis Presentation September 25, 2003

Transcription:

JOURNAL OF TCSVT 1 Deep Deformable Patch Metric Learning for Person Re-identification Sławomir Bak Peter Carr Disney Research Pittsburgh, PA, USA, 1213 {slawomir.bak,peter.carr}@disneyresearch.com Abstract The methodology for finding same individual in a network cameras must deal significant changes in appearance caused by variations in illumination, viewing angle and a person s pose. Re-identification requires solving two fundamental problems: (1) determining a distance measure between features extracted from different cameras that copes illumination changes (metric learning); and (2) ensuring that matched features refer same body part (correspondence). Most metric learning approaches focus on finding a robust distance measure between bounding box images, neglecting alignment aspects. In this paper, we propose learn appearance measures for es that are combined using deformable models. Learning metrics for es avoids strong dimensionality reduction, thus keeping more information. Additionally, we allow es change ir locations, directly addressing correspondence problem. As es from different locations may share same metric, our method effectively multiplies amount training data and allows metrics be learned on smaller amounts labeled images. Different metric learning approaches (KISSME, XQDA, LSSL) ger different deformable models (spring constraints, one--one matching constraints) are investigated and compared. For describing es, we propose learn a deep feature representation Convolutional Neural Networks (CNNs), thus obtaining highly effective features for re-identification. We demonstrate that our approach significantly outperforms state---art methods on multiple datasets. Index Terms metric learning, deformable models. 1 INTRODUCTION PERSON RE-IDENTIFICATION is problem recognizing same individual across a network cameras. In a realworld scenario, transition time between cameras may significantly decrease search space, but temporal information alone is not usually sufficient solve problem. As a result, visual appearance models have received a lot attention in computer vision research [], [2], [26], [29], [], [41], [47]. The underlying challenge for visual appearance is that models must work under significant appearance changes caused by variations in illumination, viewing angle and a person s pose. Metric learning approaches ten achieve best performance in re-identification. These methods learn a distance function between features from different cameras such that relevant dimensions are emphasized while irrelevant ones are ignored. Many metric learning approaches [], [19], [24] divide a bounding box pedestrian image in a fixed grid regions and extract descriprs which are n concatenated in a high-dimensional feature vecr. Afterwards, dimensionality reduction is applied, and n metric learning is performed on reduced subspace differences between feature vecrs. To avoid overfitting, dimensionality must be significantly reduced. In practice, subspace dimensionality is about three orders magnitude smaller than original. Such strong dimensionality reduction might result in loss discriminative information. Additionally, features extracted on a fixed grid (see Fig. 1), may not correspond. Copyright c 17 IEEE. Personal use this material is permitted. However, permission use this material for any or purposes must be obtained from IEEE by sending an email pubs-permissions@ieee.org. Input Bounding Box ML DPML Similarity=0. Similarity=0.8 Fig. 1: Full bounding box metric learning vs. deformable metric learning (DPML). The corresponding es in grid (highlighted in red) do not correspond same body part because pose change. Information from such misaligned features might be lost during metric learning step. Instead, our DPML deforms maximize similarity using metrics learned on a level. even though it is same person (e.g. due a pose change). Metric learning is unable recover this lost information. In this paper, instead learning a metric for concatenated features extracted from full bounding boxes from different cameras, we propose learn metrics for 2D es. Learning metrics for es is less prone overfitting (because lower dimensionality) and it requires less compression. As a result it keeps more information. Furrmore, we do not assume es must be located on

JOURNAL OF TCSVT 2 a fixed grid. Our model allows es perturb ir locations when computing similarity between two images (see Fig. 1). This model is inspired from part-based object detection [12], [43], which decomposes appearance model in local templates geometric constraints (conceptualized as springs). Our main contributions are: We propose learn metrics locally, on feature vecrs extracted from es. These metrics can be combined in a unified distance measure. We introduce two deformable -based models for accommodating pose changes and occlusions: (1) an unsupervised deformable model that introduces a global one-one matching constraint solved by a linear assignment problem, and (2) a supervised deformable model that combines an appearance term a deformation cost that controls relative placement es. For describing es, we propose learn a deep feature representation Convolutional Neural Networks (CNNs). The CNN is learned through challenging multiclass identification task. We force CNN recognize not only person identity from which has been extracted but also location. This results in highly effective representation, significantly improving re-identification accuracy. Our experiments illustrate merits -based techniques and achieve new state---art performance on multiple datasets outperforming existing approaches by large margins. 2 RELATED WORK Person re-identification approaches can be divided in two groups: feature modeling [4], [11] designs descriprs (usually handcrafted) which are robust changes in imaging conditions, and metric learning [1], [], [19], [23], [24], [42], [0] searches for effective distance functions compare features from different cameras. Robust features can be modeled by adopting perceptual principles symmetry and asymmetry human body [11]. The correspondence problem can be approached by locating body parts [4], [8] and extracting local descriprs (color hisgrams [8], color invariants [21], covariances [4], CNN [29]). However, find a proper descripr, we need look for a trade-f between its discriminative power and invariance between cameras. This task can be considered a metric learning problem that maximizes inter-class variation while minimizing intra-class variation. Many different machine learning algorithms have been considered for learning a robust similarity function. Gray et al. employed Adaboost for feature selection and weighting [14], Prosser et al. defined person re-identification as a ranking problem and used an ensemble RankSVMs [32]. Recently features learned from deep convolution neural networks have been investigated [1], [7], [23], [3], [37], [], [48]. However, most common choice for learning a metric remains family Mahalanobis distance functions. These include Large Margin Nearest Neighbor Learning (LMNN) [39], Information Theoretic Metric Learning (ITML) [9] and Logistic Discriminant Metric Learning (LDML) [1]. These methods usually aim at improving k-nn classification by iteratively adapting metric. In contrast se iterative methods, Köstinger [19] proposed KISS metric which uses a statistical inference based on a likelihood-ratio test two Gaussian distributions modeling positive and negative pairwise differences between features. Owing its effectiveness and efficiency, KISS metric is a popular baseline that has been extended linear [2], [] and nonlinear [29], [41] subspace embeddings. Most se approaches learn a Mahalanobis distance function for feature vecrs extracted from full bounding box images. Integration feature learning directly metric learning approach has been proposed in [38]. Mahalanbois-like function ger feature representation is learned a novel end--end framework throughout a triplet embedding. Recently, a trend learning similarity measures for es [2], [3], [33], [34], [1] has emerged. Operating on es allows directly address person pose variations and camera viewpoint changes. Shen et al. [33] learns correspondence structure that captures spatial correspondence patterns across camera viewpoints The correspondence structure is represented by -wise matching probabilities learned using a boosting-like approach. Patchwise correspondence is also introduced in [34]. First body is divided in upper and lower body parts and n trees are independently constructed find correspondence. Zheng et al. [1] shows that introducing -level matching model based on a sparse representation can help in handling inaccurate person detecrs as well as large amount occlusion. This paper is based on our previous work [2], where we have proposed learn dissimilarity functions for es in bounding boxes, and n combine ir scores in a robust distance measure. We have shown that our approach has clear advantages over existing algorithms. In this paper we continue our analysis by evaluating additional parameters (e.g. size es and ir layouts) and by employing novel metric learning approaches. We also investigate additionally an unsupervised deformable model based on one--one matching constraint. Finally, we propose learn features directly from data through challenging multi-class identification task employing CNN model []. This results in highly effective representation that brings significant improvement in performance. Compared state-- art methods, our approach yields significantly higher recognition accuracy. 3 METHOD Often dissimilarity Ψ(i, j) between two bounding box images i and j taken from different cameras is defined as a Mahalanobis metric. The Mahalanobis metric measures squared distance between feature vecrs extracted from se bounding box images, x i and x j Ψ(i, j) = d 2 (x i, x j ) = (x i x j ) T M(x i x j ), (1) where M is a matrix encoding basis for comparison. M is usually learned in two stages: dimensionality reduction is first applied on x i and x j (e.g. principle component analysis - PCA), and n metric learning (e.g. KISS metric [19]) is performed on reduced subspace. To avoid overfitting, dimensionality must be significantly reduced keep free parameters low [16], [2]. In practice, x i and x j are high dimensional feature vecrs and ir reduced dimensionality is usually about three orders magnitude smaller than original [19], [2], [29]. Such strong dimensionality reduction might lose discriminative information, especially in case misaligned features in x i and x j (e.g. highlighted es in Fig. 1).

JOURNAL OF TCSVT 3 We propose learn a metric for matching es in bounding box. We perform dimensionality reduction on features extracted from each. The reduced dimensionality is usually only one order magnitude smaller than original one, thus keeping more information (see Section 4). In Section 3.1 we fer a -based metric learning. Section 3.2 introduces three state---art Mahalanobis-like metric learning approaches: KISSME [19], XQDA [2] and LSSL [42]. In Section 3.3, we propose two methods for integrating metrics in a single similarity measure and in Section 3.4 we show how learn a very effective representation Convolutional Neural Networks. 3.1 Patch-based Metric Learning We divide bounding box image i in a dense grid overlapping rectangular es. From each location k, we extract feature vecr p k i. We represent bounding box image i as an ordered set features X i = {p 1 i, p2 i,..., pk i }, where K is es. Usually in standard metric learning approaches [19], [26], [29], [], se descriprs are furr concatenated in a single high dimensional feature vecr (e.g. x i = [p 1 i p2 i... pk i ]) and metric learning ger dimensionality reduction is n performed. Instead, we learn a dissimilarity function Φ for feature vecrs extracted from es. Patch dissimilarities are furr combined in a unified dissimilarity by integration function Z (see Section 3.3). We define dissimilarity between two images i and j as ) Ψ(i, j) = Z k,l 1...K (Φ(p k i, p l j; θ(k)) (2) where p k i and pl j are feature vecrs extracted from es at locations k and l, respectively, in bounding box images i and j. Images i and j are assumed come from different cameras. Set parameters θ determines function Φ and it is learned using a given metric learning approach (see Section 3.2). Notice that Ψ(i, j) is defined as an asymmetric dissimilarity measure due k dependency. If symmetry is a concern, one can redefine final dissimilarity as a function both Ψ(i, j) and Ψ(j, i), e.g. Ψ (i, j) = min(ψ(i, j), Ψ(j, i)). Although, it is possible learn one metric for each location k, this might be o many degrees freedom. In practice, multiple locations might share a common metric, and in extreme case a single θ could be learned for all locations. We investigated re-identification performance different s metrics (see Section 4.2.1) and found that in some cases multiple metrics might perform better than a single one. Regions statistically different amounts background noise should have different metrics (e.g. es close head contain more background noise than es close rso). However, we also found that recognition performance is a function available training data (see Section 4.2.1), which limits metrics that can be learned efficiently. In standard approach, a pair bounding boxes corresponds a single training example. Breaking a bounding box in a set es increases amount training data if a reduced metrics is learned (e.g. some locations k share same metric/parameters θ). When a single θ is learned, amount training data increases by combining es for all K locations in a single set (K more positive examples for learning a metric compared standard approach). In experiments we show that this can significantly boost performance when training dataset is small (e.g. ilids dataset). 3.2 Metric learning (Φ) Given pairs sample bounding boxes (i, j) we introduce space pairwise differences p k ij = pk i pk j and partition training data in p k+ ij when i and j are bounding boxes containing same person and p k ij orwise. Note that for learning we use differences on es from same location k. 3.2.1 KISS metric learning Köstinger et al. [19] proposed an effective and efficient way learning a Mahalanobis metric by assuming a Gaussian structure difference space (i.e. p k ij ). When employing KISSME our dissimilarity measure becomes Φ(p k i, p k j ; θ(k)) = (p k i p k j ) T M (k) (p k i p k j ), (3) thus θ(k) = {M (k) }. To learn M (k) we follow Köstinger [19] and assume a zero mean Gaussian structure on difference space and employ a log likelihood ratio test. This results in M (k) = Σ 1 k+ Σ 1 k, (4) where Σ k+ and Σ k are covariance matrices p k+ ij respectively and p k ij, Σ k+ = (p k+ ij )(pk+ ij )T, () Σ k = (p k ij )(pk ij )T. (6) Computing Eq. (4) requires inverting two covariance matrices. In practice, as p k i s are still relatively high dimensional (see Sec. 4.2.3), Σ k+ is ten singular, thus Σ 1 k+ cannot be computed. As a result, dimensionality reduction on p k i is usually applied (e.g. PCA), which allows invert Σ k+. Keeping dimensionality low also avoids overfitting. One can find optimal principal components using cross-validation techniques. Liao et al. [2] proposed an alternative solution that simultaneously learns metric M and low dimensional subspace W, referred as XQDA. 3.2.2 XQDA metric learning Using XQDA [2] dissimilarity can be written as Φ(p k i, p k j ; θ(k)) = (p k i p k j ) T (W (k) )M (k) (W (k) ) T (p k i p k j ), (7) where M (k) = ( ) 1 ( 1 (W (k) ) T Σ k+ (W (k) ) (W (k) ) T Σ k (W )) (k). (8) Original feature dimension d p k i is reduced by subspace W (k) R d r. W (k) is learned using Generalized Rayleigh Quotient objective, solved by generalized eigenvalue decomposition problem similar LDA [2]. Notice that Σ k+ is computed in original d-dimensional space, thus its singularity remains a problem. Liao et al. proposes add a small regularizer diagonal elements Σ k+, which is a common trick in LDA-like problems. This makes estimation Σ k+ more smooth and robust. As a result, learning parameters become θ(k) = {W (k), M (k) }.

JOURNAL OF TCSVT 4 3.2.3 LSSL metric learning Yang et al. [42] introduced large scale similarity learning (LSSL) that combines feature difference (p k ij = p k i pk j ) and commonness (q k ij = p k i + pk j ), thus producing more discriminative measure. The main idea comes from insights found in a 2-dimensional Euclidean space. Consider l 2 -normalized 2- dimensional feature space. Notice that for similar vecrs (i and j containing same person) p k ij is expected be small but qk ij should be very large, in contrary for dissimilar vecrs (i and j containing different people) p k ij is expected be large and qk ij should be relatively small. Therefore, by combining difference and commonness we can expect more discriminative metric compared metric learning methods that only employ differences p k ij. The dissimilarity measure n becomes Φ(p k i, p k j ; θ(k)) = (p k ij) T M (k) p (p k ij) T λ(q k ij) T M (k) q (q k ij) T, (9) where both M (k) p and M (k) q can be inferred analogically KISS metric learning [19]. Yang et al. [42] shows furr, that based on a pair-constrained Gaussian assumption, covariance for pairs containing different people (p k ij and q k ij ) can be directly deduced from image pairs containing same person (for details see [42]). Parameter λ is used balance between difference and commonness feature vecrs. Similarly [42], we set λ = 1. in all experiments. As a result, learning parameters become θ(k) = {M (k) p, M (k) q } and PCA is applied on p k i avoid covariance singularity problem. 3.3 Integrated dissimilarities for images (Z) To compute tal dissimilarity between two bounding box images i and j, we propose several strategies for aggregating metrics learned for es. First, we introduce a rigid model (PML) illustrate that learning metric for es keeps more information avoiding strong dimensionality reduction (Section 3.3.1). Additionally, learning metrics on level might effectively multiply amount training data yielding significant boost in recognition performance for smaller datasets. Pose changes and different camera viewpoints make reidentification more difficult as features extracted on a fixed grid may not correspond even though it is same person. Breaking a bounding box image in es allows us introduce deformable models that can effectively cope pose changes, enabling es in one bounding box perturb ir locations (deform) when matching anor bounding box. Independently metric learning, our task is find a strategy that can perturb locations simulate pose changes. We investigate two deformable models (1) un unsupervised deformable model based one--one matching constraint (HPML) that does not require any additional training apart metric learning (Section 3.3.2) and (2) a supervised deformable model geometric constraints (conceptualized as springs) (DPML) that we train by introducing an optimization problem as a relatvie distance comparison triplets (Section 3.3.3). 3.3.1 Rigid model (PML) We combine dissimilarity scores by summing over all es K Z PML = Φ(p k i, p k j ; θ(k)). () k=1 #es 1 #es 1 Fig. 2: Deformable models: K K cost matrix, which is used as an input Hungarian algorithm for finding optimal one-one correspondence. The dissimilarity between two es becomes if distance between ir spatial locations η(, ) is greater than assumed threshold δ. Compared standard approach (e.g. in case KISS metric), this is equivalent learning a block diagonal matrix M 1 0... 0 Z PML = [ p 1 ij, p 2 ij,..., p K ] 0 M 2... 0 ij..... 0 0 0... M K p 1 ij p 2 ij. p K ij (11) where all M (k) are learned independently. We refer this formulation as PML. 3.3.2 Unsupervised Deformable Model (HPML) Patch-based methods [2], [34] ten allow es adjust ir locations when comparing two bounding box images. Sheng et al. [34] assumed correspondence structure be fixed and learned it using a boosting-like approach. Instead, we define correspondence task as a linear assignment problem. Given K es from bounding box image i and K es from bounding box image j we create a K K cost matrix that contains similarity scores in a fixed neighborhood (see Fig 2). To avoid es freely changing ir location, we introduce a global one-one matching constraint and solve a linear assignment problem Ω ij = arg min Ω ij ( K k=1 s.t. ( Ω ij (k), k ) = Φ(p Ωij(k) i, p k j ; θ(k)) + ( Ω ij (k), k ) ), {, η(ωij (k), k) > δ; (12) 0, orwise, where Ω ij is a permutation vecr mapping es p Ωij(k) i es p k j and Ω ij (k) and k determine locations, (, ) is a spatial regularization term that constrains search neighborhood, where η corresponds distance between two locations and threshold δ determines allowed displacement (different δ s are evaluated in Fig 1(a)). We find optimal assignment Ω ij ( correspondence) using Kuhn-Munkres (Hungarian) algorithm []. This yields tal dissimilarity: Z HPML = K k=1 ( Φ p Ω ij (k) i We refer this formulation as HPML. ), p k j ; θ(k). (13)

JOURNAL OF TCSVT 3.3.3 Supervised Deformable model (DPML) We employ a model which approximates continuous non-affine warps by translating 2D templates [12], [43] (see Fig. 1). We use a spring model limit displacement es. The deformable dissimilarity score for matching at location k in bounding box i bounding box j is defined as ψ(p k i, j) = min l [ Φ(p k i, p l j; θ(k)) + α k (k, l) ], (14) where feature p l j is extracted from bounding box j at location l; appearance term Φ(p k i, pl j ; θ(k)) computes feature dissimilarity between es and deformation cost α k (k, l) refers a spring model that controls relative placement es k and l. (k, l) is squared distance between locations. α k encodes rigidity spring: α k = corresponds a rigid model, while α k = 0 allows a change its location freely. Notice difference HPML, for which definition allows us perform discrete optimization (Ω stands for optimal global one--one assignment). For DPML we define a continues function and we first optimize alignment locally (ψ(p k i, j)) and n combine se deformable dissimilarity scores in a unified dissimilarity measure Z DPML = K w k ψ(p k i, j) k=1 = w, ψ ij, (1) where w is a vecr weights and ψ ij corresponds a vecr dissimilarity scores. Learning α k and w: Similarly [29], we define optimization problem as a relative distance comparison triplets {i, j, z} such that w, ψ iz > w, ψ ij for all i, j, z; where i and j correspond bounding boxes extracted from different cameras containing same person, and i and z are bounding boxes from different cameras containing different people. Unfortunately, Eq. 14 is non-convex and we can not guarantee avoiding local minima. In practice, we use a limited unique spring constants α k and apply two-step optimization. First, we optimize α k w = 1, by performing exhaustive grid search (see Section 4.3) while maximizing Rank-1 recognition rate. Second, we fix α k and determine best w using structural SVMs [18]. This approach is referred as DPML. 3.4 Deep es It is common practice in person re-identification combine handcrafted color and texture descriprs for describing image regions and n let metric learning discover relevant features and discard irrelevant ones. Often color hisgrams in different color spaces ger SIFT-like features are concatenated in high-dimensional feature vecrs [2]. Xiao et al. [] showed that CNN models also can effectively be applied person reidentification despite insufficient data. They proposed train jointly CNN data from multiple datasets and n finetune model a given camera pair using a domain-guided dropout strategy. In this work we adopt CNN model from [], but instead training it for whole images, we train it for es obtain highly robust feature representation. This model learns a set high-level feature representations through challenging multi-class identification tasks, i.e., classifying a training image in one C identities. As generalization capabilities id=1 id=11 id=13 id=1 id=17 id=12 id=14 id=16 id=18 " CNN deep feature " fc7 st-max layer " id=12 Fig. 3: Deep feature learning CNN: each image is divided in a set 8 non-overlapping es. The identity each is extended by its location. As a result, CNN is forced recognize not only person identity but also location. learned features increase classes predicted during training [36], we need C be relatively large (e.g. several thousand). While training CNN for es, we modify training strategy. First, each image is divided in a set 8 nonoverlapping es size height/4 width/2 and n each (although comes from same image but from different location) gets assigned a new identity. As a result, CNN model is forced determine not only person identity from which has been extracted but also location. Given a dataset images M identities, task becomes classify es in C = 8M identities. When CNN is trained classify a large identities and configured keep dimension last hidden layer relatively low (e.g., setting dimensions for fc7 26 []), it forms compact and highly robust feature representations for re-identification. We found that learned deep feature representation is very effective and combined metric learning approaches it significantly outperforms state---art techniques. Figure 3 explains training our deep features. 4 EXPERIMENTS We carry out experiments on four challenging datasets: [13], i-lids [49], CUHK01 [22] and CUHK03 [23]. The results are analyzed in terms recognition rate, using cumulative matching characteristic (CMC) [13] curve its rank-1 accuracy. The CMC curve represents expectation finding correct match in p r matches. The curve can be characterized by a scalar value computed by normalizing area under curve referred as nauc value. Section 4.1 describes benchmark datasets used in experiments. We explore our rigid metric model (PML) ger its parameters, including different metric learning approaches (θ) in Section 4.2. The deformable models (HPML and DPML) are discussed in Section 4.3. Finally, in Section 4.4, we compare our performance or state art methods. 4.1 Datasets [13] is one most popular person re-identification datasets. It contains 632 image pairs pedestrians captured by two outdoor cameras. images contain large variations in lighting conditions, background, viewpoint, and image quality (see Fig. 4). Each bounding box is cropped and scaled be 128 48 pixels. We follow common evaluation procol for

JOURNAL OF TCSVT Fig. 4: Sample images from dataset. Top and botm lines correspond images from different cameras. Columns illustrate same person. 6 Fig. 6: Sample images from CUHK01 dataset. Top and botm lines correspond images from different cameras. Columns illustrate same person. detected pedestrians, while training is performed employing both manually cropped and aumatically detected images. We follow a single-shot setting. Fig. : Sample images from i-lids dataset. Top and botm lines correspond images from different cameras. Columns illustrate same person. this database: randomly dividing 632 image pairs in 316 image pairs for training and 316 image pairs for testing. We repeat this procedure times and compute average CMC curves for obtaining reliable statistics. i-lids [49] consists 119 individuals 476 images. This dataset is very challenging since re are many occlusions. Often only p part person is visible and usually re is a significant scale or viewpoint change as well (see Fig. ). We follow evaluation procol [29]: dataset is randomly divided in image pairs used for training and remaining 9 image pairs are used for testing. This procedure is repeated times for obtaining averaged CMC curves. CUHK01 [22] contains 971 persons captured two cameras. For each person, 2 images for each camera are provided. The images in this dataset are better quality and higher resolution than in two previous datasets. Each bounding box is scaled be 1 pixels. The first camera captures side view pedestrians and second camera captures frontal view or back view (see Fig. 6). We follow common evaluation setting: persons are split in 48 for training and 486 for testing. We repeat this procedure times for computing averaged CMC curves. CUHK03 [23] is one largest published person reidentification datasets. It contains 1467 persons, where each person has 4.8 images on average. The dataset provides both manually cropped bounding box images and aumatically detected bounding box images a pedestrian detecr [12]. For evaluation we follow testing procol [23]: identities are randomly divided in non-overlapping training and test sets. The training set consists 1367 persons and test set consists 0 persons. For testing we only use aumatically Training deep features: To learn our deep representation we used two datasets: CUHK03 [23] and PRID11 [17]. From CUHK03 we used 1367 identities that were randomly selected for training []. PRID11 contains 0 individuals appearing in two cameras and additionally it contains 18 identities that appear in first camera but do not reappear in second one, and 49 identities that appear only in second camera, in tal 934 identities. Merging both datasets, we have M = 21 identities and by furr dividing images in a set 8 non-overlapping es CNN is forced perform multi-class identification C = 8 21 = 188 identities. The dimensionality last hidden layer is kept be low (26), which stands for our deep feature representation. The architecture and training parameters are kept same as in []. Unlike [], we do not perform any fine-tuning deep feature representation on test datasets. Instead, we proposed perform Mahalanobis metric learning adjust metric particular camera-pair variations. 4.2 Rigid Patch Metric Learning (PML) In this section, we first compare our rigid model (PML) standard full bounding box approach (BBOX). BBOX is equivalent method presented in [19]. Each bounding box size w h is divided in a grid K = overlapping es size w4 w2 stride w8 w4 resulting in a 3 layout (different layouts are discussed in Section 4.2.2). In this experiment, es are represented by concatenated hisgrams in LAB and HSV color space ger color SIFT (see details on different representations in Section 4.2.3). For full bounding box case, we concatenate extracted feature vecrs in a high dimensional feature vecr. PCA is applied obtain a 62-dimensional feature space (where optimal dimensionality is found by cross-validation). Then, KISS metric [19] is learned in 62-dimensional PCA subspace. For PML, instead learning a metric for concatenated feature vecr, we learn metrics for features. In this way, we avoid undesirable compression. The dimensionality feature vecr is reduced by PCA 3 (also found by cross-validation) and metrics are learned independently for each location. Fig. 7 illustrates comparison on three datasets. It is apparent that PML significantly improves re-identification performance by keeping a higher degrees freedom (3 ) when learning dissimilarity function.

JOURNAL OF TCSVT 7 CUHK01 ilids 0 33.4%(0.972) PML 26.%(0.969) BBOX 0 37.49%(0.966) PML 16.41%(0.943) BBOX 0 4.49%(0.9) PML 28.%(0.912) BBOX 1 2 1 2 1 2 Fig. 7: Performance comparison Patch based Metric Learning (PML) vs. full bounding box metric learning (BBOX). Rank-1 identification rates as well as nauc values provided in brackets are shown in legend next method name. 0 23.4%(0.943) PML m=1 27.12%(0.9) PML m=2 28.99%(0.963) PML m=6 32.63%(0.967) PML m=13 33.4%(0.972) PML m= 1 2 Fig. 8: Performance comparison w.r.t. M. Using different metrics for different image regions yields better performance. (a) nauc 0.92 0.91 0.9 0.89 0.88 0.87 0.86 0.8 0.84 0.83 0.82 m = 2 m = 6 m = 13 m = (b) clusters Fig. 9: Dividing image regions in several metrics. (a) nau C values w.r.t. a location a learned metric; (b) results for different clusters m. 4.2.1 Number Patch Metrics As mentioned earlier, our formulation allows θ be learned per location. In practice, re may be insufficient training data for this many degrees freedom. We evaluate two extremes: learning m = independent KISS metrics (one per location) and learning a single KISS metric for all es (m = 1), see Fig. 8. The results indicate that multiple metrics lead significantly better recognition accuracy. To understand variability in learned metrics, we setup following experiment: learn a metric for a particular location k, and n apply this metric compute dissimilarity scores for all or locations. We plot nauc values w.r.t. location learned metric in Fig. 9(a). It is apparent that metrics learned at different locations yield different performances. Surprisingly, higher performance is obtained by metrics learned on es at lower locations in bounding box (corresponding leg regions). We believe that it is due significant images in dataset having dark and cluttered backgrounds in upper regions (see last 3 p images in Fig. 4). Lower parts bounding boxes usually have more coherent background from sidewalks. Additionally, we cluster locations spatially using hierarchical (botm-up), where similarity between regions is computed using nau C values. Fig. 9(b) illustrates results w.r.t. clusters. Next, we learn metrics for each cluster locations. These metrics are n used for computing similarity in corresponding image regions. Recall from Fig. 8 that best performance was achieved m =. In this circumstance, re appears be sufficient data train an independent metric for each location. We test this hyposis by reducing amount training data and evaluating optimal metrics when fewer training examples are available. Fig. illustrates that based approach achieves high performance much faster than full bounding box metric learning. Interestingly, for a small positive pairs (less than 0), a reduced metrics gives better performance. When a common metric is learned for multiple locations, amount training data is effectively increased because features from multiple es can be used as examples for learning same metric (Section 3.1). 4.2.2 Patch layout Our model consists a set rectangular es extracted on a grid layout. The size and grid density (determined by a stride) define tal es K and a level overlap. To investigate impact se parameters on re-identification performance, we evaluate PML KISS metric for different layouts. Fig. 11(a) shows results for different K defined by different sizes and different

JOURNAL OF TCSVT 8 Rank-1 re-identification rate 3 2 1 0 BBOX PML m=1 PML m=2 PML m=6 PML m=13 PML m= 0 0 0 Number positive pairs Fig. : Rank-1 recognition rate varying size training dataset. strides (e.g. K = is a result size w 4 w 2 stride w 8 w 4 and for K = 8 stride dimensions are equal size, which corresponds a configuration non-overlapping es). From Fig. 11(a) we can notice that having small es (e.g. for K = 1 size w 4 w 4 ) might slightly decrease performance, and in general keeping es larger (see K = 39 and K = ) yields better recognition accuracy. The results also indicate that overlapping es (when stride is smaller than size) perform significantly better than non-overlapping es (K = 8 and K = in Fig. 11(a) correspond layouts non-overlapping es). Similarly, using overlapping deep features yields better performance compared non-overlapping es (see Fig. 11(b)). As a result, in furr performance evaluations we select layouts that consist overlapping es for both handcrafted features as well as deep features. For handcrafted features we select K = best performing configuration in Fig. 11(a). For deep es we also select a configuration overlapping es but K = 39 match size used during deep training (see Section 3.4). Notice that we train deep feature using non-overlapping es, which we found perform slightly better. 4.2.3 Patch representation It is common practice in person re-identification combine color and texture descriprs for describing an image. We evaluated performance different combinations representations, including Lab, RGB and HSV hisgrams, each bins per channel. Texture information was captured by color SIFT, which is SIFT descripr extracted for each Lab channel and n concatenated. In our previous work [2], we selected combination Lab, HSV and color SIFT as best descripr. The dimensionality concatenated HSV, Lab and color SIFT is 64 ( 3 + 3 + 128 3 = 64). In this work, instead handcrafting representation, we propose learn features directly from data CNNs through multiclass identification task (Sec. 3.4). As a result, each deep is represented by 26-dimensional feature vecr (fc7) and we reduce its dimensionality by PCA before running KISS metric learning. Fig. 12(a) illustrates averaged CMC curves for data set. It is clear that proposed deep CUHK01 ilids METHOD H-C CNN H-C CNN H-C CNN PML, KISS 33. 43.2 37.4 61. 4.4 74.8 PML, XQDA 29. 37.6 39.2 49.2 7.8 73.2 PML, LSSL 37.1 47.0 42.7 71.3.7 78.3 TABLE 1: Performance comparison different metric learning approaches using handcrafted features (HSV+Lab+ColorSIFT) denoted by H-C and deep es CNN. CMC rank-1 accuracies are reported. representation (CNN) outperforms all handcrafted representations by a large margin. When learning representation, we propose force CNN recognize not only person identity but also location. To evaluate effectiveness our approach we also trained CNN for es, while neglecting locations (CNN (-k)). Fig. 12(b) illustrates that including information on locations allows us learn more effective features. The CNNs learned locations perform significantly better for both L2 and KISS metric learning. 4.2.4 Patch metric learning In this section we evaluate our PML model, while employing previously discussed metric learning approaches: KISS metric learning [19], XQDA metric learning [2] and LSSL metric learning [42]. We investigate performance while employing both handcrafted features (HSV+Lab+ColorSIFT) and deep features. As i-lids dataset contains a relatively small training samples (only subjects available for training) and as indicated in our previous analysis (Section 4.2.1), we learn a single θ for all es, thus increasing amount training examples (m = 1). From Fig. 13 it is apparent, that deep features significantly improve recognition accuracy on all datasets. It is also clear that LSSL metric learning consistently achieves best performance among all metric learning approaches. Surprisingly, XQDA ten performs worse than standard KISS metric learning, especially when using deep es. This discrepancy might be due fact that deep es are already highly discriminative and applying additional discriminative objectives ( Generalized Rayleigh Quotient) may decrease performance. Table 1 summarizes results. Additionally we evaluated performance CNN model while learning on whole images (equivalent JSTL model []) combined metric learning approaches. Recall that we only use CUHK03 and PRID11 datasets for training deep features. Each image is represented by 26-dimensional feature vecr and we reduce its dimensionality 0 by PCA before running KISS and LSSL metric learning (found by cross-validation). From Fig. 14 it is apparent that metric learning improves recognition accuracy global deep features (compare JSTL, L2). However, it is also clear best performance JSTL combined metric learning is far behind proposed based learning combined our deep representation. 4.3 Deformable Patch Metric Learning 4.3.1 Unsupervised Deformable Model (HPML) Fig. 1(a) illustrates impact our unsupervised deformable model on recognition accuracy. We also compare effectiveness different neighborhoods on overall accuracy. In Eq. (12), we

JOURNAL OF TCSVT 9 0 23.09%(0.963) PML, K=8, h4 x w2, stride h4 x w2 31.61%(0.971) PML, K=21, h4 x w2, stride h8 x w4 32.33%(0.9) PML, K=39, h4 x w2, stride h16 x w4 22.83%(0.962) PML, K=, w4 x w2, stride w4 x w2 33.4%(0.972) PML, K=, w4 x w2, stride w8 x w4 29.%(0.967) PML, K=1, w4 x w4, stride w8 x w8 33.37%(0.972) PML, K=1, w4 x w2, stride w8 x w16 0.3%(0.971) PML, K=8, h4 x w2, stride h4 x w2 43.24%(0.979) PML, K=39, h4 x w2, stride h16 x w4 1 2 (a) HSV+Lab+ColorSIFT 1 2 (b) CNN Fig. 11: Performance comparison on dataset w.r.t. different layouts; (a) using handcrafted features HSV+Lab+ColorSIFT; (b) using deep es CNN. Overlapping es yield better performance. 0 43.24%(0.979) CNN 33.4%(0.972) HSV+Lab+ColorSIFT 27.12%(0.961) HSV+Lab 18.73%(0.93) HSV 18.93%(0.94) Lab.00%(0.887) RGB 11.14%(0.874) ColorSIFT 1 2 (a) 0 43.24%(0.979) CNN, KISS 3.%(0.969) CNN (-k), KISS 21.%(0.817) CNN, L2 16.23%(0.781) CNN (-k), L2 1 2 (b) Fig. 12: Performance comparison different descriprs for dataset; (a) best performance is achieved by our deep representation (CNN); (b) forcing CNN determine locations along person identities increases effectiveness deep features: CNN (-k) corresponds CNN trained out locations, and CNN was learned locations. constrain displacement es δ horizontal δ vertical pixels. Interestingly, allowing es move vertically (δ vertical > 0) generally decreases performance. We believe that this is due fact that images in all se datasets were annotated manually and vertical alignment (from head feet) people in se images is usually correct. Allowing es move horizontally consistently improves performance for all datasets. The highest gain in accuracy is obtained on ilids dataset (3%), which contains inaccurate detections and large amount occlusions. This indicates that our linear assignment approach provides a reliable solution for pose changes. 4.3.2 Supervised Deformable model (DPML) We simplify Eq. 14 by restricting unique spring constants. Two parameters α 1, α 2 are assigned locations obtained by hierarchical clusters m = 2 (see Fig. 1(b)). α k encodes rigidity es at particular locations. We perform an exhaustive grid search iterating through α 1 and α 2 while maximizing Rank-1 recognition rate. Fig 1(b) illustrates recognition rate map as a function both coefficients. Interestingly, rigidity (high spring constants) is useful for lower es ( dark red region in left-botm corner map) but not so for es in upper locations bounding box. This might be related fact that metrics learned on lower locations have higher performance (compare nauc values in Fig. 9). Fig. 16 illustrates performance comparison different integration functions Z. We employ LSSL metric learning ger deep features. The results clearly show that introducing deformable models consistently improves recognition accuracy in all datasets and that best performance is obtained by supervised spring model DPML.

JOURNAL OF TCSVT 0 CUHK01 0 ilids 0 47.01%(0.983) PML, LSSL 43.24%(0.979) PML, KISS 37.67%(0.94) PML, XQDA 21.%(0.817) PML, L2 1 2 (a) CNN 0 71.34%(0.99) PML, LSSL 61.0%(0.987) PML, KISS 49.%(0.9) PML, XQDA 11.72%(0.87) PML, L2 1 2 (b) CNN 0 78.31%(0.9) PML, LSSL 74.83%(0.9) PML, KISS 73.22%(0.91) PML, XQDA.2%(0.943) PML, L2 1 2 (c) CNN 0 CUHK01 0 ilids 0 37.18%(0.977) PML, LSSL 33.4%(0.972) PML, KISS 29.3%(0.96) PML, XQDA 14.48%(0.837) PML, L2 1 2 (d) HSV+Lab+ColorSIFT 0 42.73%(0.978) PML, LSSL 37.49%(0.966) PML, KISS 39.%(0.967) PML, XQDA 1.91%(0.867) PML, L2 1 2 (e) HSV+Lab+ColorSIFT 0.76%(0.91) PML, LSSL 4.49%(0.9) PML, KISS 7.%(0.943) PML, XQDA 38.14%(0.86) PML, L2 1 2 (f) HSV+Lab+ColorSIFT Fig. 13: Performance comparison different metric learning approaches using deep es CNN p row and handcrafted features - HSV+Lab+ColorSIFT botm row. LSSL metric learning performs best among all metric learning techniques. 0 47.01%(0.983) PML, LSSL 23.3%(0.947) JSTL, LSSL 18.9%(0.936) JSTL, KISS 2.24%(0.937) JSTL, XQDA.02%(0.4) JSTL, L2 1 2 Fig. 14: Performance comparison global deep features (JSTL model) combined metric learning approaches vs. Patchbased Metric Learning (PML) based on proposed deep features. Deep features significantly outperform global deep features. Computational complexity Although rigid model (PML) does not perform as good as deformable models, it is less computationally expensive. It requires only K similarities be computed compare two images. However, although HPML requires solving Hungarian algorithm (Eq. (12)), in practice matrix K K (see Fig. 2) can be relatively sparse (compare performance different neighborhoods in Fig. 1(a)). Given τ non-infinite entries in this matrix, we employed QuickMatch algorithm [28] that runs in linear time O(τ). As a result, deep texture feature extraction is slowest part and it depends on GPU architecture (e.g. on Tesla K experiment takes 3s, 3s spent on deep feature extraction). DPML is slowest model and same experiment takes around min. 4.4 Comparison Or Methods Table. 2 reports performance comparison our based methods state---art approaches across 4 datasets:, CUHK01, ilids and CUHK03-detected. Our methods outperform all state---art techniques on all datasets. The maximum improvement is achieved on ilids dataset. We improve state---art rank-1 accuracy (64.6%) by almost 18% (82.2%). This dataset contains a relatively small training samples (we use only subjects for training). Driven by our previous analysis (Section 4.2.1), we learn a single θ for all es, thus increasing training set. As a result, PML, HPML and DPML significantly outperform state---art approaches. There are three aspects that make our approach more effective on ilids: (1) we are able generate a significantly larger training set using m = 1, (2) occlusions in images pollute only a few scores in our similarity measure, while in case full-image based metric learning y might have a global impact on final dissimilarity measure, (3) misaligned features can be corrected by our deformable models. Notice, that our simplest aggregation technique (PML) already achieves very competitive results. This highlights effectiveness combining driven LSSL metric learning deep representation.

19 472 images in figure 6), while botm parts usually have more clusters). Next, we learn mfigure particular metrics. by fact as well as met-this8:might TODO 4 24 formance isfull-image achieved. be learning. explained values. illustrates results w.r.t.(botm-up), 468 than based metric Interestingly, for a Figure locations using hierarchical 473 coherent background coming from sidewalk. rics using es belonging particular clusters. These 471 2 that lower m, more es are used for learning clusters (m corresponds metrics 469 positive pairs, lower m, better per-21 where similarity between is computed using nau C small 474 or regions metrics are n for computing similarity in corfurr, study metricsnext, m, weused cluster.1. Learning deformation cost 472 26 particular metrics. as values. well as TCSVT Figure clusters). we learn m metof 11 4 JOURNAL is achieved. This might be explained 47 illustrates results w.r.t. As than 22 responding image regions. shown informance figure based 4 best full-image metric learning. Interestingly, forbya fact using hierarchical (botm-up), 473 locations 27 rics using es belonging particular clusters. These 471 decrease 2 that comment. lower m, We more es are used perforparameters learning clusters (m corresponds using metrics 476 23 performance is achieved m small =. TODO: positive pairs, lower m, better where between regions is computed nau C 474 similarity 28 metrics are nimpact used for computing similarity inby cor.1. particular Learning deformation cost hungarian algorithm (, ) that are assigned locations obta 472 0 1 2 477 metrics. as well as clusters). Next, we learn m met24 This might be explained by fact 4 values. Figure illustrates resultsin w.r.t. 47 29 responding image regions. As shown figure4 best formance is achieved. 0 12x0 478 erarchical cluste 473 % rics using es belonging particular clusters. These 2 We parameters 2for parameters decrease lower m, morees are used learning performance clusters (m corresponds.1.4 comment. metrics 476 Performance w.r.t. that amount training data 24x0 by m =. TODO: 3 is achieved (figure, m = 2). TODO 479 0x8 used 474 metrics are n for computing similarity in cor26 ( 1,.1. ) that are assigned locations cost obtained by hiparticular as477 well as 12x8 clusters). Next, we learn m met2metrics. 31 Learning deformation 34% 2 4 We carry out experiments show evolution per47 24x16 27 clusters m =state 2 responding image regions. As shown in figure rics belonging particular clusters. These4 besterarchical.2. 478 using es 32 Comparison art.1.4 Performance w.r.t. amount training data training image 481 formance pairs. In this We decrease parameters 2 parameters 1 476, m = 2). TODO 28 achieved by m =similarity. TODO: comment..1.(figure 479 33 metrics are performance n used foriscomputing in corlearning deformationbaseline cost - KISSME31%- small pc 482 experiment we were training models on different (, ) that are assigned locations obtained by hi1 2 477 29 4 34 Weimage carry out showin evolution perresponding regions. As shown figure 4 best 0 experiments Comparison 483 positive pairspairs. fromina this training set and testing on 316state Baseline -art KISSME -clusters crossvalidation We.2. decrease parameters 2 parameters erarchical m = 3 2 ( 62 p 478 481 formance training image performance is achieved by m =. TODO: comment. 29%.1.4 Performance w.r.t. from amount training data -1 testing set. Figure that mxqda[17] ( 81illustrates, 2Baseline ) that are- KISSME assigned locations obtained by[24] hi(figure, = 2). TODO 482 36 31 - small pc Kernel-based experiment we were 484 training pairs models ona different 479 48 based approach achieves high performance much quicker Ours p, PCA, metric learning, pooling -2 28% erarchical clusters m = 2 483 37 We positive pairs from a training set and testing on316 Baseline - KISSME - crossvalidation ( 62 pc) 32 4 carry out experiments show evolution per.1.4 Performance w.r.t. amount training data state art,.2. m =Comparison 2). TODO [24] 484 38 pairs from -3a testing set. Figure 8 illustrates that -In (figure XQDA[17] Kernel-based 33 481 formance training image pairs. this 27% CUHK01 ilids 48 39 approachwe achieves high performance much quicker Ours -Baseline p, PCA, metric learning, pooling 34 We carrybased out experiments show evolution per482 KISSME small pc experiment were training models on different (a) HPML (b) DPML.2. Comparison state art 3 formance training image pairs. In this 483 positive pairs from a training set and testing on 316 Baseline - KISSME - crossvalidation ( 62 pc) 1: model (a) HPML comparison different allowable neighborhoods - KISSME - small[24] pc (horizontal vertical) when 36 experiment we Deformable were models on different 484 Fig. pairs from training a testing set.parameters: Figure 8 illustrates that XQDA[17] Kernel-based Baseline applying for matching es; (b) DPMLBaseline exhaustive grid search over α1 and α2 coefficients 48 positive pairshungarian from a algorithm training andperformance testing on much 316 - KISSME crossvalidation (pooling 62 pc) for. α1 37 based approach achievesset high quicker Ours - p,- PCA, metric learning, and aα2testing correspond es locations w.r.t. left image.xqda[17] Grid searchkernel-based map illustrates[24] Rank-1 recognition rate as a function 38 pairs from set. Figure 8 illustrates that (α1, α2 ). The white dot optimal 39 based approach achieves highhighlights performance muchoperating quickerpoint. Ours - p, PCA, metric learning, pooling 0 Gain rank-1 accuracy(%) 0.4 0.3 0.3 0.2 0.2 0.1 0 9 CUHK01 ilids 98 96 7 6 1.79%(0.986) DPML, LSSL 48.29%(0.983) HPML, LSSL 47.01%(0.983) PML, LSSL 0 1 9 8 8 7.93%(0.997) DPML, LSSL 72.84%(0.996) HPML, LSSL 71.34%(0.99) PML, LSSL 7 2 1 94 92 88 86 84 82.%(0.9) DPML, LSSL 81.36%(0.974) HPML, LSSL 78.31%(0.9) PML, LSSL 82 2 1 2 Fig. 16: Performance comparison Patch based Metric Learning (PML) our deformable models: unsupervised HPML and supervised DPML. Rank-1 identification rates as well as nau C values provided in brackets are shown in legend next method name. CUHK01 ilids CUHK03 DPML-CNN HPML-CNN PML-CNN DPML [2] PML [2] 1.7 48.2 47.0 41.4 33. 7.9 72.8 71.3 3.8.6 82.2 81.3 78.3 7.6 1.6 84.0 82.1.6 - esdc [46] SDALF [11] TL [31] Dropout [] KISSME [19] LOMO+XQDA [2] Mirror [6] Ensembles [29] MidLevel [47] kldfa [41] DeepNN [1] Null Space [44] Null Space (fusion) [44] Triplet Loss [7] Gaussian+XQDA [27] Joint CNN [37] Sample-Specific SVM [4] 26.7 19.9 34.1 38.6 19.6.0 42.9 4.9 29.1 32.8 34.8 42.2 1.1 47.8 49.7 3.7 42.6 1.1 9.9 32.1 66.6 16.4 63.2.4 3.4 34.3 47. 64.9 69.0 3.7 7.8 71.8 6.9 36.8 41.7 0.3 64.6 28.4 46.3 0.3.3.4-7.3 - M ETHOD 62.1 4.0 3.7 4.7 2.2 7.0 TABLE 2: Performance comparison on, CUHK01, ilids and CUHK03-detected; CMC rank-1 accuracies are reported. The best scores are shown in red. The second best scores are highlighted in blue. Our approach significantly outperforms best state art approaches. S UMMARY Re-identification must deal appearance differences arising from changes in illumination, viewpoint and a person s pose. Traditional metric learning approaches do not address registration errors and instead only focus on feature vecrs extracted from bounding boxes. In contrast, we propose a -based approach. Operating on es has several advantages: Extracted feature vecrs have lower dimensionality and do not have be subject same levels compression as feature vecrs extracted for entire bounding box. Multiple locations can share same metric, which effectively increase amount training data. We allow es adjust ir locations when comparing two bounding boxes. The idea is similar part-based models used in object detection. As a result, we directly address registration errors while simultaneously evaluating appearance consistency. Learning deep features directly from data and forcing CNN determine also location results in highly effective representation.

JOURNAL OF TCSVT Our experiments illustrate how se advantages lead new state art performance on well established, challenging reidentification datasets. R EFERENCES [1] [2] [3] [4] [] [6] [7] [8] [9] [] [11] [12] [13] [14] [1] [16] [17] [18] [19] [] [21] [22] [23] [24] [2] [26] [27] [28] [29] [] [31] [32] E. Ahmed, M. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In CVPR, 1. S. Bak and P. Carr. Person re-identification using deformable metric learning. In WACV, 16. S. Bak and P. Carr. One-shot metric learning for person re-identification. In CVPR, 17. S. Bak, E. Corvee, F. Bremond, and M. Thonnat. Person re-identification using spatial covariance regions human body parts. In AVSS,. D. Chen, Z. Yuan, G. Hua, N. Zheng, and J. Wang. Similarity learning on an explicit polynomial kernel feature map for person re-identification. In CVPR, 1. Y.-C. Chen, W.-S. Zheng, and J. Lai. Mirror representation for modeling view-specific transform in person re-identification. In IJCAI, 1. D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person reidentification by multi-channel parts-based cnn improved triplet loss function. In CVPR, June 16. D. S. Cheng, M. Cristani, M. Sppa, L. Bazzani, and V. Murino. Cusm picrial structures for re-identification. In BMVC, pages 68.1 68.11, 11. J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Informationoretic metric learning. In ICML, 07. M. Dikmen, E. Akbas, T. S. Huang, and N. Ahuja. Pedestrian recognition a learned metric. In ACCV, pages 01 12,. M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person re-identification by symmetry-driven accumulation local features. In CVPR,. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection discriminatively trained part based models. TPAMI,. D. Gray, S. Brennan, and H. Tao. Evaluating Appearance Models for Recognition, Reacquisition, and Tracking. PETS, 07. D. Gray and H. Tao. Viewpoint invariant pedestrian recognition an ensemble localized features. In ECCV, 08. M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? metric learning approaches for face identification. In ICCV, 09. M. Guillaumin, J. Verbeek, and C. Schmid. Multiple instance metric learning from aumatically labeled bags faces. In ECCV,. M. Hirzer, C. Beleznai, P. M. Roth, and H. Bisch. Person reidentification by descriptive and discriminative classification. In SCIA, pages 91 2, 11. T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training structural svms. Machine Learning, 09. M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bisch. Large scale metric learning from equivalence constraints. In CVPR, 12. H. W. Kuhn. The hungarian method for assignment problem. Naval Research Logistics Quarterly, 2(1-2), 19. I. Kviatkovsky, A. Adam, and E. Rivlin. Color invariants for person reidentification. TPAMI, 13. W. Li, R. Zhao, and X. Wang. Human reidentification transferred metric learning. In ACCV, 12. W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 14. Z. Li, S. Chang, F. Liang, T. Huang, L. Cao, and J. Smith. Learning locally-adaptive decision functions for person verification. In CVPR, 13. S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 1. N. Martinel, C. Micheloni, and G. Foresti. Saliency weighted features for person re-identification. In ECCV Workshops, 14. T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sa. Hierarchical gaussian descripr for person re-identification. In CVPR, 16. J. B. Orlin and Y. Lee. QuickMatch: A very fast algorithm for Assignment Problem. Technical Report WP 347-93, Massachusetts Institute Technology, 1993. S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Learning rank in person re-identification metric ensembles. In CVPR, 1. S. Pedagadi, J. Orwell, S. A. Velastin, and B. A. Boghossian. Local fisher discriminant analysis for pedestrian re-identification. In CVPR, 13. P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian. Unsupervised cross-dataset transfer learning for person re-identification. In CVPR, June 16. B. Prosser, W.-S. Zheng, S. Gong, and T. Xiang. Person re-identification by support vecr ranking. In BMVC,. 12 [33] Y. Shen, W. Lin, J. Yan, M. Xu, J. Wu, and J. Wang. Person reidentification correspondence structure learning. In ICCV, 1. [34] H. Sheng, Y. Huang, Y. Zheng, J. Chen, and Z. Xiong. Person reidentification via learning visual similarity on corresponding pairs. In KSEM, 1. [3] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li. Embedding deep metric for person re-identification: A study against large variations. In B. Leibe, J. Matas, N. Sebe, and M. Welling, edirs, ECCV, 16. [36] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting,000 classes. In CVPR, June 14. [37] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning single-image and cross-image representations for person re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 16. [38] G. Wang, L. Lin, S. Ding, Y. Li, and Q. Wang. DARI: distance metric and representation integration for person verification. In AAAI, 16. [39] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, 06. [] T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations domain guided dropout for person re-identification. In CVPR, 16. [41] F. Xiong, M. Gou, O. Camps, and M. Sznaier. Person re-identification using kernel-based metric learning methods. In ECCV, 14. [42] Y. Yang, S. Liao, Z. Lei, and S. Z. Li. Large scale similarity learning using similar pairs for person verification. In AAAI, 16. [43] Y. Yang and D. Ramanan. Articulated human detection flexible mixtures parts. TPAMI, 13. [44] L. Zhang, T. Xiang, and S. Gong. Learning a discriminative null space for person re-identification. In CVPR, 16. [4] Y. Zhang, B. Li, H. Lu, A. Irie, and X. Ruan. Sample-specific svm learning for person re-identification. In CVPR, 16. [46] R. Zhao, W. Ouyang, and X. Wang. Unsupervised salience learning for person re-identification. In CVPR, 13. [47] R. Zhao, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In CVPR, 14. [48] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. MARS: A video benchmark for large-scale person re-identification. In ECCV, 16. [49] W.-S. Zheng, S. Gong, and T. Xiang. Associating groups people. In BMVC, 09. [0] W.-S. Zheng, S. Gong, and T. Xiang. Person re-identification by probabilistic relative distance comparison. In CVPR, 11. [1] W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong. Partial person re-identification. In ICCV, 1. Sławomir Bak is an Associate Research Scientist at Disney Research Pittsburgh. His research focuses on recognition (person re-identification), visual tracking, Riemannian manifolds and metric learning. Before joining Disney Research, he spent several years as a Postdocral Fellow and PhD student at INRIA Sophia Antipolis. He received his PhD from INRIA, University Nice in 12 for a sis on person re-identification. Sławomir obtained his Master s Degree in Computer Science (Intelligent Decision Support Systems) from Poznan University Technology in 08. Peter Carr is a Research Scientist at Disney Research, Pittsburgh. His research interests lie at intersection computer vision, machine learning and robotics. Peter joined Disney Research in as a Postdocral Researcher. Prior Disney, Peter received his PhD from Australian National University in, under supervision Pr. Richard Hartley. Peter received a Master s Degree in Physics from Centre for Vision Research at York University in Toron, Canada, and a Bachelor s Applied Science (Engineering Physics) from Queen s University in Kingsn, Canada.