Coupling-and-Decoupling: A Hierarchical Model for Occlusion-Free Car Detection

Size: px

Start display at page:

Download "Coupling-and-Decoupling: A Hierarchical Model for Occlusion-Free Car Detection"

Edwin Lambert
6 years ago
Views:

1 Coupling-and-Decoupling: A Hierarchical Model for Occlusion-Free Car Detection Bo Li 1,2,3, Tianfu Wu 2,3, Wenze Hu 3,4 and Mingtao Pei 1 1 Beijing Lab of Intelligent Information, School of Computer Science and Technology, Beijing Institute of Technology, Beijing , P.R.China 2 BUPT-Seesoft Joint Lab of Visual Computing and Image Communication, Beijing University of Posts and Telecommunications (BUPT), Beijing , P.R.China 3 Lotus Hill Research Institute, EZhou, P.R.China 4 Department of Statistics, University of California, Los Angeles {boli.lhi, tfwu.lhi, wzhu.lhi}@gmail.com, peimt@bit.edu.cn Abstract. Handling occlusions in object detection is a long-standing problem. This paper addresses the problem of X-to-X-occlusion-free object detection (e.g. car-to-car occlusions in our experiment) by utilizing an intuitive coupling-and-decoupling strategy. In the coupling stage, we model the pair of occluding X s (e.g. car pairs) directly to account for the statistically strong co-occurrence (i.e. coupling). Then, we learn a hierarchical And-Or directed acyclic graph (AOG) model under the latent structural SVM (LSSVM) framework. The learned AOG consists of, from the top to bottom, (i) a root Or-node representing different compositions of occluding X pairs, (ii) a set of And-nodes each of which represents a specific composition of occluding X pairs, (iii) another set of And-nodes representing single X s decomposed from occluding X pairs, and (iv) a set of terminal-nodes which represent the appearance templates for the X pairs, single X s and latent parts of the single X s, respectively. The part appearance templates can also be shared among different single X s. In detection, a dynamic programming (DP) algorithm is used and as a natural consequence we decouple the two single X s from the X-to-X occluding pairs. In experiments, we test our method on roadside cars which are collected from real traffic video surveillance environment by ourselves. We compare our model with the state-of-the-art deformable part-based model (DPM) and obtain better detection performance. 1 Introduction In the literature of object detection, handling occlusions is very challenging and remains a long-standing problem. The two main reasons are (i) The gap between training and testing. When training an object detector, unoccluded object instances are often collected and used purposely. In testing, however, occlusions are inevitable in real scenarios. As a result, the detection performance will go down significantly as occlusions become severe. And (ii) The lack of common occlusion models. Generally and statistically speaking, it is very difficult to capture and predict occlusions because they can be treated as being uniformly distributed

2 2 B. Li, T. Wu, W. Hu and M. Pei in the wildest situation. To some extend, that explains, in turn, why the gap between training and testing exists. To address the occlusion problem, among others, hierarchical modeling (e.g. deformable part-based models [5]) has been widely used and shows performance improvement, and a 2-layer model is often adopted for modeling single objects, which can tackle small occlusions implicitly. Fig. 1. Some examples of roadside cars. There are different types of car-to-car occlusions which challenge the state-of-the-art detectors trained for single cars. In this paper, we distinguish between two types of occlusions: the X-to-X and X-to-Y occlusions, where X and Y represent different object categories (e.g. X represents car and Y person) respectively, and then present a couplingand-decoupling method for X-to-X occlusion-free object detection without modeling occlusions explicitly. As the running examples, we use roadside cars which are often parked along the curb, leading to the X-to-X occlusions. Occlusion-free roadside car detection can facilitate many important applications in computer vision and intelligent transportation, such as parking violation capturing, license plate detection and parking management. Figure 1 shows some examples of carto-car occlusions in real traffic video surveillance environment. In the sequel, we concretely use car instead of X to present the formulation (but notice that the proposed method is not limited to cars). Our method consists of two stages as follows. (i) The coupling stage in modeling and learning. Instead of training a single object detector, we learn hierarchical And-Or directed and acyclic graph (AOG) models for the car-to-car occluding pairs directly to account for the statistically strong coupling. The learned AOG consists of, from the top to bottom, (i) a root Or-node representing different compositions of occluding car-to-car pairs, (ii) a set of And-nodes each of which represents a specific composition of occluding car pairs, (iii) another set of And-nodes representing single cars decomposed from occluding car pairs, and (iv) a set of terminal-nodes which represent the appearance templates for the car pairs, single cars and latent parts of the single cars, respectively. The part appearance templates can also be shared among different single cars. We adopt Histogram of Oriented Gradient (HOG) [2] as the appearance feature as done in DPM [5]. Figure 3 shows the learned AOG for car-to-car pairs (where for clarity only a portion is drawn). We formulate the learning of AOG under the latent structural SVM (LSSVM) framework [13, 14, 16]. In the training

3 A Hierarchical Model for Occlusion-Free Car Detection 3 dataset, bounding boxes of car pairs and corresponding two single cars are annotated, and the parts of single cars are treated as latent variables. (ii) The decoupling stage in detection. Our AOG model is directed and acyclic and we can utilize the DP algorithm in inference. For detected car pairs, the back-traced bounding boxes for the two single cars are obtained, i.e., decoupled from the car pair. Since the locations and sizes of bounding boxes of the single cars are annotated when jointly training the AOG model, the back-traced ones are the optimal solutions for the two single cars statistics of overlapping cars, subset: test 0.8 detection rate of DPM model and proposed model, subset: test our hierarchical car detector DPM car detector proportion detection rate overlap ratio Occlusion: Occlusion: overlap ratio Occlusion: Occlusion: Occlusion: Fig. 2. Top-left: The population ratios in the testing set of roadside cars used in this paper. Bottom: Some examples of cropped car-to-car occluding pairs. The occlusion ratio is measured for the back car in the car pairs. Top-right: The plots of detection rates v.s. occlusion ratios, where blue dashed curve is for the state-of-the-art DPM [5] and red curve is for the proposed method. See text for details. To illustrate the necessity and the advantage of the proposed method in this paper, in Fig. 2, the left figure shows the population ratios of car-to-car pairs with different degrees of occlusions in the testing dataset collected by ourselves from the real traffic video surveillance environment. Some cropped image examples are shown in the bottom. The right figure shows the detection rates against the occlusion ratio for the proposed method (the red curve) and the state-of-the-art DPM [5] (the blue dashed curve). We can observe that, (i) The population ratio of car pairs with occlusions being equal or greater than 0.2 is greater than 0.5 (i.e. occlusions become a statistically major factor). (ii) At the same time, the detection performance of DPM dropped significantly when occlusions go beyond 0.2, while our method can obtain much better performance. (iii) The detection performance of our method goes up significantly when occlusions are greater than This is because that with those severe occlusions,

4 4 B. Li, T. Wu, W. Hu and M. Pei even if DPM could recall the two single cars, their bounding boxes overlap larger than the threshold normally used (e.g. 0.7), and then the one with lower score will be excluded by non-maximum suppressing (NMS) (see the DPM detection results in Fig. 5). Our method can, however, detect those cars correctly by decoupling them from the detected car pairs. More results and final performance comparison are shown in Fig. 5 and Fig.4 respectively. In the literature of computer vision, car detection for traffic monitoring systems are addressed mainly in single unoccluded situations, such as car type classification [12, 8], multiple-view car detection [9, 7], or shadow removal from suspicious car regions in images [10]. [1] proposed a method to detect and track multiple cars simultaneously, but they did not address the occlusion problem. Fig. 3. Our AOG Model. First-layer: illustration of car pair And-nodes and their corresponding appearance features. Second-layer: illustration of single car And-nodes and their corresponding appearance features. Third-layer: illustration of car parts Terminalnodes and their corresponding appearance and deformation features. Parts are shared. For clarity, we just show the parts of two single cars. 2 The Model 2.1 The AOG In this section, we specify the AOG hierarchical model used in this paper which is a directed and acyclic graph facilitating the DP algorithm in detection. The learning of AOG will be given in Sec. 4. By following the framework in [17], our AOG embeds the occluding car pair detection grammars which are embodied by defining three types of nodes:

5 A Hierarchical Model for Occlusion-Free Car Detection 5 (i) The root Or-node O represents compositional alternatives of the occluding car pairs (e.g., car pairs from different viewpoints or with different degrees of occlusions). The Or-node O has a branching variable, denoted by ω(o), indicating which child And-node is selected, and ω(o) will be inferred onthe-fly in detection. (ii) A set of And-nodes V And. There are two types of And-nodes: car pairs and single cars. Each car pair And-node represents the decomposition of a specific type of occluding car pair into two single cars (e.g. a frontal view car pair with the back car being occluded by 30% roughly), and each single car Andnode represents the decomposition of a single car into a small number of parts. (iii) A set of Terminal-nodes V T. First of all, the And-nodes defined above themselves can terminate directly, creating terminal nodes, when the resolution is low (relative to their own decomposed parts). Secondly, each part is represented by a terminal-node linking to image data. In the model, each terminalnode t V T has its own location, denoted by l t, which will be also inferred on-the-fly in detection. The location for placing an And-node is the same as that for the terminal-node directly terminated from it. In the AOG, terminal-nodes link the object detection grammars to image data by evaluating the appearance features, And-nodes take into account the geometric deformations between their child nodes, and Or-nodes select the best solution (i.e. the one with maximal score) among their child nodes. So, the scoring function of the AOG consists of two terms: appearance (i.e. data term) and deformation (i.e. relation term). Formally, an AOG is specified by a 5-tuple, G = (O, V And, V T, Θ app, Θ def ) (1) where Θ app are the parameters for the appearance scoring function when placing terminal-nodes in images, and Θ def the parameters for the deformation cost of a placed terminal-node with respect to its anchor location. They will be learned by LSSVM jointly. Part-sharing in the AOG. For the child single car And-nodes decomposed from car pair And-nodes, some of them are often with the same type (such as sided view or frontal view cars) but different occlusions. So, they can share part appearance, but might have different deformation models. By sharing-parts, it will supply more data in training the part appearance parameters, and also reduce run-time in detection. 2.2 The scoring function of an AOG Let Λ be the image lattice and I Λ an image defined on Λ. In detection, we need to search over scales to detect objects with different sizes. In practice, a feature pyramid of I Λ is generated, denoted by H (e.g. the HOG feature pyramid used

6 6 B. Li, T. Wu, W. Hu and M. Pei in the DPM [5] and our method). When placing an AOG in I Λ at a location u Λ, we have, (i) The scoring function for evaluating an Or-node O at u is defined by, Score(O, u) = max Score(A, u) (2) A ch(o) where ch(o) V And is the set of child And-nodes of the Or-node O. We can assign the branching variable ω(o) = arg max A ch(o) Score(O, u). (ii) The scoring function for computing an And-node A with respect to a placed Or-node O at u is defined by, Score(A, u O, u) =< θ app t A, Φ app (H, A, u) > + Score(c A, u) (3) c ch(a) where the first term is the appearance score for the terminal-node t A terminated from And-node A directly, θ app t A Θ app is the corresponding appearance parameters, Φ app (H, A, u) is the features extracted from the feature pyramid, and ch(a) V And V T is the set of child nodes of A. (iii) The scoring function for computing an And-node A 1 with respect to a placed And-node A at u is defined by, Score(A 1 A, u) = max ( < v Λ θapp t A1, Φ app (H, A 1, v) > < θ def A, 1 A Φdef A 1 A (v, u) > + Score(t A 1, v)) (4) t ch(a 1) where θ def Θ def is the corresponding deformation parameter for node (such as A 1 ) with respect to node (such as A), Φ def (v, u) is the deformation feature which we adopt the same quadratic function as used in DPM [5] and we have Φ def (v, u) = [dx2, dx, dy 2, dy] where (dx, dy) is the displacement between v and u. The best placed location of node A 1 is retrieved by taking arg max v Λ Score(A 1 A, u). (iv) For computing a part terminal-node t with respect to a placed parent Andnode A at u, the scoring function is defined by, Score(t A, u) = max v Λ (< θapp t, Φ app (H, t, v) > < θ def t A, Φdef t A (v, u) >) (5) where in practice, we often place node t at twice the spatial resolution relative to node A to capture more detail information. 3 The DP algorithm for Detection In detection, we first find all the locations in the image pyramid where the scores of the placed AOGs are higher than the estimated threshold τ. For example, at the original resolution, we have {u; Score(O, u) > τ, u Λ}. Then, we will

7 A Hierarchical Model for Occlusion-Free Car Detection 7 utilize the NMS to get final detection results. Since the AOG is directed and acyclic, the AOG scoring function is evaluated in two phases by utilizing the DP algorithm: (i) one bottom-up phase to compute all the appearance scoring maps for terminal-nodes, as well as their transformed maps for different parent nodes which are computed by using the efficient generalized distance transform [6], and then (ii) one top-down phase to retrieve the configurations (i.e., locations of car pair, single cars and parts) for all the locations whose scores are greater than the threshold τ, followed by a post-processing NMS step in practice. We omit the obvious details of the DP algorithm here due to the limited space, which are referred to [5]. By the top-down back-tracing, we can obtain the decoupled single cars from detected occluding car pairs. Notice that we may have two inferred locations for a single car which is shared by two adjacent car pairs if the single car appears in the middle of a line of multiple occluding cars. Then, we use the location as the final detection result for the single car which is decoupled from the detected car pair with higher score. 4 Learning AOG by Latent Structural SVM In this section, we formulate the learning of the AOG under the latent structural SVM (LSSVM) framework [13, 14, 16], which has been widely used in the literature of object detection and machine learning. Training Data. We collect roadside cars from the real traffic video surveillance environment. We annotate the bounding boxes for both occluding car pairs and the corresponding two single cars. When labeling occluded single cars, we annotate their whole bounding boxes. Notice that some cars may be used twice in two adjacent car pairs when they appear in the middle of a line of multiple occluding cars. Those duplicated cars can be treated as bootstrapped ones in learning appearance parameters for single car and parts. 4.1 Latent variables in the AOG Given the training data specified above, for the AOG defined in Sec. 2.1, we have the latent variables as follows. The branches of the root Or-node, i.e., the mixture components of occluding car pairs. Based on the labeled bounding boxes, we initialize them using k-means clustering on the concatenated features (k = 3 clusters in our experiment): the aspect ratios of the three annotated bounding boxes and the displacements between the centers of the two single cars relative to that of the car pair (normalized by the size of the car pair bounding box). The aspect ratios of single cars can roughly indicate viewpoints, the displacement have clue on the configuration of car pair and the aspect ratio of car pair can reflect the degree of occlusions. In training, we also incorporate left-right flipped ones as done in [4]. So, we have 6 car pair models in total. We train the initial AOG (consisting of

8 8 B. Li, T. Wu, W. Hu and M. Pei the root Or-node, the six car pair And-nodes, the twelve single car And-nodes, and the corresponding terminal-nodes for the And-nodes) under LSSVM framework by treating the locations and sizes of car pairs and single cars as hidden variables anchored at the annotated bounding boxes. At each step of re-labeling the positive examples (i.e. assigning latent variables) in learning, we force the the assignment of car pair terminal-nodes to overlap more than 0.7 with the ground-truth, and more than 0.8 for single car terminal nodes. The part configuration for single cars and part-sharing. After the initial AOG is trained, we initialize the part configurations for the single cars based on the learned single car template, similar to the greedy pursuit method used in DPM [5]. We used 8 parts of rectangular shape and with equal sizes for each single car. For the part sharing, we use the similar method as done in [11], resulting in 30 part terminal-nodes in total. 4.2 Learning by LSSVM Denote the set of positive training images by D + = {(I 1, y 1, z 1 ),, (I n, y n, z n )}, where y i = 1 and z i = (ω i, B i, P i ) consisting of (i) The Or-node branching variable ω i (i.e. the mixture component index); (ii) The labeled three bounding boxes B i for the car pair and the two single cars respectively; and (iii) The bounding boxes P i for parts of single cars. z i s are treated as latent variables during learning with different initialization: ω i is initialized by the k-mean clustering stated above, B i by the annotated bounding boxes, and P i by the greedy pursuit and part-sharing strategy stated above. Let D = {(I n+1, y n+1 ),, (I N, y N )} be a set of negative training images (i.e. images without cars appearing) where y i = 1. We first train the initial AOG using z i = (ω i, B i ), and then initialize P i and learn the full AOG using z i = (ω i, B i, P i ). Both are done under the LSSVM framework. Given z, the scoring function is a linear function, Score(I, y, z; Θ) =< Θ, Φ(I, y, z) > (6) where Θ = (Θ app, Θ def ) and Φ(I, y, z) = (Φ app (I, y, z), Φ def (y, z)) specified in Eqn. 3, Eqn. 4 and Eqn. 5. Under the LSSVM framework, we learn Θ by solving the following surrogate loss function [14, 16], 1 min Θ 2 Θ 2 2+ C N N i=1 [max y,z (Score(I i, y, z; Θ) + (y i, y, z)) max z (Score(I i, y i, z))] (7) where the loss function (y i, y, z) = 1 if y i = y, 0 otherwise, and C is the tradeoff parameter balancing the first regularization term and the surrogate loss

9 A Hierarchical Model for Occlusion-Free Car Detection 9 term. The objective function is non-convex, and the concave-convex procedure (CCCP) [15, 14, 16] is used to get a local optimum. Firstly, Eqn.7 can be re-written as, min Θ 1 2 Θ C N max N (Score(I i, y, z; Θ) + (y i, y, z))] + y,z i=1 }{{} f(θ), convex function C N max N (Score(I i, y i, z)) z i=1 }{{} g(θ), concave function (8) Then, at step t, based on the current solution Θ t, The CCCP solves the problem with the two steps as follows. (i) Bounding g(θ) from the upper (since it is concave), i.e., finding hyperplane p t such that, g(θ) g(θ t ) + (Θ Θ t ) p t To do that, we first get the best latent variable assignment for each example by solving zi = arg max z i Score(I, y i, z i ) using the DP algorithm. Then, p t is constructed by, p t = C N Φ(I i, y i, zi ) N i=1 (ii) Updating the solution Θ t+1 = arg min Θ (f(θ) + Θ p t ). The step leads to a standard structural SVM by using different off-the-shelf solver such as the cutting plane method. The details are referred to [13, 16]. Figure 3 shows a portion of our learned AOG model. The first layer corresponding to car pair, the second layer corresponding to single car, and the third layer corresponding to car parts. Beside each node in the AOG, we visualize the learned appearance and deformation templates. 5 Experiments To evaluate our proposed method, we collected 482 car images from street view scenes and annotated the bounding boxes for both car pairs and single cars. In detail, we obtained 1380 car pairs, 2760 occluded single cars and 702 unoccluded single cars. We randomly select 200 images for training, and use the rest for testing. For the negative set, we use the training negative images from PASCAL VOC 2007 database [3]. We also follow the VOC protocol for reporting results [3]. A putative bounding box is considered correct if the intersection of its bounding box with the ground-truth bounding box is greater than 50% of their

10 10 B. Li, T. Wu, W. Hu and M. Pei union. Multiple detections for the same ground truth are penalized. We compute Precision-Recall (PR) curves and score the average precision (AP) across our test set. In experiments, we compared our AOG with the two baseline DPMs: Baseline 1. DPM trained by using occluded single cars in the training set; Baseline 2. DPM trained by using all the single cars in the training set. Fig. 4 shows the PR curves of the three methods, where the proposed method outperforms the two baseline DPMs significantly (by 9.5% and 12.3% respectively). 1 class: car, subset: test precision Baseline 1: AP = Baseline 2: AP = Hierarchical AOG Model: AP = recall Fig. 4. Precision-Recall curves for our model and baseline methods. Figure 5 shows detection results of both DPM car model (Baseline 1 is used since it is better than Baseline 2 according to the PR curves) and our AOG model. Figure 6 shows some examples of layered detection results of Our AOG model. On the top, we show the detection results of car pair model in our AOG. On the bottom, detection results of single cars are shown by using the full AOG model. Here, we can see that our model can lead to fast coarse-to-fine detection, which we will further investigate in our on-going work. 6 Conclusion In this paper, we proposed a hierarchical And-Or directed acyclic graph (AOG) model to address the problem of X-to-X-occlusion-free object detection. The model is a grammar model. It consists of (i) a root Or-node representing a mixture of different types of occluding X pairs, (ii) a set of And-nodes representing different types of occluding X pairs, (iii) another set of And-nodes representing different types of occluding single X s decomposed from X pairs, and (iv) a set

11 A Hierarchical Model for Occlusion-Free Car Detection 11 DPM DPM DPM AOG AOG AOG DPM DPM DPM AOG AOG AOG Fig. 5. Comparison of DPM car model and our hierarchical AOG model. The first row and the third row show the detection results (blue bounding boxes) of DPM car detector, the second and the fourth row show the detection results (red bounding boxes) of proposed AOG model. Best viewed in color. Car Pair Car Pair Car Pair Single Car Single Car Single Car Fig. 6. Layered detections of our AOG model. Top: detection results of car pair module by coupling in the first layer. Bottom: detection results of single car module by decoupling in the second layer.

12 12 B. Li, T. Wu, W. Hu and M. Pei of terminal-nodes representing the appearance templates for the X pairs, single X s and latent parts of the single X s. The part appearance templates can also be shared among different And-nodes of single X s. This model is learned by the latent structural SVM (LSSVM). DP algorithm is used for inference. Our model is a general model, though we only use cars as running examples in this paper, it can be used for other objects potentially. Acknowledgement. We thank the three anonymous reviewers for their helpful comments. This work is supported by China 973 Program under Grant No. 2012CB316300, Natural Science Foundation of China under Grant No References 1. Choi, J.Y., Sung, K.S., Yang, Y.K.: Multiple Vehicles Detection and Tracking based on Scale-Invariant Feature Transform. In: ITSC (2007) Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR (2005) Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. ( (2007) 4. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.: Discriminatively Trained Deformable Part Models, Release 4. ( pff/latentrelease4/) (2010) 5. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object Detection with Discriminatively Trained Part-Based Models. TPAMI 32 (2010) Felzenszwalb, P.F., Huttenlocher, D.P.: Distance Transforms of Sampled Functions. Technical report , Cornell University CIS (2004) 7. Gupte, S., Masoud, O., Martin, R.F.K., Papanikolopoulos, N.P.: Detection and Classification of Vehicles. TITS 3 (2002) Lai, A.H.S., Fung, G.S.K., Yung, N.H.C.: Vehicle Type Classification from Visualbased Dimension Estimation. In: ITSC (2001) Leotta, M.J., Mundy, J.L.: Vehicle Surveillance with a Generic, Adaptive, 3D Vehicle Model. TPAMI 33 (2011) Liu, X., Dai, B., He, H.: Real-Time On-Road Vehicle Detection Combining Specific Shadow Segmentation and SVM Classification. In: ICDMA (2011) Ott, P., Everingham, M.: Shared Parts for Deformable Part-based Models. In: CVPR (2011) Petrovic, V.S., Cootes, T.F.: Analysis of Features for Rigid Structure Vehicle Type Recognition. In: BMVC (2004) Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large Margin Methods for Structured and Interdependent Output Variables. JMLR 6 (2005) Yu, C.N.J., Joachims, T.: Learning Structural SVMs with Latent Variables. In: ICML (2009) Yuille, A.L., Rangarajan, A.: The Concave-Convex Procedure (CCCP). In: NIPS (2001) Zhu, L., Chen, Y., Yuille, A.L., Freeman, W.T.: Latent Hierarchical Structural Learning for Object Detection. In: CVPR (2010) Zhu, S.C., Mumford, D.: A Stochastic Grammar of Images. FTCGV 2 (2006)

Development in Object Detection. Junyuan Lin May 4th

Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,