Object Detection TA : Young-geun Kim Biostatistics Lab., Seoul National University March-June, 2018 Seoul National University Deep Learning March-June, 2018 1 / 57
Index 1 Introduction 2 R-CNN 3 YOLO 4 Evaluation Seoul National University Deep Learning March-June, 2018 2 / 57
Introduction Introduction Seoul National University Deep Learning March-June, 2018 3 / 57
Introduction In this session, we will learn about... The Object Detection problem. Regions with CNN features (R-CNN), a region proposal-based approach model. You-Only-Look-Once (YOLO), an unified approach model. Evaluation metrics for detection models. Seoul National University Deep Learning March-June, 2018 4 / 57
Introduction What is Object Detection? Object Detection is a task finding where and what objects are (Location + Classification). An integral part of various vision application such as Automated Driving System, Face Detection and Object Counting. Figure: from https://youtu.be/mpu2histivi (YOLO v3 clip). Seoul National University Deep Learning March-June, 2018 5 / 57
Introduction What is Object Detection? (Conti.) For given image, task-taker should answer the predicted region and class confidence. A region is expressed as rectangular called bounding box. The number of objects is not provided. Figure: from Ren et al., 2015.. Seoul National University Deep Learning March-June, 2018 6 / 57
Introduction What is Object Detection? (Conti.) Exact region (or bounding box) of each object is called Ground-Truth (GT) box, the minimal rectangular containing whole part of the object. A region is parameterized by (x, y, w, h) where (x, y) is the coordinate of top-left (or center) point, w is the width, and h is the height of the bounding box. Intersection over Union (IoU), the ratio of intersection area to union area, between predicted region and GT box presents the accuracy about location. Seoul National University Deep Learning March-June, 2018 7 / 57
Introduction What is Object Detection? (Conti.) There are various types of objects. For example, VOC challenge requires detecting following 20 kinds of object classes. Middle Level Person Animal Vehicle Indoor Low Level person bird, cat, cow, dog, horse, sheep aeroplane, bicycle, boat, bus, car, motorbike, train bottle, chair, diningtable, potted plant, sofa, tv/monitor mean Average Precision (map), an estimator of the area under the precision-recall curve (AUCPR), usually presents the accuracy about classification. Seoul National University Deep Learning March-June, 2018 8 / 57
Introduction What is Object Detection? (Conti.) An object is considered detected if task-taker selects any region with predicted label satisfying following conditions. Condition 1 : Highly overlapped with GT box of the object. Condition 2 : Correctly classified. Seoul National University Deep Learning March-June, 2018 9 / 57
Introduction Challenges Infinitely Imbalanced Structure : Background (BG) is the majority class. There are few positive regions (GT) and infinitely many negative regions (BG). True Predicted N P Total N TN FP # of BG P FN TP # of GT Table: The confusion matrix of object detection. In this structure, accuracy about positive class is severely suffered. This means that finding an object position as it is difficult. Seoul National University Deep Learning March-June, 2018 10 / 57
Introduction Challenges (Conti.) Dynamic Scale : Shape of objects is various. Some are tiny/huge and some are horizontally/vertically long. Figure: from VOC2012. This means that our model should recognize various scale of regions. Seoul National University Deep Learning March-June, 2018 11 / 57
Introduction Challenges (Conti.) Multi-task : Finding object position (Location) and classifying the object (Classification) each is difficult. Object Detection requires performing both tasks simultaneously. In practice, the test time of detection model should be short, but due to the high level of difficulty, it is challenging. Seoul National University Deep Learning March-June, 2018 12 / 57
Introduction Approaches Pre-deep learning approaches (Do not cover. See 50 years of object recognition: Directions forward). Regions with CNN features (R-CNN), a region proposal-based approach model. You-Only-Look-Once (YOLO), an unified approach model. Seoul National University Deep Learning March-June, 2018 13 / 57
R-CNN R-CNN Seoul National University Deep Learning March-June, 2018 14 / 57
R-CNN Regions with CNN features Regions with CNN features (R-CNN) is a region proposal-based approach model (Girshick et el., 2014). R-CNN selects regions using Selective Search (Uijlinga et al., 2013), warps them as to the same scale and extracts features to learn class-specific SVMs. Figure: from Girshick et el., 2014. Seoul National University Deep Learning March-June, 2018 15 / 57
R-CNN Selective Search Selective Search (SS) is an hierarchical grouping algorithm whose domain is a set of region. For given set of regions, SS greedily merges regions. The distance measure is a partial summation of similarity about colour, texture, size, and fill. Figure: from Uijlinga et al., 2013. Seoul National University Deep Learning March-June, 2018 16 / 57
R-CNN Selective Search (Conti.) Considering various features from fine-level region, SS distinguishes objects and captures their hierarchical structure. Initialization is based on a graph-based segmentation algorithm (Felzenszwalb and Huttenlocher, 2004.) whose time complexity is nearly linear in the number of pixels. Seoul National University Deep Learning March-June, 2018 17 / 57
R-CNN Detection Network Proposed regions pass through CNNs which consists of classifier and bounding box (bbox) regressor. AlexNet (Krizhevsky et al., 2012.) is applied with replaced FC layer for corresponding number of class including BG. After tuning AlexNet, class-specific linear SVMs and bbox regressor (Felzenszwalb et al., 2010.) are learned by using extracted feature. bbox regressor predicts (x, y, w, h) of GT and use it to adjust proposed regions. Seoul National University Deep Learning March-June, 2018 18 / 57
R-CNN Limitation of R-CNN R-CNN requires fine-tuning CNN, learning multiple SVMs and bbox regressor (multi-stage pipeline). Training SVMs and bbox regressor requires feedforwarding all regions in all images and saving all extracted features. Because of the same reason in training, test time is too long. It takes 47 second to perform detection for a single image. Seoul National University Deep Learning March-June, 2018 19 / 57
R-CNN Spatial Pyramid Pooling Network Feedforwarding all proposed regions is time-consuming approach. Spatial Pyramid Pooling (SPP; He et al., 2014.) models the spatial connectivity and makes various regions into the fixed size. For usual CNNs, warp conv conv warp since there is no spatial connectivity between raw image and extracted feature. Figure: from He et al., 2014. Seoul National University Deep Learning March-June, 2018 20 / 57
R-CNN Spatial Pyramid Pooling Network (Conti.) SPP Network learns the spatial connectivity, still capturing semantic content. SPP reduces the computation cost, but SPP Network is still multi-stage pipeline. Figure: from He et al., 2014. Seoul National University Deep Learning March-June, 2018 21 / 57
R-CNN Fast R-CNN Fast R-CNN (Girshick and Ross, 2015.) is a variation of R-CNN applying Region of Interest (RoI) pooling, a kind of SPP. Training is single-stage by using multi-task loss. Multi-task loss enables us to update all weights simultaneously. Figure: from Girshick and Ross, 2015. Seoul National University Deep Learning March-June, 2018 22 / 57
R-CNN Region of Interest Pooling RoI pooling connects the raw image and the final extracted feature before FC layers. RoI feature vector passes two sibling FC layer. Figure: from Girshick and Ross, 2015. Seoul National University Deep Learning March-June, 2018 23 / 57
R-CNN Region of Interest Pooling (Conti.) In contrast to usual max-pooling, RoI pooling has dynamic filter size. Back propagation through RoI pooling requires activated positions for each region. Seoul National University Deep Learning March-June, 2018 24 / 57
R-CNN Multi-task Loss For given region (x r, y r, w r, h r ) in an image, Fast R-CNN calculates p and t k = (tx k, ty k, tw k, th k ), the predicted probability vector and location for class k parameterized by following. t k x = (x k x r )/w r t k y = (y k y r )/h r t k w = log(w k /w r ) t k h = log(hk /h r ) Let u, v be the true class and location of corresponding GT box for given region. v is parameterized by substituting (x, y, w, h) of the GT. Seoul National University Deep Learning March-June, 2018 25 / 57
R-CNN Multi-task Loss (Conti.) To train multi-task model, the loss function is designed as L(p, u, t u, v) = L cls (p, u) + λ[u 1]L loc (t u, v) where L cls (p, u) = log p u is log loss for true class u and L loc (t u, v) = i {x,y,w,h} huber(t u i v i ). The hyper-parameter λ controls balance between classification loss and regression loss. For u = 0, background region, L loc doesn t have any role. L loc is a function of ti u v i, so L is invariant to translation, flipping and rescaling. Seoul National University Deep Learning March-June, 2018 26 / 57
R-CNN Limitation of Fast R-CNN (Conti.) Compared to R-CNN, Fast R-CNN achieves slightly higher accuracy with nearly 100 times short test time, but the test time is still long. In VOC 2007 test task, Fast R-CNN takes 1830ms per image. Region proposal task, SS is a huge piece consuming 1510ms per image. Seoul National University Deep Learning March-June, 2018 27 / 57
R-CNN Faster R-CNN Faster R-CNN (Ren et al., 2015) is a variation of Fast R-CNN using Region Proposal Network (RPN). Roughly speaking, Faster R-CNN = RPN + Fast R-CNN. Contrast to SS, RPN has learnable weight for multi-task loss. Seoul National University Deep Learning March-June, 2018 28 / 57
R-CNN Region Proposal Network For given point in an image, RPN classifies objectness of several regions centered on that point and regresses exact location. Pre-determined points in each image are called anchors. Figure: adjusted from VOC2012. Seoul National University Deep Learning March-June, 2018 29 / 57
R-CNN Region Proposal Network (Conti.) 1. For selected anchor, view small region nearby the anchor in the level of extracted feature. 2. Determine the objectness of k regions centered on the corresponding anchor in the raw image map. Figure: from Ren et al., 2015. Seoul National University Deep Learning March-June, 2018 30 / 57
R-CNN Region Proposal Network (Conti.) 3. For all regions classified to be positive, adjust them using reg layer. Figure: from Ren et al., 2015. Seoul National University Deep Learning March-June, 2018 31 / 57
R-CNN Multi-task loss for RPN RPN uses multi-task loss similar to Fast R-CNN. Exact formula is L({p i }, {t i }) = 1 L cls (p i, pi ) + λ 1 pi L reg (t i, ti ) N cls N reg where i is the index of an anchor. i Here, p i is the predicted probability of anchor i being an object. pi the ground-truth label. t is the same to Fast R-CNN. (Opinion) This data has multi-label structure. Note that input domain of loss is an anchor box, not anchor boxes sharing center. This design relaxes the issue about class correlation between anchor boxes. i is Seoul National University Deep Learning March-June, 2018 32 / 57
R-CNN Training faster R-CNN RPN and fast R-CNN share feature extractor part. This shared structure reduces test-time, the origin of its name Faster R-CNN. Sharing structure is implemented by following sequence. Phase Feature Extractor Region Proposal 1. Train RPN 2. Train fast R-CNN 3. Tune RPN 4. Tune fast R-CNN Initialized from ImageNet model Initialized from ImageNet model Frozen from phase 2. Frozen from phase 2. - RPN from phase 1. - RPN from phase 3. Seoul National University Deep Learning March-June, 2018 33 / 57
R-CNN Summary of R-CNN variations All the models use bbox regressor to adjust proposed region. R-CNN uses SVM and others use softmax classifier. Region Proposal Method Region Scaling Method R-CNN SS Warping Fast R-CNN SS RoI pooling Faster R-CNN RPN RoI pooling Table: Key methodologies. map (%) test time (ms/image) R-CNN 66.0 > 10 4 Fast R-CNN 66.9 1830 Faster R-CNN 69.9 198 Table: Evaluation on VOC 2007 test set, adjusted from Girshick and Ross, 2015. and Ren et al., 2015. Seoul National University Deep Learning March-June, 2018 34 / 57
YOLO YOLO Seoul National University Deep Learning March-June, 2018 35 / 57
YOLO You-Only-Look-Once You-Only-Look-Once (YOLO; Redmon et al., 2016.) is an unified approach model. YOLO has one CNNs solving both location and classification problem. In the introduction of paper: Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. For given image, YOLO feedforwards only one time, remarkably reducing test time. All the figures, tables, and equations in this section are come from Redmon et al., 2016. Seoul National University Deep Learning March-June, 2018 36 / 57
YOLO Terminology An image is divided by S S grid cells. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Seoul National University Deep Learning March-June, 2018 37 / 57
YOLO Terminology (Conti.) Each grid cell predicts B bounding boxes and corresponding objectness confidence. Each bounding box is parametrized by (x, y, w, h), the same to R-CNN. The objectness confidence is Pr(Object) IOU truth pred. Seoul National University Deep Learning March-June, 2018 38 / 57
YOLO Terminology (Conti.) All bounding boxes sharing grid cell have the same conditional class probability, formally Pr(Class i Object). At test time, the class-specific confidence, Pr(Class i ) IoUpred truth is predicted by multiplying predicted conditional class confidence and objectness confidence. Seoul National University Deep Learning March-June, 2018 39 / 57
YOLO Terminology (Conti.) Seoul National University Deep Learning March-June, 2018 40 / 57
YOLO Architecture For given image, YOLO predicts (x, y, w, h) and objectness confidence for all bounding boxes and conditional class probability for all grid cells. Considering its spatial meaning, we can view the output as S S (B 5 + C) box. In VOC competition, S = 7, B = 2, and C = 20. Seoul National University Deep Learning March-June, 2018 41 / 57
YOLO Architecture (Conti.) Following figure describes the architecture of YOLO. For given image, convolution layers extract features and final FC layer predicts bounding box parameters, objectness confidence, and conditional class probability. Seoul National University Deep Learning March-June, 2018 42 / 57
YOLO Loss Following is the loss function of YOLO. The first two terms are about bbox regression. Next two terms are about objectness classification and the last term is about the class classification. Here, 1 i and 1 ij are indicators about responsibility of ith grid cell and its jth bounding box, respectively. Seoul National University Deep Learning March-June, 2018 43 / 57
YOLO Performance YOLO is the first deep-learning model in the context of real-time detection with the state-of-the-art accuracy. Real-Time : 30 frames per second or better. When the speed of car is 60km/h, car moves 0.55m between detections. Seoul National University Deep Learning March-June, 2018 44 / 57
YOLO Performance (Conti.) Compared with fast R-CNN, YOLO has high location error and low background error. Correct: correct class and IoU >.5, Loc: correct class,.1<iou<.5, Sim: class is similar, IoU>.1, Other: class is wrong, IoU>.1, Background: IoU<.1 for any object. Seoul National University Deep Learning March-June, 2018 45 / 57
Evaluation Evaluation Seoul National University Deep Learning March-June, 2018 46 / 57
Evaluation Non Maximum Suppression Some of predicted regions severely overlap. In object detection, multiple detection for single GT is penalized. Figure: from https://kr.mathworks.com/help/vision/ref/selectstrongestbbox.html Seoul National University Deep Learning March-June, 2018 47 / 57
Evaluation Figure: from https://kr.mathworks.com/help/vision/ref/selectstrongestbbox.html. Seoul National University Deep Learning March-June, 2018 48 / 57 Non Maximum Suppression (Conti.) Non Maximum Suppression (NMS) is a pre-work for evaluation, removing overlapped regions using confidence. Choose the most confident bounding box and remove all other boxes with high IoU with the box. Repeat until there is no more box. NMS is applied to both RPN and detection network.
Evaluation Evaluation measures In Infinitely Imbalance Structure, performance measures using TN may unsuitable. True Predicted N P Total N TN FP # of BG P FN TP # of GT Table: The confusion matrix of object detection. Detecting all objects as it is easy. Just classify all regions to all object. What would be the value of TN? If the model is reasonable, TN should be. Seoul National University Deep Learning March-June, 2018 49 / 57
Evaluation Evaluation measures (Conti.) Main evaluation measures in object detection are based on Precision and Recall. Precision : the proportion of TP among positive labeled, TP/(TP+FP). Recall : the proportion of TP among positive, TP/(TP+FN). F1 score : the harmonic mean of precision and recall. AUCPR : the area under the precision-recall curve. Commonly used estimator for AUCPR in object detection is Average Precision (AP). Seoul National University Deep Learning March-June, 2018 50 / 57
Evaluation Average Precision Let c be the threshold of confidence. Than AUCPR can be expressed as AUCPR = Precision(c)dRecall(c) where Precision(c) and Recall(c) are the precision and recall at threshold level c, respectively. By plugging in the empirical precision and recall, Precision(c) and Recall(c), we get an estimator of AUCPR, AUCPR = Precision(c)d Recall(c). Seoul National University Deep Learning March-June, 2018 51 / 57
Evaluation Average Precision (Conti.) Here, by the definition of Riemann Stieltjes integral, AUCPR = = Precision(c)d Recall(c) c {conf i i P} Precision(c) ( # of P have conf. equal to c ). # of P This is the Average Precision (AP), an weighted average of precisions at each confidence level of GT box. Seoul National University Deep Learning March-June, 2018 52 / 57
Evaluation Average Precision (Conti.) Considering various kinds of class, the mean of AP is used. This is called mean Average Precision (map). In practice, Interpolated AP is used due to the wiggles in the precision-recall curve. Unlike the ROC curve, it may not hold monotonicity. Seoul National University Deep Learning March-June, 2018 53 / 57
Evaluation References Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587). Uijlings, Jasper RR, et al. Selective search for object recognition. International journal of computer vision 104.2 (2013): 154-171. Felzenszwalb, Pedro F., and Daniel P. Huttenlocher. Efficient graph-based image segmentation. International journal of computer vision 59.2 (2004): 167-181 Felzenszwalb, Pedro F., et al. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32.9 (2010): 1627-1645. Seoul National University Deep Learning March-June, 2018 54 / 57
Evaluation References (Conti.) A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. He, Kaiming, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. european conference on computer vision. Springer, Cham, 2014. Girshick, Ross. Fast r-cnn. arxiv preprint arxiv:1504.08083 (2015). Simonyan, Karen, and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv:1409.1556 (2014). Seoul National University Deep Learning March-June, 2018 55 / 57
Evaluation References (Conti.) Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015. Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788). Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2), 303-338. Seoul National University Deep Learning March-June, 2018 56 / 57
Evaluation References (Conti.) Boyd, Kendrick, Kevin H. Eng, and C. David Page. Area under the precision-recall curve: Point estimates and confidence intervals. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2013. Introduction to modern information retrieval Seoul National University Deep Learning March-June, 2018 57 / 57