Abandoned Luggage Detection

Size: px

Start display at page:

Download "Abandoned Luggage Detection"

Owen Washington
5 years ago
Views:

1 Abandoned Luggage Detection Using deep neural networks to detect and track abandoned luggage in video By Evan Miller and Eli Selkin For CS519 instructed by Dr. Hao Ji At California State Polytechnic University, Pomona

2 Presentation Outline Introduction to automated video surveillance Object detection and localization from video using SSD Object tracking Threat detection Issues and potential solutions Future work

3 History of Video Surveillance Originates with analog closed circuit television (CCTV) monitoring Analog video became more affordable through 1970s-80s Mid-late 80s - Charge-Coupled Device (CCD) cameras Nanny cams pushed cameras to be extremely small to be hidden anywhere Post 9/11 - Push for computer-enhanced surveillance Allowed for filming in low light and night, digital 90s - Digital multiplexing, digital storage, portability and affordability Video cameras installed on New York City Municipal Building near City Hall May U.S. Parks Service installs facial recognition at the Statue of Liberty and on Ellis Island Use of new and old technologies for computer-assisted surveillance Push for complete autonomy, neural networks, etc. Strong motivations for automated abandoned luggage detection Prevent bombings/other attacks in airports, subways, etc. [1]

Cornerstones of Automated Video Surveillance Object Detection and Localization Object Tracking Maintaining object movement over time and establishing object permanence

4 Cornerstones of Automated Video Surveillance Object Detection and Localization Object Tracking Maintaining object movement over time and establishing object permanence between frames Threat Detection Identifying and locating persons and other objects of interest Identifying situations that are threats to security Perpetrator Identification

5 Object Detection and Localization What is it? Objects' features must be learned so that the components can be recognized. Past: detection using sliding windows, recognizing edges, pyramids. Limitations Present: detection using neural networks (NN). Learning weights. How do neural networks localize? VGG based (or any network really): 1. R-CNN -SVMs from CNN [2] 0.1 fps 2. Fast R-CNN - RoI [3] 5 fps 3. Faster R-CNN - RPN [4] 7 fps 4. SSD multibox [5] 46 fps "Non-classifier" based: 1. YOLO [6] 45 fps 2. YOLOv2 detects in each box 67 fps

6 Object Detection Has Been Done Hasn't it surpassed human ability? Yes, the error rates have surpassed human ability for general detection. But, the areas of localization, segmentation, are still open areas of study, especially for conducting this in real time. Why do you need real time? Real time is needed for analysis of time-sensitive data, such as in surveillance video, detection of defects in a large number of parts in a moving production line, etc.

7 Why SSD? Positive aspects of SSD vs Faster R-CNN Stability Built on trusted components Peer reviewed Speed Why not YOLOv2? Near real time Lower computational complexity Implementation: C only from darknet (no complete official python wrapper) Lack of peer-reviewed source. Several authors give much lower map (mean average precision) for YOLOv2 than the authors of YOLOv2 suggest.

8 SSD - VGG base + New Convolutional Layers Modified VGG-16 + SSD K:3, S:1, P:0 Filters: 256 K:1, S:1, P:0 Filters: 128 K:3, S:1, P:0 Filters: 256 K:1, S:1, P:0 Filters: 128 K:3, S:2, P:1 Filters: 256 K:1, S:1, P:0 Filters: 128 K:3, S:2, P:1 Filters: 512 K:1 Filters: 256 K:1 P:0 Filters: 1024 K:3 P:6 Filters: 1024 MAX POOL k:3 s:1 K:3 Filters: 512 MAX POOL k:2 s:2 4_3 K:3 Filters: 512 MAX POOL k:2 s:2 K:3 Filters: 256 MAX POOL k:2 s:2 K:3 Filters: 128 MAX POOL k:2 s:2 K:3 Filters: 64 INPUT 300x300 P P P P P P 3x 3x 3x 2x 2x

9 What do you do with all those "predictions", well they aren't probabilities yet...? SOFTMAX ON CONFIDENCES Now they are probabilities CONCATENATE FLAT CONFIDENCES 4_3, fc7, conv6_2, conv7_2, conv8_2, conv9_2 CONCATENATE PriorBoxes 4_3, fc7, conv6_2, conv7_2, conv8_2, conv9_2 CONCATENATE FLAT MUTLIBOX PriorBox with variances to boxes FLATTEN CONV CONF + PERMUTE FLATTEN MULTIBOX + PERMUTE K:3, S:1, P:0 Filters: 256 F F F P NMS based on confidences SSD 4_3, fc7, conv6_2, conv7_2, conv8_2, conv9_2 [4]

10 Training a Model for Detection and Localization Data Data Data Data Just like in training a detection model, you need a lot of data for the model to be able to learn good weights. But in addition to images with a single one-hot vector, the labels for localization are a little more complex. They contain a label tag as well as potentially two other sets of data: 1) The coordinates of a bounding box, 2) The coordinates for segmentation. Data Sources For the current demonstration we are using a model trained on the MS-COCO dataset. However it was trained on 2014 data with 30,000 fewer images than the current set. We're working on training on a smaller subsample of the 2017 COCO dataset, but we'll discuss that in the work to come.

Example ImageNet image and annotation (target) <annotation> <folder>n00007846</folder> <filename>n00007846_123688</filename> <source> <database>imagenet database</database> </source> <size>

11 Example ImageNet image and annotation (target) <annotation> <folder>n </folder> <filename>n _123688</filename> <source> <database>imagenet database</database> </source> <size> <width>500</width> <height>333</height> <depth>3</depth> </size> <segmented>0</segmented> <object> <name>n </name> <pose>unspecified</pose> <truncated>0</truncated> <difficult>0</difficult> <bndbox> <xmin>87</xmin> <ymin>0</ymin> <xmax>499</xmax> <ymax>332</ymax> </bndbox> </object> </annotation>

$Example COCO image and annotation (target) [{"bbox": [0.0, 139.06, 76.57, 245.87], "category_id": 1, "segmentation": [[5.32, 176.11, 0.0, 176.11, 17.67, 166.61, 11.97, 153.31, 10.07, 149.51, 17.$ 82, 192.26, 76.57, 269.21, 69.92, 290.11, 67.07, 298.66, 65.17, 306.26, 64.22, 311.96, 59.47, 316.71, 54.72, 317.66, 52.82, 319.56, 49.02, 323.36, 48.07, 329.06, 42.37, 342.36, 40.9, 378.48, 41.

82, 192.26, 76.57, 269.21, 69.92, 290.11, 67.07, 298.66, 65.17, 306.26, 64.22, 311.96, 59.47, 316.71, 54.72, 317.66, 52.82, 319.56, 49.02, 323.36, 48.07, 329.06, 42.37, 342.36, 40.9, 378.48, 41.

12 Example COCO image and annotation (target) [{"bbox": [0.0, , 76.57, ], "category_id": 1, "segmentation": [[5.32, , 0.0, , 17.67, , 11.97, , 10.07, , 17.67, , 17.67, , 19.57, , 21.47, , 33.82, , 36.67, , 46.17, , 47.12, , 52.82, , 43.32, , 46.17, , 53.77, , 68.02, , 71.82, , 76.57, , 69.92, , 67.07, , 65.17, , 64.22, , 59.47, , 54.72, , 52.82, , 49.02, , 48.07, , 42.37, , 40.9, , 41.97, , 40.9, , 35.52, 381.7, 29.06, 381.7, 26.91, , 23.68, , 21.52, , 13.99, , 10.76, , 8.61, 381.7, 4.3, 359.1, 0.0, , 2.15, 182.6]]}, {"bbox": [72.65, , 47.04, ], "category_id": 1, "segmentation": [[93.1, , 95.15, , 95.15, , 101.8, , , , 111.0, , , , , , , , , , , , 89.52, , 86.97, 209.5, 80.83, 204.9, 72.65, , 74.18, , 78.78, , 83.39, , 87.99, , 86.97, , 81.85, , 84.92, , 86.46, 308.2, 83.9, , 81.34, , 82.88, , 81.85, , 79.81, ]]}, {"bbox": [123.24, , 40.0, ], "category_id": 1, "segmentation": [[136.22, 322.7, , , , , , , , , , , 137.3, , 137.3, , , , , , , , , 242.7, , , , , 160.0, , , , , ]]}, {"bbox": [193.56, , 67.9, ], "category_id": 1, "segmentation": [[209.76, 340.9, , , , 294.6, , , , , , , , , , , , , , 207.4, , , , , , , , , , , , , , , , , , , , , , , , , , , , , 236.0, , , , , , , , , , , , , , , , , , , , , ]]}, {"bbox": [195.58, , 13.63, 13.94], "category_id": 1, "segmentation": [[206.11, , , , , , , , 196.2, , , , , , , 206.3, , 203.2, , 200.1]]}, {"bbox": [268.36, , 26.14, 32.54], "category_id": 1, "segmentation": [[275.13, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 294.5, , , , , , 291.3, , , , , 218.8, , ]]}, {"bbox": [66.81, , 59.67, ], "category_id": 1, "segmentation": [[66.81, , 76.45, , 75.71, , 67.55, , 69.03, , 69.78, , 73.49, , 78.68, , 83.87, , 90.55, , 98.71, , 95.0, , 94.26, , , 190.1, , , , , , , , , , , , , , , , , , , 93.09, , 89.11, , 83.54, , 82.75, , 72.81, , 70.43, ]]}, {"bbox": [346.63, , , ], "category_id": 1, "segmentation": [[346.63, , , 241.5, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 458.6, , , , , , , , , , , , , , , , , , , 241.5, , , , , , , , 265.9, , ]]}...

13 How Training Works Loss In order to train, you need a measurement of loss. Loss is a weighted sum of two calculated loss values. One for loss over confidences on the image Another for the losses of the localizations between the ground truth and the detected boxes. N is the number of matches (xij which are a pair of detection and truth) [4]

14 Computing loss over localizations: Summing for each positive match where a ma the difference between the predicted box's (difference from default bounding box) and the true box's (difference from the default bounding box). smoothl1 is a piecewise function that gives a smaller positive value from the difference (Girschick) [4] [2] [4]

15 Test Data

16 How Our Tracking System Works Neural network detects objects (people, baggage) on scene Each frame, detected bag bounding boxes are assigned to persistent bag instances based on their previous location histories Bags are deemed immobile if after X amount of frames they haven t moved Y mean distance Once immobile, the nearest person is marked as the owner Owner is tracked with its own instance with its own location history for tracking him/her Once the owner has moved some distance away from the bag, the bag is marked abandoned

17 Results

Issues - Object Permanence Computers can t tell that a person (or other object) from a previous frame is the same person from before Humans are great at identifying uniqueness for object permanence:

18 Issues - Object Permanence Computers can t tell that a person (or other object) from a previous frame is the same person from before Humans are great at identifying uniqueness for object permanence: Colors Size Position/trajectory Clothing Body structure Face Movement patterns (stride, posture) (Thanks short-term memory!) Some of this is difficult to implement or impossible to execute in real time Our solution: Pick the most significant measure - position/trajectory

Issues - Perspective Only one perspective has many limitations Occlusion Size and absolute position of objects in scene are hard to estimate Varying movement paths

across a room) Don t know internal camera parameters, no calibration Certain camera positions make object detection difficult Complete - lose track of objects

19 Issues - Perspective Only one perspective has many limitations Occlusion Size and absolute position of objects in scene are hard to estimate Varying movement paths (down a hallway vs. across a room) Don t know internal camera parameters, no calibration Certain camera positions make object detection difficult Complete - lose track of objects Partial - misidentification of objects, inaccurate bounding boxes High angles, far distances give low detection rates Neural network likely trained with images that are side-on As a result, our implementation relies on many hard-coded thresholds Thresholds work well with some camera angles but not with others

Ideal Solution Stereoscopic vision solves many of these issues Allows for more accurate reconstruction of the scene Absolute positions easier to identify Occlusion is reduced, especially with remote

20 Ideal Solution Stereoscopic vision solves many of these issues Allows for more accurate reconstruction of the scene Absolute positions easier to identify Occlusion is reduced, especially with remote camera placements in the same scene Accurate distances from abandoned bags Object permanence more consistently tracked Calibration and knowing internal camera parameters helps accuracy significantly Object detection can be more accurate More angles = more opportunities for the neural network to detect objects Can consistently improve detection model by training specifically on images from established camera angles

21 Future Work for Improving Detection Training a new model Training on COCO '17 with limited categories (6: person, cat, dog, backpack, handbag, suitcase) Augmenting data: Although SSD takes data and provides some internal augmentation, it does not do full augmentation with rotation and mirroring. Modifying thresholds, NMS, and other hyperparameters There are many hyperparameters. We feel that some of hyperparameters used in training the pre-trained models were expecting large numbers of categories so they limited the total number of suggestions. Since we are going to have 6 categories instead of 80, we can increase the number of proposals kept.

22 Future Verification of Video Testing There is no automated verification process that is openly available to train a system to improve its ability to detect objects over time on a video. ImageNet has a video object detection challenge, but the website is unresponsive and the related Kaggle challenge under ImageNet is empty. As a result, verification is painstakingly done by hand, to determine if an object that needs to be identified was in fact identified at a certain time. One might imagine that adding some sort of memory like an LSTM to the model and training on images that were taken in sequence would improve the ability for a system to repeatedly classify an object in a scene.

23 Future Work for Improving Tracking Assign bag owners immediately on scene by matching up similar trajectories Calculate distance from the bag based on the closest edges of their bounding boxes, or by finding the intersection of previous bounding boxes Solves the issue of mis-identifying owners once the bag stops moving if there are many people near the bag Solves issues with partial occlusion making bounding box center points inaccurate Allow for owners to return to their bags within some distance without marking the bag as abandoned Identify owners based on histogram or face recognition Allows for owners to leave the scene and be re-identified when they return May help maintain tracking when the scene is very busy or is prone to occlusion

24 References 1. History of Video Surveillance and CCTV. [Online]. Available: [Accessed: 15-Nov-2017]. 2. R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, arxiv: [cs], Nov R. Girshick, Fast R-CNN, arxiv: [cs], Apr S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp , Jun W. Liu et al., SSD: Single Shot MultiBox Detector, arxiv: [cs], vol. 9905, pp , J. Redmon and A. Farhadi, YOLO9000: Better, Faster, Stronger, arxiv: [cs], Dec

25 Discussion & Link to Code Any questions?

Lecture 5: Object Detection

Object Detection CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 5: Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 Traditional Object Detection Algorithms Region-based