Outline. Person detection in RGB/IR images 8/11/2017. Pedestrian detection how? Pedestrian detection how? Comparing different methodologies?

Size: px

Start display at page:

Download "Outline. Person detection in RGB/IR images 8/11/2017. Pedestrian detection how? Pedestrian detection how? Comparing different methodologies?"

Kelley Fleming
5 years ago
Views:

1 Outline Person detection in RGB/IR images Kristof Van Beeck How pedestrian detection works Comparing different methodologies Challenges of IR images DPM & ACF o Methodology o Integrating IR images & benefits o Results on use case FLIR o Methodology o Results on use case FLIR / Alphatronics / Port of Antwerp Conclusions 2 Pedestrian detection how? Pedestrian detection how? Input image Extract features Evaluate Model Perform NMS Detection Output How is a detection model learned? Use hundreds (or thousands) of images Images containing pedestrians (positives) Images containing background (negatives) What about multiple sizes? Sliding window in scale-space feature pyramid time consuming Detection scores = probability measure Low threshold = find many pedestrians, make mistakes High threshold = find fewer pedestrians with less or no mistakes Machine Learning (e.g. SVM, AdaBoost, Neural Networks, ) 3 4 Comparing different methodologies? Comparing different methodologies? Comparing CV algorithms Loop over multiple score thresholds Compare with ground truth Plot these measures in graph, two types: PR and miss-rate vs FPPI Manual annotations Optimal Point Correct detections (TP) Annotations not found (FN) Make fewer mistakes Wrong detections (FP) Find more pedestrians 5 6 1

able to visualize people in low light (night, mist) conditions, dusty environments, rain, inherent privacy o Disadvantage LWIR: less discriminative information (only grayscale image, no

) Optimal Point Make fewer mistakes Often combination of both RGB and LWIR is used 7 8 DPM - methodology Deformable part models [Felzenszwalb, CVPR2008] Main gradient model Allows

Uses both gradient and color information Good for highly deformable objects / viewpoints Higher computational complexity Feature values are calculated as the sum of pixel values in

2 Comparing different methodologies? Plot these measures in graph, two types: PR and miss-rate vs FPPI Find more pedestrians Challenges of LWIR images Detection of pedestrians in LWIR images is difficult: o Advantage LWIR: able to visualize people in low light (night, mist) conditions, dusty environments, rain, inherent privacy o Disadvantage LWIR: less discriminative information (only grayscale image, no color information) Hard for person re-identification (gait analysis?) Optimal Point Make fewer mistakes Often combination of both RGB and LWIR is used 7 8 DPM - methodology Deformable part models [Felzenszwalb, CVPR2008] Main gradient model Allows deformation of parts relative to the root-model ACF - methodology ACF: Aggregate Channel Features [Dollár, PAMI2014] State-of-the-art detector in 2014 accurate and fast, even on CPU Uses both gradient and color information Good for highly deformable objects / viewpoints Higher computational complexity Feature values are calculated as the sum of pixel values in rectangles Approximation of the features at most scales Learn decision trees on weak features 9 How?- Comparing Challenges DPM/ACF - Conclusions 10 Two examples from EAVISE applications ACF integrate LWIR images How to integrate LWIR information? IR Active blind spot safety detection system (DPM) Combining multiple detectors to increase the accuracy (DPM, HOG, ICF) Relatively simply, include additional IR channels Problem? Need for training data (expensive!)

ACF integrate LWIR images Solution: KAIST multispectral pedestrian dataset o Publically available o 95.

182 unique pedestrians ACF integrate LWIR images We internally developed ACF framework Detection code Training code Extended with

gradient orientations 1 gradient magnitude 13 14 ACF integrate LWIR images Tested many influences: Acquired dataset from FLIR o ACF/ACF+,

um o Goal: detect abnormal behavior (e.g.

3 ACF integrate LWIR images Solution: KAIST multispectral pedestrian dataset o Publically available o VGA (640 x 480, 20 Hz) image pairs (Color + LWIR) o annotations and unique pedestrians ACF integrate LWIR images We internally developed ACF framework Detection code Training code Extended with LWIR-input ACF color channels 3 color channels (LUV) 6 gradient orientations 1 gradient magnitude Gets better LWIR channels 1 intensity 6 gradient orientations 1 gradient magnitude ACF integrate LWIR images Tested many influences: Acquired dataset from FLIR o ACF/ACF+, amount of training data, model size, resolution of LWIR images Details on experiments: reports VIPER website LWIR o LWIR only, train station Brugge, Belgium o Goal: detect abnormal behavior (e.g. people which cross the train tracks) Combination RGB Specs: o # of videos: combined: 27, crossing: 26, humans: 37 o Framerate of 7 FPS, resolution of 640x512 Detecting this behavior is composed of two main parts: o Perform accurate pedestrian detection and generate tracks (this presentation) o Analyze detections/tracks for abnormal behavior (AI team after coffee break) First naive approach: camera is fixed, perform background subtraction, uses blobs as detection will this work? Works sometimes (MOG technique)! However many problems : blobs merge, contrast, passing trains, # parameters,

precision of 90%, recall of 75% (AP of 78.

unique track IDs Crossings Combined (almost no crossings) 0,1,3,4,5,6,7,8,9,10,11,12,13,14 15,16,17,18,19,20,21,22,23,24,25,26,27 Training pool Testing pool (model trained with KAIST data) Every 5 th

4 Train appearance-based model! Annotated all data using the VATIC tool (available on website) 63 videos remained Ordered from more to less important (crossings) 27 videos were labeled: Trained an ACF model with FLIR data At precision of 90%, recall of 75% (AP of 78.9%) Crossings Combined (almost no crossings) 0,1,3,4,5,6,7,8,9,10,11,12,13,14 15,16,17,18,19,20,21,22,23,24,25,26,27 Label results: o Total of frames, labels (4523 occluded) o Total of 79 unique track IDs Crossings Combined (almost no crossings) 0,1,3,4,5,6,7,8,9,10,11,12,13,14 15,16,17,18,19,20,21,22,23,24,25,26,27 Training pool Testing pool (model trained with KAIST data) Every 5 th image 2921 annotations (2429 images) 4212 pos., neg Not bad! (video in a few slides) Improvements: Applied on this scenario: o Divided in two regions o Add scene constraints o Extract annotations o Add tracking o Fit first order function (plane) o Eliminate detections which diverge (based on percentage) Scene constraints: o Assume flat ground plane o After calibration (based on annotations) each height at specific position is known o Reject detections which deviate from this constraint Prune too much Optimal point Prune too little Original FLIR model For fixed recall (75%) ~5% improvement in precision! Final improvement: tracking o Predict future position Result video, compare initial KAIST model (red) with our final best result (green): o Match with new detection, keep prediction if none found o Kalman filter, constant velocity motion model Significant increase in recall! If TTL too high, precision drops (FP are tracked)

methodologies Not that new: o Yann Le Cun: A theoretical framework for back-propagation (1998) o Around 2012: breakthrough: enough datasets, architectures, but most of

Image NVIDIA 25 26 - What Step away from manual feature development (e.g. ACF), and let algorithm determine important features Feed images through a convolutional

(classification, detection, segmentation) with more and more layers Top accuracy, detects many classes at once Two phases: training and inference o Problem: enormous

hardware needed (expensive GPUs) o Training takes weeks and requires hundred thousands of images (Google, Facebook AI research, ) o Inference is fast on high-end GPUs

5 Remaining challenges? o Small pedestrians o Low contrast o Reflections Recent trend in computer vision: deep learning Achieves excellent accuracy results, easily surpassing previous methodologies Not that new: o Yann Le Cun: A theoretical framework for back-propagation (1998) o Around 2012: breakthrough: enough datasets, architectures, but most of all affordable GPU hardware Improve results with deep learning! Image NVIDIA What Step away from manual feature development (e.g. ACF), and let algorithm determine important features Feed images through a convolutional neural network (CNN) cascade of convolution layers, max pooling layers and fully connected layers Fully connected layer for classification Many architectures exist (classification, detection, segmentation) with more and more layers Top accuracy, detects many classes at once Two phases: training and inference o Problem: enormous amount of weights and interconnections need to be learned vast amount of training data needed o Uses back propagation and gradient descent, # iterations Special hardware needed (expensive GPUs) o Training takes weeks and requires hundred thousands of images (Google, Facebook AI research, ) o Inference is fast on high-end GPUs Feature extraction Network: YoloV Worth a look! Many different publically available deep learning networks: o Caffe, Torch, Tensorflow, We re using darknet (i.e. YOLOv2: You Only Look Once 1 ) Example video, taken with webcam, real-time processing Speed limited to 30 FPS due to webcam (normal FPS) 1: You Only Look Once: Unified, Real-Time Object Detection, J. Redmond et. al., CVPR

How does YOLO perform on our LWIR dataset, without retraining (trained on COCO, 80 classes,

assume that most weights of feature layers could be reused, and start from these Only set

32 Crossings Combined (almost no crossings) 0,1,3,4,5,6,7,8,9,10,11,12,13,14

23 convolution layers Used ~1500 images for training with ~2400 annotations Trained on

Training takes about 42 min for every 1000 iterations Best model after 14000 iterations (~

Speed: NVIDIA GTX 1080: 60-70 FPS (80 FPS without vis.) 33 34 Excellent detection results!

6 How does YOLO perform on our LWIR dataset, without retraining (trained on COCO, 80 classes, ~50-60 FPS)? Better results retrain YOLO How? Need lots of data? Solution: transfer learning! Fully connected layer for classification Feature extraction Do not retrain full network: assume that most weights of feature layers could be reused, and start from these Only set empty weights for later layers in the network Significantly less training images needed Crossings Combined (almost no crossings) 0,1,3,4,5,6,7,8,9,10,11,12,13,14 15,16,17,18,19,20,21,22,23,24,25,26,27 Training pool Testing pool For FLIR set: keep first 23 convolution layers Used ~1500 images for training with ~2400 annotations Trained on Gigabyte NVIDIA Geforce GTX1080 (Mid 2016, euro) Resulting detections on video: Training takes about 42 min for every 1000 iterations Best model after iterations (~ 10 hours of training) Already after 2000 iterations good model! Speed: NVIDIA GTX 1080: FPS (80 FPS without vis.) Excellent detection results! Quantitative results: YOLO is even better than curves show Able to detect even occluded people, which were not annotated Very good! Precision of 90%, recall of 86%! Note: localization of Yolo slightly worse due to grid proposals

Include ground plane constraint and tracking Even better?

~ 60 FPS on Geforce GTX1080 (180W) 2560 CUDA cores (Pascal), 8.

~5 FPS on Jetson TX1 Supercomputer, Quad ARM core, credit card size (10W)

layers): 6 hours training: ~ 70-90 FPS on Geforce GTX1080 (180W) TX1 /TX2

tracked) ~ 12FPS on Jetson TX1 See demo during coffee break!

real-time), now about 9 seconds: Patient monitoring in healthcare Patients

when patient lies in bed For privacy issues, use only LWIR camera

Testcase - Experiments Alphatronics Testcase - Experiments Recorded dataset

(RGB), Yolo (LWIR) o Yolo seems to achieve excellent results!

7 Include ground plane constraint and tracking Even better? Not that many FP visible Ground-plane Tracking What about detection speed? ~ 60 FPS on Geforce GTX1080 (180W) 2560 CUDA cores (Pascal), 8.3 TFLOPS, 8 GB DDR5 memory Embedded: Jetson TX1 & TX2! ~5 FPS on Jetson TX1 Supercomputer, Quad ARM core, credit card size (10W) 256 CUDA cores, 1 TFLOPS (TX1), 8GB (TX2) on-board memory TinyYolo (15 layers): 6 hours training: ~ FPS on Geforce GTX1080 (180W) TX1 /TX2 599 USD Negligible improvement Higher recall, lower precision (FP are tracked) ~ 12FPS on Jetson TX1 See demo during coffee break! Alphatronics Testcase - Goal Example video (7 FPS, 1min30sec in real-time), now about 9 seconds: Patient monitoring in healthcare Patients tend to fall out of bed Develop automatic monitoring system which detects when patient lies in bed For privacy issues, use only LWIR camera Challenges: ghost images due to remaining heat, blankets Alphatronics Testcase - Experiments Alphatronics Testcase - Experiments Recorded dataset consisting of 4 sequences, ~ 1100 images o Evaluated ACF (LWIR), Yolo (RGB), Yolo (LWIR) o Yolo seems to achieve excellent results! o Problems with difficult poses & blanket Quantitative results: OpenPose, see after coffee break Yolo RGB: very good: Precision 90%, recall 95% Yolo LWIR (overfitting?) ACF (LWIR)

Port of Antwerp Testcase - Goal Safety system for lifting bridges Detect if person/cyclist is on bridge Combine both thermal and RGB images Recorded dataset of the Siberia bridge o FLIR Trafione

Integrate thermal in Yolo detector o Default only 3 channels (R, G, B) o Try different combinations to integrate Thermal images: RGB TGR, BTR, BGT LUV LTV, LUT HSV HTV, TSV 43 44 Port of Antwerp

Methodologies ACF, DPM, Deep Learning o Extensions to LWIR images o Three use cases with extensive validation FLIR, Alphatronics, Port of Antwerp Thank you for your attention! Questions?

8 Port of Antwerp Testcase - Goal Safety system for lifting bridges Detect if person/cyclist is on bridge Combine both thermal and RGB images Recorded dataset of the Siberia bridge o FLIR Trafione camera o 28 thermal-visible video sequences o 2 camera viewpoints o 47 annotated person/cyclist tracks o 839 training images (60%), 564 testing images (40%) Port of Antwerp Testcase - Experiments Integrate thermal in Yolo detector o Default only 3 channels (R, G, B) o Try different combinations to integrate Thermal images: RGB TGR, BTR, BGT LUV LTV, LUT HSV HTV, TSV Port of Antwerp Testcase - Experiments Experiments: Port of Antwerp Testcase - Experiments Video on Yolo - LTV: Conclusions We presented work of the CV team: o An introduction to pedestrian detection o Methodologies ACF, DPM, Deep Learning o Extensions to LWIR images o Three use cases with extensive validation FLIR, Alphatronics, Port of Antwerp Thank you for your attention! Questions? s: kristof.vanbeeck@kuleuven.be toon.goedeme@kuleuven.be Thanks for contributions: Toon Goedemé, Kristof Van Beeck, Floris De Smedt, Andy Warrens, Steven Puttemans, Timothy Callemein, Maarten Vandersteegen, Wiebe Van Ranst 47 8

Exploiting scene constraints to improve object detection algorithms for industrial applications

Exploiting scene constraints to improve object detection algorithms for industrial applications PhD Public Defense Steven Puttemans Promotor: Toon Goedemé 2 A general introduction Object detection? Help