Progress in Computer Vision in the Last Decade & Open Problems: People Detection & Human Pose Estimation

Size: px

Start display at page:

Download "Progress in Computer Vision in the Last Decade & Open Problems: People Detection & Human Pose Estimation"

Lillian Wilcox
5 years ago
Views:

1 Progress in Computer Vision in the Last Decade & Open Problems: People Detection & Human Pose Estimation Bernt Schiele Max Planck Institute for Informatics & Saarland University, Saarland Informatics Campus Saarbrücken

2 Overview People Detection & Tracking progress since today s state-of-the-art open problems Human Pose Estimation progress since [Felzenszwalb,Huttenlocher@ijcv05] today s state-of-the-art open problems Some More Open Problems some recent progress on open problems!2

Schiele, CVPR 16 Learning Non-Maximum Suppression J. Hosang, R. Benenson, and B.

3 People Detection Ten years of pedestrian detection, what have we learned? R. Benenson, M. Omran, J. Hosang and B. Schiele, ECCV workshop 14 How Far are We from Solving Pedestrian Detection? S. Zhang, R. Benenson, M. Omran, J. Hosang and B. Schiele, CVPR 16 Learning Non-Maximum Suppression J. Hosang, R. Benenson, and B. Schiele, CVPR 17 CityPersons: A Diverse Dataset for Pedestrian Detection S. Zhang, R. Benenson and B. Schiele, CVPR 17 Occluded Pedestrian Detection Through Guided Attention in CNNs S. Zhang, J. Yang and B. Schiele, CVPR 18 Shanshan Zhang Rodrigo Benenson Jan Hosang Mohamed Omran Bernt Schiele

[Dollar,Wojek,Perona,Schiele@PAMI 12] Caltech Pedestrian Benchmark Features

(greater LA area) annotation: - 250,000 frames (~137 min) annotated with

annotation: 2 bounding boxes for entire pedestrian & visible region -

4 12] Caltech Pedestrian Benchmark Features of the Pedestrian Dataset: 11h of normal driving in urban environment (greater LA area) annotation: - 250,000 frames (~137 min) annotated with 350,000 labeled bounding boxes of 2,300 unique pedestrians - occlusion annotation: 2 bounding boxes for entire pedestrian & visible region - difference between single person and groups of people Caltech-USA currently the most active dataset!4

5 Great Progress in Pedestrian Detection Within 10+ Years Current Deep Networks lower is better 5

6 workshop 14] What is Driving the Detection Quality? Multi-scale models? Additional (test time) data (e.g. flow, stereo)? Exploiting context? Training data? Better features? 6

7 workshop 14] What is Driving the Detection Quality? Multi-scale models - not much helps a bit (1-2%), but not key for quality in Caltech-USA Additional (test time) data (e.g. flow, stereo) - a bit using more frames (flow or stereo) helps Exploiting context? a bit more... expect 2-5% improvement when using context Training data Better features 7

8 workshop 14] What is Driving the Detection Quality? Multi-scale models - not much Additional (test time) data (e.g. flow, stereo) - a bit Exploiting context - a bit more Training data? Better features 8

9 workshop 14] Training Data Matters (you knew this already) 9

10 Features Alone Can Explain (Almost) 10 Years of Progress workshop 14]!10

11 workshop 14] What have we learned? The key role of data More training data improves performance Data collection (& annotation) is key requirement for good performance - question: how to obtain sufficient data? The key role of features Features alone explain a decade of people detection progress Flow, context, and strong features are complementary (currently) What about deep learning??? 11

12 Leading Methods on Caltech All Leading Methods are CNN-based: SAF R-CNN: R-CNN (jointly) trained for 2 scales ~9% - Scale-Aware Fast R-CNN for Pedestrian Detection by J. Li, X. Liang, S. Shen, T. Xu, J. Feng & S. TransMultimedia'17 F-DNN: Cascade of multiple CNNs ~9% - Fused DNN: A Deep Neural Network Fusion Approach to Fast and Robust Pedestrian Detection by X. Du, M. El-Khamy, J. Lee & L. wacv 17 SA-FastRCNN: FasterRCNN trained better ~10% - Is Faster R-CNN Doing Well for Pedestrian Detection? by L. Zhang, L. Lin, Z. Liang, K. eccv 16 Faster-RCNN: FasterRCNN trained better ~10% - CityPersons: A Diverse Dataset for Pedestrian Detection by S. Zhang, R. Benenson & B. cvpr 17!12

13 How Far are We from Solving Pedestrian Detection? S. Zhang, R. Benenson, M. Omran, J. Hosang, CVPR 16

14 How Good are Today s Detectors? [Zhang,Benenson,Omran,Hosang,Schiele@CVPR 16] Human baseline Annotators: pedestrian detection experts Single frame observation Upper bound of performance State-of-the-art detectors Far behind human baseline A lot of room for improvement - False positives - False negatives Human Baseline!14

15 16] False Positive: Background Errors Vertical structures Lights Car parts Tree leaves Other background!15

False Negative Sources [Zhang,Benenson,Omran,Hosang,Schiele@CVPR 16] Small Scales

16 False Negative Sources 16] Small Scales Partial Occlusion Side View Annotation Errors Cyclists Others False Negative Statistics!16

17 ConvNets and Pedestrian Detection Failure Cases False positives: - ConvNets reduce background errors, but still significant problem - ConvNets still have problems with localization (double detection, etc.) an attempt to overcome this problem: "Learning Non-Maxmimum Suppression" [Hosang,Benenson,Schiele@cvpr17] False negatives: - ConvNets should improve for certain cases (side views, cyclists, etc.) provided sufficient training data for such cases - today s ConvNets have issues with small scale, partial occlusion, etc. Annotation quality appears crucial for best performance at least for today s ConvNets!17

18 Multi-Person Tracking Based on Multicut & Subgraph Partitioning Subgraph Decomposition for Multi-Object Tracking S. Tang, B. Andres, M. Andriluka and B. Schiele, CVPR 15 Multi-Person Tracking by Multicuts and Deep Matching S. Tang, B. Andres, M. Andriluka and B. Schiele ECCV 16 workshop winner of the Multi Object Tracking (MOT) 2016 challenge Multi-Person Tracking by Lifted Multicut and Multi-Person Re-Identification S. Tang, M. Andriluka, B. Andres and B. Schiele, CVPR 17 winner of the Multi Object Tracking (MOT) 2017 challenge Siyu Tang Björn Andres Micha Andriluka Bernt Schiele

19 17] Multicut Tracking - Motivation Tracking as global association problem frame 10 frame 30 frame 50 Typically addressed as disjoint paths problem!19

[Tang,Andriluka,Andres,Schiele@CVPR 17] Multicut Tracking - Motivation Subgraph decomposition for multi-object tracking frame 10 frame

20 17] Multicut Tracking - Motivation Subgraph decomposition for multi-object tracking frame 10 frame 30 frame 50 Desired property of tracking by graph decomposition joint spatial-temporal association resulting in robust tracking results!20

lines: and linking lines: hypotheses the disjoint paths of each

21 The Underlying Graph in Space-Time Domain: Visualizing Disjoint Paths Solution x Red dots: detection hypotheses Black Red dots lines: and linking lines: hypotheses the disjoint paths of each person Disjoint Paths Associations are brittle and fragile. y time!21

22 The Underlying Graph in Space-Time Domain: Visualizing MultiCut Subgraph Solution x Decompositions (clusters) Associations more robust y Black dots and lines: the decomposition (cluster) for each person. time!22

[Tang,Andriluka,Andres,Schiele@CVPR 15] Multicut Tracking - Results Detection Hypotheses

23 15] Multicut Tracking - Results Detection Hypotheses Tracklet Hypotheses Multicut Decomposition Final Tracks Dotted rectangles are interpolated tracks.!23

[Tang,Andriluka,Andres,Schiele@CVPR 15] Multicut Tracking - Results

24 15] Multicut Tracking - Results Dotted rectangles are interpolated tracks. Decompositions (clusters) Tracks!24

25 17] Results on MOT 16 Lifted Multicut SenseTime!25

26 People Detection & Tracking Take Home Messages / Open Problems Training data matters how to obtain sufficient training data? quality of annotations matters too? - at least using today s CNNs and loss functions open problems: - how to train with weaker or no supervision? - how to generate relevant training data e.g. for rare cases Key role of features CNNs / deep networks promise to help open problems: - classifiers for background (both false negatives & false positives) - small scale & partial occlusion are problematic for todays CNNs Modeling of structure often crucial briefly discussed for multi-person tracking, similar for people detection open problem: - what is the right balance between (manual) structure and end-to-end trained models!26

27 Overview People Detection progress since today s state-of-the-art open problems Human Pose Estimation progress since [Felzenszwalb,Huttenlocher@ijcv05] today s state-of-the-art open problems Some More Open Problems!27

Human Pose Estimation Discriminative Appearance Models for Pictorial

Schiele, IJCV 11 2D Human Pose Estimation: New Benchmark and

Multi-Person Pose Estimation L. Pishchulin, E. Insafutdinov, S.

Schiele, CVPR 16 DeeperCut: A Deeper, Stronger, and Faster

28 Human Pose Estimation Discriminative Appearance Models for Pictorial Structures M. Andriluka, S. Roth, and B. Schiele, IJCV 11 2D Human Pose Estimation: New Benchmark and State-of-the-Art Analysis M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, CVPR 14 DeepCut: Joint Subset Partition and Labeling for Multi-Person Pose Estimation L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele, CVPR 16 DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, ECCV 16 ArtTrack: Articulated Multi-Person Tracking in the Wild E. Insafutdinov, L. Pishchulin, M. Andriluka, and B. Schiele, CVPR 17 Leonid Pishchulin Eldar Insafutinov Siyu Tang Björn Andres Micha Andriluka Peter Gehler Bernt Schiele

29 Human Pose Estimation - what happened so far Single Person Pose Estimation - two phases Phase 1: pictorial structures models e.g. [Felzenszwalb&Huttenlocher@ijcv05], [Andriluka&al@ijcv11], [Yang&Ramanan@pami13], [Pishchulin&al@iccv13], Phase 2: using deep learning e.g. [Thoshev,Szegedy@cvpr14], [Thompson&al@nips14], [Chen&Yuille@nips14], [Carreira&al@cvpr16], [Hu&Ramanan@cvpr16], [Wei&al@cvpr16], [Newell&al@cvpr16],!29

30 MPII Human Pose Dataset: Dataset demo 410 human activities (after merging similar activities) over 40,000 annotated poses over 1.5M video frames Activity Categories Activities 14] Images

31 Analysis - overall performance Best Methods now: deep learning takes over PCKh total, MPII Single Person Best Method as of ICCV 13 since CVPR 14, dataset has become de-facto standard benchmark large training set facilitated development of deep learning methods!31

[Felzenszwalb&Huttenlocher@ijcv05], [Andriluka&al@ijcv11], [Yang&Ramanan@pami13], [Pishchulin&al@iccv13], Phase 2: using deep learning

32 Human Pose Estimation Single Person Pose Estimation - two phases Phase 1: pictorial structures models e.g. [Felzenszwalb&Huttenlocher@ijcv05], [Andriluka&al@ijcv11], [Yang&Ramanan@pami13], [Pishchulin&al@iccv13], Phase 2: using deep learning e.g. [Thoshev,Szegedy@cvpr14], [Thompson&al@nips14], [Chen&Yuille@nips14], [Carreira&al@cvpr16], [Hu&Ramanan@cvpr16], [Wei&al@cvpr16], [Newell&al@cvpr16], Multi Person Pose Estimation - far fewer publications in particular: [Eichner&Ferrari@eccv10], [Ghiasi&al@cvpr14], [Chen&Yuille@cvpr15],...!32

33 Deep(er)Cut: Joint Subset Partition & Labeling for Multi Person Pose Estimation DeepCut: Joint Subset Partition and Labeling for Multi-Person Pose Estimation L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele, CVPR 16 DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, ECCV 16 ArtTrack: Articulated Multi-Person Tracking in the Wild E. Insafutdinov, L. Pishchulin, M. Andriluka, and B. Schiele, CVPR 17

34 Multi Person Pose Estimation Standard 2-stage approach: 1. stage: person detection to estimate bounding box 2. stage: single person pose estimation per bounding box - cannot recover from errors of 1. stage - occlusion reasoning across people!34

Multi Person Pose Estimation Our approach -

and labeling problem jointly estimates: number

35 Multi Person Pose Estimation Our approach - DeepCut [cvpr 16] & DeeperCut [eccv 16] joint formulation for person detection and pose estimation formulated as subset partitioning and labeling problem jointly estimates: number of people and their pose, partial occlusion & truncation,!35

36 Deep(er)Cut - Overview dense graph joint subset partitioning person clusters II detection candidates III body part labeling labeled body parts I!36

37 I: Detection Candidates: Unary probabilities Sliding CNN VGG / ResNet conv maps cross-entropy self-regression pairwise regression scoremaps unary vector fields pairwise vector fields FCs FCs loss functions dense outputs (unary) body part detections!37

38 I: Detection Candidates: Unary probabilities Sliding CNN VGG conv maps / ResNet cross-entropy self-regression pairwise regression scoremaps unary vector fields pairwise vector fields FCs FCs loss functions dense outputs sampling top detection candidates sho 0.9 elb 0.5 wri 0.1 ank 0.0!38

39 II: Graph Construction Pairwise Probabilities: same person & same body part (clustering of part detections) - green: high probability - red: low probability d4 d1 d3 d2 same person & different body parts (kinematic relations) - pairwise regression to other body parts - logistic regression from both spatial relation and appearance (image conditioned pairwise probabilities) 1 2 different persons (multi-person & occlusion reasoning) - logistic regression from both spatial relation and appearance d4 d1 d3 d2!39

40 Deep(er)Cut - Overview dense graph joint subset partitioning person clusters II detection candidates III body part labeling labeled body parts I!40

41 Qualitative Results MPII-Multi Person 2-stage DeepCut!41

ArtTrack [cvpr 17] Key idea: extend DeepCut to temporal domain joint inference performed in spatio-temporal graph Figure 2: Top: Body joint detection hypotheses shown for three frames.

42 ArtTrack [cvpr 17] Key idea: extend DeepCut to temporal domain joint inference performed in spatio-temporal graph Figure 2: Top: Body joint detection hypotheses shown for three frames. Middle: Spatio-temporal graph with spatial edges (blue) and temporal edges for head (red) and neck (yellow). We only show a subset of the edges. Bottom: Estimated poses for all persons in the video. Each color corresponds to a unique person identity. union of edges of aprogress fully connected in Person graph Detection for eachand frame, Human is Pose removed Estimation while t in (df the,d 0 f last Decade Bernt Schiele 0 ) =0 with (d f,d 0 f 0) 2 E t implies!42

43 ArtTrack: Qualitative Results [cvpr 17]!43

44 Human Pose Estimation: Take Home Messages / Open Problems Training data matters large datasets like MPII Human Pose push performance Single Person Pose Estimation - impressive performance already off-the-shelf CNNs like ResNet obtain top-performance iterative refinements (Stacked Hourglass, Convolutional Pose Machines) also obtain top-performance Multi Person Pose Estimation - not working as well underrepresented in literature & more work needed - training data lacking (can we collect sufficient training data?) structured models & reasoning important - problem complexity higher!44

45 Overview People Detection & Tracking progress since today s state-of-the-art open problems Human Pose Estimation progress since [Felzenszwalb,Huttenlocher@ijcv05] today s state-of-the-art open problems Some More Open Problems!45

46 Take Home Messages & Some Open Problems Machine learning has been and will continue to be a driver Lots of data (internet, storage, ) Fast processing (CPU and GPU clusters, ) Powerful machine learning models (deep learning, graphical models, ) BUT we should not get carried away by current successes of deep learning! current successes largely depend on fully supervised training using large datasets!46

47 Some Open Problems (1/3) It is impossible to get sufficient training data for everything! Then: How to train from insufficient training data? lack of annotation (costly) - how to train with NO or WEAKER supervision - unsupervised & semi-supervised learning!47

48 Simple Does It: Weakly Supervised (Instance and) Semantic Segmentation CVPR 2017 Anna Khoreva Rodrigo Benenson Jan Hosang Matthias Hein Bernt Schiele

49 Weakly Supervised Semantic Segmentation Question: Is it possible to obtain high-quality segmentations with weak supervision such as bounding box annotations? Person Person Horse Horse Full supervision Time-consuming. Weak supervision Only 2 clicks per object (bounding box annotations)!49

Objectness Non-consensus regions are set to ignore labels.

50 Generation of Annotations classic methods Ground truth Image + Boxes Bounding boxe contain info about the object: 1. Background 2. Object extend 3. Objectness Non-consensus regions are set to ignore labels. Segment proposals [MCG, Pont-Tuset et al.] Consensus Non-consensus Consensus GrabCut [Rother et al.] Input for CNN!50

supervision we achieve 95% quality of the fully

51 Quantitative Results: DeepLab ResNet classic methods Image + Boxes Input for CNN Main result: With only box supervision we achieve 95% quality of the fully supervised model. miou Weak Full Supervision!51

52 Image Ground truth Weak supervision Full supervision!52

53 Learning Video Object Segmentation from Static Images CVPR 2017 Anna Khoreva * Frederico Perazzi * Rodrigo Benenson Bernt Schiele Alexander Sorkine-Hornung

54 Video Object Segmentation Goal: Separating a specific foreground object from background in a video given its 1 st frame mask annotation. Object 1 Object 2 1 st frame t DAVIS 2016 [Perazzi et al. 16]!54

55 MaskTrack - Proposed Approach we process video per-frame, using guidance from previous frame MaskTrack Frame t-1 output mask Frame t input DeepLab [Chen et al., ICLR 15] we can train using static images only Frame t output mask!55

56 Qualitative Results

57 Some Open Problems (1/3) It is impossible to get sufficient training data for everything! Then: How to train from insufficient training data? lack of annotation (costly) - how to train with NO or WEAKER supervision - unsupervised & semi-supervised learning lack of necessary data for training (with and without labels): - e.g. autonomous driving: child running across street - data generation methods (computer graphics & generative adversarial networks) real data distributions skewed & exponential decrease of samples - most classes have close to no training data (how to train from very few samples)!57

Generative Adversarial Network for Features (F-GAN) Idea

object class generation is conditioned on some class embedding c(y)

Head color: brown Belly color: yellow Bill shape: pointy

58 Generative Adversarial Network for Features (F-GAN) Idea learn to generate features for any object class generation is conditioned on some class embedding c(y) c(y) can be class attributes, sentences, Word2Vec, etc. Head color: brown Belly color: yellow Bill shape: pointy discriminator CNN Head color: brown Belly color: yellow Bill shape: pointy z ~ N(0, 1) f-clswgan generator!58

59 Some Open Problems (2/3) How to integrate prior knowledge & end-to-end learning linked to lack of necessary & sufficient data prior knowledge helps whenever training data is limited - examples: tracking, pose estimation, scene understanding manual modeling vs. automatic learning - manual modeling vs. automatic mining (e.g. from wikipedia) of prior knowledge - what is the right (structural) model to integrate prior knowledge? (e.g. for scene understanding) - what to learn: prior knowledge? model structure itself? just parameters? something else?!59

60 13] 3D Scene Understanding 3D scene analysis for mobile platforms (i.e. robots, cars) mobile observer aims to understand its 3D mobile environment i.e. traffic, people, etc Application scenarios Traffic safety and driver assistance Autonomous vehicles Robotics!60

61 13] A state-of-the-art Approach (monocular camera) Image sequence Bayesian 3D Scene Model Object detec:ons T-1 T T+1 Scene tracklets Seman:c scene labels Prior informa:on (camera, objects)!61

62 11] Sample Result including Occlusion Reasoning!62

[Wojek,Roth,Schindler,Schiele@PAMI 13] System sample video (different types of vehicles) Message

63 13] System sample video (different types of vehicles) Message modeling and reasoning in 3D is powerful: e.g. for occlusion reasoning, inclusion of prior information!63

64 Some Open Problems (3/3) Understanding and introspection of deep-learning results status-quo - neural networks often considered black boxes - lots of trial-and-error (for network structure, training, etc.) - rather unclear why certain architectures / learning regimes work better ideally - we should understand which architectures work when, and why - deep architectures should allow for introspection why a certain (in)correct result was obtained explain its own reasoning!64

65 Current Example from our Group [Park, Hendricks, Akata, Schiele, Darrell, Rohrbach; CVPR18] Visual and Textual Explanation Interface: VQA!65

66 Progress in Computer Vision in the Last Decade & Open Problems: People Detection & Human Pose Estimation Bernt Schiele Max Planck Institute for Informatics & Saarland University, Saarland Informatics Campus Saarbrücken

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

[ICIP 2017] Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab., POSTECH Pedestrian Detection Goal To draw bounding boxes that