Multimodal detection and recognition of persons with a static robot

Multimodal detection and recognition of persons with a static robot Jaldert Rombouts rombouts@ai.rug.nl Internal advisors: prof. dr. L.R.B Schomaker. Artificial Intelligence, University of Groningen drs. T. van der Zant. Artificial Intelligence, University of Groningen External advisor: dr. P. E. Rybski. Robotics Institute, Carnegie Mellon University

Overview Introduction Background and Approach Experiments and Results Discussion Questions

Introduction SnackBot (Lee et al., 2009) Human-Robot Interaction (HRI) Vending machine Topic: Person detection and recognition

Introduction Solutions: ID-Cards, biometrics (Jain et al., 2004b) Disadvantage: Close proximity, conscious user effort More natural solution? Based on soft biometrics (Jain et al., 2004a) Color, gait, shape (combinations) Passive (e.g. camera)

Introduction Implemented soft-biometric system(s) based on related work Evaluated performance: Multiple poses Various distances

Background and Approach Segmentation Feature extraction Data set First: robot and sensors

Robot and Sensors 8 7 Model Foreground COG 6 5 X (meters) 4 3 2 1 0-4 -3-2 -1 0 1 2 3 4 Y (meters)

Segmentation Implemented two methods: 1. Background modeling based (Horprasert et al. (1999)) 2. Stereo based (Darrell et al. (2000); Zhao et al. (2000)) Combined with laser-based leg-detector

Feature extraction Color: (HS)V, nrgb, Y(CrCb), CIE-L(ab) Torso or Head + Torso (spatial histogram) 1. Mean + standard deviation 2. 1D and 2D chromaticity histograms 4, 8, 16, 32 bins Person Height: from stereo

Data set 30 persons (fairly large w.r.t. related work) 2 environments, 9 positions, 4 poses Repeated recording for validation

Low office dividers Chair Opening Drawers Experimenter Desk 510 meter High office dividers Table Chair 8 9 7 Boxes 6 5 4 3 2 1 Legend Point in cluttered scene Point in sparse scene 710 meter

Data set - Poses

Overview Introduction Background and Approach Experiments and Results Discussion Questions

Recognition Main questions: What is the best set of features for recognition? Robustness against pose and distance? Difference between segmentation methods? Difference between environments?

Testing method Classifiers: K-Nearest Neighbor (knn) Support Vector Machine (SVM) Random Forest (RF) (Breiman, 2001) Good performance on large featurevectors with low information features

Testing method Cross-validation Average CA over environments

Recognition Feature selection: 1. Color space 2. Color feature 3. Combining height Detailed experiments: Environment, location and pose

Color features (1+2) HSV 2D 32 bin histogram was best Features extracted from torso slightly better than head+torso ±0.55 for DS and ±0.64 for BGM (Baseline ~0.03)

Bin size vs. CA 0.64 0.62 knn RF SVM 0.6 0.58 0.56 CA 0.54 0.52 0.5 0.48 0.46 4 8 16 32 Bin size

Combining height Not trivial: knn and SVM use distances in feature-space Small impact single feature Idea: make height more important by scaling axis

0.8 0.75 SVM-DS SVM-BGM knn-ds knn-bgm RF-DS RF-BGM 0.7 CA 0.65 0.6 0.55 0 10 20 30 40 50 60 70 80 Scaling Factor

Combining height SVM and knn profit from height (.15 in CA) RF only marginally (±0.01): overfitting Height does not seem important at trainlocation, but gains importance with distance RF cannot make use of domain-knowledge designer

Detailed experiments 1. Environment 2. Position 3. Pose

Position Clear influence of distance: 0.98-1.0 (DS) and 0.95-1.0 (BGM) [close] 0.65-0.85 (DS) and 0.77-0.88 (BGM) [medium] 0.42-0.59 (DS) and 0.58-0.70 (BGM) [far] Scores BGM/DS very similar for locations 1-6, BGM better at 7-9

Pose Robustness to varying pose Train on single pose (e.g. front), test on all poses: Per location (all four poses) 1 vs. 4: ±0.10 drop in CA when averaged over all locations

Summary Simple features yield good performance BGM better than DS (esp. further locations) Little influence of environment Clear influence of position (distance) Reasonably robust to pose

Discussion Careful with generalization: e.g. HSV might be best in our experimental circumstances, but worse in others Fitted to subjects/environments? Try more environments/subjects

Discussion Intra-day recognition only (Darrell et al. 2000; Harville, 2005) Combine with e.g. face, voice No unsupervised enrollment of users

Questions?

References Breiman, L. (2001). Random forests. Machine learning, 45(1):5 32. Darrell, T., Gordon, G., Harville, M., and Woodfill, J. (2000). Integrated Person Tracking Using Stereo, Color, and Pattern Detection. International Journal of Computer Vision, 37(2):175 185. Demsar, J., Zupan, B., Leban, G., and Curk, T. (2004). Orange: From Ex- Harville, M. (2005). Stereo person tracking with short and long term plan-view appearance models of shape and color. In IEEE, editor, IEEE Conference on Advanced Video and Signal Based Surveillance, 2005. AVSS 2005, pages 522 527. Jain, Heikkilä, A., J. Dass, and S., Silvén, and O. Nandakumar, (2004). A real-time K. (2004a). system Can for monitoring soft biometric of cyclists traits assist user recognition? In Jain, A. K. and Ratha, N. K., editors, Biometric Technology for Human Identification, volume 5404, pages 561 572. SPIE. Jain, A., Ross, A., and Prabhakar, S. (2004b). An introduction to biometric recognition. Circuits and Systems for Video Technology, IEEE Transactions on, 14(1):4 20.

References Horprasert, T., Harwood, D., and Davis, L. S. (1999). A statistical approach for real-time robust background subtraction and shadow detection. In Proc. IEEE ICCV, volume 99, pages 1 19. Lee, M., Forlizzi, J., Rybski, P., Crabbe, F., Chung, W., Finkle, J., Glaser, E., and Kiesler, S. (2009). The snackbot: documenting the design of a robot for long-term human-robot interaction. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction, pages 7 14. ACM New York, NY, USA. Zhao, L. and Thorpe, C. E. (2000). Stereo-and neural network-based pedestrian detection. Intelligent Transportation Systems, IEEE Transactions on, 1(3):148 154.