Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Similar documents
Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Structured light 3D reconstruction

Registration of Dynamic Range Images

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

Algorithm research of 3D point cloud registration based on iterative closest point 1

Three-dimensional nondestructive evaluation of cylindrical objects (pipe) using an infrared camera coupled to a 3D scanner

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing

3D Scanning. Qixing Huang Feb. 9 th Slide Credit: Yasutaka Furukawa

Reconstruction of complete 3D object model from multi-view range images.

Processing 3D Surface Data

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Processing 3D Surface Data

Planetary Rover Absolute Localization by Combining Visual Odometry with Orbital Image Measurements

Dense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm

A consumer level 3D object scanning device using Kinect for web-based C2C business

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

EE795: Computer Vision and Intelligent Systems

ACQUIRING 3D models of deforming objects in real-life is

3D Modeling of Objects Using Laser Scanning

Generating 3D Meshes from Range Data

Motion Tracking and Event Understanding in Video Sequences

CRF Based Point Cloud Segmentation Jonathan Nation

3D Editing System for Captured Real Scenes

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Rigid ICP registration with Kinect

3D Photography: Active Ranging, Structured Light, ICP

Dynamic Geometry Processing

TEXTURE OVERLAY ONTO NON-RIGID SURFACE USING COMMODITY DEPTH CAMERA

Occlusion Robust Multi-Camera Face Tracking

L2 Data Acquisition. Mechanical measurement (CMM) Structured light Range images Shape from shading Other methods

Surface Registration. Gianpaolo Palma

Character Modeling IAT 343 Lab 6. Lanz Singbeil

Depth-Layer-Based Patient Motion Compensation for the Overlay of 3D Volumes onto X-Ray Sequences

Complex Sensors: Cameras, Visual Sensing. The Robotics Primer (Ch. 9) ECE 497: Introduction to Mobile Robotics -Visual Sensors

Representing Moving Images with Layers. J. Y. Wang and E. H. Adelson MIT Media Lab

Measuring and Visualizing Geometrical Differences Using a Consumer Grade Range Camera

Perceptual Grouping from Motion Cues Using Tensor Voting

arxiv: v1 [cs.cv] 28 Sep 2018

Improvement and Evaluation of a Time-of-Flight-based Patient Positioning System

FAST REGISTRATION OF TERRESTRIAL LIDAR POINT CLOUD AND SEQUENCE IMAGES

Implementation of 3D Object Reconstruction Using Multiple Kinect Cameras

Robust Articulated ICP for Real-Time Hand Tracking

Overview. Related Work Tensor Voting in 2-D Tensor Voting in 3-D Tensor Voting in N-D Application to Vision Problems Stereo Visual Motion

Motion Capture & Simulation

3D Photography: Stereo

AAM Based Facial Feature Tracking with Kinect

Cloth Animation. CENG 732 Computer Animation. Simple Draping. Simple Draping. Simple Draping. Simple Draping

SIMPLE ROOM SHAPE MODELING WITH SPARSE 3D POINT INFORMATION USING PHOTOGRAMMETRY AND APPLICATION SOFTWARE

Applications. Systems. Motion capture pipeline. Biomechanical analysis. Graphics research

Point Cloud Streaming for 3D Avatar Communication

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Capture of Arm-Muscle deformations using a Depth Camera

Live Metric 3D Reconstruction on Mobile Phones ICCV 2013

c 2014 Gregory Paul Meyer

A Method of Automated Landmark Generation for Automated 3D PDM Construction

Geometric Registration for Deformable Shapes 1.1 Introduction

Interactive Collision Detection for Engineering Plants based on Large-Scale Point-Clouds

3D Models from Range Sensors. Gianpaolo Palma

REFINEMENT OF COLORED MOBILE MAPPING DATA USING INTENSITY IMAGES

Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction

Data-driven Depth Inference from a Single Still Image

Fast and robust techniques for 3D/2D registration and photo blending on massive point clouds

Data-driven Approaches to Simulation (Motion Capture)

Factorization with Missing and Noisy Data

Graph-based High Level Motion Segmentation using Normalized Cuts

Colored Point Cloud Registration Revisited Supplementary Material

Intensity Augmented ICP for Registration of Laser Scanner Point Clouds

Calibration of an Optical-Acoustic Sensor

High-Fidelity Augmented Reality Interactions Hrvoje Benko Researcher, MSR Redmond

Processing 3D Surface Data

3D Face and Hand Tracking for American Sign Language Recognition

Occlusion Detection of Real Objects using Contour Based Stereo Matching

Construction Progress Management and Interior Work Analysis Using Kinect 3D Image Sensors

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

PART-LEVEL OBJECT RECOGNITION

Correspondence. CS 468 Geometry Processing Algorithms. Maks Ovsjanikov

Dynamic Time Warping for Binocular Hand Tracking and Reconstruction

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER

Chaplin, Modern Times, 1936

IRIS SEGMENTATION OF NON-IDEAL IMAGES

Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen I. INTRODUCTION

International Conference on Communication, Media, Technology and Design. ICCMTD May 2012 Istanbul - Turkey

Landmark Detection on 3D Face Scans by Facial Model Registration

A Model-based Approach to Rapid Estimation of Body Shape and Postures Using Low-Cost Depth Cameras

Registration of Moving Surfaces by Means of One-Shot Laser Projection

Efficient SLAM Scheme Based ICP Matching Algorithm Using Image and Laser Scan Information

Miniature faking. In close-up photo, the depth of field is limited.

Dense 3D Reconstruction. Christiano Gava

Advances in 3D data processing and 3D cameras

New Sony DepthSense TM ToF Technology

MODELING AND HIERARCHY

Virtualized Reality Using Depth Camera Point Clouds

A MULTI-RESOLUTION APPROACH TO DEPTH FIELD ESTIMATION IN DENSE IMAGE ARRAYS F. Battisti, M. Brizzi, M. Carli, A. Neri

REAL-TIME FACE SWAPPING IN VIDEO SEQUENCES: MAGIC MIRROR

Spatio-Temporal Stereo Disparity Integration

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs Supplementary Material

TEXTURE OVERLAY ONTO NON-RIGID SURFACE USING COMMODITY DEPTH CAMERA

Robotics Programming Laboratory

Dense 3D Reconstruction. Christiano Gava

Transcription:

Accurate 3D Face and Body Modeling from a Single Fixed Kinect Ruizhe Wang*, Matthias Hernandez*, Jongmoo Choi, Gérard Medioni Computer Vision Lab, IRIS University of Southern California Abstract In this paper, we address the problem of both face and body modeling using a single fixed low-cost 3D camera (e.g. Kinect). Unlike other scanning technologies which either set up multiple sensors around a static subject or scan the static subject with a hand-held moving 3D camera, our method allows the subject to move in front of a single fixed 3D sensor. This opens the door to a wide range of applications where scanning is conveniently performed at home alone. While partial range scans of face are aligned rigidly, the body modeling is performed through articulated registration. At the surface reconstruction stage, we utilize the cylindrical representation for smoothing, hole filling and blending. Experimental results demonstrate the effectiveness of our modeling technique. Keywords: Kinect, face modeling, body modeling, rigid registration, articulated registration, cylindrical representation 1. Introduction 3D face and body modeling is of interest to computer vision and computer graphics. Precise 3D face and body models are necessary in many applications, such as animation, virtual reality, human computer interaction. However, obtaining such an accurate model is not an easy task. Early systems are either based on laser scan or structured light. While these systems can provide very accurate models, they are expensive. In this paper, we propose a user-friendly scanning system based on a low-cost structured -light 3D sensor. The complete setup of our home-used scanning system is illustrated in Fig. 1. The camera is mounted vertically to maximize the field of view so that a subject can stand as close as possible. The subject first scan the body. The required initial pose of the subject is also shown in Fig. 1. While the subject starts turning from the initial pose, he/she is required to stay static at approximately every 1/4 circle for 1 second. The subject can turn naturally as long as his/her two arms stay in the torso's plane and his/her two arms do not cause occlusion on legs. After scanning the body, the user comes closer to the sensor and scans his/her face. While the system and instructions are easy to set up and follow respectively, the whole data-recording process won't take more than 20 seconds. Fig.. System setup The first key component of our scanning technology is the registration between point clouds. For face scanning, we perform rigid registration. We set a frontal depth image as the reference, and then align each subsequent cloud of 3D points to the reference using a GPU (Graphics Processing Unit) * equal contribution to the paper. contact: ruizhewa@usc.edu; (626)390-8301

implementation of a robust variant of the ICP (Iterative Closest Point) algorithm, which enables real time scanning. This registration, combined with our online surface reconstruction method, allows us to reject poor alignment due to facial expressions, occlusions, or a poor estimation of the transformation, by thresholding the distance between the current frame and our online model. For body modeling, we carry out articulated registration considering the subject turns 360 degrees in front of the sensor during scanning. Instead of registering all consecutive frames, we sample key frames out of the complete turning sequence and align them in a top-bottom-top manner. The second major component is surface reconstruction. For face modeling, we utilize an online cylindrical representation which enables us to perform 3D operations in the unwrapped 2D image space. For example, interpolation on a 2D image is hole filling on 3D surface and filtering the image acts as spatial smoothing in 3D. We initialize the cylindrical model with the reference point cloud and update the model with reliable new frames. Following the same idea, we employ a part-based cylindrical representation for the body modeling and solve the problem of blending between two overlapping cylinders. After separately scanning face and body, we register and blend them to generate a single model. Experimental results demonstrate the effectiveness of our modeling technique. For both face and body modeling, we quantitatively compare our models to commercial laser scans. For the face, an average error of 1mm is achieved while for the body an average error of 5mm is observed. The degradation of accuracy for the body modeling is mainly due to the sensor's increasing quantization step at a longer distance. With more accurate 3D sensing device, better accuracy can be achieved. The rest of the paper is organized as follows. Section 2 covers our face and body modeling methods in details. Section 3 describes the experimental results. Section 4 ends with conclusion and future work. 2. Method 2.1. Face modeling Reconstructing an accurate 3-D model from the 3-D camera is not an easy problem. The depth map computation uses a triangulation principle, hence making the depth data near boundaries very noisy. As a result, a simple averaging in time does not suffice. We propose to accumulate and register face images with several poses. Then, we perform both a temporal integration and a spatial smoothing to remove the noise. A cylindrical representation [3,4] is used to improve performance. Fig. 2 shows a flowchart of our approach. This approach was presented in [5]. Fig.. Flowchart for the face modeling After segmenting the face region, the images are aligned by using a point cloud registration technique, namely the Expectation-Minimization Iterative Closest Point algorithm [1,2]. The first image is arbitrarily set as a reference and assumed to be a frontal face. The subsequent images are

registered to the reference coordinate system. Note that other pose estimation algorithms could be used instead, as long as they provide sufficient accuracy. The registered images are merged in an unwrapped cylindrical 2-D image, which can represent starshaped objects, such as faces [3,4]. Practically, a cylinder is set around the face in the reference frame. Its axis is the vertical axis going through the middle of the head. The axis location does not need to be very accurate and is loosely estimated by taking the center of mass of the head in the reference frame. For each 3-D point of coordinates (x, y, z), the cylindrical coordinate ( ; ; y) can be computed with the equations: The 3-D geometry of the face can be stored in an unwrapped cylindrical image in which the value at the pixel ( ; y) is the distance from this point to the cylinder axis (Fig. 3). This representation is simple and enables us to perform 2-D image filtering techniques to smooth the 3-D data, hence improving performance. Also, the mesh can be computed very simply by linking neighboring pixels together. Fig.. Cylindrical representation. A cylinder is set around the face (left). Every 3-D point (x; y; z) is converted to cylindrical coordinate ( ; ; y) where r is the distance from the surface to the cylinder axis (middle), giving the intensity value at the pixel ( ; y) in the unwrapped cylindrical map (right). 2.2. Body modeling

Fig.. Flowchart for the body modeling The general pipeline of the body modeling system is shown in Figure 4. First we detect 4 key poses out of the whole depth video sequence, which are front (reference pose), back, and two profiles. They cover sufficient information to describe the main body shape. The 4 key frames are registered in a top-bottom-top manner instead of other well-developed non-rigid registration methods. Top-bottom means that registration goes from the root node to all leaf nodes in the tree model of human body as shown in Figure 1(c) while bottom-top means the opposite. Top-bottom registration first aligns the torso or the whole body then aligns succeeding rigid body parts. Bottom-top registration first refines the alignment of rigid body parts and then propagates the refinement all the way to the root node, i.e. torso. After registering 4 key frames, an articulated part-based cylindrical body model, which supports a set of operations, can be used to process the rough and noisy registered points cloud of the body. Figure 4 shows a flow chart of the modeling process. The key here is the 2D part-based unwrapped cylindrical representation which enables computationally effective 2D interpolation and 2D filtering. More details can be found in [6]. 2.3. Blending face and body models The performance of the Kinect sensor decreases quadratically as the distance increases. At the scanning range of full body, e.g. approximately 2 meters, most details are lost while we only capture the global shape. Among all those lost details, the ones of face are most appealing. Hence it is necessary to blend the body model and face model to take advantage of the close scanning distance of face. After rigid and articulated registration stage, we obtain the raw points cloud of face and body respectively. We rigidly align the points cloud of face with the points cloud belonging to the head area of body. Then we remove all points, which are closer than a threshold to the facial points cloud, from

the body. This makes sure that at the surface reconstruction stage, only the points cloud of face is used for the final model's facial area. The gap between the face points cloud and body points cloud are filled automatically by our cylindrical surface reconstruction technique. 3. Experimental Result 3.1. Face models and evaluation The 3-D face models can be reconstructed online and in real-time. Also, they are visually accurate (Fig. 5). The accuracy was quantified by comparing our models to commercial laser scans and we could get an accuracy of 1mm on average. Note that smooth areas such as the cheeks are very close to the ground truth while the high curvature areas have a bigger error, as shown in Fig. 6. Fig.. Face models of different people Fig.. (a) Our models (b) Laser scans (c) Heat map 3.2. Body models and evaluation Fig. 7 includes modeled results of 4 people scanned by our system. When they turn in front of the depth camera, we observe obvious articulated motion. In Fig. 7 each model is showed at 4 different views. Fig. 7 shows clear and smoothed body shapes as a whole and contains personalized shapes such as knees, hips and clothes. The holes on body are interpolated and the joints between rigid body parts are well blended. The average computing time of the whole process with an Intel Core i7 processor at 2.0 GHz is around 3 minutes. Although finer details of body shape (e.g. lips, eyes)

cannot be extracted due to the noisy nature of Kinect, we believe that the current system can generate models accurate enough for applications such as online shopping and gaming. We believe that the model can be further refined by using a more complex model, adding more frames and taking advantage of the corresponding RGB image. Fig.. Body models of different people Besides qualitative analysis, we also present quantitative comparison between our model and the laser scanned result. Due to the existence of articulation between these two models, it is hard to compare them as a whole. Instead we compare the segmented rigid body parts. The heat map of torso is shown in Fig. 8. We generate the point-wise error by mapping each point of our body part to the cylindrical image generated by the laser-scanned body part and looking for the closest pixel or interpolating neighboring 4 pixels. The median of absolute error on torso is 5.84mm.

Fig.. Heat map of torso 3.3. Blending face and body models We manually segment the head points cloud from body to register with the face points cloud (Sec 2.3). The final reconstruction is shown in Fig. 9. We capture the global shape of body as well as the fine details of face. Fig.. Final reconstruction model 4. Conclusion and Future Work In this paper, we address the problem of both face and body modeling with a single fixed 3D sensor. Our scanning system generates high-quality face and body models quickly. In the future, we plan to work on refining details of the model from the intensity stream.

References [1] Chen, Y; Medioni G: "Object modelling by registration of multiple range images". Image Vision Computing, 1991 [2] Tamaki T, Abe M, Raytchev B, Kaneda K: Softassign and EM-ICP on GPU. CVPR 2010. [3] Lin Y, Medioni G, Choi J: Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours. CVPR 2010. [4] Williams L: Performance-Driven Facial Animation. Computer Graphics 1990, 24(4). [5] Hernandez M, Choi J, Medioni G: Laser scan quality 3-D face modeling using a low-cost depth camera. European Signal Processing Conference (EUSIPCO) 2012. [6] Wang R, Choi J, Medioni G.: "Accurate Full Body Scanning from a Single Fixed 3D Camera" 3DIMPVT, 2012 Second International Conference on. IEEE, 2012: 432-439.