A study of a multi-kinect system for human body scanning

A study of a multi-kinect system for human body scanning A Seminar Report Submitted in partial fulfillment of requirements for the degree of Master of Technology by Shashwat Rohilla Roll No: 133050009 under the guidance of Prof. Parag Chaudhuri Department of Computer Science and Engineering IIT Bombay April, 2014

Contents 1 Introduction 1 1.1 Motivation......................................... 1 1.2 What is Kinect?...................................... 1 2 Earlier Work 3 2.1 Stages of Modeling..................................... 3 2.2 ICP for Point Cloud Registration............................. 4 2.3 Four-Kinect system..................................... 4 3 Calibration 6 3.1 Self Calibration....................................... 6 3.2 Mapping depth to RGB camera.............................. 7 3.3 Calibration between multiple Kinects........................... 8 4 Conclusion and Future Scope 9 4.1 Limitations......................................... 9 4.2 Future Work........................................ 10 4.2.1 Code......................................... 10 4.2.2 Kinect Stand.................................... 10 4.2.3 Multi-Kinect Calibration using Hand-Waving.................. 11 5 Acknowledgement 13 1

Abstract 3D models are useful in many areas such as gaming, animation, simulations, virtual worlds etc. Simple objects can be manually created using tools like Blender. But to model a person, data is collected by scanning one. Scanning is not an easy process. 3D scanning techniques like laser scan are fast and accurate but costly, bulky and require expert knowledge. Computer Vision techniques using 2D camera like structure from motion and shapes from shading are not robust. We present a method to scan a person using Microsoft s Kinect. The main reason to choose Kinect is its portability and cheap cost. Currently, we are able to generate 3D model of a person. A person is made to stand in front of the cameras and turn around gradually to have his full body captured. We are getting reasonable output with some noise. But the whole procedure is quite tedious. It is not feasible for a layman to use the setup. This report aims at refining the previous work.

Chapter 1 Introduction 1.1 Motivation We aim to build a system that can create a 3D model of a person. Creating such a model in computer is a tough task. Rather, we take structural data of the person (by taking images) and then process them to get the model. There are several techniques for this. One of them is 3D laser scanning [9]. In this, an emitter emits some kind of light which is detected for its reflection or radiation passing through. One of the technique under this is Timeof-flight. In this, the round-trip time of the laser is detected. Knowing the speed of light, c, and the time, one can detect the distance. It is fast, accurate and measures distances on the order of kilometres; but at the cost of price. We don t need such qualities to model a humanoid. Passive techniques [9] are those which does not involve any emission of radiation. They just rely on the natural incident light. Usual 2D camera can be used to take certain images. Stereoscopic systems: It requires two images taken from cameras slightly apart from each other. The depth is calculated by exploiting the slight difference in both the images. Our eye works on the same prinicple. Photometric systems: It uses a single camera but multiple lighting conditions to determine shapes from shading. These techniques are too sensitive to lighting conditions and require lots of computation. To overcome these problems, we propose to use Microsoft s Kinect. Kinect is a 3D camera which calculates depth by using the principle of structured light. The IR projector projects some pattern and the IR camera notices its change from a known pattern to calculate the disparity and hence depth. 1.2 What is Kinect? Kinect 1.1 is a 3D camera that makes use of infra-red (IR) illumination to obtain depth data, color images and sound. It is devised by Microsoft for XBox 360. Later Primesense [6] (the company which designed Kinect) also released open source drivers. 1

The system can measure distance with a 1cm accuracy at 2m and has a resolution of 3mm at 2m. The default RGB video stream uses 8-bit VGA resolution (640 480 pixels), but the hardware is capable of resolutions up to 1280 1024 (at a lower frame rate). The monochrome depth sensing video stream is in VGA resolution (640 480 pixels) with 11-bit depth, which provides 2,048 levels of sensitivity. Kinect has an IR camera, RGB camera and a laser based IR projector as shown in the figure above. The IR camera and the IR projector forms a stereo set with a baseline of about 7.5cms. The IR projector projects a known pattern which is read by the IR camera. The change in the pattern is observed to calculate the depth of a pixel. As the IR camera does not interfere with the visual spectrum it enables simultaneous capture of texture images. We are using PCL library [5] with libfreenect [8] opensource drivers for Kinect. Figure 1.1: Source: http:/pavankumarvasu.files.wordpress.com/2013/03/sp-kinect-img.png In the next chapter, we will discuss in brief the work done by [2]. We will discuss the process of gathering point clouds and creating model from them. Later, we will talk about calibration because that is something which can be improved considerably. Then, in chapter 4, we wil discuss the limitations of the current system and some ideas to overcome them. 2

Chapter 2 Earlier Work Currently, we are able to model a person with good accuracy [1] [2] [3]. Following are the stages of pipeline which we have to go through. 2.1 Stages of Modeling The process of scanning and modeling a human using Kinect has the following stages: Calibrate: The cameras are first calibrated to get their positions and orientations with respect to each other. This will be explained later. Capture: The person stands in front of the cameras and 3D images are taken. It is made sure that the whole body is covered so that there are no holes when the point cloud is generated. Construct Point Cloud: A point cloud is a set of voxels which are captured in the previous stage. To construct it, we back project every pixel in the image. Since we know the depth at each pixel, we know exactly the 3D position of the point. Segmentation: The point cloud constructed above not only contains the person, but also the background. The background is assumed to be quite far so that it becomes easier to segment it out on the basis of its depth value. To make it more rigorous, the person is assumed to have depth values within a certain band. Remove Noise: Depth sensors such as Kinect are prone to measurement errors that lead to sparse outliers which may corrupt the registration process. Hence, these outliers must be removed from the point clouds before the registration process. Register Point Clouds: We now have many point clouds of the same person captured from different viewpoints. These point clouds are registered using ICP algorithm. Meshing and Texture Mapping: Meshes are created out of the generated point clouds to have a smooth surface. After that, texture is mapped from the original point cloud. 3

2.2 ICP for Point Cloud Registration Iterative Closest Point (ICP) is an algorithm employed to minimize the difference between two clouds of points. ICP is often used to reconstruct 2D or 3D surfaces from different scans, to localize robots and achieve optimal path planning (especially when wheel odometry is unreliable due to slippery terrain), to co-register bone models, etc. In the algorithm, one point cloud, the reference, or target, is kept fixed, while the other one, the source, is transformed to best match the reference. The algorithm iteratively revises the transformation (combination of translation and rotation) needed to minimize the distance from the source to the reference point cloud. Inputs: reference and source point clouds, initial estimation of the transformation to align the source to the reference (optional), criteria for stopping the iterations. Output: refined transformation. Essentially, the algorithm steps are: 1. For each point in the source point cloud, find the closest point in the reference point cloud. 2. Estimate the combination of rotation and translation using a mean squared error cost function that will best align each source point to its match found in the previous step. 3. Transform the source points using the obtained transformation. 4. Iterate (re-associate the points, and so on). Source: http://en.wikipedia.org/wiki/iterative closest point 2.3 Four-Kinect system In the beginning, the idea was to use a single Kinect and move it around the model. Infact, the person can be asked to rotate if scanning process is not continuous. ICP is commonly used to minimize rigid body transformations of two clouds. But the problem with this is that it fails to give a complete and a closed model because of accumulation of errors in the successive frames. Also, ICP is time-consuming. Suppose, we roughly know by how much amount the person has rotated in successive frames. Then we can get a coarse estimation of the initial position of the point cloud. This method gave reasonable results. The problem with the single Kinect is that the point clouds are blurry and of low quality. Hence, we switched to a multi-kinect system [2]. We used four kinect cameras - two pairs on the opposite sides. This four-kinect system 2.1 gives more consistent and smoother models as compared to the previous one. Currently, the setup which we have uses four Kinects as shown below. 4

Figure 2.1: Four-Kinect System [2] In our multi-kinect system, one big problem is calibration. We are getting accurate results but the process of calibration is quite complex. In the next chapter, we will discuss different types of calibrations, the related problems and ideas to overcome them. 5

Chapter 3 Calibration In the last chapter, we discussed the whole process of human modeling in brief. We realize that calibration process can be improved a lot. Here, we will discuss various types of calibrations and give a light to the improvements which can be done. 3.1 Self Calibration After capturing the image, we backproject the pixels into 3D space to get a point clouds. The backprojection line is known only if we know the intrinsic parameters of the camera. The intrinsic parameters of the camera includes Focal Length f Image origin o x, o y Scale facter s x, s y Skew Factor Lens Distortion Here, we show calculations of only f, o x, o y, s x and s y. Consider the following diagram 3.1. Here, O is the pinhole, p(x, y, z) and P (X, Y, Z) are the points in 2D and 3D respectively. Clearly, x = f X Z ; y = f Y Z...(1) where every quantity is in metres. After this, mapping is done to map camera coordinates (x and y in metres) to pixel coordinates (x p and y p in pixels) respectively. This is done by dividing it by scaling factors (s x, s y ) m pixel. The origin of the camera coordinate system may not coincide with the image center. The offset is denoted by (o x, o y ) pixel. The image pixel now becomes x p = f s x X Z + o x; y p = f s y Y Z + o y; 6

Figure 3.1: Representation of a pinhole camera (http://en.wikipedia.org/wiki/pinhole camera model) In matrix form, it can be represented as x p y p } z p {{ } p = f s x 0 o x 0 0 f s y o y 0 } 0 0 {{ 1 0 } M p = MP From this, pixel coordinates can be calculated as x p = x p z p ; y p = y p z p X Y Z 1 } {{ } P M (the projection matrix) is solved using least squares by taking at least 6 corresponding features points. Manually marking those points is a tedious task. Also, it leads to high reprojection errors (difference between actual and the calculated pixel positions) because humans are not that accurate. The reprojection error is around 10-20 pixels. A utility is there which automatically detects good feature points on the checkboard. In this case, good feature points are the corners of its cells. It is observed that the reprojection error drops to around 1-2 pixels. We intend to calibrate both RGB and depth camera simultaneously. The RGB camera is able to distinguish between black and white, thats why the utility can detect the corner of the cells. But to a depth camera, the checkerboard looks like a blank sheet. An idea [10] is to use a checkboard with some hollow cells in between. In this way, the change in depths at the corners can be used to characterize it as a good feature point. So, both RGB and depth camera can be calibrated at once. 3.2 Mapping depth to RGB camera The IR and RGB cameras are separated by a small baseline. So, a stereo calibration algorithm is used to determine the transformation between them. We can use a semi-transparent checkerboard 7

so that the shared corners of the cells are visible to depth camera as well. To do this, we first calibrate the individual cameras. Then, with all the internal camera parameters fixed (including focal ( length), we calibrate the external ) transform between the two. Typical translation values are 0.0254 0.00013 0.00218 The measured distance between IR and RGB lens centers is about 2.5 cm. The Y and Z offsets are very small. The rotation component of the transform was also very small. Typical offsets were about 0.5 degrees, which translates to a 1 cm offset at 1.5 m. 3.3 Calibration between multiple Kinects Multi-kinect calibration means to calculate the transformations so that coordinate system of a camera can be mapped to the coordinate system of any another camera. This can be done by placing an object in a way visible to every camera. Taking checkerboard is a good idea since features can be detected accurately and automatically. After placing the object, an arbitrary coordinate system is chosen with respect to which the transformation of the cameras are calculated. Its simpler to choose corner (say bottom-left) of the object. Lets say the transformations for i th camera be T ar i. Then, the transformation from i th camera to j th camera would be T ari T ar j. In words, it means to bring the i th camera to the shared coordinate system and then apply T ar j to that. Our case is a bit more complex because there is no position for the checkerboard which is visible to all the four cameras. So we first calibrate camera-1 and camera-2, then camera-3 and camera-4 and then camera-2 and camera-4. Note that these three pairs will have different shared coordinate system. This generates good feature points. To map, say camera-1 to camera-3, we would first map camera-1 to camera-2, then to camera-4 and finally to camera-3. The transformations are refined by repeating the same algorithm again. Point clouds of a person are captured for all the 3 pairs above. Camera-2 and camera-4 are calibrated by putting some objects in a small container. Calibration is done using ICP. The results are good but this step makes the calibration cumbersome. One way to calibrate is by waving the hand. The first two types of calibrations are not so important. They are related to camera intrinsics which are not going to change unless we change the camera. We need to perform the above process only once for a camera. Till now, we have discussed in brief what work has been done so far. We studied the calibration processes and realized that our four-kinect calibration gives good results but is cumbersome. In the next chapter, we will discuss about limitations in the whole process and some ways to overcome them. This will include calibration by hand-waving technique as well. 8

Chapter 4 Conclusion and Future Scope We studied the work done by [2] mainly focussing on calibration. We studied in brief the process of creating a human model using Kinect. We got good results but the whole procedure is quite complex. The limitations are listed below. 4.1 Limitations We are able to get point cloud with good accuracy but still the method is not robust. If the person is wearing loose clothes, then the reconstructed point cloud may not be realistic in some parts. The whole process takes a considerable amount of time. The performance is shown below. Process Average computing time (min) Scanning 2 Pre-processing 10 Registration 10 Post-processing 2 Table 4.1: The average computing time of each step with an Intel Core2Duo processor at 2 GHz This can be decreased a lot if we can improve the hardware so as to switch on all the Kinects simultaneously. This might also decrease blurring of the mesh. This is because the frames from different cameras which are captured simutaneously, have a slight delay because the hardware permits only one frame capture at a time. The system assumes that the person doesn t move when the point cloud is being captured. This should be more robust because its not easy for someone to stand still for a long time. Manual editing is required to register different point clouds. Calibration procedure gets tedious to get a finer transformations. We will discuss more about this in the next section. 9

4.2 Future Work 4.2.1 Code As of now, the whole process is not supposed to be for a layman user. Instructions for a user to get scanned are a bit complex for him. There are lots of executables. openni grabber openni: Captures point clouds to be used for calibration and stores them in the location provided as the argument. There are some keys to start and stop streaming. calibrate2: It takes the above point clouds and aligns them. After that, some manual editing is done to refine the alignment. The transformation parameters are then hard-coded in a function call. openni grabber openni simul fork: Grabs the actual point clouds of the user for scanning. The user rotates from 0 to 90 degrees in 4 steps (effectively 5 frames). In every step, all 4 Kinects scan the person thus capturing 20 point clouds in total. The task of gathering point cloud is over. The subsequent code is to manipulate to get the final mesh. The whole process can be merged into a bash file. The user can just run the file (provided with the path), give point cloud data whenever the program asks, and at the end have his model ready in the system. At some places, key events are not required because their task can be automated. For the rest, the key events can be replaced by speech commands (if needed). Some minor changes were made to the code (like removing some useless infinite loops, changing the hard-coded paths in the makefile etc). 4.2.2 Kinect Stand Figure 4.1: Our Setup 10

(a) Kinect Stand with sliding bases (b) A scale can be attached to the stand (c) Alternative base: Stable, lighter but does not allow camera to go much lower Figure 4.2: Stand Design The whole apparatus 4.1 to scan the model requires that the Kinects should be steady at their positions. Even the slighest movement can distort the output a lot. If any of the camera is displaced, the whole calibration process would have to be repeated, which takes a lot of time. Currently, the cameras are somehow made stable. We have planned to have a steel stand 4.2 upto a height of 7f t (maximum). It would be able to incorporate 2 Kinects which can slide vertically. Knobs at their back can be tightened to fix their position. A vertical metre scale would be there to measure the distance between Kinects. Two such stands are needed. The main advantage of having this is that two Kinects on a stand will always be rigid and the distance between them is known. Another advantage is that it is portable. We can put the stand anywhere. There would be a proper arrangement of wires. 4.2.3 Multi-Kinect Calibration using Hand-Waving Our calibration procedure is tedious. First cameras are calibrated using checkerboard. Then, to refine the transformations, point clouds are captured. An idea is to wave an object which is tracked by all the cameras. This paper [7] talks about multi-calibration by moving two LED markers rigid with respect to each other. It assumes that the cameras share a volume space at least pairwise. The markers are tracked to get new voxels with 11

time. Since we have pairwise transformations, global calibration is done by constructing a vision graph. Vision graph shows connectivity between the cameras. To calculate the transformation between i th and j th camera, we pick that series of pairwise transformations which has the shortest path (as calculated by Dijkstra s algorithm). Weights assigned to each edge is inversely proportional to the number of corresponding points they share. If there are no common points, there is no edge between the cameras. For this algorithm to work, the graph is needed to be connected. The accuracy of this method depends on many factors one of which is accuracy of marker detection algorithm. This algorithm can be used to track a hand of the user. The main issue which might occur is that the cameras are not synchronized in our case. The user hand position might change while different cameras are capturing point clouds of the same time instance. But one point is that there is a large field volume which is visible to all the four cameras. This should give some advantage. 12

Chapter 5 Acknowledgement I thank Prof. Parag Chaudhuri for very helpful discussions and insights into the project, as well as for constructive feedback. 13

Bibliography [1] Niket Bagwe, A study of the Microsofts Kinect camera. Technical report, IIT Bombay, 2011 [2] Niket Bagwe, 3D Reconstruction of an Animatable Humanoid using Microsoft Kinect, IIT Bombay, 2012 [3] Jaai Mashalkar, Personalized Animatable Avatars from Depth Data, IIT Bombay, 2013 [4] Microsoft s Kinect, http://en.wikipedia.org/wiki/kinect [5] Point Cloud Library(PCL), http://www.pointclouds.org [6] Prime Sense, http://www.primesense.com/ [7] Gregorij Kurillo, Zeyu Li, Ruzena Bajcsy, Wide-area external multi camera calibration using vision graphs and virtual calibration objectm University of California, Berkeley [8] Kinect Drivers, https://github.com/openkinect/libfreenect [9] Laser Scan, http://en.wikipedia.org/wiki/3d scanner#non-contact active [10] Calibration using Semi-transparent checkerboard, http://doc-ok.org/?p=289 [11] Multi-kinect calibration, http://doc-ok.org/?p=295 14