Institutionen för systemteknik

Size: px

Start display at page:

Download "Institutionen för systemteknik"

Ambrose Beasley
5 years ago
Views:

1 Institutionen för systemteknik Department of Electrical Engineering Examensarbete Autonomous Morphometrics using Depth Cameras for Object Classification and Identification Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet av Felix Björkeson LiTH-ISY-EX--13/4680--SE Linköping 2013 Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Linköpings tekniska högskola Linköpings universitet Linköping

3 Autonomous Morphometrics using Depth Cameras for Object Classification and Identification Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet av Felix Björkeson LiTH-ISY-EX--13/4680--SE Handledare: Examinator: M.Sc. Kristoffer Öfjäll isy, Linköpings universitet Dr. Daniel Ljunggren Optronic Dr. Lars-Inge Alfredsson isy, Linköpings universitet Linköping, 10 juni 2013

5 Avdelning, Institution Division, Department Computer Vision Laboratory Department of Electrical Engineering SE Linköping Datum Date Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--13/4680--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version Titel Title Autonom Morphometri med Djupkameror för Objektklassificering och Identifiering Autonomous Morphometrics using Depth Cameras for Object Classification and Identification Författare Author Felix Björkeson Sammanfattning Abstract Identification of individuals has been solved with many different solutions around the world, either using biometric data or external means of verification such as id cards or RFID tags. The advantage of using biometric measurements is that they are directly tied to the individual and are usually unalterable. Acquiring dependable measurements is however challenging when the individuals are uncooperative. A dependable system should be able to deal with this and produce reliable identifications. The system proposed in this thesis can autonomously classify uncooperative specimens from depth data. The data is acquired from a depth camera mounted in an uncontrolled environment, where it was allowed to continuously record for two weeks. This requires stable data extraction and normalization algorithms to produce good representations of the specimens. Robust descriptors can therefore be extracted from each sample of a specimen and together with different classification algorithms, the system can be trained or validated. Even with as many as 138 different classes the system achieves high recognition rates. Inspired by the research field of face recognition, the best classification algorithm, the method of fisherfaces, was able to accurately recognize 99.6% of the validation samples. Followed by two variations of the method of eigenfaces, achieving recognition rates of 98.8% and 97.9%. These results affirm that the capabilities of the system are adequate for a commercial implementation. Nyckelord Keywords Depth Cameras, Classification, Morhometrics, Homography, B-Spline, Eigenfaces, Fisherfaces, Local Binary Pattern Histograms, Nerual Network

7 Abstract Identification of individuals has been solved with many different solutions around the world, either using biometric data or external means of verification such as id cards or RFID tags. The advantage of using biometric measurements is that they are directly tied to the individual and are usually unalterable. Acquiring dependable measurements is however challenging when the individuals are uncooperative. A dependable system should be able to deal with this and produce reliable identifications. The system proposed in this thesis can autonomously classify uncooperative specimens from depth data. The data is acquired from a depth camera mounted in an uncontrolled environment, where it was allowed to continuously record for two weeks. This requires stable data extraction and normalization algorithms to produce good representations of the specimens. Robust descriptors can therefore be extracted from each sample of a specimen and together with different classification algorithms, the system can be trained or validated. Even with as many as 138 different classes the system achieves high recognition rates. Inspired by the research field of face recognition, the best classification algorithm, the method of fisherfaces, was able to accurately recognize 99.6% of the validation samples. Followed by two variations of the method of eigenfaces, achieving recognition rates of 98.8% and 97.9%. These results affirm that the capabilities of the system are adequate for a commercial implementation. iii

9 Preface This master thesis was the final work towards achieving a Master of Science degree. I was given the opportunity to complete my thesis at the company Optronic. Throughout the first five months in the year 2013, I set upon to complete my goal. Due to confidentiality, certain phrases are not mentioned in this thesis report to avert recognition of sensitive words by search engines. You as a reader will however easily discern what the actual content of the report is really about. I therefore appeal that you do not get annoyed by the use of vague words while the included figures clearly reveal the true nature of the specimens that are mentioned throughout the report. Stockholm, June 2013 Felix Björkeson v

11 Notation Abbreviations Abbreviation Meaning 1-D One dimensional 2-D Two dimensional 3-D Three dimensional PCL Point cloud library SVD Singular value decomposition PCA Principal component analysis LDA Linear discriminant analysis TOF Time of flight Clarifications Word Specimen Sample Descriptor Class Object Meaning A particular individual in a set of individuals Set of data points representing an instance of an object Smaller set of data extracted from a sample, describing the specimen Defines the set of objects that belongs to the same specimen An instance of a class, i.e. a sample of a individual vii

13 Contents Notation vii 1 Introduction Background Problem Structure of Thesis Theory D Imaging Time of Flight Imaging Structured Light Imaging Time Of Flight Imaging Versus Structured Light Imaging Pinhole Camera Model Occlusion Object Classification Morphometrics Morphological Landmarks Face Recognition Algorithms for Image Processing Flood Fill Principal Component Analysis Homography Hartley Normalization D Plane Fitting Face Recognition Methods Eigenfaces Fisherfaces Local Binary Pattern Histograms Neural Network B-splines Implementation 25 ix

14 x CONTENTS 4.1 Overview Data Acquisition Point Cloud A Good Camera Angle Choosing a Frame of Interest Data Extraction Manipulating the Viewing Angle Axis Threshold Filter Foreground Extraction with Flood Fill Reducing Dimensionality Create a Height Map Rectification Using Homography Data Processing Ridge Detection Ridge Suppression Protrusion Detection Surface Descriptor Extraction Descriptor Analysis Eigenfaces Fisherfaces Local Binary Pattern Histograms Neural Network Results Evaluation Descriptor Extraction Training Samples Per Class Number of Different Classes Performance Decline Over Time Discussion Speed and Performance Outliers Prediction Uncertainty Neural Network Parameters Descriptor Size Improvements Different Descriptor Harder Outlier Control Color Data Reiteration Frequency Analysis Conclusions 57 Bibliography 59

15 1 Introduction 1.1 Background The need for autonomously working systems within surveillance, robotics, machine control and machine vision etc. increases. Here state-of-the-art solutions in optical sensor technology and image processing can help to overcome problems associated with identification and classification of objects. Image sensors have traditionally been outputting two-dimensional (2-D), or even one-dimensional (1-D) data, but now sensors capable of outputting three-dimensional (3-D) depth data are increasingly available. Future applications are believed to strongly benefit from utilizing this additional dimension in similarity with the superior stereoscopic method of human vision and brain for navigation and identification. The human abilities are recurrently used as the benchmark reference, a limit which we initially hope to equal and ultimately surpass. Optronic have a long history in optical metrology and the development and manufacturing of depth cameras. Through its sister company Fotonic, they are interested in further exploring applications which would benefit from using 3-D technology. One application is identification (and classification, or recognition) of general objects using features available in image data such as surface structure, shape, pose, color and reflectivity. As one specific example we are interested in the possibilities of identifying living objects, such as people and animals from their shape and pattern. Such objects are usually structurally similar to other objects of the same group (species or part of body) and subtle textural or curvature features are often the only thing differentiating them. Morphometrics is a scientific area defined as the study of shape and form of organisms that encapsulates the core of this thesis. 1

16 2 1 Introduction 1.2 Problem The goal was to be able to distinguish a set of specimens from each other based on data acquired from a depth camera. To complete this task, range data captured in an uncontrolled environment, where specimens passes freely beneath the camera at their convenience, is available. The camera is positioned at a narrow gate where all specimens go through several times a day. Since the specimens choose themselves when to advance through this location combined with the general layout around this gate, disorderly behavior is common. This increases the risk of unfavorable data being acquired. Another problematic characteristic about the data is that cross specimen variations are small, meaning that the shape and form of one individual is very similar to that of another. The internal variation isn t either necessarily low due to the fact that the specimens are living beings that moves and deforms. What further complicates the task is that the time difference in the captured data is wide, ranging over two weeks, making it possible for the specimens to have noticeably changed during this time. The solution used today to identify the different subjects are RFID tags attached to each specimen. As previously mentioned, the specimens tends to behave quite uncivilized, which consequentially results in the possibility of loosing a RFID tag or two. Specimens walking around with unknown identity will cause a lot of problems. It is therefore interesting to explore the possibly to have a completely unintrusive system that doesn t rely on mechanical systems that might break, such as RFID tags. 1.3 Structure of Thesis Chapter 2 begins with explaining some fundamental theory starting with 3-D imaging. The following sections are closely related to each other and set out to explain the principles behind classification. The first section touches the subject on a broad scale, while the rest more or less explains special cases of classifications. Chapter 3 will continue with explaining theories, but is more particularly aimed at explaining well-known algorithms. These algorithms are some of the cornerstones of the system and how they work together to form the final solution is explained in chapter 4. The system is then evaluated in chapter 5 followed by chapter 6, where some notable aspects are discussed along with some proposed improvements. At the end of the report conclusions are mention in chapter 7.

17 2Theory In this section some fundamental theories are mentioned or explained. Beginning with the principles of three dimensional depth imaging, and then continuing with different aspects of classification D Imaging The basic goal in the field of three dimensional depth imaging is to produce accurate depth estimations of the surroundings. There are a number of techniques to achieve this and below two of the most common will be briefly described. These two technologies are based on temporal and spatial information, respectively. The first technology is called Time of Flight (TOF) that estimates the time of flight for light rays. Some common commercially available models are the MESA Imaging SR4000 and Fotonic E-SERIES. Next cameras utilizing structured light as a technology to create 3-D data will be explained. In this thesis, data from the latter type of camera is what have been used during development. The most referred-to 3-D camera that use this kind of technology is the Microsoft Kinect [1], developed by PrimeSense. In this thesis however, a Fotonic P70 [2] was used. The book Time-of-Flight Cameras and Microsoft Kinect by C. Dal Mutto, et al.[3] reviews the technology and applications of both these technologies very well. Below the fundamentals are briefly summarized Time of Flight Imaging The idea and purpose of a 3-D camera is to achieve accurate depth values for each pixel in a matrix sensor, resulting in an image containing the distance to the objects in front of the camera. The basic principle of a TOF camera is to 3

18 4 2 Theory acquire temporal information measuring the time t it takes for light to move back and forth between the camera and the environment. Since the speed of light c is constant in the same medium, the distance d from the camera to the object is given by d = c t 2. (2.1) There are several techniques to measure this time but the one used in most TOF cameras is to measure the phase difference ϕ of the radiated light wave compared to the reflected wave, see figure 2.1. This works by modulating the outgoing light at around 40 MHz. Using simple wave theory the time can be calculated as t = ϕ 2πf mod, (2.2) where f mod is the frequency of the light modulation. The radiated light is usually generated from a constant modulated light source emitting light near the infrared spectrum, making it invisible to the human eye. Current cameras use common LEDs to emit this light with a wavelength about 850 nm. The phase difference is Amplitude [V] 0 Time [s] Figure 2.1: Phase difference of a signal (in blue) and its reflection (in red). computed using ϕ = arctan Q 3 Q 4 Q 1 Q 2, (2.3) where Q i represent the amount of electrical charge received from different time intervals at π 2 radians phase delays from each other. This electrical charge is the result of a matrix of CCD/CMOS lock-in pixels [4] that convert the light energy into electricity. These lock-in pixels structured in a matrix form the actual sensor chip, where every pixel can sample the light independent from each other. The final distance to the camera can then be calculated by combining equation 2.1 and

19 2.1 3-D Imaging resulting in distance d = c ϕ. (2.4) 2 2πf mod Since light is periodic, there is an upper limit to the time that can be measured before the next period arrives. This cause errors resulting in limits to the measurable range by discontinuities in the range data and is called phase wrapping. The consequence of this, is an interval where the depth value can be estimated correctly. There are several techniques to deal with this problem such as phase unwrapping, but none of these will be dealt with in detail here. Reader are encouraged to read the report by M. Hansard, et al. [5] Structured Light Imaging The second type of 3-D camera uses structured light to create 3-D data from spatial information by triangulation. The basic principle behind this kind of camera is that a projector is projecting a pattern upon the environment that is detectable by a camera. Due to variation in the environment, the pattern will be distorted and this distortion is the key to proper depth estimation. The camera can locate points in the pattern and using a known baseline between the projector and the camera, a three dimensional point may be triangulated. The pattern that is projected must have certain properties enabling the camera to unambiguously locate positions in the pattern, so that correct correspondences can be appointed. Figure 2.2 shows the basic setup of a projector and a camera. The Fotonic P70 uses a point like pattern modulated three times in both axes, see figure 2.3 for a visualization of the projection pattern. To triangulate a 3-D point with a calibrated and rectified stereo setup, consider the corresponding points p R = (u R, v R ) and p L = (u L, v L ), where p R is the coordinates of a 3-D point P = (x, y, z) reflected and projected into the camera, and p L is the equivalent reference coordinates of a light ray emitted from the projector. Since the setup should be a rectified stereo setup, v R = v L and u R = u L d should hold true, where d is the difference in horizontal coordinates, called disparity. The disparity is then inversely proportional to the depth value z through z = bf d, (2.5) where b is the baseline between the camera and projector and f the focal length. The x and y coordinates can then be found with the aid of the cameras intrinsic camera matrix, see section for more information Time Of Flight Imaging Versus Structured Light Imaging The question is then what technology is best? There is no straight answer since both technologies have advantages and disadvantages. They are appropriate for different situations.

20 6 2 Theory (x,y,z) A (u,v) B Figure 2.2: The setup of a structured light camera, where A is the projector and B the camera. 3-D point coordinates x, y and z are projected into the image coordinates u and v. Since the pattern has a special coding the projector know the projection coordinates of the corresponding image coordinates.

21 2.1 3-D Imaging 7 Figure 2.3: The structured light emitted from the Microsoft Kinect projector.

22 8 2 Theory Resolution and Speed Since structured light imaging uses ordinary CMOS sensor chips found in regular cameras, the maximal possible resolution correlates with the resolution of the chip. On modern chips this resolution is very high and results in a possibility of very high resolution depth images. The high resolution does however bring a side effect. Since the amount of calculations needed to calculate the depth values are proportional to the resolution of the frame, structured light technology tends to be slow. The bottleneck is therefore the processing power of the camera, and resolution will have to be balanced with frame rate, in a way that is not needed with a TOF camera. While TOF cameras employ relatively immature technology it is possible to capture frames at a very high rate due to the structure of the chip, which enables independent parallel measurements of every pixel. The resolution is however significantly lower due to this structure. Since structured light imaging sensors cannot read every pixel simultaneously, frame rate will be slowed down additionally. Pixels are read sequentially row wise, called rolling shutter, which also produce another set of problems such as motion distortion. A TOF camera can be constructed very compactly. Ideally every pixel should have its own light emitter positioned as close as possible from itself. This is however not feasible since it would demand a very large and sparse sensor and emitter chip. One way to simulate the effect of having a single emitter in the center of the sensor chip is by positioning several emitters evenly distributed around the chip. Because of this geometry the TOF camera can be made much more compact than a structured light based camera, which needs a baseline between the projector and sensor. The range of the camera is then directly dependent on the size of baseline. This means that with a fixed baseline, only a specific range can be accurately measured. For example, the Microsoft Kinect has a baseline of approximately 7.5 cm and the optimal operation range is 0.8 m to 6 m. Distance Dependent Another advantage of the TOF camera is that, in a sense, it is not distance dependent. It is able to discern an accurate depth measurement within the whole current range. Structured light on the other hand measures the disparity of pixels, this measurement is proportional to the distance to the camera. A fixed displacement of the structured light in 3-D space would appear as a larger disparity closer to the camera compared to further away from it. The result is a much lower accuracy far away from the camera than close to it. Reflectivity Ambiguity One of the major problems with a TOF camera is caused by the inability to know the reflectivity of a surface. A black surface reflects significantly less photons than a white surface resulting in a weaker signal, hampering the accuracy of the measurement because, the signal is then weak compared to the background light. The consequence of this usually results in different depth estimations for bright surfaces compared to dark surfaces, even if both have the same depth. Bouncing light waves also causes artifacts. If an emitted light wave bounces on more than

23 2.1 3-D Imaging 9 one surface before reaching back to the camera, its time of flight is longer, resulting in overestimated depth values. This effect is called multi-path phenomenon and there is currently no known compensation to remedy it. Edge Artifacts Structured light has a tendency to create artifact points at edges. If there is a discontinuity in depth, as there usually is at the edge of an object, points have a tendency to be estimated between the objects edge and the background, creating a drape like effect. TOF cameras are also plagued by this, but not to the extent that structured light cameras are Pinhole Camera Model The pinhole camera model is a approximative simple mathematical model of a camera that also applies to depth cameras. The principle of the pinhole camera is to allow light from the environment to only enter through a small aperture, a pinhole. The light is then projected upon a flat surface creating an image. This can be described with a series of transformations, converting a point in world coordinates into pixel coordinates. Figure 2.4 shows the basic setup of a pinhole camera where a point is projected into the image plane. The camera center is denoted O, with the principal axis crossing the principal plane at point R. The 3-D point P has the coordinates (X, Y, Z) and the projected point has the image coordinates (u, v). The focal length is denoted f. Note that the image plane is somewhat rotated to avoid some negative signs later on. Before projection is possible, the point has to be described with the camera coordinate system. This can be done with a transformation matrix T describing the cameras position in 3-D space which contains the cameras extrinsic parameters T = ( R t ), (2.6) where R, a 3 3 matrix, represents the camera rotation and t, a 3 1 vector, the camera translation. With the 3-D point transformed into camera coordinates it can be projected into the image plane. Looking at figure 2.4 it is clear with some trigonometry that the image coordinates u and v can be calculated with the following equations: u = f X Z (2.7) v = f Y Z (2.8) The last step is to transform the image coordinates into pixel coordinates. This can be as simple as translating the coordinates so that origin is in the top left corner. The projection and image coordinate transformation can be represented with a single transformation matrix, the intrinsic camera matrix K, which can be

24 10 2 Theory Y Z P O u p R v f X Figure 2.4: The geometrical setup of a pinhole camera. constructed as K = f 0 u 0 0 f v , (2.9) where u 0 and v 0 is the pixel coordinates of the principal point, ideally half of the image resolution. A 3-D point with coordinates (X, Y, Z) can then ultimately be transformed into pixel coordinates (u, v) through: c u v 1 = KT X Y Z 1 (2.10) with c as a homogeneous scaling factor. Note that some additional parameters are usually included in the model such as radial distortion and skewing, see [6] for more details. It was however found that the simple model was adequate for this application Occlusion It is important to consider the effect of occlusion. The 3-D data acquired from a 3-D camera will not be complete, and viewable from every angle. Surfaces not

2.1 3-D Imaging 11 (a) Object occlusion (b) Far side occlusion (c) Self occlusion Figure 2.5: Three different types of occlusion. The points on the floor are occluded behind the box in (a).

Note that all three types are often present in the same point cloud. seen from the camera will not be present in the 3-D data.

When regarding the 3-D camera as a single unit, there are three types of occlusion one have to consider. The first type I will be called object occlusion.

25 2.1 3-D Imaging 11 (a) Object occlusion (b) Far side occlusion (c) Self occlusion Figure 2.5: Three different types of occlusion. The points on the floor are occluded behind the box in (a). The backside of the sofa in (b) is not visible from the cameras point of view. Local variations in the basket in (c) obstruct proper 3-D point generation. Note that all three types are often present in the same point cloud. seen from the camera will not be present in the 3-D data. It is therefore crucial to find a good angle when recording the data, ensuring that relevant surfaces will be included. When regarding the 3-D camera as a single unit, there are three types of occlusion one have to consider. The first type I will be called object occlusion. This occurs when an object is blocking the view of another object, see figure 2.5a. This is usually not so impeding because object occlusion normally only affect the background. The second type of occlusion will be called far side occlusion and this type of occlusion affects surfaces on the opposite side of the object from the cameras point of view, see figure 2.5b. The last type of occlusion is self occlusion and is the type of occlusion that have to be considered the most. This is because it usually occurs on the surface that is studied. It is also very dependent on the viewing angle, which is possible to modify, see figure 2.5c. It is therefore crucial to find a good angle when recording the data. Note that these three kinds of occlusion to some extent represent the same problem, but they are considered separate. Also note that these types of occlusion are not the types of occlusion you normally hear about when dealing with 3-D sensors. Usually two types of occlusion called camera and light occlusions are mentioned. Camera occlusion would in the case of structured light be regarded as the parts where surfaces are blocking the cameras view of the structured light. On the contrary, light occlusion would occur on the surfaces the camera is able to see but the light cannot reach. These terms are however more technical and should mainly be regarded when studying or developing the actual sensor, not when operating it. Since as an operator, you cannot control the effects of these occlusions without modifying the sensor. The time-of-flight technology doesn t even have equivalent occlusion effects.

26 12 2 Theory 2.2 Object Classification Classification is the act of differentiating classes from each other. There are essentially four steps [7] that should be taken into consideration when attempting classification: 1. Produce training and validation sets - The real reason for creating a classification system is to be able to automatically classify unknown samples. It is therefore reasonable to have two separate sets of data, one used for training, and one used for validation. The validation set should then simulate a set of unknown data that is to be classified. The training set should be able to represent all the possible outcomes with a wide variety. The system might not be able to recognize a set of validation data if not similar data occurred in the training data. By providing good validation data, overfitting can be avoided, i.e. preventing the system from trying to describe only the training data, not the generalized structure. 2. Extract features - It is common that the data used at classification contains a lot of information. There is usually too much information, where the majority might be irrelevant for differentiating the different classes. An example could be that you are about to classify a set of cars depending on their brand. Here a lot of information will not aid you in figuring out the brand, e.g. all cars have four wheels, one engine, headlights and taillights, etc. Carefully selected features should then be examined, features which are more or less unique for each brand. In the world of cars the easiest way to find out the brand of the car is to look at the logo, a small feature that with utmost accuracy will be able to distinguish the car. Therefore the data should be reduced to a smaller set of independent features, capable of separating the classes. A set of features can then be called a descriptor, as it describes the class, or more specifically, the sample. 3. Create and train a classifier - The descriptors are usually paired with a corresponding class label, a key to identify the class. A classifier can then take a set of training descriptors with their corresponding labels and train itself to recognize and appoint labels to new descriptors. How this is done depends on the used algorithm. Note that different algorithms work well with different kinds of descriptors. There is seldom a perfect descriptor and a perfect classifier. Most classifier algorithms can be changed and tweaked with a number of parameters, changing the behavior of them. The algorithms mentioned in section 2.5 are all different classifiers. 4. Evaluate the classifier - The final step is to evaluate the system. This is done by inputting the training data and comparing the class label the system outputted with the true class label. If the system is not able to assess the correct class labels for a majority of the training data, it is not very good. The remedy could be to tweak the parameters of the classifier, change the classifier algorithm, or find a better descriptor.

27 2.3 Morphometrics Morphometrics Morphology is in the field of biology defined as the study of shape and dimensions. Morphometrics is a sub-field of morphology, defined as the quantification and comparison of shape [8], [9]. Traditional morphometrics may examine metric measurements, angles, masses etc. These measurements are however usually correlated and the ratio of the height and width of an object usually stays the same as it grows. The result is a small amount of independent variables despite a large amount of measurements. By making the variables independent the scale information is often removed. However, if the aim of the analysis is to find absolute differences between subjects, the scale information can still be useful. For example Yakubu, A., et al. [10] use traditional morphometrics to distinguish between two different species of fish, namely Oreochromis niloticus and Lates niloticus, by analyzing seven morphometric measurements (body weight, standard length, total length, head length, body depth, dorsal fin length and caudal fin length). Traditional morphometrics does however only deal with outlining measurements and not the internal shape variations occurring among a specimen. Morphological landmarks are then an extension to enable data to be gathered over the whole specimen, expanding the spatial information available to describe the specimen, i.e. the features used to form a descriptor as mentioned in the previous section. 2.4 Morphological Landmarks Morphological landmarks are distinguishable features on an object that can unambiguously be detected on all specimens [11], i.e. they are said to be homologous. According to I. L. Dryden and Kanti V. Mardia [12] there are essentially three different classes of landmarks: anatomical landmarks, mathematical landmarks and pseudo-landmarks. An anatomical landmark is a landmark with anatomical correspondence between subjects. It is usually assigned by an expert and has to hold a meaning in the current context. A mathematical landmark holds certain mathematical properties as for example a local maximum or minimum. The pseudo-landmarks do not have any distinguishable attributes in themselves, they exist between anatomical or mathematical landmarks to enrich the amount of measurements. These samples are then usually bundled together to form an actual descriptor. 2.5 Face Recognition The research field of face recognition is a well established and quickly evolving field with a wide variety of algorithms available. Recognition based on purely biometric data extracted from the face would be a step closer to the humans superior abilities of recognizing faces. The main incentive is from a security and surveillance viewpoint. Systems today can quite easily be bypassed by forging or acquiring data necessary for access. A standard system is the use of a id-card together with a pass code. None of these are impossible to come by, and you

28 14 2 Theory can therefore gain illegitimate access to areas or information. Biometric data is however much harder to forge. A common biometric measurement is a fingerprint. Acquiring a fingerprint does however require that the subjects finger is placed upon a device for scanning, a task which is time consuming and might not always be possible. Scanning a face from a distance is then a sound alternative. It is achievable on many subjects at the same time and it is completely non intrusive. Note that the face is often used because it contains a lot of information, but any area which might provide enough information is applicable. This suggests that the field of face recognition has a lot to offer for our problem, and well established methods should be examined and exploited if possible. One of the first algorithms to really break through was the method of eigenfaces, developed by L. Sirovich and M. Kirby [13] and used by Matthew Turk and Alex Pentland [14] for face classification. It tries to maximize the variance between the faces by creating a set of basis vectors created from an eigen decomposition as shown in section 3.6.1, thereby its name eigenfaces. Note that it is called eigenfaces due to its primary application of using images of faces, but any image can be used. Even though their implementation has since been outperformed many times, it still forms the basis for many modern algorithms. It is also frequently used as a baseline method when comparing performance of other systems. Eigenfaces is essentially a way to represent faces and may not be such a good way to classify them. Belhumeur, Hespanha and Kriegman recognized this and developed a method called fisherfaces [15] that outperformed eigenfaces greatly in classifying faces. Fisherfaces performs a linear discriminant analysis (LDA), invented by Sir R. A. Fisher, who successfully used it to classify flowers [16]. LDA tries to cluster the same classes together by maximizing the ratio of external and internal class differences. Both eigenfaces and fisherfaces suffer from the necessity of requiring of a lot of training samples acquired from different conditions to be able to accurately recognize faces in somewhat uncontrolled conditions. They are also holistic methods that use every pixel when processing the data. This requires a near perfect alignment of each face, which usually is only possible in controlled conditions. Many different variations have been developed to try and deal with these drawbacks, and some have succeeded better than others. Hu Han et al. [17] reviews several different techniques of illumination preprocessing. The aim of these techniques is to suppress the variations caused by different lighting conditions in each frame. However with the use of 3-D data the illumination problem would be completely eliminated. This once again implies that having accurate 3-D data is a great advantage, as evident in a survey by Andrea F. Abate et al. [18] comparing face recognition methods based on 2-D and 3-D imaging. A 2-D method that tries to deal with the illuminations variation is Local Binary Pattern Histograms [19]. The basic principle with this algorithm is to evaluate the relative local structure around pixels, enabling a robust descriptor of a face. The three algorithms mentioned in this section will be further explained in the section 3.6.

29 3 Algorithms for Image Processing There exist many algorithms in the world of image processing and in this chapter the ones used will be explained. Not all of the following algorithms are explicitly image processing algorithms, but all of them can be used for image processing, and are therefore included in this chapter. 3.1 Flood Fill Flood fill is an image processing tool used for finding connected pixels. It is very good at segmenting a region from the rest of an image, if the region has proper edges. A seed pixel is first selected inside the region, the algorithm then grow from this pixel evaluating neighboring pixels if they are connected. A pixel is connected if Z(x ) Z(x) Z(x ) + +, (3.1) where Z(x) is the current image value for a pixel with image coordinates x, which might be connected to the seed pixel. This is done for all neighboring seed pixels, which have the coordinates x. The two thresholds and + are selected manually depending on the scale and variance of the data. Figure 3.1 shows a simple example of the flood fill algorithm. The seed pixels can only expand to pixels connected through a 4-connectivity in this illustration. Which means that only horizontally and vertically neighboring pixels are considered as neighborhood pixels. 15

30 16 3 Algorithms for Image Processing Figure 3.1: The flood fill algorithm searches from the current pixel (green) for possible connected pixels in its neighborhood. With = + = 1 the pixel with value 4 is connected, the pixels with value 3 and 7 are not, due to the difference being too large. 3.2 Principal Component Analysis Principal component analysis is one of the most commonly used tools within data analysis. Its purpose is to transform a set of data making it linearly uncorrelated. The transformation is defined by a number of components, called principal components. The first principal component indicate the maximal variance of the data and the following components indicate the maximal possible variance while fulfilling orthogonality. Mathematically the principal components are calculated based on the covariance matrix. The covariance matrix C contains the variance of all the components of a vector. It can be estimated from realizations of a random vector X, and its mean µ as C = E { (X µ) (X µ) T }. (3.2) The eigenvectors of the covariance matrix then corresponds to the principal components, and the corresponding eigenvalues are proportional to the variance along these vectors. Singular Value Decomposition With all realizations of X put column-wise into a matrix Y = { X 1 µ, X 2 µ,..., X N µ }, (3.3) the covariance matrix is C = Y Y T. The principal components can however be calculated by doing a singular value decomposition of the matrix Y directly. The matrix Y is then decomposed using a singular value decomposition as Y = U SV T, (3.4) where the columns of U are the left hand singular vectors of Y and corresponds to the eigenvectors of Y Y T, i.e. the principal components. The rows of V are the

31 3.3 Homography 17 right hand side singular vectors and corresponds to the eigenvectors of Y T Y. The matrix S contains the corresponding singular values in its diagonal, representing the square roots of the non-zero eigenvalues. If C would be a M N matrix, then U would be a unitary M M matrix, V would be N N and S would be a M N rectangular diagonal matrix. 3.3 Homography A homography is a projective transformation that has eight degrees of freedom which enables it to perform all affine transformations, i.e. translation, rotation, scaling and skewing. Together those require six degrees of freedom, and the last two enables a homography to change the perspective of a set of points on a plane, i.e. points on an image. See figure 3.2 for an example of a homography transforming four points into a square. Which can be seen as changing the perspective of a rectangular plane from a slight angle to an angle perfectly from above. X' 1 X' 2 X 1 X 2 X 4 X 3 X' 4 X' 3 Figure 3.2: A homography is capable of transforming the set of points X to the set of points X A homography matrix can be calculated to transform the vector of 2-D homogeneous point coordinates X into another set of point coordinates x, fulfilling x = HX. (3.5) The homography matrix H can be estimated using the homogeneous estimation method. By reshaping H 11 H 12 H 13 H = H 21 H 22 H 23, (3.6) H 31 H 32 H 33 into a vector h according to h = (H 11, H 12, H 13, H 21, H 22, H 23, H 31, H 32, H 33 ) T, (3.7)

32 18 3 Algorithms for Image Processing the homogeneous estimation method finds the solution to Ah = 0 (3.8) for h, where matrix A is constructed as x 1 y x 1 X 1 y 1 X 1 X x 1 y 1 1 x 1 Y 1 y 1 Y 1 Y 1 x 2 y x 2 X 2 y 2 X 2 X 2 A = x 2 y 2 1 x 2 Y 2 y 2 Y 2 Y 2. (3.9) x n y n x n X n y n X n X n x n y n 1 x n Y n y n Y n Y n The vector h is then the vector which minimizes Ah, subject to h = 1, and is then given as the eigenvector with the smallest eigenvalue of A T A. Which is equal to the right hand singular vector obtained from a singular value decomposition of A, see section 3.2. This singular value decomposition always exist for any matrix, as opposed to an eigenvalue decomposition. 3.4 Hartley Normalization There might be a problem when doing singular value decomposition of a matrix to determine a homography. If the matrix is ill-conditioned, the resulting singular values might not have appropriate values, namely one zero value and the rest non zeros. The problem with this is that the solution might not be accurate enough to be useful. The cause of this problem is that the homogeneous coordinates have a bad distribution. The remedy is to transform the points into a coordinate system where they have an optimal distribution, this transformation is called a Hartley normalization [20]. Their mean distance to origin should be equal to 2 and their mean position should be in the origin. The mean position m can trivially be calculated as m = 1 n x n i. (3.10) The mean distance to origin s can be calculated with s = 1 n x n i m. (3.11) The normalized points x are then transformed from x as i=1 i=1 x = T 1 x, (3.12)

33 3.5 3-D Plane Fitting 19 where the Hartley normalization matrix T is constructed with the values from equation 3.10 and 3.11 as s 0 m x 2 T = s. (3.13) 0 m y With the two sets of corresponding points x and X, used to estimate a homography, two normalization matrices are constructed T x and T X, respectively. With equation 3.12, these sets of points are normalized, resulting in two sets of normalized points x and X. A homography matrix H, can be calculated with these points according to section 3.3. The homography matrix H, describing the homography between the points X and x is then calculated as H = T X H T 1 x. (3.14) This solution is more stable compared to calculating H directly from x and X D Plane Fitting Fitting a plane to data can be done to see if the data is linearly dependent. Used with 3-D data, planar surfaces can be localized which is useful for example when searching for floors or walls. When fitting a plane to a set of points you try to minimize some kind of error, different methods minimize different measurements. The linear regression method minimizes the distance of all the points to the plane along the z-axis. Modifying the algorithm so that the shortest distance, i.e. the perpendicular distance, to the plane is minimized is called orthogonal regression. The implicit equation for a 3-D plane is ax + by + cz + d = 0, (3.15) where (x, y, z) express the coordinates of a point on the plane. With p = (a, b, c, d) T defining the plane coefficients and x = (x, y, z, 1) T, a 3-D point expressed in homogeneous coordinates, the perpendicular distance can be defined by rewriting equation 3.15 as D = x p. (3.16) Minimizing this equation for several points is equal to finding the right hand singular vector corresponding to the smallest singular value of the matrix A, i.e. find p minimizing Ap, where A = x 1 y 1 z 1 1 x 2 y 2 z 2 1. (3.17).... x n y n z n 1

34 20 3 Algorithms for Image Processing This right hand singular vectors of a matrix can be found with a singular value decomposition. Note that the mean of the points should be removed from the points, so that the condition p = 1 can be properly added, to prevent the trivial solution p = 0 to be chosen. Removing the mean of the points will results in d = 0 for equation 3.15, i.e. the plane passes through origin. To have a plane with the correct distance from origin, d can be set to the orthogonal distance from the plane to the mean point, i.e. d = m n, where n = (a, b, c) T 3.6 Face Recognition Methods Three different face recognition methods were mentioned in section 2.5 and here the mathematical theory will be explained Eigenfaces As mentioned in section 2.5, Eigenfaces is considered to be the first functional face recognition method. It employs principal component analysis to capture the variation in a set of images. The result is a basis of Eigenfaces representing the set of images in an effective way. Due to the often large sizes of images a trick is exploited to be able to calculate this basis. To calculate the Eigenfaces you start by arranging the N number of images in vectors, from the top left corner then row-wise, resulting in a set of vectors {I 1, I 2,..., I N }. (3.18) Calculate the mean vector Π, which when reshaped can be seen as the mean image. Then concatenate all the vectors minus mean together creating a large matrix A such as A = {I 1 Π, I 2 Π,..., I N Π}. (3.19) If an image is pixels the vectors would be components long, and with N images the matrix would be N. Since PCA of a matrix A employs the eigenvalues of AA T which now is the cost to calculate eigenvectors on this matrix is unreasonable. To circumvent this the N N inner product matrix A T A is used instead. The eigenvectors υ i and eigenvalues of λ i of AA T is what originally was needed, defined as AA T υ i = λ i υ i. (3.20) But now with A T A we get the following eigen decomposition A T Aω i = µ i ω i. (3.21) However by multiplying with A from the left we get the following Which implies that υ i = Aω i Aω i one. It can actually be showed that υ i = λ 0.5 i AA T (Aω i ) = µ i (Aω i ). (3.22) since the norm of the eigenvectors is equal to Aω i for non-zero eigenvalues, but

35 3.6 Face Recognition Methods 21 since the vectors are normalized it isn t relevant. The drawback of using A T A is that a maximum of N Eigenfaces can be calculated, but that is usually enough. With a large amount of images only the eigenvectors corresponding to the largest eigenvalues are kept, the rest do not contribute enough variance information and are consequently discarded. The classification works by finding the closest training samples in the newly created Eigenface subspace. With a constructed base, where the eigenvectors are the basis vectors, an image can be projected into this subspace resulting in a set of coordinates. Begining by removing the mean Π previously calculated, the coordinates can then be calculated as c = V (I Π), (3.23) where V is a matrix containing the used eigenvectors υ i as rows. Creating and storing a set of coordinates {c 1, c 2,..., c N } for all training images enables a fast comparison to new images. By finding the shortest distance from the current coordinates c to one of the training coordinates c i a corresponding class can be found Fisherfaces The principle behind Fisherfaces is linear discriminant analysis. The result from the LDA is a subspace spanned by a set of vectors called Fisherfaces. The goal of Fisherfaces is to find a set of basis vectors where the external class differences are minimized while the internal class distances are maximized. The internal class distances are represented by a scatter matrix S I, calculated as S I = (I k Π c )(I k Π c ) T, (3.24) c C I k c where C is a set of classes. And Π c is the mean of the images in class c. The external class differences are represented with the scatter matrix S E, calculated as S E = N c (Π c Π)(Π c Π) T, (3.25) c C with N c as the number of samples in class c, and Π as the total mean of all samples. To minimize S I and maximize S E the optimal basis vectors contained in matrix V should fulfill V T S E V V opt = argmax V V T S I V. (3.26) These resulting basis vectors are the set of generalized eigenvectors of S E and S I corresponding to the largest eigenvalues, i.e. the basis is given by S E V opt = S I V opt Λ, (3.27) where Λ is the diagonal matrix containing the corresponding eigenvalues. Note that there are some technicalities that need to be solved with S I being singular, read [15] for details. The resulting eigenvectors can, as in the case of Eigenfaces, be reshaped into

36 22 3 Algorithms for Image Processing images, called Fisherfaces. The actual class prediction is similar to how Eigenfaces classifies, where a sample is projected into the created subspace and the closest class is then chosen Local Binary Pattern Histograms The principle behind the local binary pattern histograms algorithm is to encode each pixel with a binary number. This number tries to explain the local structure around the pixel by evaluating if the pixel is larger or smaller than its neighborhood. Figure 3.3 shows a small example on how to encode a pixel using a basic 3 3 local binary pattern. There are some additional extensions applied to extract Threshold Figure 3.3: Example of a pixel (red circle) being encoded as a local binary pattern. The thresholds are put in series starting from the top left going around clockwise resulting in the binary string: , called the local binary pattern. the patterns for the whole image, which you can read about in [19]. The final step to actually create a full descriptor is to divide the image of local binary patterns into pieces and evaluate histograms for each part. These histograms are then concatenated together to create the local binary pattern histogram. Prediction of a class works by comparing the histogram of the current sample with histograms of all the training samples. Choosing the one with the smallest Chi-Square distance (H d = 1 (I) H 2 (I)) 2, (3.28) H 1 (I) I where H 1 is the local binary pattern histograms of the sample that is to be predicted, and H 2 is the local binary pattern histograms for one of the training samples. 3.7 Neural Network An artificial neural network has the possibility of solving complex nonlinear problems. The network employs a system of nodes, called neurons. Each neurons employ simple mathematical functions and together they are able to find complex

37 3.7 Neural Network 23 relationships between input and output. The network is structured as several layers in series. At least one input layer and one output layer is needed to form a proper network, in addition to those an arbitrary number of so called hidden layers can be added between. Each layer contain a number of nodes that communicate to other nodes in the next layer with weighted connections. By changing these weights the network may be trained to a set of training data, see figure 3.4 for a simple setup with one hidden layer. As mentioned the number of hidden layers and nodes can be chosen arbitrary. Each node contains an activation function which takes the sum of the weighted outputs of previous nodes as input. The activation function is usually a sigmoid function, i.e. f (x) = β (1 e αx ) (1 + e αx ), (3.29) where α and β is usually specified according to the range of the input. There are w ij w' jk x 1 y 1 x 2 y 2 (a) (b) (c) Figure 3.4: A basic neural network with two input nodes in the input layer (a) with the inputs x 1 and x 2. The output layer (c) consists of two nodes with the outputs y 1 and y 2. The network has one hidden layer (b) which contain four hidden nodes. It is the weights w and w between the layers that are optimized, granting the networks its "learning" abilities. different algorithms to train a network, the classical algorithm is called random sequential back-propagation. It works by sending input data through the system with a known output, then letting the error propagate back through the network while adjusting the weights accordingly through a simple gradient descend [21].

38 24 3 Algorithms for Image Processing 3.8 B-splines Splines can through a set of control points interpolate continuous data. With only a few control points, splines can represent complex shapes or data structures. B-splines are splines that don t go through their control points, but are merely supported by them. A B-spline is both C 1 and C 2 continuous, meaning that its first and second derivatives are continuous. A number of control points, or coefficients P k define the B-spline and construct the spline as X(t) = N P k B k (t), (3.30) k=0 where the blending functions B(t) are piecewise defined for different t. The first blending function B 0 (t) is defined over the interval 1 t < 3 as (t + 3) 3, 1 t < 0 3t B 0 (t) = 3 15t 2 21t 5, 0 t < 1 3t 3 + 3t 2. (3.31) 3t + 1, 1 t < 2 (1 t) 3, 2 t < 3 The next blending functions are then defined the same as equation 3.31 but with a shifted t interval. This means that each segment of the B-spline is the sum of four different blending functions, see figure 3.5 for an illustration. A segment of the B-spline can then be defined on the interval [0, 1) with equation 3.32, expressing the blending functions in matrix form. X(t) = 1 6 [ Pk 1 P k P k+1 P k+2 ] t 3 t 2 t 1 (3.32) t Figure 3.5: A B-spline function (red) created from six blending functions (blue) with coefficients P k = [2, 5, 1, 3, 2, 5]. Note that this function is only defined for 0 t < 7 and the additional dashed functions are helping functions to fulfill the boundary conditions X(0) = P 0 = 2 and X(7) = P 6 = 5.

39 4 Implementation The whole program was written in C++. Some of the algorithms and principles were realized using existing functions from the open source libraries OpenCV and Point Cloud Library. Each section is in more or less chronological order. 4.1 Overview The system uses data from a 3-D camera as input, then processes it using different methods to extract a robust descriptor. This descriptor is then used to train or predict corresponding classes from each input. More precisely, the amount of data from the raw sensor output is first reduced by choosing the appropriate data to process further. Then that single frame is filtered to extract the relevant data and a rectification process is used to normalize the data in the sense of making it more or less invariant to all affine transformations, i.e. rotation, translation and scale. Distinct points are marked as reference points for the descriptor extraction. Here is a short overview of the steps performed: 1. Acquire the relevant data from the sensor 2. Filter the data by extracting the foreground 3. Get and rectify a height map from the foreground data 4. Find and remove ridge portions of the height map 5. Localize specific protrusions 6. Create a descriptor 7. Train and/or validate classifiers 25

The depth precision with the current setup is about 5 mm at the relevant distance from the camera.

Since the depth range is limited, combined with the finite precision, only a portion of the data range is used.

40 26 4 Implementation 4.2 Data Acquisition The camera is capable of delivering data at 30 frames per second with a resolution of points. The depth precision with the current setup is about 5 mm at the relevant distance from the camera. The camera outputs four arrays which contain the three coordinates and RGB color, see figure 4.1 for an example of a typical output. The coordinates are stored as a signed number represented with 16 bits giving them a possible range of values. Since the depth range is limited, combined with the finite precision, only a portion of the data range is used. (a) The depth (z-axis) output. (b) The color output. (c) The lenght (y-axis) output. (d) The width (x-axis) output. Figure 4.1: A typical output of a Fotonic P70 where the x, y and z-axis are displayed in the JET color space. Note that the color output appears to be black and white, it is actually not. Weak RGB camera combined with a color deprived scene results in an almost black and white image.

41 4.2 Data Acquisition Point Cloud The camera delivers data in three dimensions, by combining the values from each dimension a set of points in 3-D space can be assembled. This set can then be called a point cloud. The Point Cloud Library (PCL) can be used to store and process these point clouds, which enables the use of PCL s internal functions, including a powerful visualizer. The output arrays from the camera can easily be converted into a point cloud by traversing the arrays and extracting the respective coordinates and turning them into points. By paying attention to where you are in the array it is possible to convey an important attribute of the point cloud, namely give it a constant height and width. If this attribute is preserved the point cloud is said to be organized which is a crucial attribute if one where to attempt to convert the point cloud back into the respective 2-D arrays. An organized point cloud differs from a normal point cloud as it knows what points are part of its neighborhood directly from their memory positions. Contrary to a normal point cloud which need to search for possible neighbors in a local neighborhood defined by the euclidean distance. If the points have been properly saved the point cloud has resolution of points, resulting in potentially valid points. As explained in sections and a 3-D camera only has a limited range where the depth of a point may be accurately estimated. So depending on the environment all points may not have a set of coordinates, and invalid points are then represented as Not-A-Number, N AN. Figure 4.2 shows the point cloud converted from the arrays in figure 4.1 rendered with the PCL Visualizer. Figure 4.2: A point cloud rendered using PCL s internal visualizer

42 28 4 Implementation A Good Camera Angle As mentioned in section 2.1.5, occlusion will play a vital part of the quality of the data. Therefore a good camera position relative the specimens will need to be set up prior to the start of data recording. The main matter is what parts of the specimens should to be included in each frame, and this is in turn an issue of where enough information can be extracted on each specimen with good repeatability. The parts that move a lot will be difficult to use since their variation will be extremely large in every frame. As you can see in figure 4.1 the final perspective was chosen to be from above, looking down on the specimens with a slight angle of about 20 degrees relative to the ground. Increasing this angle would enable a view of the backside of the specimen, but would then lead to self occlusion on the top. Since the top is the most stable region it was favored to have the best possible data of it Choosing a Frame of Interest The 3-D camera provides a continuous stream of 3-D data, and to process every frame would not only be extremely resource demanding but also unnecessary. Depending on the environmental setup, where the camera is positioned and how the objects are moving in front of the camera, only a fraction of the frames might be viable for processing. The primary concern is if the points of interest are fully within the field of view of the camera. More or less complex solutions can be used to solve this problem. Since the data rate is about 65 GB/hour (each frame is 604 kb on disc) and with a limited storage capacity and processing power a simple and effective process needs to be employed. Two columns in the data was designated, one to the right and one to the left. The specimen will then move through one of the columns when it enters the frame, and move through the other as it exits the frame. A simple boolean expression can then be assembled to determine if a specimen is within the frame. 4.3 Data Extraction After the appropriate frame has been selected some pre-processing has to be performed to remove unwanted data. The set of points belonging to the background is typically data that should be removed. Also points that might belong to the object but do not contribute with any useful information, will then only disturb and should consequently be removed. The primary method for extracting the wanted data, i.e. the foreground, is the flood fill method applied to the height map. How to get the height map and relevant implications is described in section Depending on the quality of the captured data some pre-processing might be needed. Quality in this sense is a good recording angle and fair positions of the specimens so that they are more or less centered in the view. If the recording angle is too tilted the whole data can be rotated and consequentially changing the viewing angle, this can help to improve the results of the flood fill method.

43 4.3 Data Extraction Manipulating the Viewing Angle One of the greatest benefits with 3-D data over 2-D data is the possibility to observe the data from another angle. This is essentially a change of basis and changing the viewing angle is then a simple linear transformation. In the case of 3-D points represented with homogeneous coordinates a rotation is a simple multiplication from the left with a rotation matrix. The rotation matrix can be represented as equation 4.1 where α, β and γ are rotation angles around each axis in the Cartesian coordinate system. R = cos α cos β cos α sin β sin γ sin α cos γ cos α sin β cos γ + sin α sin γ 0 sin α cos β sin α sin β sin γ + sin α cos γ sin α sin β cos γ cos α sin γ 0 sin β cos β sin γ cos β cos γ (4.1) A good reference is usually the floor, if the floor is properly aligned to the camera, the rest of the data is usually fine as well. The floor should therefore be parallel to the viewing plane, i.e. have a more or less constant depth value. There are several ways to find the rotation matrix that aligns the floor parallel to the viewing plane. Since the camera has a fixed position in all of the data, the best option is to find a constant rotation matrix and apply it to all data, instead of using a dynamic algorithm that estimates a matrix for each instance. An easy and sufficient method is to use PCL s viewer to manually rotate the point cloud until the floor looks flat, then saving the matrix which describes how the camera has moved. The inverse of this matrix can then be used to properly rotate the data Axis Threshold Filter With the viewing angle and position aligned to the background, a simple and effective filter is preferably applied to remove as many points as possible with a low numbers of calculations. The simplest filter possible with only one inequality check per point is to compare one of the three coordinates to a constant threshold value. This is done a number of times with different axes and thresholds depending on the situation. The points to remove are then P {l x < P x < u x, l y < P y < u y, l z < P z < u z }, (4.2) where l and u are the lower and upper limits for the different axes. For example, removing the floor will then be as simple to remove points with a depth value larger than a certain value, provided the floor is properly aligned Foreground Extraction with Flood Fill Section 3.1 describes how the flood fill method works. Since the flood fill method is designed to work on 2-D images, all of the 3-D data cannot be used, only the height data is used. The goal is to extract the foreground from the background. To do this a proper seed pixel has to be picked, this pixel needs to belong to the foreground. If the data acquisition is good enough at choosing proper frames where the specimen is in the center of the image, an easy option is to pick the center pixel as seed pixel. Depending on the range of the data, the upper and lower thresholds for connectivity has to be set appropriately. After converting the height

Note that some lower parts of the specimen was deemed as background. This is fine because those parts were not wanted as foreground anyway.

44 30 4 Implementation values to a metric coordinate system both thresholds were set to approximately 5 cm. This mean that neighboring pixels with less than a 5 cm depth difference will be considered as the same surface. Figure 4.3 shows the result of the flood fill algorithm applied to a typical frame. Note that some lower parts of the specimen was deemed as background. This is fine because those parts were not wanted as foreground anyway. They would have created an unwanted bias due to the higher number of points on that side. (a) Original depth map slighly rotated to adjust for the recording angle. (b) The extracted foreground marked as red. Figure 4.3: Flood fill algorithm used to extract the foreground. The seed pixel was picked as the center pixel of the image, which almost every time belong to the foreground. 4.4 Reducing Dimensionality As mentioned earlier, there are benefits when working with 2-D data instead of 3-D data. Many algorithms only work on 2-D data and the question when using 3-D data is how and what to use Create a Height Map The height map, or depth map as some would call it, is what defines a depth camera. A height map is essentially the same as the depth map shown in figure 4.1 usually the only difference being opposite signs for the distance values, and is basically what a depth camera is provides by its 2-D sensor. What is really interesting is that this map can be manipulated to enhance certain features, while preserving the array structure. This is because the 3-D point cloud preserves its organized property after filters or transformations have been applied. Formally the reason behind this is that transformations and filter only manipulate the values of the data not the location of the data itself, i.e. the memory location. It is therefore possible to apply all kinds of operations to the 3-D data and then trivially

4.4 Reducing Dimensionality 31 copy the memory information into an image without problems.

This concept of manipulation give rise to a whole new dimension to 2-D processing, so to speak. See figure 4.4 for an illustration of the same 3-D points viewed from different angles.

4: The same 3-D points rotated resulting in different representations. Note that the background has already been removed as described in section 4.

45 4.4 Reducing Dimensionality 31 copy the memory information into an image without problems. The possible result is an image with resolution pixels, where the intensity value for each pixel corresponds to the depth value for each corresponding point. This concept of manipulation give rise to a whole new dimension to 2-D processing, so to speak. See figure 4.4 for an illustration of the same 3-D points viewed from different angles. This concept might even holds some advantages over the perhaps (a) Original viewing angle. (b) Rotated 25 degrees to the left. (c) Rotated 25 degrees to the right. Figure 4.4: The same 3-D points rotated resulting in different representations. Note that the background has already been removed as described in section 4.3 and the data has been properly rectified as described in section more obvious method of projecting the 3-D points into the camera to calculate its corresponding image coordinates. Which is necessary when dealing with a "normal" point cloud with no apparent structure, produced from methods like Simultaneous Localization and Mapping (SLAM) [22] or CAD modeling. If one where to project data from a depth sensor the main disadvantage of the projected points is that they would be unevenly distributed due to perspective as shown in figure 4.5. Severe resampling and interpolation would have to be applied to fill in missing data or remove data with sub-pixel coordinates. Self occlusion would also be apparent when the viewing angle changes and data would have to be interpolated, if missing data is unacceptable Rectification Using Homography Observing the specimens in figure 4.1 and 4.3, it is clear that their position inside the frame is not very fixed. The specimens can be located all over the frame and with an almost arbitrary rotation. This is unwanted. Another unfavorable effect concerns perspective. The way the depth map is created results in different scale depending on what region in the frame is considered. Since the camera is tilted, objects positioned further away from the camera will be smaller in the recorded height map compared to those closer. To fix, or at least minimize both these problems a homography is estimated that is dependent on the 3-D data. The camera is calibrated to internally rectify depth values by providing the length and width displacements, which can be seen in figure 4.1. The following method will incorporate this information to create a dynamic rectification process. The following five steps are performed to rectify an image:

32 4 Implementation Figure 4.5: Unevenly distributed points caused by perspective. The point density is larger at the bottom of the object compared to the top. 1.

46 32 4 Implementation Figure 4.5: Unevenly distributed points caused by perspective. The point density is larger at the bottom of the object compared to the top. 1. Extract principal components from segmented 3-D data, see section Position four points with the principal components as base 3. Project points into image coordinates 4. Estimate homography, see section Apply homography to image The second step above is to find four points in 3-D space that will mark the borders of the specimen. This is achieved by applying principal component analysis on the point cloud, resulting in three principal axes, which usually correlates with the specimens length, width and height, due to that those dimensions contains the most variance. The data matrix X in equation 3.2 is simply all n 3-D points arranged in a n 3 matrix. Once again the recording angle can cause some problems. Due to occlusion the data might have a tendency to be biases with perhaps more points on one side of the specimen than on the other. The consequence of this is that the resulting axes from principal component analysis will be skewed. If this effect is recurring on all specimens a manual adjustment can be applied by rotating the axes slightly. Even though these principal components doesn t always align perfectly, resulting in slightly rotated images, the program can still manage. This is not a crucial stage and the following algorithms can still succeed. Since the principal components are orthogonal, they can directly be used as a new orthogo-

4.4 Reducing Dimensionality 33 nal base. With this new base, four points can be positioned on fix coordinates in the new basis around the specimen.

47 4.4 Reducing Dimensionality 33 nal base. With this new base, four points can be positioned on fix coordinates in the new basis around the specimen. These 3-D points are then projected into the same image coordinate system as all pixels lie in. Equation 2.7 and 2.8 can be used to find the image coordinates if the focal length is previously known directly from the camera settings. If not, it can be estimated using the same formulas only with f as the unknown and the 3-D points and their corresponding image coordinates as known variables. The correspondence is found directly through their memory positions. As many focal lengths as valid points will be calculated this way and the final focal length used, is the mean. One might expect that the value of f would change a lot when comparing the estimated value close to the center and the value close to the borders due to radial distortion. The effect is however sufficiently small to be neglected for this application. Figure 4.6 illustrates the two first principal component together with the projected points. These four projected points will Figure 4.6: The two first principal components are marked as black. They should be somewhat aligned to the specimens length and width. The four white points are the projected 3-D points positioned in 3-D space with the principal components as base. Note that this image is an illustration, and is not numerically correct. Some of the white points are usually projected outside the image borders and would therefore not serve as good examples. then form the vector X and the four image corners will from the vector x in equation 3.5. The image corner coordinates are in this case with an image resolution of pixels the following points: (0, 0), (320, 0), (320, 240), (0, 240). With a homography matrix estimated, the easiest way to rectify the image is to traverse through each pixel in the new rectified image and find out where to sample values from the original image. By transforming the coordinates with the inverse of the homography matrix the corresponding coordinates in the original image can be found. This coordinate will usually have sub pixel precision, meaning it will be between pixels. The real value used is then interpolated from the surrounding pixels. Bi-linear interpolation is the most commonly used method, which is just a linear interpolation in two dimensions. Figure 4.7 shows some images before and after rectification. Note that exactly four points are used when estimating the homography meaning

48 34 4 Implementation Figure 4.7: Four images before (bottom) and after rectification (top). Beware that the bottom images have been flipped vertically to simplify comparison with their corresponding rectified image. It is actually the process of homography that performs the mirroring of the rectified images. It simplifies some code implementations later on. the equation 3.8 is exactly determined, so it would be sufficient to solve A x = b where A is the first eight columns of A and b is the last column. The same can be said about the Hartley normalization. Due to only using four points an exact solution can be found, rendering the normalization superfluous. This is no real problem as the normalized solutions will not be inferior in any way, on the contrary, it is nice to have a more general implementation ready. 4.5 Data Processing The corrected data has now been extracted and properly aligned and the next step is to find some features that can be located in all specimens with adequate precision. As explained in section 2.4, morphological landmarks are locations with dependable attributes and are therefore searched for. By examining the anatomy of the subject and observing subjects in motion, four landmarks are found to possess excellent properties. In figure 4.8 four landmarks are marked in white. They appear protruded from the surface. These landmarks are actually anatomical landmarks, and with some processing they can achieve the attribute of mathematical landmarks. It is due to the mathematical property they can be located autonomously. A series of continuous landmarks can also be found within the black borders. One can come to the conclusion that these landmarks are also anatomical landmarks and are already for the most part mathematical landmarks. The reason for not using these later in the descriptor is that they are heavily variant of the pose of the specimen. They do however disturb the localization of the white landmarks and should therefore be suppressed. All these landmarks were localized manually and are prone to contain high uncertainty due to that there is no actual correct position for them. Their positions can be defined in multiple ways and the challenge is to be able to find them with an adequate

49 4.5 Data Processing 35 Figure 4.8: The anatomical landmarks manually marked on a specimen. repeatability on multiple samples Ridge Detection To find these landmarks within the black regions in figure 4.8 we have developed a simple algorithm here called Ridge Walk. This algorithm results in a preliminary vector of connected pixels describing the largest values along the specimen with some additional conditions. To smooth and parametrize this vector a B-spline curve is fitted to it, resulting in six coefficients describing the B-spline. Note that this is a continuous line and does not always correspond to an anatomical landmark, therefore some parts of it can be defined as pseudo-landmarks. The four steps needed to create the curve is essentially: 1. Filter the height image 2. Find a good starting point on this new image 3. Use Ridge Walk from the starting point 4. Fit a B-spline to the result of the Ridge Walk Height Image Filtering This method utilize the values of a processed variant of the height map. The goal of the algorithm is to find the local maximum points along the specimen. To achieve this, a derivate, or gradient map is first created, estimated in the y- direction using a Sobel filter. Derivatives are very noise sensitive so it is a good idea to remove the noise from the height image before approximating the gradient. Noise is easily removed from an image by removing the high frequencies with a frequency low-pass filter. A Sobel filter can approximate the gradient of an image by applying a Sobel kernel over every pixel in the image such as I Y = S Y I, (4.3)

50 36 4 Implementation (a) First order derivative image S Y. (b) Second order derivative image S Y Y. (c) Final difference image S dif f. Figure 4.9: The gradient images created with a Sobel filter, displayed in log scale. i.e. a convolution. The kernel S Y for the y-direction gradient is constructed as a 3 3 matrix: S Y = (4.4) A second image is created containing the second order derivatives of the lowpassed filtered height image. This image can either be created by applying the same Sobel kernel to the first order derivative image, or by applying the Sobel kernel onto itself creating a second larger kernel, which is used as the new kernel at convolution. Then a final image is created from the difference of the two derivative images as S dif f = S Y S Y Y. (4.5) Observing these images in figure 4.9 it is clear that the final image S dif f can be used to find the wanted line. It is true that the first gradient S Y is very similar to S dif f due to the small influence of S Y Y. The final result does however improve if the difference image is used instead. Ridge Walk To create the vector of coordinates a start pixel is required. It is picked as the pixel with the largest value within an interval of the first column in S dif f, i.e. the pixels to the far left. The interval is chosen to be in the middle of what is foreground, the size is however fixed. Then the selected start pixel begin to walk to the right, constantly incrementing the x-coordinate while the y-coordinate is chosen as the pixel with the largest value within an interval relative to the last y-coordinate, see figure 4.10 for an illustration. A result of the Ridge Walk method can be seen in figure 4.11a with the color cyan. B-spline Fitting The noise and irregularities which the Ridge Walk curve might contain, will create problems later on. The curve will therefore be parametrized as a uniform cubic

51 4.5 Data Processing 37 y x Figure 4.10: The next coordinate should always contain the largest value. The possible search interval is in this case between plus and minus one y- coordinate. The next step will therefore be down to the right i.e. plus one x-coordinate as always and minus one y-coordinate B-spline, which consequentially smooths it. Six coefficients will define the spline, see section 3.8 how B-splines are defined. To actually fit a B-spline to the curve, the Levenberg-Marquardt [23] method is used to minimize the difference between the curves y-coordinates and the B-splines values, i.e. the coefficients are optimized using the error function N P = argmin P k (X Pk [i] Y [i]) 2, (4.6) i=1 where X Pk [i] is a discretized B-spline and Y [i] is a vector containing the y-coordinates obtained from the Ridge Walk method. In this case N is equal to 320 due to the width of the image and consequentially the length of the curve. This solution is faster and more stable than trying to optimize a B-spline over the height image directly. The error function could in that case be defined as N P = argmin (I max I(i, X Pk [i])) 2, (4.7) P k i=1 where I is the height map. The problem with this approach is that the height map is discrete. The Levenberg-Marquardt method tries to change the variables P k slightly while evaluating the resulting error. If these steps are smaller than a full pixel, the error function would return the same value. When the same values have been returned several times, the optimization stops, since it cannot find a better solution. Figure 4.11a shows the final B-spline when optimized to the corresponding ridge walk line. The reason for using a B-spline was to get a more smooth curve. A simple low-pass filter would indeed manage that, but another reason for using a B-spline was to be

38 4 Implementation able to easily validate a good ridge detection. By observing the differences of the coefficient values a validity can be calculated.

52 38 4 Implementation able to easily validate a good ridge detection. By observing the differences of the coefficient values a validity can be calculated. If two contiguous coefficients differ too much it means that the ridge is unreasonable contorted, and the conclusion is a failed ridge localization. (a) A B-spline (blue) and the ridge walk line (green) used as reference. (b) A number of Gauss functions applied along a B-spline. Figure 4.11: The B-spline function fitted to a specimen to suppress some dominant features Ridge Suppression As mentioned before, the purpose of the ridge localization method is to suppress it. To achieve that the spline will be extended into two dimensions to actually be able to interact with the original image. Therefore one dimensional Gauss curves along the y-axis are formed along the spline, creating a full image that can be subtracted from the height image. A Gauss curve is defined as G(t) = ae (x b)2 2c 2, (4.8) where a, b and c are constants defining the shape of the curve. The height of the curve is defined by a, the center of the curve is defined by b and the width is defined by c corresponding to the standard deviation of the curve. The full image containing all discrete Gauss curves along the B-spline is then constructed as f (x, y) = e (y X[x]) (4.9) Here the height a = 1, and the center is equal to the splines position. The width has been set to a constant 10. With the B-spline showed in figure 4.11a the Gauss ridge in figure 4.11b is created.

4.5 Data Processing 39 4.5.3 Protrusion Detection What is characteristic about the white landmarks in figure 4.8 is that they appear to be small bumps.

12a shows a frequency high-passed variant of the height image, created by subtracting a frequency low-passed variant of the original image.

53 4.5 Data Processing Protrusion Detection What is characteristic about the white landmarks in figure 4.8 is that they appear to be small bumps. This means that they contain higher frequencies than the rest of the surface. Figure 4.12a shows a frequency high-passed variant of the height image, created by subtracting a frequency low-passed variant of the original image. The ridge does however seem to interfere with the two landmarks to the right. Here is where the Gauss ridge comes in. By subtracting it from the height image we get the results in figure 4.12b. Observing this image it is clear that the wanted points can be located at some of the local maximums in the image. The local maximums are easily found by checking if a pixel value is the largest in its neighborhood. This operation is very noise sensitive so a small low-pass filter should be applied. Note that this can ultimately be seen as a band-pass filter of the original height image, followed by the Gauss ridge subtraction. (a) High pass filter applied to the height image. (b) The Gauss ridge image subtracted from the high passed height image. Figure 4.12: The height image has been filtered with a high pass filter then using the B-spline to suppress parts of it. Only four of the points are wanted and the next challenge is to choose the correct ones. Depending on the data the correct points might be more or less difficult to extract. Figure 4.13 shows the resulting local maximums for an image in different colors. The first points to be removed are the ones which values are smaller than the mean of the height image, marked as blue. Then the points that don t belong to the convex hull of the points are removed, marked as cyan. Then if more than four points remain a last filtering is performed. Points missing a corresponding pair point are removed, then the two points farthest to the right are chosen, lastly the two widest points left are selected, marked as red. Note that this formula is very data dependent and relies heavily on a proper normalization.

54 40 4 Implementation Figure 4.13: All the local maximums found on a high passed, B-spline suppressed image. In this case figure 4.12b is used (it is not the local maximums of the currently displayed image). The points are removed in the following order: blue, cyan, yellow Surface Descriptor Extraction With four proper points, a closed region is defined. This region should contain enough data to be able to represent the specimen to a sufficient extent. The region is extracted as a rectangular image using a homography as previously explained, but using the four landmarks as corner points. The size was chosen as pixels, which is much smaller than the actual region and down sampling will automatically be applied by the homography. Before extracting the image the data should be normalized. This is done by estimating a plane to the four landmarks in 3-D space, described in section 3.5. A rotation matrix can be estimated from aligning the planes normal vector to the depth axis, making it possible to rotate the point cloud before creating the descriptor image. This rotation matrix is easily constructed by finding a new orthonormal basis (b 1, b 2, b 3 ) where b 3 corresponds to the depth axis. With the old basis defined as B 1 = (1, 0, 0) T, B 2 = (0, 1, 0) T and B 3 = (0, 0, 1) T the new basis is calculated as follows: b 1 = B 2 n b 2 = n b 1 b 3 = n. (4.10) Inserting the new basis into a matrix creates the correct transformation matrix, such as b 1x b 1y b 1z 0 b T = 2x b 2y b 2z 0. (4.11) b 3x b 3y b 3z Figure 4.14a shows a sample before normalization, and figure 4.14b shows the same sample post normalization. As expected the four corners of the descriptor holds the same approximate values. With the data rotated the descriptor is cut

55 4.6 Descriptor Analysis 41 (a) The points marked on the specimen. (b) The normalized samples where all points have the same depth value. Figure 4.14: The four points define the edges of the descriptor, to normalize the descriptor the data is rotated so that all four points have the same depth values. and rectified resulting in the final descriptor, as shown in figure Examining these descriptors of four different classes it is possible to see some differences between the classes while simultaneously detecting similarities within the same class. It is primarily the right half side of the descriptor that reveal some kind of distinction, it is however also the region that varies most within the same class. This is probably mainly caused by slight variation in the positions of the landmarks and the fact that it contains a portion of one of the least rigid body part of the specimen. 4.6 Descriptor Analysis With the data properly processed and normalized into descriptors, the next step would be to analyze these descriptor images. A lot of different methods exist, and here a few will be examined and evaluated. Since the descriptor image holds similar properties as faces, the three sections 3.6.1, and include methods often used for face recognition. The fourth method is a neural network, described in section 3.7. It is a flexible model that can find complex relationships between input and output, but usually demands well adjusted parameters to give adequate results Eigenfaces The first method to be used for classification is the eigenfaces method. Two different variants will be used, the first is a slight tweak of the method described in The second variant exploits the fact that eigenfaces is essentially a method to represent images, not classify them.

calibrated coordinates Linear transformation pixel coordinates

1 calibrated coordinates Linear transformation pixel coordinates 2 Calibration with a rig Uncalibrated epipolar geometry Ambiguities in image formation Stratified reconstruction Autocalibration with partial