Jaroslav Moravec. Object recognition using 3D convolutional neural networks

Size: px

Start display at page:

Download "Jaroslav Moravec. Object recognition using 3D convolutional neural networks"

Meagan Joseph
5 years ago
Views:

1 BACHELOR THESIS Jaroslav Moravec Object recognition using 3D convolutional neural networks Department o Sotware Engineering Supervisor o the bachelor thesis: Study programme: Study branch: RNDr. Jakub Lokoč, Ph.D. Computer Science ISDI Prague 2017

2 I declare that I carried out this bachelor thesis independently, and only with the cited sources, literature and other proessional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Sb., the Copyright Act, as amended, in particular the act that the Charles University has the right to conclude a license agreement on the use o this work as a school work pursuant to Section 60 subsection 1 o the Copyright Act. In... date... signature o the author i

3 Title: Object recognition using 3D convolutional neural networks Author: Jaroslav Moravec Computer Science: Department o Sotware Engineering Supervisor: RNDr. Jakub Lokoč, Ph.D., Department o Sotware Engineering Abstract: With the ast development o laser and sensor technologies, it has become easy to scan a real-world object and save it in a digital ormat into a persistent database. With the rising number o scanned 3D objects, data management and retrieval methods become necessary. For various retrieval tasks, eective retrieval models are required. In our work, we ocus on eective classiication and similarity search. The investigated approach is based on convolutional neural networks representing a machine learning method that boomed in recent years. We have designed and trained several architectures o 3D convolutional neural networks and tested them on state-o-the-art benchmark 3D datasets or 3D object recognition and retrieval. We were also able to show that the trained eatures on one dataset can be then used to predict class labels on another 3D dataset. Keywords: Object recognition, 3D convolution, neural networks ii

4 Contents 1 Introduction 3 2 3D Datasets 4 3 Search in 3D Datasets 5 4 Classiication and Search Using DCNN DCNN Artiicial Neural Networks Convolutional Layer Pooling Layer Local Response Normalization Fully Connected Layer Dropout Transormation o Model or DCNN Used CNN Architectures Object retrieval Using Our Classiier Using Similarity Search Learning o DCNN Motivation Gradient Descent Optimization Gradient Descent Variants Gradient Descent Optimization Algorithms Experiments Object Recognition and Retrieval (SHREC16) Method Results and Discussion Object Retrieval (SHREC15) Method Results and Discussion Object Recognition and Retrieval (ModelNet10) Method Results and Discussion Conclusion and Future work 42 Bibliography 44 List o Figures 47 List o Tables 48 1

5 List o Abbreviations 49 Attachments 50 2

6 1. Introduction As the mankind always wanted to create a technology that would help with hard or even impossible work or one person, the ields o robotics and computer vision were established and developed or last decades. Nowadays, these ields are developing new autonomous robots, machines and cars or 3D world. Thus we also need to develop various methods to teach computers (i.e., machine brains ) to understand real-world objects, environment and situations. The goal o this thesis is to describe and implement one o these methods - 3D convolutional neural network - and compare its results or various datasets with state-o-the-art methods. During last decades, many types o classiiers were designed to address a classiication problem Obj C, where Obj is a set o 3D objects and C is a inite set o classes. Methods can be based on the appearance o the object eg. edge, gradient or grayscale matching. Such methods are mostly based on inormation gathered rom pre-computed projections o an object, thus do not take 3D geometric shape o the object into consideration. These are probably the oldest approaches and their results cannot compete with present eature-based methods. These methods extract some eatures rom pre-captured views o the object (eg. corners, surace patches). Thus enorce to consider also 3D shape o the object. These vectors o eatures are then matched to decide to which class the object belongs. In our thesis, we describe and implement one o the eature-based methods or objects recognition Convolutional neural networks (CNN). The thesis is organized as ollows. In chapter 2, we describe 3D datasets, which were used or training our convolutional neural network architectures in chapter 6. In chapter 3, we discuss the usage o the object retrieval approaches and important deinitions. These are helpul in the ollowing chapters especially in section 4.4 and chapter 6. Following chapter 4 contains all necessary deinitions and algorithms to understand convolutional neural networks. We tried to describe all concepts, which are very hard to understand, so that they are easily readable or anyone interested in this topic. The insight in this chapter is very important to understand the ollowing chapter 5 and thus the whole concept o learning o convolutional neural networks and why these black-box-algorithms really work. It also includes the theory behind chapter 6, which will reer to this one. The next chapter 5 contains important inormation about learning o convolutional neural networks. It describes the most common type o optimizer - gradient descent - and some o its approaches. Some o these optimizers are then used in chapter 6. Probably the most important chapter 6 contains parameters/hyperparameters o our architectures and their results that will be described and discussed in more detail. The conclusions and uture work are presented in chapter 7. 3

7 2. 3D Datasets With the development o new sensors, it is becoming more easy to scan a real world 3D model in everyday lie and store it in a digital ormat. With the rising number o stored 3D models, also novel methods or 3D data management and retrieval are required. This is the purpose o the SHREC competition which is being held by well-known universities every year. In this work, we will use the ModelNet10 dataset and two datasets rom competitions SHREC 16 and SHREC 15. The two datasets are used or shape recognition described more in chapter 3 and chapter 4. Since the structures o all the datasets are dierent, we need to describe them separately. According to the web page o the organizer o SHREC 15, the dataset (SHREC15 [2015]) contains 229 labeled objects rom nine classes and other unclassiied objects rom several publicly available shape collections. Ater obtaining the irst large-scale set o shapes, the organizers applied a careul post-processing step in order to repair non-maniold objects and merge objects with more than 1 connected component. Their inal dataset only contains maniold objects with one connected component. This pre-processing step will guarantee that most o the current approaches work with the dataset. The SHREC 16 competition uses the dataset ShapeNetCore subset o ShapeNet (SHREC16 [2016]) which contains about 51,300 3D models out o 55 common categories. For the competition, the dataset was divided into train, validation and test parts, in a ratio 70/10/20. The competition has two levels o diiculty: the normal and the perturbed data. The normal data are consistently aligned with respect to cartesian axes, while the perturbed data are randomly rotated. In this work, we trained all our classiiers and retrieval models using the normal dataset. As described on the web page ModelNet [2016], ModelNet (Wu et al. [2014]) is a project o Princeton University, whose goal is to provide comprehensive clean collection o 3D CAD models or objects. The organizers chose 10 common categories o objects and then collected models belonging to each category using online searching engines by querying or each object category term. Then, they hired human workers to manually decide whether each CAD model belongs to the speciied category. Furthermore, they manually aligned the orientation o the CAD models or this 10-class subset as well. We will use this dataset or our experiments, it contains models rom 10 classes or train and 908 models or the test. The comparisons o our results with state-o-the-art will be discussed in chapter 6. 4

8 3. Search in 3D Datasets In this section, we will ollow explanations presented in Bustos et al. [2005]. The problem o eicient and eective search in databases with 3D objects arises in many domains. These are or example: Medical domain: 3D shape retrieval can be used or the detection o organs deormation and thus or diagnostic purposes. Molecular biology: 3D retrieval approaches are used or a structural classiication, where molecules and proteins are modeled as 3D objects. Meteorology: Similarity search in 3D data was used to warn people allergic to dierent kinds o pollen. The conocal laser scan rom a microscope gives the 3D volumetric data or pollen rom which the structure can be extracted. Based on this structure we can build a classiier or dierent pollen types. Computer aided design: Retrieval in 3D databases can be used to support CAD tools which are requently used in manuacturing. When some new product is designed, it can be built rom smaller 3D objects, which are already in the database. Or i some part o the 3D object needs to be substituted, e. g. or reducing cost, it could be replaced by a similar part in the database. Army: 3D shape retrieval can be used or the classical riend/oe detection problem. The shape o some unidentiied object is compared to shapes in the database. Based on the result, we can say i the object is our riend or enemy. Movie and video games: Producers make heavy usage o 3D models to enhance realism. Similarity search can be used in existing databases or adaptation and re-usage o 3D objects. As we can see, there are diverse ields o usage or shape retrieval in 3D data and so there are also dierent approaches or 3D data representation, manipulation, and presentation. A complex 3D object can be represented as a set o smaller primitives, which are combined into one. 3D acquisition devices usually produce voxelized object approximations or 3D point clouds, but other representations like 3D grammars also exist. Probably the most widely used representation is an approximation o a 3D object with a mesh o polygons (usually triangles). Basically, all mentioned representations can be used or 3D shape retrieval or can be converted to another representation suitable or similarity retrieval. For decades, the similarity search o 3D shapes and their description was studied in the ields o computer vision, shape analysis, and computational geometry. In the computer vision, we usually try to segment a 3D object into 2D images and then match these segments to the set o a priori known reerence 2D objects. Problems can obviously arise with invariance in input (lighting conditions, view perspective, clutter, occlusion). But the decision problem itsel is also diicult: What is the similarity notion? 5

9 What is the similarity threshold? How much tolerance is sustainable in a given application context, and which answer set sizes are required? The key part o the object retrieval task is also its eiciency because we want to be able to search in large databases quickly. Feature vector paradigm Feature vector paradigm is a standard method or multimedia retrieval, when we don t know how to compare two objects directly. As complex and unstructured objects (like 3D models) rom a universe Obj cannot be directly compared to each other, a simpliied descriptor s universe U is deined consisting only o extracted (and potentially aggregated) important eatures o objects Lokoč [2010]. Let us assume, we have deined certain aspects o our 3D object, then all these aspects are used to orm a eature vector (descriptor) o this object, with usually a very high dimension. Note that eature vectors can be indexed or more eicient retrieval. The resulting eature vectors should describe important characteristics o the modeled 3D object, that are determined by the utilized extraction method. Extraction methods can consider: properties o the 3D object s bounding box distribution o normal vectors or curvature the ourier transorm o some spherical unctions that characterize objects It is naturally hard to ind the right extraction method or our similarity search task because no approach is suitable or all tasks at once. Every extraction method inds some other characteristic o the 3D object and so gives other results. Deinition 3.1. Extraction unction e :Obj U transorms a multimedia object rom database universe Obj into a descriptor in descriptor universe U. We are not usually working with the whole object universe Obj, but with a small subset X Obj. Similarly we deine a descriptor subset S with respect to original database X as S U. Ater we choose an appropriate extraction method, the eature vector or every object in the database has to be evaluated. I we want to decide or some 3D object how much is similar to another one, we only need to use a suitable distance unction or eature vectors o those two 3D objects. We can then produce the ranking o all database objects in ascending order with respect to their distance to a query object. Deinition 3.2. Distance measure o two 3D objects deined by their descriptors is determined by a non-negative, real number. Generally: δ : U U R + 0 Smaller values o δ or two objects denote higher similarity. Then, there are two types o similarity queries in the descriptor database S: 6

10 Collection A A R R Figure 3.1: The query relevant and retrieved objects visualization Range queries: A range query range(o, r) returns or some value r all objects (descriptors) that are within distance r rom o: range(o, r) = {u S : δ(u, o) r} k-nearest neighbors (k NN) queries: Returns the k most similar objects (descriptors) rom S to o, so it returns set o object knn(o) = C that: C S : C = k c C u S \ C : δ(o, c) δ(o, u) An important amily o similarity unctions in vector spaces is the Minkowski L s, deined as: L s ( v 1, ( ) 1 v 2 ) = vi 1 vi 2 s s where v 1 and v 2 are eature vectors rom R d, s 1. The most used unctions rom this amily is Manhattan distance L 1, Euclidean distance L 2 and the maximal distance L = max 1 i d v 1 i v 2 i. Let us know introduce the notation according to ig. 3.1: R... is a set o relevant objects A... is a set o retrieved objects R A... is a set o all retrieved objects that are relevant Wesley [2010] We can deine important deinitions with respect to ig. 3.1 or similarity search, which will be used in the ollowing chapters. Deinition 3.3. Precision is the raction o a number o retrieved objects that are relevant to a number o all retrieved objects: precision = i R A A Deinition 3.4. Recall is the raction o a number o retrieved objects that are relevant to a number o all relevant objects: recall = 7 R A R

Figure 3.2: The precision-recall curve rom Stanord [b] Deinition 3.5. Accuracy o an inormation retrieval system is the raction o classiications that are correct: accuracy = Manning et al.

11 Figure 3.2: The precision-recall curve rom Stanord [b] Deinition 3.5. Accuracy o an inormation retrieval system is the raction o classiications that are correct: accuracy = Manning et al. [2008] (R A) (R A ) (R A) (R A ) (R A) (R A ) In case we want to compute accuracy o classiier on some database, thus it is a raction o all objects in the database classiications that are correct. We can express precision as a unction (visualized in ig. 3.2) o recall p(r) and then deine ollowing Su et al. [2015]: Deinition 3.6. (Average precision) computes the average value o p(r) over the interval rom r = 0 to r = 1: AveP = 1 r=0 p(r)dr Deinition 3.7. Mean average precision is the mean o the average precision o all queries in Q: q Q AveP (q) MAP = Q Beitzel et al. [2009] 8

12 4. Classiication and Search Using DCNN 4.1 DCNN In this section, we present Deep convolutional neural networks, including their: main building block (neuron) standard layer-wise organization orward propagation and backpropagation algorithms Layers o convolutional neural networks are deined in the rest o the section Artiicial Neural Networks Motivation As we can ind in CS231n [2017a], artiicial neural networks were developed or modeling biological neural systems. Their basic computational unit was named ater its equivalent in brain - neuron. We will describe both systems with respect to ig Each neuron receives input signals rom its dendrites and creates an output signal, which is then transmitted through its axon. The axon branches out and connects to other neuron dendrites. I the sum o all input signals is greater than a threshold the neuron can ire again and send the signal to its axons. In the computational model which is used by artiicial neural nets the signal (x i ) travels rom one neuron through a connection with speciic strength (w i ) to a second neuron. The multiplication o the connection strength w i (which is called the weight o the connection) and the signal x i gives one o the inputs to the second neuron. The sum o all input signals i x i w i and a bias value b o the neuron is not compared with the threshold (like in the biological case), but we apply some activation unction on it. The result o the activation unction is also the output o the neuron. The network is taught eatures (trying to change weights o connections) o the input to make its prediction closer to the desired output. CS231n [2017a] Neuron We will use similar notation and some deinitions rom Schmid [2011]. Deinition 4.1. A Neuron is a triple (,w,b), where: : R R is an activation unction (e.g. sigmoid, tanh, ReLU, described below in this section 4.1.1) w R n is a vector o weights b R is a bias 9

13 (a) The biological neuron rom CS231n [2017c] (b) The computational model rom CS231n [2017d] Figure 4.1: The neuron For neuron input x R n its output y R is computed: y = (x T w + b) (4.1) Neurons can be connected with weighted links, thus creating an artiicial neural network. Deinition 4.2. The artiicial neural network is a pair (N, C), where: N is a set o neurons C N N is a set o oriented connections Layer-wise organization As deined in previous subsection, the artiicial neural network is a collection o neurons which are connected to one another. Primarily, neural networks are also organized into distinct layers: Deinition 4.3. The layer l is a subset o neurons in N: l N Deinition 4.4. The layer-wise organized artiicial neural network with c layers is an artiicial neural network in which: N is a set o neurons C N N is an acyclic set o oriented connections L = {l 0, l 1,..., l c 1 } is a set o neural network layers where l 0 l 1 l c 1 = N i j : l i l j = (n 1, n 2 ) C = i : (n 1 l i n 2 l i+1 ) the irst layer is called the input layer and the last layer is called the output layer 10

14 x 0 a 0 0 w0,0 1 w0,1 1 z0 1 a 1 0 w0,0 2 w0,1 2 z0 2 a 2 0 y 0 x 1 a 0 1 w1,0 1 w1,1 1 w2,0 1 w 1 2,1 z 1 1 a 1 1 w1,0 2 w1,1 2 w2,0 2 z 2 1 a 2 1 y w 2 2,1 Figure 4.2: The layer-wise organization Deinition 4.5. The artiicial neural network with c layers is called a (c 1)- layer neural network. Deinition 4.6. The cardinality o layer l ( l ) is equal to a number o neurons in this layer. The most commonly used layer type is a ully-connected layer, where all neurons rom one layer are connected with every neuron in the adjacent layer (there is no connection between neurons in the same layer). In ig. 4.2, there is a 2-layer neural network example with one hidden layer, one output layer, one input layer, two inputs and two outputs. Remark. In the ollowing derivations, we will treat bias in a special way. The reason is that or a speciic layer bias can be simulated as a new neuron with output 1 and connected with all neurons in the layer. The weights o these connections can be modiied so each connected neuron gets dierent input rom the bias neuron.j. Matas [2015] Forward propagation We will use the notation as suggested in the presentation o pro. J. Matas [2015] and visualized in the ig. 4.2: w l i,j is the weight o the connection between i-th neuron o (l 1)-th layer and j-th neuron o l-th layer zi l = j a l 1 j layer w l j,i is the weighted sum o inputs into the i-th neuron in l-th : R R is an activation unction a l i = (z l i) is the activation o the i-th neuron in the l-th layer Let us assume we need to compute the input zj k o the j-th neuron in the k-th layer: zj k = wi,j k a k 1 i (4.2) i {0,..., l k 1 1} To compute the output a k j o the j-th neuron in the k-th layer, we use the ollowing equation: a k j = ( ) zj k (4.3) 11

15 Algorithm 4.1. (Forward propagation algorithm) Let x R n be the input o c-layer artiicial neural network, where n is equal to L 0, we compute the output y R m, where m is equal to L c, o network as ollows 1. or h = 0 to n 1: 2. a 0 h = x h 3. or = 1 to c: 4. or h = 0 to L 1: 5. compute input z h using eq. (4.2) 6. compute output a h using eq. (4.3) 7. or h = 0 to m 1: 8. y h = a c h 9. return y J. Matas [2015] In ig. 4.3, we show an example o the orward propagation on the same network architecture as in ig. 4.2, the input x = (0.5, 0.3), the activation unction (x) = max(0, x) is ReLU (section 4.1.1) and weights: w 1 0,0 = 0.2, w 1 1,0 = 0.4, w 1 2,0 = 0.6 w 1 0,1 = 0.1, w 1 1,1 = 0.3, w 1 2,1 = 0.5 w 2 0,0 = 0.4, w 2 1,0 = 0.6, w 2 2,0 = 0.8 w 2 0,1 = 0.7, w 2 1,1 = 0.9, w 2 2,1 = 0.1 Now we proceed according to the algorithm Assign the input: a 0 0 = x 0 = 0.5 a 0 1 = x 1 = Compute input and output in the hidden layer: z 1 0 = w 1 0,0 a w 1 1,0 a w 1 2,0 1 = = 0.82 a 1 0 = (z 1 0) = 0.82 z 1 1 = w 1 0,1 a w 1 1,1 a w 1 2,1 1 = = 0.64 a 1 1 = (z 1 1) = Compute input and output in the output layer z 2 0 = w 2 0,0 a w 2 1,0 a w 2 2,0 1 = = a 2 0 = (z 2 0) = z 2 1 = w 2 0,1 a w 2 1,1 a w 2 2,1 1 = = 1.25 a 2 1 = (z 2 1) = Assign the output: y 0 = a 2 0 y 1 = a

16 Backpropagation algorithm Figure 4.3: The orward propagation We will derive the backpropagation algorithm ollowing Makin [2006] and interspace the derivation with the example in ig Every training algorithm or neural networks is trying to change weights o connections in the network so that the predicted outputs or a set o inputs are close to the real ones. This closeness is deined by an error unction E. Deinition 4.7. The input o c-layer artiicial neural network is x R n, where n is equal to L 0. Deinition 4.8. The output o c-layer artiicial neural network is y R m, where m is equal to L c. Deinition 4.9. The target output o c-layer artiicial neural network is t R m, where m is equal to L c. And t is the presumed output or the input x. Deinition The training set T is a set o ordered pairs (x, t), where x is the input and t is the target output. The algorithm iterates over every pair in the training set ((x, t) T ) and processes our consecutive steps: Use the orward propagation or the input x to compute the predicted output y Compute the error E with the predicted output y and the target output t Backpropagate the error signal and compute partial derivatives o parameters based on it Adapt weights Let T be a training set, x be an input and t be a target output, where (x, t) T, o a c-layer artiicial neural network. The predicted output o this neural network y is a result o orward propagation algorithm, which was deined in section 4.1.1, with the input x. We can deine the error unction: E = 1 (y i t i ) 2 (4.4) 2 i 13

17 In this method, the weights are moved in the opposite direction o their derivative: w l i,j = α w l i,j The α parameter is called the learning rate and enables to scale the step size. We can expand the partial derivative with chain rule as ollows (4.5) w l i,j = a l j a l j zj l z l j w l i,j (4.6) In the ollowing derivations, we will use the irst two ractions o the previous equation as a single quantity (error term): δj l = a l j a l j zj l (4.7) We will consider three situations: the computation o the error signal on the output layer the computation o the derivative o weights between the last hidden layer and the output layer the computation o the error signal on a hidden layer and the derivative o weights between other layers The error signal on the last layer In the case l is the output layer, this quantity can be computed as the derivative o 4.4: = (t j a l j) (4.8) a l j as a l j = y j rom the algorithm 4.1. Example: We can now compute the error on the last layer or our example in ig Let us consider, that target output t = (0.3, 0.5): a 2 0 = ( ) = 1.212, a 2 1 = ( ) = 0.75 The derivative o weights between last hidden and output layers We now need to use the eq. (4.6) to compute the derivative o weights. Let us split the computation to make it more clear. We already derived the error signal on the output layer in the eq. (4.8) above: a l j = (t j a l j) As a l j = (zj), l then the derivative o al j zj l unction: a l j zj l = [(z l j)] 14 is only a derivative o the activation

18 Figure 4.4: The error signal on the last layer We know that z l j is the weighted sum o inputs into the j-th neuron o the l-th layer and hence: Now we only combine it together: w l i,j z l j w l i,j = a l 1 i = (t j a l j) [(z l j)] a l 1 i (4.9) In convolutional neural networks is the most common activation unction ReLU unction, which will be explained in more detail in the section Its derivative is equal to 1 i the input is greater than 0 and is equal to 0 otherwise. Example: We can now compute the derivative o weights between the hidden layer and the last layer or our example in ig Consider that we want to compute w 2 0,0 and w 2 0,1 : w 2 0,0 = (1.512) 0.82 = w 2 0,1 = 0.75 (1.25) 0.82 = Let us know divide the last situation into two separate tasks Figure 4.5: The derivative o weights between the hidden layer and the last layer The error signal on a hidden layer 15

19 I we now suppose that layer l is a hidden layer, it is harder to compute. We a l j need to consider, how the error rom a l j was propagated to activation o the next layer l + 1: = a l+1 i zi l+1 a l j i a l+1 i zi l+1 (4.10) a l j The irst two derivatives are the negative error term o the next layer: a l+1 i a l+1 i zi l+1 = δ l+1 i And as z l+1 i = j a l j w l+1 j,i, the last derivative is equal to: z l+1 i a l j = w l+1 j,i We can rewrite eq. (4.10): a l j = i δ l+1 i w l+1 j,i (4.11) Example: Now we compute the error signal on the activation neuron in the hidden layer ig Let us consider we want to compute : a 1 0 a 1 0 = ( (1.512) (1.25) ) = Figure 4.6: Computation o the signal error o one neuron in the hidden layer The derivative o weight between other layers We already know the error signal o the hidden layer rom derivations above: a l j = i δi l+1 wj,i l+1 An we also derived ollowing derivatives above as ollows a l j zj l = [(z l j)] 16

20 Now, we combine everything: w l k,j ( = i z l j w l i,j = a l 1 i (δ l+1 i w l+1 j,i ) ) [(z l j)] a l 1 k (4.12) Example: We can inally count one o the derivative o error with respect to weight between the input layer and the hidden layer. For example in the ig. 4.7, we want to compute w 1 0,0 : w 1 0,0 = ( ( (1.512) (1.25) )) = Figure 4.7: The derivative o one weight between input layer and hidden layer As we can see, especially rom our ig. 4.7, irst we need to compute or each hidden layer all error terms o the adjacent layer. So the algorithm starts rom the output layer and iterates backward over all layers and weights. Ater this step, all weights are updated as in the eq. (4.5). Now we are going to prove that the backpropagation algorithm is linear in number o weights: Let now k be a number o all weights in NN. The orward propagation is done in linear time (O(k)), because this pass iterates over all layers, neurons and weights, while every weight is used only once. The error o the prediction is computed in linear time with respect to a number o neurons in the last layer (O(m)). When backpropagating the error signal and computing partial derivatives based on it, we use every weight once so it is also in linear time (O(k)) with respect to number o weights. Adaptation o weights is the iteration over all weights, thus also linear in number o weights (O(k)). Rectiied linear unit We will use explanations rom Krizhevsky et al. [2012] and Stanord course CS231n [2017a]. The rectiied linear unit computes the (activation) unction 17

21 (x) = max(0, x). The derivative o the ReLU that we need or the backpropagation, is: 1 x > 0 (x) = 0 else It was ound that it considerably accelerates the convergence o the stochastic gradient descent compared to the sigmoid or tanh unctions. And moreover this unction is unlike sigmoid or tanh unctions computationally less demanding Unortunately, ReLU units can be ragile during training. For example, a large gradient lowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any input again Convolutional Layer This chapter is well described in one o the courses, which are available online by Stanord University, CS231n [2017b]. We will use this source to explain the concept o convolutional neural networks and its parameters and hyper-parameters. To derivate the orward pass and the backpropagation algorithm, we will use two online sources Gibianski [2014] and Kaunah [2016] but keep the notation rom the previous chapter. Motivation Convolutional layers are the main building block o convolutional neural networks. The parameter o this layer is a set o learnable ilters that are not large spatially but extend over all input channels. During the orward pass, we convolve (see proper deinition below) each ilter across the width and the height o the input volume and compute dot products at any position between the ilter entries and the input. When sliding the ilter over the width and the height o the input volume we will produce a 2-dimensional activation map that gives ilter responses at every spatial position. Intuitively, the network makes ilters activate whenever they see any type o visual eature. Each ilter creates one 2D activation map, the number o output channels is then equal to the number o used ilters. CS231n [2017b] In our ollowing derivations we will describe a concept o general d-dimensional convolutional layer, while in the rest o the thesis we will use three-dimensional convolutional layers. Input, parameters, hyperparameters and output Let us suppose we have a d-dimensional convolutional layer that accepts an input volume: E 1 E 2 E d C, where C is the number o channels. The layer uses our hyperparameters: K... the number o kernels F... the spatial size o kernels in all dimensions S... the stride (how many steps is increased region s position in all dimensions) 18

22 P... the amount o zero padding (pad the input volume with zeros around the border) The spatial size o kernels F satisies: F min(e 1,..., E d ) All kernels have the same number o channels C, thus the same as the input. The output has the ollowing volume: E 1 E 2 E d K, where: E i = E i F + 2P S It would be much better i each input region had its own kernel, so taught eature on that speciic position. But this would bring overwhelming need or memory, so we use a parameter sharing concept: The whole input uses K ilters, which are the same or each region. In addition the convolutional layers also uses K biases (or each ilter one). CS231n [2017b] Forward propagation algorithm Deinition Let d be a number o dimensions o ilters in a layer and D = {D 1, D 2,..., D d } be dimensions o ilters and E = {E 1, E 2,..., E d } be dimensions o the output o the previous layer so that satisies i : D i E i and C be a number o channels o ilters and the output. Then let A R E 1 E d C be the output o neurons in the previous layer, W R D 1 D d C be a ilter and b R is a bias o the kernel W. Input to a convolution layer on position p = (p 1,..., p d ) Z p : Z p = b + C 1 c=0 D 1 1 d 1 =0 D d 1 d d =0 + 1 A p1 +d 1,...,p d +d d,c W d1,...,d d,c (4.13) The orward propagation or the output o previous layer A l 1 then results in convolution rom deinition 4.11 on each position with every ilter. As we can see the output o convolution Z is a (d)-dimensional activation map with one channel or each ilter in the layer. Then we can concatenate all activation maps rom all ilters in the layer and call it the input Z l, which has d dimensions with K channels. The output o convolutional layer is then computed as: A l = (Z l ) where the activation unction is applied on each input element. Let us now consider an example o input into a convolutional layer with S = 1, P = 1, K = 1, C = 3 and F = 3. The orward pass can be seen in the ig We can compute e. g. Z 0,0,0 : Z 0,0,0 = =

23 OUTPUT A l 1 : FILTER W INPUT Z l (zero pad) BIAS B Figure 4.8: The convolution in a convolutional layer 20

24 Backpropagation In the previous section, we have introduced the orward pass in covolutional layers, thus now we need to understand how to compute an error signal or the previous layer and update parameters o the layer. We will ollow explanations rom Gibianski [2014] and Stanord [a]. Let us suppose, the adjacent layer was pooling layer or convolutional layer, we have the error unction E and the error signal on this layer (both types that A l we supposed as adjacent layer are routing the error signal to the previous layer, see bellow). Now assume, we want to compute the gradient o one kernel W on a position p = (p 1,..., p d ) (let us consider only one channel case). We will use a chain rule as we already did in the case o backpropagation in the ully connected layer: E 1 D 1 = W p1,...,p d p 1 =0 E d D d p d =0 A l p 1,...,p d A l p Z 1,...,p p l d 1,...,p d Z l p 1,...,p d W p1,...,p d (4.14) Let us now split the computation into three parts or each derivative: From the eq. (4.13) (orward propagation), the derivative o input with respect to kernels is equal to the last layer output: Z l p 1,...,p d W p1,...,p d = A l 1 p 1 +p 1,...,p d+p d The derivative o the output with respect to the input is only the derivative o the activation unction: A l p 1,...,p d Z l p 1,...,p d = [ (Z l p 1,...,p d )] The error signal or this layer was already computed by the adjacent layer thus: A l p 1,...,p d is already known. We can then combine these three parts again: E 1 D 1 E d D d = W p1,...,p d p 1 =0 p d =0 A l p 1,...,p d [ (Z l p 1,...,p d )] A l 1 p 1 +p 1,...,p d+p d (4.15) We will now derive gradient or bias b. This is a little bit easier than in the case o kernels, because bias is only added to the input on each position and is not weighted by anything, thus the derivative o error with respect to bias is: E1 D1 b = p 1 =0 E d D d p d =0 A l p 1,...,p d [ (Z l p 1,...,p d )] (4.16) The last task in order to complete backpropagation algorithm in convolutional layers is to compute the error signal o the previous layer: A l 1 p 1,...,p d = D 1 1 p 1 =0 D d 1 p d =0 D 1 1 p 1 =0 Z l p 1 p 1,...,p d p d D d 1 p d =0 21 Zp l 1 p 1,...,p d p d A l 1 = p 1,...,p d (4.17) W p1,...,p d Z l p 1 p 1,...,p d p d,

25 4.1.3 Pooling Layer We will explain a concept o pooling layers ollowing course rom Stanord University CS231n [2017b]. Motivation This type o a layer is used to progressively reduce the spatial size o the representation and also the number o parameters. Hence it helps to control overitting issue. The layer operates independently in every channel and resizes the spatial size with an operation. The operation is used on every region o the chosen spatial size and the output is used as a representative o this region, pooling can use (e.g.): Max pooling: choose the maximum value out o the region Min pooling: choose the minimum value out o the region Average pooling: compute the average on region L 2 -norm pooling: compute the L 2 -norm on region Input, parameters, hyperparameters and output The 2D pooling layer accepts volume o size: W 1 H 1 C 1, where W 1 is the width, H 1 is the height and C 1 is the number o channels. It needs two hyperparameters: F... the spatial size in both dimensions o the used regions S... the stride (how many steps is increased region position in all dimensions) The output has volume: W 2 H 2 C 2, where: W 2 = W 1 F S + 1 H 2 = H 1 F S + 1 C 2 = C 1 The 3D pooling layer accepts also the depth D 1 ( W 1 H 1 D 1 C 1 ), the output is then W 2 H 2 D 2 C 2, where D 2 = D 1 F + 1. S This layer does not use any parameter. Backpropagation As we mentioned in the motivation part o this subsection, this layer applies some kind o operation on regions o the input and or each region choose a representative with respect to the used operation. That is the orward propagation step, demonstrated in the ig As this type o layer does not use any weight there is nothing to update. But we 22

26 Figure 4.9: The max-pool (2 2) layer orward pass will still want to send the error to the previous layer. We will assume a backpropagation only or Max-pooling layer: The backward pass or a max(x, y) operation has a simple interpretation as only routing the gradient to the input with the highest value in the orward pass. So when we do the orward pass, we need to remember the index in the region with the maximum value and the error signal o the region during the backward pass is passed only on this index. Any other input element in the region has the error signal equal to zero. With this, we computed the error signal or the previous layer A l 1. CS231n [2017b] Local Response Normalization The concept o these types o layers is well described by Joshi [2016]. For the orward pass we will ollow the explanation o Krizhevsky et al. [2012]. Motivation In neurobiology there is a concept called lateral inhibition, which is the capacity o an excited neuron to decrease activity o its neighbors. This creates one signiicant peak - the local maximum. Local response normalization layer does the same in convolutional neural networks architecture. Joshi [2016] Today this type o layer is not so common anymore, because its contribution has been shown to be minimal, i any. Nowadays we have better training algorithms, regularization techniques, or eg. normalized datasets. This all helps the perormance much more than the LRN layers. So we will describe only one o the implementation approaches (krizhievsky) on 2D CNN. Forward pass ReLUs have the desirable property that they do not require any input normalization to prevent them rom saturating. I at least some training examples produce a positive input to a ReLU, some kind o learning will happen in that neuron. However, we still ind that the ollowing local normalization scheme aids generalization. Let a i x,y be the activity o a neuron computed by applying kernel i at position (x, y) and then applying the ReLU nonlinearity, the response-normalized activity b i x,y is given by: b i x,y = a i x,y ( k + α min(n 1,i+ n 2 ) j=max(0,i n 2 ) (aj x,y) 2 ) β (4.18) 23

27 where the sum runs over n adjacent kernel maps at the same spatial position, and N is the total number o kernels in the layer. The ordering o the kernel maps is, o course, arbitrary and determined beore the training begins. This sort o response normalization implements a orm o lateral inhibition inspired by the type ound in real neurons, creating a competition or big activities amongst neuron outputs computed with dierent kernels. Krizhevsky et al. [2012] Fully Connected Layer This layer corresponds to the normal ully connected layer, as described in the section above (Neural networks). 2D or 3D output o the previous layer is linearized and used as input to this layer. The backpropagation algorithm is the same as described above Dropout As was said in Krizhevsky et al. [2012], it would be much better i we had deeper convolutional neural networks, the results o which were combined into one. This approach is very successul in reducing the test errors but appears to be too expensive. There is, however, a very eicient technique that does the same thing and is rather eicient. It is called the dropout. This technique consists o setting to zero the output o each neuron in the layer with a speciied probability. Most common usage is with probability 0.5. The dropped out neurons do not contribute to the orward pass and do not participate in backpropagation at all. So every time an input is presented to out CNN, it chooses one o the architectures that share weights. I we do not use the dropout in our ully connected layer, our network would exhibits substantial overitting. 4.2 Transormation o Model or DCNN The input or a 3D convolutional neural network, as described in the previous subsection, is a three-dimensional matrix (width height depth) with one channel. I we had inormation about eg. RGB color o every point, the input would have three channels. Now let us assume, we have a classic 3D object deined with his vertexes and aces. You can se an example o this type o model in the ig. 4.10a as visualized in meshlab 1. We need to transorm this model so that it corresponds to the expected input o 3D CNN. Deinition A 3D occupancy grid is 3D map o cubes, where each cube is carrying inormation about its occupancy. Let X be a set o 3D points and F be a set o triangular aces rom three points in X. Let F and a, b, c be lengths o its sides. In our work we

(a) A model (b) Random points on aces (c) An occupancy grid o voxels Figure 4.

19) where t is a threshold. It would be surely easier to choose the vertices o each selected ace as its representative points, but this approach would be or aces with big area very sparse.

Then we will choose point p = (1 r 1 )x + ( r 1 (1 r 2 ))y + (r 2 r 1 )z (4.

28 (a) A model (b) Random points on aces (c) An occupancy grid o voxels Figure 4.10: An example o transormation assume that aces with longer sides are more important or the overall shape o the object, so we irstly choose only those: F = { F a t b t c t} (4.19) where t is a threshold. It would be surely easier to choose the vertices o each selected ace as its representative points, but this approach would be or aces with big area very sparse. Thus, we would like to choose random points on selected aces in F. We will ollow approach in Osada et al. [2002]. Let x, y, z R 3, t be a triangle with vertexes x, y and z and r 1, r 2 U[0, 1]. Then we will choose point p = (1 r 1 )x + ( r 1 (1 r 2 ))y + (r 2 r 1 )z (4.20) Intuitively, r 1 sets the percentage rom vertex A to the opposing edge, while r 2 represents the percentage along that edge. From each ace in F we create k points using eq. (4.20) and add them into the set P. Let min(p i ) be the least element in i-th dimension o all points in P, max(p ) be the greatest number in all dimension o all points in P and x[i], where x P and i {0, 1, 2}, be the element in i-the dimension o point x. We want to create a normalized set o points P. For each point x P, we create a new point y P, where: i {0, 1, 2} : y[i] = x[i] min(p i ) max(p ) We can see an example o P in the ig. 4.10b. With this normalized set o points in P we will create an occupancy grid with size n 3 using ollowing algorithm. 25

29 Algorithm 4.2. Let n be the size o all dimensions o an occupancy grid, P a set o 3D points that we want to register into the grid. The algorithm returns the occupancy grid with size n 3 and one channel: 0. Occ zeros(n, n, n, 1) 1. oreach p P : 2. x round(p[0] (n 1)), y round(p[1] (n 1)), z round(p[2] (n 1)) 3. Occ[x, y, z, 0] 1 4. return Occ The occupancy grid created by the algorithm 4.2 is then used as the input to convolutional neural network. With respect to our example it is visualized in the ig. 4.10c. 4.3 Used CNN Architectures We trained a lot o architectures with dierent parameters but only some o them were able to pass 85 % accuracy on validation dataset o SHREC16. In the ollowing chapter 6, we will consider only the architectures shown in the table 4.1. Our architectures will have the ollowing notation: INPUT(W H D C): an input layer with W H D C output volume CONV(K, F, S): a convolutional layer with K ilters o size F F F and with stride S. POOL(k, method): a pooling layer on region o spatial size k with a speciied method FC(k): a ully connected layer with k neurons Dropout(p): a Dropout, as deined above with probability p OUTPUT(k): an output layer with k neurons In the ig there is an example o a convolutional neural network. It has one convolutional layer with K = 32 ilters o size 5 5 3, stride S = 1, C = 1 and zero padding P = 2. There is also a max-pooling layer with spatial size 2, then another convolutional layer with kernels o spatial size 5 5 but with 32 channels and K = 48 ilter with stride S = 1 and P = 2.And another max-pooling with spatial size 2, its output is latten and goes into two ully connected layers (one with 768 neurons and the other with 256 neurons), where the later is ully connected to the output layer neurons. 4.4 Object retrieval We have described the object retrieval task and its approaches in chapter 3. Now we will introduce our methods, which will be used in chapter 6. 26

30 ID Architecture 0 INP UT ( ) = CONV (8, 3, 2) = CONV (16, 3, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) 1 INP UT ( ) = CONV (8, 5, 2) = CONV (16, 3, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) 2 INP UT ( ) = CONV (16, 5, 2) = CONV (16, 3, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) 3 INP UT ( ) = CONV (8, 5, 1) = P OOL(2, MAX) = CONV (16, 3, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) 4 INP UT ( ) = CONV (16, 5, 1) = P OOL(2, MAX) = CONV (16, 3, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) 5 INP UT ( ) = CONV (16, 5, 1) = P OOL(2, MAX) = CONV (32, 3, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) 6 INP UT ( ) = CONV (16, 7, 1) = P OOL(2, MAX) = CONV (32, 5, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) 7 INP UT ( ) = CONV (16, 5, 1) = P OOL(2, MAX) = CONV (16, 5, 1) = P OOL(2, MAX) = F C(256) = Dropout(0.5) = OUT P UT (55) Table 4.1: Used CNN architectures 27

Figure 4.11: The CNN architecture This igure is generated by adapting the code rom https://github.com/ gwding/draw_convnet 4.4.1 Using Our Classiier Based on good results o our convolutional neural networks on object recognition (see chapter 6), we decided to use our classiier or the shape retrieval.

31 Figure 4.11: The CNN architecture This igure is generated by adapting the code rom gwding/draw_convnet Using Our Classiier Based on good results o our convolutional neural networks on object recognition (see chapter 6), we decided to use our classiier or the shape retrieval. Let us assume we have a database o objects D and we want to compute mean average precision on this database using our classiier. Forward pass through our trained convolutional neural network gives a vector o length equal to the number o classes. Each element o this vector gives the inormation o likelihood or the object o it belongs to this class. So or each class we can create a list o representatives o all objects with respect to their likelihood to be rom this class (ordered descending). We can think about this approach as some kind o hashing unction: Whenever we want to compute average precision o a queried object, we use our CNN to classiy the object and choose the prepared list o the class it was assigned to. For the computation o average precision we use the methods in deinition Using Similarity Search Our other method is a classic similarity search approach. Let us assume that out o is an output o orward pass in CNN or object o, which we can use as a eature vector. Consider, we want to compute average precision or speciic query object q in our database D. For each object o D we compute: DIS(q, o) = L 2 (out q out o ) We can retrieve all objects rom the D (ordered by the similarity unction ascending) and evaluate average precision or query object q on this set. 28

32 5. Learning o DCNN For our explanations o learning approaches or DCNN s weights we will ollow Ruder [2016]. 5.1 Motivation In the previous chapter in eq. (4.5), we have already introduced the way how to update weights to make the prediction o our convolutional neural network closer to the real label. In this chapter we are going to describe more sophisticating methods or updating parameters o our networks. We have chosen to use the algorithm based on a gradient descent or the optimization o our learning, which is by ar the most common in this ield. In most cases, learning rameworks already contains implementations o optimizations, so the approaches seemed to be used only as black-boxes. Because o this we are going to explain them in this chapter in more details to make it easier to understand our experiments in chapter 6. A gradient descent is a way to minimize the error unction E by updating parameters w o a network in the opposite direction o the gradient. The learning w rate α determines the size o the steps necessary to reach a (local) minimum. 5.2 Gradient Descent Optimization Gradient Descent Variants There are three variants o gradient descent algorithm which dier only in amount o data rom the dataset that is given to the network or learning. Batch gradient descent Computes the gradient o the error unction with respect to weights or the entire dataset. As we need to calculate the gradients or the whole dataset to perorm only one update the batch gradient descent can be very slow. This variant does not allow us to update our model online, i. e. with new examples on-the-ly. Stochastic gradient descent This approach perorms in contrast a parameter update or each training example (x, t). The batch gradient descent perorms redundant computations or a large dataset because it recomputes gradients or similar inputs without any update. SGD does not have this redundancy as it update weights each time. It is usually much aster and can be used or online learning. The problem is that SGD can complicate convergence to the exact error unction minimum as it only jumps between a new local minimum. However, it has been shown, that SGD will have the same convergence behavior or lower learning rates as the batch gradient descent method. 29

33 Mini-batch gradient descent The most common approach is somewhere between the previous two methods and perorms an update or every mini-batch o n examples. This way it: reduces the variance o the parameter updates which can lead to more stable convergence can make use o highly optimized matrix optimizations common to stateo-the-art deep learning libraries that make computing the gradient with respect to mini-batch very eicient The common size o a mini-batch is in a range (50, 256) but can vary with respect to our application Gradient Descent Optimization Algorithms In this subsection we are going to outline some o the optimization algorithms that we will use in our experiments in the chapter below. Momentum SGD has trouble with areas where the surace curves in one dimension much more steeply than in others. However, they are very common around local optima. In these scenarios, SGD oscillates across the slopes o the ravine while making only hesitant progress to local optimum. The momentum is a method that helps accelerate SGD in a relevant direction. It does this by adding raction o the update vector o the previous step to the current update vector: v s = γv s 1 + α w w = w v s where s is the current time step and γ is usually set to 0.9 or to a similar value. The momentum term increases or dimensions the gradients o which point in the same directions and reduces updates or dimensions the gradients o which change directions. As a result, we gain a aster convergence and a reduced oscillation. Adagrad The Adagrad is a gradient descent algorithm that only change the learning rate to parameters, thus perorming larger updates or inrequent parameters and smaller updates or requent parameters. Up to now, we have perormed updates generally or all weights, but Adagrad uses dierent learning rates or dierent weights. Let us consider, we have some weight w i and g s,i is the gradient o the error unction with respect to the w i at time step s: g s,i = w i 30

34 The SGD update or every parameter w i at each time step s then becomes: w s+1,i = w s,i αg t,i In its update rule, Adagrad modiies the general learning rate α at each time step s or every parameter w i, based on the previous gradients: w s+1,i = w s,i α Gs,ii + ε g s,i where G s R d d is a diagonal matrix, where each diagonal element i, i is the sum o the squares o the gradients with respect to the w i up to the time step s and ε is a smoothing term that avoids division by zero. Adagrad s main beneit is that it eliminates the need to tune the learning rate manually. On the other hand, there is a problem with squared gradients accumulation in the denominator. Since all added terms are positive, the accumulated sum keeps growing during training. The result is that the learning rate is becoming very small during the training and in the end the network can not learn any additional knowledge. Adadelta The Adadelta is an Adagrad extension, which is capable to reduce its biggest problem with aggressive learning rate decrease. The Adadelta accumulates only k previous gradients, where k is a ixed size. Instead o storing all k previous squared gradients ineiciently, the sum o gradients is computed recursively as the decaying average o all previous squared gradients. Let us consider we have the running average o squared gradients C[g 2 ] s,i on time step i or w i, the current gradient then depends only on the previous average and the current: C[g 2 ] s,i = γc[g 2 ] s 1,i + (1 γ)g 2 s,i where γ can be set to 0.9 or any close value as in case o the momentum above. For clarity, we will now rewrite our update step in vanilla SGD or the parameter: w s,i = αg s,i w s+1,i = w s,i + w s,i We know that the Adadelta learning rate decay is derived rom the Adagrad: α s,i = g s,i Gs, ii + ε So we only replace the diagonal matrix G with the average over previous squared gradients: α s,i = C[g2 ] s,i + ε g s,i 31

35 RMSprop The RMSprop is an unpublished, adaptive learning rate method proposed by Geo Hinton. The RMSprop and the Adadelta were developed independently rom the need to resolve Adagrad s main problem. In act the RMSprop is identical to the irst update o the Adadelta, that we derived above: C[g 2 ] s,i = 0.9 C[g 2 ] s 1,i g 2 s,i α s,i = C[g2 ] s,i + ε g s,i The RMSprop also divides the learning rate by an exponentially decaying average o squared gradients Adam The Adam (ADAptive Moment estimation) is another method that computes adaptive learning rates or each parameter. Besides storing exponentially decaying average o previous squared gradients, it also keeps an exponentially decaying average o past gradients, similar to the momentum. Now, let: m s,i = β 1 m s 1,i + (1 β 1 )g s,i v s,i = β 2 v s 1,i + (1 β 2 )g 2 s,i m s,i estimates the irst moment (mean) and v s,i estimates the second moment (variance) o the gradients. As all m s,i and v s,i are initialized to 0, according to authors o this method it is biased towards zero, especially during the initial step. They counteract these biases by computing the irst and the second bias-corrected moment: ˆm s,i = m s,i 1 β1 i ˆv s,i = 1 β2 i Then they use these to update parameters just as we have seen in the Adadelta and the RMSprop: α w s+1,i = w s,i ˆv s,i + ε ˆm s,i where the deault values are: β 1 = 0.9, β 2 = and ε = v s,i 32

36 6. Experiments All the mentioned experiments were done on the personal computer DELL Inspirion 15 on Windows 10, with the ollowing components: Intel Core i7 6700HQ 16 GB RAM NVIDIA GeForce GTX 960M Our program is using ramework tensorlow and is written in the Python programming language. Architectures and their numbering correspond to table 4.1 in section 4.3. All used datasets (already described in the chapter 2) where preprocessed according to the section 4.2. At irst we choose rom each model aces with suiciently long sides ollowing eq. (4.19), where we use t with respect to the dataset: SHREC16: t = 0 SHREC15: t = ModelNet10: # aces < t = 5 else where #aces is the number o all aces o the object. In these aces, we choose k points according to the eq. (4.20). The constant k also diers depending on the used dataset: SHREC16: SHREC15: ModelNet10: 100 S > k = 10 S > else 100 S > k = 10 S > else { 450 S k = > 450 S else where S is the area o ace. The parameters need to be chosen or each dataset respectively, because o dierent vertices normalization. All points created rom aces are then normalized and inally the occupancy grid is created with respect to them as described in the algorithm 4.2. Since it would be too computationally expensive to create the occupancy grid again and again or each model in every learning epoch, we decided to do the preprocessing part only once. The occupancy grid was then saved as a our-dimensional ( ) 33

37 numpy array saved into the binary ile. Reading the numpy binary ile is much aster than reading standard.obj or.o ormat (in our case even more than 10 times aster or one model reading). For ModelNet10 and SHREC16, we provide conusion matrix which is represented as a picture. Each row o pixels in the picture gives an inormation about distribution o guesses or speciic class. In other words i all guesses are right, then in the picture would be only a diagonal line rom (0, 0) to (#classes 1, #classes 1). 6.1 Object Recognition and Retrieval (SHREC16) Method As we presumed in the chapter 2, this dataset consists o around 51 thousands o models rom 55 distinct classes and is used in a shape retrieval competition. All models were irstly preprocessed as discussed in the section above. We then trained several CNN architectures according to the table 4.1 to recognize models in the validation and the test part o the dataset. During our training we used the validation dataset to remember the best model. Ater each epoch we perormed an accuracy computation on the validation dataset and i this model seemed to be by ar the best, we saved it. The computation o test data accuracy is then perormed with the best model on the validation dataset. Every network was trained or 50 epochs. Ater that we were able to perorm a similarity search as it was described in the section Results and Discussion Object recognition The results o all selected architectures with additional parameters are in the table 6.1. The accuracy evaluation on the test dataset was done with a model rom epoch having the best validation accuracy. On the other hand accuracy on the validation dataset in our table was computed on a model rom the last epoch in the training. In the beginnings o our approaches we were trying to train networks with a batch size only 10. The architecture had two convolutional layers but a pooling layer ollows only ater the second one. In this approach we can see the problem, which was described in chapter 5. The convolutional neural network weights were updated too oten and did big steps to the local minimum or small batch. The convergence to the real minimum was in this case very slow or probably impossible. In that case, the experiments to higher our number o hyperparameters (number o ilters and their spatial size) could not help, because it would only lead to a bigger overitting issue. Because o this problem we decided to change our approach and higher the number o objects in the batch. For the batch size 50 results were much better, as we 34

38 Arch. Batch Learning Optimizer Acc. on Acc. on Id size rate val. data test data Adam % % Adam % % Adam % % Adam % % Adam % % Adam % % Adam % % Adam % % Adam % % Adam 85.3 % % Adam 85.6 % % Table 6.1: Results o object recognition on SHREC16 can see in the table 6.1. We can also ind, that there were much better architectures with a bigger spatial size o ilters on the test dataset. As a higher size o the batch did not help much anymore we added another pooling layer next to the irst convolution layer (architectures: 3, 4, 5). As we can see in the table 6.1, these architectures were able to pass 85% accuracy on validation and with a higher batch size even more, than 74% on the test dataset. The last row o this table also describes that higher spatial size and number o ilters would not help. It would only led to bigger overitting issue. 0 airplane 1 trash can 2 bag 3 basket 4 bathtub 5 bed 6 bench 7 birdhouse 8 bookshel 9 bottle 10 bowl 11 bus 12 cabinet 13 camera 14 can 15 cap 16 car 17 cellphone 18 chair 19 clock 20 keyboard 21 dishwasher 22 display 23 earphone 24 aucet 25 ile 26 guitar 27 helmet 28 jar 29 knie 30 lamp 31 laptop 32 speaker 33 mailbox 34 microphone 35 microwave 36 motorcycle 37 mug 38 piano 39 pillow 40 pistol 41 pot 42 printer 43 remote control 44 rile 45 rocket 46 skateboard 47 soa 48 stove 49 table 50 telephone 51 tower 52 train 53 vessel 54 washer Table 6.2: Labels o object in SHREC16 In the ig. 6.1 there are two conusion matrices on SHREC16 dataset. We created matrices on those two architectures which had best results on the validation and the test dataset as described above with respect to table 6.1. You can ind the labels o dataset groups in the table 6.2. E. g. our approach was very good in recognizing airplanes, guitars, riles or motorcycles. It was mainly because their shape is very dierent rom other objects, but also because they have a lots o representatives in both splits o dataset. On the other hand, other objects can be oten mistakenly labeled as one o representatives o these big groups. It is easy to see that on validation dataset we did much better, but on both datasets we had problems to distinguish or example a microphone rom a lamp, this problem is understandable, because shapes o these to objects are very similar. But it also 35

issue is probably with cameras, or cellphones (in the test dataset is only one example). Object retrieval Arch. Batch Learning Optimizer MAP on MAP on Id size rate val. data test data 0 10 0.

39 (a) Validation dataset (b) Test dataset Figure 6.1: Conusion matrices on SHREC16 is not so good to recognize birdhouses, because it does not have enough examples in the training database (only 73 in all three parts o SHREC16 dataset), the same issue is probably with cameras, or cellphones (in the test dataset is only one example). Object retrieval Arch. Batch Learning Optimizer MAP on MAP on Id size rate val. data test data Adam Adam Adam Adam Adam Adam Adam Adam Adam Adam Adam Table 6.3: Results o object retrieval on SHREC16 The results o MAP on SHREC16 with architectures already seen above are in table 6.3. We tested the shape retrieval as it was described in section We can say that the results are correlating with those in object recognition. The three best architectures on test split are still the best on this part o dataset, but 36

40 the results are not as good as we would predict rom our method. The problem is probably visualized in the conusion matrix ig. 6.1b. The smaller classes are oten mistaken or the bigger ones and so our method, which is based on good object recognition, does not work with the test split as well as with the validation split, where was much better accuracy. The SHREC16 competition was using some dierent type o MAP evaluation on only 1000 retrieved objects. But our method still does not seem to be able to compete with the best methods yet, although as we can see in chapter 7 there is a big room or improvement o our approach. 6.2 Object Retrieval (SHREC15) Method We decide to try our taught CNNs also on dierent dataset to ind out i they will still be able to retrieve objects correctly. The learned eatures, which we are extracting rom objects in one dataset should be ound even in case o other objects, as it was shown with 2D CNNs. We cannot use the same approach as deined in the section or the classic shape retrieval. The task o the competition is to provide a distance matrix where or each query object are returned their distances rom all other objects in the database. The distance matrix is then a matrix F R q,p, where q is the number o labeled objects and p is the number o all other objects in the database (with over distractors and 229 labeled objects) and the element F i,j is the distance o the i-th query object rom the j-th object in the database. We will use similarity search deined in section For each object in database, we do the orward propagation in our selected convolutional neural network (trained on SHREC16, see above). The output o this CNN is a vector with 55 elements. This vector is then normalized and used as a eature vector o the object. Let us consider that the vector i is the normalized eature vector o i-th object in the database. As it was deined in the chapter 3, i we would like to compute the element F i,j we ollow: F i,j = L 2 ( i j ) Results and Discussion For our experiment, we used the best architectures learned on SHREC16, results can be seen in table 6.4. For evaluation o our distance matrix, we used code provided by organizers o SHREC15. This evaluator was also able to print our precision recall unction values, which is plotted in ig The result o competition are published in Godil et al. [2015] (in the igure 4). As we can see, our approach is even better than any other method rom the competition in inding the nearest neighbor (NN). Our average precision is also better than most o the approaches, thus we can say that the learned eatures on the completely dierent dataset were also ound in here and objects were retrieved accurately (when we consider there were so many distractors). 37

41 Figure 6.2: The precision-recall graph on SHREC15 Arch. Batch Learning Optimizer MAP NN Id size rate Adam Adam Adam Adam Table 6.4: Results o the object retrieval on the SHREC15 using architectures learned on the SHREC Object Recognition and Retrieval (ModelNet10) Method As we mentioned in the chapter 2, this dataset consists o around 5 thousands o models rom 10 classes. All models were irstly preprocessed as we already discussed above in the section. We then trained several cnn s architectures according to the table 4.1 to recognize models in the test dataset, since the validation is not available. This dataset has only the test split, so we did not used it as with the previous dataset and only trained it or a speciic number o epochs. On the model rom the last epoch we evaluated the accuracy on the test split. Ater that, we were able to perorm the similarity search as it was described in section using our trained classiier. 38

42 6.3.2 Results and Discussion Object recognition In the table 6.5 there are results o our convolutional neural networks on the ModelNet10 dataset. We can see that in this dataset was much better to use RMSProp optimizer. It was also good to use a lower learning rate and train it or a lower number o epochs. In the pictures ig. 6.3 there are visualized graphs o the accuracy (on the y axis) Arch. #epoch Batch Learning Optimizer Acc. on Id size rate test data Adam % RMSProp % RMSProp % RMSProp % Adam % RMSProp % RMSProp 90.3 % RMSProp 90.4 % RMSProp 90.5 % RMSProp 88.1 % Table 6.5: Results o object recognition on ModelNet10 ater each epoch (on the x axis) and there is also visible the problem with all architectures. Even with the smaller learning rate, the process o learning is not smooth much. The accuracy is oscillating around 89%. The problem is probably caused by the size o the dataset. It is so small and imbalanced that bigger groups have much bigger impact on the learning and move the gradients aster in their direction. We also created conusion matrix or ModelNet with the best trained (a) First our architectures (b) Last six architectures Figure 6.3: The accuracy in each epoch architecture (in the table 6.5). The labels are described in the table 6.6 and the conusion matrix is in the ig The model was in most cases right but it did oten mistake in recognizing between a desk and a table, or between a night stand and a dresser, but both cases are understandable. We can say that our approach is very good in recognizing every group, but the 39

results are not so good as in other approaches in the ModelNet competition. Even though we missed only 6.9% on the irst place we were sixth rom the last place (eleventh).

43 results are not so good as in other approaches in the ModelNet competition. Even though we missed only 6.9% on the irst place we were sixth rom the last place (eleventh). 0 bathtub 1 bed 2 chair 3 desk 4 dresser 5 monitor 6 night stand 7 soa 8 table 9 toilet Table 6.6: The ModelNet10 labels Figure 6.4: The conusion matrix on ModelNet10 Object retrieval In the table 6.7 there are results o the object retrieval. The results are correlating with those in the object recognition, since the three architectures beore the last one are still the best. It also shows that our object retrieval using learned classiier is a very eicient approach on this dataset, as it has even better results than our recognition approach. Our results are much better than those in the competition, but best approaches did not provide their MAP. 40

Machine Learning 13. week

Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of