MATTI TUHOLA WIRELESS ACCESS POINT QUALITY ASSESSMENT USING CONVOLUTIONAL NEURAL NETWORKS. Bachelor of Science Thesis

Size: px

Start display at page:

Download "MATTI TUHOLA WIRELESS ACCESS POINT QUALITY ASSESSMENT USING CONVOLUTIONAL NEURAL NETWORKS. Bachelor of Science Thesis"

Milo Harrell
5 years ago
Views:

1 MATTI TUHOLA WIRELESS ACCESS POINT QUALITY ASSESSMENT USING CONVOLUTIONAL NEURAL NETWORKS Bachelor of Science Thesis Examiner: Heikki Huttunen Submitted: April 29, 2016

2 I ABSTRACT TAMPERE UNIVERSITY OF TECHNOLOGY Degree Programme in Information Technology MATTI TUHOLA: Wireless Access Point Quality Assessment Using Convolutional Neural Networks Bachelor of Science Thesis, 22 pages April 2016 Major: Signal Processing Examiner: Heikki Huttunen Keywords: Convolutional Neural Networks, Deep Learning, Machine Learning, Positioning Positioning has many applications in today s world and there are many methods to improve its accuracy. Using wireless access points is a common way to improve positioning accuracy but not all access points are reliable enough for this purpose. In this thesis, machine learning methods are used to assess the quality of wireless access points. The problem is treated as a supervised learning problem. The data is annotated, pre-processed, and a machine learning model is used to predict the labels created in the annotation process. Convolutional neural networks are used as the principal model and a logistic regression model is used for comparison. The results indicate that machine learning can be used for this purpose. Convolutional neural networks perform better than logistic regression, but not by a large margin.

3 II PREFACE This thesis was made in collaboration with HERE for the Bachelor s Seminar in Signal Processing in the spring of My personal motivation for this work arose from my interest in machine learning. This work was a great opportunity to increase my understanding of machine learning, in particular convolutional neural networks, and apply it to a real-world problem. IwouldliketothankHeikkiHuttunenforhisvaluablefeedbackandforexamining this work, the people at HERE for providing the topic and the data, and my friends and family for their support. Tampere, April 29, 2016 Matti Tuhola

4 III CONTENTS 1. Introduction Theoretical Background Neural Networks and Deep Learning Multi-Layer Perceptrons Convolutional Neural Networks Network Training Model Evaluation Overfitting and Regularization Implementation The Dataset Annotation Data Pre-processing Models and Learning Algorithms Results Discussion Visualizing the Weights and the Feature Maps Conclusions

5 IV LIST OF ABBREVIATIONS AND SYMBOLS CNN GPS GPU MAE MLP MSE ReLU SGD WLAN Convolutional neural network Global Positioning System Graphics processing unit Mean absolute error Multi-layer perceptron Mean squared error Rectified linear unit Stochastic gradient descent Wireless local area network b The bias value of a neuron. J( ) Acostfunctionthatcalculatesascalarcost. p Dropout rate, the probability of a unit in a neural network to be dropped. w The weight vector of a neuron. X Acompletedataset. x (i) y ŷ The ith sample in a dataset. The set of labels for a dataset. The predictions made by a machine learning model. The learning rate, a parameter of the optimization algorithm that controls the rate of change on each iteration. The set of parameters of a machine learning model. ( ) The activation function applied to the output of a neuron.! The convolutional kernel used in convolutional neural networks.

6 1 1. INTRODUCTION The use of positioning systems in different fields has proliferated in the early 21st century. Positioning systems are used in many products, including vehicles and mobile devices, for a vast range of applications such as navigation, mapping, and various location-based services. Emerging technologies such as self-driving cars also heavily depend on positioning. All of these applications benefit from accuracy and afastsignalacquisitiontime. Accordingly,improvingthemhasbeenofinterestto researchers for decades. Positioning systems rely on global navigation satellite systems, such as the Global Positioning System (GPS), as their foundation. These systems have their limitations: they generally cannot operate indoors, and in some situations, acquiring a signal may take a long time. For this reason, information from additional sources, such as cellular sites and wireless local area network (WLAN) access points, is used. The information from these sources can be used to determine which satellites are in range, or to approximate the distance to nearby access points. Data about the WLAN access points can be crowd-sourced from users mobile devices and later be used for positioning. However, the incoming data is raw and may contain inaccuracies due to the heterogeneity of the devices, software bugs, and environmental and other factors. The goal of this thesis is to use machine learning to determine which access points provide data that is reliable enough for positioning. The problem will be treated as a supervised learning problem. In Chapter 2, the theoretical background behind the machine learning techniques used in this thesis will be discussed. The implementation will be discussed in Chapter 3. The machine learning task can be split into two parts. First, the raw data is annotated. A numeric label for each sample is set based on the input from a human expert. The label represents the quality of the given access point. In the second part, the data is pre-processed and fed into a machine learning model that attempts to predict the labels. Experiments are run using multiple convolutional neural network models and a logistic regression model. The results will be discussed in Chapter 4.

7 2 2. THEORETICAL BACKGROUND Machine learning algorithms are algorithms that are able to learn from data. The starting point for any machine learning application is thus to have a dataset X that consists of m samples, X = {x (1), x (2),...,x (m) }. A machine learning model learns its parameters from the data. There are two primary categories of learning, namely unsupervised and supervised learning. In unsupervised learning the model attempts to discover patterns or other insights from unlabeled data. In supervised learning there is, in addition to the dataset X, asetofcorrespondinglabels y = {y (1), y (2),...,y (m) }.Asupervisedlearningmodelapproximatesafunctionthat predicts the labels based on an input x. Supervisedlearningcanbefurtherdivided into regression and classification tasks. In regression, the predicted outputs are continuous numeric values, whereas in classification, the predicted outputs consist of discrete class labels. Machine learning models vary in terms of complexity. The more complex models have a greater representational capacity, meaning that they can represent more intricate functions. The most appropriate function of all the functions that a model can represent is chosen by training the parameters. The models are typically trained using gradient-based methods that change model parameters iteratively. The representational capacity and the training of neural networks,afamilyofmachinelearning models, are discussed in Section 2.1. In order to be useful in the real world, a machine learning model needs to be able to generalize to new data that it has not seen before. To properly evaluate the model, it is conventional to randomly split the dataset into two or three subsets, namely a training set, a test set, and often a validation set [4, p. 222]. The training set is used to train the parameters of the model. The validation set is used to find the best model out of many. It is also used to tune the hyperparameters, ormanually chosen settings that control the behavior of the learning algorithm. The purpose of the test set is to measure the model s ability to generalize to new, unseen data. Model evaluation and splitting the dataset are discussed in Section 2.2. A model is said to overfit, ifitperformswellonthetrainingdata,andpoorlyonthetestdata. Techniques to combat overfitting are discussed in Section 2.3.

8 2.1. Neural Networks and Deep Learning Neural Networks and Deep Learning Deep learning is a branch of machine learning where neural network models with multiple computational layers are used to learn multiple levels of abstraction from the data. In recent years, deep learning has seen a surge in popularity. It is considered to be the state-of-the-art technology in many fields, including computer vision and speech recognition. [10] Deep learning has also proved successful in tasks that were previously unattainable for computers, such as beating human champions in the game of Go [3]. Many machine learning models benefit from having the data represented as a set of hand-crafted features. The representation of the data has a heavy impact on the performance of the model. Choosing the right features for a given task is difficult and often requires substantial domain knowledge. One of the advantages of deep learning models is that they can not only learn to make predictions from existing features but learn the features themselves too, with little data pre-processing or feature extraction. Artificial neural networks, or simply neural networks, are a central concept in deep learning. Neural networks are machine learning models that consist of layers of interconnected units. The units are often referred to as neurons. Neuroscience has been a source of inspiration for artificial neural networks. This link should not, however, be overemphasized. Today, neuroscience is not predominantly used to guide deep learning research, and the goal of deep learning is not to learn to simulate the human brain [5, pp ]. Neural networks have a large number of parameters 2 R N, which allows them to represent very complex functions. In some recent models, N has been as large as 144 million [13]. The parameters can be learned from data using methods such as the backpropagation algorithm and stochastic gradient descent, discussed in Section In the following sections, two types of artificial neural networks will be discussed, namely multi-layer perceptrons and convolutional neural networks. They are examples of feedforward neural networks, where the computational layers are connected sequentially and information flows in one direction without feedback connections. In terms of theoretical advances, few things about deep learning are new. Neural networks have existed for decades; many central ideas such as the backpropagation algorithm and convolutional neural networks were already known in the 1980s and the 1990s [8, 9]. The reason for the newfound success in using neural networks has been in large part due to the availability of large datasets and the growing

9 2.1. Neural Networks and Deep Learning 4 computational capacity. Computational capacity has allowed for models that may have in the order of hundreds of layers. For comparison, neural networks in the 1990s typically only had two or three layers. The large number of layers, or depth, is where deep learning gets its name from Multi-Layer Perceptrons Multi-layer perceptrons (MLPs) are the quintessential example of a feedforward neural network. They are also an example of a supervised learning model. The parameters of the network have to be separately trained in order to use the network for prediction. Multi-layer perceptrons consist of many layers of neurons. They derive their name from the 1950s neuron model, the perceptron [12]. MLPs are discussed in more detail in the work of Goodfellow et al. [5, pp ]. The neurons used in MLPs are simple computational units that are loosely inspired by neurons in the brain. The neuron calculates the inner product of the inputs and the weights and passes the result through an activation function. The structure of aneuronisillustratedinfigure 2.1. x 1 w 1 b x 2 w 2 P y.. x n w n Figure 2.1 A neuron in a multi-layer perceptron. The neuron calculates the inner product of the input vector x and the weight vector w, addsthebiasvalueandpassestheresult through an activation function ( ). h i T. The inputs to the neuron can be represented as a vector x = x 1 x 2 x n The input vectors in MLPs are one-dimensional. Inputs of two or higher dimensions must be vectorized before feeding them into an MLP. The parameters of the neuron h i T, consist of a bias b and a weight vector w = w 1 w 2 w n where the values are the weights for the respective inputs. The output of a single neuron is defined by a =! nx w i x i + b = w T x + b, (2.1) i=1

10 2.1. Neural Networks and Deep Learning 5 where ( ) is the activation function. The purpose of the activation function is to produce a non-linear decision boundary. Non-linearity is essential because a multi-layer perceptron with linear activation functions could be expressed with a single layer. Common activation functions for multi-layer perceptrons include the hyperbolic tangent function (x) =tanh(x) and the logistic sigmoid function (x) = (1 + e x ) 1. The bias term allows the network to shift the activation function to the left or to the right. Multi-layer perceptrons consist of an input layer, one or more hidden layers and an output layer. The only purpose of the input layer is to feed the input vector to the next layer. The units in the other layers are neurons. Hidden layers are computational layers inside the network, whereas the output layer defines the output of the network. In classification, the number of neurons in the output layer is equal to the number of classes in regression, there is just one neuron in the output layer. Figure 2.2 illustrates a multi-layer perceptron with two hidden layers. Input layer Hidden layer Hidden layer Ouput layer x 1 x 2 y 1 x 3 y 2 x 4 x 5 Figure 2.2 A multi-layer perceptron with two hidden layers. The units in the hidden layers and in the output layer, depicted as circles, are neurons. The layers in MLPs are fully connected. This means that every neuron is connected to all of the units in the preceding layer and the subsequent layer. The neurons get their input from the units in the previous layer and pass their output to the units in the next layer. They are not connected to other neurons in the same layer. The network defines a function ŷ = f(x; ), where consists of the biases and the weights of the neurons and ŷ is the predicted output. One of the problems with fully connected neural networks is that the number of connections grows exponentially as the size of the input grows, which makes them

11 2.1. Neural Networks and Deep Learning 6 impractical for large inputs. In addition, they are very susceptible to even small changes in the input. For example, translating or rotating an input image may completely change the predicted output Convolutional Neural Networks Convolutional neural networks (CNNs) are variations of multi-layer perceptrons. Their structure draws inspiration from the visual cortex and they are specifically designed for data that can be represented in a grid-like topology, such as images [5, p. 334]. In this section, convolutional neural networks will be discussed as they are typically applied to image data. Convolutional layers are the core building blocks of CNNs. The parameters of a convolutional layer consist of a set of convolutional kernels. In convolutional layers, the input is not limited to one-dimensional vectors. Input images are represented as 2D or 3D matrices, depending on whether or not there are multiple color channels. The neurons in a convolutional layer are arranged on a 2D plane called a feature map. The neurons in a feature map are locally connected. The output neuron is only connected to a limited area in the input, the size of which is determined by the convolutional kernel. Each neuron is connected to a different area in the input but all neurons use the same convolutional kernel. The use of shared kernels allows the network to express large models with a fairly small number of parameters. Figure 2.3 illustrates a convolutional layer. Input Kernel Feature map Figure 2.3 Aconvolutionallayerconvolvestheinputwithaconvolutionalkernel,producing a feature map.

12 2.1. Neural Networks and Deep Learning 7 Consider a simple case with an N N grayscale image as the input x, andanm M convolutional kernel!. The output of a single neuron is given by a i,j = MX 1 u=0 MX 1 v=0! u,v x i+u,j+v, (2.2) where a i,j is the neuron in the ith column and the jth row of the feature map. In this case, the feature map would be of size (N M +1) (N M +1). If the dimensions of the input were to be retained, the input would have to padded with zeros. Each convolutional layer typically comprises multiple convolutional kernels. As a result, there will be multiple feature maps, and the output of the convolutional layer will be a 3D matrix. The kernel! can also be three-dimensional and extend through a 3D input volume. The size kernel typically depends on size of the input. For example, 3 3 and 5 5 are a common kernel sizes for small images, whereas or might be used for fairly large images. The output from a convolutional layer is passed through a non-linear activation function. The most common activation function for CNNs is the rectified linear unit (ReLU), (x) = max(0,x). ReLUs are preferred to the hyperbolic tangent or the logistic sigmoid function because they help convolutional neural networks converge faster [7]. In addition, they are computationally cheap and they help avoid some problems that are present when training with other activation functions. Pooling layers are often used between successive convolutional layers. Pooling layers reduce the spatial size of the feature maps by partitioning them into k k, most commonly 2 2, non-overlapping tiles and reducing them to a single pixel. The most common type of pooling is max pooling, taking the maximum of each tile. Pooling makes the network more invariant to small translations in the input, which is useful if the presence of some feature is more important than its location [5, pp ]. Another benefit of pooling is that it increases the computational efficiency of the network. Figure 2.4 depicts the structure of a typical convolutional neural network. The structure follows a repeating pattern, where a convolutional layer with a non-linear activation function is followed by a pooling layer. This structure allows the network to learn features hierarchically. The first convolutional layers typically learn lowlevel features such as edges, whereas later layers will learn features that are higher level abstractions. The last layers in the network are typically fully-connected like

13 2.1. Neural Networks and Deep Learning 8 Input layer Hidden layers Output layer Convolution Pooling Convolution Pooling Fully connected MLP Figure 2.4 A convolutional neural network consists of convolutional layers with non-linear activation functions, pooling layers, and fully connected layers. the layers in MLPs. Their purpose is to learn from the high level features produced by the last convolutional layer, and to produce the final output Network Training So far, the focus has been on the representational capacity of neural networks. The process described earlier, where a neural network with parameters gets an input x and produces a predicted output ŷ, iscalledforward propagation. Aneural network model defines a family of functions. By learning the parameters, the most appropriate function to solve a given problem can be selected. In order to learn the parameters, the network needs to be trained on a dataset. In the training phase, the first step is to perform forward propagation on an input x to calculate the predicted output ŷ. Then, based on the difference between ŷ and the label y, acostfunctionj( ) is used to calculate a scalar cost that measures how well the predicted outputs match the labels. The backpropagation algorithm is applied to compute the gradient of the network. The gradient is typically calculated on the cost with respect to the parameters. Another algorithm, stochastic gradient descent (SGD), is commonly used to learn, i.e. to minimize the cost function and to update the parameters, using the gradient. The choice of cost function typically depends on the kind of problem that is being solved. For regression problems, simple cost functions such as the mean squared error (MSE), given by MSE = 1 m mx (ŷ (i) y (i) ) 2, (2.3) i=1

14 2.1. Neural Networks and Deep Learning 9 or the mean absolute error (MAE), given by MAE = 1 m mx i=1 ŷ (i) y (i), (2.4) are usually suitable. MAE has the advantage of being more robust to outliers. For classification problems, using a different cost function such as logarithmic loss can be beneficial. The gradient of the network with respect to its parameters is calculated using backpropagation. Backpropagation starts with the scalar cost calculated by the cost function and uses the chain rule of calculus to calculate the gradients of all the outputs in the previous layers with respect to the cost. These gradients indicate how much the parameters of each unit contributed to the cost. The parameters can then be updated accordingly. Minimizing the cost function is an optimization problem. Optimization algorithms used in machine learning are typically based on the concept of gradient descent. Gradient descent starts with some initial parameters and then iteratively moves towards the minimum based on the gradient of the cost function with respect to the parameters. The cost functions in neural networks are almost always non-convex, which means that there are many local minima. An optimization function is not guaranteed to converge to the global minimum. Gradient descent is sensitive to the initial values of the parameters. Typically, small random values are used for initialization [5, p. 176]. The training is done in epochs. In an epoch, all the samples in the training set are presented to the network once. The number of epochs used to train a neural network can be in the order of hundreds. In vanilla gradient descent, the gradient is computed for the entire dataset. In other words, the parameters are only updated after a complete epoch. In stochastic gradient descent, the gradient is estimated based on one sample or a mini-batch of n samples of the data {x (1), x (2),...,x (n) } with corresponding labels {y (1), y (2),...,y (n) }. The order in which the samples are presented to the network affects the outcome and is therefore typically randomized. Stochastic gradient descent updates the parameters based on the gradient calculated by backpropagation. The parameter update is given by J( ),

15 2.2. Model Evaluation 10 where J( ) is the cost function that calculates a scalar cost for the n samples in amini-batchand is the learning rate that controls the rate of change on each iteration. Choosing the right learning rate can be difficult. If it is too small, the convergence will be slow - if it is too large, fluctuations or even divergence may occur. This is among the reasons that many alternative optimization algorithms, such as Ada- Grad, RMSprop, and Adam [6], have been proposed. These alternative optimization algorithms are still based on the concept of stochastic gradient descent but they typically have an adaptive learning rate or some other characteristics that may hasten the convergence. Training a neural network is a computationally intensive task. Today, neural networks are most commonly trained on graphics processing units (GPUs). GPUs provide a high memory bandwidth and a high degree of parallelism. These features can be utilized to a great extent when training neural networks. [5, pp ] The success of deep learning can, to some extent, be attributed to the efficient use of GPUs [10]. 2.2 Model Evaluation The ultimate purpose of a machine learning model is to be used in the real world on new, unseen data. For this reason, it is important to evaluate the performance of the model. There are two separate problems to consider when evaluating the performance of machine learning models, namely model selection, choosingthebest model out of many, and model assessment, estimatingthemodel sabilitytogen- eralize to new data [4, p. 222]. In an ideal situation, where data is plentiful, the dataset should be split into three subsets, a training set, a validation set, and a test set. The model should then be trained on the training set, and any modifications to its hyperparameters or the model itself should be evaluated on the validation set. The generalization performance of the final model should be assessed on the test set, but only after no further changes to the model or its hyperparameters will be made. The reasoning behind the split is manifold. The performance of the model on the data it was trained on does not reflect its ability to generalize to new data. For this reason, it is important that at least the test set be always separate, even if data is scarce. Furthermore, using a validation set is necessary to make sure that assessing the model on the test set provides a non-biased error rate. If the model or its hyperparameters are chosen so that they minimize the error on the test set, the

16 2.3. Overfitting and Regularization 11 model may learn to slightly overfit the test set. In many cases, the size of the dataset is not large enough to be split into three separate datasets. One of the most common ways to solve this problem is to use k-fold cross-validation [4, pp ]. In k-fold cross-validation, the training set is split into k subsets of equal size. One subset is used for validation, and the other k 1 subsets are used for training the model. This is repeated k times, so that each subset is used for validation, and the results are combined. One of the drawbacks of k-fold cross-validation is that it can be computationally expensive on large models. 2.3 Overfitting and Regularization Neural network models have a large number of parameters, which makes them very expressive. It also makes them prone to a problem known as overfitting. Overfitting occurs when the model starts to learn the noise from the training data, decreasing the model s ability to generalize to data outside the training set. There are many ways to reduce overfitting. One of the simplest methods is to simply have more training data. More data can be gathered and annotated, or, the existing data can be augmented by adding slightly modified copies of the existing samples to the training set. Augmentation is especially applicable to image data; new images can be created by rotating, translating, and mirroring existing ones. Augmentation has been found to be an effective way to improve generalization performance. It cannot, however, be applied to all kinds of data. Techniques that reduce overfitting are known as regularization techniques. Traditional regularization techniques include Tikhonov, L 2,andL 1 regularization. They add an additional regularization term to the cost function. This term penalizes large weights. Consequently, the network will learn to prefer small weights, which makes it more difficult for the network to learn noise from the training data. Dropout [14] is a method for reducing overfitting in neural networks. Instead of changing the cost function, dropout changes the network itself. The key idea is to randomly drop some units in the training phase. Dropping means that the output of the unit is set to zero, which effectively means that the unit has no effect on the output of the network. Each unit has a probability p of being dropped. The probability p is a hyperparameter of the model. Figure 2.5 illustrates a neural network with dropout applied to it.

17 2.3. Overfitting and Regularization 12 Input layer Hidden layer Hidden layer Ouput layer x 1 x 2 y 1 x 3 y 2 x 4 x 5 Figure 2.5 The multi-layer perceptron from Figure 2.2 after applying dropout with a dropout rate of 0.5. Dropout is performed in the training phase. When testing the model, no units are dropped. Dropout has been found to be a very effective method for reducing overfitting. Units being dropped out forces the remaining units in the network to learn more independently. Applying dropout is approximately equal to training many neural networks and averaging their output [14].

18 13 3. IMPLEMENTATION 3.1 The Dataset The dataset consists of signal strength measurements near WLAN access points. Each sample in the dataset is a point cloud with measurements from one access point. The number of measurements in a sample ranges from one to the order of one thousand. Each measurement comprises three values, namely the latitude, the longitude, and the signal strength. The data has been crowd-sourced from mobile devices that have been in range of the access point and connected to GPS. The raw data is unlabeled and contains inaccuracies. The objective is to create a model that can filter out the less accurate access points and thus improve positioning accuracy. To achieve this, randomly picked 7,500 samples from a larger dataset were annotated and pre-processed. 3.2 Annotation In order to apply a supervised learning algorithm, the data needs to have labels. In this case, a label would be a score that tells how useful a particular sample of the data is. Annotating the data to create labels is often a laborious task that needs to be performed manually. In this particular case it would be difficult for the user to consistently determine a score by just looking at one sample of the data at a time. To solve this problem, a different approach was taken to annotating the data. Instead of looking at one sample of the data at a time, two samples were compared to each other and the user had to choose the one they found to be better. Comparing all the possible pairs would be impractical, so the Swiss tournament system [11] was used to create a limited number of pairs. The Swiss tournament system is a non-elimination system with a predetermined number of rounds, n. Inthefirstround, thedatais randomly split into pairs. The user determines the winners that each receive one point. In the next rounds, the pairs are randomly formed among samples with the same score. The process is repeated until all the comparisons have been made.

19 3.2. Annotation 14 Figure 3.1 Ascreenshotoftheannotationsoftware.Theuserchoosesthebettersample of the two displayed at a time. In the above example, the left sample is better, because the data points with the best signal strengths are clearly concentrated in a small area. For this purpose, a program for annotating the data was written in MATLAB. A screenshot of this program is shown in Figure 3.1. The program takes the raw data as input and plots two samples from it to the map at a time. The signal strength is represented by color the warmer the color, the stronger the signal. The map is used as a contextual cue to help the user determine which of the two samples compared at a time is better. The user then chooses the better sample, and the annotation process unfolds as described earlier. The final output is a file with the file names and the scores of the respective samples, scaled between 0 to 1 with n +1 discrete values. This method is not without its drawbacks. It relies on human intuition and understanding on what makes a good sample. This is likely to differ between people. In addition, the user may be forced to choose between two equally good or equally bad samples, which may skew the results. These problems could be alleviated by annotating the same data multiple times and combining the results, but in any case, they are limitations of the method that must be acknowledged.

20 3.3. Data Pre-processing Data Pre-processing The data collected from the WLAN access points cannot be fed into a model as a raw object that consists of the data points and metadata it needs to be pre-processed. The data was chosen to be represented as grayscale images. The size of these images was chosen to be with values ranging from 0 to 255. Each non-zero pixel represents a data point and the pixel value represents the signal strength. It was found that this size was sufficient to fit all the data points in more than 95 % of the samples. In many cases, the points that were left out were outliers. The median of the latitudes and the longitudes of all the points in a sample was chosen to be the center point of the image. Figure 3.2 Samples from the dataset in the form they were fed to the learning model. Figure 3.2 illustrates the input images. The 7,500 samples of data were randomly split into a training set of 80 % and a test set of 20 %. In practice, this means that there were 6,000 training samples and 1,500 test samples. Since the size of the training set can be considered small for a deep model, an augmented version of the training set was created. The input images were mirrored and rotated to produce in total 8 times as much data as there was originally. This was expected to increase the network s invariance to rotation and thus increase its ability to generalize. 3.4 Models and Learning Algorithms The machine learning models were implemented using a Python library Keras [2]. Keras offers a high level of abstraction, making it relatively straightforward to build and experiment with different kinds of models. It needs another library, a so-called backend, to handle the low-level computations. Theano [1] was used as the backend. The input data was represented as a NumPy array [15]. The shape of the training data array, as an example, was The first value refers to the number of samples, the second one to the number of color channels, and the remaining two to the dimensions of the input image. A convolutional neural network with two convolutional layers written in Keras is shown in Program 3.1.

21 3.4. Models and Learning Algorithms 16 model = Sequential() # Two convolutional layers with 32 kernels and a kernel size of 3x3. model.add(convolution2d(32, 3, 3, border_mode= valid, input_shape=(1, 64, 64))) model.add(activation( relu )) model.add(dropout(0.25)) model.add(maxpooling2d(pool_size=(2, 2))) model.add(convolution2d(32, 3, 3)) model.add(activation( relu )) model.add(dropout(0.25)) model.add(maxpooling2d(pool_size=(2, 2))) # Two fully connected (dense) layers. model.add(flatten()) model.add(dense(128)) model.add(activation( relu )) model.add(dropout(0.5)) model.add(dense(1)) # Squash final output between zero and one. model.add(activation( sigmoid )) model.compile(loss= mae, optimizer= rmsprop ) # Train the model on the training data in batches of 128 samples, # iterating over the entire dataset 100 times. Validate on the test data. model.fit(x_train, y_train, 128, 100, verbose=1, validation_data=(x_test, y_test)) Program 3.1 AsimpleconvolutionalneuralnetworkwritteninKeras. Multiple convolutional neural networks with a varying number of convolutional layers were made. After the convolutional layers, the models have two fully-connected layers, the latter of which is the output layer and consists of only one neuron. Many of the hyperparameters, such as the number and the size of the convolutional kernels, and the dropout rate, were chosen experimentally using 5-fold cross-validation on the training data. The models used 32 convolutional kernels in each layer, the size of the kernel being 3 3. Adropoutrateof0.25 was used for the convolutional layers. In addition to the convolutional neural networks, one logistic regression model was implemented for comparison. Logistic regression models [4, pp ] are among the simplest machine learning models there are. Despite its simplicity, logistic regression often provides adequate results for many problems. Logistic regression can also be interpreted to be a neural network with a single layer and a logistic sigmoid

22 3.4. Models and Learning Algorithms 17 activation function. logistic regression. The input images were vectorized into vectors for RMSprop was used as the optimization algorithm. The cost function was chosen to be mean absolute error (MAE) because of its easy interpretability. The outputs of the models range between 0 and 1, and a 0.1 mean absolute error means that the prediction is, in average, off by 0.1.

23 18 4. RESULTS The models were trained five times, each time with a new random split for the training data and the test data, and the results were averaged. This was done to compensate for the small size of the test set, which was just 1,500 samples. The experiment was first run with the original training set of 6,000 samples. The test errors and the time to train the models are shown in Table 4.1. The number of convolutional layers is shown in parenthesis after the name of the model. Table 4.1 The test errors and training times of the models without data augmentation. Model Error (MAE) Time to train (100 epochs) CNN (1) min CNN (2) min CNN (3) min CNN (5) min CNN (7) min Logistic regression min The same experiment was run with augmented training data. The augmented training set included rotated and mirrored copies of the input images. The size of the augmented training set was 48,000 samples. The results are shown in Table 4.2. Table 4.2 The test errors and training times of the models with data augmentation. Model Error (MAE) Time to train (100 epochs) CNN (1) min CNN (2) h17min CNN (3) h45min CNN (5) h20min CNN (7) h28min Logistic regression min All the models were trained on a single NVIDIA Tesla K40M GPU. The models were trained for 100 epochs with a mini-batch size of 64.

24 4.1. Discussion Discussion Augmenting the data decreases the test error on all convolutional neural networks. The effect of augmentation on the logistic regression model is minute. Training with the larger dataset also comes with a cost, namely a significantly longer time to train the model. In practice this is not necessarily an issue, since the model only needs to be trained once in order to be used. Convolutional neural networks performed better than logistic regression in every test. The convolutional neural network architecture with five layers was found to have the lowest test error in all cases. The differences among the error rates of the CNN models were, however, small and the significance of these differences can be questioned. The results are slightly affected by noise. Sources of noise include the random initialization of the parameters and the randomized mini-batches. The fact that the results are averages from five separate training passes should decrease the effect of the noise. Figure 4.1 The test error of the models over 100 epochs of training. Figure 4.1 shows the test error of the different models over 100 epochs of training. The test errors in the figure are from one of the five training passes. The logistic regression model converges immediately and doesn t change in a meaningful way over the training epochs. The small convolutional neural network models also converge quickly, whereas the larger networks benefit from the large number of epochs. The training error of the CNN with one convolutional layer appears to increase over time, which implies overfitting and insufficient regularization.

4.2. Visualizing the Weights and the Feature Maps 20 4.

Visualizing the weights or the outputs of the layers can provide insight into this. Input Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4.

25 4.2. Visualizing the Weights and the Feature Maps Visualizing the Weights and the Feature Maps Neural networks have been described as black box models because their inner workings can be difficult to understand. Visualizing the weights or the outputs of the layers can provide insight into this. Input Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4.2 Examples of the feature maps produced by a convolutional neural network given the input image on the left. Other layers such as Figure 4.2 visualizes some of the feature maps produced by the convolutional neural network model with five layers. In the figure, the gray background color represents zeros, lighter shades represent positive values, and darker shades represent negative values. The feature maps are depicted before the ReLU non-linearity has been applied to them. The first convolutional layer appears to have responded to the edges, whereas the following convolutional layers appear to have learned increasingly high level features about the sample. For the logistic regression model, the input images were vectorized to a vector. Each input unit has a corresponding weight that dictates how much of an impact the input unit has on the output. The weights can be used to infer what kind of an input the model would consider ideal. The weights of the logistic regression model, reshaped into a image, are shown in Figure 4.3. The color coding is the same as it was for the feature maps.

26 4.2. Visualizing the Weights and the Feature Maps 21 Figure 4.3 The weights of the logistic regression model reshaped into an image. The weights show that the logistic regression model heavily favors inputs that have strong pixel values at the center of the image or in close proximity to it. Values around this region are negative. In other words, inputs that have the their data points concentrated near the center are preferred. The values outside of these regions appear to be very small. The reason for this is probably that most samples simply did not have any non-zero values in these regions.

27 22 5. CONCLUSIONS The motivation behind this thesis was to determine whether machine learning, and especially convolutional neural networks, could be used to determine whether a WLAN access point provides accurate data for the purposes of positioning. The starting point was raw and unlabeled WLAN data. The data was annotated using a program specifically built for the purpose. The annotation method compared two samples at a time over multiple passes over the dataset. The data was pre-processed into images in order to feed it to a machine learning model. Five convolutional neural network models and a logistic regression model were used in the experiments. The experiments were run with the normal dataset and an artificially augmented dataset. The results suggest that using machine learning to assess the quality of WLAN access points is feasible. The best results were achieved with a convolutional neural network architecture that had five convolutional layers. The differences between the models were, however, rather small. Furthermore, it was observed that logistic regression, amuchsimplermodel,couldalsoachieveresultsrelativelyclosetothoseobtained with convolutional neural networks. Data augmentation was found to improve the error rate of the models at the expense of a longer training time. In the future, more work should go towards the annotation process. As it is, the annotation process may produce very different labels for very similar samples of the data. This problem could be alleviated by annotating the same data many times using the current method, or, by using a different method altogether. Some, but probably limited, improvements could also be made by further tuning the models and their hyperparameters. In addition, having more data could improve the performance of the model, or at least increase the confidence in the results.

28 23 BIBLIOGRAPHY [1] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, Theano: a CPU and GPU math expression compiler, in Proceedings of the Python for Scientific Computing Conference (SciPy), June2010. [2] F. Chollet, Keras, [3] T. Chouard, The Go Files: AI computer wraps up 4-1 victory against human champion, Nature, [Online]. Available: http: //dx.doi.org/ /nature [4] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference and prediction, 2nd ed. Springer, [Online]. Available: [5] Y. B. Ian Goodfellow and A. Courville, Deep learning, 2016, book in preparation for MIT Press. [Online]. Available: org [6] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, CoRR, vol. abs/ , [Online]. Available: [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp [Online]. Available: imagenet-classification-with-deep-convolutional-neural-networks.pdf [8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Computation, vol. 1, no. 4, pp , Winter1989. [9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.86,no.11,pp , November [10] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp , [Online]. Available: nature14539

29 Bibliography 24 [11] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, J. Astola, M. Carli, and F. Battisti, TID2008 a database for evaluation of full-reference visual quality assessment metrics, Advances of Modern Radioelectronics, vol. 10, pp , [12] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, vol.65,no.6,pp , [13] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR, vol. abs/ , [Online]. Available: [14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research, vol.15,no.1,pp ,2014. [15] S. van der Walt, S. C. Colbert, and G. Varoquaux, The NumPy array: A structure for efficient numerical computation, Computing in Science Engineering, vol. 13, no. 2, pp , March 2011.

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA February 9, 2017 1 / 24 OUTLINE 1 Introduction Keras: Deep Learning library for Theano and TensorFlow 2 Installing Keras Installation