Machine Learning written examination

Size: px

Start display at page:

Download "Machine Learning written examination"

Julius Hensley
6 years ago
Views:

1 Institutionen för informationstenologi Olle Gällmo Universitetsadjunt Adress: Lägerhyddsvägen 2 Box Uppsala Machine Learning written examination Friday, June 10, Allowed help material: Pen, paper and rubber, dictionary Telefon: Telefax: Hemsida: user.it.uu.se/~crwth Epost: olle.gallmo@it.uu.se Please, answer (in Swedish or English) the following questions to the best of your ability! Any assumptions made, which are not already part of the problem formulation, must be stated clearly in your answer! The maximum number of points is 40. To get the grade 3 (pass) a total of 20 points is required. The grade 4 requires 27 points and grade 5 requires 32 points. I will not be able to drop in to answer questions during this exam but Pontus will, sometime between 9.00 and Otherwise, as always, if something is unclear, just state your assumptions in your answer and you will probably be all right. Dept. of Information Technology Olle Gällmo Lecturer Address: Lägerhyddsvägen 2 Box 337 SE Uppsala SWEDEN Telephone: Telefax: In this exam, some concepts may be called by different names than the ones used in the boo. Here is a list of useful synonyms and acronyms: Perceptron = summation unit = SU = conventional neuron Binary perceptron = perceptron with binary activation function Multilayer perceptron = MLP = Feedforward networ of summation units Bac propagation = Generalized delta rule Standard Competitive Learning = LVQ-I without a neighbourhood function Good luc! 7 students wrote the exam. Two failed, four passed with grade 3 and one student passed with grade 4. Web site: user.it.uu.se/~crwth olle.gallmo@it.uu.se

2 1. Explain the two concepts pattern learning (= stochastic/online learning) and batch learning (= epoch/offline learning). From a theoretical (gradient descent) perspective, one is more correct than the other, which?... (3) Both learn from a finite training set by computing an error for each presented pattern and then the amount w ji by which the parameters in the learning system (e.g. the weights in a neural networ) should be changed to reduce that error. In pattern learning, the parameters are updated directly after each pattern presentation. This requires a random order of presentation, which is why this form of learning is sometimes called "stochastic". In batch learning the computed parameter changes are accumulated over time and the system is not updated until the whole training set has been presented once (one epoch). This is the more correct form of learning, from a gradient descent perspective (it is gradient descent). Points were deducted if the accumulation of changes were not mentioned in the description of batch learning. 2. Write down the equation and draw a graph of the logistic sigmoid function, often used as activation function in neural networs!... (2) f(x) = 1/(1+exp(-λx)). The steepness parameter λ was not required for full credit. Points deducted if the graph did not include ranges/axis. 3. Explain why a single binary perceptron can not solve the XOR problem!... (3) Binary perceptron classify by positioning a separating hyperplane between the classes. In the XOR-problem the classes are not separable by a hyperplane (a line, in this 2D-case). It requires at least two lines (see figure). Note that the word 'discriminant' does not imply anything about it's shape. A ellipse is also a disciminant, for example, and in that case it would suffice with one discriminant to solve the XOR-problem. So, the explanation here must state that the discriminant formed by binary perceptron is a hyperplane. 4. Write down a general recipe for how to derive a gradient descent learning rule (such as the delta rule or bac propagation), given a feedforward networ F(x) and an error function (objective function), E.... (4) Most students misunderstood the question. The general recipe ased for here was discussed and handed out on lecture 4. Deriving bacprop is an instantiation of that recipe (and gave partial credit, if done right). 5. Compare advantages and disadvantages of multilayer perceptrons (MLPs) and radial basis function networs (RBFs). The discussion must contain at least two examples where MLPs usually do better than RBFs and two examples where RBFs usually do better than MLPs.... (4) Se handout from lecture 10. Note that I ased for advantages/disadvantages, not for differences in how MLPs/RBFs are defined.

3 6. Let s assume that you are given a multilayer perceptron which is a blac box to you. You can see how many inputs and outputs there are, but you cannot pee inside to decide the number of hidden nodes, nor can you change any other internal properties of the networ. Let s also assume that you have a set of data (i.e. a set of input vectors with corresponding desired output vectors) representing some unnown target function. The inputs and output vectors in this set have the same dimensionality as the number of inputs and outputs of the networ, respectively. You are allowed to do what you wish with the given set of examples, but you will not be able to obtain more data. a) Suggest a method to find out if the networ is powerful enough (i.e. have enough hidden nodes) to find the target function!... (2) Train the networ on all the data you have, for a long time, i.e. try to overtrain the networ on purpose. If you can't get a low error even if allowed to do that, the networ is too small (or there are serious errors in the data). b) What bad effects can be expected if the networ is too large?... (2) You may overtrain the networ, i.e. lose the ability to generalize to unseen data. It also taes longer time, of course, but that is not the main problem. c) If the networ is too large, how can you avoid or minimize the bad effects when, as in this case, you are not allowed to change the size of the networ? (2) Split the data into at least three subsets a training set, a validation set and a test set so that you can decide when to stop training ('early stopping'). An alternative is to use K-fold cross validation, but you still need to mae sure that you don't train for too long. Noise injection also helps. 7. Unsupervised learning / Self organization a) How is standard competitive learning related to the classical method K-means?... (2) They are equivalent, if SCL is trained in batch learning mode. b) Explain the winner-taes-all problem which can occur in competitive learning!... (2) Competitive learning is to move the closest node (codeboo vector) towards the latest input vector, to mae it more liely to win also next time the same input vector is presented. The winner-taes-all problem occurs when a few nodes, in the extreme case only one node, wins all the time because they are closer to all the data than the other nodes (which never win and therefore never move). At best this leads to underutilization of the networ, at worst all the data is classified to the same class. c) When a new node is to be inserted in the Growing Neural Gas algorithm, it is inserted between two already existing nodes. Which two nodes?... (2) Inbetween the node x which has the greatest error and the node y among its current neighbours, with the greatest error. The error here is proporational to how much the node has moved lately (as a winner), so a great error indicates that the node is in an area with too few nodes to cover the data well.

4 8. Use the state transition graph below to explain temporal difference learning rewards: states: r r r s s s... values: a) Define values (V) recursively in terms of the following rewards and values!. (2) V(s) = r + γv(s') where γ is a discount factor. b) Show how values are updated, using the TD(0) learning rule!... (2) They are updated proportionally to the TD-error, which is the difference between the two sides of the equation from the previous question: V(s) := V(s) + η[r + γv(s') V(s)] c) TD(0) is a special case of a more general algorithm, called TD(λ) which includes eligibility traces. What is this?... (2) Instead of updating only the value from the previous state V(s), in TD(λ) we update all state values, but proportionally to how recently they were visited, which is measured by a slowly decaying variable, local for each state. The rate of this decay is controlled by the (global) λ parameter. TD(0), from the previous question, is an extreme case where the trace decays immediately and therefore only the latest state value is updated. 9. Population methods V(s) V(s ) V(s ) a) Explain the basic crossover mechanism used in Genetic Programming (GP). (2) Swap subtrees of the two parse trees, i.e. subexpressions of the two programs. Some students had confused Genetic Programming and Genetic Algorithms and described one-point crossover for GA (which gave no credit). b) Comparing the three Particle Swarm Optimization variants pbest, lbest and gbest, which one is most similar to random search? Which one is most liely to get stuc in local optima, due to premature convergence?... (2) In pbest, a particle only considers its personal best. It gets no information from the other particles, and pbest is therefore the most similar to random search (most chaotic). In gbest, particles strive for a weighted combination of their personal bests and the best position found by any particle in the swarm. This leads to a much more directed search with a tight swarm, where most particles strive in the same general direction. gbest is therefore the most liely to get stuc. lbest is a compromise between the two a generalization of the algorithm which contains pbest and gbest as two extremes. Here, a particle strives for a weighted combination of its personal best and the best of its neighbours (usually not the whole floc), defined by a neighbourhood graph.

5 c) When Ant Colony Optimization is used in the travelling salesman problem, the probability that ant moves from city i to city j at time t can be defined as follows: α β [ τ ij ( t) ] *[ η ] ij pij ( t) =, j C α β i [ τ ic( t) ] *[ ηic ] c C i Explain the right hand side of this equation!... (2) τ ij (t) is the amount of pheromones on the trail from city i to j (at time t it decays over time). η ij is the inverted distance from city i to j.the transition probability is a trade-off between these two, weighed by constants α and β.the sum over C is just for normaliation (to mae this a probability, i.e. so that it sums to 1). C is the set of feasible cities of ant (i.e. the remaining cities to visit).

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer