Review: Final Exam CPSC Artificial Intelligence Michael M. Richter

Review: Final Exam

Model for a Learning Step Learner initially Environm ent Teacher Compare s pe c ia l Information Control Correct Learning criteria Feedback changed Learner after Learning

Learning by Example Learning by Example Source Types Presentation Learner positive Environment negative Teacher Once Incrementally

Learning Data, Tests and Applications In a learning process we learn something from given examples where we know the solution (e.g. the concept, the function, the decision tree or whatever we want to learn). These examples are also called training data. The goal is to apply the the learned object for future examples where we do not have the solution: This is the application phase. In between we have the test phase where we have also examples for which the solution is known as test data.

Example: Learning Concepts Determine the concept on the basis of some given classified examples (experiences). Experiences positive examples negative examples Learning function concept K(X) The generated concepts are hypotheses

Supervised and Unsupervised Learning Supervised: Teacher is involved, e.g. Correcting the solution Presenting classified examples Telling the error Unsupervised: No teacher. Consequence: Self organization. Examples: Competitive Learning Kohonen networks Various clustering algorithms

Clustering: Possible Outcomes 2-dimensional examples with real valued attributes x 2 x 2 x x 1 1 Cluster result 1 Cluster result 2 For euclidean distance : Cluster result 1 is better efor distance measure = weighted euclidean measure with weights α 1 =0 and α 2 =1: Cluster result 2 is better

Examples: Realization of some Boolean Functions Negation (NoT) Conjunction (AND) Disjunction (OR) x -1-0.5 x 1 x 2 1 1 2 x 1 x 2 1 1 1 y y y

Network Topologies We have: Each output of a neuron can lead to several inputs Some input, however, can only come from one neuron A network of neurons can be represented as a directed graph. This directed graph is called the topology. The output of a neuron is used as input for two other neurons The output of two neurons is used as two inputs x 1,x 2 for another neuron

What can one Neuron Do? One neuron can solve problems that are linearly separable E.g. for m = 2 input signals y positive examples negative examples x

Not Every Boolean Function can be Realized by one Neuron Example of a not linearly separable problem This boolean function cannot be represented bby a single neuron x 1 0 0 1 1 x 2 0 1 0 1 xor 0 1 1 0 1 1 Therefore: Combine neurons! Use Networks!

A More General View (3) Example: The output of the neuron is a real number f(x) = 1 / (1 + e x ) 1 ½

Feed-Forward Nets Neurons are ordered in levels Each neuron of some level can only be connected with neurons of the next higher level There are no horizontal or backward connections We distinguish Input level (receptors) Invisible levels (hidden layers) Output level (Effectors) If each neuron is connected with all neurons of the next higher level the network is called a completely connected Feed- Forward-Net.

Learning Learning...... Means almost always Adaptation of weights For example: 1,7-0,9 3,1 Σ A Training 1,9-1,4 3,1 T I M E Σ A

Delta Rule (Widrow-Hoff-Rule) Expected result is called teaching input T n (t) for the desired activity The teaching input nur is only defined for output neurons Idea: Change the weights proportional the error Learning rule: t c n,n = η (T n (t) - a n (t)) O n (t) η is the learning rate: Determines the degree of the change.

Delta Rule: Example earning rate = 1; O n (t) = a n (t) Teaching Input 0.9 0.7 Output 0.6 4 T 4 T 5 No error, therefore no weight change 0.3 0.7 5 +0.48 +0.54 +0.18 Input 0.8 0.9 0.3 1 2 3

Application to Feed-forward Networks Problem: Learning of hidden layers: No teaching input is defined Idea: Learning of the output layer with the delta rule Propagate errors backwards from the output layer over the hidden layers to the input layer (backpropagation) Output neuron: The error is computed Forward backward

Example Feed-forward net: Output level...... Hidden layers................................. Input level......

Hidden Neurons (1) Go backwards: Σ A Σ A... c 2 c 1... Σ A

A Networkmodel for Backpropagation Activation rule: a ( t + 1) = F( a ( t) c ( t)) j i ij i F is a sigmoid function of the form 1 E= 2 n n output N. ( T ( t) a ( t)) n 2 1 Fx ( )= F(x) 1 x + θ T e 1 Topology: Completely connected feed-forward net

Backpropagation Phase: Sum over neurons of previous layer Net input at neuron j: netj = cij a For output neurons: Teaching Input T j Change of weights recursively: Backwards from out layer to input layer cij t) = η acccording to: i i ( δ a ( t) j i η is the learning rate if j output neuron (start of recursion): if j hidden neuron (recursion step): δ j = T j a j t F net j δ ( ( )) '( ) = ( δ c ) F'( netj ) j k jk k Sum over the neurons of the next layer

Example Teaching Input T 6 T 7 Output δ δ 6 7 6 a 6 a 7 net 6 7 net 7 Hidden δ 3 c c 57 36 O c 5 37 c 46 c 47 c 56 O 3 O 4 a 3 4 a 4 a 5 net 3 net 3 4 4 5 δ δ 5 c 13 c 14 c 15 c 23 c 24 c 25 net 5 Input O 1 O 2 a 1 a 2 1 2

Computation with the Sigmoid Function Suppose: Fx ( )= 1 x + θ T e 1 Then: x θ 2 F x F x e T 1 1 '( ) = ( 1) ( ) ( ) = Fx ( ) ( 1 Fx ( )) T T Advantage: F (net) is easy to compute using F(net) (computed Already in the forward phase!)

Competitive Learning Assumptions: Normalized weights w ij = 1 j, j < w ij <1 Input, output values are binary Two posibilities: one neuron is the winner and can learn ( winner takes all ) several neurons can learn

Competitive Learning The learning algorithm (with learning rate a): for all training data In = (in 1, in 2,... in m ) do for all neurons j in the competition layer do compute w ij * in i ; determine the winning neuron S as the one for which w is * in i is maximal; out S := 1; for all neurons j S do out j := 0 for all i do w is := W is + a * * w is ) m in i

Competitive Learning Analysis: If in i = 0 then w is decreases If in i = 1 and previous w is < 1/m then w is increases Consequences: The weights w is are now closer to in i w is =1, because - weight changes of w in is = a( i - w 1 i m is ) = a ( m in i - w i i i is ) 1-1 = a(1 1) = 0

A Geometric View Consider a training vector set whose vectors all have the same length, and suppose, without loss of generality, that this is one. Recall that the length x 2 of a vector x is given by x 2 = x i 2 A vector set for which x 2 = 1 for all x is said to be normalised. If the components are all positive or zero then this is approximately equivalent to the condition x i = 1 Since the vectors all have unit length, they may be represented by arrows from the origin to the surface of the unit (hyper)sphere.

Vectors on Unit Hypersphere(1)

Vectors on Unit Hypersphere(2) Suppose now that a competitive layer has also normalized weight vectors Then these vectors may also be represented on the same sphere. Blue: Input vector the neuron that wins is the one that is closest to the input vector. It moves even closer to the input vector

More than one Winner Neuron Kohonen Networks : Self Organizing Feature Maps (SOFM) Two layer architecture Each Kohonen neuron has a neighborhood which consists out of neighbors with respect to this architecture

A General Aspect: Dimension Reduction In reality we encounter n-dimensional situations (e.g. n = 3) This means: We have n data v = (v 1..., v n ) These data are in a topological space and neighboorhoods Contain dependencies. Wanted: Replacement by r = (r 1)...,r m ), m < n

Feature Maps The orientation tuning over the surface forms a kind of map with similar tunings being found close to each other. These maps are called topographic feature maps. It is possible to train a network using methods based on activity competition in a such a way as to create such maps automatically. These nets consist of a layer of nodes each of which is connected to all the inputs and which is connected to some neighbourhood of surrounding nodes.

Kohonen Map (1)

Kohonen Map (2) The SOM defines a mapping from the input data space spanned by x 1..x n onto a one- or twodimensional array of nodes. The mapping is performed in a way that the topological relationships in the n-dimensional input space are maintained when mapped to the SOM. In addition, the local density of data is also reflected by the map: areas of the input data space which are represented by more data are mapped to a larger area of the SOM.

Learning (1) Learning of Kohonen nets: As for competitive learning: for all training data IN = (in 1, in 2,...) do für all Kohonen neurons j do Compute (in i w ij ) 2 Determine neuron S for which this value is minimal for all Kohonen neurons j from the neighborhood of S do w ij := w ij + a (in j w ij )

Learning (2) Learning of Kohonen nets This means: All Kohonen neuron from the chosen neighborhood of S adapt their weights in the direction of the learned vector IN The size of the neighborhood is variying, in general the will shrink over time After successful learning there will be groups of Kohonen neurons that have similar weight vectors The net finds categories of learning data

Shrinking Neighborhoods (1) The black neuron in the center is the winner. The squares describe the shrinking neighborhoods In general: The forms of the neighborhoods may be arbitrary geometric objects.

Shrinking Neighborhoods (2) Decreasing the neighbourhood ensures that progressively finer features or differences are encoded. The gradual lowering of the learn rate ensures stability (otherwise the net may exhibit oscillation of weight vectors between two clusters). Both methods have to ensure that the learning process has some kind of convergence or stability.

Evolutionary Algorithms This is an umbrella term to desribe problem solving methods that make use of methods of evolution. EA Evolutionary Algorithms GA Genetic Algorithms EP Evolutionary Programming ES Evolution Strategies CFS Classifier Systems GP Genetic Programming GA: A model of Machine Learning that is derived from evolution in the nature. A population of individuals (essentially bit strings) go through a process of simulated evolution, using crossover and mutation to create offsprings. EP: Stochastic optimization strategy similar to GA. The others are different extensions and variations of the basic ideas. We will concentrate on GA.

Genetic Algorithms The three most important aspects of using genetic algorithms are: (1) definition of the objective function, (2) definition and implementation of the genetic representation, and (3) definition and implementation of the genetic operators. Once these three have been defined, the generic genetic algorithm should work fairly well. Beyond that you can try many different variations to improve performance, find multiple optima (species - if they exist), or parallelize the algorithms.

Genetic Operators We suppose that the set I of individuals consists of vectors (x 1,x 2,...,x k,...x n ) of lenght n with entries of some data structure (which is often {0,1}). Mutation operators: Simple mutation operators pick some k and y with a certain probability, the result of the mutation is (x 1,x 2,...,x k,...x n ) where x k = y. Multivariate mutation performs this at several places. Cross-over operators have two arguments (the parents): (x 1,x 2,...,x k,...x n ), (y 1,y 2,...,y k,...y n ) ; The result are two children (x 1,x 2,...,x k,y k+1,...x n ), (y 1,y 2,...,y k, x k+1,...x n ); again k is picked with some probability.

Lists These are some sample list mutation operators. Notice that lists may be fixed or variable length.

List Crossover Operators

Fitness Function The most difficult and most important concept of genetic programming is the fitness function. The fitness function determines how well a program is able to solve the problem. It varies greatly from one type of program to the next. For example, if one were to create a genetic program to set the time of a clock, the fitness function would simply be the amount of time that the clock is wrong. Unfortunately, few problems have such an easy fitness function; most cases require a slight modification of the problem in order to find the fitness.

A Possible Control Algorithm Initial Population P SELECTION 1. add Po to P 2. increase age 3. remove dead/ unfit individuals Initial Evaluation loop BREEDING 1. choose mating partners 2. Po := set of individuals created by crossover/mutation EVALUATION 1. assign fitness value 2. assign lifetime value retrieval error calculation

Flowchart (Executional Steps) of Genetic Programming Genetic programming is problem-independent in the sense that the flowchart specifying the basic sequence of executional steps is not modified for each new run or each new problem. There is usually no discretionary human intervention or interaction during a run of genetic programming (although a human user may exercise judgment as to whether to terminate a run). The figure below is a flowchart showing the executional steps of a run of genetic programming. The flowchart shows the genetic operations of crossover, reproduction, and mutation as well as the architecture-altering operations. This flowchart shows a twooffspring version of the crossover operation.

Creation of Initial Population of Computer Programs Genetic programming starts with a primordial ooze of thousands of randomly-generated computer programs. The set of functions that may appear at the internal points of a program tree may include ordinary arithmetic functions and conditional operators. The set of terminals appearing at the external points typically include the program's external inputs (such as the independent variables X and Y) and random constants (such as 3.2 and 0.4). The randomly created programs typically have different sizes and shapes.

Main Generational Loop of Genetic Programming The main generational loop of a run of genetic programming consists of the fitness evaluation, selection, and the genetic operations. Each individual program in the population is evaluated to determine how fit it is at solving the problem at hand. Programs are then probabilistically selected from the population based on their fitness to participate in the various genetic operations, with reselection allowed. While a more fit program has a better chance of being selected, even individuals known to be unfit are allocated some trials in a mathematically principled way. That is, genetic programming is not a purely greedy hill-climbing algorithm. The individuals in the initial random population and the offspring produced by each genetic operation are all syntactically valid executable programs. After many generations, a program may emerge that solves, or approximately solves, the problem at hand.

Mutation Operation In the mutation operation, a single parental program is probabilistically selected from the population based on fitness. A mutation point is randomly chosen, the subtree rooted at that point is deleted, and a new subtree is grown there using the same random growth process that was used to generate the initial population. This mutation operation is typically performed sparingly (with a low probability of, say, 1% during each generation of the run).

Crossover Operation In the crossover, or sexual recombination operation, two parental programs are probabilistically selected from the population based on fitness. The two parents participating in crossover are usually of different sizes and shapes. A crossover point is randomly chosen in the first parent and a crossover point is randomly chosen in the second parent. Then the subtree rooted at the crossover point of the first, or receiving, parent is deleted and replaced by the subtree from the second, or contributing, parent. Crossover is the predominant operation in genetic programming (and genetic algorithm) work and is performed with a high probability (say, 85% to 90%).

Reproduction Operation The reproduction operation copies a single individual, probabilistically selected based on fitness, into the next generation of the population.

Example (1)

Example (3)

Example (4)

Example (5)

Simplified Flow Diagram