Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Similar documents
Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Markov Decision Processes and Reinforcement Learning

Using Machine Learning to Optimize Storage Systems

10-701/15-781, Fall 2006, Final

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

COMS 4771 Clustering. Nakul Verma

Unsupervised Learning: Clustering

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

The Expectation Maximization (EM) Algorithm

CS Introduction to Data Mining Instructor: Abdullah Mueen

Machine Learning. Unsupervised Learning. Manfred Huber

Clustering web search results

Probabilistic Graphical Models Part III: Example Applications

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Q-learning with linear function approximation

Adaptive Building of Decision Trees by Reinforcement Learning

K-means and Hierarchical Clustering

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

A Brief Introduction to Reinforcement Learning

Unsupervised Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Artificial Intelligence. Programming Styles

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Simple Model Selection Cross Validation Regularization Neural Networks

ECE 5424: Introduction to Machine Learning

Review: Final Exam CPSC Artificial Intelligence Michael M. Richter

Grundlagen der Künstlichen Intelligenz

7. Boosting and Bagging Bagging

Clustering: Classic Methods and Modern Views

Unsupervised Learning

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

arxiv: v1 [cs.ne] 22 Mar 2016

10601 Machine Learning. Model and feature selection

CSC 411: Lecture 12: Clustering

PRACTICE FINAL EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Unsupervised Learning and Data Mining

When Network Embedding meets Reinforcement Learning?

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Markov Decision Processes (MDPs) (cont.)

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CLUSTERING. JELENA JOVANOVIĆ Web:

Marco Wiering Intelligent Systems Group Utrecht University

Machine Learning. Semi-Supervised Learning. Manfred Huber

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Generalized Inverse Reinforcement Learning

1) Give decision trees to represent the following Boolean functions:

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

ALTERNATIVE METHODS FOR CLUSTERING

Latent Variable Models and Expectation Maximization

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Supervised vs unsupervised clustering

Fall 09, Homework 5

Introduction to Machine Learning CMU-10701

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Introduction to Artificial Intelligence

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Clustering with Reinforcement Learning

Function Approximation. Pieter Abbeel UC Berkeley EECS

Inference and Representation

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Based on Raymond J. Mooney s slides

Introduction to Graphical Models

CP365 Artificial Intelligence

Gene Clustering & Classification

Network Traffic Measurements and Analysis

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

An Introduction to Machine Learning

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Hierarchical Assignment of Behaviours by Self-Organizing

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

CSE 573: Artificial Intelligence Autumn 2010

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Lecture 13: Learning from Demonstration

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University


WestminsterResearch

Midterm Examination CS540-2: Introduction to Artificial Intelligence

10701 Machine Learning. Clustering

Semi-supervised Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Clustering and The Expectation-Maximization Algorithm

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

The k-means Algorithm and Genetic Algorithm

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

Day 3 Lecture 1. Unsupervised Learning

Planning and Control: Markov Decision Processes

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Transcription:

Unsupervised CS 3793/5233 Artificial Intelligence Unsupervised 1

EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means In clustering, the target feature is not given. Goal: Construct a natural classification that can be used to predict features of the data. The examples are partitioned in into clusters or classes. Each class predicts values of the features for the examples in the class. In hard clustering, each example is placed definitively in a class. In soft clustering, each example has a probability of belonging to each class. The best clustering minimizes an error measure. CS 3793/5233 Artificial Intelligence Unsupervised 2

EM Algorithm EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means The EM (Expectation Maximization) algorithm is not an algorithm, but is an algorithm design technique. Start with a hypothesis space for classifying the data and a random hypothesis. Repeat until convergence: E Step. Classify the examples using the current hypothesis. M Step. Learn a new hypothesis from the examples using their current classification. This can get stuck in local optima; different initializations can affect the result. CS 3793/5233 Artificial Intelligence Unsupervised 3

k-means Algorithm EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means The k-means algorithm is used for hard clustering. Inputs: training examples the number of classes/clusters, k Outputs: Each example is assigned to one class. The average/mean example of each class. If example e = (x 1,..., x n ) is assigned to class i with mean u i = (u i1,..., u in ), error is e u i 2 = n j=1 (x j u ij ) 2 CS 3793/5233 Artificial Intelligence Unsupervised 4

k-means Procedure EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means Procedure K-Means(E, k) Inputs: set of examples and number of classes Randomly assign each example to a class Let E i be the examples in class i Repeat M-Step: for each class i from 1 to k u[i] e E i e/ E i E-Step: for each example e in E put e in class arg min i u[i] e 2 until no changes in any E i return u and E i clusters CS 3793/5233 Artificial Intelligence Unsupervised 5

Data EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means CS 3793/5233 Artificial Intelligence Unsupervised 6

Random Assignment to Classes EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means CS 3793/5233 Artificial Intelligence Unsupervised 7

Assign to Closest Mean EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means CS 3793/5233 Artificial Intelligence Unsupervised 8

Assign to Closest Mean Again EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means CS 3793/5233 Artificial Intelligence Unsupervised 9

of k-means EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means An assignment of examples to classes is stable if running both the M step and the E step does not change the assignment. This algorithm will converge to a stable local minimum. It is not guaranteed to converge to a global minimum. It is sensitive to the relative scale of the dimensions. Increasing k can always decrease error until k is the number of different examples. CS 3793/5233 Artificial Intelligence Unsupervised 10

Soft k-means EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means To illustrate soft clustering, consider a soft k-means algorithm. E-Step: For each example e, calculate probability distribution P(class i e) P(c i e) exp{ u i e 2 } M-Step: For each class i, determine mean probabilistically. e E u i = P(c i e) e e E P(c i e) CS 3793/5233 Artificial Intelligence Unsupervised 11

Soft k-means EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means e P 0 (C x e) P 1 (C x e) P 2 (C x e) (0.7, 5.1) 0.0 0.013 0.0 (1.5, 6.0) 1.0 0.764 0.0 (2.1, 4.5) 1.0 0.004 0.0 (2.4, 5.5) 0.0 0.453 0.0 (3.0, 4.4) 0.0 0.007 0.0 (3.5, 5.0) 1.0 0.215 0.0 (4.5, 1.5) 0.0 0.000 0.0 (5.2, 0.7) 0.0 0.000 0.0 (5.3, 1.8) 0.0 0.000 0.0 (6.2, 1.7) 0.0 0.000 0.0 (6.7, 2.5) 1.0 0.000 0.0 (8.5, 9.2) 1.0 1.000 1.0 (9.1, 9.7) 1.0 1.000 1.0 (9.5, 8.5) 0.0 1.000 1.0 CS 3793/5233 Artificial Intelligence Unsupervised 12

of Soft EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means Soft clustering often uses a parameterized probability model, e.g., means and standard deviations for normal distribution. Initially, assign random probabilities to the examples: prob. of class i given example e. The M-step updates the values of the parameters from the probabilities. The E-step updates the probabilities of the examples from the probability model. Does not guarantee global minimum. CS 3793/5233 Artificial Intelligence Unsupervised 13

Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features What should an agent do given: Prior knowledge: possible states of the world possible actions Observations: current state of world immediate reward/punishment Goal: act to maximize accumulated reward We assume there is a sequence of experiences: state, action, reward, state, action, reward,... At any time agent must decide whether to explore to gain more knowledge, or exploit knowledge it has already discovered CS 3793/5233 Artificial Intelligence Unsupervised 14

Why is reinforcement learning hard? Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features What actions are responsible for a reward may have occurred a long time before the reward was received. The long-term effect of an action depends on what the agent will do in the future. The explore-exploit dilemma: at each time should the agent be greedy or inquisitive? The ǫ-greedy strategy is to select what looks like the best action 1 ǫ of the time, and to select a random action ǫ of the time. CS 3793/5233 Artificial Intelligence Unsupervised 15

Temporal Differences Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Suppose we have a sequence of values v 1, v 2, v 3,... Estimating the average with the first k values: A k = v 1 + + v k k Separating out v k : A k = (v 1 + + v k 1 )/k + v k /k Let α = 1/k, then A k = (1 α)a k 1 +αv k = A k 1 +α(v k A k 1 ) The TD update is: A A + α(v A) CS 3793/5233 Artificial Intelligence Unsupervised 16

Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Suppose a robot in this environment. One terminal square has +1 reward (recharge station). One terminal square has 1 reward (falling down stairs). An action to stay put always succeeds. An action to move to a neighbor square, succeeds with probability 0.8, stays in the same square with prob. 0.1, goes to another neighbor with prob. 0.1 Should the robot try moving left or right? +1 1 CS 3793/5233 Artificial Intelligence Unsupervised 17

Review of Q Values Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features A policy is a function from states to actions. For reward sequence r 1, r 2,..., discounted reward is: V = i=1 γi 1 r i (discount = γ) V (s) is expected value of state s. Q(s, a) is value of action a from s. For optimal policy: V (s) = max a Q(s, a) (value of best action) Q(s, a)= s P(s s, a)(r(s, a, s )+γv (s )) = s P(s s, a)(r(s, a, s )+γ max a Q(s, a )) Learn optimal policy by learning Q values. Use each experience s, a, r, s to update Q[s, a]. CS 3793/5233 Artificial Intelligence Unsupervised 18

Q- Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Procedure Q-learning(S, A, γ, α, ǫ) Inputs: states, actions discount, step size, exploration factor Initialize Q[S, A] to zeros Repeat for multiple episodes: s initial state Repeat until end of episode: Select action a using ǫ-greedy strategy Do action a. Observe reward r and state s Q[s, a] Q[s, a] + α (r + γ max a Q[s, a ] Q[s, a]) s s return Q CS 3793/5233 Artificial Intelligence Unsupervised 19

Q- Update Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Unpacking the Q- update: Q[s, a] Q[s, a] + α (r + γ max a Q[s, a ] Q[s, a]) Q[s, a]: Value of doing action a from state s. r, s : reward received and next state α, γ: learning rate and discount factor max a Q[s, a ]: With current Q values, the value of the optimal action from state s. r + γ max a Q[s, a ]: With current Q values, the discounted reward to be averaged in. CS 3793/5233 Artificial Intelligence Unsupervised 20

Robot Q-Learner, γ =0.9, α=ǫ=0.1 Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features.57.76.74.84.73.95.58.72.68.82.76.96.66.74.62.73.51.00.52.51.51.66.54.22.54.62 +1 1.60.71.42.49.37.42.00.01.50.49.55.50.39.38 CS 3793/5233 Artificial Intelligence Unsupervised 21

Problems with Q- Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Q-learning does off-policy learning: learns value of optimal policy, but does not follow it. This is bad if exploration is dangerous. Below, Q-learning walks off the cliff too much. On-policy learning learns the value of the policy being followed. SARSA uses the experience s, a, r, s, a to update Q[s, a]. CS 3793/5233 Artificial Intelligence Unsupervised 22

SARSA Procedure Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Procedure SARSA(S, A, γ, α, ǫ) Initialize Q[S, A] to zeros Repeat for multiple episodes: s initial state Select action a using ǫ-greedy strategy Repeat until end of episode: Do action a. Observe reward r and state s Select action a for s using ǫ-greedy Q[s, a] Q[s, a] + α (r + γq[s, a ] Q[s, a]) s s and a a return Q CS 3793/5233 Artificial Intelligence Unsupervised 23

SARSA on the Cliff Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Reward per Episode 0-20 -40-60 -80-100 Q learning Sarsa 0 200 400 600 800 1000 Number of Episodes CS 3793/5233 Artificial Intelligence Unsupervised 24

with Features Why hard? Temporal Differences Q Review Q- Update Robot Q-Learner Problems SARSA SARSA on Cliff Features Often, we want to reason in terms of features. Want to take of advantage of similarities between states. Each assignment to the features is a state. Idea: Express Q as a function of the features. Features encode state and action. (s, a) = (x 1, x 2, x 3,...) Q(s, a) = w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3 +... δ = r + γq(s, a ) Q(s, a) w i w i + αδx i CS 3793/5233 Artificial Intelligence Unsupervised 25

CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure Learn each probability separately, if you: know the structure of the network observe all the variables have many examples have no missing data Structure Data Probababilities CS 3793/5233 Artificial Intelligence Unsupervised 26

Conditional Probabilities CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure Use counts for each conditional probability. For example: P(E = t A = t B = f) = count(e = t A = t B = f) + c 1 count(a = t B = f) + c c 1 and c is prior (expert) knowledge (c 1 c). When there are few examples or many parents to a node, there might be little data for probability estimates: Use supervised learning or noisy ORs/ANDs. CS 3793/5233 Artificial Intelligence Unsupervised 27

Unobserved Variables CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure What if we had no observations of E? Use EM algorithm with probabilistic inference. Randomly assign values to probability tables that include E. Repeat: E-step: Calculate P(E e) for each example e. M-step: Update probabability tables using counts of P(E e). CS 3793/5233 Artificial Intelligence Unsupervised 28

Network Structures CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure P(M D) = P(D M) P(M) P(D) log P(M D) log P(D M) + log P(M) M is a Bayesian network and D is the data. Assume all variables are observed. A bigger network can have higher P(M D). P(M) can help control the size. (e.g., using the description length). You can search over network structure looking for the most likely model. CS 3793/5233 Artificial Intelligence Unsupervised 29

Algorithm I CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure Search over total orderings of variables. For each total ordering X 1,..., X n use supervised learning to learn P(X i X 1... X i 1 ). Return the network model found with minimum: log P(D M) log P(M) log P(D M) can be obtained by calculation. Can approximate log P(M) m log(d + 1), where m = # of parameters in M and d = # of examples. CS 3793/5233 Artificial Intelligence Unsupervised 30

Algorithm II CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure Learn a tree-structured Bayesian network. Compute correlations between all pairs of variables. Do maximum spanning tree maximizing absolute values of correlations. Pick a variable to be root of the tree, then fill in probabilities using data. CS 3793/5233 Artificial Intelligence Unsupervised 31

: Original Bayesian Network CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure P(A) = 0.8 B A P(B A) t 0.5 f 0.2 A D C A P(C A) t 0.2 f 0.5 B C P(D B,C) t t 0.5 t f 0.8 f t 0.2 f f 0.4 CS 3793/5233 Artificial Intelligence Unsupervised 32

: 1000 s CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure A B C D Count T T T T 39 T T T F 37 T T F T 300 T T F F 69 T F T T 15 T F T F 67 T F F T 117 T F F F 189 A B C D Count F T T T 6 F T T F 5 F T F T 20 F T F F 10 F F T T 15 F F T F 46 F F F T 27 F F F F 38 CS 3793/5233 Artificial Intelligence Unsupervised 33

: Learn Probabilities CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure A t f P(A) = 0.83 B P(B A) 0.53 0.25 A D C A P(C A) t 0.19 f 0.43 B C P(D B,C) t t 0.52 t f 0.80 f t 0.21 f f 0.39 CS 3793/5233 Artificial Intelligence Unsupervised 34

: EM for Hidden Variable CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure P(H) =0.25 B H P(B H) t 0.83 f 0.37 H D C H P(C H) t 0.02 f 0.30 B C P(D B,C) t t 0.5 t f 0.8 f t 0.2 f f 0.4 CS 3793/5233 Artificial Intelligence Unsupervised 35

: Learn Structure A P(A) =0.83 CPs Unobserved Variables Network Structure Algorithm I Algorithm II Original 1000 s Learn Probabilities EM for Hidden Learn Structure Correlation Table B C D A.22.21.11 B.12.41 C.23 B D A P(B A) t 0.53 f 0.25 B P(D B) t 0.75 f 0.34 C D P(C D) t 0.14 f 0.34 CS 3793/5233 Artificial Intelligence Unsupervised 36