Chapter 2 Learning Basics and Linear Models

Chapter 2 Learning Basics and Linear Models M1 Nakayama Sahoko(SP) 2017/7/7 1/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 2/32

Overview This chapter provides Supervised machine learning terminology and practices Linear and log-linear models for binary and multiclass classification 4/32

Supervised Machine Learning The creation of mechanisms that can look at examples and produce generalizations Input Output F(x) spam Not-spam Spam Or Not-spam 6/32

Parameterized function Searching over the set of all possible functions is very hard Restrict to specific Hypothesis classes(family of functions) injecting the learner with inductive bias Searching over the space of parameters One common hypothesis class (linear model) : Input parameters f x = x $ W + b x R * +, W R * +, * /01 b R * /01 7/32

How to know the function is good Our goal is to produce a function f(x) that correctly maps inputs x to outputs y5 How do we know that the produced function f() is indeed a good one? 9/32

Leave-one-out cross-validation Train k functions f 6:8 1. leaving out a different input example x 9 2. evaluating the resulting function f 9 () on its ability to predict x 9 Train another function f() on the entire trainings set x 6:8 https://www.slideshare.net/devonkbarrow/euro-2013-barrow-crone 10/32

Leave-one-out Good a good approximation of the accuracy on new inputs Bad very costly in computation time used only in cases where k is very small https://www.slideshare.net/butest/an-introduction-to-machine-learning 11/32

Held-out set 1. Randomly split all data into 2 subsets(say in 80%/20%): Training set Held-out set 2. Train a model on training set 3. Test its accuracy on the held-out set 12/32

A three-way split To compare several models and select the best one Three-way split of the data into train, validation(development), and a test set Training set Tweaks, error analysis and model selection validation set Test set Held-out set A single run of the final model 13/32

Binary classification f x = x $ w + b d @AB = 1, w: vector y5 = sign f x = sign x $ w + b The positive class: 1 The negative class: -1 15/32

Binary classification y5 = sign f x = sign x $ w + b = sign(size w 6 + price w R + b) Blue circles: Dupont Circle Green crosses: Fairfax 16/32

Binary classification y5 = sign f x = sign x $ w + b = sign(size w 6 + price w R + b) If y5 0 Fairfax else Dupont Circle 17/32

More than two features Counts of letter-bigram ab : x \] = #ab D, #ab : number of times the bigram ab appears in the document D : total number of bigrams in the document (document s length) x R`ab (an alphabet has 28 letters) 18/32

More than two features Bigram histograms for several German and English texts 19/32

More than two features Given a new item as Which will it be considered as the German group or the English one?? y5 = sign f x = sign x $ w + b = sign( x \\ w \\ + x \] w \] + x \c w \c + b) be considered as English if f x 0 and as German otherwise 20/32

Log-linear binary classification The confidence of the decision The probability that the classifier assigns to the class pushing the output through a squashing function such as the sigmoid 1 σ x = 1 + e fg y5 = σ f(x) = 6 6hi j(k wmn) 21/32

Multi-class classification Assign an example to one of k different classes e.g.) classify a document into one of six possible languages : English, French, German, Italian, Spanish, Other y5 = f x = argmax x $ w p + b p L {E t, F u, G u, I u, S y, O} Re-written as w p R`ab, b p W R`ab, vector b R y5 = f x = x $ W + b prediction = y5 = argmax y [9] i (2.7) 22/32

Multi-class classification y5 = f x = argmax x $ w p + b p L {E t, F u, G u, I u, S y, O} L = E t, F u, G u, I u, S y, O 784 #aa D #ab D #ac D #zy D #zz D 3 2 4 6 2 +b = score p 23/32

Multi-class classification y5 = f x = x $ W + b prediction = y5 = argmax y [9] #aa D #ab D #ac D #zy D #zz D 3 4 5 4 1 5 2 5 2 4 2 3 4 2 6 5 9 7 6 13 2 2 1 1 2 2 8 1 3 2 + 2 0 3 5 1 i En Fr Gr Ir Sp O = 4 6 2 7 5 2 24/32

Representations y5 = f x = x $ W + b is a representation of the documentation. 26/32

One-hot vector x [+] R`ab : one-hot vector i : particular document position, D [9] : bigram at the document positon All entries are zero except the single entry corresponding to the letter bigram D [9],which is 1 0 0 0 1 0 0 0 0 0 0 28/32

Bag of words x = 1 Ž x [+] D 9 6 The resulting vector x is commonly referred to as an averaged bag of bigrams(averaged bag of words or just bag of words) 0 0 0 R 0 0 6 0 0 0 29/32

Continuous bag of words y5 = 1 Ž W [+] D 9 6 This representation is called a continuous bag of words(cbow), as it is composed of sum of word representations y = x $ W =( 6 x [+] 9 6 )$ W = 6 (x [+] 9 6 $ W) = 6 W [+] 9 6 W [ ] W [ ] W [ ] sum 30/32 y

Continuous bag of words y = x $ W =( 6 x [+] 9 6 )$ W = 6 (x [+] 9 6 $ W) = 6 W [+] 9 6 W [ ] W [ ] sum y W [ ] 3 4 5 4 1 5 x [+] 2 5 2 4 2 3 0 0 1 0 0 4 2 6 5 9 7 6 13 2 2 1 1 2 2 8 1 3 2 W W [+] En Fr Gr Ir Sp O = 4 2 6 5 9 7 31/32

Log-linear multi-class classification For binary case sigmoid function, resulting in a log-linear model For multi-class case softmax function : softmax(x) [9] = ik [+] resulting in y5 = softmax y [9] = i (kwmn) [+] i (kwmn) [ ] xw + b i k [ ] Forces the values in y to be positive and sum to 1, making them interpretable as a probability distribution 33/32