Chapter 2 Learning Basics and Linear Models

Similar documents
Data Preprocessing. Supervised Learning

Bayes Theorem simply explained. With applications in Spam classifier and Autocorrect Hung Tu Dinh Jan 2018

Logistic Regression: Probabilistic Interpretation

Polytechnic University of Tirana

5 Learning hypothesis classes (16 points)

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Supervised Learning: The Setup. Spring 2018

CSEP 573: Artificial Intelligence

Machine Learning Classifiers and Boosting

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Ensemble Methods, Decision Trees

Smarter text input system for mobile phone

Data Mining and Analytics

I211: Information infrastructure II

06: Logistic Regression

What s New in the 2009/2010 Product Catalogue?

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

What s New in the 2009/2010 Product Catalogue?

Dynamic Feature Selection for Dependency Parsing

Naïve Bayes for text classification

Lecture 20: Neural Networks for NLP. Zubin Pahuja

AKA: Logistic Regression Implementation

Rich feature hierarchies for accurate object detection and semantic segmentation

CS489/698 Lecture 2: January 8 th, 2018

Network Traffic Measurements and Analysis

Recursion defining an object (or function, algorithm, etc.) in terms of itself. Recursion can be used to define sequences

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

Neural Nets & Deep Learning

CSE 446 Bias-Variance & Naïve Bayes

We extend SVM s in order to support multi-class classification problems. Consider the training dataset

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Evaluating Classifiers

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Machine Learning 13. week

Software-Information. Binary 8f 23021/1.3b, Binary 8f 2021/1.3b. Software-Information to: Binary 4f 23021/1.3b, Binary 4f 2021/1.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Calibrating Random Forests

Factoring. Factor: Change an addition expression into a multiplication expression.

Supervised Learning for Image Segmentation

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Printing Report Cards

Lecture 6,

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Notes and Announcements

Neural Networks (pp )

Tree-based methods for classification and regression

4&5 Binary Operations and Relations. The Integers. (part I)

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Bayes Classifiers and Generative Methods

Predicting Popular Xbox games based on Search Queries of Users

Learning to Localize Objects with Structured Output Regression

On the automatic classification of app reviews

Instance-Based Learning. Goals for the lecture

Review of Data Representation & Binary Operations Dhananjai M. Rao CSA Department Miami University

1 Machine Learning System Design

Semi-supervised Learning

CSE 573: Artificial Intelligence Autumn 2010

Deep Learning for Computer Vision

Linear Regression and K-Nearest Neighbors 3/28/18

Multi-Class Logistic Regression and Perceptron

Logical Rhythm - Class 3. August 27, 2018

k-nearest Neighbor (knn) Sept Youn-Hee Han

Applying Improved Random Forest Explainability (RFEX 2.0) steps on synthetic data for variable features having a unimodal distribution

Introduction to Classification & Regression Trees

.commercers extension user guide Auto Content

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

CS294-1 Final Project. Algorithms Comparison

Support Vector Machines

Syntactic N-grams as Machine Learning. Features for Natural Language Processing. Marvin Gülzow. Basics. Approach. Results.

Recognition Part I: Machine Learning. CSE 455 Linda Shapiro

Context-sensitive Classification Forests for Segmentation of Brain Tumor Tissues

Lecture Linear Support Vector Machines

MSRP Price list & order form

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Linear Classification and Perceptron

Classifying Depositional Environments in Satellite Images

Bayesian model ensembling using meta-trained recurrent neural networks

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

1) Give decision trees to represent the following Boolean functions:

Goals: Define the syntax of a simple imperative language Define a semantics using natural deduction 1

Week 3: Perceptron and Multi-layer Perceptron

The Basics of Decision Trees

STA 4273H: Statistical Machine Learning

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Chapter Seven: Regular Expressions

Evaluating Classifiers

Clustering & Classification (chapter 15)

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Random Forests and Boosting

Transcription:

Chapter 2 Learning Basics and Linear Models M1 Nakayama Sahoko(SP) 2017/7/7 1/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 2/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 3/32

Overview This chapter provides Supervised machine learning terminology and practices Linear and log-linear models for binary and multiclass classification 4/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 5/32

Supervised Machine Learning The creation of mechanisms that can look at examples and produce generalizations Input Output F(x) spam Not-spam Spam Or Not-spam 6/32

Parameterized function Searching over the set of all possible functions is very hard Restrict to specific Hypothesis classes(family of functions) injecting the learner with inductive bias Searching over the space of parameters One common hypothesis class (linear model) : Input parameters f x = x $ W + b x R * +, W R * +, * /01 b R * /01 7/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 8/32

How to know the function is good Our goal is to produce a function f(x) that correctly maps inputs x to outputs y5 How do we know that the produced function f() is indeed a good one? 9/32

Leave-one-out cross-validation Train k functions f 6:8 1. leaving out a different input example x 9 2. evaluating the resulting function f 9 () on its ability to predict x 9 Train another function f() on the entire trainings set x 6:8 https://www.slideshare.net/devonkbarrow/euro-2013-barrow-crone 10/32

Leave-one-out Good a good approximation of the accuracy on new inputs Bad very costly in computation time used only in cases where k is very small https://www.slideshare.net/butest/an-introduction-to-machine-learning 11/32

Held-out set 1. Randomly split all data into 2 subsets(say in 80%/20%): Training set Held-out set 2. Train a model on training set 3. Test its accuracy on the held-out set 12/32

A three-way split To compare several models and select the best one Three-way split of the data into train, validation(development), and a test set Training set Tweaks, error analysis and model selection validation set Test set Held-out set A single run of the final model 13/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 14/32

Binary classification f x = x $ w + b d @AB = 1, w: vector y5 = sign f x = sign x $ w + b The positive class: 1 The negative class: -1 15/32

Binary classification y5 = sign f x = sign x $ w + b = sign(size w 6 + price w R + b) Blue circles: Dupont Circle Green crosses: Fairfax 16/32

Binary classification y5 = sign f x = sign x $ w + b = sign(size w 6 + price w R + b) If y5 0 Fairfax else Dupont Circle 17/32

More than two features Counts of letter-bigram ab : x \] = #ab D, #ab : number of times the bigram ab appears in the document D : total number of bigrams in the document (document s length) x R`ab (an alphabet has 28 letters) 18/32

More than two features Bigram histograms for several German and English texts 19/32

More than two features Given a new item as Which will it be considered as the German group or the English one?? y5 = sign f x = sign x $ w + b = sign( x \\ w \\ + x \] w \] + x \c w \c + b) be considered as English if f x 0 and as German otherwise 20/32

Log-linear binary classification The confidence of the decision The probability that the classifier assigns to the class pushing the output through a squashing function such as the sigmoid 1 σ x = 1 + e fg y5 = σ f(x) = 6 6hi j(k wmn) 21/32

Multi-class classification Assign an example to one of k different classes e.g.) classify a document into one of six possible languages : English, French, German, Italian, Spanish, Other y5 = f x = argmax x $ w p + b p L {E t, F u, G u, I u, S y, O} Re-written as w p R`ab, b p W R`ab, vector b R y5 = f x = x $ W + b prediction = y5 = argmax y [9] i (2.7) 22/32

Multi-class classification y5 = f x = argmax x $ w p + b p L {E t, F u, G u, I u, S y, O} L = E t, F u, G u, I u, S y, O 784 #aa D #ab D #ac D #zy D #zz D 3 2 4 6 2 +b = score p 23/32

Multi-class classification y5 = f x = x $ W + b prediction = y5 = argmax y [9] #aa D #ab D #ac D #zy D #zz D 3 4 5 4 1 5 2 5 2 4 2 3 4 2 6 5 9 7 6 13 2 2 1 1 2 2 8 1 3 2 + 2 0 3 5 1 i En Fr Gr Ir Sp O = 4 6 2 7 5 2 24/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 25/32

Representations y5 = f x = x $ W + b is a representation of the documentation. 26/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 27/32

One-hot vector x [+] R`ab : one-hot vector i : particular document position, D [9] : bigram at the document positon All entries are zero except the single entry corresponding to the letter bigram D [9],which is 1 0 0 0 1 0 0 0 0 0 0 28/32

Bag of words x = 1 Ž x [+] D 9 6 The resulting vector x is commonly referred to as an averaged bag of bigrams(averaged bag of words or just bag of words) 0 0 0 R 0 0 6 0 0 0 29/32

Continuous bag of words y5 = 1 Ž W [+] D 9 6 This representation is called a continuous bag of words(cbow), as it is composed of sum of word representations y = x $ W =( 6 x [+] 9 6 )$ W = 6 (x [+] 9 6 $ W) = 6 W [+] 9 6 W [ ] W [ ] W [ ] sum 30/32 y

Continuous bag of words y = x $ W =( 6 x [+] 9 6 )$ W = 6 (x [+] 9 6 $ W) = 6 W [+] 9 6 W [ ] W [ ] sum y W [ ] 3 4 5 4 1 5 x [+] 2 5 2 4 2 3 0 0 1 0 0 4 2 6 5 9 7 6 13 2 2 1 1 2 2 8 1 3 2 W W [+] En Fr Gr Ir Sp O = 4 2 6 5 9 7 31/32

Contents 2 Learning Basics and Linear Models 2.1 Supervised Learning and Parameterized Functions 2.2 Train, Test, and Validation Sets 2.3 Linear Models 2.3.1 Binary Classification 2.3.2 Log-Linear Binary Classification 2.3.3 Multi-class Classification 2.4 Representations 2.5 One-Hot and Dense Vector Representations 2.6 Log-linear Multi-class Classification 32/32

Log-linear multi-class classification For binary case sigmoid function, resulting in a log-linear model For multi-class case softmax function : softmax(x) [9] = ik [+] resulting in y5 = softmax y [9] = i (kwmn) [+] i (kwmn) [ ] xw + b i k [ ] Forces the values in y to be positive and sum to 1, making them interpretable as a probability distribution 33/32