Supervised Learning: The Setup. Spring 2018

Similar documents
Decision Trees: Representation

Decision Trees: Representa:on

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Machine Learning (CSE 446): Decision Trees

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Nearest Neighbor Classification. Machine Learning Fall 2017

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Multi-Class Logistic Regression and Perceptron

CS 395T Computational Learning Theory. Scribe: Wei Tang

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning. bad news. Big Picture: Supervised Learning. Supervised (Function) Learning. Learning From Data with Decision Trees.

Name & Recitation Section:

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

F. Aiolli - Sistemi Informativi 2006/2007

Reductions and Satisfiability

Outline. Announcements. Homework 2. Boolean expressions 10/12/2007. Announcements Homework 2 questions. Boolean expression

CSE 446 Bias-Variance & Naïve Bayes

Data Science Tutorial

Annotation and Evaluation

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

CMPSCI 250: Introduction to Computation. Lecture #1: Things, Sets and Strings David Mix Barrington 22 January 2014

Propositional Calculus. Math Foundations of Computer Science

Programming with Python

Topic 7 Machine learning

Chapter 2 Learning Basics and Linear Models

INFORMATION VISUALIZATION

CS 188: Artificial Intelligence Fall Machine Learning

Problem 1: Complexity of Update Rules for Logistic Regression

CSCI 447 Operating Systems Filip Jagodzinski

Simple Model Selection Cross Validation Regularization Neural Networks

Automatic Domain Partitioning for Multi-Domain Learning

Dependency Parsing CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CS 188: Artificial Intelligence Fall Announcements

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification

5 Learning hypothesis classes (16 points)

CSE 341: Programming Languages

CSCI 5582 Artificial Intelligence. Today 10/31

Lecture 8: Grid Search and Model Validation Continued

Natural Language Processing

CISC 4631 Data Mining

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CSEP 517 Natural Language Processing Autumn 2013

Software Testing Prof. Meenakshi D Souza Department of Computer Science and Engineering International Institute of Information Technology, Bangalore

(Refer Slide Time 3:31)

Lecture 3: Recursion; Structural Induction

Classification and Regression

The Specification Phase

(How Not To Do) Global Optimizations

Nearest Neighbor Predictors

Mathematical Logic Prof. Arindama Singh Department of Mathematics Indian Institute of Technology, Madras. Lecture - 37 Resolution Rules

News English.com Ready-to-use ESL / EFL Lessons 2005 was a second longer than usual

CS40-S13: Functional Completeness

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CSC236 Week 4. Larry Zhang

Annotation and Feature Engineering

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Backpropagation + Deep Learning

Introduction to Denotational Semantics. Brutus Is An Honorable Man. Class Likes/Dislikes Survey. Dueling Semantics

Lecture 3: Linear Classification

1 Case study of SVM (Rob)

MITOCW ocw f99-lec07_300k

Decision Tree CE-717 : Machine Learning Sharif University of Technology

CS 188: Artificial Intelligence Spring Announcements

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

CS489/698 Lecture 2: January 8 th, 2018

Classification: Feature Vectors

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

1 Counting triangles and cliques

Big Data Analytics CSCI 4030

CS 4700: Artificial Intelligence

Naïve Bayes, Gaussian Distributions, Practical Applications

Lecture Notes 15 Number systems and logic CSS Data Structures and Object-Oriented Programming Professor Clark F. Olson

Lecture 23: Domain-Driven Design (Part 1)

ImageNet Classification with Deep Convolutional Neural Networks

ARTIFICIAL INTELLIGENCE. Uncertainty: fuzzy systems

Inf2C Software Engineering Coursework 1. Capturing requirements for a city bike-hire scheme

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CIS 500 Software Foundations Fall December 6

Administrivia. Existential Types. CIS 500 Software Foundations Fall December 6. Administrivia. Motivation. Motivation

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Data Mining. Lecture 03: Nearest Neighbor Learning

Linear Classification and Perceptron

Instance-based Learning

Bayes Classifiers and Generative Methods

Semi-supervised learning and active learning

Type Inference Systems. Type Judgments. Deriving a Type Judgment. Deriving a Judgment. Hypothetical Type Judgments CS412/CS413

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Propositional Calculus: Boolean Functions and Expressions. CS 270: Mathematical Foundations of Computer Science Jeremy Johnson

Rapid growth of massive datasets

PARALLEL CLASSIFICATION ALGORITHMS

Spatial Localization and Detection. Lecture 8-1

Boosting Simple Model Selection Cross Validation Regularization

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Data Mining and Analytics

15-110: Principles of Computing, Spring 2018

Transcription:

Supervised Learning: The Setup Spring 2018 1

Homework 0 will be released today through Canvas Due: Jan. 19 (next Friday) midnight 2

Last lecture We saw What is learning? Learning as generalization The badges game 3

This lecture More badges Formalizing supervised learning Instance space and features Label space Hypothesis space Some slides based on lectures from Tom Dietterich, Dan Roth 4

The badges game 5

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - 6

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - What is the label for Peyton Manning? What about Eli Manning? 7

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - What function can you learn? 8

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - How were the labels generated? If length of first name <= 5, then + else - 9

Questions Are you sure you got the correct function? Is this prediction or just modeling data? How did you know that you should look at the letters? All words have a length. Background knowledge. How did you arrive at it? What learning algorithm did you use? 10

What is supervised learning? 11

Instances and Labels Running example: Automatically tag news articles 12

Instances and Labels Running example: Automatically tag news articles An instance of a news article that needs to be classified 13

Instances and Labels Running example: Automatically tag news articles Sports A label An instance of a news article that needs to be classified 14

Instances and Labels Running example: Automatically tag news articles Sports Mapped by the classifier to Business Politics Entertainment Instance Space: All possible news articles Label Space: All possible labels 15

Instances and Labels X: Instance Space The set of examples that need to be classified Eg: The set of all possible names, documents, sentences, images, emails, etc. 16

Instances and Labels X: Instance Space The set of examples that need to be classified Y: Label Space The set of all possible labels Eg: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 17

Instances and Labels X: Instance Space The set of examples that need to be classified Target function y = f(x) Y: Label Space The set of all possible labels Eg: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 18

Instances and Labels X: Instance Space The set of examples that need to be classified Target function y = f(x) The goal of learning: Find this target function Learning is search over functions Y: Label Space The set of all possible labels E.g.: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 19

Supervised learning X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action 20

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data 21

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data Learning algorithm 22

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data Learning algorithm A learned function g: X Y 23

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data Learning algorithm A learned function g: X Y Can you think of other training protocols? 24

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels 25

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels Draw test example x X f(x) g(x) Are they different? 26

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels Draw test example x X f(x) g(x) Are they different? Apply the model to many test examples and compare to the target s prediction 27

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels Draw test example x X f(x) g(x) Are they different? Apply the model to many test examples and compare to the target s prediction Can you use test examples during training? 28

Supervised learning: General setting Given: Training examples of the form <x, f(x)> The function f is an unknown function Typically the input x is represented in a feature space Example: x {0,1} n or x R & A deterministic mapping from instances in your problem (eg: news articles) to features For a training example x, the value of f(x) is called its label Goal: Find a good approximation for f The label determines the kind of problem we have Binary classification: f(x) {-1,1} Multiclass classification: f(x) {1, 2, 3, ", K} Regression: f(x) R Questions? 29

Nature of applications There is no human expert Eg: Identify DNA binding sites Humans can perform a task, but can t describe how they do it Eg: Object detection in images The desired function is hard to obtain in closed form Eg: Stock market 30

On using supervised learning We should be able to decide: 1. What is our instance space? What are the inputs to the problem? What are the features? 2. What is our label space? What is the prediction task? 3. What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 31

1. The Instance Space X Can we directly use raw instance/input for machine learning? X: Instance Space Y: Label Space Target No, function computers can only process numbers The set of examples We y = need f(x) to use a set of numbers The set of to all represent the raw instances possible labels These numbers are referred as features The goal of learning: Find this target function Learning is search over functions Eg: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 32

1. The Instance Space X Designing an appropriate feature representation is crucial X: Instance Space The set of examples Instances x X are defined Y: Label by features/attributes Space Target function Example: Boolean features y = f(x) The set of all Does the email contain possible the word labels free? Eg: The set of all possible names, documents, sentences, images, emails, etc The goal of learning: Example: Find this Real target valued function features What is the height of the person? Learning is What search was over the functions stock price yesterday? Eg: {Spam, Not-Spam}, {+,-}, etc. 33

Instances as feature vectors An input to the problem (Eg: emails, names, images) Feature function A feature vector 34

Instances as feature vectors An input to the problem (Eg: emails, names, images) Feature function A feature vector Feature functions a.k.a feature extractors: Deterministic (for the most part) Some exceptions, e.g., random hashing 35

Instances as feature vectors An input to the problem (Eg: emails, names, images) Feature function A feature vector An instance in ML algorithms Feature functions a.k.a feature extractors: Deterministic (for the most part) Some exceptions, e.g., random hashing 36

Instances as feature vectors The instance space X is a N-dimensional vector space (e.g R N or {0,1} N ) Each dimension is one feature Each x X is a feature vector Each x = [x 1, x 2, ", x N ] is a point in the vector space 37

Instances as feature vectors The instance space X is a N-dimensional vector space (e.g R N or {0,1} N ) Each dimension is one feature Each x X is a feature vector Each x = [x 1, x 2, ", x N ] is a point in the vector space x 2 x 1 38

Instances as feature vectors The instance space X is a N-dimensional vector space (e.g < N or {0,1} N ) Each dimension is one feature Each x X is a feature vector Each x = [x 1, x 2, ", x N ] is a point in the vector space x 2 x 1 39

Let s brainstorm some features for the badges game 40

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a [1 0 0 0 ] b [0 1 0 0 ] Manning a [1 0 0 0 ] Scrooge c [0 0 1 0 ] Feature: The length of the name Naoki! 5 Abe! 3 41

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a [1 0 0 0 ] b [0 1 0 0 ] Manning a [1 0 0 0 ] Scrooge c [0 0 1 0 ] Feature: The length of the name Naoki! 5 Abe! 3 Question: What is the length of this feature vector? 42

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a [1 0 0 0 ] b [0 1 0 0 ] Manning a [1 0 0 0 ] Scrooge c [0 0 1 0 ] Feature: The length of the name Naoki! 5 Abe! 3 Question: What is the length of this feature vector? 26 (One dimension per letter) 43

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a [1 0 0 0 ] b [0 1 0 0 ] Manning a [1 0 0 0 ] Scrooge c [0 0 1 0 ] Feature: The length of the name Naoki 5 Abe 3 Question: What is the length of this feature vector? 26 (One dimension per letter) 44

Good features are essential Good features decide how well a task can be learned Eg: A bad feature function the badges game Is there a day of the week that begins with the last letter of the first name? Much effort goes into designing features Or maybe learning them Will touch upon general principles for designing good features But feature definition largely domain specific Comes with experience 45

On using supervised learning ü What is our instance space? What are the inputs to the problem? What are the features? 2. What is our label space? What is the learning task? 3. What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 46

2. The Label Space Y Why should we care about label space? X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions Eg: The set of all possible names, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 47

2. The Label Space Y Classification: The outputs are categorical Binary classification: Two possible labels We will see a lot of this Multiclass classification: K possible labels We may see a bit of this Structured classification: Graph valued outputs A different class Classification is the primary focus of this class 48

2. The Label Space Y The output space can be numerical Regression: Y is the set (or a subset) of real numbers Ranking Labels are ordinal That is, there is an ordering over the labels Eg: A Yelp 5-star review is only slightly different from a 4-star review, but very different from a 1-star review 49

On using supervised learning ü What is our instance space? What are the inputs to the problem? What are the features? ü What is our label space? What is the learning task? 3. What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 50

3. The Hypothesis Space X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions Eg: The set of all possible names, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 51

3. The Hypothesis Space X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions The hypothesis space is the set of functions we consider for this search 52

3. The Hypothesis Space X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions The hypothesis space is the set of functions we consider for this search An important problem: How can we decide hypothesis space? 53

Example of search: Full space x 1 Unknown x 2 function y = f(x 1, x 2 ) x ( x ) y 0 0 0 0 1 0 1 0 0 1 1 1 Can you learn this function? What is it? 54

Let us increase the number of features x 1 x 2 Unknown x 3 function y = f(x 1, x 2, x 3, x 4 ) x 4 Can you learn this function? What is it? 55

To identify one function in our hypothesis space X Z x 1 x 2 x 3 x 4 y 0 0 0 0? 0 0 0 1? 0 0 1 0? 0 0 1 1? 0 1 0 0? 0 1 0 1? 0 1 1 0? 0 1 1 1? 1 0 0 0? 1 0 0 1? 1 0 1 0? 1 0 1 1? 1 1 0 0? 1 1 0 1? 1 1 1 0? 1 1 1 1? 56

We cannot locate even one function in our hypothesis space! 57

What is the problem? Maybe you think we can solve this problem by using 16 training instances But how about 100 binary features? How many training examples will you need? How about 1000 binary features? 58

What is the problem? The hypothesis space is too large! There are 2 16 = 65536 possible Boolean functions over 4 inputs Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 2 16 functions. We have seen only 7 examples, cannot even determine one Boolean function in our space What is the size our hypothesis space for n inputs? 59

Solution: Restrict the search space A hypothesis space is the set of possible functions we consider We were looking at the space of all Boolean functions Instead choose a hypothesis space that is smaller than the space of all functions Only simple conjunctions (with four variables, there are only 16 conjunctions without negations) Simple disjunctions m-of-n rules: Fix a set of n variables. At least m of them must be true Linear functions! 60

Example Hypothesis space 1 Simple conjunctions Simple conjunctive rules are of the form g(x)=x i x j x k 61

Example Hypothesis space 1 Simple conjunctions Simple conjunctive rules are of the form g(x)=x i x j x k What is the size of simple conjunction hypothesis? 62

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i x j x k 63

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i x j x k Rule Counter-example Rule Counter-example Always False 1001 x ) x / 0011 x ( 1100 x ) x 0 0011 x ) 0100 x / x 0 1001 x / 0110 x ( x ) x / 0011 x 0 0101 x ( x ) x 0 0011 x ( x ) 1100 x ( x / x 0 0011 x ( x / 0011 x ) x / x 0 0011 x ( x 0 0011 x ( x ) x / x 0 0011 64

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i x j x k Rule Counter-example Rule Counter-example Always False 1001 x ) x / 0011 x ( 1100 x ) x 0 0011 Exercise: How many simple conjunctions are x ) possible when 0100 there are n inputs instead x / xof 0 4? 1001 x / 0110 x ( x ) x / 0011 x 0 0101 x ( x ) x 0 0011 x ( x ) 1100 x ( x / x 0 0011 x ( x / 0011 x ) x / x 0 0011 x ( x 0 0011 x ( x ) x / x 0 0011 65

Example Hypothesis space 1 Simple conjunctions Is there a consistent hypothesis in this space? There are only 16 simple conjunctive rules of the form g(x)=x i x j x k Rule Counter-example Rule Counter-example Always False 1001 x ) x / 0011 x ( 1100 x ) x 0 0011 x ) 0100 x / x 0 1001 x / 0110 x ( x ) x / 0011 x 0 0101 x ( x ) x 0 0011 x ( x ) 1100 x ( x / x 0 0011 x ( x / 0011 x ) x / x 0 0011 x ( x 0 0011 x ( x ) x / x 0 0011 66

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i x j x k Rule Counter-example Rule Counter-example Always False 1001 x ) x / 0011 x ( 1100 x ) x 0 0011 x ) 0100 x / x 0 1001 x / 0110 x ( x ) x / 0011 x 0 0101 x ( x ) x 0 0011 x ( x ) 1100 x ( x / x 0 0011 x ( x / 0011 x ) x / x 0 0011 x ( x 0 0011 x ( x ) x / x 0 0011 67

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i x j x k Rule Counter-example Rule Counter-example Always False 1001 x ) x / 0011 x ( No simple 1100 conjunction explains the x ) data! x 0 0011 x ) Our hypothesis 0100 space is too small x / x 0 1001 x / 0110 x ( x ) x / 0011 x 0 0101 x ( x ) x 0 0011 x ( x ) 1100 x ( x / x 0 0011 x ( x / 0011 x ) x / x 0 0011 x ( x 0 0011 x ( x ) x / x 0 0011 68

Example Hypothesis space 2 m-of-n rules Pick a subset with n variables. Y = 1 if at least m of them are 1 Example: If at least 2 of {x 1, x 3, x 4 } are 1, then the output is 1. Otherwise, the output is 0. Is there a consistent hypothesis in this space? Try to check if there is one First, how many m-of-n rules are there for four variables? 69

Solution: Restrict the search space How do we pick a hypothesis space? Using some prior knowledge (or by guessing) What if the hypothesis space is so small that nothing in it agrees with the data? We need a hypothesis space that is flexible enough 70

Restricting the hypothesis space Our guess of the hypothesis space may be incorrect General strategy Pick an expressive hypothesis space expressing concepts Concept = the target classifier that is hidden from us. Sometimes we may even call it the oracle. Example hypothesis spaces: m-of-n functions, decision trees, linear functions, grammars, multi-layer deep networks, etc. Develop algorithms that find an element the hypothesis space that fits data well (or well enough) Hope that it generalizes 71

Views of learning Learning is the removal of remaining uncertainty If we knew that the unknown function is a simple conjunction, we could use the training data to figure out which one it is Requires guessing a good, small hypothesis class And we could be wrong We could find a consistent hypothesis and still be incorrect with a new example! 72

On using supervised learning ü What is our instance space? What are the inputs to the problem? What are the features? ü What is our label space? What is the learning task? ü What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 73