Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?

Size: px

Start display at page:

Download "Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?"

Muriel Ross
5 years ago
Views:

1 Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help? Olivier Bousquet, Google, Zürich, June 4th, 2007

2 Outline 1 Introduction 2 Features 3 Minimax Revisited 4 Time Matters

3 Goal of this talk Entertain you (Gilles request) Share experience, propose questions, not results Get feedback

4 Goal of this talk Entertain you (Gilles request) Share experience, propose questions, not results Get feedback

5 Goal of this talk Entertain you (Gilles request) Share experience, propose questions, not results Get feedback

6 Why this talk? The good thing about Machine Learning: many real-world problem can benefit almost directly from it So it should be easy to have a positive impact using ML algorithms Unfortunately, there are many obstacles when trying to do so So, what can be done?

7 Why this talk? The good thing about Machine Learning: many real-world problem can benefit almost directly from it So it should be easy to have a positive impact using ML algorithms Unfortunately, there are many obstacles when trying to do so So, what can be done?

8 Process Optimization Collect data on an industrial process Goal is to tune this process (reduce the scrap) Easy to put sensors (collect a lot of possibly irrelevant data) Hard to make controlled tests (few examples or poor exploration of the design space) Many practical constraints Requires a decision system (not prediction)

9 Process Optimization Collect data on an industrial process Goal is to tune this process (reduce the scrap) Easy to put sensors (collect a lot of possibly irrelevant data) Hard to make controlled tests (few examples or poor exploration of the design space) Many practical constraints Requires a decision system (not prediction)

10 Process Optimization Collect data on an industrial process Goal is to tune this process (reduce the scrap) Easy to put sensors (collect a lot of possibly irrelevant data) Hard to make controlled tests (few examples or poor exploration of the design space) Many practical constraints Requires a decision system (not prediction)

11 Spam Filetring Consider incoming s and classify them as spam or non-spam Not necessarily an absolute notion Large collection of instances (e.g. gmail, tens of millions of users) Huge feature space (e.g. all possible n-grams) Training and testing time need to be small Should handle special cases as well as general ones

12 Spam Filetring Consider incoming s and classify them as spam or non-spam Not necessarily an absolute notion Large collection of instances (e.g. gmail, tens of millions of users) Huge feature space (e.g. all possible n-grams) Training and testing time need to be small Should handle special cases as well as general ones

13 Spam Filetring Consider incoming s and classify them as spam or non-spam Not necessarily an absolute notion Large collection of instances (e.g. gmail, tens of millions of users) Huge feature space (e.g. all possible n-grams) Training and testing time need to be small Should handle special cases as well as general ones

14 Lessons Learned Real-world problems are often of extreme scale (too few instances, or too many, high dimensional) structured (data does not come as vectors of numbers) complex (the ML component is only a tiny part of the system) ill-defined (success criterion not necessarily the accuracy) mission-critical (require trust, validation, human intervention) buggy (data sources often corrupted)

15 Understandability is Crucial Introducing a system that can make decisions in an organization requires the system to be understandable: the relationship between training data and the model/predictions should be clear readable/interpretable: the model should be readable and easy to interpret diagnosable: if something is wrong e.g. in the data it should be visible modifiable: ability to modify the system (locally/globally) in a predictable way traceable: the decisions (e.g. for special cases) should be explained predictable: the evolution over time should be explained

16 What Does Matter? Experts should focus on their expertise and speak their own langugage No hidden assumptions, no meaningless parameter Take into account resource constraints Accuracy is good, understandability is better: understanding the behaviour of a system is more useful than being able to predict it

17 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

18 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

19 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

20 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

21 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

22 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

23 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

24 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

25 So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user s intent) Computationally efficient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy)

26 Outline 1 Introduction 2 Features 3 Minimax Revisited 4 Time Matters

27 Data and Features Matter More than Algorithms Statement: The time spent cleaning the data and engineering features may lead to much larger improvements than the time spent on fine-tuning the algorithm Example: Spam filtering using the content of the message (humans are very good at it, but it would take a lot of data to learn this from scratch), or using the fact that the sender IP is bad (would filter 90% of the spam without the need for any learning, just a lookup table). Issue: choice and construction of features is an engineering problem and requires expertise

28 Data and Features Matter More than Algorithms Statement: The time spent cleaning the data and engineering features may lead to much larger improvements than the time spent on fine-tuning the algorithm Example: Spam filtering using the content of the message (humans are very good at it, but it would take a lot of data to learn this from scratch), or using the fact that the sender IP is bad (would filter 90% of the spam without the need for any learning, just a lookup table). Issue: choice and construction of features is an engineering problem and requires expertise

29 Data and Features Matter More than Algorithms Statement: The time spent cleaning the data and engineering features may lead to much larger improvements than the time spent on fine-tuning the algorithm Example: Spam filtering using the content of the message (humans are very good at it, but it would take a lot of data to learn this from scratch), or using the fact that the sender IP is bad (would filter 90% of the spam without the need for any learning, just a lookup table). Issue: choice and construction of features is an engineering problem and requires expertise

30 Key question Instead of suggesting features, we can at least provide a way to assess their quality Given a feature X and a response Y the question is How are those two quantities X and Y related? Ideally: causality (active research area), otherwise: correlation

31 Key question Instead of suggesting features, we can at least provide a way to assess their quality Given a feature X and a response Y the question is How are those two quantities X and Y related? Ideally: causality (active research area), otherwise: correlation

32 Key question Instead of suggesting features, we can at least provide a way to assess their quality Given a feature X and a response Y the question is How are those two quantities X and Y related? Ideally: causality (active research area), otherwise: correlation

33 Possible Approach How to quantify the relationship between two variables X and Y from a sample? First idea: try to estimate some quantity like I (X : Y ) However this does not take into account the structure (two close X values correspond to two close Y values) Second idea: consider inf f E( f (X ) Y ) However, this works only if the class is restricted Third idea: choose F, consider inf f F E( f (X ) Y ) However this does not take into account the sample size Fourth idea: choose an algorithm f n, consider E n E( f n (X ) Y ) Of course it highly depends on the algorithm. But this is where we can specify some prior assumption.

34 Possible Approach How to quantify the relationship between two variables X and Y from a sample? First idea: try to estimate some quantity like I (X : Y ) However this does not take into account the structure (two close X values correspond to two close Y values) Second idea: consider inf f E( f (X ) Y ) However, this works only if the class is restricted Third idea: choose F, consider inf f F E( f (X ) Y ) However this does not take into account the sample size Fourth idea: choose an algorithm f n, consider E n E( f n (X ) Y ) Of course it highly depends on the algorithm. But this is where we can specify some prior assumption.

35 Possible Approach How to quantify the relationship between two variables X and Y from a sample? First idea: try to estimate some quantity like I (X : Y ) However this does not take into account the structure (two close X values correspond to two close Y values) Second idea: consider inf f E( f (X ) Y ) However, this works only if the class is restricted Third idea: choose F, consider inf f F E( f (X ) Y ) However this does not take into account the sample size Fourth idea: choose an algorithm f n, consider E n E( f n (X ) Y ) Of course it highly depends on the algorithm. But this is where we can specify some prior assumption.

36 Possible Approach How to quantify the relationship between two variables X and Y from a sample? First idea: try to estimate some quantity like I (X : Y ) However this does not take into account the structure (two close X values correspond to two close Y values) Second idea: consider inf f E( f (X ) Y ) However, this works only if the class is restricted Third idea: choose F, consider inf f F E( f (X ) Y ) However this does not take into account the sample size Fourth idea: choose an algorithm f n, consider E n E( f n (X ) Y ) Of course it highly depends on the algorithm. But this is where we can specify some prior assumption.

37 Possible Approach How to quantify the relationship between two variables X and Y from a sample? First idea: try to estimate some quantity like I (X : Y ) However this does not take into account the structure (two close X values correspond to two close Y values) Second idea: consider inf f E( f (X ) Y ) However, this works only if the class is restricted Third idea: choose F, consider inf f F E( f (X ) Y ) However this does not take into account the sample size Fourth idea: choose an algorithm f n, consider E n E( f n (X ) Y ) Of course it highly depends on the algorithm. But this is where we can specify some prior assumption.

38 Possible Approach How to quantify the relationship between two variables X and Y from a sample? First idea: try to estimate some quantity like I (X : Y ) However this does not take into account the structure (two close X values correspond to two close Y values) Second idea: consider inf f E( f (X ) Y ) However, this works only if the class is restricted Third idea: choose F, consider inf f F E( f (X ) Y ) However this does not take into account the sample size Fourth idea: choose an algorithm f n, consider E n E( f n (X ) Y ) Of course it highly depends on the algorithm. But this is where we can specify some prior assumption.

39 Why looking at subsets of features? Simplicity: quickly identify simple relationships Interpretability: combinations of few features easier to interpret Exploration, correction of obvious mistakes, visualization Understanding correlations and further causality

40 Outline 1 Introduction 2 Features 3 Minimax Revisited 4 Time Matters

41 Priors Algorithm design is composed of two steps Choosing a preference This first step is based on knowledge of the problem, this is where guidance (but no theory) is needed. Exploiting it for inference The second step can possibly be formalized (optimality with respect to assumptions). The main issue is computational cost.

42 Priors Algorithm design is composed of two steps Choosing a preference This first step is based on knowledge of the problem, this is where guidance (but no theory) is needed. Exploiting it for inference The second step can possibly be formalized (optimality with respect to assumptions). The main issue is computational cost.

43 Requirements Facing a problem, an expert should focus on his area of expertise Parameters should make sense or be self-tuned Choosing an algorithm should not be the expert s task (or assumptions encoded into the algorithms should be clear)

44 Requirements Facing a problem, an expert should focus on his area of expertise Parameters should make sense or be self-tuned Choosing an algorithm should not be the expert s task (or assumptions encoded into the algorithms should be clear)

45 Requirements Facing a problem, an expert should focus on his area of expertise Parameters should make sense or be self-tuned Choosing an algorithm should not be the expert s task (or assumptions encoded into the algorithms should be clear)

46 Key question We choose some objective measure of success and have some prior preference or expectation (e.g. we consider that it is prefereable to use linear functions) Given this objective, which algorithm gets closer to this objective in all circumstances? Theory cannot tell what is the right assumption But it can tell how to best exploit the assumptions

47 The Bayesian Way Assume something about how the data is generated Consider an algorithm specifically tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup {g n} P P ( L(g n ) inf L(g) g Seems reasonable and useful for understanding but does not provide guarantees )

48 The Bayesian Way Assume something about how the data is generated Consider an algorithm specifically tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup {g n} P P ( L(g n ) inf L(g) g Seems reasonable and useful for understanding but does not provide guarantees )

49 The Bayesian Way Assume something about how the data is generated Consider an algorithm specifically tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup {g n} P P ( L(g n ) inf L(g) g Seems reasonable and useful for understanding but does not provide guarantees )

50 The Bayesian Way Assume something about how the data is generated Consider an algorithm specifically tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup {g n} P P ( L(g n ) inf L(g) g Seems reasonable and useful for understanding but does not provide guarantees )

51 The Bayesian Way Assume something about how the data is generated Consider an algorithm specifically tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup {g n} P P ( L(g n ) inf L(g) g Seems reasonable and useful for understanding but does not provide guarantees )

52 The Bayesian Way Assume something about how the data is generated Consider an algorithm specifically tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup {g n} P P ( L(g n ) inf L(g) g Seems reasonable and useful for understanding but does not provide guarantees )

53 The Bayesian Way Assume something about how the data is generated Consider an algorithm specifically tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup {g n} P P ( L(g n ) inf L(g) g Seems reasonable and useful for understanding but does not provide guarantees )

54 The Worst Case Way Assume nothing about the data (distribution-free) Restrict your objectives Derive an algorithm that reaches this objective no matter how the data is generated ( ) inf sup L(g n ) inf L(g) {g n} P g G Gives guarantees

55 The Worst Case Way Assume nothing about the data (distribution-free) Restrict your objectives Derive an algorithm that reaches this objective no matter how the data is generated ( ) inf sup L(g n ) inf L(g) {g n} P g G Gives guarantees

56 The Worst Case Way Assume nothing about the data (distribution-free) Restrict your objectives Derive an algorithm that reaches this objective no matter how the data is generated ( ) inf sup L(g n ) inf L(g) {g n} P g G Gives guarantees

57 The Worst Case Way Assume nothing about the data (distribution-free) Restrict your objectives Derive an algorithm that reaches this objective no matter how the data is generated ( ) inf sup L(g n ) inf L(g) {g n} P g G Gives guarantees

58 Going Further Prior preference can be more than a simple restriction Given a preference, what is the algorithm that does the best job? p(k) specifies how much regret one has in the fact that G k is the best class ( ) inf max L(g n ) inf L(g) p(k) k g G k sup {g n} P

59 Outline 1 Introduction 2 Features 3 Minimax Revisited 4 Time Matters

60 What are the constraints? Real-world problems are resource-constrained (computation, memory and data). Approaches to model this: Asymptotic results: no constraints PAC learning: polynomial time constraint (in n and ɛ, δ) to reach accuracy ɛ with probability 1 δ, assuming Bayes classifier in the class Convergence rates: no computational constraint, best accuracy with constrained sample size (data) Can we go further?

61 What are the constraints? Real-world problems are resource-constrained (computation, memory and data). Approaches to model this: Asymptotic results: no constraints PAC learning: polynomial time constraint (in n and ɛ, δ) to reach accuracy ɛ with probability 1 δ, assuming Bayes classifier in the class Convergence rates: no computational constraint, best accuracy with constrained sample size (data) Can we go further?

62 What are the constraints? Real-world problems are resource-constrained (computation, memory and data). Approaches to model this: Asymptotic results: no constraints PAC learning: polynomial time constraint (in n and ɛ, δ) to reach accuracy ɛ with probability 1 δ, assuming Bayes classifier in the class Convergence rates: no computational constraint, best accuracy with constrained sample size (data) Can we go further?

63 What are the constraints? Real-world problems are resource-constrained (computation, memory and data). Approaches to model this: Asymptotic results: no constraints PAC learning: polynomial time constraint (in n and ɛ, δ) to reach accuracy ɛ with probability 1 δ, assuming Bayes classifier in the class Convergence rates: no computational constraint, best accuracy with constrained sample size (data) Can we go further?

64 Key Question For practical purposes, we need to answer the following question: Given these computational resources, which algorithms gets closer to the objective under all circumstances? The question is not what accuracy you can reach with a given number of examples But rather what accuracy you can reach with a given set of resources (computation time/memory)

65 Key Question For practical purposes, we need to answer the following question: Given these computational resources, which algorithms gets closer to the objective under all circumstances? The question is not what accuracy you can reach with a given number of examples But rather what accuracy you can reach with a given set of resources (computation time/memory)

66 Possible Approach Decompose Error into three terms: Approximation, Estimation, Optimization Assume infinite supply of examples, but limited computation time An algorithm may choose to request more data or to process the one it has The goal is to be close to the Bayes classifier, not to the empirical minimizer

67 Possible Approach Decompose Error into three terms: Approximation, Estimation, Optimization Assume infinite supply of examples, but limited computation time An algorithm may choose to request more data or to process the one it has The goal is to be close to the Bayes classifier, not to the empirical minimizer

68 Possible Approach Decompose Error into three terms: Approximation, Estimation, Optimization Assume infinite supply of examples, but limited computation time An algorithm may choose to request more data or to process the one it has The goal is to be close to the Bayes classifier, not to the empirical minimizer

69 Possible Approach Decompose Error into three terms: Approximation, Estimation, Optimization Assume infinite supply of examples, but limited computation time An algorithm may choose to request more data or to process the one it has The goal is to be close to the Bayes classifier, not to the empirical minimizer

70 Formalization E( f n ) E(f ) = ( E(f F) E(f ) ) (Approximation) + ( E(f n ) E(f F) ) (Estimation) + ( E( f n ) E(f n ) ) (Optimization) min F,ρ,n E app + E est + E opt subject to T (F, ρ, n) T max

71 Application Batch gradient: iteration( cost ) Nd, iterations to reach optimization error ρ: O log 1 ρ, estimation error d/n, hence T = O ( d log 1 ) ɛ 2 ɛ Stochastic gradient: iteration cost d, iterations to reach optimization error ρ: O (1/ρ), hence T = O ( ) d ɛ 2 Refinements depending on the loss function, noise conditions...

72 Take Home Messages Make assumptions explicit when estimating relationship between features and response Assumptions should concern how to evaluate, not how reality is. Once measure is clear, what is the best algorithm (independent of how reality is)? Furthermore, what is the best algorithm given resource constraints?

73 Take Home Messages Make assumptions explicit when estimating relationship between features and response Assumptions should concern how to evaluate, not how reality is. Once measure is clear, what is the best algorithm (independent of how reality is)? Furthermore, what is the best algorithm given resource constraints?

74 Take Home Messages Make assumptions explicit when estimating relationship between features and response Assumptions should concern how to evaluate, not how reality is. Once measure is clear, what is the best algorithm (independent of how reality is)? Furthermore, what is the best algorithm given resource constraints?

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017 Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum