Learning Programs from Noisy Data

Size: px

Start display at page:

Download "Learning Programs from Noisy Data"

Derick French
5 years ago
Views:

1 Learning Programs from Noisy Data Veselin Raychev Pavol Bielik Martin Vechev Andreas Krause ETH Zurich

2 Why learn programs from examples? Input/output examples often easier to provide examples than specification (e.g. in FlashFill) 2

3 Why learn programs from examples? Input/output examples learn a function p such that often easier to provide examples than specification (e.g. in FlashFill) p( )= p( )=...? 3

4 Why learn programs from examples? Input/output examples learn a function p such that often easier to provide examples than specification (e.g. in FlashFill) the user may make a mistake in the examples p( )= p( )=...? 4

5 Why learn programs from examples? Input/output examples learn a function p such that often easier to provide examples than specification (e.g. in FlashFill) the user may make a mistake in the examples p( )= p( )=...? Actual goal: produce p that the user really wanted and tried to specify 5

6 Why learn programs from examples? Input/output examples learn a function p such that often easier to provide examples than specification (e.g. in FlashFill) the user may make a mistake in the examples p( )= p( )=...? Actual goal: produce p that the user really wanted and tried to specify Key problem of synthesis: overfits, not robust to noise 6

7 Learning Programs from Data: Defining Dimensions Handling errors in the dataset Number of Examples Learned program complexity 7

8 Learning Programs from Data: Defining Dimensions Program synthesis (PL) Handling errors in the dataset Number of Examples Learned program complexity no tens interesting programs 8

9 Learning Programs from Data: Defining Dimensions Program synthesis (PL) Deep learning (ML) Handling errors in the dataset no yes Number of Examples tens millions Learned program complexity interesting programs simple, but unexplainable functions 9

10 Learning Programs from Data: Defining Dimensions Program synthesis (PL) This paper bridges a gap Deep learning (ML) Handling errors in the dataset no yes yes Number of Examples tens millions millions Learned program complexity interesting programs interesting programs simple, but unexplainable functions 10

11 Learning Programs from Data: Defining Dimensions Program synthesis (PL) This paper bridges a gap Deep learning (ML) Handling errors in the dataset no yes yes Number of Examples tens millions millions Learned program complexity interesting programs interesting programs simple, but unexplainable functions expands capabilities of existing synthesizers 11

12 Learning Programs from Data: Defining Dimensions Program synthesis (PL) This paper bridges a gap Deep learning (ML) Handling errors in the dataset no yes yes Number of Examples tens millions millions Learned program complexity interesting programs interesting programs simple, but unexplainable functions expands capabilities of existing synthesizers new state-of-the-art precision for programming tasks 12

13 Learning Programs from Data: Defining Dimensions Program synthesis (PL) This paper bridges a gap Deep learning (ML) Handling errors in the dataset Number of Examples Learned program complexity no tens interesting programs yes yes Bridges gap between ML and PL millions Advances both areas interesting programs millions simple, but unexplainable functions expands capabilities of existing synthesizers new state-of-the-art precision for programming tasks 13

14 In this paper A general framework that handles errors in training dataset learns statistical models on data handles synthesis with millions of examples Instantiated with two synthesizers generalize existing works 14

15 Contributions New probabilistic models Handling noise Input/output examples 1. synthesize p 2. use probabilistic model parametrized with p Handling large datasets incorrect examples Representative dataset sampler Program generator 15

16 Contributions New probabilistic models Handling noise Input/output examples 1. synthesize p 2. use probabilistic model parametrized with p Handling large datasets incorrect examples Representative dataset sampler Program generator 16

17 Synthesis with noise: usage model Input/output examples 17

18 Synthesis with noise: usage model Input/output examples Domain Specific Language 18

19 Synthesis with noise: usage model Input/output examples synthesizer Domain Specific Language 19

20 Synthesis with noise: usage model Input/output examples synthesizer Domain Specific Language p such that p( )= p( )=... 20

21 Synthesis with noise: usage model Input/output examples incorrect example (e.g. a typo) synthesizer Domain Specific Language p such that p( )= p( )=... 21

22 Synthesis with noise: usage model Input/output examples incorrect example (e.g. a typo) synthesizer Domain Specific Language p such that p( )= p( )=... p( ) 22

23 Synthesis with noise: usage model Input/output examples incorrect example (e.g. a typo) synthesizer Domain Specific Language new kind of feedback from synthesizer p such that p( )= p( )=... p( ) 23

24 Synthesis with noise: usage model Input/output examples incorrect example (e.g. a typo) Tell user to remove suspicious example, or synthesizer Domain Specific Language new kind of feedback from synthesizer p such that p( )= p( )=... p( ) Ask for more examples 24

25 Handling noise: problem statement D: Input/output examples dataset of input/output examples incorrect examples synthesizer p best = arg min errors(d, p) p P Too long program, hardcodes the input/outputs. Synthesis must penalize such answers space of possible programs in DSL Our Issue: problem formulation: if return if return if return if return p P return such regularization a program makes constant no errors p best = arg min errors(d, p) + λr(p) and it may be the only solution to the minimization problem regularizer penalizes long error rate programs 25

26 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions 26

27 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions 27

28 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver 28

29 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver Ask a number of SMT queries in increasing value of solution cost e.g. for λ = 0.6 costs are r cost number of errors

30 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver Ask a number of SMT queries in increasing value of solution cost e.g. for λ = 0.6 costs are r cost number of errors UNSAT

31 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver Ask a number of SMT queries in increasing value of solution cost e.g. for λ = 0.6 costs are r cost number of errors UNSAT UNSAT

32 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver Ask a number of SMT queries in increasing value of solution cost e.g. for λ = 0.6 costs are r cost number of errors UNSAT 0.6 UNSAT UNSAT

33 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver Ask a number of SMT queries in increasing value of solution cost e.g. for λ = 0.6 costs are r cost number of errors UNSAT 0.6 UNSAT UNSAT UNSAT

34 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver Ask a number of SMT queries in increasing value of solution cost e.g. for λ = 0.6 costs are r cost number of errors UNSAT 0.6 UNSAT SAT 2 UNSAT UNSAT

35 Noisy synthesis using SMT p best = arg min errors(d, p) + λr(p) p P total solution cost number of instructions encoding err 1 = if p( )= then 0 else 1 err 2 = if p( )= then 0 else 1 err 3 = if p( )= then 0 else 1 errors = err 1 + err 2 + err 3 p P r (with r instructions) Ψ formula given to SMT solver Ask a number of SMT queries in increasing value of solution cost e.g. for λ = 0.6 costs are r cost number of errors UNSAT 0.6 UNSAT SAT 2 UNSAT UNSAT best program is with two instructions and makes one error 35

36 Noisy synthesizer: example Take an actual synthesizer and show that we can make it handle noise 36

37 Implementation: BitSyn For BitStream programs, using Z3 similar to Jha et al.[icse 10] and Gulwani et al.[pldi 11] Example program: function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs 37

38 Implementation: BitSyn For BitStream programs, using Z3 similar to Jha et al.[icse 10] and Gulwani et al.[pldi 11] Example program: function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs Question: how well does our synthesizer discover noise? (in programs from prior work) 38

39 Implementation: BitSyn For BitStream programs, using Z3 similar to Jha et al.[icse 10] and Gulwani et al.[pldi 11] Example program: function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs Question: how well does our synthesizer discover noise? (in programs from prior work) 39

$empirically pick λ here Example program: function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs Question: how well does our$

40 Implementation: BitSyn For BitStream programs, using Z3 similar to Jha et al.[icse 10] and Gulwani et al.[pldi 11] best area to be in. empirically pick λ here Example program: function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o) } synthesized, short loop-free programs Question: how well does our synthesizer discover noise? (in programs from prior work) 40

41 So far handling noise Problem statement and regularization Synthesis procedure using SMT Presented one synthesizer Handling noise enables us to solve new classes of problems beyond normal synthesis 41

42 Contributions New probabilistic models Handling noise Input/output examples 1. synthesize p 2. use probabilistic model parametrized with p Handling large datasets incorrect examples Representative dataset sampler Program generator 42

43 Contributions New probabilistic models Handling noise Input/output examples 1. synthesize p 2. use probabilistic model parametrized with p Handling large datasets incorrect examples Representative dataset sampler Program generator 43

44 Fundamental problem Large number of examples: p best = arg min cost(d, p) p P 44

45 Fundamental problem Large number of examples: p best = arg min cost(d, p) p P D Millions of input/output examples 45

46 Fundamental problem Large number of examples: p best = arg min cost(d, p) p P computing cost(d, p) O( D ) D Millions of input/output examples 46

47 Fundamental problem Large number of examples: p best = arg min cost(d, p) p P computing cost(d, p) O( D ) D Synthesis: practically intractable Millions of input/output examples 47

48 Fundamental problem Large number of examples: p best = arg min cost(d, p) p P computing cost(d, p) O( D ) D Synthesis: practically intractable Millions of input/output examples Key idea: iterative synthesis on fraction of examples 48

49 Our solution: two components Synthesizer for small number of examples Program generator p best = arg min cost(d, p) p P given dataset d, finds best program 49

50 Our solution: two components Synthesizer for small number of examples Program generator p best = arg min cost(d, p) p P given dataset d, finds best program Dataset sampler Picks dataset d D We introduce representative dataset sampler Generalize a user providing input/output examples 50

51 In a loop Program generator Representative dataset sampler 51

52 In a loop Program generator Representative dataset sampler Start with a small random sample d D Iteratively generate programs and samples. 52

53 In a loop Program generator Representative dataset sampler Start with a small random sample d D Iteratively generate programs and samples. Program generator p 1 53

54 In a loop Program generator Representative dataset sampler Start with a small random sample d D Iteratively generate programs and samples. Program generator p 1 d Representative dataset sampler 54

55 In a loop Program generator Representative dataset sampler Start with a small random sample d D Iteratively generate programs and samples. Program generator p 1 d Representative dataset sampler Program generator p 1, p 2 55

56 In a loop Program generator Representative dataset sampler Start with a small random sample d D Iteratively generate programs and samples. Program generator p 1 d Representative dataset sampler Program generator p 1, p 2 Representative dataset sampler 56

57 In a loop Program generator Representative dataset sampler Start with a small random sample d D Iteratively generate programs and samples. Program generator p 1 d Representative dataset sampler Program generator p 1, p 2 Representative dataset sampler p best 57

58 In a loop Program generator Representative dataset sampler Start with a small random sample d D Iteratively generate programs and samples. Program generator p 1 d Representative dataset sampler Program generator p 1, p 2 Algorithm generalizes synthesis by examples techniques Representative dataset sampler p best 58

59 Representative dataset sampler Idea: pick a small dataset d for which a set of already generated programs p 1,...,p n behave like on the full dataset d = arg min d D max i 1..n cost(d, p i ) - cost(d, p i ) 59

60 Representative dataset sampler Idea: pick a small dataset d for which a set of already generated programs p 1,...,p n behave like on the full dataset d = arg min d D max i 1..n cost(d, p i ) - cost(d, p i ) Costs on small dataset d p 1 p 2 60

61 Representative dataset sampler Idea: pick a small dataset d for which a set of already generated programs p 1,...,p n behave like on the full dataset d = arg min d D max i 1..n cost(d, p i ) - cost(d, p i ) Costs on small dataset d Costs on full dataset D p 1 p 2 p 1 p 2 61

62 Representative dataset sampler Idea: pick a small dataset d for which a set of already generated programs p 1,...,p n behave like on the full dataset d = arg min d D max i 1..n cost(d, p i ) - cost(d, p i ) Costs on small dataset d Costs on full dataset D p 1 p 2 p 1 p 2 62

63 Representative dataset sampler Idea: pick a small dataset d for which a set of already generated programs p 1,...,p n behave like on the full dataset d = arg min d D max i 1..n cost(d, p i ) - cost(d, p i ) Costs on small dataset d Costs on full dataset D p 1 p 2 p 1 p 2 63

64 Representative dataset sampler Idea: pick a small dataset d for which a set of already generated programs p 1,...,p n behave like on the full dataset d = arg min d D max i 1..n cost(d, p i ) - cost(d, p i ) Costs on small dataset d Costs on full dataset D p 1 p 2 p 1 p 2 Theorem: this sampler shrinks the candidate program search space In evaluation: significant speedup of synthesis 64

65 So far... handling large datasets Iterative combination of synthesis and sampling New way to perform approximate empirical risk minimization Guarantees (in the paper) Representative dataset sampler Program generator 65

66 Contributions New probabilistic models Handling noise Input/output examples 1. synthesize p 2. use probabilistic model parametrized with p Handling large datasets incorrect examples Representative dataset sampler Program generator 66

67 Contributions New probabilistic models Handling noise Input/output examples 1. synthesize p 2. use probabilistic model parametrized with p Handling large datasets incorrect examples Representative dataset sampler Program generator 67

68 Statistical programming tools A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs 68

69 Statistical programming tools A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs 1. Train machine learning model 69

70 Statistical programming tools A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs 1. Train machine learning model 2. Make predictions with model 70

71 Statistical programming tools A new breed of tools: Learn from large existing codebases (e.g. Big Code) to make predictions about programs 1. Train machine learning model 2. Make predictions with model hard-coded model low precision 71

72 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) 72

73 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] 73

74 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] 74

75 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] 75

76 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] Learn a mapping + name. slice Model will predict slice when it sees it after + name. This model comes from NLP 76

77 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] Learn a mapping + name. slice Raychev et al.[pldi 14] Model will predict slice when it sees it after + name. This model comes from NLP 77

78 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] Learn a mapping + name. slice Raychev et al.[pldi 14] Model will predict slice when it sees it after + name. This model comes from NLP 78

79 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] Learn a mapping + name. slice Raychev et al.[pldi 14] Model will predict slice when it sees it after + name. This model comes from NLP 79

80 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] Learn a mapping + name. slice Raychev et al.[pldi 14] Model will predict slice when it sees it after + name. This model comes from NLP 80

81 Existing machine learning models Essentially remember mapping from context in training data to prediction (with probabilities) Hindle et al.[icse 12] Learn a mapping + name. slice Model will predict slice when it sees it after + name. This model comes from NLP Raychev et al.[pldi 14] Learn a mapping charat slice Model will predict slice when it sees it after charat Relies on static analysis 81

82 Problem of existing systems Precision. They rarely predict the next statement Hindle et al.[icse 12] Learn a mapping + name. slice Model will predict slice when it sees it after + name. This model comes from NLP Raychev et al.[pldi 14] Learn a mapping charat slice Model will predict slice when it sees it after charat Relies on static analysis 82

83 Problem of existing systems Precision. They rarely predict the next statement Hindle et al.[icse 12] Learn a mapping Very low precision + name. slice Model will predict slice when it sees it after + name. This model comes from NLP Raychev et al.[pldi 14] Learn a mapping charat slice Model will predict slice when it sees it after charat Relies on static analysis 83

84 Problem of existing systems Precision. They rarely predict the next statement Hindle et al.[icse 12] Learn a mapping Very low precision + name. slice Model will predict slice when it sees it after + name. This model comes from NLP Raychev et al.[pldi 14] Learn a mapping Low precision for JavaScript charat slice Model will predict slice when it sees it after charat Relies on static analysis 84

85 Problem of existing systems Precision. They rarely predict the next statement Hindle et al.[icse 12] Learn a mapping Very low precision Raychev et al.[pldi 14] + name. slice Model will predict slice when it sees it after + name. Core problem: This model comes from NLP Existing machine learning models are limited and not Learn expressive a mapping enough Low precision for JavaScript charat slice Model will predict slice when it sees it after charat Relies on static analysis 85

86 Key idea: second-order learning Learn a program that parametrizes a probabilistic model that makes predictions. 86

87 Key idea: second-order learning Learn a program that parametrizes a probabilistic model that makes predictions. 1. Synthesize program describing a model 87

88 Key idea: second-order learning Learn a program that parametrizes a probabilistic model that makes predictions. 2. Train model i.e. learn the mapping 1. Synthesize program describing a model 88

89 Key idea: second-order learning Learn a program that parametrizes a probabilistic model that makes predictions. 2. Train model 3. Make predictions with this model i.e. learn the mapping 1. Synthesize program describing a model 89

90 Key idea: second-order learning Learn a program that parametrizes a probabilistic model that makes predictions. 2. Train model 3. Make predictions with this model i.e. learn the mapping 1. Synthesize hard-coded programs program describing a model prior models are described by simple Our approach: learn a better program 90

91 Training and evaluation Training example: slice input output 91

92 Training and evaluation Training example: Compute context with program p slice input output 92

93 Training and evaluation Training example: Compute context with program p slice input Learn a mapping touppercase output slice 93

94 Training and evaluation Training example: Compute context with program p slice input Evaluation example: Learn a mapping touppercase output slice 94

95 Training and evaluation Training example: Compute context with program p slice input Evaluation example: Compute context with program p Learn a mapping touppercase output slice 95

96 Training and evaluation Training example: Compute context with program p slice input Learn a mapping touppercase output slice Evaluation example: Compute context with program p predict completion slice 96

97 Training and evaluation Training example: Compute context with program p slice input Learn a mapping touppercase output slice Evaluation example: Compute context with program p predict completion slice 97

98 Observation Synthesis of probabilistic model can be done with the same optimization problem as before! evaluation data: input/output examples Our problem formulation: regularization constant p best = arg min errors(d, p) + λr(p) p P regularizer penalizes long programs 98

99 Observation Synthesis of probabilistic model can be done with the same optimization problem as before! evaluation data: input/output examples cost(d, p) Our problem formulation: regularization constant p best = arg min errors(d, p) + λr(p) p P regularizer penalizes long programs 99

100 So far... Handling noise Synthesizing a model Representative dataset sampler Techniques are generally applicable to program synthesis Next, application for Big Code called DeepSyn 100

101 DeepSyn: Training Trained on JavaScript files from GitHub Program synthesizer D Representative dataset sampler Program generator 101

102 DeepSyn: Training Trained on JavaScript files from GitHub Program synthesizer D Representative dataset sampler Program generator D p Train model on full data and best program p 102

103 DeepSyn: Evaluation evaluation files (not used in training or synthesis) API completion task 103

104 DeepSyn: Evaluation evaluation files (not used in training or synthesis) API completion task Conditioning program p Accuracy Last two tokens, Hindle et al.[icse 12] 22.2% Last two APIs per object, Raychev et al.[pldi 14] 30.4% This work Program synthesis with noise 46.3% Program synthesis with noise + dataset sampler 50.4% 104

DeepSyn: Evaluation 50 000 evaluation files (not used in training or synthesis) API completion task Conditioning program p Accuracy Last two tokens, Hindle et al.[icse 12] 22.

105 DeepSyn: Evaluation evaluation files (not used in training or synthesis) API completion task Conditioning program p Accuracy Last two tokens, Hindle et al.[icse 12] 22.2% Last two APIs per object, Raychev et al.[pldi 14] 30.4% This work Program synthesis with noise 46.3% Program synthesis with noise + dataset sampler 50.4% We can explain best program. It looks at API preceding completion position and at tokens prior to these APIs. 105

106 Q&A Handling noise Input/output examples Synthesis of probabilistic models Second-order learning 1. synthesize p 2. use probabilistic model parametrized with p Handling large datasets incorrect examples Extending synthesizers to handle noise Representative dataset sampler Program generator Scalability 106

107 Q&A Handling noise Input/output examples Synthesis of probabilistic models Second-order learning 1. synthesize p 2. use probabilistic model parametrized with p Bridges gap between ML and PL Advances both areas Handling large datasets incorrect examples Extending synthesizers to handle noise Representative dataset sampler Program generator Scalability 107

108 What did we synthesize? Left PrevActor WriteAction WriteValue PrevActor WriteAction PrevLeaf WriteValue PrevLeaf WriteValue 108

Learning a Static Analyzer from Data

Learning a Static Analyzer from Data Pavol Bielik, Veselin Raychev, and Martin Vechev Department of Computer Science, ETH Zürich, Switzerland {pavol.bielik, veselin.raychev, martin.vechev}@inf.ethz.ch