Differentiable Data Structures (and POMDPs) Yarin Gal & Rowan McAllister February 11, 2016 Many thanks to Edward Grefenstette for graphics material; other sources include Wikimedia licensed under CC BY-SA 3.0
Motivation Data structures are abstract data types lie at the core of Computer Science e.g. stacks, queues, heaps, binary trees, DAGs, etc. used in sorting algorithms, cycles in DAGs, many more We d like to teach computers to use data structures in solving tasks For many tasks a data structure is sensible allows for flexible models for such tasks 2 of 27
Motivation Data structures are abstract data types lie at the core of Computer Science e.g. stacks, queues, heaps, binary trees, DAGs, etc. used in sorting algorithms, cycles in DAGs, many more We d like to teach computers to use data structures in solving tasks For many tasks a data structure is sensible allows for flexible models for such tasks 2 of 27
Motivation Data structures are abstract data types lie at the core of Computer Science e.g. stacks, queues, heaps, binary trees, DAGs, etc. used in sorting algorithms, cycles in DAGs, many more We d like to teach computers to use data structures in solving tasks For many tasks a data structure is sensible allows for flexible models for such tasks 2 of 27
Motivation Many working on these ideas at Deep Mind, Facebook (neural Turing machine, memory networks, etc.) Featured on Future of Life Institute s Top A.I. Breakthroughs of 2015 3 of 27
Motivation Many working on these ideas at Deep Mind, Facebook (neural Turing machine, memory networks, etc.) Featured on Future of Life Institute s Top A.I. Breakthroughs of 2015 3 of 27
Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 3 of 27
Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 4 of 27
Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 4 of 27
Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 Push v 3 4 of 27
Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 4 of 27
Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 4 of 27
Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 Pop (u = 1) 4 of 27
Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 5 of 27
Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 5 of 27
Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 Enqueue v 3 5 of 27
Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 5 of 27
Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 5 of 27
Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 Dequeue (u = 1) 5 of 27
Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 5 of 27
Countless approaches In the past 2 years: Neural Turing Machines (Graves et al., arxiv, 2014) Memory Networks (Weston et al., arxiv, 2014) End-To-End Memory Networks (Sukhbaatar et al., NIPS, 2015) Weakly Supervised Memory Networks (Sukhbaatar et al., 2015) Learning to Transduce with Unbounded Memory (Grefenstette et al., NIPS, 2015) Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets (Joulin and Mikolov, NIPS, 2015) Transition-Based Dependency Parsing with Stack Long Short-Term Memory (Dyer et al., ACL, 2015) Neural Programmer-Interpreters (Reed and de Freitas, ICLR, 2016) Neural Random-Access Machines (Kurach et al., ICLR, 2016) Neural GPUs Learn Algorithms (Kaiser and Sutskever, ICLR, 6 of 27
Countless approaches In the past 2 years: Neural Turing Machines (Graves et al., arxiv, 2014) Memory Networks (Weston et al., arxiv, 2014) End-To-End Memory Networks (Sukhbaatar et al., NIPS, 2015) Weakly Supervised Memory Networks (Sukhbaatar et al., 2015) Learning to Transduce with Unbounded Memory (Grefenstette et al., NIPS, 2015) Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets (Joulin and Mikolov, NIPS, 2015) Transition-Based Dependency Parsing with Stack Long Short-Term Memory (Dyer et al., ACL, 2015) Neural Programmer-Interpreters (Reed and de Freitas, ICLR, 2016) Neural Random-Access Machines (Kurach et al., ICLR, 2016) Neural GPUs Learn Algorithms (Kaiser and Sutskever, ICLR, 6 of 27
Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27
Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27
Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 Push half a v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27
Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 Push half a v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27
Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (with weight d = 0.8) 8 of 27
Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) 8 of 27
Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) Push v 2 (d = 0.5) 8 of 27
Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) Push v 2 (d = 0.5) Pop (u = 0.9) 8 of 27
Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) Push v 2 (d = 0.5) Pop (u = 0.9) Push v 3 (d = 0.9) 8 of 27
Continuous stack And in equations: 9 of 27
Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 Enqueue v 2 Dequeue v 2 (u = 1) 10 of 27
Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 Enqueue v 2 Dequeue v 2 (u = 1) 10 of 27
Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) 10 of 27
Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) Dequeue (u = 0.1) Enqueue v 2 (d = 0.5) 10 of 27
Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) Dequeue (u = 0.1) Enqueue v 2 (d = 0.5) Dequeue (u = 0.8) Enqueue v 3 (d = 0.9). 10 of 27
Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) Dequeue (u = 0.1) Enqueue v 2 (d = 0.5) Dequeue (u = 0.8) Enqueue v 3 (d = 0.9). 10 of 27
Continuous queue (exercise) Reminder stack s equations: Exercise 2: What s the equivalent for a continuous queue? 11 of 27
Continuous queue (solution) Queue s equations: 12 of 27
Data structure as a recurrent unit Equations can be seen as a single time step update of a recurrent stack / queue unit The unit takes an input and previous state, and emits an output and next state prev. values (V t-1 ) next values (V t ) previous state next state prev. strengths (s t-1 ) push (d t ) input pop (u t ) value (v t ) Neural Stack next strengths (s t ) output (r t ) Split Join 13 of 27
Controller Grefenstette et al. (2015) use an RNN to control the data structure: (V t-1, s t-1 ) 1 ) (V t-1, s t-1 ) R N N h t (o t, ) o t V t-1 s t-1 d t u t v t Neural Stack r t V t s t previous state H next t-1 (V t, s t ) state H t input output i t o t r t-1 h t-1 (i t, r t-1 ) R N N h t (o t, ) o t V t-1 s t-1 d t u t v t Neural Stack r t V t s t (V t, s t ) previous state next H t-1 state H t input i t output r t-1 h t-1 (i t, r t-1 ) R N N h t (o t, ) o t V t-1 s t-1 d t u t v t Neural Stack r t V t s t o t Hybrid unit s input splits into RNN s input and stack s input 14 of 27
Controller Grefenstette et al. (2015) use an RNN to control the data structure: (V t-1, s t-1 ) previous state H t-1 input r t-1 h t-1 R N N h t (o t, ) V t-1 s t-1 d t u t Neural Stack r t V t s t (V t, s t ) next state H t i t (i t, r t-1 ) o t v t output o t A single time step of a combined RNN unit and stack unit 14 of 27
Insights Some insights: Stack has no additional parameters Increased space complexity (naive impl. O(MT 2 ) with M dim. of v i and T time steps) Can reduce space complexity by working in place (O(MT ), from personal communication) 15 of 27
Insights Some insights: Stack has no additional parameters Increased space complexity (naive impl. O(MT 2 ) with M dim. of v i and T time steps) Can reduce space complexity by working in place (O(MT ), from personal communication) 15 of 27
Insights Some insights: Stack has no additional parameters Increased space complexity (naive impl. O(MT 2 ) with M dim. of v i and T time steps) Can reduce space complexity by working in place (O(MT ), from personal communication) 15 of 27
Insights Stack s gradients: Can vanish quickly unless initialised carefully 16 of 27
Evaluation Model evaluated on: Sequence copying ab ab ab a V X Sequence reversal ab ba ab a V X Learning a grammar svo sov svo ovs V X Most papers above give these anecdotal toy examples Top A.I. Breakthroughs of 2015? Let s learn some history. 17 of 27
Evaluation Model evaluated on: Sequence copying ab ab ab a V X Sequence reversal ab ba ab a V X Learning a grammar svo sov svo ovs V X Most papers above give these anecdotal toy examples Top A.I. Breakthroughs of 2015? Let s learn some history. 17 of 27
Evaluation Model evaluated on: Sequence copying ab ab V ab a X Sequence reversal ab ba V ab a X Learning a grammar svo sov V svo ovs X Most papers above give these anecdotal toy examples Top A.I. Breakthroughs of 2015? Let s learn some history. 17 of 27
Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 17 of 27
History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) From the abstract: A higher order single layer recursive network learns to simulate a deterministic finite state machine. When a [..] neural net state machine is connected through a common error term to an external analog stack memory, the combination can be interpreted as a neural net pushdown automata. [It is] given the primitives push and pop, and is able to read the top of the stack. Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn 18 of 27
History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27
History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27
History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27
History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27
History NNPDA Concrete example, NNPDA (1992): 2 Output push pop or no-op Top-of-stack 1.0 State(t+1) Action External stack higher order weights alphabets on the stack copy copy State Neurons Input Neurons Read Neurons State(t) Input(t) Top-of-stack(t) 2 Cited by Grefenstette et al. (2015) 19 of 27
History NNPDA Concrete example, NNPDA (1992): 2 2 Cited by Grefenstette et al. (2015) 19 of 27
History NNPDA Experimental evaluation on tasks: Balanced parenthesis grammar (())() V ()( X Learning a grammar (1 n 0 n ) 111000 V 1110 X Sequence reversal ab ba ab a V X 20 of 27
History NNPDA Top A.I. Breakthroughs of 1989 Output push pop or no-op Top-of-stack 1.0 (V t-1, s t-1 ) State(t+1) Action External stack alphabets on the stack higher order weights previous state H t-1 input r t-1 h t-1 R N N h t (o t, ) V t-1 s t-1 d t u t Neural Stack r t V t s t (V t, s t ) next state H t copy State Neurons Input Neurons Read Neurons copy i t (i t, r t-1 ) o t v t output o t State(t) Input(t) Top-of-stack(t) Figure: NNPDA Figure: Neural stack Same ideas (although with different motivation) Similar structure Same evaluations even 21 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27
History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources...... can use these models from the 90 s in real-world applications. 22 of 27
Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 22 of 27
Transition Based Dependency Parsing This gives a projective tree in linear time 23 of 27 Dependency grammar: a syntactic structure where words are connected to each other by directed links Various representations of Dependency Grammars Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures
Transition Based Dependency Parsing Dependency grammar: a syntactic structure where words are connected to each other by directed links Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures Example transitions This gives a projective tree in linear time Main challenge: what action should the parser make in each state? 23 of 27
Transition Based Dependency Parsing Dependency grammar: a syntactic structure where words are connected to each other by directed links Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures This gives a projective tree in linear time Main challenge: what action should the parser make in each state? 23 of 27
Transition Based Dependency Parsing Dependency grammar: a syntactic structure where words are connected to each other by directed links Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures This gives a projective tree in linear time Main challenge: what action should the parser make in each state? 23 of 27
Model Dyer et al. (2015) use stack LSTMs These follows a simpler formulation that Grefenstette et al. (2015) add a stack pointer to determine which LSTM cell to use in next time step 24 of 27
Model Use three stack LSTMs: One to represent the input One to hold the partially constructed syntactic trees And one to record the history of the parser actions 25 of 27
Results Managed to improve on best results to date (C&M (2014)) 26 of 27
Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 26 of 27
Future? Exciting applications starting to emerge Going beyond toy examples More recent work starting to combine traditional data structures with reinforcement learning Reinforcement Learning Neural Turing Machines (Zaremba and Sutskever, arxiv, 2015) Should start learning POMDPs... 27 of 27