Differentiable Data Structures (and POMDPs)

Similar documents
arxiv: v2 [cs.ne] 10 Nov 2018

Recurrent Neural Network (RNN) Industrial AI Lab.

Context Encoding LSTM CS224N Course Project

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

COSC160: Data Structures: Lists and Queues. Jeremy Bolton, PhD Assistant Teaching Professor

CSCI312 Principles of Programming Languages!

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

LR Parsing - The Items

Lexical Scanning COMP360

Derivations of a CFG. MACM 300 Formal Languages and Automata. Context-free Grammars. Derivations and parse trees

Compiling Regular Expressions COMP360

Postfix (and prefix) notation

Pushdown Automata. A PDA is an FA together with a stack.

Languages and Compilers

Transition-based Parsing with Neural Nets

R10 SET a) Construct a DFA that accepts an identifier of a C programming language. b) Differentiate between NFA and DFA?

CT32 COMPUTER NETWORKS DEC 2015

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Structured Attention Networks

STACKS AND QUEUES. Problem Solving with Computers-II

MIT Specifying Languages with Regular Expressions and Context-Free Grammars. Martin Rinard Massachusetts Institute of Technology

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

Stack and Queue. Stack:

Machine Learning. MGS Lecture 3: Deep Learning

Natural Language Processing with Deep Learning CS224N/Ling284

UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE

Algorithm Design and Analysis

Differentiable Inductive Logic Programming Richard Evans,

Lecture Notes for Advanced Algorithms

UNIVERSITY OF SOUTH ALABAMA COMPUTER SCIENCE

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

Neural Programming by Example

Module 1: Asymptotic Time Complexity and Intro to Abstract Data Types

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018

We can create PDAs with multiple stacks. At each step we look at the current state, the current input symbol, and the top of each stack.

End-Term Examination Second Semester [MCA] MAY-JUNE 2006

27: Hybrid Graphical Models and Neural Networks

12 Abstract Data Types

From Theorem 8.5, page 223, we have that the intersection of a context-free language with a regular language is context-free. Therefore, the language

Context Free Languages and Pushdown Automata

UNIT III & IV. Bottom up parsing

CSE450 Translation of Programming Languages. Lecture 4: Syntax Analysis

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Definition 2.8: A CFG is in Chomsky normal form if every rule. only appear on the left-hand side, we allow the rule S ǫ.

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Stacks, Queues and Hierarchical Collections. 2501ICT Logan

Deep Learning. Architecture Design for. Sargur N. Srihari

R13. II B. Tech I Semester Supplementary Examinations, May/June DATA STRUCTURES (Com. to ECE, CSE, EIE, IT, ECC)

Lexical and Syntax Analysis. Bottom-Up Parsing

PA3 Design Specification

Question Bank. 10CS63:Compiler Design

Stacks, Queues (cont d)

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Formal Languages and Compilers Lecture VI: Lexical Analysis

Learning Explanatory Rules from Noisy Data Richard Evans, Ed Grefenstette

Neural Programmer-Interpreters. Scott Reed and Nando de Freitas

Stacks, Queues and Hierarchical Collections

COP 3402 Systems Software Syntax Analysis (Parser)

Introduction p. 1 Pseudocode p. 2 Algorithm Header p. 2 Purpose, Conditions, and Return p. 3 Statement Numbers p. 4 Variables p. 4 Algorithm Analysis

Recurrent Neural Networks

Stacks and queues (chapters 6.6, 15.1, 15.5)

Theory and Compiling COMP360

CSE 105 THEORY OF COMPUTATION

Theory of Computation Prof. Raghunath Tewari Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Specifying Syntax COMP360

CS 206 Introduction to Computer Science II

Dependency Parsing 2 CMSC 723 / LING 723 / INST 725. Marine Carpuat. Fig credits: Joakim Nivre, Dan Jurafsky & James Martin

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Bayesian model ensembling using meta-trained recurrent neural networks

LL Parsing, LR Parsing, Complexity, and Automata

Recurrent Neural Network and its Various Architecture Types

Data Structure. Recitation VII

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

arxiv: v1 [cs.lg] 8 Feb 2018

CSCE 314 Programming Languages

On the Efficient Implementation of Pipelined Heaps for Network Processing. Hao Wang, Bill Lin University of California, San Diego

Slides credited from Dr. David Silver & Hung-Yi Lee

Formal languages and computation models

CS 4510/9010 Applied Machine Learning. Neural Nets. Paula Matuszek Fall copyright Paula Matuszek 2016

ImageNet Classification with Deep Convolutional Neural Networks

11. a b c d e. 12. a b c d e. 13. a b c d e. 14. a b c d e. 15. a b c d e

CS 44 Exam #2 February 14, 2001

arxiv: v2 [cs.lg] 23 Feb 2016

Principles of Compiler Construction ( )

Data Structures. Outline. Introduction Linked Lists Stacks Queues Trees Deitel & Associates, Inc. All rights reserved.

KEY. A 1. The action of a grammar when a derivation can be found for a sentence. Y 2. program written in a High Level Language

DeepWalk: Online Learning of Social Representations

CS5371 Theory of Computation. Lecture 8: Automata Theory VI (PDA, PDA = CFG)

COMP 261 ALGORITHMS and DATA STRUCTURES

(a) R=01[((10)*+111)*+0]*1 (b) ((01+10)*00)*. [8+8] 4. (a) Find the left most and right most derivations for the word abba in the grammar

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

mywbut.com GATE SOLVED PAPER - CS (A) 2 k (B) ( k+ (C) 3 logk 2 (D) 2 logk 3

ONE-STACK AUTOMATA AS ACCEPTORS OF CONTEXT-FREE LANGUAGES *

1 P age DS & OOPS / UNIT II

Midterm I (Solutions) CS164, Spring 2002

Scan and Quicksort. 1 Scan. 1.1 Contraction CSE341T 09/20/2017. Lecture 7

End-To-End Spam Classification With Neural Networks

Contents. Chapter 1 SPECIFYING SYNTAX 1

Lecture Bottom-Up Parsing

Transcription:

Differentiable Data Structures (and POMDPs) Yarin Gal & Rowan McAllister February 11, 2016 Many thanks to Edward Grefenstette for graphics material; other sources include Wikimedia licensed under CC BY-SA 3.0

Motivation Data structures are abstract data types lie at the core of Computer Science e.g. stacks, queues, heaps, binary trees, DAGs, etc. used in sorting algorithms, cycles in DAGs, many more We d like to teach computers to use data structures in solving tasks For many tasks a data structure is sensible allows for flexible models for such tasks 2 of 27

Motivation Data structures are abstract data types lie at the core of Computer Science e.g. stacks, queues, heaps, binary trees, DAGs, etc. used in sorting algorithms, cycles in DAGs, many more We d like to teach computers to use data structures in solving tasks For many tasks a data structure is sensible allows for flexible models for such tasks 2 of 27

Motivation Data structures are abstract data types lie at the core of Computer Science e.g. stacks, queues, heaps, binary trees, DAGs, etc. used in sorting algorithms, cycles in DAGs, many more We d like to teach computers to use data structures in solving tasks For many tasks a data structure is sensible allows for flexible models for such tasks 2 of 27

Motivation Many working on these ideas at Deep Mind, Facebook (neural Turing machine, memory networks, etc.) Featured on Future of Life Institute s Top A.I. Breakthroughs of 2015 3 of 27

Motivation Many working on these ideas at Deep Mind, Facebook (neural Turing machine, memory networks, etc.) Featured on Future of Life Institute s Top A.I. Breakthroughs of 2015 3 of 27

Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 3 of 27

Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 4 of 27

Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 4 of 27

Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 Push v 3 4 of 27

Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 4 of 27

Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 4 of 27

Data structures recap Stack A simple stack example (r is the top of the stack peek ) We push elements to the top of the stack We pop (u elements) from the top of the stack Push v 1 Push v 2 Pop (u = 1) 4 of 27

Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 5 of 27

Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 5 of 27

Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 Enqueue v 3 5 of 27

Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 5 of 27

Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 5 of 27

Data structures recap Queue A simple queue example (r is the bottom of the queue peek ) We enqueue elements at the top of the queue We dequeue (u elements) from the bottom of the queue Enqueue v 1 Enqueue v 2 Dequeue (u = 1) 5 of 27

Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 5 of 27

Countless approaches In the past 2 years: Neural Turing Machines (Graves et al., arxiv, 2014) Memory Networks (Weston et al., arxiv, 2014) End-To-End Memory Networks (Sukhbaatar et al., NIPS, 2015) Weakly Supervised Memory Networks (Sukhbaatar et al., 2015) Learning to Transduce with Unbounded Memory (Grefenstette et al., NIPS, 2015) Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets (Joulin and Mikolov, NIPS, 2015) Transition-Based Dependency Parsing with Stack Long Short-Term Memory (Dyer et al., ACL, 2015) Neural Programmer-Interpreters (Reed and de Freitas, ICLR, 2016) Neural Random-Access Machines (Kurach et al., ICLR, 2016) Neural GPUs Learn Algorithms (Kaiser and Sutskever, ICLR, 6 of 27

Countless approaches In the past 2 years: Neural Turing Machines (Graves et al., arxiv, 2014) Memory Networks (Weston et al., arxiv, 2014) End-To-End Memory Networks (Sukhbaatar et al., NIPS, 2015) Weakly Supervised Memory Networks (Sukhbaatar et al., 2015) Learning to Transduce with Unbounded Memory (Grefenstette et al., NIPS, 2015) Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets (Joulin and Mikolov, NIPS, 2015) Transition-Based Dependency Parsing with Stack Long Short-Term Memory (Dyer et al., ACL, 2015) Neural Programmer-Interpreters (Reed and de Freitas, ICLR, 2016) Neural Random-Access Machines (Kurach et al., ICLR, 2016) Neural GPUs Learn Algorithms (Kaiser and Sutskever, ICLR, 6 of 27

Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27

Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27

Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 Push half a v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27

Continuous stack Previous stack example Let s make our stack continuous... 1 Lets push half a v 2 (d = 0.5)... what does that mean? Define stack peek to be a mixture of the top 1.0 elements Push v 1 Push v 2 Pop v 2 Push half a v 2 1 Learning to Transduce with Unbounded Memory, Grefenstette et al., NIPS, 2015 7 of 27

Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (with weight d = 0.8) 8 of 27

Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) 8 of 27

Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) Push v 2 (d = 0.5) 8 of 27

Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) Push v 2 (d = 0.5) Pop (u = 0.9) 8 of 27

Continuous stack Define stack pop (with weight u) to remove top u elements (which can be a fraction!). Example: Push v 1 (d = 0.8) Pop (u = 0.1) Push v 2 (d = 0.5) Pop (u = 0.9) Push v 3 (d = 0.9) 8 of 27

Continuous stack And in equations: 9 of 27

Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 Enqueue v 2 Dequeue v 2 (u = 1) 10 of 27

Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 Enqueue v 2 Dequeue v 2 (u = 1) 10 of 27

Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) 10 of 27

Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) Dequeue (u = 0.1) Enqueue v 2 (d = 0.5) 10 of 27

Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) Dequeue (u = 0.1) Enqueue v 2 (d = 0.5) Dequeue (u = 0.8) Enqueue v 3 (d = 0.9). 10 of 27

Continuous queue Similarly, previous queue example Make our queue continuous... Define enqueue (with weight d) to add element to queue top, dequeue (with weight u) to remove bottom u elements. Example (and exercise 1): Enqueue v 1 (d = 0.8) Dequeue (u = 0.1) Enqueue v 2 (d = 0.5) Dequeue (u = 0.8) Enqueue v 3 (d = 0.9). 10 of 27

Continuous queue (exercise) Reminder stack s equations: Exercise 2: What s the equivalent for a continuous queue? 11 of 27

Continuous queue (solution) Queue s equations: 12 of 27

Data structure as a recurrent unit Equations can be seen as a single time step update of a recurrent stack / queue unit The unit takes an input and previous state, and emits an output and next state prev. values (V t-1 ) next values (V t ) previous state next state prev. strengths (s t-1 ) push (d t ) input pop (u t ) value (v t ) Neural Stack next strengths (s t ) output (r t ) Split Join 13 of 27

Controller Grefenstette et al. (2015) use an RNN to control the data structure: (V t-1, s t-1 ) 1 ) (V t-1, s t-1 ) R N N h t (o t, ) o t V t-1 s t-1 d t u t v t Neural Stack r t V t s t previous state H next t-1 (V t, s t ) state H t input output i t o t r t-1 h t-1 (i t, r t-1 ) R N N h t (o t, ) o t V t-1 s t-1 d t u t v t Neural Stack r t V t s t (V t, s t ) previous state next H t-1 state H t input i t output r t-1 h t-1 (i t, r t-1 ) R N N h t (o t, ) o t V t-1 s t-1 d t u t v t Neural Stack r t V t s t o t Hybrid unit s input splits into RNN s input and stack s input 14 of 27

Controller Grefenstette et al. (2015) use an RNN to control the data structure: (V t-1, s t-1 ) previous state H t-1 input r t-1 h t-1 R N N h t (o t, ) V t-1 s t-1 d t u t Neural Stack r t V t s t (V t, s t ) next state H t i t (i t, r t-1 ) o t v t output o t A single time step of a combined RNN unit and stack unit 14 of 27

Insights Some insights: Stack has no additional parameters Increased space complexity (naive impl. O(MT 2 ) with M dim. of v i and T time steps) Can reduce space complexity by working in place (O(MT ), from personal communication) 15 of 27

Insights Some insights: Stack has no additional parameters Increased space complexity (naive impl. O(MT 2 ) with M dim. of v i and T time steps) Can reduce space complexity by working in place (O(MT ), from personal communication) 15 of 27

Insights Some insights: Stack has no additional parameters Increased space complexity (naive impl. O(MT 2 ) with M dim. of v i and T time steps) Can reduce space complexity by working in place (O(MT ), from personal communication) 15 of 27

Insights Stack s gradients: Can vanish quickly unless initialised carefully 16 of 27

Evaluation Model evaluated on: Sequence copying ab ab ab a V X Sequence reversal ab ba ab a V X Learning a grammar svo sov svo ovs V X Most papers above give these anecdotal toy examples Top A.I. Breakthroughs of 2015? Let s learn some history. 17 of 27

Evaluation Model evaluated on: Sequence copying ab ab ab a V X Sequence reversal ab ba ab a V X Learning a grammar svo sov svo ovs V X Most papers above give these anecdotal toy examples Top A.I. Breakthroughs of 2015? Let s learn some history. 17 of 27

Evaluation Model evaluated on: Sequence copying ab ab V ab a X Sequence reversal ab ba V ab a X Learning a grammar svo sov V svo ovs X Most papers above give these anecdotal toy examples Top A.I. Breakthroughs of 2015? Let s learn some history. 17 of 27

Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 17 of 27

History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) From the abstract: A higher order single layer recursive network learns to simulate a deterministic finite state machine. When a [..] neural net state machine is connected through a common error term to an external analog stack memory, the combination can be interpreted as a neural net pushdown automata. [It is] given the primitives push and pop, and is able to read the top of the stack. Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn 18 of 27

History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27

History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27

History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27

History Idea goes back as far as 1989 (that I could trace): Higher order recurrent networks and grammatical inference (Giles et al., NIPS, 1989) Connectionist Pushdown Automata that Learn Context-free Grammars (Sun et al., IJCNN, 1990) Neural networks with external memory stack that learn context-free grammars from examples (Sun et al., CISP, 1990) Using Prior Knowledge in an NNPDA to Learn Context-Free Languages (Das et al., NIPS, 1992) The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations (Sun et al., 1993) mostly showing (empirically) that networks can learn Finite State Automata. 18 of 27

History NNPDA Concrete example, NNPDA (1992): 2 Output push pop or no-op Top-of-stack 1.0 State(t+1) Action External stack higher order weights alphabets on the stack copy copy State Neurons Input Neurons Read Neurons State(t) Input(t) Top-of-stack(t) 2 Cited by Grefenstette et al. (2015) 19 of 27

History NNPDA Concrete example, NNPDA (1992): 2 2 Cited by Grefenstette et al. (2015) 19 of 27

History NNPDA Experimental evaluation on tasks: Balanced parenthesis grammar (())() V ()( X Learning a grammar (1 n 0 n ) 111000 V 1110 X Sequence reversal ab ba ab a V X 20 of 27

History NNPDA Top A.I. Breakthroughs of 1989 Output push pop or no-op Top-of-stack 1.0 (V t-1, s t-1 ) State(t+1) Action External stack alphabets on the stack higher order weights previous state H t-1 input r t-1 h t-1 R N N h t (o t, ) V t-1 s t-1 d t u t Neural Stack r t V t s t (V t, s t ) next state H t copy State Neurons Input Neurons Read Neurons copy i t (i t, r t-1 ) o t v t output o t State(t) Input(t) Top-of-stack(t) Figure: NNPDA Figure: Neural stack Same ideas (although with different motivation) Similar structure Same evaluations even 21 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources... 22 of 27

History NNPDA But, NNPDA limitations: Had to approximate derivatives through the stack Vanishing gradients Only keeps input symbols in stack Coupled with the RNN controller Was built in the 90 s... Modern research: Issues were answered in Grefenstette et al., 2015 Uses advances from recent years (stochastic optimisation, data sub-sampling, adaptive learning rates) More computational resources...... can use these models from the 90 s in real-world applications. 22 of 27

Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 22 of 27

Transition Based Dependency Parsing This gives a projective tree in linear time 23 of 27 Dependency grammar: a syntactic structure where words are connected to each other by directed links Various representations of Dependency Grammars Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures

Transition Based Dependency Parsing Dependency grammar: a syntactic structure where words are connected to each other by directed links Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures Example transitions This gives a projective tree in linear time Main challenge: what action should the parser make in each state? 23 of 27

Transition Based Dependency Parsing Dependency grammar: a syntactic structure where words are connected to each other by directed links Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures This gives a projective tree in linear time Main challenge: what action should the parser make in each state? 23 of 27

Transition Based Dependency Parsing Dependency grammar: a syntactic structure where words are connected to each other by directed links Transition Based Dependency Parsing: read words sequentially from a buffer, and combine them incrementally into syntactic structures This gives a projective tree in linear time Main challenge: what action should the parser make in each state? 23 of 27

Model Dyer et al. (2015) use stack LSTMs These follows a simpler formulation that Grefenstette et al. (2015) add a stack pointer to determine which LSTM cell to use in next time step 24 of 27

Model Use three stack LSTMs: One to represent the input One to hold the partially constructed syntactic trees And one to record the history of the parser actions 25 of 27

Results Managed to improve on best results to date (C&M (2014)) 26 of 27

Outline Motivation Data Structures Recap Differentiable Data Structures History Applications in Language Processing Future? 26 of 27

Future? Exciting applications starting to emerge Going beyond toy examples More recent work starting to combine traditional data structures with reinforcement learning Reinforcement Learning Neural Turing Machines (Zaremba and Sutskever, arxiv, 2015) Should start learning POMDPs... 27 of 27