Generating composite SQL queries from natural language questions using recurrent neural networks. Matthias De Groote

Size: px
Start display at page:

Download "Generating composite SQL queries from natural language questions using recurrent neural networks. Matthias De Groote"

Transcription

1 Generating composite SQL queries from natural language questions using recurrent neural networks Matthias De Groote Supervisors: Prof. dr. ir. Joni Dambre, Prof. dr. Wesley De Neve Counsellors: Ir. Fréderic Godin, Dr. ir. Thomas Demeester Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year

2

3 Generating composite SQL queries from natural language questions using recurrent neural networks Matthias De Groote Supervisors: Prof. dr. ir. Joni Dambre, Prof. dr. Wesley De Neve Counsellors: Ir. Fréderic Godin, Dr. ir. Thomas Demeester Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year

4 FOREWORD i Foreword Since following the course Machine Learning, I was intrigued by the mathematical models, practical applications and societal impacts in the domain. I started this project with the intention of doing research on chatbots, but after reading and immersing in the NLP domain, my interest shifted more towards the question answering and machine translation. Hence, the final topic of this dissertation, which is kind of a combination of both. Before beginning the dissertation, I would like to thank everyone who has helped me throughout the year to complete it. First of all I would like to express my gratitude to my promotors Prof. Dr. Ir. Joni Dambre and Prof. Dr. Wesley De Neve for granting me the opportunity to conduct a year of research in the NLP domain. Second, I also want to thank my counsellors, Ir. Fréderic Godin and Dr. Ir. Thomas Demeester, for their time and feedback throughout the year. Their fast and accurate support during and between our constructive meetings guided me throughout this work. Also, I would like to thank Fréderic for proofreading my dissertation. At last, I want to thank my parents that gave me the opportunity to pursue my student career and supported me during these years. Their encouragements and confidence have helped me a lot. Matthias De Groote, June 2018

5 PERMISSION OF USE ii Permission of Use The author(s) gives (give) permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the copyright terms have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation. Matthias De Groote, June 2018

6 Generating Composite SQL Queries from Natural Language Questions using Recurrent Neural Networks Matthias De Groote Master s dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Academic year Supervisors: Prof. Dr. Ir. J. Dambre,Prof. Dr. W. De Neve Counsellors: Ir. F. Godin, Dr. Ir. T. Demeester Faculty of Engineering and Architecture Ghent University Department of Electronics and Information Systems Chair: Prof. Dr. Ir. K. De Bosschere Summary Relational databases store a vast amount of today s information and are becoming increasingly important in actual applications. Accessing these databases requires an understanding of SQL, which is not common knowledge. The active research about semantic parsers that translate natural language questions to SQL queries has recently shifted towards using neural networks. Although there are already good neural network approaches on simple SQL queries (i.e. queries on one table), these solutions cannot produce composite SQL queries (i.e. queries on multiple tables). This work introduces a new dataset containing composite SQL queries and offers Natural-Language-To-SQL (NL2SQL) solutions, trained and tested on this dataset. Keywords nl2sql, encoder-decoder model, machine translation, recurrent neural networks

7 Generating composite SQL queries from natural language questions using recurrent neural networks Matthias De Groote Supervisors: Prof. Dr. Ir. J. Dambre, Prof. Dr. W. De Neve, Ir. F. Godin, Dr. Ir. T. Demeester Abstract Relational databases store a vast amount of today s information and are becoming increasingly important in actual applications. Accessing these databases requires an understanding of SQL, which is not common knowledge. The active research about semantic parsers that translate natural language questions to SQL queries has recently shifted towards using neural networks. Although there are already good neural network approaches on simple SQL queries (i.e. queries on one table), these solutions cannot produce composite SQL queries (i.e. queries on multiple tables). This work introduces a new dataset containing composite SQL queries and offers Natural-Language- To-SQL (NL2SQL) solutions, trained and tested on this dataset. Keywords nl2sql, encoder-decoder model, machine translation, recurrent neural networks, machine learning I. INTRODUCTION The IT revolution of the past few decades has resulted in a large-scale digitization of data, making it accessible to millions of users in the form of databases. However, accessing these databases requires an understanding of query languages such as Structured Query Language (SQL), which, while powerful, is difficult to master. This is often unfortunately beyond the programming expertise of a majority of end-users. Thus, building effective semantic parsers that can translate natural language questions into logical forms such as queries has been a long-standing goal [1], [2], [3]. Dong et al. [4] showed that recurrent neural networks with attention and copying mechanisms can be used effectively to build successful semantic parsers. Also, recent work by Zhong et al. [5] introduced the state-of-the-art Seq2SQL model for question to SQL translation in the supervised setting. In order to build this model, they published a dataset WikiSQL, which is a magnitude larger than the previous semantic parsing datasets. Xu et al. [6] and Wang et al. [7] also published papers that improved the accuracy on WikiSQL. However, there is a lack of any complex operator in the SQL queries of the WikiSQL dataset, such as JOIN. This work focuses on generating composite SQL queries (i.e. queries on multiple tables) from natural language questions using recurrent neural networks with attention and copying mechanisms. While the most recent solutions in the paragraph above scored a good accuracy on the WikiSQL dataset, those models could not predict the JOIN operator. Training in a supervised manner requires labeled examples of question-query pairs, hence this work also introduces a new dataset of question-query pairs, including both simple and composite SQL queries. This work is ordered as follows. First, a brief overview of the related work about the NL2SQL problem is given in Section II. Next, the construction of the dataset is discussed in Section III, followed by Section IV that handles the solutions that tackle this problem. The experiments and their results are shown in Section V. Finally, conclusions are drawn in Section VI. II. RELATED WORK There are currently two approaches to solve the NL2SQL problem: semantic parsing and neural networks. Semantic parsing is an approach to translate text to a formal meaning representation such as logical forms or structured queries. There have been many works considering parsing a natural language description into a logical form [1], [2], [8], [9]. Most of these previous systems rely on high-quality lexicons, domain-or representation-specific features and may not generalize. This is why in this work, the focus is on neural network approaches to handle the NL2SQL tasks, which require less feature engineering. Recently, a new dataset on NL2SQL has been released by Salesforce: WikiSQL [5]. It is a corpus of 80,654 handannotated instances of natural language questions, SQL queries and SQL tables extracted from 24,241 HTML tables from Wikipedia. It is an order of magnitude larger than previous semantic parsing datasets, which makes it interesting for datahungry neural networks. There are three papers that have competitive scores on the WikiSQL task: Seq2SQL [5], SQLNet [6] and Pointing out SQL queries from text [7]. They all offer a different solution for the NL2SQL problem: Seq2SQL used reinforcement learning to solve the order-matters problems in a sequence-to-sequence model. SQLNet wanted to avoid reinforcement learning and proposed a sketch-based approach, which only specifies the shape of the query. Pointing out SQL queries from text used a typed decoder that statically predicted the next token based on the type of the token. Despite the several advantages of the WikiSQL dataset over the previous semantic parsing datasets, it only consists of simple SQL queries on one table. This work will focus on training an end-to-end encoder-decoder model on composite SQL queries on multiple tables. 1

8 III.DATASET This work introduces a new dataset consisting of two subdatasets: a dataset pairing natural language questions with simple SQL queries and one with composite SQL queries. The simple SQL queries have a similar syntax as the queries from the WikiSQL dataset, while the composite SQL queries will also include the JOIN operator. The SQL queries are based on the IMDb database 1. To gather human questions, the SimpleQuestions 2 dataset from babi project [10] proves to be very practical. It consists of a total of 108,442 questions, written in natural language by human English-speaking annotators. Each of these questions is paired with a corresponding fact, formatted as (subject, relationship, object). The facts have been extracted from the knowledge base Freebase 3. Only the questions answerable by the IMDb database are extracted, lowercased and removed of punctuation. The named entities (names of persons, titles of movies,...) are replaced by general tags. By filtering the duplicate questions, this resulted in 1,540 unique template questions. A dataset can easily be created by replacing all the tags with e.g. ten random values from the IMDb database. The combination of the relationship and the tags determines the ground truth SQL query. There are 40 unique combinations of relationship and tags, which are hand-annotated by the author. The complete dataset is published and can be downloaded on Bitbucket 4. IV. MODEL This section introduces the multiple models that were offered as solution. The first is a GloVe-based model that serves as baseline together with the more advanced SQLNet model. Next is the encoder-decoder model, discussed with two extensions: the attention and copy mechanism. A. GloVe-based model The GloVe-based model is created on the idea that similar questions should have similar queries. It uses GloVe [11] and cosine similarity for respectively the vector representation and similarity measurement of the questions. We decided to use the Common Crawl embedding that is trained on 42 billion tokens, consists of a vocabulary of 1.9 million tokens and embeds these tokens in the 300-dimensional vector space 5. All the words are lower-cased and embedded into a vector using the pre-trained word embedding. The vector representation of the question is the sum of all the vector representations of the words in the question. For each question in the test set, the cosine similarity between that question and all the questions from the training set is calculated. The query accompanying the most similar question from the training set is predicted. The results can be found in Section V. 1 The IMDb database is the same as in [3], it can be found at goo.gl/ DbUBMM 2 It can be downloaded here: It can be downloaded here: B. SQLNet In order to check that the dataset is not trivially composed, we have trained SQLNet [6], mentioned in Section II on our simple query dataset to compare the accuracy. SQLNet is the state-of-the-art model on the WikiSQL dataset and proposes a sketch-based approach to generate a SQL query. This means that they train different models to predict each different clause of the SQL query, such as the column name after SELECT or the condition after the WHERE clause. We parsed our dataset in the corresponding format of SQLNet to be able to train the model. We trained our model without column attention and with fixed word embeddings, the results can be found in Section V. C. Encoder decoder In machine translation, input sequences and output sequences have different lengths. Google presented a general end-to-end approach to sequence learning in [12]. This Sequence to Sequence (seq2seq) network, or encoder-decoder network, is a model consisting of two RNNs called the encoder and decoder. Figure 1 shows a high-level overview of the encoder-decoder network. The encoder reads an (embedded) input sequence x 0,..., x n and outputs a single vector h n, while all the other outputs h 0,..., h n 1 are discarded. The decoder reads h n to produce an output sequence y 0,..., y k. y 0 y 1 y k Predict Predict Predict GRU GRU... GRU h 0 h 1 hn GRU GRU x 0 x 1 x n... GRU DECODER ENCODER Figure 1: High level overview of encoder-decoder architecture The words from the input sentences and output queries are first embedded into a vector representation using the same GloVe embedding [11] as in the GloVe-based model. If an (yet unknown) input word of the training set is not part of the GloVe vocabulary, we add a vector with random values uniformly sampled between -1 and 1. We assigned each word a different value (and not e.g. all zeroes or assigning the <UNK> embedding) because there were quite a lot (1233) of words that were not part of the GloVe vocabulary. If a test input word is not present in the mapping, it is mapped to the <UNK> (unknown) token. 1) Encoder The input scalar x i, corresponding to a word, is first embedded using the word embedding explained. A GRU takes as input this embedding x i together with the previous hidden 2

9 state s i 1, and produces the encoders output h i and a new hidden state s i. The first hidden state s 0 is initialized to all zeroes. In the simple decoder, only the encoders last output h n is used, in contrast to the attention decoder where all the encoders outputs h 0,..., h n are needed. 2) Decoder In the simplest decoder, only the last output of the encoder h n is used. This is called the context vector, because it encodes the context from the entire sequence. The first hidden state s 0 from the decoder is set equal to this context vector. The decoder uses half of the times teacher forcing, a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input [13]. When it does not use teacher forcing, the input is equal the previous predicted token. The decoder embeds the input y i 1 (in case of no teacher forcing) the same way as the encoder. A GRU takes as input this embedded vector together with the previous hidden state s i 1, and produces an output vector z i and the next hidden state s i. This output vector z i IR 1 1 n with n the number of hidden nodes, is transformed to a distribution over the vocabulary size through another feedforward layer followed by the LogSoftmax function. The negative likelihood loss is optimized. 3) Decoder with attention mechanism If only the last encoder output (the context vector) is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence. The attention mechanism [14] avoids this by encoding the whole input sequence based on the sequence of all the encoder outputs, as opposed to only the last encoder output. Our implementation of the attention mechanism first calculates a set of attention weights α i,j. Each weight α i,j is a normalized attention energy e i,j : α i,j = exp(e i,j) k exp(e i,k) (1) e i,j = s T i 1 W ah j (2) where each attention energy e i,j is calculated as the dot product between the decoder hidden state s i 1 and a linear transform of the corresponding encoder output h j. This is one of the score functions, mentioned as general form, in the global attention mechanisms proposed by [15]. These attention weights α i,j are multiplied by the encoder output vectors h 1,..., h n to create a weighted combination, the context vector c i. c i = n α i,j h j (3) j=1 The context vector c i is concatenated with the decoder s input y i 1 and serves as the input of the GRU. A feedforward layer applies a linear transformation to the concatenation of the GRU output vector z i and the context vector c i. Passing through LogSoftmax, the log probabilities over the distribution of the vocabulary are obtained. 4) Copy mechanism There are quite a lot Out Of Vocabulary (OOV) words in the test set, such as actor names, movie titles,.... To deal with these OOV words, we have designed a copy mechanism that works in combination with attention. Our copy mechanism works in two steps. The first step consists of predicting whether there should be a token copied from the input sequence. This is accomplished by including a <COPY> token in the vocabulary. The second step consists of replacing the <COPY> token with the input token with the highest attention value. The loss function is adapted to cope with the copy mechanism. If the target token is in the input sequence, the negative likelihood loss between the <COPY> token and the decoder output is added to the loss. Also, the attention values should align with the input tokens that should be copied. The crossentropy loss between the attention values and the index of the corresponding input token is added to the loss. If the target token is not in the input sequence, the negative likelihood loss between the target token and the decoder output is added to the loss. V. EXPERIMENTS This section starts with explaining the evaluation details. Next, it discusses the results from the experiments with the models in Section IV, comparing baseline models and encoderdecoder model on both simple and composite queries. A. Evaluation details We have chosen to use the deep learning framework PyTorch from Facebook 6, because SQLNet and Seq2SQL are written in PyTorch and open-sourced on GitHub 7, which can serve as inspiration. As discussed in Section III, the dataset consists of 1,537 template questions containing tags such as [@film] or [@actor]. The tags (indicated are replaced with 10 random, corresponding values from the IMDb database. The resulting 15,370 question-query pairs are split into (85%, 0%, 15%) and (70%, 15%, 15%) train, validation and test set for respectively the Glove-based model and the other models. The sets are separated, such that a template question from one set is not present in the other sets. The evaluation metric used in the baselines and models is the accuracy. The accuracy is the proportion of correct cases (both true positives and true negatives) among the total number of cases examined. Because some parts of the SQL query are harder to predict than others, we calculate the accuracy of the different components of the SQL query, such that it becomes clear on which components of the query the model has to be improved. Table I shows the multiple evaluation components. In case of multiple conditions, the conditions are sorted and compared because the conditions in a SQL query are commutative

10 Table I: Evaluation components of the query. Evaluation component Description Select Checks if the column name after SELECT is correct From Checks if the table name after FROM is correct Where (first operands) Checks if the column name (= first operand) ( = conds first op) after WHERE is correct. Where (all operands) Checks if the whole condition after WHERE is correct ( = conds all op) Joins (only for composite queries) Checks if the table names and conditions of the JOIN clause of the query are correct All Checks if the whole query is correct Table IV: Choice of hyperparameters Hyperparameter Choice Optimizer Adam Learning rate Dropout 0.1 Amount of hidden nodes 256 B. Simple queries The accuracy results from the GloVe-based model and SQLNet on the different components of the simple queries can be found in Table II. The table shows that the biggest improvement lies in predicting the WHERE condition (conds all op), while predicting the table name after FROM is already perfect. Also, the predictions of the GloVe-based model of the column names after SELECT and as first operand of the WHERE condition (conds first op) have an accuracy of respectively 84.45% and 82.51%, which is already quite high. The accuracy results from the encoder-decoder model on the different components of the simple queries can be found in Table III. The choice of hyperparameters can be found in Table IV. The model is trained until the validation loss converges, which is after approximately 9 epochs. Note the difference with the training procedure of SQLNet, where there were multiple models for each clause of the query, which were trained independent of each other until their validation loss converged. As can be seen on Figure 2, the encoder-decoder model has the most impact on predicting the last operand of the WHERE clause, which results in a higher total accuracy. Clearly, the attention mechanism does not have impact on the simple queries. However, in combination with the copy mechanism there is a 2% increase in overall accuracy compared to the simple decoder. Table II: Accuracy results Glove-based model and SQLNet - simple queries Simple queries GloVe-based SQLNet Select 84.45% 71.7% From 100.0% 100% Where (first operands) 82.51% / Where (all operands) 36.19% 46.8% All 29.48% 36.5% Table III: Accuracy results encoder-decoder model - simple queries. Simple queries Simple With attention With attention and copy Select 95.47% 94.08% 93.87% From 100% 99.65% 99.87% Where (first operands) 92.74% 90.04% 92.39% Where (all operands) 65.99% 60.55% 67.77% All 64.42% 58.59% 66.29% C. Composite queries The accuracy results from the Glove-based and encoderdecoder model are shown in Table V. The choice of hyperpa- Figure 2: Accuracy results for the WHERE clause and the whole query - simple queries. rameters are the same as with the simple queries and can be found in Table IV. The model is trained until the validation loss converges, which is after approximately 9 epochs. Analogous as with the simple queries, the extensions of the attention mechanism and copy mechanism have particularly effect on predicting the third operand in the WHERE clause. Figure 3 zooms in on the prediction of the last operand and the correctness of the whole query. We notice again that solely the attention mechanism brings no improvement in the accuracy of the whole query. Although, when it is combined with the copy mechanism, it brings an additional improvement of 8%. All three models score better than the GloVe-based model. Table V: Accuracy results encoder-decoder model - composite queries. Composite queries GloVe Simple With attention With attention & copy Select % 96.46% 94.00% 96.28% From % 99.60% 99.41% 99.96% Where (first operands) % 94.89% 91.76% 93.51% Where (all operands) % 65.69% % 73.94% Joins 80.74% % 92.83% 94.49% All % 63.32% 63.32% 71.29% An example of the attention mechanism is visualized in Figure 4. Here, the SELECT, FROM and JOIN clause are determined by one word, producer. The words angele et tony have weight 1 in the prediction of the last operand of the WHERE clause, where they are copied correctly. The copy mechanism works in two steps: predicting a <COPY> token and if it should copy, choosing which token it should copy. The <COPY> tokens are correctly predicted 88.71% of the times. Table VI shows three examples of question-query pairs and the results from the simple decoder and the decoder with the attention and copy mechanism. The first example shows a case where the copy mechanism correctly predicts the tokens to copy, but the attention is wrongly aligned for the last token 4

11 is not feasible to generate a new dataset for each new database. The model should take as input, next to human questions, the database schema and adapt it s predicted queries to this schema. The WikiSQL dataset consists of 24,241 different tables, providing a ideal dataset to train these generalized solutions on. REFERENCES Figure 3: Accuracy results for the WHERE clause and the whole query - composite queries. Figure 4: Visualization of the attention mechanism of a question-query pair. causing it to copy the wrong word. The second example is an illustration of a common error in the copy mechanism: wrongly predicting the length of the condition. The third example illustrates a case where the copy mechanism is correct and the simple decoder is not. [1] Luke S Zettlemoyer and Michael Collins. Online learning of relaxed CCG grammars for parsing to logical form. Computational Linguistics, (June): , [2] Chris Quirk, Raymond Mooney, and Michel Galley. Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages , [3] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. SQLizer : Query Synthesis from Natural Language. Splash, 1(1):1 25, [4] Li Dong and Mirella Lapata. Language to Logical Form with Neural Attention [5] Victor Zhong, Caiming Xiong, and Richard Socher. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. pages 1 12, [6] Xiaojun Xu, Liu Chang, and Song Dawn. SQLNet: Generating Structured Queries from Natural Language without Reinforcement Learning. pages 1 13, [7] Chenglong Wang, Marc Brockschmidt, and Rishabh Singh. Pointing Out SQL Queries from Text. pages , [8] Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, and Mingcheng Chen. Latent Attention For If-Then Program Synthesis. (Nips), [9] J M Zelle and R J Mooney. Learning to Parse Database queries using inductive logic proramming. Learning, (August): , [10] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale Simple Question Answering with Memory Networks [11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages , [12] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. pages 1 9, [13] Jason Brownlee. What is teacher forcing for recurrent neural networks? Accessed: [14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/ , [15] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation VI.CONCLUSION This work started with an introduction to the NL2SQL problem, exploring the state-of-the-art solutions. Although these scored an overall accuracy of approx. 70%, the neural network approaches were trained only on simple SQL queries. Hence, the introduction of a new dataset containing composite SQL queries. The encoder-decoder model scored better on this dataset than the GloVe-based model and the state-of-the-art SQLNet, but could still improve in predicting the last operand of the WHERE clause. The extensions of the attention and copy mechanism helped increasing this accuracy, bearing promising results. Future work should consist of generalizing the model to be able to predict queries on other databases. The current model will not be able to adapt seamlessly to a new database, and it 5

12 Table VI: Example predictions by the different models. Q denotes the natural language question and G denotes the corresponding ground truth query. S and A&C denote respectively the queries produced by the simple decoder and the decoder with copy and attention mechanism. Our models generate in general the table name two times, such that the column names can specify where they come from, but this is left out of the table for the sake of brevity. The bold words indicate wrongly predicted words. Q1 S A&C G Q2 S A&C G Q3 S A&C G which iowan cinematographer produced the film Appunti inutili - Virgilio Giotti SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = aquele querido mes de agosto SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Virgilio SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Giotti where is the film Angry Samoans form SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = apron strings SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry Samoans what film is a part of the Crime film genre SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = game-show SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime 6

13 CONTENTS x Contents Foreword Permission of Use Overview Extended Abstract Contents List of Figures List of Tables List of abbreviations i ii iii iv x xii xiv xv 1 Introduction 1 2 Related Literature SQL: syntax and usage Neural networks Optimizer Feedforward neural networks Recurrent neural networks Long Short-Term Memory Gated Recurrent Unit Sequence-to-sequence model Natural-language-to-SQL: NL2SQL Semantic parsing approaches Neural network approaches Conclusion

14 CONTENTS xi 3 Dataset 18 4 Methodology GloVe-based model GloVe: Global Vectors for Word Representation Method SQLNet Encoder decoder Word embedding Encoder Decoder Decoder with attention mechanism Copy mechanism Conclusion Experiments Experimental setup Deep learning framework Evaluation details Baselines GloVe-based model SQLNet Encoder-decoder model Simple queries Composite queries Hyperparameter optimization Optimizer Dropout Hidden nodes Conclusion Conclusion 51 Bibliography 53

15 LIST OF FIGURES xii List of Figures 1.1 Example of translating a question to a query SGD without momentum versus Adam with momentum Example of a feedforward neural network with one hidden layer Unfolding in time of a RNN, figure from Olah [2015] Graphical representation LSTM unit, figure from Olah [2015] Graphical representation GRU unit, figure from Olah [2015] High-level representation of encoder-decoder model Schematic overview of SQLizer, figure from Yaghmazadeh et al. [2017] Entity-relationship model of the IMDb database Length distribution over the input sentences and queries High level overview of encoder-decoder architecture Encoder architecture Simple decoder architecture Attention mechanism. The attention weights are values in the range of [0,1], which are mapped to [black, white]. The encoder outputs are multiplied and changed accordingly Attention decoder architecture Accuracy of the GloVe model - simple queries Accuracy of the GloVe model - composite queries Accuracy results encoder-decoder model - simple queries Accuracy results for the WHERE clause and the whole query - simple queries Accuracy results encoder-decoder model - composite queries Accuracy results for the WHERE clause and the whole query - composite queries

16 LIST OF FIGURES xiii 5.7 Visualization attention - pair Visualization attention - pair The negative likelihood loss for the SGD optimizer (orange) and Adam optimizer (blue), in function of the amount of trained samples The negative likelihood loss for the learning rates 0.1, 0.01 and for the Adam optimizer, in function of the amount of trained samples. The graphs are respectively red, blue and orange The absolute accuracy of the whole query calculated on the validation set for the dropout values p of 0.1, 0.3 and 0.5, in function of the amount of trained samples. The graphs are respectively orange, blue and red The negative likelihood loss for different amounts of hidden nodes

17 LIST OF TABLES xiv List of Tables 3.1 Example of a Freebase tuple and corresponding question Distribution of the amount of tags in a question Question examples Amount of unique template questions per relationship Question-query examples Example of a template question and the generated questions Split of the dataset into training, validation and test set for the different models Evaluation components of the query Accuracy results Glove-based model and SQLNet - simple queries Accuracy results encoder-decoder model - simple queries Choice of hyperparameters - simple and composite queries Accuracy results encoder-decoder model - composite queries Accuracy of first step of the copy mechanism and the frequency of the error cases Example predictions by the different models. Q denotes the natural language question and G denotes the corresponding ground truth query. S and A&C denote respectively the queries produced by the simple decoder and the decoder with copy and attention mechanism. Our models generate in general the table name two times, such that the column names can specify where they come from, but this is left out of the table for the sake of brevity. The bold words indicate wrongly predicted words

18 LIST OF ABBREVIATIONS xv List of abbreviations Adam GloVe GRU IMDb LSTM NL2SQL OOV RNN SGD SQL Adaptive Moment Estimation Global Vectors for Word representation Gated Recurrent Unit Internet Movie Database Long Short-Term Memory Natural-Language-To-SQL Out Of Vocabulary Recurrent Neural Network Stochastic Gradient Descent Structured Query Language

19 INTRODUCTION 1 Chapter 1 Introduction The IT revolution of the past few decades has resulted in a large-scale digitization of data, making it accessible to millions of users in the form of databases. However, accessing these databases requires an understanding of query languages such as Structured Query Language (SQL), which, while powerful, is difficult to master. This is often unfortunately beyond the programming expertise of a majority of end-users. Thus, building effective semantic parsers that can translate natural language questions into logical forms such as queries has been a long-standing goal. (Zettlemoyer and Collins [2007], Quirk et al. [2015], Yaghmazadeh et al. [2017]) Dong and Lapata [2016] showed that recurrent neural networks with attention and copying mechanisms can be used effectively to build successful semantic parsers. Also, recent work by Zhong et al. [2017] introduced the state-of-the-art Seq2SQL model for question to SQL translation in the supervised setting. In order to build this model, they published a dataset WikiSQL, which is a magnitude larger than the previous semantic parsing datasets. Xu et al. [2017] and Wang et al. [2017] also published papers that improved the accuracy on WikiSQL. However, there is a lack of any complex operator in the SQL queries of the WikiSQL dataset, such as JOIN. This work focuses on generating composite SQL queries (i.e. queries on multiple tables) from natural language questions using recurrent neural networks with attention and copying mechanisms. While the most recent solutions in the paragraph above scored an accuracy of approximately 70% on the WikiSQL dataset, those models could not predict the JOIN operator. Training in a supervised manner requires labeled examples of question-query pairs, hence this work also introduces a new dataset of question-query

20 INTRODUCTION 2 pairs, including both simple and composite SQL queries. Figure 1.1 shows a use case of our solution. The non-technical end-user asks a natural language questions, which is translated by our model to a SQL query. Via executing this SQL query on a database, the answer can be returned to the end-user. what is a work directed by Martin Scorsese Non-technical end-user MODEL: natural-language-to-sql SELECT title FROM movie movie INNER JOIN directed_by directed_by ON movie.mid = directed_by.msid INNER JOIN director director ON directed_by.did = director.did WHERE director.name = Martin Scorsese IMDb database Taxi Driver, Goodfellas, Silence,... Figure 1.1: Example of translating a question to a query This dissertation is ordered as follows. Chapter 2 first gives a brief overview of the SQL syntax and neural networks. It also discusses the related literature on the state-of-theart of the existing natural-language-2-sql (NL2SQL) solutions. Chapter 3 discusses the construction and details of the dataset, followed by Chapter 4 which handles the solutions that tackle this problem. The experiments and their results are shown in Chapter 5. Finally, conclusions are drawn in Chapter 6.

21 RELATED LITERATURE 3 Chapter 2 Related Literature Our system translates natural language questions to SQL queries using neural networks. This chapter will explain the most essential concepts of the system, SQL and neural networks. It starts with briefly explaining the SQL syntax, whereafter it considers feedforward and recurrent neural networks. After explaining these concepts, the third part zooms in on related literature on the natural-language-to-sql (NL2SQL) problem. 2.1 SQL: syntax and usage Relational databases store a vast amount of today s information and are a component of many applications. Accessing these relational databases requires understanding query languages such as Structured Query Language (SQL). SQL supports four fundamental operations, referred to as CRUD: Create, Read, Update and Delete. In this work, we focus on retrieving information using SQL. The general form for retrieving information from one table is as follows: SELECT column_names FROM table_name WHERE condition (ORDER BY sort-order) If we need to combine two or more tables to retrieve all the necessary columns, the JOIN keyword is added. More specifically, INNER JOIN will select records that have matching values in both tables. The general form is as follows: SELECT column_names

22 2.2 Neural networks 4 FROM table_name1 INNER JOIN table_name2 ON table_name1.column_name1 = table_name2.column_name2 WHERE condition In the remainder of this work, queries on one table and queries on multiple tables will be called respectively simple and composite queries. 2.2 Neural networks Goldberg [2015] describes a neural network as a computational model that is inspired by the way biological neural networks in the human brain process information. The basic unit of computation in a neural network is the neuron (often called a node or unit), which receives input from other nodes or from an external source and computes an output. Each input has an associated weight, which is assigned based of its relative importance to other inputs. The node applies a non-linear function f (the activation function) to the weighted sum of its inputs. The output y of a single neuron can be calculated as follows: n y = f(w T x) = f( W i x i + b) (2.1) where W (= [w 1,..., w n ]), x(= [x 1,..., x n ]) and b are respectively the weight vector, the input vector and the bias. The main function of the bias is to provide every node with a trainable constant value. The activation functions that are mentioned in this dissertation are as follows: Sigmoid: real-valued input, output between 0 and 1 i=1 σ(x) = exp( x) (2.2) Tanh: real-valued input, output between -1 and 1 tanh(x) = 2σ(2x) 1 (2.3) ReLU: real-valued input, replace negative values with zero f(x) = max(0, x) (2.4)

23 2.2 Neural networks Optimizer Neural networks use optimizers to minimize a loss function, which is dependent on the model s internal parameters. There are two popular gradient descent optimizers: stochastic gradient descent and Adaptive Moment Estimation. In the next paragraphs, the concept gradient descent and the two optimizers are discussed. Gradient descent is by far the most common way to optimize neural networks. It is a way to minimize a loss function L(θ), parameterized by the model s parameters θ R d, by updating the parameters in the opposite direction of the gradient of the loss function θ L(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. A learning rate that is too small leads to slow convergence, while a learning rate that is too large can hinder convergence. (Ruder [2017]) Stochastic Gradient Descent (SGD) performs a parameter θ update for each training example x(i) and label y(i): θ = θ η θ L(θ; x(i); y(i)) (2.5) where η is the learning rate, θ are the model s parameters and L(θ) is the loss function. SGD performs frequent updates with a high variance that causes the loss function to fluctuate heavily. The algorithm has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. Figure 2.1 shows the oscillating of SGD across the slopes of the ravine while only making hesitant progress along the bottom towards the local minimum. Adaptive Moment Estimation (Adam) has a few tricks to improve SGD. One of these tricks is momentum, which solves the ravine problem. Momentum consists of adding some fraction of the previous update to the current update, so that repeated updates in a particular direction compound. Momentum is build and is moving faster and faster in direction of the minimum. In the case of the ravine, momentum is build up in the direction of the minimum, since all updates have a component in that direction. Figure 2.1 shows the impact that momentum has on converging to a minimum. Another trick that Adam uses is to adaptively select a separate learning rate for each parameter. This speeds learning in cases where the appropriate learning rates vary across parameters. This makes tuning of learning rates less important with Adam, because performance is less sensitive to them. Kingma and Ba [2014] show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

24 2.2 Neural networks 6 SGD without momentum ADAM with momentum Figure 2.1: SGD without momentum versus Adam with momentum Feedforward neural networks The first and simplest type of neural networks are feedforward neural networks, where the information flows in only one direction: forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. An example of a feedforward neural network with one hidden layer can be found in Figure 2.2. The input nodes x 1,..., x K are connected with the hidden nodes h 1,..., h N through the associated weights {w ki }. These hidden nodes are connected with the output nodes y 1,...y M through the associated weights {w ij} Recurrent neural networks When dealing with language data, it is very common to work with sequences of variable length, such as words (sequences of letters) and sentences (sequences of words). In a traditional feedforward neural network, it is assumed that all input (and outputs) are independent of each other. On the other hand, Recurrent Neural Networks (RNNs) are called recurrent because they perform the same task for every element of the sequence, which makes them better suited for language tasks. The parameters are shared across the steps.

25 2.2 Neural networks 7 Figure 2.2: Example of a feedforward neural network with one hidden layer As seen on Figure 2.3, the RNN can be written out for the complete sequence, also called unfolding or unrolling. The formulas that describe the computation in a RNN are as follows: x t is the input at time step t. For example, x 1 could be a one-hot vector corresponding to the second word of a sentence. h t is the hidden state at time step t, which can be seen as the memory of the network. The hidden state is calculated based on the previous hidden state and the input at the current step: h t = f(x t, h t 1 ) (2.6) where f is typically one of the activation functions described in Section 2.2. The hidden state required to calculate the first hidden state h 0 is generally initialized to all zeroes. o t is the output at step t and is in the most simple model, the Elman network by Elman [1990], calculated in function of the memory at time t, h t : o t = g(h t ) (2.7) where g is typically one of the activation functions described in Section 2.2.

26 2.2 Neural networks 8 Figure 2.3: Unfolding in time of a RNN, figure from Olah [2015]. A simple RNN is hard to train effectively because of the vanishing gradients problem. When error signals (gradients) are passed back through many time steps, it tends to diminish quickly in the backpropagation process, which makes it hard for the RNN to capture longrange dependencies. For example when using an Elman RNN with a tanh activation function, the gradients are in the range ( 1, 1). Backpropagation computes gradients by the chain rule, which has the effect of multiplying n of these small numbers to compute the gradients of the first unrolled layer with an input sequence of length n, resulting in a small gradient. Activation functions such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction. In the following subsections, two extensions to the RNN architecture, LSTM and GRU, are discussed that solve this problem Long Short-Term Memory The Long Short-Term Memory (LSTM) architecture was designed by Hochreiter and Urgen Schmidhuber [1997] to solve the vanishing gradients problem. LSTM also have a chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four layers interacting in a special way. The key to LSTMs is the cell state, the horizontal line running through the top of the diagram (see Figure 2.4). The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. The gating layers have a sigmoid activation function, which outputs numbers between zero and one. These outputs are multiplied by the components, such that a value of zero means let nothing through,

27 2.2 Neural networks 9 while a value of one means let everything through. The first layer on the left is the forget gate layer f t, which decides what information from the cell state is thrown away. f t = σ(w f [h t 1, x t ] + b f ) (2.8) where σ is the sigmoid function, W f is a trainable matrix and h t 1, x t and b f are respectively the previous output, the input and the bias. Second, a σ-layer called the input gate layer i t decides which values will be updated. Next, a tanh layer creates a vector of new candidate values, C t, that could be added to the state. i t = σ(w i [h t 1, x t ] + b i ) (2.9) C t = tanh(w c [h t 1, x t ] + b c ) (2.10) where σ is the sigmoid function, W i and W c are trainable matrices and h t 1, x t, b c and b f are respectively the previous output, the input and the biases. These two steps are combined to update the state C t. The output gate layer o t decides which parts will be output. C t = f t C t 1 + i t C t (2.11) o t = σ(w o [h t 1, x t ] + b o ) (2.12) h t = o t tanh(c t ) (2.13) where σ is the sigmoid function, W o is a trainable matrix and h t 1, x t and b o are respectively the previous output, the input and the bias Gated Recurrent Unit The Gated Recurrent Unit (GRU), proposed by Cho et al. [2014], is a variant of the LSTM that combines the forget and input gates f t and i t (Equations 2.8 and 2.9) into a single update gate z t. The resulting model is simpler than standard LSTM models, and has been

28 2.2 Neural networks 10 Figure 2.4: Graphical representation LSTM unit, figure from Olah [2015]. growing increasingly popular. The architecture of a GRU cell can be found in Figure 2.5. The update and reset gates z t and r t are calculated as follows: z t = σ(w z [h t 1, x t ]) (2.14) r t = σ(w r [h t 1, x t ]) (2.15) where W z and W r are trainable matrices and h t 1 and x t are respectively the previous output and the input. The activation and the candidate activation h t and h t are calculated as follows: h t = tanh(w [r t h t 1, x t ]) (2.16) h t = (1 z t ) h t 1 + z t h t (2.17) where W and W r are trainable matrices, z t and r t are respectively the update and reset gate and h t 1 and x t are respectively the previous output and the input. The most prominent feature shared between the LSTM unit and GRU unit is the additive component of their update from t to t + 1, which is lacking in the traditional RNN. This has two advantages: it makes it easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps and it creates shortcut paths that bypass multiple temporal steps (Chung et al. [2014]).

29 2.2 Neural networks 11 Figure 2.5: Graphical representation GRU unit, figure from Olah [2015]. In the LSTM unit, the amount of the memory content that is seen is controlled by the output gate o t, while the GRU exposes its full content without any control. Also, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU control is tied via the update gate. In general, it is difficult to conclude which types of gating units would perform better. In this dissertation, the GRU is picked because Bahdanau et al. [2014] reported that these two units performed comparably to each other on machine translation and it is in general more efficient because it has a less complex structure Sequence-to-sequence model In machine translation, input sequences and output sequences have often different lengths and the entire input sequence is required in order to start predicted the target. Sutskever et al. [2014] from Google presented a general end-to-end approach to sequence-to-sequence learning. The idea is to use one RNN (the encoder) to read the input sequence, one timestamp at a time, and obtain a large fixed-dimensional vector representation: the context vector. Another RNN (the decoder) is used to unfold the vector into a new sequence. Figure 2.6 shows a sequence to sequence network for translating French to English. The encoder outputs the context vector, which the decoder unfolds into a translated sequence. As can be seen on the figure, the decoder uses its own outputs as inputs. Normally, the

30 2.3 Natural-language-to-SQL: NL2SQL 12 Le chat est noir <EOS> ENCODER context (fixed dimension) <SOS> the cat is black DECODER The cat is black <EOS> Figure 2.6: High-level representation of encoder-decoder model words in the sentence are first embedded, but this is omitted in the figure for the sake of simplicity. This model (and possible improvements by e.g. using an attention mechanism) will be further discussed in Chapter Natural-language-to-SQL: NL2SQL In this section, the existing state-of-the-art on natural-language-to-sql (NL2SQL) problems will be discussed. The following subsections will confer the two different approaches: semantic parsing and neural networks Semantic parsing approaches The primary approach to solve NL2SQL problems is semantic parsing. Semantic parsing is an approach to translate text to a formal meaning representation such as logical forms or structured queries. There have been many works considering parsing a natural language description into a logical form such as Zelle and Mooney [1996], Zettlemoyer and Collins [2007], Quirk et al. [2015], Chen et al. [2016]. Most previous systems rely on high-quality lexicons, manually-built templates, and features which are either domain- or representation-specific. They would need to be fine-tuned to the specific domain of interest, and may not generalize. For these reasons, in this work the focus is on neural network approaches to handle the NL2SQL tasks, which require less feature engineering. An example of a semantic parser is SQLizer by Yaghmazadeh et al. [2017]. It consists of an off-the-shelf parser that translates a natural language question into a sketch, which

31 2.3 Natural-language-to-SQL: NL2SQL 13 only specifies the shape - rather than the full content - of the query (e.g., join followed by selection followed by projection). Employing programming language techniques such as type-directed sketch completion and automatic repairing, their model iteratively refines the sketch into the final query. A schematic overview of their approach can be found in Figure 2.7. Figure 2.7: Schematic overview of SQLizer, figure from Yaghmazadeh et al. [2017] Neural network approaches A second approach to solve NL2SQL problems uses neural networks, in particular the encoder-decoder architecture. Dong and Lapata [2016] created such an encoder-decoder architecture that performs competitively with the existing semantic parsers, without using hand-engineered features and easy to adapt across domains and meaning representations. Recently, a new dataset on NL2SQL has been released by Zhong et al. [2017] from Salesforce: WikiSQL 1. It is a corpus of 80,654 hand-annotated instances of natural language questions, SQL queries and SQL tables extracted from 24,241 HTML tables from Wikipedia. The following properties make it a desirable dataset: It is an order of magnitude larger than previous semantic parsing datasets, which makes it interesting for data-hungry neural networks. The natural language questions are created by human beings (employing crowdsourcing on Amazon Mechanical Turk). Synthesizing the SQL query does not rely on the table s content. 1

32 2.3 Natural-language-to-SQL: NL2SQL 14 The three papers that have competitive scores on the WikiSQL task will be discussed in the following subsections: Seq2SQL, SQLNet and Pointing out SQL queries from text (Zhong et al. [2017], Xu et al. [2017], Wang et al. [2017]). Seq2SQL Seq2SQL by Zhong et al. [2017] leverages the structure of SQL to reduce the output space of the generated query. The input sequence is the concatenation of all the column names, the question and the SQL vocabulary. Seq2SQL is composed of three parts that correspond to the aggregation operator, the SELECT column and the WHERE clause. The first two components use cross-entropy loss, the last one uses policy gradient for training. Aggregation column. To compute the aggregation operation, the scalar attention score for each tth-token in the input sequence is calculated. This vector is normalized to produce a distribution over the input encodings. The input representation is the sum over the input encodings weighted by the normalized scores. The score over the aggregation operators (COUNT, MIN, MAX,...) is obtained by applying a multilayer perceptron to this input representation. SELECT column. First, each column name is encoded with a LSTM. The input representation is similar to the input representation with the aggregation operation, but with untied weights. The score for each column j is obtained by applying a multilayer perceptron over the column representations, conditioned on the input representation. WHERE clause. Using an encoder-decoder network, the decoder produces a scalar attention for each position t of the input sequence. The input token with the highest score is chosen as the next token of the generated SQL query. To address the problem of wrongly penalizing correct execution results, despite not having exact string match, reinforcement learning is applied. It learns a policy to directly optimize the expected correctness of the execution result. SQLNet Xu et al. [2017] want to avoid the necessity to employ reinforcement learning by avoiding the order-matters problems in a sequence-to-sequence model. They propose a sketch-based approach to generate a SQL query, where the sketch aligns naturally to the syntactical

33 2.3 Natural-language-to-SQL: NL2SQL 15 structure of a SQL query. A neural network is used to predict the content for each slot in the sketch. To predict the WHERE clause, the following models are trained. First, the total number K of columns to be included is predicted. Using an upperbound N on the number of columns, the problem has been cast to a (N + 1)-way classification problem. Afterwards, the columns with the highest P wherecol (col Q) are picked, where col is a column name and Q is the natural language question. These are calculated as follows: P wherecol (col Q) = σ(u c T E col + u q T E Q ) (2.18) where σ is the sigmoid function, E col and E q are the embeddings of the column name and the natural language question and u c and u q are two column vectors of trainable variables. They also introduce the column attention mechanism to compute E Q col instead of E Q. This mechanism ensures that the most relevant information in the natural language question is used when predicting on a particular column. Second, the operand value (choosing from =, >, <) is also a 3-way classification. Therefore we compute: P op (i Q, col) = softmax(u op 1 tanh(uc op E col + Uq op E Q col )) (2.19) where col is the column under consideration, E col and E Q col are the embeddings of the column name and the natural language question using the column attention mechanism and U op 1, Uc op, Uq op are trainable matrices of size 3 d, d d and d d respectively. Third, for the value slot, a substring from the natural language question is predicted. SQLNet employs a sequence-to-sequence structure where the encoder still employs a bidirectional LSTM and the decoder computes the distribution of the next token using a pointer network. The probability of the next token can be computed as: P val (i Q, col, h) = softmax(a(h)) (2.20) a(h) i = (u val ) T tanh(u val 1 H i Q + U val 2 E col + U val 3 h) i {1,..., L} (2.21) where u val a is a d-dimensional trainable vector, Uh val, U c val, Uq val are three trainable matrices of size d d and L is the length of the natural language question, h is the

34 2.4 Conclusion 16 hidden state of the previous generated sequence and H i Q each token in the natural language question. is the LSTM output for The prediction for the column name in the SELECT clause is quite similar to the prediction of the column names in the WHERE clause, with the restriction that we only need to select one column among all. The aggregation operator is predicted similar to the prediction of the operand slot. Pointing out SQL queries from text Wang et al. [2017] published the third paper that designed and trained models on the WikiSQL dataset. Their model encodes the input with a bidirectional LSTM and then decodes the hidden state with a typed LSTM. Based on the type, the decoder either copies an output token from the input question using an attention-based copying mechanism or generates it from a fixed vocabulary. The SQL grammar from WikiSQL can be written in regular expression form as: Select s c From t Where (c op v) where s, c, t, v are respectively the aggregation operator, the column name, the table name and the value slot. This ensures that the type of the next output token is statically determined. 2.4 Conclusion The existing state-of-the-art on NL2SQL problem has two different approaches: semantic parsing and neural networks. Because most of the previous semantic parsing solutions relied on high-quality lexicons and feature engineering, this work focuses on neural network approaches without hand-engineered features. Currently, the largest dataset of question-query pairs is WikiSQL. The three papers that have competitive scores on the WikiSQL task are Seq2SQL, SQLNet and Pointing out SQL queries from text (Zhong et al. [2017], Xu et al. [2017], Wang et al. [2017]). The main differences between the paper are as follows: Zhong et al. [2017] used reinforcement learning to solve the order-matters problems in a sequence-to-sequence model.

35 2.4 Conclusion 17 Xu et al. [2017] wanted to avoid reinforcement learning and proposed a sketch-based approach, which only specifies the shape of the query. Wang et al. [2017] used a typed decoder that statically predicted the next token based on the type of the token. Despite the several advantages of the WikiSQL dataset over the previous semantic parsing datasets, it only consists of simple SQL queries on one table. This work will focus on training an end-to-end encoder-decoder model on composite SQL queries on multiple tables. The construction of the dataset consisting of these question-query pairs will be discussed in the next chapter.

36 DATASET 18 Chapter 3 Dataset This section discusses the construction of the dataset, consisting of question-query pairs. The NL2SQL solutions from the previous chapter used the WikiSQL dataset by Zhong et al. [2017]. Although this dataset offers several benefits, discussed in Section 2.3, it also received criticism that it consisted of massive simplifications on SQL grammar. The dataset lacks any complex operator of the SQL grammar, e.g. JOIN or GROUP BY. The new dataset will consist of two subdatasets: a dataset containing simple SQL queries and one with composite SQL queries. The simple SQL queries have a similar syntax as the queries from the WikiSQL dataset, while the composite SQL queries will also include the JOIN operator. The simple SQL queries are included to run the WikiSQL models, so we can make sure that the question-query pairs are not trivially composed. The dataset should consist of the following parts: Natural language questions, created by humans, in order to overcome the issue that a well-trained model may overfit on template-synthesized descriptions. Ground truth SQL queries that are paired with the natural language questions. We have written the SQL queries based on the IMDb database 1, consisting of the following tables: movie, actor, director, producer, writer, company and genre. Figure 3.1 shows these tables and their relations. The dark grey tables are the bridge tables that link the movie table with the other tables through 1-to-many relations, because e.g. one movie can have multiple directors and one director can direct multiple movies. Using these tables, the M:N relations between movie and director can be satisfied. 1 The IMDb database is the same as in Yaghmazadeh et al. [2017], it can be found at goo.gl/dbubmm.

37 DATASET 19 Figure 3.1: Entity-relationship model of the IMDb database

38 DATASET 20 To gather human questions, the SimpleQuestions dataset from babi project (Bordes et al. [2015]) proves to be very practical. 2 It consists of a total of 108,442 questions, written in natural language by human English-speaking annotators. Each of these questions is paired with a corresponding fact, formatted as (subject, relationship, object). The facts have been extracted from the knowledge base Freebase 3. An example of a Freebase fact and corresponding question can be found in Table 3.1. The subject and object URL s are deprecated, but are not necessary for the further construction of the dataset. Subject Relationship Object Question Peter Greenhalgh was the cinematographer for what film? Table 3.1: Example of a Freebase tuple and corresponding question The following steps are followed in order to filter the questions: 1. All the questions that can be answered using the IMDb database are extracted, resulting in 8,605 questions. They are lowercased and punctuation is removed. 2. Replace the named entities (names of persons, titles of movies) by general tags (e.g. Brad Pitt becomes [@actor]). The list of tags is as follows: [@actor], [@director], [@film], [@country], [@year], [@producer], [@production company], [@writer] and [@genre]. It is possible that some questions contain two or three tags, e.g. indicating an actor from a specific genre. The distribution of the amount of tags per question can be found in Table 3.2. Question examples can be found in Table 3.3. Luckily, Bordes et al. [2015] provided a text file entities.txt which contains almost all the entities. We added extra entities in order to replace all the names. The questions without entities are deleted. 3. Using 95 regular expressions to filter the duplicates, this resulted in 1,540 unique template questions. The distribution over the multiple relationships can be found in Table Using these template questions, a dataset can easily be created by replacing all the tags with e.g. ten random values from the IMDb database. 2 It can be downloaded here: 3

39 DATASET 21 Amount of tags in question Frequency % % % Table 3.2: Distribution of the amount of tags in a question Relationship Questions Tags director/film what film was [@director] the director for [@director] what film is [@director] known for directing [@director] what is a [@genre] film directed by [@director] [@director], [@genre] film/written by who wrote for the film [@film] [@film] who wrote the movie [@film] [@film] who wrote the movie [@film] ([@year] film) [@film], [@year] Table 3.3: Question examples Relationship Amount of questions by 50 by 52 by 61 companies genre/films in this genre 493 Total 1538 Table 3.4: Amount of unique template questions per relationship The combination of the relationship and the tags determines the ground truth SQL query. There are 40 unique combinations of relationship and tags, which we hand-annotated with SQL queries. Examples of simple and composite question-query pairs can be found in Table 3.5, the complete dataset can be downloaded at nl2sql-dataset/overview. The distribution over the lengths of the input sentences and the SQL queries can be found

40 DATASET 22 Question what film is Big Daddy known for directing who wrote the movie Angel Back Question what film is Big Daddy known for directing Simple query SELECT title FROM films WHERE director name = David Mackay SELECT writer name FROM films WHERE title = Angel Back Composite query SELECT title FROM movie movie INNER JOIN directed by directed by ON movie.mid = directed by.msid INNER JOIN director director ON directed by.did = director.did WHERE director.name = David Mackay who wrote the movie Angel Back SELECT name FROM movie movie INNER JOIN written by written by ON movie.mid = written by.msid INNER JOIN writer writer ON written by.wid = writer.wid WHERE movie.title = Angel Back Table 3.5: Question-query examples in Figure 3.2. Due to lengthy JOIN operations, the composite SQL queries are quite longer than the simple SQL queries and the input sentences. The lengths of the composite SQL queries have a distribution which has two centers, dependent on the amount of the JOIN operations. The lengths of the input sentences seem centered around eight tokens, while the lengths of the simple queries are in the range of [10,20].

41 DATASET 23 Figure 3.2: Length distribution over the input sentences and queries

42 METHODOLOGY 24 Chapter 4 Methodology This chapter discusses the techniques that are used to build our models, while Chapter 5 will examine the results of these techniques. The first two subsections introduce a simple model and a model from the WikiSQL dataset, which will serve as baseline on our dataset. At last, the advanced encoder-decoder model and its extensions is examined. 4.1 GloVe-based model The first baseline is based on the idea that similar questions should have similar queries. It uses GloVe and cosine similarity for respectively the vector representation and similarity measurement of the questions, which are both explained in the first subsection. The second subsection discusses the method, which is applicable both on simple and composite queries GloVe: Global Vectors for Word Representation GloVe by Pennington et al. [2014] is an unsupervised learning algorithm for obtaining vector representations for words. It is a count-based model, where training is performed on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Other word embeddings such as word2vec (by Mikolov et al. [2013]) are predictive models, that learn their predictive ability by optimizing the loss of predicting the target words from the context words given the vector representations.

43 4.2 SQLNet 25 Pennington et al. [2014] published word vectors that are pre-trained on large corpora, which are convenient for us because the GloVe embedding would not work when trained on our small NL2SQL dataset. We decided to use the Common Crawl embedding that is trained on 42 billion tokens, consists of a vocabulary of 1.9 million tokens and embeds these tokens in the 300-dimensional vector space. 1 Cosine similarity between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. The cosine similarity between two vectors is a measure that calculates the cosine of the angle between them. The formula results from solving the equation of the dot product: cos(x, y) = x y x y (4.1) where x, y are vectors Method The GloVe-based model consists of the following steps: 1. All the words are lowercased and embedded into a vector using the pre-trained word embedding. 2. The vector representation of the question is the sum of all the vector representations of the words in the question. 3. For each question in the test set, the cosine similarity (described in Section 4.1.1) between that question and all the questions from the training set is calculated. 4. The query accompanying the most similar question from the training set is predicted. The results of the GloVe-based model can be found in Section SQLNet In order to check that the dataset is not trivially composed, we have trained SQLNet (Xu et al. [2017]), described in Section 2.3.2, on our simple query dataset to compare the 1 It can be downloaded here: Other pre-trained word embeddings can be found here:

44 4.3 Encoder decoder 26 accuracy. SQLNet is the state-of-the-art model on the WikiSQL dataset and proposes a sketch-based approach to generate a SQL query. This means that they train different models to predict each different clause of the SQL query, such as the column name after SELECT or the condition after the WHERE clause. We parsed our dataset in the corresponding format of SQLNet to be able to train the model. We trained our model without column attention and with fixed word embeddings, the results can be found in Section Encoder decoder In machine translation, input sequences and output sequences have different lengths. Like already mentioned in Section 2.2.6, Sutskever et al. [2014] presented a general end-to-end approach to sequence learning. This Sequence to Sequence (seq2seq) network, or encoderdecoder network, is a model consisting of two RNNs called the encoder and decoder. Figure 4.1 shows a high-level overview of the encoder-decoder network. The encoder reads an (embedded) input sequence x 0,..., x n and outputs a single vector h n, while all the other outputs h 0,..., h n 1 are discarded. The decoder reads h n to produce an output sequence y 0,..., y k. The combination of two RNNs ensures that the lengths of the input and output sequence can be different, as there is no explicit one on one relation between the input and output sequences. This is especially important in our case, because the SQL queries are most of the time longer than the questions, as can be seen in Figure 3.2. The encoder inputs and decoder outputs are embedded with word embeddings, which will be discussed in the first subsection. The encoder and simple decoder will be explained in the next two subsections. The fourth subsection will elaborate on a solution to the discarding of the other encoder outputs: the attention mechanism. The last subsection will examine the copy mechanism, which deals with rare words and Out Of Vocabulary (OOV) words Word embedding The words from the input sentences and output queries are first embedded into a vector representation using the same GloVe embedding (Pennington et al. [2014]) as in the GloVebased model, explained in Section Both the mapping between the words and their

45 4.3 Encoder decoder 27 y 0 y 1 y k DECODER Predict Predict Predict GRU GRU... GRU h 0 h 1 hn ENCODER GRU GRU... GRU x 0 x 1 x n Figure 4.1: High level overview of encoder-decoder architecture indexes and the mapping between the indexes and their corresponding 300-dimensional vector representations are stored. The mapping of an (yet unknown) input word of the training set is as follows: The word is part of the GloVe vocabulary: the corresponding GloVe embedding is added. The word is not part of the GloVe vocabulary: a vector with random values uniformly sampled between -1 and 1 is added. We assigned each word a different value (and not e.g. all zeroes or assigning the <UNK> embedding) because there were quite a lot (1233) of words that were not part of the GloVe vocabulary. This means that our new vocabulary, including the words that are not part of the GloVe vocabulary, is 29% larger than with only the GloVe vocabulary. In both cases, the mapping between the words and their indexes is updated accordingly. The mapping of an input word of the test set, which is at training time unknown, is as follows: The word is present in the mapping: take the corresponding vector.

46 4.3 Encoder decoder 28 The word is not present in the mapping: take the GloVe embedding corresponding with the <UNK> (unknown) token is added Encoder The encoder architecture can be found in Figure 4.2. The input scalar x i, corresponding to a word, is first embedded using the word embedding explained in Section A GRU, explained in Section 2.2.5, takes as input this embedding x i together with the previous hidden state s i 1, and produces the output h i and a new hidden state s i. The first hidden state s 0 is initialized to all zeroes. In the simple decoder, only the encoders last output h n is used, in contrast to the attention decoder where all the encoder outputs h 0,..., h n are needed Decoder The decoder architecture can be found in Figure 4.3. In the simplest decoder, only the last output of the encoder h n is used. This is called the context vector, because it encodes the context from the entire sequence. The first hidden state s 0 from the decoder is set equal to this context vector. The input of the decoder depends on whether it uses teacher forcing. Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input (Brownlee [2017]). It works by using the target token from the training dataset at the current time step y target i 1 as input in the next time step y i, rather than the output generated by the network. Using teacher forcing causes the network to converge faster, but it may also exhibit instability. PyTorch s autograd gives us the freedom (see the dynamic graph definition described in Section 5.1.1) to randomly choose to use teacher forcing. We decided to use a teacher forcing ratio of So the input of the decoder is as follows: Teacher forcing: input y i = the previous target token y target i 1 No teacher forcing: input y i = the previous predicted token y i 1 The decoder embeds the input y i 1 (in case of no teacher forcing) the same way as the encoder. A GRU takes as input this embedded vector together with the previous hidden

47 4.3 Encoder decoder 29 Input x i Size: 1 Previous hidden s i-1 Size: 1x1xhidden_nodes Embedding Embedded x i Size: 1x1x300 GRU Output h i Size: 1x1xhidden_nodes Hidden s i Size: 1x1xhidden_nodes Figure 4.2: Encoder architecture

48 4.3 Encoder decoder 30 state s i 1, and produces an output vector z i and the next hidden state s i. This output vector z i is transformed to a distribution over the vocabulary size through the following operations: 1. out: feedforward layer that applies linear transformation to the output vector z i IR 1 l with l as the number of hidden nodes: y i = W out z i + b out (4.2) where y i is a vector IR 1 m with m as the size of the vocabulary, W out is a trainable matrix and b out is the bias. 2. log_softmax: applies the LogSoftmax(y i ) function to obtain log-probabilities. The formula is as follows: exp(y i,k ) LogSoftmax(y i,k ) = log( j exp(y i,j) ) (4.3) where y i,k are the elements of y i. The loss function used is the negative likelihood loss. By minimizing the negative loglikelihood loss function, the model is encouraged to assign higher probability values to the correct labels across training examples. The negative likelihood loss combined with the LogSoftmax is also called the cross-entropy loss Decoder with attention mechanism As shown in Figure 4.1, the encoder-decoder model discards all the encoder outputs but the last one. If only the last encoder output (the context vector) is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence. The attention mechanism avoids this by encoding the whole input sequence based on the sequence of all the encoder outputs, as opposed to only the last encoder output. Attention allows the decoder network to focus on a different part of the encoder outputs for every step of the decoder s own outputs. Figure 4.4 shows the attention mechanism, where the attention weights are represented by a grayscale value and the encoder outputs as colors. Our implementation of the attention mechanism works as follows:

49 4.3 Encoder decoder 31 Input y i-1 Size: 1 Previous hidden s i-1 Size: 1x1xhidden_nodes Embedding Embedded y i-1 Size: 1x1x300 GRU Output z i Size: 1x1xhidden_nodes Hidden s i Size: 1x1xhidden_nodes Out Log Softmax Output y i Size: 1 x vocabulary_size Figure 4.3: Simple decoder architecture

50 4.3 Encoder decoder 32 Encoder outputs Hidden DECODER Attention Encoder outputs * Attention weights Context vector Figure 4.4: Attention mechanism. The attention weights are values in the range of [0,1], which are mapped to [black, white]. The encoder outputs are multiplied and changed accordingly. 1. A set of attention weights α i,j is calculated. Each weight α i,j is a normalized attention energy e i,j : α i,j = exp(e i,j) k exp(e (4.4) i,k) where each attention energy e i,j is calculated with a score function a using the last hidden state s i 1 and the particular encoder output h j : e i,j = score(s i 1, h j ) (4.5) Luong et al. [2015] proposed the following score functions: s T i 1 h j score(s i 1, h j ) = s T i 1 W ah j v T a tanh(w a[s i 1 ; h j ]) dot general concat (4.6)

51 4.3 Encoder decoder 33 where W a and v a are respectively a trainable matrix and vector. We decided to use the general form, which is the dot product between the decoder hidden state s i 1 and a linear transform of the corresponding encoder output h j. 2. These attention weights α i,j are multiplied by the encoder output vectors h 1,..., h n to create a weighted combination, the context vector c i. c i = n α i,j h j (4.7) j=1 3. The context vector c i is concatenated with the decoder s input y i 1 and serves as the input of the GRU. z i, s i = f([c i, y i 1 ], s i 1 ) (4.8) where f is the GRU, s i and s i 1 are respectively the new and last hidden state and z i is the output vector. 4. A feedforward layer applies a linear transformation to the concatenation of the GRU output vector z i and the context vector c i. y i = W out [z i, c i ] + b out (4.9) where y i is a vector IR 1 m with m as the size of the vocabulary, W out is a trainable matrix and b out is the bias. 5. Finally, y i is passed through a LogSoftmax. exp(y i ) LogSoftmax(y i ) = log( j exp(y j) ) (4.10) where y i are the elements of y. The architecture of the attention decoder architecture can be seen on Figure 4.5. Except for the attention mechanism and additional layers, it is the same as the simple decoder architecture. For the sake of brevity, the embedding of the input is left out of the figure Copy mechanism In machine translation, it is often necessary that the decoder can copy tokens from the input sequence. Specifically in our NL2SQL story, the decoder should be able to copy actor

52 4.3 Encoder decoder 34 Encoder outputs h 1,..., h n Size: input_length (n) x hidden_nodes Previous hidden s i-1 Size: 1x1xhidden_nodes Embedded y i-1 Size: 1x1x300 Attention (linear transformation) Dot product Softmax Attention weights alpha i,1,..., alpha i,n Size: 1 x max_length BMM Batch matrix-matrix product Context c i Size: 1x1xhidden_nodes Concatenation GRU Concatenation Output z i Hidden s i Size: 1x1xhidden_nodes Size: 1x1xhidden_nodes Out Log Softmax Output y i Size: 1 x vocabulary_size Figure 4.5: Attention decoder architecture

53 4.4 Conclusion 35 names, titles,... from the input sequence. These are often Out Of Vocabulary (OOV) words that are not present in the training set. This is why the major improvements on the encoder-decoder model with attention are at predicting the last operand of the WHERE condition, which is often a word from the input sequence. Our copy mechanism works in two steps. 1. The first step consists of predicting whether to copy a token from the input sequence or not. This is accomplished by including a <COPY> token in the vocabulary. 2. The second step consists of replacing the <COPY> with the input token with the highest attention value. The loss function is changed in order to work with the copy mechanism: 1. In the first step, the algorithm checks whether the target token is in the input sequence. If this is the case, the negative likelihood loss between the <COPY> token and the decoder output is added to the loss. If this is not the case, the negative likelihood loss between the target token and the decoder output is added to the loss. 2. In the second step, the attention values should align with the input tokens that should be copied. In case of a <COPY> token, the cross-entropy loss between the attention values and the index of the corresponding input token is added to the loss. 4.4 Conclusion This chapter starts with explaining the two models that serve as baseline for our dataset, the GloVe baseline and SQLNet. The GloVe-based model searches for the best matching question in the training set and predicts the corresponding query, working both for simple and composite queries. SQLNet is an advanced model that works only on the simple queries, on one table, and is described in Section Our encoder-decoder model, which is explained in the last section, works also on composite queries. Because the simple decoder used only the encoders last output, the first extension consisted of using an attention mechanism. The attention mechanism encodes the whole input sequence based on the sequence of all encoder outputs, instead of only the encoders last output. The second extension consisted of adding a copy mechanism, providing the algorithm with the possibility of copying unseen words from the question. The following chapter will discuss the results of the techniques explained in this chapter.

54 EXPERIMENTS 36 Chapter 5 Experiments In this chapter, the models proposed in Chapter 4 are evaluated and compared. It begins with giving an overview of the experimental setup: the deep learning framework used and how our results are evaluated. Next, it touches briefly the results of the baseline models. The third section handles the results of the encoder-decoder model, both on simple and composite queries. At the end, the choice of hyperparameters is explained. 5.1 Experimental setup This section begins with clarifying which deep learning framework suits our models the best. Next, it gives an overview of the multiple components of the SQL query that are evaluated Deep learning framework Given the proposed models in Chapter 4, we chose to use the deep learning framework PyTorch. 1 We picked this framework because of the following reasons: Dynamic graph definition: PyTorch supports dynamic graph definition. Static graph definition, used by other frameworks, means that for training a RNN, the input sequence length should stay fixed. (so the sentence length is fixed to some maximum value and smaller sequences are padded with zeros) Pythonic way: PyTorch is deeply integrated into Python. 1

55 5.1 Experimental setup 37 Tutorial: SQLNet and Seq2SQL (Xu et al. [2017], Zhong et al. [2017]) are written in PyTorch and open-sourced on GitHub. 2 These served as guidelines for writing our own solutions Evaluation details As discussed in Chapter 3, the dataset consists of 1,537 template questions containing tags such as [@film] or [@actor]. The tags (indicated are replaced with 10 random, corresponding values from the IMDb database. Table 5.1 shows an example of how the [@film] tag is replaced by movie titles. Template question Who wrote the movie [@film]? Generated question 1 Who wrote the movie Deadpool 2? (...) (...) Generated question 10 Who wrote the movie Avengers: Infinity war? Table 5.1: Example of a template question and the generated questions The resulting 15,370 question-query pairs are split differently for the GloVe-based model and the others, because the GloVe-based model does not use a validation set. The template questions from the training, validation and test set are separated, such that a template question from one set is not present in the other sets. Details can be found in Table 5.2. Model Training set Validation set Test set GloVe-based model 85% / 15% 13,060 pairs / 2,310 pairs SQLNet & Encoder-decoder 70% 15% 15% 10,750 pairs 2,310 pairs 2,310 pairs Table 5.2: Split of the dataset into training, validation and test set for the different models The evaluation metric used in the baselines and models is the accuracy. The accuracy is the proportion of correct cases (both true positives and true negatives) among the total number of cases examined. Because some parts of the SQL query are harder to predict than 2 Xu et al. [2017] implemented SQLNet and rebuild Seq2SQL, which can be found at com/xiaojunxu/sqlnet

56 5.2 Baselines 38 others, we calculate the accuracy of the different components of the SQL query, such that is becomes clear on which components of the query the model has to be improved. Table 5.3 shows the multiple evaluation components. In case of multiple conditions, the conditions are sorted and compared because the conditions in a SQL query are commutative. This means that e.g. the following conditions are equivalent: WHERE actor.name = Brad Pitt AND movie.title = Fight Club WHERE movie.title = Fight Club AND actor.name = Brad Pitt Evaluation component Description Select Checks if the column name after SELECT is correct From Checks if the table name after FROM is correct Where (first operands) Checks if the column name (= first operand) ( = conds first op) after WHERE is correct. Where (all operands) Checks if the whole condition after WHERE is correct ( = conds all op) Joins (only for composite queries) Checks if the table names and conditions of the JOIN clause of the query are correct All Checks if the whole query is correct Table 5.3: Evaluation components of the query 5.2 Baselines This section describes the results from the first two models: the GloVe-based model and the more advanced model SQLNet from Xu et al. [2017] GloVe-based model The results on accuracy for simple and composite queries can be found respectively in Figures 5.1 and 5.2. The figures show that the biggest improvement lies in predicting the WHERE condition (conds all op), while predicting the table name after FROM is already perfect. Also, the predictions of the column name after SELECT and as first operand of the WHERE condition (conds first op) have an accuracy above 80%, which is quite high for a baseline.

57 5.3 Encoder-decoder model 39 Figure 5.1: Accuracy of the GloVe model - simple queries Figure 5.2: Accuracy of the GloVe model - composite queries SQLNet The results of SQLNet on our dataset can be found in Table 5.4. SQLNet trains different prediction models for the different parts of the query, which means that the validation loss of the models can converge at different times. The SQLNet model was trained for 100 epochs, where the best selection and condition predictions on the validation set were achieved after respectively 77 and 45 epochs. The FROM accuracy is not calculated, because it is only one table. The SQLNet is a more advanced model and scores better than the GloVe-based model on predicting the WHERE clause. The comparison can be consulted in Table 5.4. The WHERE (first operands) clause is not available for SQLNet because their evaluation mechanism did not measure the accuracy on that component. Simple queries GloVe-based SQLNet Select 84.45% 71.7% From 100.0% 100% Where (first operands) 82.51% / Where (all operands) 36.19% 46.8% All 29.48% 36.5% Table 5.4: Accuracy results Glove-based model and SQLNet - simple queries 5.3 Encoder-decoder model The first part of this section involves the results of the encoder-decoder model on the simple queries, while the second part contains the results on the composite queries. It will explore

58 5.3 Encoder-decoder model 40 the improvements of the attention and copy mechanism and show some visualizations of these mechanisms Simple queries The accuracy results from the encoder-decoder model on the different components can be found in Table 5.5 and Figure 5.3. The choice of hyperparameters can be found in Table 5.6 and is motivated in Section 5.4. The model is trained until the validation loss converges, which is after approximately 9 epochs. Note the difference with the training procedure of SQLNet, where there were multiple models for each clause of the query, which were trained independent of each other until their validation loss converged. Clearly, the attention mechanism does not have impact on the simple queries. However, in combination with the copy mechanism there is a 2% increase in overall accuracy compared to the simple decoder. Simple queries Simple With attention With attention and copy Select 95.47% 94.08% 93.87% From 100% 99.65% 99.87% Where (first operands) 92.74% 90.04% 92.39% Where (all operands) 65.99% 60.55% 67.77% All 64.42% 58.59% 66.29% Table 5.5: Accuracy results encoder-decoder model - simple queries Hyperparameter Choice Optimizer Adam Learning rate Dropout 0.1 Amount of hidden nodes 256 Table 5.6: Choice of hyperparameters - simple and composite queries Not all evaluation components score equal, hence we compare the evaluation components that score worst. The prediction of the last operand and the correctness of the whole query is shown in Figure 5.4. We notice a decrease in accuracy by using attention, which is compensated in combination with the copy mechanism. All three models score better than the GloVe-based model and SQLNet (see Table 5.4).

59 5.3 Encoder-decoder model 41 Figure 5.3: Accuracy results encoder-decoder model - simple queries Figure 5.4: Accuracy results for the WHERE clause and the whole query - simple queries Composite queries The accuracy results from the encoder-decoder model are shown in Table 5.7 and Figure 5.5. The choice of hyperparameters are the same as with the simple queries and can be found in Table 5.6 and is motivated in Section 5.4. The model is trained until the validation loss converges, which is after approximately 9 epochs. Analogous as with the simple queries, the extensions of the attention mechanism and copy mechanism have particularly effect on

60 5.3 Encoder-decoder model 42 predicting the third operand in the WHERE clause. Composite queries GloVe Simple With attention With attention & copy Select % 96.46% 94.00% 96.28% From % 99.60% 99.41% 99.96% Where (first operands) % 94.89% 91.76% 93.51% Where (all operands) % 65.69% % 73.94% Joins 80.74% % 92.83% 94.49% All % 63.32% 63.32% 71.29% Table 5.7: Accuracy results encoder-decoder model - composite queries Figure 5.5: Accuracy results encoder-decoder model - composite queries Figure 5.6 zooms in on the prediction of the last operand and the correctness of the whole query. We notice again that solely the attention mechanism brings no improvement in the accuracy of the whole query. Although, when it is combined with the copy mechanism, it brings an additional improvement of 8%. All three models score better than the GloVebased model (see Table 5.7). The attention mechanism is visualized in Figures 5.7 and 5.8. Because it is used to weight specific encoder outputs of the input sequence, we can imagine looking where the network is focused most at each time step. In Figure 5.7, we notice that the word nation has the most impact on the translation. Because country code is only present in the company table, nation has almost weight 1 in the whole SELECT, FROM and JOIN clause. The words

61 5.3 Encoder-decoder model 43 Figure 5.6: Accuracy results for the WHERE clause and the whole query - composite queries another forever have weight 1 in the prediction of the last operand of the WHERE clause, where they are copied correctly. Figure 5.8 shows a second example question-query pair. Here, the SELECT, FROM and JOIN clause are again determined by one word, producer. The last operand of the WHERE clause is also copied correctly. Figure 5.7: Visualization attention - pair 1 Figure 5.8: Visualization attention - pair 2 The copy mechanism works in two steps: predicting a <COPY> token and if it should copy,

62 5.3 Encoder-decoder model 44 choosing which token it should copy. The accuracy of the first step is shown in Table 5.8, where the percentage indicates in how many queries the <COPY> tokens are correctly or incorrectly predicted. The largest error is when the decoder predicts a different length than the ground truth query. Also, the decoder has a larger chance predicting falsely a <COPY> token than forgetting to predict a <COPY> token. The frequencies do not sum up to 1, but that is because it is possible that multiple error cases happen simultaneously. Correct 88.71% Incorrect, different lengths 9.58% Incorrect, should have copied 0.985% Incorrect, should not have copied 4.43% Table 5.8: Accuracy of first step of the copy mechanism and the frequency of the error cases. Table 5.9 shows three examples of question-query pairs and the results from the simple decoder and the decoder with the attention and copy mechanism. The first example shows a case where the copy mechanism correctly predicts the tokens to copy, but the attention is wrongly aligned for the last token causing it to copy the wrong word. The second example is an illustration of a common error in the copy mechanism: wrongly predicting the length of the condition. The third example illustrates a case where the copy mechanism is correct and the simple decoder is not.

63 5.3 Encoder-decoder model 45 Q1 S A&C G Q2 S A&C G Q3 S A&C G which iowan cinematographer produced the film Appunti inutili - Virgilio Giotti SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = aquele querido mes de agosto SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Virgilio SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Giotti where is the film Angry Samoans form SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = apron strings SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry Samoans what film is a part of the Crime film genre SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = game-show SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime Table 5.9: Example predictions by the different models. Q denotes the natural language question and G denotes the corresponding ground truth query. S and A&C denote respectively the queries produced by the simple decoder and the decoder with copy and attention mechanism. Our models generate in general the table name two times, such that the column names can specify where they come from, but this is left out of the table for the sake of brevity. The bold words indicate wrongly predicted words.

64 5.4 Hyperparameter optimization Hyperparameter optimization This section explores the optimization of hyperparameters in the model. The first subsection discusses the choice of optimizer and learning rate, which are used to minimize the loss. Afterwards, the impact of the regularization technique dropout is examined. Finally, we tune the amount of hidden nodes, which is the length of the context vector Optimizer We have tried two different gradient descent optimizers from the torch.optim package: stochastic gradient descent (optim.sgd) and Adaptive Moment Estimation (optim.adam), explained in Section This section will compare the results of both. Figure 5.9 shows the negative likelihood loss for the SGD optimizer and Adam optimizer (both with learning rate 0.001) in function of the time. The experiment is conducted on the composite query dataset using the encoder-decoder model with attention. Following aspects can be deducted from the figure: The Adam optimizer (blue) converges faster than the SGD optimizer (orange). Due the frequent updates with high variance, the loss function fluctuates heavily. Figure 5.9: The negative likelihood loss for the SGD optimizer (orange) and Adam optimizer (blue), in function of the amount of trained samples.

65 5.4 Hyperparameter optimization 47 Figure 5.10 shows the evolution of the negative likelihood loss for different learning rates, also conducted on the composite query dataset with the encoder-decoder model with attention. The Adam optimizer with learning rate 0.1 fails to converge, while the Adam optimizer with learning rate 0.01 is not able to converge to a loss below 2.0. The default PyTorch learning rate of converges faster than the others, which is why we picked for the learning rate in our model. Figure 5.10: The negative likelihood loss for the learning rates 0.1, 0.01 and for the Adam optimizer, in function of the amount of trained samples. The graphs are respectively red, blue and orange Dropout Overfitting occurs when the model adapts to training data too well, but does not generalize to new data. In our case, this would mean that the model predicts almost perfectly the template queries in the training set, but predicts poorly the (different) template queries in the test set. There is a direct trade-off between overfitting and model complexity. Neural networks are complex models, so additional countermeasures to prevent overfitting are taken. Dropout is a regularization technique for neural networks that tackle this problem (Srivastava et al. [2014]). The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.

66 5.4 Hyperparameter optimization 48 The dropout mechanism in PyTorch works as follows: During training, it randomly zeroes some of the elements of the input tensor with probability p. The outputs are scaled by a factor of 1 p. During evaluation, the module simply computes an identity function. This dropout reduces overfitting and gives major improvements over other regularization methods. The impact of the dropout parameter p on our model is shown in Figure The absolute accuracy of the whole query is calculated on the validation set each time 1000 training samples are passed, for p = 0.1, 0.3 and 0.5. The experiment is conducted with the decoder with the copy and attention mechanism. The overall accuracy seems to converge, which indicates that the dropout parameter in our model has not that much effect on the accuracy. Figure 5.11: The absolute accuracy of the whole query calculated on the validation set for the dropout values p of 0.1, 0.3 and 0.5, in function of the amount of trained samples. The graphs are respectively orange, blue and red Hidden nodes The amount of hidden nodes, which is also the length of the context vector, has less effect on the negative likelihood loss. Figure 5.12 shows the impact of the size of the context vector on the negative likelihood loss. The more hidden nodes, the better the reduction of the loss, but also an increase in parameters, which leads to longer training times. This is why we picked 256 hidden nodes to train our model.

67 5.5 Conclusion 49 Figure 5.12: The negative likelihood loss for different amounts of hidden nodes. 5.5 Conclusion This chapter discusses the results from the techniques explained in Chapter 4. It begins with giving a short overview of the reasons why we choose PyTorch as deep learning framework. Because evaluating only on the accuracy of the whole query would enclose too little details of our models, we evaluated the accuracy of the different parts of the SQL query. The second part of the first subsection demonstrates the multiple evaluation components of the SQL query. The chapter continues with disclosing the first results from this dissertation, namely the experiments with the basic GloVe-based model and the more advanced SQLNet (by Xu et al. [2017]). These models already score quite good on the SELECT, FROM and JOIN clauses, but there are improvements to be made in predicting the third operand of the WHERE clause. This is where our encoder-decoder model appears to boost the accuracy results. On the simple queries, the encoder-decoder model scores better than the GloVe-based model and SQLNet but the attention mechanism does not improve the total accuracy, unless combined with the copy mechanism. On the composite queries, the simple decoder already scores better than the two baseline models, but the extensions enhance the accuracy. The attention with copy mechanism carries a gain of 7.9% compared to the simple decoder, concluding with 71.29% total accuracy.

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning V. Zhong, C. Xiong, R. Socher Salesforce Research arxiv: 1709.00103 Reviewed by : Bill Zhang University of Virginia

More information

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision Anonymized for review Abstract Extending the success of deep neural networks to high level tasks like natural language

More information

Deep Learning Applications

Deep Learning Applications October 20, 2017 Overview Supervised Learning Feedforward neural network Convolution neural network Recurrent neural network Recursive neural network (Recursive neural tensor network) Unsupervised Learning

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 7. Recurrent Neural Networks (Some figures adapted from NNDL book) 1 Recurrent Neural Networks 1. Recurrent Neural Networks (RNNs) 2. RNN Training

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44 A Activation potential, 40 Annotated corpus add padding, 162 check versions, 158 create checkpoints, 164, 166 create input, 160 create train and validation datasets, 163 dropout, 163 DRUG-AE.rel file,

More information

Sentiment Classification of Food Reviews

Sentiment Classification of Food Reviews Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of

More information

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling Authors: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho and Yoshua Bengio Presenter: Yu-Wei Lin Background: Recurrent Neural

More information

Recurrent Neural Nets II

Recurrent Neural Nets II Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall 2016 9 November, 2016 Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

Novel Image Captioning

Novel Image Captioning 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra Recurrent Neural Networks Nand Kishore, Audrey Huang, Rohan Batra Roadmap Issues Motivation 1 Application 1: Sequence Level Training 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 1 LSTM for Language Translation and Image Captioning Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia 2 Part I LSTM for Language Translation Motivation Background (RNNs, LSTMs) Model

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

Context Encoding LSTM CS224N Course Project

Context Encoding LSTM CS224N Course Project Context Encoding LSTM CS224N Course Project Abhinav Rastogi arastogi@stanford.edu Supervised by - Samuel R. Bowman December 7, 2015 Abstract This project uses ideas from greedy transition based parsing

More information

How to SEQ2SEQ for SQL

How to SEQ2SEQ for SQL 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 Abstract

More information

Deep Neural Networks Optimization

Deep Neural Networks Optimization Deep Neural Networks Optimization Creative Commons (cc) by Akritasa http://arxiv.org/pdf/1406.2572.pdf Slides from Geoffrey Hinton CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy

More information

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Transition-Based Dependency Parsing with Stack Long Short-Term Memory Transition-Based Dependency Parsing with Stack Long Short-Term Memory Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, Noah A. Smith Association for Computational Linguistics (ACL), 2015 Presented

More information

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks Rahul Dey and Fathi M. Salem Circuits, Systems, and Neural Networks (CSANN) LAB Department of Electrical and Computer Engineering Michigan State

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

Layerwise Interweaving Convolutional LSTM

Layerwise Interweaving Convolutional LSTM Layerwise Interweaving Convolutional LSTM Tiehang Duan and Sargur N. Srihari Department of Computer Science and Engineering The State University of New York at Buffalo Buffalo, NY 14260, United States

More information

Image Captioning with Object Detection and Localization

Image Captioning with Object Detection and Localization Image Captioning with Object Detection and Localization Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

More information

PRAGMATIC APPROACH TO STRUCTURED DATA QUERYING VIA NATURAL LANGUAGE INTERFACE

PRAGMATIC APPROACH TO STRUCTURED DATA QUERYING VIA NATURAL LANGUAGE INTERFACE PRAGMATIC APPROACH TO STRUCTURED DATA QUERYING VIA NATURAL LANGUAGE INTERFACE 2018 Aliaksei Vertsel, Mikhail Rumiantsau FriendlyData, Inc. San Francisco, CA hello@friendlydata.io INTRODUCTION As the use

More information

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015 Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015 Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using

More information

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS Puyang Xu, Ruhi Sarikaya Microsoft Corporation ABSTRACT We describe a joint model for intent detection and slot filling based

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Recurrent Neural Networks Christopher Manning and Richard Socher Organization Extra project office hour today after lecture Overview

More information

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Grounded Compositional Semantics for Finding and Describing Images with Sentences Grounded Compositional Semantics for Finding and Describing Images with Sentences R. Socher, A. Karpathy, V. Le,D. Manning, A Y. Ng - 2013 Ali Gharaee 1 Alireza Keshavarzi 2 1 Department of Computational

More information

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Empirical Evaluation of RNN Architectures on Sentence Classification Task Empirical Evaluation of RNN Architectures on Sentence Classification Task Lei Shen, Junlin Zhang Chanjet Information Technology lorashen@126.com, zhangjlh@chanjet.com Abstract. Recurrent Neural Networks

More information

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction by Noh, Hyeonwoo, Paul Hongsuck Seo, and Bohyung Han.[1] Presented : Badri Patro 1 1 Computer Vision Reading

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs Natural Language Processing with Deep Learning CS4N/Ling84 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Boya Peng Department of Computer Science Stanford University boya@stanford.edu Zelun Luo Department of Computer

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

arxiv: v1 [cs.cl] 13 Nov 2017

arxiv: v1 [cs.cl] 13 Nov 2017 SQLNet: GENERATING STRUCTURED QUERIES FROM NATURAL LANGUAGE WITHOUT REINFORCEMENT LEARNING Xiaojun Xu Shanghai Jiao Tong University Chang Liu, Dawn Song University of the California, Berkeley arxiv:1711.04436v1

More information

arxiv: v1 [cs.ai] 13 Nov 2018

arxiv: v1 [cs.ai] 13 Nov 2018 Translating Natural Language to SQL using Pointer-Generator Networks and How Decoding Order Matters Denis Lukovnikov 1, Nilesh Chakraborty 1, Jens Lehmann 1, and Asja Fischer 2 1 University of Bonn, Bonn,

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

Outline GF-RNN ReNet. Outline

Outline GF-RNN ReNet. Outline Outline Gated Feedback Recurrent Neural Networks. arxiv1502. Introduction: RNN & Gated RNN Gated Feedback Recurrent Neural Networks (GF-RNN) Experiments: Character-level Language Modeling & Python Program

More information

Image-to-Text Transduction with Spatial Self-Attention

Image-to-Text Transduction with Spatial Self-Attention Image-to-Text Transduction with Spatial Self-Attention Sebastian Springenberg, Egor Lakomkin, Cornelius Weber and Stefan Wermter University of Hamburg - Dept. of Informatics, Knowledge Technology Vogt-Ko

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features Xu SUN ( 孙栩 ) Peking University xusun@pku.edu.cn Motivation Neural networks -> Good Performance CNN, RNN, LSTM

More information

Natural Language Interface for Databases Using a Dual-Encoder Model

Natural Language Interface for Databases Using a Dual-Encoder Model Natural Language Interface for Databases Using a Dual-Encoder Model Ionel Hosu 1, Radu Iacob 1, Florin Brad 2, Stefan Ruseti 1, Traian Rebedea 1 1 University Politehnica of Bucharest, Romania 2 Bitdefender,

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Javier Béjar Deep Learning 2018/2019 Fall Master in Artificial Intelligence (FIB-UPC) Introduction Sequential data Many problems are described by sequences Time series Video/audio

More information

Domain-Aware Sentiment Classification with GRUs and CNNs

Domain-Aware Sentiment Classification with GRUs and CNNs Domain-Aware Sentiment Classification with GRUs and CNNs Guangyuan Piao 1(B) and John G. Breslin 2 1 Insight Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, Galway,

More information

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio Presented

More information

Semantic image search using queries

Semantic image search using queries Semantic image search using queries Shabaz Basheer Patel, Anand Sampat Department of Electrical Engineering Stanford University CA 94305 shabaz@stanford.edu,asampat@stanford.edu Abstract Previous work,

More information

arxiv: v1 [cs.cl] 9 Jul 2018

arxiv: v1 [cs.cl] 9 Jul 2018 Chenglong Wang 1 Po-Sen Huang 2 Alex Polozov 2 Marc Brockschmidt 2 Rishabh Singh 3 arxiv:1807.03100v1 [cs.cl] 9 Jul 2018 Abstract We present a neural semantic parser that translates natural language questions

More information

A Quick Guide on Training a neural network using Keras.

A Quick Guide on Training a neural network using Keras. A Quick Guide on Training a neural network using Keras. TensorFlow and Keras Keras Open source High level, less flexible Easy to learn Perfect for quick implementations Starts by François Chollet from

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Week 3: Perceptron and Multi-layer Perceptron

Week 3: Perceptron and Multi-layer Perceptron Week 3: Perceptron and Multi-layer Perceptron Phong Le, Willem Zuidema November 12, 2013 Last week we studied two famous biological neuron models, Fitzhugh-Nagumo model and Izhikevich model. This week,

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

LSTM with Working Memory

LSTM with Working Memory LSTM with Working Memory Andrew Pulver Department of Computer Science University at Albany Email: apulver@albany.edu Siwei Lyu Department of Computer Science University at Albany Email: slyu@albany.edu

More information

DCU-UvA Multimodal MT System Report

DCU-UvA Multimodal MT System Report DCU-UvA Multimodal MT System Report Iacer Calixto ADAPT Centre School of Computing Dublin City University Dublin, Ireland iacer.calixto@adaptcentre.ie Desmond Elliott ILLC University of Amsterdam Science

More information

Kyoto-NMT: a Neural Machine Translation implementation in Chainer

Kyoto-NMT: a Neural Machine Translation implementation in Chainer Kyoto-NMT: a Neural Machine Translation implementation in Chainer Fabien Cromières Japan Science and Technology Agency Kawaguchi-shi, Saitama 332-0012 fabien@pa.jst.jp Abstract We present Kyoto-NMT, an

More information

FastText. Jon Koss, Abhishek Jindal

FastText. Jon Koss, Abhishek Jindal FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words

More information

Image Captioning with Attention

Image Captioning with Attention ing with Attention Blaine Rister (blaine@stanford.edu), Dieterich Lawson (jdlawson@stanford.edu) 1. Introduction In the past few years, neural networks have fueled dramatic advances in image classication.

More information

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies http://blog.csdn.net/zouxy09/article/details/8775360 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Character Recognition Using Convolutional Neural Networks

Character Recognition Using Convolutional Neural Networks Character Recognition Using Convolutional Neural Networks David Bouchain Seminar Statistical Learning Theory University of Ulm, Germany Institute for Neural Information Processing Winter 2006/2007 Abstract

More information

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018 Machine Learning for Natural Language Processing Alice Oh January 17, 2018 Overview Distributed representation Temporal neural networks RNN LSTM GRU Sequence-to-sequence models Machine translation Response

More information

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016 CS 1674: Intro to Computer Vision Neural Networks Prof. Adriana Kovashka University of Pittsburgh November 16, 2016 Announcements Please watch the videos I sent you, if you haven t yet (that s your reading)

More information

Dialog System & Technology Challenge 6 Overview of Track 1 - End-to-End Goal-Oriented Dialog learning

Dialog System & Technology Challenge 6 Overview of Track 1 - End-to-End Goal-Oriented Dialog learning Dialog System & Technology Challenge 6 Overview of Track 1 - End-to-End Goal-Oriented Dialog learning Julien Perez 1 and Y-Lan Boureau 2 and Antoine Bordes 2 1 Naver Labs Europe, Grenoble, France 2 Facebook

More information

ABC-CNN: Attention Based CNN for Visual Question Answering

ABC-CNN: Attention Based CNN for Visual Question Answering ABC-CNN: Attention Based CNN for Visual Question Answering CIS 601 PRESENTED BY: MAYUR RUMALWALA GUIDED BY: DR. SUNNIE CHUNG AGENDA Ø Introduction Ø Understanding CNN Ø Framework of ABC-CNN Ø Datasets

More information

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018 SEMANTIC COMPUTING Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) Dagmar Gromann International Center For Computational Logic TU Dresden, 21 December 2018 Overview Handling Overfitting Recurrent

More information

House Price Prediction Using LSTM

House Price Prediction Using LSTM House Price Prediction Using LSTM Xiaochen Chen Lai Wei The Hong Kong University of Science and Technology Jiaxin Xu ABSTRACT In this paper, we use the house price data ranging from January 2004 to October

More information

CAP 6412 Advanced Computer Vision

CAP 6412 Advanced Computer Vision CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Feb 04, 2016 Today Administrivia Attention Modeling in Image Captioning, by Karan Neural networks & Backpropagation

More information

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Tianyu Wang Australia National University, Colledge of Engineering and Computer Science u@anu.edu.au Abstract. Some tasks,

More information

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context

Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context Neural Network Joint Language Model: An Investigation and An Extension With Global Source Context Ruizhongtai (Charles) Qi Department of Electrical Engineering, Stanford University rqi@stanford.edu Abstract

More information

Neural Network Weight Selection Using Genetic Algorithms

Neural Network Weight Selection Using Genetic Algorithms Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks

More information

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework Demo Paper Joerg Evermann 1, Jana-Rebecca Rehse 2,3, and Peter Fettke 2,3 1 Memorial University of Newfoundland 2 German Research

More information

Image-Sentence Multimodal Embedding with Instructive Objectives

Image-Sentence Multimodal Embedding with Instructive Objectives Image-Sentence Multimodal Embedding with Instructive Objectives Jianhao Wang Shunyu Yao IIIS, Tsinghua University {jh-wang15, yao-sy15}@mails.tsinghua.edu.cn Abstract To encode images and sentences into

More information

Deep Learning. Architecture Design for. Sargur N. Srihari

Deep Learning. Architecture Design for. Sargur N. Srihari Architecture Design for Deep Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation

More information

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 22: The Attention Mechanism Theo Rekatsinas 1 Why Attention? Consider machine translation: We need to pay attention to the word we are currently translating.

More information

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October

More information

arxiv: v3 [cs.cl] 13 Sep 2018

arxiv: v3 [cs.cl] 13 Sep 2018 Robust Text-to-SQL Generation with Execution-Guided Decoding Chenglong Wang, 1 * Kedar Tatwawadi, 2 * Marc Brockschmidt, 3 Po-Sen Huang, 3 Yi Mao, 3 Oleksandr Polozov, 3 Rishabh Singh 4 1 University of

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks 11-785 / Fall 2018 / Recitation 7 Raphaël Olivier Recap : RNNs are magic They have infinite memory They handle all kinds of series They re the basis of recent NLP : Translation,

More information

Pointer Network. Oriol Vinyals. 박천음 강원대학교 Intelligent Software Lab.

Pointer Network. Oriol Vinyals. 박천음 강원대학교 Intelligent Software Lab. Pointer Network Oriol Vinyals 박천음 강원대학교 Intelligent Software Lab. Intelligent Software Lab. Pointer Network 1 Pointer Network 2 Intelligent Software Lab. 2 Sequence-to-Sequence Model Train 학습학습학습학습학습 Test

More information

CS 224n: Assignment #3

CS 224n: Assignment #3 CS 224n: Assignment #3 Due date: 2/27 11:59 PM PST (You are allowed to use 3 late days maximum for this assignment) These questions require thought, but do not require long answers. Please be as concise

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

End-To-End Spam Classification With Neural Networks

End-To-End Spam Classification With Neural Networks End-To-End Spam Classification With Neural Networks Christopher Lennan, Bastian Naber, Jan Reher, Leon Weber 1 Introduction A few years ago, the majority of the internet s network traffic was due to spam

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Rationalizing Sentiment Analysis in Tensorflow

Rationalizing Sentiment Analysis in Tensorflow Rationalizing Sentiment Analysis in Tensorflow Alyson Kane Stanford University alykane@stanford.edu Henry Neeb Stanford University hneeb@stanford.edu Kevin Shaw Stanford University keshaw@stanford.edu

More information

CSE 250B Project Assignment 4

CSE 250B Project Assignment 4 CSE 250B Project Assignment 4 Hani Altwary haltwa@cs.ucsd.edu Kuen-Han Lin kul016@ucsd.edu Toshiro Yamada toyamada@ucsd.edu Abstract The goal of this project is to implement the Semi-Supervised Recursive

More information

A Hybrid Neural Model for Type Classification of Entity Mentions

A Hybrid Neural Model for Type Classification of Entity Mentions A Hybrid Neural Model for Type Classification of Entity Mentions Motivation Types group entities to categories Entity types are important for various NLP tasks Our task: predict an entity mention s type

More information

Advanced Search Algorithms

Advanced Search Algorithms CS11-747 Neural Networks for NLP Advanced Search Algorithms Daniel Clothiaux https://phontron.com/class/nn4nlp2017/ Why search? So far, decoding has mostly been greedy Chose the most likely output from

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text 16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning Spring 2018 Lecture 14. Image to Text Input Output Classification tasks 4/1/18 CMU 16-785: Integrated Intelligence in Robotics

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Decentralized and Distributed Machine Learning Model Training with Actors

Decentralized and Distributed Machine Learning Model Training with Actors Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of

More information

Pixel-level Generative Model

Pixel-level Generative Model Pixel-level Generative Model Generative Image Modeling Using Spatial LSTMs (2015NIPS) L. Theis and M. Bethge University of Tübingen, Germany Pixel Recurrent Neural Networks (2016ICML) A. van den Oord,

More information