Generating composite SQL queries from natural language questions using recurrent neural networks. Matthias De Groote

Size: px

Start display at page:

Download "Generating composite SQL queries from natural language questions using recurrent neural networks. Matthias De Groote"

Willis Hall
5 years ago
Views:

Generating composite SQL queries from natural language questions using recurrent neural networks Matthias De Groote Supervisors: Prof. dr. ir. Joni Dambre, Prof. dr. Wesley De Neve Counsellors: Ir.

1 Generating composite SQL queries from natural language questions using recurrent neural networks Matthias De Groote Supervisors: Prof. dr. ir. Joni Dambre, Prof. dr. Wesley De Neve Counsellors: Ir. Fréderic Godin, Dr. ir. Thomas Demeester Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year

3 Generating composite SQL queries from natural language questions using recurrent neural networks Matthias De Groote Supervisors: Prof. dr. ir. Joni Dambre, Prof. dr. Wesley De Neve Counsellors: Ir. Fréderic Godin, Dr. ir. Thomas Demeester Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year

4 FOREWORD i Foreword Since following the course Machine Learning, I was intrigued by the mathematical models, practical applications and societal impacts in the domain. I started this project with the intention of doing research on chatbots, but after reading and immersing in the NLP domain, my interest shifted more towards the question answering and machine translation. Hence, the final topic of this dissertation, which is kind of a combination of both. Before beginning the dissertation, I would like to thank everyone who has helped me throughout the year to complete it. First of all I would like to express my gratitude to my promotors Prof. Dr. Ir. Joni Dambre and Prof. Dr. Wesley De Neve for granting me the opportunity to conduct a year of research in the NLP domain. Second, I also want to thank my counsellors, Ir. Fréderic Godin and Dr. Ir. Thomas Demeester, for their time and feedback throughout the year. Their fast and accurate support during and between our constructive meetings guided me throughout this work. Also, I would like to thank Fréderic for proofreading my dissertation. At last, I want to thank my parents that gave me the opportunity to pursue my student career and supported me during these years. Their encouragements and confidence have helped me a lot. Matthias De Groote, June 2018

5 PERMISSION OF USE ii Permission of Use The author(s) gives (give) permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the copyright terms have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation. Matthias De Groote, June 2018

6 Generating Composite SQL Queries from Natural Language Questions using Recurrent Neural Networks Matthias De Groote Master s dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Academic year Supervisors: Prof. Dr. Ir. J. Dambre,Prof. Dr. W. De Neve Counsellors: Ir. F. Godin, Dr. Ir. T. Demeester Faculty of Engineering and Architecture Ghent University Department of Electronics and Information Systems Chair: Prof. Dr. Ir. K. De Bosschere Summary Relational databases store a vast amount of today s information and are becoming increasingly important in actual applications. Accessing these databases requires an understanding of SQL, which is not common knowledge. The active research about semantic parsers that translate natural language questions to SQL queries has recently shifted towards using neural networks. Although there are already good neural network approaches on simple SQL queries (i.e. queries on one table), these solutions cannot produce composite SQL queries (i.e. queries on multiple tables). This work introduces a new dataset containing composite SQL queries and offers Natural-Language-To-SQL (NL2SQL) solutions, trained and tested on this dataset. Keywords nl2sql, encoder-decoder model, machine translation, recurrent neural networks

7 Generating composite SQL queries from natural language questions using recurrent neural networks Matthias De Groote Supervisors: Prof. Dr. Ir. J. Dambre, Prof. Dr. W. De Neve, Ir. F. Godin, Dr. Ir. T. Demeester Abstract Relational databases store a vast amount of today s information and are becoming increasingly important in actual applications. Accessing these databases requires an understanding of SQL, which is not common knowledge. The active research about semantic parsers that translate natural language questions to SQL queries has recently shifted towards using neural networks. Although there are already good neural network approaches on simple SQL queries (i.e. queries on one table), these solutions cannot produce composite SQL queries (i.e. queries on multiple tables). This work introduces a new dataset containing composite SQL queries and offers Natural-Language- To-SQL (NL2SQL) solutions, trained and tested on this dataset. Keywords nl2sql, encoder-decoder model, machine translation, recurrent neural networks, machine learning I. INTRODUCTION The IT revolution of the past few decades has resulted in a large-scale digitization of data, making it accessible to millions of users in the form of databases. However, accessing these databases requires an understanding of query languages such as Structured Query Language (SQL), which, while powerful, is difficult to master. This is often unfortunately beyond the programming expertise of a majority of end-users. Thus, building effective semantic parsers that can translate natural language questions into logical forms such as queries has been a long-standing goal [1], [2], [3]. Dong et al. [4] showed that recurrent neural networks with attention and copying mechanisms can be used effectively to build successful semantic parsers. Also, recent work by Zhong et al. [5] introduced the state-of-the-art Seq2SQL model for question to SQL translation in the supervised setting. In order to build this model, they published a dataset WikiSQL, which is a magnitude larger than the previous semantic parsing datasets. Xu et al. [6] and Wang et al. [7] also published papers that improved the accuracy on WikiSQL. However, there is a lack of any complex operator in the SQL queries of the WikiSQL dataset, such as JOIN. This work focuses on generating composite SQL queries (i.e. queries on multiple tables) from natural language questions using recurrent neural networks with attention and copying mechanisms. While the most recent solutions in the paragraph above scored a good accuracy on the WikiSQL dataset, those models could not predict the JOIN operator. Training in a supervised manner requires labeled examples of question-query pairs, hence this work also introduces a new dataset of question-query pairs, including both simple and composite SQL queries. This work is ordered as follows. First, a brief overview of the related work about the NL2SQL problem is given in Section II. Next, the construction of the dataset is discussed in Section III, followed by Section IV that handles the solutions that tackle this problem. The experiments and their results are shown in Section V. Finally, conclusions are drawn in Section VI. II. RELATED WORK There are currently two approaches to solve the NL2SQL problem: semantic parsing and neural networks. Semantic parsing is an approach to translate text to a formal meaning representation such as logical forms or structured queries. There have been many works considering parsing a natural language description into a logical form [1], [2], [8], [9]. Most of these previous systems rely on high-quality lexicons, domain-or representation-specific features and may not generalize. This is why in this work, the focus is on neural network approaches to handle the NL2SQL tasks, which require less feature engineering. Recently, a new dataset on NL2SQL has been released by Salesforce: WikiSQL [5]. It is a corpus of 80,654 handannotated instances of natural language questions, SQL queries and SQL tables extracted from 24,241 HTML tables from Wikipedia. It is an order of magnitude larger than previous semantic parsing datasets, which makes it interesting for datahungry neural networks. There are three papers that have competitive scores on the WikiSQL task: Seq2SQL [5], SQLNet [6] and Pointing out SQL queries from text [7]. They all offer a different solution for the NL2SQL problem: Seq2SQL used reinforcement learning to solve the order-matters problems in a sequence-to-sequence model. SQLNet wanted to avoid reinforcement learning and proposed a sketch-based approach, which only specifies the shape of the query. Pointing out SQL queries from text used a typed decoder that statically predicted the next token based on the type of the token. Despite the several advantages of the WikiSQL dataset over the previous semantic parsing datasets, it only consists of simple SQL queries on one table. This work will focus on training an end-to-end encoder-decoder model on composite SQL queries on multiple tables. 1

8 III.DATASET This work introduces a new dataset consisting of two subdatasets: a dataset pairing natural language questions with simple SQL queries and one with composite SQL queries. The simple SQL queries have a similar syntax as the queries from the WikiSQL dataset, while the composite SQL queries will also include the JOIN operator. The SQL queries are based on the IMDb database 1. To gather human questions, the SimpleQuestions 2 dataset from babi project [10] proves to be very practical. It consists of a total of 108,442 questions, written in natural language by human English-speaking annotators. Each of these questions is paired with a corresponding fact, formatted as (subject, relationship, object). The facts have been extracted from the knowledge base Freebase 3. Only the questions answerable by the IMDb database are extracted, lowercased and removed of punctuation. The named entities (names of persons, titles of movies,...) are replaced by general tags. By filtering the duplicate questions, this resulted in 1,540 unique template questions. A dataset can easily be created by replacing all the tags with e.g. ten random values from the IMDb database. The combination of the relationship and the tags determines the ground truth SQL query. There are 40 unique combinations of relationship and tags, which are hand-annotated by the author. The complete dataset is published and can be downloaded on Bitbucket 4. IV. MODEL This section introduces the multiple models that were offered as solution. The first is a GloVe-based model that serves as baseline together with the more advanced SQLNet model. Next is the encoder-decoder model, discussed with two extensions: the attention and copy mechanism. A. GloVe-based model The GloVe-based model is created on the idea that similar questions should have similar queries. It uses GloVe [11] and cosine similarity for respectively the vector representation and similarity measurement of the questions. We decided to use the Common Crawl embedding that is trained on 42 billion tokens, consists of a vocabulary of 1.9 million tokens and embeds these tokens in the 300-dimensional vector space 5. All the words are lower-cased and embedded into a vector using the pre-trained word embedding. The vector representation of the question is the sum of all the vector representations of the words in the question. For each question in the test set, the cosine similarity between that question and all the questions from the training set is calculated. The query accompanying the most similar question from the training set is predicted. The results can be found in Section V. 1 The IMDb database is the same as in [3], it can be found at goo.gl/ DbUBMM 2 It can be downloaded here: It can be downloaded here: B. SQLNet In order to check that the dataset is not trivially composed, we have trained SQLNet [6], mentioned in Section II on our simple query dataset to compare the accuracy. SQLNet is the state-of-the-art model on the WikiSQL dataset and proposes a sketch-based approach to generate a SQL query. This means that they train different models to predict each different clause of the SQL query, such as the column name after SELECT or the condition after the WHERE clause. We parsed our dataset in the corresponding format of SQLNet to be able to train the model. We trained our model without column attention and with fixed word embeddings, the results can be found in Section V. C. Encoder decoder In machine translation, input sequences and output sequences have different lengths. Google presented a general end-to-end approach to sequence learning in [12]. This Sequence to Sequence (seq2seq) network, or encoder-decoder network, is a model consisting of two RNNs called the encoder and decoder. Figure 1 shows a high-level overview of the encoder-decoder network. The encoder reads an (embedded) input sequence x 0,..., x n and outputs a single vector h n, while all the other outputs h 0,..., h n 1 are discarded. The decoder reads h n to produce an output sequence y 0,..., y k. y 0 y 1 y k Predict Predict Predict GRU GRU... GRU h 0 h 1 hn GRU GRU x 0 x 1 x n... GRU DECODER ENCODER Figure 1: High level overview of encoder-decoder architecture The words from the input sentences and output queries are first embedded into a vector representation using the same GloVe embedding [11] as in the GloVe-based model. If an (yet unknown) input word of the training set is not part of the GloVe vocabulary, we add a vector with random values uniformly sampled between -1 and 1. We assigned each word a different value (and not e.g. all zeroes or assigning the <UNK> embedding) because there were quite a lot (1233) of words that were not part of the GloVe vocabulary. If a test input word is not present in the mapping, it is mapped to the <UNK> (unknown) token. 1) Encoder The input scalar x i, corresponding to a word, is first embedded using the word embedding explained. A GRU takes as input this embedding x i together with the previous hidden 2

9 state s i 1, and produces the encoders output h i and a new hidden state s i. The first hidden state s 0 is initialized to all zeroes. In the simple decoder, only the encoders last output h n is used, in contrast to the attention decoder where all the encoders outputs h 0,..., h n are needed. 2) Decoder In the simplest decoder, only the last output of the encoder h n is used. This is called the context vector, because it encodes the context from the entire sequence. The first hidden state s 0 from the decoder is set equal to this context vector. The decoder uses half of the times teacher forcing, a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input [13]. When it does not use teacher forcing, the input is equal the previous predicted token. The decoder embeds the input y i 1 (in case of no teacher forcing) the same way as the encoder. A GRU takes as input this embedded vector together with the previous hidden state s i 1, and produces an output vector z i and the next hidden state s i. This output vector z i IR 1 1 n with n the number of hidden nodes, is transformed to a distribution over the vocabulary size through another feedforward layer followed by the LogSoftmax function. The negative likelihood loss is optimized. 3) Decoder with attention mechanism If only the last encoder output (the context vector) is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence. The attention mechanism [14] avoids this by encoding the whole input sequence based on the sequence of all the encoder outputs, as opposed to only the last encoder output. Our implementation of the attention mechanism first calculates a set of attention weights α i,j. Each weight α i,j is a normalized attention energy e i,j : α i,j = exp(e i,j) k exp(e i,k) (1) e i,j = s T i 1 W ah j (2) where each attention energy e i,j is calculated as the dot product between the decoder hidden state s i 1 and a linear transform of the corresponding encoder output h j. This is one of the score functions, mentioned as general form, in the global attention mechanisms proposed by [15]. These attention weights α i,j are multiplied by the encoder output vectors h 1,..., h n to create a weighted combination, the context vector c i. c i = n α i,j h j (3) j=1 The context vector c i is concatenated with the decoder s input y i 1 and serves as the input of the GRU. A feedforward layer applies a linear transformation to the concatenation of the GRU output vector z i and the context vector c i. Passing through LogSoftmax, the log probabilities over the distribution of the vocabulary are obtained. 4) Copy mechanism There are quite a lot Out Of Vocabulary (OOV) words in the test set, such as actor names, movie titles,.... To deal with these OOV words, we have designed a copy mechanism that works in combination with attention. Our copy mechanism works in two steps. The first step consists of predicting whether there should be a token copied from the input sequence. This is accomplished by including a <COPY> token in the vocabulary. The second step consists of replacing the <COPY> token with the input token with the highest attention value. The loss function is adapted to cope with the copy mechanism. If the target token is in the input sequence, the negative likelihood loss between the <COPY> token and the decoder output is added to the loss. Also, the attention values should align with the input tokens that should be copied. The crossentropy loss between the attention values and the index of the corresponding input token is added to the loss. If the target token is not in the input sequence, the negative likelihood loss between the target token and the decoder output is added to the loss. V. EXPERIMENTS This section starts with explaining the evaluation details. Next, it discusses the results from the experiments with the models in Section IV, comparing baseline models and encoderdecoder model on both simple and composite queries. A. Evaluation details We have chosen to use the deep learning framework PyTorch from Facebook 6, because SQLNet and Seq2SQL are written in PyTorch and open-sourced on GitHub 7, which can serve as inspiration. As discussed in Section III, the dataset consists of 1,537 template questions containing tags such as [@film] or [@actor]. The tags (indicated are replaced with 10 random, corresponding values from the IMDb database. The resulting 15,370 question-query pairs are split into (85%, 0%, 15%) and (70%, 15%, 15%) train, validation and test set for respectively the Glove-based model and the other models. The sets are separated, such that a template question from one set is not present in the other sets. The evaluation metric used in the baselines and models is the accuracy. The accuracy is the proportion of correct cases (both true positives and true negatives) among the total number of cases examined. Because some parts of the SQL query are harder to predict than others, we calculate the accuracy of the different components of the SQL query, such that it becomes clear on which components of the query the model has to be improved. Table I shows the multiple evaluation components. In case of multiple conditions, the conditions are sorted and compared because the conditions in a SQL query are commutative

10 Table I: Evaluation components of the query. Evaluation component Description Select Checks if the column name after SELECT is correct From Checks if the table name after FROM is correct Where (first operands) Checks if the column name (= first operand) ( = conds first op) after WHERE is correct. Where (all operands) Checks if the whole condition after WHERE is correct ( = conds all op) Joins (only for composite queries) Checks if the table names and conditions of the JOIN clause of the query are correct All Checks if the whole query is correct Table IV: Choice of hyperparameters Hyperparameter Choice Optimizer Adam Learning rate Dropout 0.1 Amount of hidden nodes 256 B. Simple queries The accuracy results from the GloVe-based model and SQLNet on the different components of the simple queries can be found in Table II. The table shows that the biggest improvement lies in predicting the WHERE condition (conds all op), while predicting the table name after FROM is already perfect. Also, the predictions of the GloVe-based model of the column names after SELECT and as first operand of the WHERE condition (conds first op) have an accuracy of respectively 84.45% and 82.51%, which is already quite high. The accuracy results from the encoder-decoder model on the different components of the simple queries can be found in Table III. The choice of hyperparameters can be found in Table IV. The model is trained until the validation loss converges, which is after approximately 9 epochs. Note the difference with the training procedure of SQLNet, where there were multiple models for each clause of the query, which were trained independent of each other until their validation loss converged. As can be seen on Figure 2, the encoder-decoder model has the most impact on predicting the last operand of the WHERE clause, which results in a higher total accuracy. Clearly, the attention mechanism does not have impact on the simple queries. However, in combination with the copy mechanism there is a 2% increase in overall accuracy compared to the simple decoder. Table II: Accuracy results Glove-based model and SQLNet - simple queries Simple queries GloVe-based SQLNet Select 84.45% 71.7% From 100.0% 100% Where (first operands) 82.51% / Where (all operands) 36.19% 46.8% All 29.48% 36.5% Table III: Accuracy results encoder-decoder model - simple queries. Simple queries Simple With attention With attention and copy Select 95.47% 94.08% 93.87% From 100% 99.65% 99.87% Where (first operands) 92.74% 90.04% 92.39% Where (all operands) 65.99% 60.55% 67.77% All 64.42% 58.59% 66.29% C. Composite queries The accuracy results from the Glove-based and encoderdecoder model are shown in Table V. The choice of hyperpa- Figure 2: Accuracy results for the WHERE clause and the whole query - simple queries. rameters are the same as with the simple queries and can be found in Table IV. The model is trained until the validation loss converges, which is after approximately 9 epochs. Analogous as with the simple queries, the extensions of the attention mechanism and copy mechanism have particularly effect on predicting the third operand in the WHERE clause. Figure 3 zooms in on the prediction of the last operand and the correctness of the whole query. We notice again that solely the attention mechanism brings no improvement in the accuracy of the whole query. Although, when it is combined with the copy mechanism, it brings an additional improvement of 8%. All three models score better than the GloVe-based model. Table V: Accuracy results encoder-decoder model - composite queries. Composite queries GloVe Simple With attention With attention & copy Select % 96.46% 94.00% 96.28% From % 99.60% 99.41% 99.96% Where (first operands) % 94.89% 91.76% 93.51% Where (all operands) % 65.69% % 73.94% Joins 80.74% % 92.83% 94.49% All % 63.32% 63.32% 71.29% An example of the attention mechanism is visualized in Figure 4. Here, the SELECT, FROM and JOIN clause are determined by one word, producer. The words angele et tony have weight 1 in the prediction of the last operand of the WHERE clause, where they are copied correctly. The copy mechanism works in two steps: predicting a <COPY> token and if it should copy, choosing which token it should copy. The <COPY> tokens are correctly predicted 88.71% of the times. Table VI shows three examples of question-query pairs and the results from the simple decoder and the decoder with the attention and copy mechanism. The first example shows a case where the copy mechanism correctly predicts the tokens to copy, but the attention is wrongly aligned for the last token 4

is not feasible to generate a new dataset for each new database. The model should take as input, next to human questions, the database schema and adapt it s predicted queries to this schema.

11 is not feasible to generate a new dataset for each new database. The model should take as input, next to human questions, the database schema and adapt it s predicted queries to this schema. The WikiSQL dataset consists of 24,241 different tables, providing a ideal dataset to train these generalized solutions on. REFERENCES Figure 3: Accuracy results for the WHERE clause and the whole query - composite queries. Figure 4: Visualization of the attention mechanism of a question-query pair. causing it to copy the wrong word. The second example is an illustration of a common error in the copy mechanism: wrongly predicting the length of the condition. The third example illustrates a case where the copy mechanism is correct and the simple decoder is not. [1] Luke S Zettlemoyer and Michael Collins. Online learning of relaxed CCG grammars for parsing to logical form. Computational Linguistics, (June): , [2] Chris Quirk, Raymond Mooney, and Michel Galley. Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages , [3] Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. SQLizer : Query Synthesis from Natural Language. Splash, 1(1):1 25, [4] Li Dong and Mirella Lapata. Language to Logical Form with Neural Attention [5] Victor Zhong, Caiming Xiong, and Richard Socher. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. pages 1 12, [6] Xiaojun Xu, Liu Chang, and Song Dawn. SQLNet: Generating Structured Queries from Natural Language without Reinforcement Learning. pages 1 13, [7] Chenglong Wang, Marc Brockschmidt, and Rishabh Singh. Pointing Out SQL Queries from Text. pages , [8] Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, and Mingcheng Chen. Latent Attention For If-Then Program Synthesis. (Nips), [9] J M Zelle and R J Mooney. Learning to Parse Database queries using inductive logic proramming. Learning, (August): , [10] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale Simple Question Answering with Memory Networks [11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages , [12] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. pages 1 9, [13] Jason Brownlee. What is teacher forcing for recurrent neural networks? Accessed: [14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/ , [15] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation VI.CONCLUSION This work started with an introduction to the NL2SQL problem, exploring the state-of-the-art solutions. Although these scored an overall accuracy of approx. 70%, the neural network approaches were trained only on simple SQL queries. Hence, the introduction of a new dataset containing composite SQL queries. The encoder-decoder model scored better on this dataset than the GloVe-based model and the state-of-the-art SQLNet, but could still improve in predicting the last operand of the WHERE clause. The extensions of the attention and copy mechanism helped increasing this accuracy, bearing promising results. Future work should consist of generalizing the model to be able to predict queries on other databases. The current model will not be able to adapt seamlessly to a new database, and it 5

12 Table VI: Example predictions by the different models. Q denotes the natural language question and G denotes the corresponding ground truth query. S and A&C denote respectively the queries produced by the simple decoder and the decoder with copy and attention mechanism. Our models generate in general the table name two times, such that the column names can specify where they come from, but this is left out of the table for the sake of brevity. The bold words indicate wrongly predicted words. Q1 S A&C G Q2 S A&C G Q3 S A&C G which iowan cinematographer produced the film Appunti inutili - Virgilio Giotti SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = aquele querido mes de agosto SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Virgilio SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Giotti where is the film Angry Samoans form SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = apron strings SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry Samoans what film is a part of the Crime film genre SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = game-show SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime 6

13 CONTENTS x Contents Foreword Permission of Use Overview Extended Abstract Contents List of Figures List of Tables List of abbreviations i ii iii iv x xii xiv xv 1 Introduction 1 2 Related Literature SQL: syntax and usage Neural networks Optimizer Feedforward neural networks Recurrent neural networks Long Short-Term Memory Gated Recurrent Unit Sequence-to-sequence model Natural-language-to-SQL: NL2SQL Semantic parsing approaches Neural network approaches Conclusion

14 CONTENTS xi 3 Dataset 18 4 Methodology GloVe-based model GloVe: Global Vectors for Word Representation Method SQLNet Encoder decoder Word embedding Encoder Decoder Decoder with attention mechanism Copy mechanism Conclusion Experiments Experimental setup Deep learning framework Evaluation details Baselines GloVe-based model SQLNet Encoder-decoder model Simple queries Composite queries Hyperparameter optimization Optimizer Dropout Hidden nodes Conclusion Conclusion 51 Bibliography 53

15 LIST OF FIGURES xii List of Figures 1.1 Example of translating a question to a query SGD without momentum versus Adam with momentum Example of a feedforward neural network with one hidden layer Unfolding in time of a RNN, figure from Olah [2015] Graphical representation LSTM unit, figure from Olah [2015] Graphical representation GRU unit, figure from Olah [2015] High-level representation of encoder-decoder model Schematic overview of SQLizer, figure from Yaghmazadeh et al. [2017] Entity-relationship model of the IMDb database Length distribution over the input sentences and queries High level overview of encoder-decoder architecture Encoder architecture Simple decoder architecture Attention mechanism. The attention weights are values in the range of [0,1], which are mapped to [black, white]. The encoder outputs are multiplied and changed accordingly Attention decoder architecture Accuracy of the GloVe model - simple queries Accuracy of the GloVe model - composite queries Accuracy results encoder-decoder model - simple queries Accuracy results for the WHERE clause and the whole query - simple queries Accuracy results encoder-decoder model - composite queries Accuracy results for the WHERE clause and the whole query - composite queries

16 LIST OF FIGURES xiii 5.7 Visualization attention - pair Visualization attention - pair The negative likelihood loss for the SGD optimizer (orange) and Adam optimizer (blue), in function of the amount of trained samples The negative likelihood loss for the learning rates 0.1, 0.01 and for the Adam optimizer, in function of the amount of trained samples. The graphs are respectively red, blue and orange The absolute accuracy of the whole query calculated on the validation set for the dropout values p of 0.1, 0.3 and 0.5, in function of the amount of trained samples. The graphs are respectively orange, blue and red The negative likelihood loss for different amounts of hidden nodes

17 LIST OF TABLES xiv List of Tables 3.1 Example of a Freebase tuple and corresponding question Distribution of the amount of tags in a question Question examples Amount of unique template questions per relationship Question-query examples Example of a template question and the generated questions Split of the dataset into training, validation and test set for the different models Evaluation components of the query Accuracy results Glove-based model and SQLNet - simple queries Accuracy results encoder-decoder model - simple queries Choice of hyperparameters - simple and composite queries Accuracy results encoder-decoder model - composite queries Accuracy of first step of the copy mechanism and the frequency of the error cases Example predictions by the different models. Q denotes the natural language question and G denotes the corresponding ground truth query. S and A&C denote respectively the queries produced by the simple decoder and the decoder with copy and attention mechanism. Our models generate in general the table name two times, such that the column names can specify where they come from, but this is left out of the table for the sake of brevity. The bold words indicate wrongly predicted words

18 LIST OF ABBREVIATIONS xv List of abbreviations Adam GloVe GRU IMDb LSTM NL2SQL OOV RNN SGD SQL Adaptive Moment Estimation Global Vectors for Word representation Gated Recurrent Unit Internet Movie Database Long Short-Term Memory Natural-Language-To-SQL Out Of Vocabulary Recurrent Neural Network Stochastic Gradient Descent Structured Query Language

19 INTRODUCTION 1 Chapter 1 Introduction The IT revolution of the past few decades has resulted in a large-scale digitization of data, making it accessible to millions of users in the form of databases. However, accessing these databases requires an understanding of query languages such as Structured Query Language (SQL), which, while powerful, is difficult to master. This is often unfortunately beyond the programming expertise of a majority of end-users. Thus, building effective semantic parsers that can translate natural language questions into logical forms such as queries has been a long-standing goal. (Zettlemoyer and Collins [2007], Quirk et al. [2015], Yaghmazadeh et al. [2017]) Dong and Lapata [2016] showed that recurrent neural networks with attention and copying mechanisms can be used effectively to build successful semantic parsers. Also, recent work by Zhong et al. [2017] introduced the state-of-the-art Seq2SQL model for question to SQL translation in the supervised setting. In order to build this model, they published a dataset WikiSQL, which is a magnitude larger than the previous semantic parsing datasets. Xu et al. [2017] and Wang et al. [2017] also published papers that improved the accuracy on WikiSQL. However, there is a lack of any complex operator in the SQL queries of the WikiSQL dataset, such as JOIN. This work focuses on generating composite SQL queries (i.e. queries on multiple tables) from natural language questions using recurrent neural networks with attention and copying mechanisms. While the most recent solutions in the paragraph above scored an accuracy of approximately 70% on the WikiSQL dataset, those models could not predict the JOIN operator. Training in a supervised manner requires labeled examples of question-query pairs, hence this work also introduces a new dataset of question-query

20 INTRODUCTION 2 pairs, including both simple and composite SQL queries. Figure 1.1 shows a use case of our solution. The non-technical end-user asks a natural language questions, which is translated by our model to a SQL query. Via executing this SQL query on a database, the answer can be returned to the end-user. what is a work directed by Martin Scorsese Non-technical end-user MODEL: natural-language-to-sql SELECT title FROM movie movie INNER JOIN directed_by directed_by ON movie.mid = directed_by.msid INNER JOIN director director ON directed_by.did = director.did WHERE director.name = Martin Scorsese IMDb database Taxi Driver, Goodfellas, Silence,... Figure 1.1: Example of translating a question to a query This dissertation is ordered as follows. Chapter 2 first gives a brief overview of the SQL syntax and neural networks. It also discusses the related literature on the state-of-theart of the existing natural-language-2-sql (NL2SQL) solutions. Chapter 3 discusses the construction and details of the dataset, followed by Chapter 4 which handles the solutions that tackle this problem. The experiments and their results are shown in Chapter 5. Finally, conclusions are drawn in Chapter 6.

21 RELATED LITERATURE 3 Chapter 2 Related Literature Our system translates natural language questions to SQL queries using neural networks. This chapter will explain the most essential concepts of the system, SQL and neural networks. It starts with briefly explaining the SQL syntax, whereafter it considers feedforward and recurrent neural networks. After explaining these concepts, the third part zooms in on related literature on the natural-language-to-sql (NL2SQL) problem. 2.1 SQL: syntax and usage Relational databases store a vast amount of today s information and are a component of many applications. Accessing these relational databases requires understanding query languages such as Structured Query Language (SQL). SQL supports four fundamental operations, referred to as CRUD: Create, Read, Update and Delete. In this work, we focus on retrieving information using SQL. The general form for retrieving information from one table is as follows: SELECT column_names FROM table_name WHERE condition (ORDER BY sort-order) If we need to combine two or more tables to retrieve all the necessary columns, the JOIN keyword is added. More specifically, INNER JOIN will select records that have matching values in both tables. The general form is as follows: SELECT column_names

22 2.2 Neural networks 4 FROM table_name1 INNER JOIN table_name2 ON table_name1.column_name1 = table_name2.column_name2 WHERE condition In the remainder of this work, queries on one table and queries on multiple tables will be called respectively simple and composite queries. 2.2 Neural networks Goldberg [2015] describes a neural network as a computational model that is inspired by the way biological neural networks in the human brain process information. The basic unit of computation in a neural network is the neuron (often called a node or unit), which receives input from other nodes or from an external source and computes an output. Each input has an associated weight, which is assigned based of its relative importance to other inputs. The node applies a non-linear function f (the activation function) to the weighted sum of its inputs. The output y of a single neuron can be calculated as follows: n y = f(w T x) = f( W i x i + b) (2.1) where W (= [w 1,..., w n ]), x(= [x 1,..., x n ]) and b are respectively the weight vector, the input vector and the bias. The main function of the bias is to provide every node with a trainable constant value. The activation functions that are mentioned in this dissertation are as follows: Sigmoid: real-valued input, output between 0 and 1 i=1 σ(x) = exp( x) (2.2) Tanh: real-valued input, output between -1 and 1 tanh(x) = 2σ(2x) 1 (2.3) ReLU: real-valued input, replace negative values with zero f(x) = max(0, x) (2.4)

23 2.2 Neural networks Optimizer Neural networks use optimizers to minimize a loss function, which is dependent on the model s internal parameters. There are two popular gradient descent optimizers: stochastic gradient descent and Adaptive Moment Estimation. In the next paragraphs, the concept gradient descent and the two optimizers are discussed. Gradient descent is by far the most common way to optimize neural networks. It is a way to minimize a loss function L(θ), parameterized by the model s parameters θ R d, by updating the parameters in the opposite direction of the gradient of the loss function θ L(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. A learning rate that is too small leads to slow convergence, while a learning rate that is too large can hinder convergence. (Ruder [2017]) Stochastic Gradient Descent (SGD) performs a parameter θ update for each training example x(i) and label y(i): θ = θ η θ L(θ; x(i); y(i)) (2.5) where η is the learning rate, θ are the model s parameters and L(θ) is the loss function. SGD performs frequent updates with a high variance that causes the loss function to fluctuate heavily. The algorithm has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. Figure 2.1 shows the oscillating of SGD across the slopes of the ravine while only making hesitant progress along the bottom towards the local minimum. Adaptive Moment Estimation (Adam) has a few tricks to improve SGD. One of these tricks is momentum, which solves the ravine problem. Momentum consists of adding some fraction of the previous update to the current update, so that repeated updates in a particular direction compound. Momentum is build and is moving faster and faster in direction of the minimum. In the case of the ravine, momentum is build up in the direction of the minimum, since all updates have a component in that direction. Figure 2.1 shows the impact that momentum has on converging to a minimum. Another trick that Adam uses is to adaptively select a separate learning rate for each parameter. This speeds learning in cases where the appropriate learning rates vary across parameters. This makes tuning of learning rates less important with Adam, because performance is less sensitive to them. Kingma and Ba [2014] show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

24 2.2 Neural networks 6 SGD without momentum ADAM with momentum Figure 2.1: SGD without momentum versus Adam with momentum Feedforward neural networks The first and simplest type of neural networks are feedforward neural networks, where the information flows in only one direction: forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. An example of a feedforward neural network with one hidden layer can be found in Figure 2.2. The input nodes x 1,..., x K are connected with the hidden nodes h 1,..., h N through the associated weights {w ki }. These hidden nodes are connected with the output nodes y 1,...y M through the associated weights {w ij} Recurrent neural networks When dealing with language data, it is very common to work with sequences of variable length, such as words (sequences of letters) and sentences (sequences of words). In a traditional feedforward neural network, it is assumed that all input (and outputs) are independent of each other. On the other hand, Recurrent Neural Networks (RNNs) are called recurrent because they perform the same task for every element of the sequence, which makes them better suited for language tasks. The parameters are shared across the steps.

25 2.2 Neural networks 7 Figure 2.2: Example of a feedforward neural network with one hidden layer As seen on Figure 2.3, the RNN can be written out for the complete sequence, also called unfolding or unrolling. The formulas that describe the computation in a RNN are as follows: x t is the input at time step t. For example, x 1 could be a one-hot vector corresponding to the second word of a sentence. h t is the hidden state at time step t, which can be seen as the memory of the network. The hidden state is calculated based on the previous hidden state and the input at the current step: h t = f(x t, h t 1 ) (2.6) where f is typically one of the activation functions described in Section 2.2. The hidden state required to calculate the first hidden state h 0 is generally initialized to all zeroes. o t is the output at step t and is in the most simple model, the Elman network by Elman [1990], calculated in function of the memory at time t, h t : o t = g(h t ) (2.7) where g is typically one of the activation functions described in Section 2.2.

2.2 Neural networks 8 Figure 2.3: Unfolding in time of a RNN, figure from Olah [2015]. A simple RNN is hard to train effectively because of the vanishing gradients problem.

26 2.2 Neural networks 8 Figure 2.3: Unfolding in time of a RNN, figure from Olah [2015]. A simple RNN is hard to train effectively because of the vanishing gradients problem. When error signals (gradients) are passed back through many time steps, it tends to diminish quickly in the backpropagation process, which makes it hard for the RNN to capture longrange dependencies. For example when using an Elman RNN with a tanh activation function, the gradients are in the range ( 1, 1). Backpropagation computes gradients by the chain rule, which has the effect of multiplying n of these small numbers to compute the gradients of the first unrolled layer with an input sequence of length n, resulting in a small gradient. Activation functions such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction. In the following subsections, two extensions to the RNN architecture, LSTM and GRU, are discussed that solve this problem Long Short-Term Memory The Long Short-Term Memory (LSTM) architecture was designed by Hochreiter and Urgen Schmidhuber [1997] to solve the vanishing gradients problem. LSTM also have a chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four layers interacting in a special way. The key to LSTMs is the cell state, the horizontal line running through the top of the diagram (see Figure 2.4). The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. The gating layers have a sigmoid activation function, which outputs numbers between zero and one. These outputs are multiplied by the components, such that a value of zero means let nothing through,

27 2.2 Neural networks 9 while a value of one means let everything through. The first layer on the left is the forget gate layer f t, which decides what information from the cell state is thrown away. f t = σ(w f [h t 1, x t ] + b f ) (2.8) where σ is the sigmoid function, W f is a trainable matrix and h t 1, x t and b f are respectively the previous output, the input and the bias. Second, a σ-layer called the input gate layer i t decides which values will be updated. Next, a tanh layer creates a vector of new candidate values, C t, that could be added to the state. i t = σ(w i [h t 1, x t ] + b i ) (2.9) C t = tanh(w c [h t 1, x t ] + b c ) (2.10) where σ is the sigmoid function, W i and W c are trainable matrices and h t 1, x t, b c and b f are respectively the previous output, the input and the biases. These two steps are combined to update the state C t. The output gate layer o t decides which parts will be output. C t = f t C t 1 + i t C t (2.11) o t = σ(w o [h t 1, x t ] + b o ) (2.12) h t = o t tanh(c t ) (2.13) where σ is the sigmoid function, W o is a trainable matrix and h t 1, x t and b o are respectively the previous output, the input and the bias Gated Recurrent Unit The Gated Recurrent Unit (GRU), proposed by Cho et al. [2014], is a variant of the LSTM that combines the forget and input gates f t and i t (Equations 2.8 and 2.9) into a single update gate z t. The resulting model is simpler than standard LSTM models, and has been

2.2 Neural networks 10 Figure 2.4: Graphical representation LSTM unit, figure from Olah [2015]. growing increasingly popular. The architecture of a GRU cell can be found in Figure 2.5. The update and reset gates z t and r t are calculated as follows: z t = σ(w z [h t 1, x t ]) (2.

The activation and the candidate activation h t and h t are calculated as follows: h t = tanh(w [r t h t 1, x t ]) (2.16) h t = (1 z t ) h t 1 + z t h t (2.

28 2.2 Neural networks 10 Figure 2.4: Graphical representation LSTM unit, figure from Olah [2015]. growing increasingly popular. The architecture of a GRU cell can be found in Figure 2.5. The update and reset gates z t and r t are calculated as follows: z t = σ(w z [h t 1, x t ]) (2.14) r t = σ(w r [h t 1, x t ]) (2.15) where W z and W r are trainable matrices and h t 1 and x t are respectively the previous output and the input. The activation and the candidate activation h t and h t are calculated as follows: h t = tanh(w [r t h t 1, x t ]) (2.16) h t = (1 z t ) h t 1 + z t h t (2.17) where W and W r are trainable matrices, z t and r t are respectively the update and reset gate and h t 1 and x t are respectively the previous output and the input. The most prominent feature shared between the LSTM unit and GRU unit is the additive component of their update from t to t + 1, which is lacking in the traditional RNN. This has two advantages: it makes it easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps and it creates shortcut paths that bypass multiple temporal steps (Chung et al. [2014]).

Also, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU control is tied via the update gate.

29 2.2 Neural networks 11 Figure 2.5: Graphical representation GRU unit, figure from Olah [2015]. In the LSTM unit, the amount of the memory content that is seen is controlled by the output gate o t, while the GRU exposes its full content without any control. Also, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU control is tied via the update gate. In general, it is difficult to conclude which types of gating units would perform better. In this dissertation, the GRU is picked because Bahdanau et al. [2014] reported that these two units performed comparably to each other on machine translation and it is in general more efficient because it has a less complex structure Sequence-to-sequence model In machine translation, input sequences and output sequences have often different lengths and the entire input sequence is required in order to start predicted the target. Sutskever et al. [2014] from Google presented a general end-to-end approach to sequence-to-sequence learning. The idea is to use one RNN (the encoder) to read the input sequence, one timestamp at a time, and obtain a large fixed-dimensional vector representation: the context vector. Another RNN (the decoder) is used to unfold the vector into a new sequence. Figure 2.6 shows a sequence to sequence network for translating French to English. The encoder outputs the context vector, which the decoder unfolds into a translated sequence. As can be seen on the figure, the decoder uses its own outputs as inputs. Normally, the

30 2.3 Natural-language-to-SQL: NL2SQL 12 Le chat est noir <EOS> ENCODER context (fixed dimension) <SOS> the cat is black DECODER The cat is black <EOS> Figure 2.6: High-level representation of encoder-decoder model words in the sentence are first embedded, but this is omitted in the figure for the sake of simplicity. This model (and possible improvements by e.g. using an attention mechanism) will be further discussed in Chapter Natural-language-to-SQL: NL2SQL In this section, the existing state-of-the-art on natural-language-to-sql (NL2SQL) problems will be discussed. The following subsections will confer the two different approaches: semantic parsing and neural networks Semantic parsing approaches The primary approach to solve NL2SQL problems is semantic parsing. Semantic parsing is an approach to translate text to a formal meaning representation such as logical forms or structured queries. There have been many works considering parsing a natural language description into a logical form such as Zelle and Mooney [1996], Zettlemoyer and Collins [2007], Quirk et al. [2015], Chen et al. [2016]. Most previous systems rely on high-quality lexicons, manually-built templates, and features which are either domain- or representation-specific. They would need to be fine-tuned to the specific domain of interest, and may not generalize. For these reasons, in this work the focus is on neural network approaches to handle the NL2SQL tasks, which require less feature engineering. An example of a semantic parser is SQLizer by Yaghmazadeh et al. [2017]. It consists of an off-the-shelf parser that translates a natural language question into a sketch, which

31 2.3 Natural-language-to-SQL: NL2SQL 13 only specifies the shape - rather than the full content - of the query (e.g., join followed by selection followed by projection). Employing programming language techniques such as type-directed sketch completion and automatic repairing, their model iteratively refines the sketch into the final query. A schematic overview of their approach can be found in Figure 2.7. Figure 2.7: Schematic overview of SQLizer, figure from Yaghmazadeh et al. [2017] Neural network approaches A second approach to solve NL2SQL problems uses neural networks, in particular the encoder-decoder architecture. Dong and Lapata [2016] created such an encoder-decoder architecture that performs competitively with the existing semantic parsers, without using hand-engineered features and easy to adapt across domains and meaning representations. Recently, a new dataset on NL2SQL has been released by Zhong et al. [2017] from Salesforce: WikiSQL 1. It is a corpus of 80,654 hand-annotated instances of natural language questions, SQL queries and SQL tables extracted from 24,241 HTML tables from Wikipedia. The following properties make it a desirable dataset: It is an order of magnitude larger than previous semantic parsing datasets, which makes it interesting for data-hungry neural networks. The natural language questions are created by human beings (employing crowdsourcing on Amazon Mechanical Turk). Synthesizing the SQL query does not rely on the table s content. 1

32 2.3 Natural-language-to-SQL: NL2SQL 14 The three papers that have competitive scores on the WikiSQL task will be discussed in the following subsections: Seq2SQL, SQLNet and Pointing out SQL queries from text (Zhong et al. [2017], Xu et al. [2017], Wang et al. [2017]). Seq2SQL Seq2SQL by Zhong et al. [2017] leverages the structure of SQL to reduce the output space of the generated query. The input sequence is the concatenation of all the column names, the question and the SQL vocabulary. Seq2SQL is composed of three parts that correspond to the aggregation operator, the SELECT column and the WHERE clause. The first two components use cross-entropy loss, the last one uses policy gradient for training. Aggregation column. To compute the aggregation operation, the scalar attention score for each tth-token in the input sequence is calculated. This vector is normalized to produce a distribution over the input encodings. The input representation is the sum over the input encodings weighted by the normalized scores. The score over the aggregation operators (COUNT, MIN, MAX,...) is obtained by applying a multilayer perceptron to this input representation. SELECT column. First, each column name is encoded with a LSTM. The input representation is similar to the input representation with the aggregation operation, but with untied weights. The score for each column j is obtained by applying a multilayer perceptron over the column representations, conditioned on the input representation. WHERE clause. Using an encoder-decoder network, the decoder produces a scalar attention for each position t of the input sequence. The input token with the highest score is chosen as the next token of the generated SQL query. To address the problem of wrongly penalizing correct execution results, despite not having exact string match, reinforcement learning is applied. It learns a policy to directly optimize the expected correctness of the execution result. SQLNet Xu et al. [2017] want to avoid the necessity to employ reinforcement learning by avoiding the order-matters problems in a sequence-to-sequence model. They propose a sketch-based approach to generate a SQL query, where the sketch aligns naturally to the syntactical

33 2.3 Natural-language-to-SQL: NL2SQL 15 structure of a SQL query. A neural network is used to predict the content for each slot in the sketch. To predict the WHERE clause, the following models are trained. First, the total number K of columns to be included is predicted. Using an upperbound N on the number of columns, the problem has been cast to a (N + 1)-way classification problem. Afterwards, the columns with the highest P wherecol (col Q) are picked, where col is a column name and Q is the natural language question. These are calculated as follows: P wherecol (col Q) = σ(u c T E col + u q T E Q ) (2.18) where σ is the sigmoid function, E col and E q are the embeddings of the column name and the natural language question and u c and u q are two column vectors of trainable variables. They also introduce the column attention mechanism to compute E Q col instead of E Q. This mechanism ensures that the most relevant information in the natural language question is used when predicting on a particular column. Second, the operand value (choosing from =, >, <) is also a 3-way classification. Therefore we compute: P op (i Q, col) = softmax(u op 1 tanh(uc op E col + Uq op E Q col )) (2.19) where col is the column under consideration, E col and E Q col are the embeddings of the column name and the natural language question using the column attention mechanism and U op 1, Uc op, Uq op are trainable matrices of size 3 d, d d and d d respectively. Third, for the value slot, a substring from the natural language question is predicted. SQLNet employs a sequence-to-sequence structure where the encoder still employs a bidirectional LSTM and the decoder computes the distribution of the next token using a pointer network. The probability of the next token can be computed as: P val (i Q, col, h) = softmax(a(h)) (2.20) a(h) i = (u val ) T tanh(u val 1 H i Q + U val 2 E col + U val 3 h) i {1,..., L} (2.21) where u val a is a d-dimensional trainable vector, Uh val, U c val, Uq val are three trainable matrices of size d d and L is the length of the natural language question, h is the

34 2.4 Conclusion 16 hidden state of the previous generated sequence and H i Q each token in the natural language question. is the LSTM output for The prediction for the column name in the SELECT clause is quite similar to the prediction of the column names in the WHERE clause, with the restriction that we only need to select one column among all. The aggregation operator is predicted similar to the prediction of the operand slot. Pointing out SQL queries from text Wang et al. [2017] published the third paper that designed and trained models on the WikiSQL dataset. Their model encodes the input with a bidirectional LSTM and then decodes the hidden state with a typed LSTM. Based on the type, the decoder either copies an output token from the input question using an attention-based copying mechanism or generates it from a fixed vocabulary. The SQL grammar from WikiSQL can be written in regular expression form as: Select s c From t Where (c op v) where s, c, t, v are respectively the aggregation operator, the column name, the table name and the value slot. This ensures that the type of the next output token is statically determined. 2.4 Conclusion The existing state-of-the-art on NL2SQL problem has two different approaches: semantic parsing and neural networks. Because most of the previous semantic parsing solutions relied on high-quality lexicons and feature engineering, this work focuses on neural network approaches without hand-engineered features. Currently, the largest dataset of question-query pairs is WikiSQL. The three papers that have competitive scores on the WikiSQL task are Seq2SQL, SQLNet and Pointing out SQL queries from text (Zhong et al. [2017], Xu et al. [2017], Wang et al. [2017]). The main differences between the paper are as follows: Zhong et al. [2017] used reinforcement learning to solve the order-matters problems in a sequence-to-sequence model.

35 2.4 Conclusion 17 Xu et al. [2017] wanted to avoid reinforcement learning and proposed a sketch-based approach, which only specifies the shape of the query. Wang et al. [2017] used a typed decoder that statically predicted the next token based on the type of the token. Despite the several advantages of the WikiSQL dataset over the previous semantic parsing datasets, it only consists of simple SQL queries on one table. This work will focus on training an end-to-end encoder-decoder model on composite SQL queries on multiple tables. The construction of the dataset consisting of these question-query pairs will be discussed in the next chapter.

36 DATASET 18 Chapter 3 Dataset This section discusses the construction of the dataset, consisting of question-query pairs. The NL2SQL solutions from the previous chapter used the WikiSQL dataset by Zhong et al. [2017]. Although this dataset offers several benefits, discussed in Section 2.3, it also received criticism that it consisted of massive simplifications on SQL grammar. The dataset lacks any complex operator of the SQL grammar, e.g. JOIN or GROUP BY. The new dataset will consist of two subdatasets: a dataset containing simple SQL queries and one with composite SQL queries. The simple SQL queries have a similar syntax as the queries from the WikiSQL dataset, while the composite SQL queries will also include the JOIN operator. The simple SQL queries are included to run the WikiSQL models, so we can make sure that the question-query pairs are not trivially composed. The dataset should consist of the following parts: Natural language questions, created by humans, in order to overcome the issue that a well-trained model may overfit on template-synthesized descriptions. Ground truth SQL queries that are paired with the natural language questions. We have written the SQL queries based on the IMDb database 1, consisting of the following tables: movie, actor, director, producer, writer, company and genre. Figure 3.1 shows these tables and their relations. The dark grey tables are the bridge tables that link the movie table with the other tables through 1-to-many relations, because e.g. one movie can have multiple directors and one director can direct multiple movies. Using these tables, the M:N relations between movie and director can be satisfied. 1 The IMDb database is the same as in Yaghmazadeh et al. [2017], it can be found at goo.gl/dbubmm.

37 DATASET 19 Figure 3.1: Entity-relationship model of the IMDb database

38 DATASET 20 To gather human questions, the SimpleQuestions dataset from babi project (Bordes et al. [2015]) proves to be very practical. 2 It consists of a total of 108,442 questions, written in natural language by human English-speaking annotators. Each of these questions is paired with a corresponding fact, formatted as (subject, relationship, object). The facts have been extracted from the knowledge base Freebase 3. An example of a Freebase fact and corresponding question can be found in Table 3.1. The subject and object URL s are deprecated, but are not necessary for the further construction of the dataset. Subject Relationship Object Question Peter Greenhalgh was the cinematographer for what film? Table 3.1: Example of a Freebase tuple and corresponding question The following steps are followed in order to filter the questions: 1. All the questions that can be answered using the IMDb database are extracted, resulting in 8,605 questions. They are lowercased and punctuation is removed. 2. Replace the named entities (names of persons, titles of movies) by general tags (e.g. Brad Pitt becomes [@actor]). The list of tags is as follows: [@actor], [@director], [@film], [@country], [@year], [@producer], [@production company], [@writer] and [@genre]. It is possible that some questions contain two or three tags, e.g. indicating an actor from a specific genre. The distribution of the amount of tags per question can be found in Table 3.2. Question examples can be found in Table 3.3. Luckily, Bordes et al. [2015] provided a text file entities.txt which contains almost all the entities. We added extra entities in order to replace all the names. The questions without entities are deleted. 3. Using 95 regular expressions to filter the duplicates, this resulted in 1,540 unique template questions. The distribution over the multiple relationships can be found in Table Using these template questions, a dataset can easily be created by replacing all the tags with e.g. ten random values from the IMDb database. 2 It can be downloaded here: 3

39 DATASET 21 Amount of tags in question Frequency % % % Table 3.2: Distribution of the amount of tags in a question Relationship Questions Tags director/film what film was [@director] the director for [@director] what film is [@director] known for directing [@director] what is a [@genre] film directed by [@director] [@director], [@genre] film/written by who wrote for the film [@film] [@film] who wrote the movie [@film] [@film] who wrote the movie [@film] ([@year] film) [@film], [@year] Table 3.3: Question examples Relationship Amount of questions by 50 by 52 by 61 companies genre/films in this genre 493 Total 1538 Table 3.4: Amount of unique template questions per relationship The combination of the relationship and the tags determines the ground truth SQL query. There are 40 unique combinations of relationship and tags, which we hand-annotated with SQL queries. Examples of simple and composite question-query pairs can be found in Table 3.5, the complete dataset can be downloaded at nl2sql-dataset/overview. The distribution over the lengths of the input sentences and the SQL queries can be found

40 DATASET 22 Question what film is Big Daddy known for directing who wrote the movie Angel Back Question what film is Big Daddy known for directing Simple query SELECT title FROM films WHERE director name = David Mackay SELECT writer name FROM films WHERE title = Angel Back Composite query SELECT title FROM movie movie INNER JOIN directed by directed by ON movie.mid = directed by.msid INNER JOIN director director ON directed by.did = director.did WHERE director.name = David Mackay who wrote the movie Angel Back SELECT name FROM movie movie INNER JOIN written by written by ON movie.mid = written by.msid INNER JOIN writer writer ON written by.wid = writer.wid WHERE movie.title = Angel Back Table 3.5: Question-query examples in Figure 3.2. Due to lengthy JOIN operations, the composite SQL queries are quite longer than the simple SQL queries and the input sentences. The lengths of the composite SQL queries have a distribution which has two centers, dependent on the amount of the JOIN operations. The lengths of the input sentences seem centered around eight tokens, while the lengths of the simple queries are in the range of [10,20].

41 DATASET 23 Figure 3.2: Length distribution over the input sentences and queries

42 METHODOLOGY 24 Chapter 4 Methodology This chapter discusses the techniques that are used to build our models, while Chapter 5 will examine the results of these techniques. The first two subsections introduce a simple model and a model from the WikiSQL dataset, which will serve as baseline on our dataset. At last, the advanced encoder-decoder model and its extensions is examined. 4.1 GloVe-based model The first baseline is based on the idea that similar questions should have similar queries. It uses GloVe and cosine similarity for respectively the vector representation and similarity measurement of the questions, which are both explained in the first subsection. The second subsection discusses the method, which is applicable both on simple and composite queries GloVe: Global Vectors for Word Representation GloVe by Pennington et al. [2014] is an unsupervised learning algorithm for obtaining vector representations for words. It is a count-based model, where training is performed on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Other word embeddings such as word2vec (by Mikolov et al. [2013]) are predictive models, that learn their predictive ability by optimizing the loss of predicting the target words from the context words given the vector representations.

43 4.2 SQLNet 25 Pennington et al. [2014] published word vectors that are pre-trained on large corpora, which are convenient for us because the GloVe embedding would not work when trained on our small NL2SQL dataset. We decided to use the Common Crawl embedding that is trained on 42 billion tokens, consists of a vocabulary of 1.9 million tokens and embeds these tokens in the 300-dimensional vector space. 1 Cosine similarity between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words. The cosine similarity between two vectors is a measure that calculates the cosine of the angle between them. The formula results from solving the equation of the dot product: cos(x, y) = x y x y (4.1) where x, y are vectors Method The GloVe-based model consists of the following steps: 1. All the words are lowercased and embedded into a vector using the pre-trained word embedding. 2. The vector representation of the question is the sum of all the vector representations of the words in the question. 3. For each question in the test set, the cosine similarity (described in Section 4.1.1) between that question and all the questions from the training set is calculated. 4. The query accompanying the most similar question from the training set is predicted. The results of the GloVe-based model can be found in Section SQLNet In order to check that the dataset is not trivially composed, we have trained SQLNet (Xu et al. [2017]), described in Section 2.3.2, on our simple query dataset to compare the 1 It can be downloaded here: Other pre-trained word embeddings can be found here:

44 4.3 Encoder decoder 26 accuracy. SQLNet is the state-of-the-art model on the WikiSQL dataset and proposes a sketch-based approach to generate a SQL query. This means that they train different models to predict each different clause of the SQL query, such as the column name after SELECT or the condition after the WHERE clause. We parsed our dataset in the corresponding format of SQLNet to be able to train the model. We trained our model without column attention and with fixed word embeddings, the results can be found in Section Encoder decoder In machine translation, input sequences and output sequences have different lengths. Like already mentioned in Section 2.2.6, Sutskever et al. [2014] presented a general end-to-end approach to sequence learning. This Sequence to Sequence (seq2seq) network, or encoderdecoder network, is a model consisting of two RNNs called the encoder and decoder. Figure 4.1 shows a high-level overview of the encoder-decoder network. The encoder reads an (embedded) input sequence x 0,..., x n and outputs a single vector h n, while all the other outputs h 0,..., h n 1 are discarded. The decoder reads h n to produce an output sequence y 0,..., y k. The combination of two RNNs ensures that the lengths of the input and output sequence can be different, as there is no explicit one on one relation between the input and output sequences. This is especially important in our case, because the SQL queries are most of the time longer than the questions, as can be seen in Figure 3.2. The encoder inputs and decoder outputs are embedded with word embeddings, which will be discussed in the first subsection. The encoder and simple decoder will be explained in the next two subsections. The fourth subsection will elaborate on a solution to the discarding of the other encoder outputs: the attention mechanism. The last subsection will examine the copy mechanism, which deals with rare words and Out Of Vocabulary (OOV) words Word embedding The words from the input sentences and output queries are first embedded into a vector representation using the same GloVe embedding (Pennington et al. [2014]) as in the GloVebased model, explained in Section Both the mapping between the words and their

45 4.3 Encoder decoder 27 y 0 y 1 y k DECODER Predict Predict Predict GRU GRU... GRU h 0 h 1 hn ENCODER GRU GRU... GRU x 0 x 1 x n Figure 4.1: High level overview of encoder-decoder architecture indexes and the mapping between the indexes and their corresponding 300-dimensional vector representations are stored. The mapping of an (yet unknown) input word of the training set is as follows: The word is part of the GloVe vocabulary: the corresponding GloVe embedding is added. The word is not part of the GloVe vocabulary: a vector with random values uniformly sampled between -1 and 1 is added. We assigned each word a different value (and not e.g. all zeroes or assigning the <UNK> embedding) because there were quite a lot (1233) of words that were not part of the GloVe vocabulary. This means that our new vocabulary, including the words that are not part of the GloVe vocabulary, is 29% larger than with only the GloVe vocabulary. In both cases, the mapping between the words and their indexes is updated accordingly. The mapping of an input word of the test set, which is at training time unknown, is as follows: The word is present in the mapping: take the corresponding vector.

46 4.3 Encoder decoder 28 The word is not present in the mapping: take the GloVe embedding corresponding with the <UNK> (unknown) token is added Encoder The encoder architecture can be found in Figure 4.2. The input scalar x i, corresponding to a word, is first embedded using the word embedding explained in Section A GRU, explained in Section 2.2.5, takes as input this embedding x i together with the previous hidden state s i 1, and produces the output h i and a new hidden state s i. The first hidden state s 0 is initialized to all zeroes. In the simple decoder, only the encoders last output h n is used, in contrast to the attention decoder where all the encoder outputs h 0,..., h n are needed Decoder The decoder architecture can be found in Figure 4.3. In the simplest decoder, only the last output of the encoder h n is used. This is called the context vector, because it encodes the context from the entire sequence. The first hidden state s 0 from the decoder is set equal to this context vector. The input of the decoder depends on whether it uses teacher forcing. Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input (Brownlee [2017]). It works by using the target token from the training dataset at the current time step y target i 1 as input in the next time step y i, rather than the output generated by the network. Using teacher forcing causes the network to converge faster, but it may also exhibit instability. PyTorch s autograd gives us the freedom (see the dynamic graph definition described in Section 5.1.1) to randomly choose to use teacher forcing. We decided to use a teacher forcing ratio of So the input of the decoder is as follows: Teacher forcing: input y i = the previous target token y target i 1 No teacher forcing: input y i = the previous predicted token y i 1 The decoder embeds the input y i 1 (in case of no teacher forcing) the same way as the encoder. A GRU takes as input this embedded vector together with the previous hidden

47 4.3 Encoder decoder 29 Input x i Size: 1 Previous hidden s i-1 Size: 1x1xhidden_nodes Embedding Embedded x i Size: 1x1x300 GRU Output h i Size: 1x1xhidden_nodes Hidden s i Size: 1x1xhidden_nodes Figure 4.2: Encoder architecture

48 4.3 Encoder decoder 30 state s i 1, and produces an output vector z i and the next hidden state s i. This output vector z i is transformed to a distribution over the vocabulary size through the following operations: 1. out: feedforward layer that applies linear transformation to the output vector z i IR 1 l with l as the number of hidden nodes: y i = W out z i + b out (4.2) where y i is a vector IR 1 m with m as the size of the vocabulary, W out is a trainable matrix and b out is the bias. 2. log_softmax: applies the LogSoftmax(y i ) function to obtain log-probabilities. The formula is as follows: exp(y i,k ) LogSoftmax(y i,k ) = log( j exp(y i,j) ) (4.3) where y i,k are the elements of y i. The loss function used is the negative likelihood loss. By minimizing the negative loglikelihood loss function, the model is encouraged to assign higher probability values to the correct labels across training examples. The negative likelihood loss combined with the LogSoftmax is also called the cross-entropy loss Decoder with attention mechanism As shown in Figure 4.1, the encoder-decoder model discards all the encoder outputs but the last one. If only the last encoder output (the context vector) is passed between the encoder and decoder, that single vector carries the burden of encoding the entire sentence. The attention mechanism avoids this by encoding the whole input sequence based on the sequence of all the encoder outputs, as opposed to only the last encoder output. Attention allows the decoder network to focus on a different part of the encoder outputs for every step of the decoder s own outputs. Figure 4.4 shows the attention mechanism, where the attention weights are represented by a grayscale value and the encoder outputs as colors. Our implementation of the attention mechanism works as follows:

49 4.3 Encoder decoder 31 Input y i-1 Size: 1 Previous hidden s i-1 Size: 1x1xhidden_nodes Embedding Embedded y i-1 Size: 1x1x300 GRU Output z i Size: 1x1xhidden_nodes Hidden s i Size: 1x1xhidden_nodes Out Log Softmax Output y i Size: 1 x vocabulary_size Figure 4.3: Simple decoder architecture

50 4.3 Encoder decoder 32 Encoder outputs Hidden DECODER Attention Encoder outputs * Attention weights Context vector Figure 4.4: Attention mechanism. The attention weights are values in the range of [0,1], which are mapped to [black, white]. The encoder outputs are multiplied and changed accordingly. 1. A set of attention weights α i,j is calculated. Each weight α i,j is a normalized attention energy e i,j : α i,j = exp(e i,j) k exp(e (4.4) i,k) where each attention energy e i,j is calculated with a score function a using the last hidden state s i 1 and the particular encoder output h j : e i,j = score(s i 1, h j ) (4.5) Luong et al. [2015] proposed the following score functions: s T i 1 h j score(s i 1, h j ) = s T i 1 W ah j v T a tanh(w a[s i 1 ; h j ]) dot general concat (4.6)

51 4.3 Encoder decoder 33 where W a and v a are respectively a trainable matrix and vector. We decided to use the general form, which is the dot product between the decoder hidden state s i 1 and a linear transform of the corresponding encoder output h j. 2. These attention weights α i,j are multiplied by the encoder output vectors h 1,..., h n to create a weighted combination, the context vector c i. c i = n α i,j h j (4.7) j=1 3. The context vector c i is concatenated with the decoder s input y i 1 and serves as the input of the GRU. z i, s i = f([c i, y i 1 ], s i 1 ) (4.8) where f is the GRU, s i and s i 1 are respectively the new and last hidden state and z i is the output vector. 4. A feedforward layer applies a linear transformation to the concatenation of the GRU output vector z i and the context vector c i. y i = W out [z i, c i ] + b out (4.9) where y i is a vector IR 1 m with m as the size of the vocabulary, W out is a trainable matrix and b out is the bias. 5. Finally, y i is passed through a LogSoftmax. exp(y i ) LogSoftmax(y i ) = log( j exp(y j) ) (4.10) where y i are the elements of y. The architecture of the attention decoder architecture can be seen on Figure 4.5. Except for the attention mechanism and additional layers, it is the same as the simple decoder architecture. For the sake of brevity, the embedding of the input is left out of the figure Copy mechanism In machine translation, it is often necessary that the decoder can copy tokens from the input sequence. Specifically in our NL2SQL story, the decoder should be able to copy actor

52 4.3 Encoder decoder 34 Encoder outputs h 1,..., h n Size: input_length (n) x hidden_nodes Previous hidden s i-1 Size: 1x1xhidden_nodes Embedded y i-1 Size: 1x1x300 Attention (linear transformation) Dot product Softmax Attention weights alpha i,1,..., alpha i,n Size: 1 x max_length BMM Batch matrix-matrix product Context c i Size: 1x1xhidden_nodes Concatenation GRU Concatenation Output z i Hidden s i Size: 1x1xhidden_nodes Size: 1x1xhidden_nodes Out Log Softmax Output y i Size: 1 x vocabulary_size Figure 4.5: Attention decoder architecture

53 4.4 Conclusion 35 names, titles,... from the input sequence. These are often Out Of Vocabulary (OOV) words that are not present in the training set. This is why the major improvements on the encoder-decoder model with attention are at predicting the last operand of the WHERE condition, which is often a word from the input sequence. Our copy mechanism works in two steps. 1. The first step consists of predicting whether to copy a token from the input sequence or not. This is accomplished by including a <COPY> token in the vocabulary. 2. The second step consists of replacing the <COPY> with the input token with the highest attention value. The loss function is changed in order to work with the copy mechanism: 1. In the first step, the algorithm checks whether the target token is in the input sequence. If this is the case, the negative likelihood loss between the <COPY> token and the decoder output is added to the loss. If this is not the case, the negative likelihood loss between the target token and the decoder output is added to the loss. 2. In the second step, the attention values should align with the input tokens that should be copied. In case of a <COPY> token, the cross-entropy loss between the attention values and the index of the corresponding input token is added to the loss. 4.4 Conclusion This chapter starts with explaining the two models that serve as baseline for our dataset, the GloVe baseline and SQLNet. The GloVe-based model searches for the best matching question in the training set and predicts the corresponding query, working both for simple and composite queries. SQLNet is an advanced model that works only on the simple queries, on one table, and is described in Section Our encoder-decoder model, which is explained in the last section, works also on composite queries. Because the simple decoder used only the encoders last output, the first extension consisted of using an attention mechanism. The attention mechanism encodes the whole input sequence based on the sequence of all encoder outputs, instead of only the encoders last output. The second extension consisted of adding a copy mechanism, providing the algorithm with the possibility of copying unseen words from the question. The following chapter will discuss the results of the techniques explained in this chapter.

54 EXPERIMENTS 36 Chapter 5 Experiments In this chapter, the models proposed in Chapter 4 are evaluated and compared. It begins with giving an overview of the experimental setup: the deep learning framework used and how our results are evaluated. Next, it touches briefly the results of the baseline models. The third section handles the results of the encoder-decoder model, both on simple and composite queries. At the end, the choice of hyperparameters is explained. 5.1 Experimental setup This section begins with clarifying which deep learning framework suits our models the best. Next, it gives an overview of the multiple components of the SQL query that are evaluated Deep learning framework Given the proposed models in Chapter 4, we chose to use the deep learning framework PyTorch. 1 We picked this framework because of the following reasons: Dynamic graph definition: PyTorch supports dynamic graph definition. Static graph definition, used by other frameworks, means that for training a RNN, the input sequence length should stay fixed. (so the sentence length is fixed to some maximum value and smaller sequences are padded with zeros) Pythonic way: PyTorch is deeply integrated into Python. 1

55 5.1 Experimental setup 37 Tutorial: SQLNet and Seq2SQL (Xu et al. [2017], Zhong et al. [2017]) are written in PyTorch and open-sourced on GitHub. 2 These served as guidelines for writing our own solutions Evaluation details As discussed in Chapter 3, the dataset consists of 1,537 template questions containing tags such as [@film] or [@actor]. The tags (indicated are replaced with 10 random, corresponding values from the IMDb database. Table 5.1 shows an example of how the [@film] tag is replaced by movie titles. Template question Who wrote the movie [@film]? Generated question 1 Who wrote the movie Deadpool 2? (...) (...) Generated question 10 Who wrote the movie Avengers: Infinity war? Table 5.1: Example of a template question and the generated questions The resulting 15,370 question-query pairs are split differently for the GloVe-based model and the others, because the GloVe-based model does not use a validation set. The template questions from the training, validation and test set are separated, such that a template question from one set is not present in the other sets. Details can be found in Table 5.2. Model Training set Validation set Test set GloVe-based model 85% / 15% 13,060 pairs / 2,310 pairs SQLNet & Encoder-decoder 70% 15% 15% 10,750 pairs 2,310 pairs 2,310 pairs Table 5.2: Split of the dataset into training, validation and test set for the different models The evaluation metric used in the baselines and models is the accuracy. The accuracy is the proportion of correct cases (both true positives and true negatives) among the total number of cases examined. Because some parts of the SQL query are harder to predict than 2 Xu et al. [2017] implemented SQLNet and rebuild Seq2SQL, which can be found at com/xiaojunxu/sqlnet

56 5.2 Baselines 38 others, we calculate the accuracy of the different components of the SQL query, such that is becomes clear on which components of the query the model has to be improved. Table 5.3 shows the multiple evaluation components. In case of multiple conditions, the conditions are sorted and compared because the conditions in a SQL query are commutative. This means that e.g. the following conditions are equivalent: WHERE actor.name = Brad Pitt AND movie.title = Fight Club WHERE movie.title = Fight Club AND actor.name = Brad Pitt Evaluation component Description Select Checks if the column name after SELECT is correct From Checks if the table name after FROM is correct Where (first operands) Checks if the column name (= first operand) ( = conds first op) after WHERE is correct. Where (all operands) Checks if the whole condition after WHERE is correct ( = conds all op) Joins (only for composite queries) Checks if the table names and conditions of the JOIN clause of the query are correct All Checks if the whole query is correct Table 5.3: Evaluation components of the query 5.2 Baselines This section describes the results from the first two models: the GloVe-based model and the more advanced model SQLNet from Xu et al. [2017] GloVe-based model The results on accuracy for simple and composite queries can be found respectively in Figures 5.1 and 5.2. The figures show that the biggest improvement lies in predicting the WHERE condition (conds all op), while predicting the table name after FROM is already perfect. Also, the predictions of the column name after SELECT and as first operand of the WHERE condition (conds first op) have an accuracy above 80%, which is quite high for a baseline.

5.3 Encoder-decoder model 39 Figure 5.1: Accuracy of the GloVe model - simple queries Figure 5.2: Accuracy of the GloVe model - composite queries 5.2.2 SQLNet The results of SQLNet on our dataset can be found in Table 5.

The SQLNet model was trained for 100 epochs, where the best selection and condition predictions on the validation set were achieved after respectively 77 and 45 epochs.

57 5.3 Encoder-decoder model 39 Figure 5.1: Accuracy of the GloVe model - simple queries Figure 5.2: Accuracy of the GloVe model - composite queries SQLNet The results of SQLNet on our dataset can be found in Table 5.4. SQLNet trains different prediction models for the different parts of the query, which means that the validation loss of the models can converge at different times. The SQLNet model was trained for 100 epochs, where the best selection and condition predictions on the validation set were achieved after respectively 77 and 45 epochs. The FROM accuracy is not calculated, because it is only one table. The SQLNet is a more advanced model and scores better than the GloVe-based model on predicting the WHERE clause. The comparison can be consulted in Table 5.4. The WHERE (first operands) clause is not available for SQLNet because their evaluation mechanism did not measure the accuracy on that component. Simple queries GloVe-based SQLNet Select 84.45% 71.7% From 100.0% 100% Where (first operands) 82.51% / Where (all operands) 36.19% 46.8% All 29.48% 36.5% Table 5.4: Accuracy results Glove-based model and SQLNet - simple queries 5.3 Encoder-decoder model The first part of this section involves the results of the encoder-decoder model on the simple queries, while the second part contains the results on the composite queries. It will explore

58 5.3 Encoder-decoder model 40 the improvements of the attention and copy mechanism and show some visualizations of these mechanisms Simple queries The accuracy results from the encoder-decoder model on the different components can be found in Table 5.5 and Figure 5.3. The choice of hyperparameters can be found in Table 5.6 and is motivated in Section 5.4. The model is trained until the validation loss converges, which is after approximately 9 epochs. Note the difference with the training procedure of SQLNet, where there were multiple models for each clause of the query, which were trained independent of each other until their validation loss converged. Clearly, the attention mechanism does not have impact on the simple queries. However, in combination with the copy mechanism there is a 2% increase in overall accuracy compared to the simple decoder. Simple queries Simple With attention With attention and copy Select 95.47% 94.08% 93.87% From 100% 99.65% 99.87% Where (first operands) 92.74% 90.04% 92.39% Where (all operands) 65.99% 60.55% 67.77% All 64.42% 58.59% 66.29% Table 5.5: Accuracy results encoder-decoder model - simple queries Hyperparameter Choice Optimizer Adam Learning rate Dropout 0.1 Amount of hidden nodes 256 Table 5.6: Choice of hyperparameters - simple and composite queries Not all evaluation components score equal, hence we compare the evaluation components that score worst. The prediction of the last operand and the correctness of the whole query is shown in Figure 5.4. We notice a decrease in accuracy by using attention, which is compensated in combination with the copy mechanism. All three models score better than the GloVe-based model and SQLNet (see Table 5.4).

5.3 Encoder-decoder model 41 Figure 5.3: Accuracy results encoder-decoder model - simple queries Figure 5.4: Accuracy results for the WHERE clause and the whole query - simple queries 5.3.2 Composite queries The accuracy results from the encoder-decoder model are shown in Table 5.

59 5.3 Encoder-decoder model 41 Figure 5.3: Accuracy results encoder-decoder model - simple queries Figure 5.4: Accuracy results for the WHERE clause and the whole query - simple queries Composite queries The accuracy results from the encoder-decoder model are shown in Table 5.7 and Figure 5.5. The choice of hyperparameters are the same as with the simple queries and can be found in Table 5.6 and is motivated in Section 5.4. The model is trained until the validation loss converges, which is after approximately 9 epochs. Analogous as with the simple queries, the extensions of the attention mechanism and copy mechanism have particularly effect on

5.3 Encoder-decoder model 42 predicting the third operand in the WHERE clause. Composite queries GloVe Simple With attention With attention & copy Select 90.78 % 96.46% 94.00% 96.28% From 100.0 % 99.

60 5.3 Encoder-decoder model 42 predicting the third operand in the WHERE clause. Composite queries GloVe Simple With attention With attention & copy Select % 96.46% 94.00% 96.28% From % 99.60% 99.41% 99.96% Where (first operands) % 94.89% 91.76% 93.51% Where (all operands) % 65.69% % 73.94% Joins 80.74% % 92.83% 94.49% All % 63.32% 63.32% 71.29% Table 5.7: Accuracy results encoder-decoder model - composite queries Figure 5.5: Accuracy results encoder-decoder model - composite queries Figure 5.6 zooms in on the prediction of the last operand and the correctness of the whole query. We notice again that solely the attention mechanism brings no improvement in the accuracy of the whole query. Although, when it is combined with the copy mechanism, it brings an additional improvement of 8%. All three models score better than the GloVebased model (see Table 5.7). The attention mechanism is visualized in Figures 5.7 and 5.8. Because it is used to weight specific encoder outputs of the input sequence, we can imagine looking where the network is focused most at each time step. In Figure 5.7, we notice that the word nation has the most impact on the translation. Because country code is only present in the company table, nation has almost weight 1 in the whole SELECT, FROM and JOIN clause. The words

61 5.3 Encoder-decoder model 43 Figure 5.6: Accuracy results for the WHERE clause and the whole query - composite queries another forever have weight 1 in the prediction of the last operand of the WHERE clause, where they are copied correctly. Figure 5.8 shows a second example question-query pair. Here, the SELECT, FROM and JOIN clause are again determined by one word, producer. The last operand of the WHERE clause is also copied correctly. Figure 5.7: Visualization attention - pair 1 Figure 5.8: Visualization attention - pair 2 The copy mechanism works in two steps: predicting a <COPY> token and if it should copy,

62 5.3 Encoder-decoder model 44 choosing which token it should copy. The accuracy of the first step is shown in Table 5.8, where the percentage indicates in how many queries the <COPY> tokens are correctly or incorrectly predicted. The largest error is when the decoder predicts a different length than the ground truth query. Also, the decoder has a larger chance predicting falsely a <COPY> token than forgetting to predict a <COPY> token. The frequencies do not sum up to 1, but that is because it is possible that multiple error cases happen simultaneously. Correct 88.71% Incorrect, different lengths 9.58% Incorrect, should have copied 0.985% Incorrect, should not have copied 4.43% Table 5.8: Accuracy of first step of the copy mechanism and the frequency of the error cases. Table 5.9 shows three examples of question-query pairs and the results from the simple decoder and the decoder with the attention and copy mechanism. The first example shows a case where the copy mechanism correctly predicts the tokens to copy, but the attention is wrongly aligned for the last token causing it to copy the wrong word. The second example is an illustration of a common error in the copy mechanism: wrongly predicting the length of the condition. The third example illustrates a case where the copy mechanism is correct and the simple decoder is not.

63 5.3 Encoder-decoder model 45 Q1 S A&C G Q2 S A&C G Q3 S A&C G which iowan cinematographer produced the film Appunti inutili - Virgilio Giotti SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = aquele querido mes de agosto SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Virgilio SELECT name FROM movie INNER JOIN made by ON movie.mid = made by.msid INNER JOIN producer ON made by.pid = producer.pid WHERE movie.title = Appunti inutili - Virgilio Giotti where is the film Angry Samoans form SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = apron strings SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry SELECT country code FROM movie INNER JOIN copyright ON movie.mid = copyright.msid INNER JOIN company ON copyright.cid = company.id WHERE movie.title = Angry Samoans what film is a part of the Crime film genre SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = game-show SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime SELECT title FROM movie INNER JOIN classification ON movie.mid = classification.msid INNER JOIN genre genre ON classification.gid = genre.gid WHERE genre.genre = Crime Table 5.9: Example predictions by the different models. Q denotes the natural language question and G denotes the corresponding ground truth query. S and A&C denote respectively the queries produced by the simple decoder and the decoder with copy and attention mechanism. Our models generate in general the table name two times, such that the column names can specify where they come from, but this is left out of the table for the sake of brevity. The bold words indicate wrongly predicted words.

5.4 Hyperparameter optimization 46 5.4 Hyperparameter optimization This section explores the optimization of hyperparameters in the model.

64 5.4 Hyperparameter optimization Hyperparameter optimization This section explores the optimization of hyperparameters in the model. The first subsection discusses the choice of optimizer and learning rate, which are used to minimize the loss. Afterwards, the impact of the regularization technique dropout is examined. Finally, we tune the amount of hidden nodes, which is the length of the context vector Optimizer We have tried two different gradient descent optimizers from the torch.optim package: stochastic gradient descent (optim.sgd) and Adaptive Moment Estimation (optim.adam), explained in Section This section will compare the results of both. Figure 5.9 shows the negative likelihood loss for the SGD optimizer and Adam optimizer (both with learning rate 0.001) in function of the time. The experiment is conducted on the composite query dataset using the encoder-decoder model with attention. Following aspects can be deducted from the figure: The Adam optimizer (blue) converges faster than the SGD optimizer (orange). Due the frequent updates with high variance, the loss function fluctuates heavily. Figure 5.9: The negative likelihood loss for the SGD optimizer (orange) and Adam optimizer (blue), in function of the amount of trained samples.

65 5.4 Hyperparameter optimization 47 Figure 5.10 shows the evolution of the negative likelihood loss for different learning rates, also conducted on the composite query dataset with the encoder-decoder model with attention. The Adam optimizer with learning rate 0.1 fails to converge, while the Adam optimizer with learning rate 0.01 is not able to converge to a loss below 2.0. The default PyTorch learning rate of converges faster than the others, which is why we picked for the learning rate in our model. Figure 5.10: The negative likelihood loss for the learning rates 0.1, 0.01 and for the Adam optimizer, in function of the amount of trained samples. The graphs are respectively red, blue and orange Dropout Overfitting occurs when the model adapts to training data too well, but does not generalize to new data. In our case, this would mean that the model predicts almost perfectly the template queries in the training set, but predicts poorly the (different) template queries in the test set. There is a direct trade-off between overfitting and model complexity. Neural networks are complex models, so additional countermeasures to prevent overfitting are taken. Dropout is a regularization technique for neural networks that tackle this problem (Srivastava et al. [2014]). The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.

66 5.4 Hyperparameter optimization 48 The dropout mechanism in PyTorch works as follows: During training, it randomly zeroes some of the elements of the input tensor with probability p. The outputs are scaled by a factor of 1 p. During evaluation, the module simply computes an identity function. This dropout reduces overfitting and gives major improvements over other regularization methods. The impact of the dropout parameter p on our model is shown in Figure The absolute accuracy of the whole query is calculated on the validation set each time 1000 training samples are passed, for p = 0.1, 0.3 and 0.5. The experiment is conducted with the decoder with the copy and attention mechanism. The overall accuracy seems to converge, which indicates that the dropout parameter in our model has not that much effect on the accuracy. Figure 5.11: The absolute accuracy of the whole query calculated on the validation set for the dropout values p of 0.1, 0.3 and 0.5, in function of the amount of trained samples. The graphs are respectively orange, blue and red Hidden nodes The amount of hidden nodes, which is also the length of the context vector, has less effect on the negative likelihood loss. Figure 5.12 shows the impact of the size of the context vector on the negative likelihood loss. The more hidden nodes, the better the reduction of the loss, but also an increase in parameters, which leads to longer training times. This is why we picked 256 hidden nodes to train our model.

5.5 Conclusion 49 Figure 5.12: The negative likelihood loss for different amounts of hidden nodes. 5.5 Conclusion This chapter discusses the results from the techniques explained in Chapter 4.

67 5.5 Conclusion 49 Figure 5.12: The negative likelihood loss for different amounts of hidden nodes. 5.5 Conclusion This chapter discusses the results from the techniques explained in Chapter 4. It begins with giving a short overview of the reasons why we choose PyTorch as deep learning framework. Because evaluating only on the accuracy of the whole query would enclose too little details of our models, we evaluated the accuracy of the different parts of the SQL query. The second part of the first subsection demonstrates the multiple evaluation components of the SQL query. The chapter continues with disclosing the first results from this dissertation, namely the experiments with the basic GloVe-based model and the more advanced SQLNet (by Xu et al. [2017]). These models already score quite good on the SELECT, FROM and JOIN clauses, but there are improvements to be made in predicting the third operand of the WHERE clause. This is where our encoder-decoder model appears to boost the accuracy results. On the simple queries, the encoder-decoder model scores better than the GloVe-based model and SQLNet but the attention mechanism does not improve the total accuracy, unless combined with the copy mechanism. On the composite queries, the simple decoder already scores better than the two baseline models, but the extensions enhance the accuracy. The attention with copy mechanism carries a gain of 7.9% compared to the simple decoder, concluding with 71.29% total accuracy.

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning V. Zhong, C. Xiong, R. Socher Salesforce Research arxiv: 1709.00103 Reviewed by : Bill Zhang University of Virginia