Learning Tag Embeddings and Tag-specific Composition Functions in Recursive Neural Network

Size: px

Start display at page:

Download "Learning Tag Embeddings and Tag-specific Composition Functions in Recursive Neural Network"

Janis Goodman
5 years ago
Views:

1 Learnng Tag Embeddngs and Tag-specfc Composton Functons n Recursve Neural Network Qao Qan, Bo Tan, Mnle Huang, Yang Lu*, Xuan Zhu*, Xaoyan Zhu State Key Lab. of Intellgent Technology and Systems, Natonal Lab. for Informaton Scence and Technology, Dept. of Computer Scence and Technology, Tsnghua Unversty, Bejng , PR Chna *Samsung R&D Insttute Bejng, Chna qanqaodecember29@126.com, smxtanbo@gmal.com ahuang@tsnghua.edu.cn, yang.lu@samsung.com xuan.zhu@samsung.com, zxy-dcs@tsnghua.edu.cn Abstract Recursve neural network s one of the most successful deep learnng models for natural language processng due to the compostonal nature of text. The model recursvely composes the vector of a parent phrase from those of chld words or phrases, wth a key component named composton functon. Although a varety of composton functons have been proposed, the syntactc nformaton has not been fully encoded n the composton process. We propose two models, Tag Guded RNN (TG- RNN for short) whch chooses a composton functon accordng to the part-ofspeech tag of a phrase, and Tag Embedded RNN/RNTN (TE-RNN/RNTN for short) whch learns tag embeddngs and then combnes tag and word embeddngs together. In the fne-graned sentment classfcaton, experment results show the proposed models obtan remarkable mprovement: TG-RNN/TE-RNN obtan remarkable mprovement over baselnes, TE-RNTN obtans the second best result among all the top performng models, and all the proposed models have much less parameters/complexty than ther counterparts. 1 Introducton Among a varety of deep learnng models for natural language processng, Recursve Neural Network (RNN) may be one of the most popular models. Thanks to the compostonal nature of natural text, recursve neural network utlzes the recursve structure of the nput such as a phrase or sentence, and has shown to be very effectve for many natural language processng tasks ncludng semantc relatonshp classfcaton (Socher et al., 2012), syntactc parsng (Socher et al., 2013a), sentment analyss (Socher et al., 2013b), and machne translaton (L et al., 2013). The key component of RNN and ts varants s the composton functon: how to compose the vector representaton for a longer text from the vector of ts chld words or phrases. For nstance, as shown n Fgure 2, the vector of s very nterestng can be composed from the vector of the left node s and that of the rght node very nterestng. It s worth to menton agan, the composton process s conducted wth the syntactc structure of the text, makng RNN more nterpretable than other deep learnng models. s... g very s very nterestng g very nterestng nterestng Fgure 1: The example process of vector composton n RNN. The vector of node very nterestng s composed from the vectors of node very and node nterestng. Smlarly, the node s very nterestng s composed from the phrase node very nterestng and the word node s. There are varous attempts to desgn the composton functon n RNN (or related models). In RNN (Socher et al., 2011), a global matrx s used to lnearly combne the elements of vectors. In RNTN (Socher et al., 2013b), a global tensor s used to compute the tensor products of dmensons to favor the assocaton between dfferent el Proceedngs of the 53rd Annual Meetng of the Assocaton for Computatonal Lngustcs and the 7th Internatonal Jont Conference on Natural Language Processng, pages , Bejng, Chna, July 26-31, c 2015 Assocaton for Computatonal Lngustcs

2 ements of the vectors. Sometmes t s challengng to fnd a sngle functon to model the composton process. As an alternatve, multple composton functons can be used. For nstance, n MV-RNN (Socher et al., 2012), dfferent matrces s desgned for dfferent words though the model s suffered from too much parameters. In AdaMC RNN/RNTN (Dong et al., 2014), a fxed number of composton functons s lnearly combned and the weght for each functon s adaptvely learned. In spte of the success of RNN and ts varants, the syntactc knowledge of the text s not yet fully employed n these models. Two deas are motvated by the example shown n Fgure 2: Frst, the composton functon for the noun phrase the move/np should be dfferent from that for the adjectve phrase very nterestng/adjp snce the two phrases are qute syntactcally dfferent. More specfcally to sentment analyss, a noun phrase s much less lkely to express sentment than an adjectve phrase. There are two notable works mentoned here: (Socher et al., 2013a) presented to combne the parsng and composton processes, but the purpose s for parsng; (Hermann and Blunsom, 2013) desgned composton functons accordng to the combnatory rules and categores n CCG grammar, however, only margnal mprovement aganst Nave Bayes was reported. Our proposed model, tag guded RNN (TG-RNN), s desgned to use the syntactc tag of the parent phrase to gude the composton process from the chld nodes. As an example, we desgn a functon for composng noun phrase (NP) and another one for adjectve phrase (ADJP). Ths smple strategy obtans remarkable mprovements aganst strong baselnes. the / DT the move / NP the move s very nteres/ng / S move / NN s / VBZ s very nteres/ng / VP very nteres/ng / ADJP very / RB nteres/ng / JJ Fgure 2: The parse tree for sentence The move s very nterestng bult by Stanford Parser. Second, when composng the adjectve phrase very nterestng/adjp from the left node very/rb and the rght node nterestng/jj, the rght node s obvously more mportant than the left one. Furthermore, the rght node nterestng/jj apparently contrbutes more to sentment expresson. To address ths ssue, we propose Tag embedded RNN/RNTN (TE-RNN/RNTN), to learn an embeddng vector for each word/phrase tag, and concatenate the tag vector wth the word/phrase vector as nput to the composton functon. For nstance, we have tag vectors for DT,NN,RB,JJ,ADJP,NP, etc. and the tag vectors are then used n composng the parent s vector. The proposed TE-RNTN obtan the second best result among all the top performng models but wth much less parameters and complexty. To the best of our knowledge, ths s the frst tme that tag embeddng s proposed. To summarze, the contrbutons of our work are as follows: We propose tag-guded composton functons n recursve neural network, TG-RNN. Tag-guded RNN allocates a composton functon for a phrase accordng to the partof-speech tag of the phrase. We propose to learn embeddng vectors for part-of-speech tags of words/phrases, and ntegrate the tag embeddngs n RNN and RNTN respectvely. The two models, TE- RNN and TE-RNTN, can leverage the syntactc nformaton of chld nodes when generatng the vector of parent nodes. The proposed models are effcent and effectve. The scale of the parameters s well controlled. Expermental results on the Stanford Sentment Treebank corpus show the effectveness of the models. TE-RNTN obtans the second best result among all publcly reported approaches, but wth much less parameters and complexty. The rest of the paper s structured as follows: n Secton 2, we survey related work. In Secton 3, we ntroduce the tradtonal recursve neural network as background. We present our deas n Secton 4. The experments are ntroduced n Secton 5. We summarze the work n Secton 6. 2 Related Work Dfferent knds of representatons are used n sentment analyss. Tradtonally, the bag-ofwords representatons are used for sentment analyss (Pang and Lee, 2008). To explot the relatonshp between words, word co-occurrence (Turney et al., 2010) and syntactc contexts (Padó 1366

3 and Lapata, 2007) are consdered. In order to dstngush antonyms wth smlar contexts, neural word vectors (Bengo et al., 2003) are proposed and can be learnt n an unsupervsed manner. Word2vec (Mkolov et al., 2013a) ntroduces a smpler network structure makng computaton more effcently and makes bllons of samples feasble for tranng. Semantc composton deals wth representng a longer text from ts shorter components, whch s extensvely studed recently. In many prevous works, a phrase vector s usually obtaned by average (Landauer and Dumas, 1997), addton, element-wse multplcaton (Mtchell and Lapata, 2008) or tensor product (Smolensky, 1990) of word vectors. In addton to usng vector representatons, matrces can also be used to represent phrases and the composton process can be done through matrx multplcaton (Rudolph and Gesbrecht, 2010; Yessenalna and Carde, 2011). Recursve neural models utlze the recursve structure (usually a parse tree) of a phrase or sentence for semantc composton. In Recursve Neural Network (Socher et al., 2011), the tree wth the least reconstructon error s bult and the vectors for nteror nodes s composed by a global matrx. Matrx-Vector Recursve Neural Network (MV-RNN) (Socher et al., 2012) assgns matrces for every words so that t could capture the relatonshp between two chldren. In Recursve Neural Tensor Networks (RNTN) (Socher et al., 2013b), the composton process s performed on a parse tree n whch every node s annotated wth fne-graned sentment labels, and a global tensor s used for composton. Adaptve Mult- Compostonalty (Dong et al., 2014) uses multple weghted composton matrces nstead of sharng a sngle matrx. The employment of syntactc nformaton n RNN s stll n ts nfant. In (Socher et al., 2013a), the part-of-speech tag of chld nodes s consdered n combnng the processes of both composton and parsng. The man purpose s for better parsng by employng RNN, but t s not desgned for sentment analyss. In (Hermann and Blunsom, 2013), the authors desgned composton functons accordng to the combnatory rules and categores n CCG grammar. However, only margnal mprovement aganst Nave Bayes was reported. Unlke (Hermann and Blunsom, 2013), our TG-RNN obtans remarkable mprovements aganst strong baselnes, and we are the frst to propose tag embedded RNTN whch obtans the second best result among all reported approaches. 3 Background: Recursve Neural Models In recursve neural models, the vector of a longer text (e.g., sentence) s composed from those of ts shorter components (e.g., words or phrases). To compose a sentence vector through word/phrase vectors, a bnary parse tree has to be bult wth a parser. The leaf nodes represent words and nteror nodes represent phrases. Vectors of nteror nodes are computed recursvely by composton of chld nodes vectors. Specally, the root vector s regarded as the sentence representaton. The composton process s shown n Fgure 1. More formally, vector v R d for node s calculated va: v = f(g(v l, v r )) (1) where v l and vr are chld vectors, g s a composton functon, and f s a nonlnearty functon, usually tanh. Dfferent recursve neural models manly dffer n composton functon. For example, the composton functon for RNN s as below: [ ] v g(v, l v r l ) = W + b (2) where W R d 2d s a composton matrx and b s a bas vector. And the composton functon for RNTN s as follows: [ ] [ ] [ ] v g(v, l v r l ) = v v r T [1:d] l v l v r + W v r + b (3) where W and b are defned n the prevous model and T [1:d] R 2d 2d d s the tensor that defnes multple blnear forms. The vectors are used as feature nputs to a softmax classfer. The posteror probablty over class labels on a node vector v s gven by v r y = softmax(w s v + b s ). (4) The parameters n these models nclude the word table L, a composton matrx W n RNN, and W and T [1:d] n RNTN, and the classfcaton matrx W s for the softmax classfer. 1367

4 4 Incorporatng Syntactc Knowledge nto Recursve Neural Model The central dea of the paper s nspred by the fact that words/phrases of dfferent part-of-speech tags play dfferent roles n semantc composton. As dscussed n the ntroducton, a noun phrase (e.g., a move/np) may be composed dfferent from a verb phrase (e.g., love move/vp). Furthermore, when composng the phrase a move/np, the two chld words, a/dt and move/nn, may play dfferent roles n the composton process. Unfortunately, the prevous RNN models neglect such syntactc nformaton, though the models do employ the parsng structure of a sentence. We have two approaches to mprove the composton process by leveragng tags on parent nodes and chld nodes. One approach s to use dfferent composton matrces for parent nodes wth dfferent tags so that the composton process could be guded by phrase type, for example, the matrx for NP s dfferent from that for VP. The other approach s to ntroduce tag embeddng for words and phrases, for example, to learn tag vectors for NP, VP, ADJP, etc., and then ntegrate the tag vectors wth the word/phrase vectors durng the composton process. 4.1 Tag Guded RNN (TG-RNN) We propose Tag Guded RNN (TG-RNN) to respect the tag of a parent phrase durng the composton process. The model chooses a composton functon accordng to the part-of-speech tag of a phrase. For example, the move has tag NP, very nterestng has tag ADJP, the two phrases have dfferent composton matrces. More formally, we desgn composton functons g wth a factor of the phrase tag of a parent node. The composton functon becomes g(t, v l, v r ) = g t (v l, v r ) = W t [ v l v r ] + b t (5) where t s the phrase tag for node, W t and b t are the parameters of functon g t, as defned n Equaton 2. In other words, phrase nodes wth varous tags have ther own composton functons such as g NP, g V P, and so on. There are totally k composton functon n ths model where k s the number of phrase tags. When composng chld vectors, a functon s chosen from the functon pool accordng to the tag of the parent node. The process s depcted n Fgure 3. We term ths model Tag guded RNN, TG-RNN for short.... s / VBZ very / RB s very nterestng / VP... g NP g ADJP g VP... g NP g ADJP g VP very nterestng / ADJP nterestng / JJ Fgure 3: The vector of phrase very nterestng s composed wth hghlghted g ADJP and s very nterestng wth g V P. But some tags have few occurrences n the corpus. It s hard and meanngless to tran composton functons for those nfrequent tags. So we smply choose top k frequent tags and tran k composton functons. A common composton functon s shared across phrases wth all nfrequent tags. The value of k depends on the sze of the tranng set and the occurrences of each tag. Specally, when k = 0, the model s the same as the tradtonal RNN. 4.2 Tag Embedded RNN and RNTN (TE-RNN/RNTN) In ths secton, we propose tag embedded RNN (TE-RNN) and tag embedded RNTN (TE-RNTN) to respect the part-of-speech tags of chld nodes durng composton. As mentoned above, tags of parent nodes have mpact on composton. However, some phrases wth the same tag should be composed n dfferent ways. For example, s nterestng and lke swmmng have the same tag VP. But t s not reasonable to compose the two phrases usng the prevous model because the partof-speech tags of ther chldren are qute dfferent. If we use dfferent composton functons for chldren wth dfferent tags lke TG-RNN, the number of tag pars wll amount to as many as k k, whch makes the models nfeasble due to too many parameters. In order to capture the compostonal effects of the tags of chld nodes, an embeddng e t R de s created for every tag t, where d e s the dmenson of tag vector. The tag vector and phrase vector are 1368

5 concatenated durng composton as llustrated n Fgure 4. Formally, the phrase vector s composed by the functon g(v, l e t l, v r, e t r ) = W v l e t l v r e t r + b (6) where t l and tr are tags of the left and the rght nodes respectvely, e t l and e t r are tag vectors, and W R d (2de+2d) s the composton matrx. We term ths model Tag embedded RNN, TE-RNN for short.... s / VBZ g very / RB s very nterestng / VP g very nterestng / ADJP nterestng / JJ Fgure 4: RNN wth tag embeddng. There s a tag embeddng table, storng vectors for RB, JJ, and ADJP, etc. Then we compose the phrase vector very nterestng from the vectors for very and nterestng, and the tag vectors for RB and JJ. Smlarly, ths dea can be appled to Recursve Neural Tensor Network (Socher et al., 2013b). In RNTN, the tag vector and the phrase vector can be nterweaved together through a tensor. More specfcally, the phrase vectors and tag vectors are multpled by the composed tensor. The composton functon changes to the followng: = g(v, l e t l, v r, e t r ) v l e t l v r e t r T [1:d] v l e t l v r e t r + W v l e t l v r e t r + b (7) where the varables are smlar to those defned n equaton 3 and equaton 7. We term ths model Tag embedded RNTN, TE-RNTN for short. The phrase vectors and tag vectors are used as nput to a softmax classfer, gvng the posteror probablty over labels va [ ] v y = softmax(w s + b s ) (8) 4.3 Model Tranng Let y be the target dstrbuton for node, ŷ be the predcted sentment dstrbuton. Our goal s to mnmze the cross-entropy error between y and ŷ for all nodes. The loss functon s defned as follows: E(θ) = y j log ŷ j + λ θ 2 (9) j where j s the label ndex, λ s a L 2 -regularzaton term, and θ s the parameter set. Smlar to RNN, the parameters for our models nclude word vector table L, the composton matrx W, and the sentment classfcaton matrx W s. Besdes, our models have some addtonal parameters, as dscussed below: TG-RNN: There are k composton matrces for top k frequent tags. They are defned as W t R k d 2d. The orgnal composton matrx W s for all nfrequent tags. As a result, the parameter set of TG-RNN s θ = (L, W, W t, W s ). TE-RNN: The parameters nclude the tag embeddng table E, whch contans all the embeddngs for part-of-speech tags for words and phrases. And the sze of matrx W R d (2d+2d e) and the softmax classfer W s R N (de+d). The parameter set of TE-RNN s θ = (L, E, W, W s ). TE-RNTN: Ths model has one more tensor T R (2d+2d e) (2d+2d e ) d than TE-RNN. The parameter set of TE-RNTN s θ = (L, E, W, T, W s ) 5 Experment e t 5.1 Dataset and Experment Settng We evaluate our models on Stanford Sentment Treebank whch contans fully labeled parse trees. It s bult upon 10,662 revews and each sentence has sentment labels on each node n the parse tree. The sentment label set s {0,1,2,3,4}, where the numbers mean very negatve, negatve, neutral, postve, and very postve, respectvely. We use standard splt (tran: 8,544 dev: 1,101, test: 2,210) on the corpus n our experments. In addton, we add the part-of-speech tag for each leaf node and phrase-type tag for each nteror node 1369

6 usng the latest verson of Stanford Parser. Because the newer parser generated trees dfferent from those provded n the datasets, 74/11/11 revews n tran/dev/test datasets are gnored. After removng the broken revews, our dataset contans revews (tran: 8,470, dev: 1,090, test: 2,199). The word vectors were pre-traned on an unlabeled corpus (about 100,000 move revews) by word2vec (Mkolov et al., 2013b) as ntal values and the other vectors s ntalzed by samplng from a unform dstrbuton U( ϵ, ϵ) where ϵ s 0.01 n our experments. The dmenson of word vectors s 25 for RNN models and 20 for RNTN models. Tanh s chosen as the nonlnearty functon. And after computng the output of node wth v = f(g(v l, vr )), we set v = v v so that the resultng vector has a lmted norm. Backpropagaton algorthm (Rumelhart et al., 1986) s used to compute gradents and we use mnbatch SGD wth momentum as the optmzaton method, mplemented wth Theano (Basten et al., 2012). We traned all our models usng stochastc gradent descent wth a batch sze of 30 examples, momentum of 0.9, L 2 -regularzaton weght of and a constant learnng rate of System Comparson We compare our models wth several methods whch are evaluated on the Sentment Treebank corpus. The baselne results are reported n (Dong et al., 2014) and (Km, 2014). We make comparson to the followng baselnes: SVM. A SVM model wth bag-of-words representaton (Pang and Lee, 2008). MNB/b-MNB. Multnomal Nave Bayes and ts bgram varant, adopted from (Wang and Mannng, 2012). RNN. The frst Recursve Neural Network model proposed by (Socher et al., 2011). MV-RNN. Matrx Vector Recursve Neural Network (Socher et al., 2012) represents each word and phrase wth a vector and a matrx. As reported, ths model suffers from too many parameters. RNTN. Recursve Neural Tenser Network (Socher et al., 2013b) employs a tensor Method Fne-graned Pos./Neg. SVM MNB b-mnb RNN MV-RNN RNTN AdaMC-RNN AdaMC-RNTN DRNN TG-RNN (ours) TE-RNN (ours) TE-RNTN (ours) CNN DCNN Para-Vec Table 1: Classfcaton accuray. Fne-graned stands for 5-class predcton and Pos./Neg. means bnary predcton whch gnores all neutral nstances. All the accuracy s at the sentence level (root). for composton functon whch could model the meanng of longer phrases and capture negaton rules. AdaMC. Adaptve Mult-Compostonalty for RNN and RNTN (Dong et al., 2014) trans more than one composton functons and adaptvely learns the weght for each functon. DCNN/CNN. Dynamc Convolutonal Neural Network (Kalchbrenner et al., 2014) and a smple Convolutonal Neural Network (Km, 2014), though these models are of dfferent genres to RNN, we nclude them here for far comparson snce they are among top performng approaches on ths task. Para-Vec. A word2vec varant (Le and Mkolov, 2014) that encodes paragraph nformaton nto word embeddng learnng. A smple but very compettve model. DRNN. Deep Recursve Neural Network (Irsoy and Carde, 2014) stacks multple recursve layers. The comparatve results are shown n Table 1. As llustrated, TG-RNN outperforms RNN, RNTN, MV-RNN, AdMC-RNN/RNTN. 1370

7 Compared wth RNN, the fne-graned accuracy and bnary accuracy of TG-RNN s mproved by 3.8% and 3.9% respectvely. When compared wth AdaMC-RNN, the accuracy of our method rses by 1.2% on the fne-graned predcton. The results show that the syntactc knowledge does facltate phrase vector composton n ths task. As for TE-RNN/RNTN, the fne-graned accuracy of TE-RNN s boosted by 4.8% compared wth RNN and the accuracy of TE-RNTN by 3.2% compared wth RNTN. TE-RNTN also beat the AdaMC-RNTN by 2.2% on the fne-graned classfcaton task. TE-RNN s comparable to CNN and DCNN, another lne of models for ths task. TE-RNTN s better than CNN, DCNN, and Para- Vec, whch are the top performng approaches on ths task. TE-RNTN s worse than DRNN, but the complexty of DRNN s much hgher than TE- RNTN, whch wll be dscussed n the next secton. Furthermore, TE-RNN s also better than TG-RNN. Ths mples that learnng the tag embeddngs for chld nodes s more effectve than smply usng the tags of parent phrases n composton. Note that the fne-graned accuracy s more convncble and relable to compare dfferent approaches due to the two facts: Frst, for the bnary classfcaton task, some approaches tran another bnary classfer for postve/negatve classfcaton whle other approaches, lke ours, drectly use the fne-graned classfer for ths purpose. Second, how the neutral nstances are processed s qute trcky and the detals are not reported n the lterature. In our work, we smply remove neural nstances from the test data before the evaluaton. Let the 5-dmenson vector y be the probabltes for each sentment label n a test nstance. The predcton wll be postve f arg max, 2 y s greater than 2, otherwse negatve, where {0, 1, 2, 3, 4} means very negatve, negatve, neutral, postve, very postve, respectvely. 5.3 Complexty Analyss To gan deeper understandng of the models presented n Table 1, we dscuss here about the parameter scale of the RNN/RNTN models snce the predcton power of neural network models s hghly correlated wth the number of parameters. The analyss s presented n Table 2 (the optmal values are adopted from the cted papers). The parameters for the word table have the same sze n d across all recursve neural models, where n s the number of words and d s the dmenson of word vector. Therefore, we gnore ths part but focus on the parameters of composton functons, termed model sze. Our models, TG-RNN/TE- RNN, have much less parameters than RNTN and AdMC-RNN/RNTN, but have much better performance. Although TE-RNTN s worse than DRNN, however, the parameters of DRNN are almost 9 tmes of ours. Ths ndcates that DRNN s much more complex, whch requres much more data and tme to tran. As a matter of a fact, our TE- RNTN only takes 20 epochs for tranng whch s 10 tmes less than DRNN. Method model sze # of parameters RNN 2d 2 1.8K RNTN 4d 3 108K AdaMC-RNN 2d 2 c 18.7K AdaMC-RNTN 4d 3 c 202K DRNN d h l +2h 2 l 451K TG-RNN (ours) 2d 2 (k + 1) 8.8K TE-RNN (ours) 2(d + d e ) d 1.7K TE-RNTN (ours) 4(d + d e ) 2 d 54K Table 2: The model sze. d s the dmenson of word/phrase vectors (the optmal value s 30 for RNN & RNTN, 25 for AdaMC-RNN, 15 for AdaMC-RNTN, 300 for DRNN). For AdaMC, c s the number of composton functons (15 s the optmal settng). For DRNN, l and h s the number of layers and the wdth for each layer (the optmal values l = 4, h = 174). For our methods, k s the number of unshared composton matrces and d e the dmenson of tag embeddng, for the optmal settng refer to Secton Parameter Analyss We have two key parameters to tune n our proposed models. For TG-RNN, the number of composton functons k s an mportant parameter, whch corresponds to the number of dstnct POS tags of phrases. Let s start from the corpus analyss. As shown n Table 3, the corpus contans 215,154 phrases but the dstrbuton of phrase tags s extremely mbalanced. For example, the phrase tag NP appears 60,239 tmes whle NAC appears only 10 tmes. Hence, t s mpossble to learn a compos- 1371

8 Phrase tag Frequency Phrase tag Frequency NP 60,239 ADVP 1,140 S 33,138 PRN 976 VP 26,956 FARG 792 PP 14,979 UCP 362 ADJP 7,912 SSINV 266 SBAR 5,308 others 1,102 Table 3: The dstrbuton of phrase-type tags n the tranng data. The top 6 frequency tags cover more than 95% phrases. ton functon for the nfrequent phrase tags. Each of the top k frequent phrase tags corresponds to a unque composton functon, whle all the other phrase tags share a same functon. We compare dfferent k for TG-RNN. The accuracy s shown n Fgure 5. Our model obtans the best performance when k s 6, whch s accordant wth the statstcs n Table 3. accuracy TG-RNN AdaMC-RNN RNN k Fgure 5: The accuracy for TG-RNN wth dfferent k. For TE-RNN/RNTN, the key parameter to tune s the dmenson of tag vectors. In the corpus, we have 70 types of tags for leaf nodes (words) and nteror nodes (phrases). Infrequent tags whose frequency s less than 1,000 are gnored. There are 30 tags left and we learn an embeddng for each of these frequent tags. We vares the dmenson of the embeddng d e from 0 to 30. Fgure 6 shows the accuracy for TE-RNN and TE-RNTN wth dfferent dmensons of d e. Our model obtans the best performance when d e s 8 for TE-RNN and 6 for TE-RNTN. The results show that too small dmensons may not be suffcent to encode the syntactc nformaton of tags and too large dmensons damage the performance. accuracy TE-RNTN TE-RNN AdaMC-RNTN AdaMC-RNN RNTN RNN d e Fgure 6: The accuracy for TE-RNN and TE- RNTN wth dfferent dmensons of d e. 5.5 Tag Vectors Analyss In order to prove tag vectors obtaned from tag embedded models are meanngful, we nspect the smlarty between vectors of tags. For each tag vector, we fnd the nearest neghbors based on Eucldean dstance, summarzed n Table 4. Tag Most Smlar Tags JJ (Adjectve) ADJP (Adjectve Phrase) VP (Verb Phrase) VBD (past tense) VBN (past partcple). (Dot) : (Colon) Table 4: Top 1 or 2 nearest neghborng tags wth defnton n brackets. Adjectves and verbs are of sgnfcant mportance n sentment analyss. Although JJ and ADJP are word and phrase tag respectvely, they have smlar tag vectors, because of playng the same role of Adjectve n sentences. VP, VBD and VBN wth smlar representatons all represent verbs. What s more nterestng s that the nearest neghbor of dot s colon, probably because both of them are punctuaton marks. Note that tag classfcaton s none of our tranng objectves and surprsngly the vectors of smlar tags are clustered together, whch can provdes addtonal nformaton durng sentence composton. 6 Concluson In ths paper, we present two ways to leverage syntactc knowledge n Recursve Neural Networks. 1372

9 The frst way s to use dfferent composton functons for phrases wth dfferent tags so that the composton processng s guded by phrase types (TG-RNN). The second way s to learn tag embeddngs and combne tag and word embeddngs durng composton (TE-RNN/RNTN). The proposed models are not only effectve (w.r.t competng performance) but also effcent (w.r.t wellcontrolled parameter scale). Experment results show that our models are among the top performng approaches up to date, but wth much less parameters and complexty. Acknowledgments Ths work was partly supported by the Natonal Basc Research Program (973 Program) under grant No.2012CB316301/2013CB329403, the Natonal Scence Foundaton of Chna under grant No / , and the Bejng Hgher Educaton Young Elte Teacher Project. The work was also supported by Tsnghua Unversty Bejng Samsung Telecom R&D Center Jont Laboratory for Intellgent Meda Computng. References Frédérc Basten, Pascal Lambln, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Ncolas Bouchard, and Yoshua Bengo Theano: new features and speed mprovements. Deep Learnng and Unsupervsed Feature Learnng NIPS 2012 Workshop. Yoshua Bengo, Réjean Ducharme, Pascal Vncent, and Chrstan Janvn A neural probablstc language model. The Journal of Machne Learnng Research, 3: L Dong, Furu We, Mng Zhou, and Ke Xu Adaptve mult-compostonalty for recursve neural models wth applcatons to sentment analyss. In AAAI. AAAI. Karl Mortz Hermann and Phl Blunsom The role of syntax n vector space models of compostonal semantcs. In ACL, pages Assocaton for Computer Lngustcs. Ozan Irsoy and Clare Carde Deep recursve neural networks for compostonalty n language. In NIPS, pages Nal Kalchbrenner, Edward Grefenstette, and Phl Blunsom A convolutonal neural network for modellng sentences. In ACL, pages Assocaton for Computer Lngustcs. Yoon Km Convolutonal neural networks for sentence classfcaton. In EMNLP, pages Assocaton for Computatonal Lngustcs. Thomas K Landauer and Susan T Dumas A soluton to plato s problem: The latent semantc analyss theory of acquston, nducton, and representaton of knowledge. Psychologcal Revew, 104(2):211. Quoc V Le and Tomas Mkolov Dstrbuted representatons of sentences and documents. In ICML, volume 32, pages Peng L, Yang Lu, and Maosong Sun Recursve autoencoders for ITG-based translaton. In EMNLP, pages Assocaton for Computer Lngustcs. Tomas Mkolov, Ka Chen, Greg Corrado, and Jeffrey Dean. 2013a. Effcent estmaton of word representatons n vector space. CoRR. Tomas Mkolov, Ilya Sutskever, Ka Chen, Greg S Corrado, and Jeff Dean. 2013b. Dstrbuted representatons of words and phrases and ther compostonalty. In NIPS, pages Jeff Mtchell and Mrella Lapata Vector-based models of semantc composton. In ACL, pages Sebastan Padó and Mrella Lapata Dependency-based constructon of semantc space models. Computatonal Lngustcs, 33(2): Bo Pang and Lllan Lee Opnon mnng and sentment analyss. Foundatons and Trends n Informaton Retreval, 2(1-2): Sebastan Rudolph and Eugene Gesbrecht Compostonal matrx-space models of language. In ACL, pages Assocaton for Computer Lngustcs. Davd E Rumelhart, Geoffrey E Hnton, and Ronald J Wllams Learnng representatons by backpropagatng errors. Nature, 323: Paul Smolensky Tensor product varable bndng and the representaton of symbolc structures n connectonst systems. Artfcal ntellgence, 46(1): Rchard Socher, Jeffrey Pennngton, Erc H Huang, Andrew Y Ng, and Chrstopher D Mannng Sem-supervsed recursve autoencoders for predctng sentment dstrbutons. In EMNLP, pages Assocaton for Computatonal Lngustcs. Rchard Socher, Brody Huval, Chrstopher D Mannng, and Andrew Y Ng Semantc compostonalty through recursve matrx-vector spaces. In EMNLP, pages Assocaton for Computatonal Lngustcs. 1373

10 Rchard Socher, John Bauer, Chrstopher D Mannng, and Andrew Y Ng. 2013a. Parsng wth compostonal vector grammars. In ACL, pages Assocaton for Computer Lngustcs. Rchard Socher, Alex Perelygn, Jean Y Wu, Jason Chuang, Chrstopher D Mannng, Andrew Y Ng, and Chrstopher Potts. 2013b. Recursve deep models for semantc compostonalty over a sentment treebank. In EMNLP, pages Assocaton for Computatonal Lngustcs. Peter D Turney, Patrck Pantel, et al From frequency to meanng: Vector space models of semantcs. Journal of Artfcal Intellgence Research, 37(1): Sda I Wang and Chrstopher D Mannng Baselnes and bgrams: Smple, good sentment and topc classfcaton. In ACL, pages Assocaton for Computatonal Lngustcs. Anur Yessenalna and Clare Carde Compostonal matrx-space models for sentment analyss. In EMNLP, pages Assocaton for Computer Lngustcs. 1374

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department