Multi-view Opinion Mining with Deep Learning

Size: px

Start display at page:

Download "Multi-view Opinion Mining with Deep Learning"

Maria Flynn
5 years ago
Views:

1 Neural Processing Letters manuscript No. (will be inserted by the editor) Multi-view Opinion Mining with Deep Learning Ping Huang Xijiong Xie Shiliang Sun Received: date / Accepted: date Abstract With the explosive growth of social media on the Internet, people are expressing an increasing number of opinions. As for objectives like business decision making and public opinion analysis, how to make the best of these precious opinionated words is a new challenge in the field of NLP. The field of opinion mining, or sentiment analysis, has become active in recent years. Since different kinds of deep neural networks differ in their structures, they are probably extracting different features. We investigated whether features generated by heterogeneous deep neural networks can be combined by multiview learning to improve the overall performance. With document level opinion mining being the objective, we implemented multi-view learning based on heterogeneous deep neural networks. Experiments show that multi-view learning utilizing these heterogeneous features outperforms single-view deep neural networks. Our framework makes better use of single-view data. Keywords Multi-view learning Opinion mining Deep learning Heterogeneous neural networks Ping Huang Department of Computer Science and Technology, East China Normal University, 3663 North Zhongshan Road, Shanghai , P.R. China phuang95@outlook.com Xijiong Xie The School of Information Science and Engineering, Ningbo University, Zhejiang , China xjxie11@gmail.com Shiliang Sun Department of Computer Science and Technology, East China Normal University, 3663 North Zhongshan Road, Shanghai , P.R. China (Corresponding Author) slsun@cs.ecnu.edu.cn Ping Huang and Xijiong Xie contributed equally to this work

2 2 Ping Huang et al. 1 Introduction With the explosive growth of Internet social media, people are expressing an increasing number of opinions. The field of opinion mining, which exploits these precious opinionated words, has accordingly become active in recent years. Typical tasks in this field aims to find out subjective contents in text and tell the sentiment polarities of the speakers. According to the definition in [1], an opinion can be defined as a quintuple (e i, a ij, s ijkl, h k, t l ), where e i stands for the entity in the opinion, a ij stands for the aspect of the entity, h k stands for the opinion holder, t l stands for the time when the opinion was given and s ijkl stands for the opinion contents and opinion holder s sentiment polarity. For example, in the review The screen of this mobile phone is good, the screen is an aspect of the entity mobile phone and a positive sentiment is expressed. With this definition given, the task of opinion mining can be then defined as determining the quintuple, or at least part of it. Tasks in the opinion mining field can be categorized into three levels: document level, sentence level and fine-grained level [2]. Document level opinion mining tasks determine an overall sentiment polarity for a whole document. Sentence level tasks are similar but they determine the sentiment polarities of sentences. Fine-grained opinion mining tasks, however, could handle more detailed information, including entities, aspects, etc. Our research is conducted on document-level. Deep learning as a popular field in machine learning, has also been used in opinion mining tasks. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are typical models used in these tasks. In 2014, Kalchbrenner et al. [3] described a convolutional architecture called dynamic convolutional neural networks (DCNNs) using dynamic k-max over linear sequences. Irsoy and Cardie [4] proposed deep recurrent neural networks (DRNNs) in They are constructed by stacking several layers of RNNs together. Each layer takes outputs from previous layer as its inputs and has its own temporal information representations. Considering that deep neural networks (DNNs) have the ability to automatically extract features from data [5], we are able to extract from the networks the intermediate representations suitable to be used as features. Furthermore, due to the structural differences of kinds of deep neural networks, it is possible that they are using different kinds of intermediate representations. Thus we try combining the deep features from heterogeneous DNNs to see whether they are complementary and whether combining them with multi-view learning methods improves the overall performance. Basing on these thoughts, we propose a new opinion mining framework to leverage heterogeneous DNNs in multi-view learning. The framework takes single-view features as inputs and processes them with heterogeneous DNNs to automatically extract multi-view features for multi-view learning methods. This framework applies multi-view methods on single-view data and attempts to exploit their potentialities.

3 Multi-view Opinion Mining with Deep Learning 3 We implemented the heterogeneous DNN feature based multi-view learning framework to handle document-level opinion mining tasks. First, abstract representations of words are learned from a massive amount of text. Then they are used by heterogeneous DNNs to learn abstract representations of documents. The sets of representations are regarded as distinct views for each document. At last, a multi-view classifier processes the multi-view representations to tell the sentiment polarities of the documents and therefore completes the opinion mining task. The remainder of the paper is organized as follows. In section 2, we introduce multi-view learning methods and how they were used in other works. In section 3, we explain how to choose intermediate representations from deep neural networks and our reasons basing on the architectures of these neural networks. In section 4, we describe our model which is based on classical DNNs from beginning to end. In section 5, the model is tested on a small-scale data set. In section 6, we conclude our work briefly. 2 Multi-view learning in other works Multi-view data are quite common in real world applications. Describing real world entities by only one measuring method can hardly depict all sides of them, while multi-view data generated by multiple measuring methods cover more details. In pictures, multi-view data can be from pixel level and texture information. For videos, all the frames and the audio can be two view of data. Multi-view learning is an emerging field in machine learning, which does research on exploiting multi-view data for a single object to improve overall performance. Recently, multi-view learning has been widely applied in numerous fields such as image retrieval and ranking [6 8], human pose recovery [9, 10] and so on. As a method combining multiple data sets, it can also be seen as data fusion. Due to different usage of multi-view data, a high level categorization [11] divides multi-view learning methods to 3 styles: co-training style algorithms, co-regularization style algorithms and margin consistency style algorithms. Co-training methods are basically frameworks based on single-view machine learning methods and are mainly used in semi-supervised learning. For each iteration, unlabeled data on one view can be labeled with the help of classifiers prediction in other views. Co-regularization methods put a regularization term to their objective functions in order to improve consensus among classifiers in the multiple views. Margin consistency methods involve maximum entropy discrimination (MED) and the margin refers to that in the MED classifiers. These methods are based on an assumption that the margins from the views are consistent, which means that the classification confidences from different views are identical [12]. Most multi-view learning methods seek a low-dimensional common subspace to represent multi-view data, then adopt the simple-view algorithm for learning [13 15]. A typical algorithm is canonical correlation analysis (CCA)[15].

4 4 Ping Huang et al. However our method uses multi-view classifiers to combine heterogeneous representations extracted from single-view data. Multi-view learning methods were also seen in two other scenarios in opinion mining. However, their motivation and starting points are completely different from our method. One of them applies multi-view learning to social media that consist of text, pictures and videos and involve users social activities. Niu et al. [16] gave a detailed introduction to methods in this scenario. Tang et al. [17] proposed a multi-view framework to exploit relations among views to help each other select relevant features for social media. The other scenario exploits co-training multi-view methods in cross-lingual problems. Wan [18] used co-training methods and machine learning to make use of unlabeled data in another language. Hajmohammadi et al. [19] sought human s assistance to label a selection of unlabeled corpus in order to find a balance between human cost and performance. 3 Feature extraction from deep neural network In the following paragraphs, separate introductions to two typical DNNs and our thoughts about feature extraction are given. 3.1 Convolutional neural network Starting with several works in 1990s [20, 21], convolutional neural network is one of the earliest types of DNN. They are well designed to discover suitable representations for the raw inputs, which is the key attribute of DNN. A CNN usually consists of several convolutional layers, pooling layers and several fully connected layers at the end. Convolutional layers take a group of adjacent elements (e.g., adjacent pixels in a picture or adjacent words in a sentence) as inputs and compare them to a series of filters to find out valuable patterns [20]. In pooling layers, max pooling, a typical pooling method, finds maximums from local patches of previous outputs. This manipulation matches the most significant patterns in local patches because those less matched patterns have smaller outputs after convolutional layers. Futhermore, pooling layers improve overall tolerance to shifts and distortions [5]. The fully connected layers at the end work as traditional neural networks and leverages features provided by previous layers to complete given tasks. It is worth pointing out that convolutional layers and pooling layers together reduce the dimension of data and extract intermediate representations of data required by successive computations. In conventional machine learning methods before DNNs, either features are carefully designed or raw features that are less meaningful and less effective (e.g., pixel level data for pictures) are directly used [22, 23]. In DNNs, however, representations are automatically learned to be suitable for the task during the training of the whole neural network. This attribute is especially notable in CNNs since their structures are designed for hierarchical feature extraction [5].

5 Multi-view Opinion Mining with Deep Learning 5 Fig. 1 The structure of a typical RNN, and its temporally unfolded version. [5] CNN deep features. In conventional classification approaches, features are directly fed to the neural networks, implying the only job of the neural network is classification. Since fully connected layers in the CNNs act like conventional feed-forward neural networks, we believe intermediate variables between conventional or pooling layers and fully connected layers can be considered as a CNN extraction of the input data. 3.2 Recurrent neural network Recurrent neural networks, another type of DNN, take sequential data as inputs. Given a data point in the sequence, RNNs give an output basing on not only the inputted data point, but also previous data points. In bidirectional RNNs [24], outputs even depend on data both before and after. An instance of RNN uses the same structure, or the cell, to calculate outputs for each data point in the input sequence. RNNs usually do not have circuits within a single time step, which is like conventional feed-forward neural networks, but each calculation process passes information between each other. Sequential data are not necessarily temporally sequential. For instance, gene sequence data are one kind of non-temporal sequence. However, a majority of sequence data are temporal sequence, so the concept of time is usually introduced to help describe sequence data [25]. An input sequence can be defined as (x 1, x 2,..., x T ), where T is the length of the sequence, each data point x t is a real-valued vector. Correspondingly, if the output of an RNN is sequential, it can be represented by (o 1, o 2,..., o T ) in a similar fashion. Many kinds of cells also pass intermediate status that differs from its output. The status at time step t is s t. The structure of a typical RNN is shown in Fig. 1. At time step t, the cell processes not only current data x t, but also previous intermediate status s t 1. For this time step, the cell needs to calculate an output o t and an intermediate status s t which are used in the next time step.

6 6 Ping Huang et al. One of the widely used cells is long short-term memory (LSTM) proposed in 1997 [26]. LSTM cells use the input x t to decide whether a part of the previous status s t 1 is kept or disposed. In 2014, a simplified cell with similar functions was proposed in [27], and is later called gated recurrent unit (GRU). The simplified structure shortens training time, but acts like LSTM when deciding how status information is handled. One apparent difference of GRU from LSTM is that GRU does not distinct output and status, which means its only output serves as the status to be passed to next time step as well. Both of them learn a method to make choices about status information, preventing it from propagating so far that gradients explode or vanish [25] RNN deep features. Information in RNNs travels from time step to time step. RNNs based on LSTM cells or GRU cells are believed to create a memorizing feature to pass information between steps. We speculate the status information passed between steps carries important features of the input sequence that are different from CNNs. 4 Our model Our model consists of four parts. First, data are processed for a better use in stages afterwards. Next, the CNN part and the RNN part are parallel and act as typical DNNs in opinion mining tasks. Chosen intermediate representations are extracted from both DNNs and passed to the last stage. Multi-view classifiers at the end finish the opinion mining task. The overall flowcat of our model can been seen in Fig Input processing Documents are written in natural languages, and there are some detailed problems when presenting text with real-valued data. In most natural language processing tasks, using word vectors instead of conventional one-hot features can directly improve the performance [28]. We use word2vec 1 to convert words into D-dimensional word vectors. Each document consisting of words is therefore converted to a sequence of word vectors. After word2vec finishes training, results are saved as a lookup table, and each word in its vocabulary has a corresponding vector. Documents may contain rare words that are not in the vocabulary. These words are simply removed from the document. Besides, all documents are adjusted to have the same length L so that data processing would be easier. Then all documents can be represented by a 2-dimensional L D matrix and this is the single-view data for a document. The structure is shown in Fig

7 Multi-view Opinion Mining with Deep Learning 7 Fig. 2 The overall flow chart. Fig. 3 The structure of the word2vec part. The single-view data are processed by a CNN and an RNN respectively. They resemble these neural networks in conventional single-view opinion mining tasks, and also output a value to predict the sentiment polarity. 4.2 The CNN part The CNN used in our model is basically the one in [29]. In the CNN, calculation are done on several channels with different window size. Within each channel, data are cut into patches and then processed by a convolutional layer, a ReLU activation layer and a pooling layer. Each channel has its own window size W. Within the convolutional layers, F trainable filters sized W D are compared with the window of word vectors by summing pointwise production. These F values reflect to what extent the data match the filters. Then they are processed by the ReLU activation function. Within the pooling layer, max-pooling finds the maximum per filter, describing how likely the pattern of this filter is seen in this document. Each channel outputs a F -dimensional vector. The results of the channels are concatenated and sent to a perceptron. The cross entropy between the output of the activation function in the perceptron and the observation is used as the loss function. The structure is shown in Fig 4. The intermediate variables extracted to be one of the views are the concatenated outputs of the channels. The reason is twofold. First, the most important structures in the CNN have been finished, so the results here cover the most characteristic operations in the CNN, i.e. convolution and pooling.

8 8 Ping Huang et al. Fig. 4 The structure of the CNN part. Fig. 5 The structure of the RNN part. Second, calculation afterwards is linear classification by a perceptron, while linear classification is one of the most basic tasks in machine learning. If a perceptron handles this task well, it is likely that the intermediate variables at this stage are suitable for classification. 4.3 The RNN part The RNN used in the model adopts an attention mechanism [30] and the cell within is GRU. At each time step, RNN takes in one word vector of the document. In every time step, the GRU cells base their outputs on the current word vector as well as previous status from the previous time step. After all time steps finish calculation, their outputs are collected by the attention layer. It calculates weights for every time step, and outputs a weighted mean. The length of the output vector is configurable. Following parts are similar to the CNN s. A perceptron handles the classification task, and the loss function is the cross entropy. For a similar reason, the deep features extracted from RNN are those variables outputted by the attention layer. The structure is shown in Fig Multi-view classification The two sets of intermediate variables are treated as two views. Given two-view data of documents and observation of sentiment polarities of the documents, a multi-view classifier (e.g., SVM-2K or MVMED) can be trained to fit them. Among the supervised multi-view classifiers based on support vector machine (SVM), one is SVM-2K [31]. It trains SVMs on both views and regularize

9 Multi-view Opinion Mining with Deep Learning 9 them with a constraint of similarity with an ɛ-insensitive term: w Aφ A (x i ) + b A w Bφ B (x i ) b B η i + ɛ (1) where w A, b A (w B, b B ) are the weight and threshold of the first (second) SVM, φ A and φ B are the two feature functions, x is the input and η i is the slack variable. The SVM-2K the following optimization for classifier parameters w A, b A, (w B, b B ) 1 min w A,w B 2 w A w B 2 + c 1 + D n i=1 η i n q Ai + c 2 i=1 n q Bi i=1 s.t. w Aφ A (x i ) + b A w Bφ B (x i ) b B η i + ɛ, y i (w Aφ A (x i ) + b A ) 1 q Ai, y i (w Aφ A (x i )e + b B ) 1 q Bi, q Ai 0, q Bi 0, η i 0, all for (1 i n), where D, c 1, c 2, ɛ are nonnegative parameters and q Ai, q Bi are slack vectors. Denote wa, w B, b A, b B as the solution to this optimization problem. The SVM-2K decision function is then f(x) = 1 2 (w A φ A(x)+b A +w B φ B(x)+b B ). The dual problem of the above optimization problem is given as 1 n (ξi A ξj A K A (x i, x j ) + ξi B ξj B K B (x i, x j )) 2 max ξ A i,ξb i,αa i,αb i + i,j=1 n (αi A + αi B ) i=1 s.t. ξ A i = α A i y i β + i + β i, ξi B = αi B y i + β + i β i, n n ξi A = ξi B = 0, i=1 i=1 0 β + i, β i, β+ i + β i D, 0 α A/B i c 1/2, where αi A, αb i, β+ i, β i are the vectors of Lagrange multipliers. Here we take ɛ = 0. The prediction function for each view is given by n f A/B (x) = ξ A/B i K A/B (x i, x) + b A/B. (4) i=1 (2) (3)

10 10 Ping Huang et al. Multi-view maximum entropy discrimination [12] (MVMED) applies MED to two-view problems by assuming consistent margins from the two views for the fisrt time, which decreases the difficulty of optimization. Some variations [32, 33] are also proposed since MVMED, making the problem less demanding. MVMED exploits multiple views in a margin consistency style. They enforced the margins from two views to be same, which means that the classification confidences from different views are considered to match each other exactly. Suppose we are given by multi-view dataset X 1 t and X 2 t, (t = 1, 2,, N) where X 1 t and X 2 t denote the ith example from view 1 and view 2, respectively and y t denotes the corresponding label. MVMED aims to seek a joint distribution p(θ 1, Θ 2 ) over the first view classifier parameter Θ 1 and the second view classifier parameter Θ 2. It uses the augmented joint distribution p(θ 1, Θ 2, γ) with the common margin γ = {γ 1,, γ N }. The optimization problem of MVMED is formulated as min p(θ1,θ 2,γ), p 0(Θ 1,Θ 2,γ) KL(p(Θ 1, Θ 2, γ) p 0 (Θ 1, Θ 2, γ)) p(θ1, Θ 2, γ) [y t L 1 (Xt 1 Θ 1 ) γ t ] dθ 1 dθ 2 dγ 0, s.t. p(θ1, Θ 2, γ) [y t L 2 (Xt 2 Θ 1 ) γ t ] dθ 1 dθ 2 dγ 0, 1 t N, where L 1 (X 1 t Θ 1 ) and L 2 (X 2 t Θ 2 ) are discriminant functions from view 1 and view 2, respectively. The expected large margin constraints are enforced on two views. The solution to the MvMED problem is based on the following theorem Theorem 1 The solution to the MVMED problem has the following general form p(θ 1, Θ 2, γ) = 1 Z(λ 1, λ 2 ) p 0(Θ 1, Θ 2, γ)e N t=1 λ1t[ytl1(x1 t Θ1) γt]+ N t=1 λ2t[ytl1(x2 t Θ2) γt], (6) where Z(λ 1, λ 2 ) is the normalization constant and λ 1 = {λ 11,..., λ 1N }, λ 2 = {λ 21,..., λ 2N } define two sets of nonnegative Lagrange multipliers, one for each classification constraint. λ 1 and λ 2 are set by finding the unique maximum of the following jointly concave objective function. That is min λ 1t,λ 2t { (5) J(λ 1, λ 2 ) = logz((λ 1, λ 2 ). (7) N (λ 1t + λ 2t + log(1 λ 1t + λ 2t )) 1 c 2 t=1 1 2 N N t=1 k=1 λ 1t λ 1k y t y k x 2 t x 2 k N N t=1 k=1 λ 1t > 0, λ 2t > 0, N t=1 λ 1ty t = 0, N t=1 λ 2ty t = 0, t {1, 2,, N}. λ 1t λ 1k y t y k x 1 t x 1 k (8)

11 Multi-view Opinion Mining with Deep Learning 11 After obtaining λ 1 and λ 2, the distribution p(θ 1, Θ 2, γ) will be specified accordingly. By marginalizing out γ, we can get the distribution p(θ 1, Θ 2 ) to predict the label of a new example (X 1, X 2 ) from view 1 and view 2 with the following rule y = sign ( 1 2 p(θ 1, Θ 2 )(L 1 (X 1 t Θ 1 ) + L 2 (X 2 t Θ 2 ))dθ 1 dθ 2 ). (9) 5 Experiment We implemented the model mentioned above, with the multi-view classification parts being MVMED, SVM-2K or CCA with kernel SVM. Experiments were conducted on 3 data sets: Large Movie Review Dataset, Stanford Sentiment Treebank and Amazon Fine Food Reviews. The split of training set, development set and test set follows the original split in Large Movie Review Dataset and Stanford Sentiment Treebank. As for Amazon Fine Food Reviews, 2.5% of the reviews are in the test set, and the remaining data are used as the training set in a 5-fold cross validation. Binary labeling criteria vary from one data set to another. Labels in Large Movie Review Dataset follow the original criteria. In Stanford Sentiment Treebank, reviews with sentiment values over 0.6 are considered as positive reviews, and those whose values are below 0.4 are negative. In Amazon Fine Food Reviews, reviews with their scores being 4 or 5 are positive, and those scored 1 or 2 are negative. All neutral reviews are ignored. Hyperparameters in each experiment are optimized through grid searching. Such hyperparameters include the unified length of documents, window sizes and filter counts in each channel in the CNN, and GRU cell counts and attention layer size in the RNN, as well as the hyperparameters in MVMED, SVM-2K and CCA with vanilla SVM. Our framework focuses on boosting performance of models given sparse data sets. From each data set, we randomly picked 1,000, 2,000, 3,000 documents respectively as the training set, while the test data set remained untouched. After training, the two DNNs (TextCNN and Attn. GRU) are individually tested, and the three multi-view models that are integrated with multi-view parts are also tested. Results are shown in Table. 1 and Fig. 6. We can conclude from the data that all multi-view methods improve the classification accuracy. They promise higher accuracies than either of the single-view methods. Specifically, CCA with SVM performs the best on Large Movie Review Dataset, and MVMED, on the other hand, performs the best on Amazon Fine Food Reviews. SVM-2K does not work well probably because it is an instance of MVMED [12], which suggests its lower generalization ability. Besides, on any one of the three data sets, accuracy increases with the growth of size of the data set. The amount of improvement is measured in the next experiment. Generally speaking, the framework leverages complementarity of heterogeneous deep features and improves overall classification accuracy on document-level opinion mining.

12 12 Ping Huang et al. Table 1 Percentage accuracies of single-view methods and multi-view methods. ACL-IMDB SST Polarity Amazon Polarity 1,000 2,000 3,000 1,000 2,000 3,000 1,000 2,000 3,000 TextCNN Attn. GRU CCA + SVM SVM-2K MVMED Accuracy(%) Training set size TextCNN Attn. GRU CCA + SVM SVM-2K MVMED Accuracy(%) Training set size TextCNN Attn. GRU CCA + SVM SVM-2K MVMED Accuracy(%) Training set size TextCNN Attn. GRU CCA + SVM SVM-2K MVMED (a) ACL-IMDB (b) SST Polarity (c) Amazon Food Polarity Fig. 6 Percentage accuracies of single-view methods and multi-view methods. Table 2 Percentage improvement from single-view methods to multi-view methods. 1,000 2,000 3,000 Avg. ACL-IMDB SST Polarity Amazon Food Polarity Avg In order to measure our model s ability to improve overall performance, we also calculated the percentage improvement of accuracies from the best single-view method to the best multi-view method in the test above. Results are shown in Table. 2. Please note that overall accuracies on Large Movie Review Dataset are the lowest among the three data sets, while those on Amazon Fine Food Reviews are the highest. From the table, we find that the lower single-view accuracies are, the more the framework can improve. So this framework helps more if applied on less reliable DNNs. 6 Conclusion Analysis on the structures of CNNs and RNNs implies that they extract different views of deep features. We explored the possibility of leveraging different features of a single document at the same time by constructing a multi-view learning framework to combine features extracted from heterogeneous DNNs. Our experiments show that, for document-level opinion mining, multi-view techniques including CCA, SVM-2K and MVMED can combine and exploit

13 Multi-view Opinion Mining with Deep Learning 13 the heterogeneous deep features and outperform single-view DNNs. Besides, our framework not only performs better as data set size grows, but also provides better improvement if applied to less reliable systems or situations in which fewer data are given. Our framework is a creative method to apply multi-view methods to any single-view problems. Further works can be done on applying multi-view methods to other conventional tasks, adapting DNNs to better fit multi-view learning, etc. 7 Acknowledgments The first two authors Ping Huang and Xijiong Xie are joint first authors. This work is sponsored by Shanghai Sailing Program. The corresponding author Shiliang Sun would also like to thank supports by NSFC Projects and , and Shanghai Knowledge Service Platform Project (No. ZF1213). The work of Xijiong Xie was supported by the NSFC of Zhejiang Province under Project LQ18F References 1. Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5(1): Sun S, Luo C, Chen J (2017) A review of natural language processing techniques for opinion mining systems. Information Fusion 36: Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arxiv preprint arxiv: Irsoy O, Cardie C (2014) Opinion mining with deep recurrent neural networks. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp LeCun Y, Bengio Y, Hinton G (2015) Deep learning. nature 521(7553): Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Transactions on Cybernetics 45: Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Transactions on Image Processing 23: Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Transactions on Cybernetics 47: Tao D, Hong C, Yu J, Wan J, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing 24: Hong C,Yu J, Tao D, Wang M (2015) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Transactions on Industrial Electronics 62: Zhao J, Xie X, Xu X, Sun S (2017) Multi-view learning overview: Recent progress and new challenges. Information Fusion 38: Sun S, Chao G (2013) Multi-view maximum entropy discrimination. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp Yu j, Rui Y, Chen B (2013) Exploiting click constraints and multi-view features for image re-ranking. IEEE Transactions on Multimedia 16: Yu j, Wang M, Tao D (2012) Semi-supervised multiview distance metric learning for cartoon synthesis. IEEE Transactions on Multimedia 21: Hardoon D.R, Szedmak S.R, Shawe-Taylor J.R (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Computation 12: Niu T, Zhu S, Pang L, El Saddik A (2016) Sentiment analysis on multi-view social data. In Proceedings of the International Conference on Multimedia Modeling, pp

14 14 Ping Huang et al. 17. Tang J, Hu X, Gao H, Liu H (2013) Unsupervised feature selection for multi-view data in social media. In Proceedings of the SIAM International Conference on Data Mining, SIAM, pp Wan X (2009) Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-volume 1, Association for Computational Linguistics, pp Hajmohammadi M.S, Ibrahim R, Selamat, A (2014) Bi-view semi-supervised active learning for cross-lingual sentiment classification. Information Processing & Management 50(5): Le Cun Y, Jackel L, Boser B, Denker J, Graf H, Guyon I, Henderson D, Howard R, Hubbard W (1989) Handwritten digit recognition: Applications of neural network chips and automatic learning. IEEE Communications Magazine 27: LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE 86(11): Tian DP (2013) A review on image feature extraction and representation techniques. International Journal of Multimedia and Ubiquitous Engineering 8(4): Egmont-Petersen M, de Ridder D, Handels H (2002) Image processing with neural networks a review. Pattern recognition 35(10): Schuster M, Paliwal K.K (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11): Lipton Z.C, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arxiv preprint arxiv: Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arxiv preprint arxiv: Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics pp Kim Y (2014) Convolutional neural networks for sentence classification. arxiv preprint arxiv: Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arxiv preprint arxiv: Farquhar J, Hardoon D, Meng H, Shawe-taylor J.S, Szedmak S (2006) Two view learning: Svm-2k, theory and practice. Advances in neural information processing systems 18: Chao G, Sun S (2016) Alternative multiview maximum entropy discrimination. IEEE transactions on neural networks and learning systems 27: Mao L, Sun S (2016) Soft margin consistency based scalable multi-view maximum entropy discrimination. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp Maas A.L, Daly R.E, Pham, P.T, Huang D, Ng A.Y, Potts C (2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, Association for Computational Linguistics, pp

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Empirical Evaluation of RNN Architectures on Sentence Classification Task Lei Shen, Junlin Zhang Chanjet Information Technology lorashen@126.com, zhangjlh@chanjet.com Abstract. Recurrent Neural Networks