Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity Anonymous Author(s) Affiliation Address email Abstract Sum-product networks are a multi-layered architecture for computing the joint probabilities of a set of input features. These networks were recently proposed as a structure capable of efficiently performing inference and they have been demonstrated to have performance superior to that of deep belief networks in the domain of visual classification. This paper explains the principles that govern the usage of sum-product networks and some of their theoretical background and proposes a method for testing the performance of the sum-product network in the novel domain of molecular activity prediction. The report concludes by reporting on the development of a port of the implementation of the sum-product network into the Python scripting language. 1 Introduction In October of 2012, a group of students working under the supervision of Professor Hinton at the University of Toronto won the Merck Molecular Activity Challenge hosted on http://kaggle.com [1]. The goal of the challenge was to predict the activity of molecules in different contexts given numerical descriptors generated from their structure. In practice, this required participants to analyze 15 data sets that each contained thousands of training examples with tens of thousands of features. Features are sometimes shared between data sets, but every data set contains many unique features as well. George Dahl, one of the students involved in the winning entry, wrote an article after the competition explaining that a deep neural network trained with dropout was used for prediction [1]. The success of this algorithm over other solutions that employed more pre-processing of features is a demonstration of the power of deep learning techniques. However, using a neural network is not the only deep learning option available. In 2011, Poon and Domingos introduced a new architecture called a sum-product network (SPN) that was designed for efficient inference with a complex partition function [2]. One of the results reported in the Poon and Domingos paper is that the SPN architecture outperformed the deep belief network (DBN) architecture used by Hinton et al. by a wide margin on visual classification and face completion tasks. The question that this paper will address is whether the SPN architecture can be adapted from the visual recognition domain to other areas in which neural networks currently dominate. The paper is structured into four main components. First, an introductory view of neural network and SPN architectures will be presented with an emphasis on the motivation for their usage. Second, claims about the SPN architecture and its strengths relative to other architectures will be discussed from a theoretical perspective. Third, a method for adapting the sum-product network to the Merck Molecular Activity Challenge prediction task is proposed. Finally, the implementation of a basic sum-product network is discussed. 1

054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 1.1 Neural Networks Artificial neural networks use layers of units to represent a functional mapping. In a typical feedforward network, the neurons in each layer can only connect to neurons in the next layer. Thus, if a ij is the jth neuron in layer i, it can connect to neurons of the form a k(j+1). Connections between neurons are weighted and all the incoming connections to a neuron are summed. The output of each neuron is determined by an activation function which is a function of the summed inputs. By introducing hidden layers of neurons between the input and output layers, neural networks can compactly represent arbitrary functions. Neural networks with at least one hidden layer and continuous, bounded, and non-constant activation functions have been proven to be universal approximators [3]. The significance of this result is that neural networks can represent functions arbitrarly well given a sufficient number of neurons. The neural networks employed by Hinton et al. in their Merck Challenge used rectified linear activation functions, multiple hidden layers, and dropout regularization [1]. Dropout is a recently developed technique for increasing the robustness of neural networks that functions by randomly omitting feature detectors during training rounds [4]. Omitting features during training prevents neurons from co-adapting to features and overfitting the data. This process is conceptually equivalent to averaging multiple models together, but it is more efficient computationally. 1.2 Sum-Product Networks Sum-product networks are motivated by the difficulty of exact inference in graphical models. Consider a graphical model written in the form P (X = x) = 1 Z k φ ( ) k x{k} where x is a vector, x{k} is a subset of x which forms the scope of the potential function φ k, and Z is the partition function. Performing inference requires summing the product of exponentially many potential functions to obtain Z = x k φ k(x {k} ). The sum-product network is based on network polynomials which are an alternate representation of the potential function. The network polynomial is constructed by multiplying the probability at a state x, p(x), with all of the indicator variables that have a value of one in that state. This operation is repeated for all states to obtain a set of products that are then summed together to yield the network polynomial. The operations required to compute the network polynomial can be represented as a tree with each product forming a node between the indicator variables and the summation. This representation suffers from the same problem as inference in graphical models in that the number of product nodes grows exponentially with the number of indicators. The root of the problem is that one product node is required for each possible state. The insight that allows sum-product networks to avoid this problem is that they add additional layers of sums and products which enables states to be reused. A sum-product network is a directed acyclic graph formed from summation and product nodes. The leaf nodes in the graph are binary indicator variables and the negation of all these indicators. The edges from summation nodes to their children are weighted with non-negative values but edges from product nodes to their children are not associated with weights. Figure 1 shows an example of a sum-product network. The network in figure 1 is a tree, but connections between nodes are not restricted to adjacent levels as long as all connected nodes alternate between product nodes and sum nodes. Figure 1: A sum-product network with four independent binary values. Bars over variable names indicate negation. 2

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 The sum-product network can be used to find the joint probability of a set of variables by summing the weighted values at summation nodes and multiplying these values at product nodes. The joint probability is the value of the root node. Marginal probabilities can be computed by choosing a variable to sum out and setting both the indicator for that variable and the indicator for the negation of the variable to one. 2 Theoretical Properties of Sum-Product Networks Sum-product networks are defined to be valid if and only if the value of the root node is always equal to the probability of observing the state indicated by its leaf nodes. This is an informal explanation and assumes that the state indicated by the leaf nodes is a valid state; in the event that both an indicator and its negation are set to one the probability of the indicated state is zero, but as mentioned above what will occur is that the network will marginalize out the two indicators. The technical definition of this property is that the network is valid iff S(e) = φ S (e) where e is an event, S(e) is the value of the root node given the input e to the indicators and phi S (e) is the probability of the event. Guaranteeing that the networks constructed using an SPN are valid is an important part of their power. As proven by Poon and Domingo [1], there are two properties that need to be met for a SPN to be valid. The children of a sum node must all be functions of the same variables and all product nodes must not be functions of both a variable and its negation. These two conditions can be easily met in the construction of the network are not affected by changing the weights of edges during training, thus sum product-networks can always guarantee that inference is possible by evaluating the values in the network. Dellaleau and Bengio showed that the depth of the network, measured as the maximum number of alternating sum and product layers, affect the representational ability of the network [5]. Their theoretical results focused on two particular classes of functions and proved that the number of hidden units in deep representations grew slower than the same quanity for shallow representations when the same functions were represented. They concluded that deep networks offer much more compact representations than shallow networks, however their results do not cover all the functions that the SPN architecture is capable of representing. 3 Adapting a SPN for the Merck Challenge There are a number of modifications that need to be made to the form of the sum-product network presented in section 1.2 for the network to be usable on the molecular activity prediction task described in section 1. This section outlines these considerations and ends by detailing how the output of the system can be compared to the results of the Merck Molecular Activity Challenge. 3.1 Continuous Input The features in the Merck data set have integers values which is problematic since the sum-product networks shown to this point have used binary features exclusively. This limitation can be overcome by using integral nodes instead of sum nodes. The idea is to treat real-valued features as samples drawn from a multinomial distribution with an infinite number of values. If each input feature is drawn from a multinomial distribution of infinitely many variables, then the weighted sum of indicators becomes an integral over the probability distribution. In the original paper on sum-product networks, Poon and Domingos assumed that pixel values were drawn from a mixture of Gaussians model [1]. The procedure for converting a real-valued input into a continuous sample begins by normalizing the input features to have zero mean and unit variance. The input values for each feature are then divided into k equal sized sets and the mean value of each set is used as the mean of a Gaussian in the mixture of Gaussians model. 3

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 3.2 Prediction The value of the root node in the sum product network naturally yields probabilities for the network. For prediction, the value of interest is not the probability of the inputs but the value of an output variable. Unlike a neural network, an SPN is not an input-output mapping and to obtain an output prediction it must first be added to the network using indicator variables. During training, the values of these indicator variables are set according to the sample result. During testing, the values of the indicator variables are obtained as the values that maximize the network value. 3.3 Overlapping Feature Sets The presence of overlapping features in the Merck data set is one the properties that allowed a deep belief network to outperform other solutions by exploiting the shared structure in the data. Capturing these shared relationships requires building one SPN that spans all of the data sets. The training of this larger network differs from standard training because not all the indicator variables will have set values since the same features aren t present in every set. A simple method for addressing this new scenario is to train the network with the non-observed features marginalized out by setting their indicator values and their negations to one. 3.4 Evaluation The evaluation of the Merck Challenge was based on the R-squared metric: R 2 = (X X)2 (Y Y ) 2 The dataset is divided with a temporal split as data for testing comes from later assays of molecules than the training data. At the time the Merck Challenge ended, the R-squared score to beat was 0.49410. Details about the confidence of the prediction and its robustness is unfortunately not available from this single metric, so evaluation can be supplemented by using a bootstrap method to obtain confidence intervals on the R-squared score. The bootstrap evaluation procedes by drawing N samples with replacement from the testing samples and computing the R-squared metric over these samples to obtain ˆR 2 1. This process is repeated for k iterations to build the set { ˆR 2 1, ˆR 2 2,..., ˆR 2 k }. Ordering this set and finding the 2.5 and 97.5 percentile provides an estimate of a 95 percent confidence interval. 4 SPN Implementation Poon and Domingos released a Java implementation of code for constructing and learning an SPN. Their code is intended to run on large distributed systems and as such depends on a message passing protocol between many computing nodes. The limitations of running this system on personal computing hardware motivated the development of a Python-based implementation of the system. Python was chosen as a platform as it is platform-independent, gaining popularity amongst the machine learning community, and accompanied by support tools such as Theano that provide access to hardware accleration [6]. The algorithm initializes by constructing a densely-connected sum-product network with zero edge weights. From this point, the basic learning algorithm presented in [1] is followed by incrementing edge weights following inference with each data sample. The data samples are presented iteratively until the edge weights converge. Edges with zero weight are removed from the final graph. The performance of the Python port is not yet capable of computing reasonable predictions for the Merck challenge due to the scale of the data set. Even with the increased efficiency of the sum-product network, learning the model of the Merck data set can only realistically be feasible with GPU-accelerated code or distributed computing, neither of which have yet been implemented. However, the current Python code could serve as a pedagogical tool in the explanation of sumproduct networks and future work will address these performance issues. 4

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 5 Conclusions and Contributions This work has relied heavily on the original SPN paper published by Poon and Domingo as it remains one of the only published resources about this new architecture. It is hoped that the development of an alternative Python implementation of sum-product network construction and training will spur additional interest in this line of research and its applications outside of vision processing. References [1] Dahl, G. (2012) Deep Learning How I Did It: Merck 1st place interview. Online article available from http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview/ [2] Poon, H. & Domingos, P. (2011, November). Sum-product networks: A new deep architecture. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on 689-690. IEEE. [3] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2):251-257. [4] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arxiv preprint arxiv:1207.0580 [5] Delalleau, O. & Bengio Y. (2011). Shallow vs. Deep Sum-Product Networks. In Proceedings of the 25th Conference on Neural Information Processing Systems. [6] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley & Y. Bengio. Theano: A CPU and GPU Math Expression Compiler. Proceedings of the Python for Scientific Computing Conference (SciPy) 2010. June 30 - July 3, Austin, TX 5