Intro to Learning in SD -1

Size: px

Start display at page:

Download "Intro to Learning in SD -1"

Harriet Curtis
5 years ago
Views:

Alessio Micheli Intro to Learning in SD -1 Alessio Micheli E-mail:

it 1- Introduction to ReCNN Apr 2018 DRAFT, please do not circulate! www.

1 Alessio Micheli Intro to Learning in SD -1 Alessio Micheli 1- Introduction to ReCNN Apr 2018 DRAFT, please do not circulate! Dipartimento di Informatica Università di Pisa - Italy Computational Intelligence & Machine Learning Group 1

2 Alessio Micheli Learning in Structured Domain Plan in 2 lectures 1. Recurrent and Recursive Neural Networks Extensions of models for supervised and unsupervised learning in structured domains Motivation and examples The structured data RNN and RecNN Recursive Cascade Correlation and other recursive approaches Unsupervised recursive models 2. Moving to DPAG and graphs: the role of causality 2

3 3 Why structured data? Alessio Micheli Because data have relationships

4 4 Introduction: Motivation of ML for SD Alessio Micheli Most of known ML methods are limited to the use of flat and fixed form of the data (vectors or sequences) fixed-length attribute-value vectors Central: data representation Graph: very useful abstraction for real data Labeled graphs = vector patterns + relationships natural: for structured domain richness efficiency: repetitive nature inherent the data SD + ML = adaptive processing of structured information

5 Alessio Micheli 5 Introduction: Research Area General Aim: investigation of ML models for the adaptive processing of structured information (e.g. sequences, trees, graphs): Structured domain learning, Learning in Structured Domain, Relational Learning. Fractal tree: a recursive structure

6 Alessio Micheli Advancements from ML to SDL Learning in Structured Domains (SD) in Pisa/CIML: Pioneering since the 90 s the development of Theoretical analysis New approaches Applications Especially on the basis of Recursive approaches. And for you? To build and advanced background for Analysis/development of innovative models Applications in the are of interdisciplinary projects (@ ciml) Practical: thesis are possible for the design of new models/applications for the extension of the input domains toward: Extension of the applicative domain Adaptivity and accuracy Efficiency 6

7 Feedforward versus Recurrent (memento) Alessio Micheli Feedforward: direction: input output Recurrent neural networks: A different category of architecture, based on the addition of feedback loops connections in the network topology, The presence of self-loop connections provides the network with dynamical properties, letting a memory of the past computations in the model. This allows us to extend the representation capability of the model to the processing of sequences (and structured data). Recurrent neural networks: They will be the subject (further developed) ISPR/CNS courses (see later). 7

multi-relational data stringa_in_italiano Strings Proteins l 1

8 Alessio Micheli From flat to structured data Flat: vectors (as in the rest of AA1) Structured: Sequences, trees, graphs, multi-relational data stringa_in_italiano Strings Proteins l 1 l 2 l 3 l 4 l 5 Series/ temporal stream Small molecules Network data 8

9 9 Alessio Micheli Introduction: Motivation of ML for SD - Examples Examples: Computer Science: AI: problem solving techniques, theorem proving: controlling search on tree by heuristics natural language processing: evaluation of parser trees image analysis: structured representation of components document processing: fuzzy evaluation of web document Natural Sciences: Computational biology and chemistry: quite complex domains where the managing of relationships and structure is relevant to achieving suitable modeling of the solutions.

10 10 Example: logo recognition Alessio Micheli

11 Alessio Micheli 11 Example: Terms in 1 st order logic

12 Alessio Micheli 12 Example: language parsing S VP Anchor NP NP PP NP NP NP Foot PRP VBZ DT NN IN PRPS NN NN It has no bearing on our work force

13 16 Example (graphs): Molecules Alessio Micheli A fundamental problem in Chemistry: correlate chemical structure of molecules with their properties (e.g. physico-chemical properties, or biological activity of molecules) in order to be able to predict these properties for new molecules Quantitative Structure-Property Relationship (QSPR) Quantitative Structure-Activity Relationship (QSAR) Property/Activity = T(Structure) T Property Value (regression) Toxic (yes/no) (classification) Molecules are not vectors! Molecules can be more naturally represented by varying size structures Can we predict directly from structures? QSPR: Correlate chemical structure of molecules with their properties

14 17 Introduction: Adaptive processing of SD Alessio Micheli The problem: there has been no systematic way to extract features or metrics relations between examples for SD A representation learning instances (extended to SD)! Goal: to learn a mapping between a structured information domain (SD) and a discrete or continuous space (transduction). Property/Activity = T(Structure) T Property Value (regression) Toxic (yes/no) (classification) Note: we have a set of trees/graphs in input Task: For instance, classify each graph [starting from a tr set of know couples (graph i, target i )]

15 Alessio Micheli Introduction: Learning Model for SD What we mean for adaptive processing of SD: extraction of the topological information directly from data H has to be able to represent hierarchic relationships adaptive measure of similarity on structures + apt learning rule efficient handling of structure variability Classical: efficient learning good generalization performance knowledge extraction capabilities 18

16 19 SD Learning scenario Alessio Micheli Model Symbolic Connectionist Probabilistic Data Type STATIC attribute/value, real vectors Rule induction, Decision trees NN, SVM Mixture models, Naïve Bayes SEQUENTIAL serially ordered entities Learning finite state automata Recurrent NN Hidden Markov Models STRUCTURAL relations among domain variables Inductive logic programming Recursive NN (Kernels forsd) Recursive Markov models

17 20 Preview: The RecNN idea Alessio Micheli Recursive NN: Recursive and parametric realization of the transduction function Adaptive by Neural Networks Fractal tree: a recursive structure

18 Alessio Micheli Structures: just use sequence? Can we process structures like they are sequences? E.g. any tree can be converted into a sequence (no information loss) but: Sequences may be long: number of vertex exp w.r.t. height of tree (aka the paths are log of #nodes for the tree, so the dependencies are much shorter) Dependencies are blurred out (arbitrary depending on the visit) Tree i g h i(g(d(a,b),(e(c))),h(f)) d e f???????? Children are far, distant relatives are close a b c 21

19 Alessio Micheli Learning in Structured Domain Plan in 2 lectures 1. Recurrent and Recursive Neural Networks Extensions of model for supervised and unsupervised learning in structured domains Motivation and examples The structured data RNN and RecNN Recursive Cascade Correlation and other recursive approaches Unsupervised recursive models 2. Moving to DPAG and graphs: the role of causality 22

20 Structured Domains l(v) L: set of attribute vectors Structure G: vertexes labels L + topology (skeleton of G) Sequences, Trees, DOAGs- DPAGs, graphs: Alessio Micheli G : labeled direct ordered/positional acyclic graphs with super-source A total order (or the position) on the edges leaving from each vertex Super-source: a vertex s such that every vertex can be reached by a direct path starting from s. Bounded out-degree and in-degree (the number of edges leaving and entering from a vertex v) DPAG: supeclass of the DOAGs: besides ordering, a distinctive positive integer can be associated to each edge, allowing some position to be absent. Trees: labeled rooted order trees, or positional (K-ary trees). Super-source: the root of the tree. Binary tree (K=2) Mostly used in this lecture 23

21 24 K-ary Trees Alessio Micheli k-ary trees (trees in the following) are rooted positional trees with finite out-degree k. Given a node v in the tree TG: The children of v are the node successors of v, each with a position j=1,...k; k is the maximum out-degree over G, i.e. the maximum number of children for each node; L(v) in is the input label associated with v, and L i (v) is the i-th element of the label; The subtree T (j) is a tree rooted at the j-th children of v. T root L T (1) ( k) T

22 Alessio Micheli Data Domains G We consider sets of DPAGs: labeled direct positional acyclic graphs with super-source, bounded in-degree and out-degree (k). Include sub-classes: DPAGs DOAGs k-ary trees sequences vectors. Notations: ch[v] set of successors of v ch j [v] is the j-th child of the node v 25

23 Alessio Micheli Structured Data: Examples Labeled Sequences, Trees, DOAGs- DPAGs, graphs l 1 l 1 l 2 l 3 l 4 l 5 Single labeled vertex Sequence d Supersource d d b c b c b c a a Rooted Tree a a DPAG a a DPAG: labeled direct positional acyclic graphs with super-source, bounded indegree and out-degree (k). Graph (undirected) 26

24 27 Trees and DPGAs Alessio Micheli f? f c c b a b a b b: shared node DPAG Tree Exercise after the lecture SD-2

$..,K] is defined on the edges leaving from v in_set(v ) = {u v V and u v } Predecessors out_set(v ) = {u v V and v u }$

25 Alessio Micheli Positional versus Ordered DPAG: for each vertex v in vert(g), an injective function S v : edg(v) [1,2,...,K] is defined on the edges leaving from v in_set(v ) = {u v V and u v } Predecessors out_set(v ) = {u v V and v u } Successors 28

26 Alessio Micheli Learning in Structured Domain Plan in 2 lectures 1. Recurrent and Recursive Neural Networks Extensions of model for supervised and unsupervised learning in structured domainsi Motivation and examples The structured data RNN and RecNN Recursive Cascade Correlation and other recursive approaches Unsupervised recursive models 2. Moving to DPAG and graphs: the role of causality 29

27 Alessio Micheli Neural Computing Approach NN are universal approximators (Theorem of Cybenko) NN can learn from example (automatic inference) NN can deal with noise and incomplete data NN can handle continuos real and discrete data Simple gradient descent technique for training Successful model in ML due to the flexibility in applications Domain Static fixed-dim patterns (vectors, records,...) Dynamical patterns (temporal sequences, sequences...) Structured patterns (DPAGs, trees...) Neural Network Feedforward Recurrent Recursive 30

28 31 Recurrent Neural Networks (resume) Alessio Micheli Up to now: internal state x( t) y( t) ( x( t 1), l( t)) g( x( t), l( t)) y x 1 q l l 1 l 2 l 3 l 4 l 5 l

29 Alessio Micheli RNN training and properties resume BPTT/RTRL see CNS course Unfolding (see ML lecture RNN): Back-Prop Through Time: Backprop on this enrolled version Causality: A system is causal if the output at time t 0 (or vertex v) only depends on inputs at time t<t 0 (depends only on v and its descendants) necessary and sufficient for internal state Stationarity: time invariance, state transition function is independent on node v (the same in any time) 32

30 Recursive Neural Networks (overview) Alessio Micheli Now: x( v) y( v) ( x(ch[ v]), l( v)) g( x(v), l( v)) x( ch[ v]) x(ch [ v]),, (ch [ v]) 1 x k State transition sytem f y d e x a b c l 1 q 1 l q k # feedbacks = # children 33

31 Alessio Micheli Generalized Shift Operators Standard shift operator (time): q -1 S t = S t-1 Generalized shift operators (structure): q j -1 G v = G chj [v] where ch j [v] is the j-th child of v RecNN with q: x( v) ( l( v), q y( v) g( x(v)) q q 1 j 1 j xi ( v) x0 x ( v) x (ch i i 1 if ch j x( v)) [ v])) [ v] nil j otherwise j v ch j [v] Used to extend the states for children of vertexes of the tree 34

32 35 Overview of Domains Alessio Micheli Synthesis formula (instantiation of domains for RNN): transduction Structure R G G : e.g. a labeled direct positional acyclic graphs with supersource or a labeled k-ary trees g output function IR m descriptor (or code) space IR n label space f I E IR T m g IR

33 36 As a transductions by recursion Alessio Micheli Key idea: family of recursive functions for processing of G T G : G O is a functional graph transduction. T G = g E (encoding and output functions) Link between state transition system and recursive transduction : x(v)= E (G v ) state x(v) in X associated to each v is the encoding of the subgraph rooted in v

34 37 Recursive Processing Alessio Micheli Recursive definition of E (encoding function) s G... G G ( 1) ( k ) E ( 0 if G is empty G) ( L, ( G ( 1) ),..., ( G ( k ) )) NN root E E E : systematic visit of G it guides the application of NN to each node of tree (bottom-up). Causality and stationary assumption.

35 38 Properties (I) Alessio Micheli Transduction: not-algebraic: T G depends not only on information associated with the current node, but also context information encoded for subgraphs Extension of causality and stationarity defined for sequence processing: A structural transduction is Causal if its output for a vertex v only depends on v and its descendants (induced subgraphs) Stationary: state transition function ( NN ) is independent on vertex v Recurrent/recursive NN transductions admit a recursive state representation with such properties Adaptivity (NN. learn. Alg.) + Universal approximation over the tree domain [Hammer ]

36 Properties (II) Alessio Micheli T G is IO-isomorph if X and T G (G) have the same skeleton (graph after removing labels) Input DOAG a b 1 T G b 0 0 a b 1 1 b b 0 0 Output DOAG Supersource transductions b Scalar Value a b Input DOAG a b b b Also knows as Structure-to-Element Supersource transductions 39

37 40 Properties (III) Alessio Micheli IO-isomorph causal transduction T G

38 Alessio Micheli Unfolding and Enc. Network We will see through RecNN data flow process with two points of view: 1. Unfolding by Walking on structures model (stationarity), according to causal assumption (inverse topological order*). The model visits the structures 2. Building an Encoding network isomorphic to the input structure (same skeleton, inverse arrows, again with stationarity) and causal assumption): We build a different encoding network for each input structure * see next slide 41

Alessio Micheli Topological Order A linear ordering of its nodes s.t. each node comes before all nodes to which it has edges. Every DAG has at least one topological sort, and may have many.

39 Alessio Micheli Topological Order A linear ordering of its nodes s.t. each node comes before all nodes to which it has edges. Every DAG has at least one topological sort, and may have many. A numbering of the vertices of a directed acyclic graph such that every edge from a vertex numbered i to a vertex numbered j satisfies i<j. According to a Partial order For RNN: Inverse topological order 42

40 44 Unfolding & Encoding Process: Unfolding view (1) Alessio Micheli f Unfolding the encoding process trough structures d e Bottom-up process for visiting a b c We will see later how to made it with NN (and hence by NN) for each step: to build an encoding network

41 45 RecNN over different structures: unfolding 2 Alessio Micheli OH Output for the tree O C CH 2 CH 3 CH 3 CH 3 CH 3 Start Start Start Examples on different trees for chemical compounds: Unfolding through structures: The same process apply to all the vertices of a tree and for all the trees in the data set Start CH 2 CH 3 Start

42 Alessio Micheli Unfolding (3) & Enc. Net. For Recurrent and Recursive NN (or Recursive unfold. for two different structures) NN NN for each step (with weight sharing): encoding network Adaptive encoding via the free parameters of NN NN (and weight) sharing : units are the same for the all the vertices of a tree and for all the trees in the data set!!! 46

43 47 RecNN: going to details for the domains Alessio Micheli X = IR m continuos state (code) space (encoded subgraph space) L= IR n vertex label space O = IR z or {0,1} z = NN : IR n m IR... IR m IR m g output function x 0 = 0 G k times E IR m Subgraph code g IR and g realized by NN with free parameters W TG RNN realize g E

44 Alessio Micheli 48 Realization of NN x NN : IR NN n m IR... ( l, x (1),..., IR k times x ( k) m IR m ) ( Wl Subgraph code k j1 Wˆ j x ( j) m x m θ) Free parameters Recursive neuron ( NN with m=1) Process a vertex # feedbacks = # children (max k)

45 49 Equation for a single unit Alessio Micheli Single unit X: generic encoded subgraph space x is for a single RNN-unit (1) ( k ) X NN ( L( v), X,..., X ) n k k w ˆ ˆ ili v w X L w X i1 j1 j1 j ( j) j ( j) () θ ( W θ) Where is a non-linear sigmoidal function, L(v) in the label of v, theta is the bias term, W in is the weight vector associated with the label space, X (j) in is the code for the j-th subgraphs (subtree) of v, and w j -hat in is the weight associated with the j-th subgraph space.

46 50 Fully-connected RNN Alessio Micheli Standard Neurons Recursive Neurons Inputs g function x Copy made according to the graph topology E Labels q 1-1 x q 2-1 x

47 52 In details: Encoding Network (I) Alessio Micheli Start here! Tau (and weight) sharing : units are the same for the all the vertices of a tree and for all the trees in the data set!!!

48 53 In details: Encoding Network (II) Alessio Micheli

49 54 Recap: State transition system Alessio Micheli x( v) ( l( v), x(ch[ v])) y( v) g( l( v), x(v)) x( ch[ v]) x(ch [ v]),, (ch [ v]) 1 x k x( v) ( l( v), q y( v) g( x(v)) q q 1 j 1 j x ( v) i x ( v) i x x 0 i (ch j 1 if x( v)) ch [ v])) [ v] nil j Generalized shift operator otherwise x ( v ) E ( G v) X state (code) space g y x 1 q 1 l q k Graphical model of /g l

50 Alessio Micheli Recap: Unfolding 4. A different view by graphical models of the Encoding Networks for Seq. and Structures 55 Note that the use of graphical models make uniform the cases of NN and generative approaches

51 56 Alessio Micheli Recent Applications: Repetita (da ML): RecNN for NLP recent exploitment Currently wide successful application in NLP (e.g. by the Stanford NLP group) Shown the effectives of RecursiveNN applied to tree representation of language (and images) data and tasks. Started in E.g. Sentiment Treebank Sentiment labels (movies reviews) for 215,154 phrases in the parse trees of 11,855 sentences Recursive NN pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%.

52 Repetita (da ML): Examples Alessio Micheli Polarity grade Human-gram annotation Other instances in the dataset 57

53 Alessio Micheli Learning System Supersource transductions: y = g(x(s)) IO-isomorphic transductions: output share the same skeleton of input graph output emitted for each vertex vg Parametric : T G depends on tunable parameters W. Adaptive : learning algorithm used to fit W to data. H = {h h(g) T G (G), G G } Tasks Supervised: target values are known (G,f(G)) (or f(v)) Unsupervised: TR ={G G G} 58

54 Alessio Micheli RNN Learning Algorithms Backpropagation through structure: Extension of BPTT Goller & Küchler (1996) Simple to understand using graphical formalism (backprop+weight sharing on the unfolded net) : The notation is adapted to the case of delta form the fathers RTRL Sperduti & Starita (1997), Equations: See Cap 19 in Kolen, Kremer, A Field Guide to Dynamical Recurrent Networks. IEEE press 2001 RCC family based: next slides 59

55 Alessio Micheli Learning in Structured Domain Plan in 2 lectures 1. Recurrent and Recursive Neural Networks Extensions of model for supervised and unsupervised learning in structured domains Motivation and examples The structured data RNN and RecNN Recursive Cascade Correlation and other recursive approaches Unsupervised recursive models 2. Moving to DPAG and graphs: the role of causality 60

56 Alessio Micheli RCC (I) Architecture of Cascade Correlation for Structure (Recursive Cascade Correlation - RCC) We realize RNN by RCC: constructive approach m is automatically computed by the training algorithm. 61

57 62 RCC (II) Alessio Micheli Architecture of a RCC with 3 hidden units ( m=3 ) and k=2. Recursive hidden units (shadowed) generate the code of the input graph (function E ). The hidden units are added to the network during the training. The box elements are used to store the output of the hidden units, i.e. the code x i (j) that represent the context according to the graph topology. The output unit realize the function g and produce the final prediction value. E.G. A.M. Bianucci, A. Micheli, A. Sperduti, A. Starita. Application of Cascade Correlation Networks for Structures to Chemistry, Applied Intelligence Journal. 12 (1/2): , 2000

58 Alessio Micheli RCC (III) Learning Gradient descent: interleaving LMS (output) and maximization of correlation between new hidden units and residual error. Main difference with CC: calculation of the following derivatives by recurrent equations: Note: simplification (of the sum on other units) due to the architecture compared to full RTRL! But compared to the recurrent case it appears the summation on children appears! 63 k t j hi t h t hh j i j hi k h j hi h k t j hi t h t hh i j hi k h j hi h w w f w w x w w l f w w x 1 ) ( ) ( ) ( ) 1 ( 1 ) ( ) ( ) 1 ( ˆ ˆ ˆ ),,, ( ˆ ˆ ),,, ( x x x x l x x x l

Alessio Micheli TreeESN (2010-13) Extend the applicability of the RC/ESN approach to tree structured data Extremely efficient way of modeling RecNNs Architectural and experimental performance

59 Alessio Micheli TreeESN ( ) Extend the applicability of the RC/ESN approach to tree structured data Extremely efficient way of modeling RecNNs Architectural and experimental performance baseline for trained RecNN models w 1 w 1 w 1 w 2 w 1 w 1 w 2 The same recursive process of RecNN made by efficient RC approaches C. Gallicchio, A. Micheli. Neurocomputing, More in a NEXT LECTURE 64

60 HTMM (2012) Alessio Micheli E.g Bottom-up Hidden Tree Markov Models extend HMM to trees exploiting the recursive approach as well Generative process from the leaves to the root Markov assumption (conditional dependence) Qch1(u),, Qch K (u) Qu Children to parent hidden state transition P(Qu Qch1(u),, Qch K (u)) Issue: how decompose this joint state transition? (see [9]). Bayesian network unfolding graphical model over the input trees y: observed elements Q: hidden states variables with discrete values More in an other LECTURE Bacciu, Micheli, Sperduti. IEEE TNNLS,

61 Alessio Micheli Learning in Structured Domain Plan in 2 lectures 1. Recurrent and Recursive Neural Networks Extensions of model for supervised and unsupervised learning in structured domains Motivation and examples The structured data RNN and RecNN Recursive Cascade Correlation and other recursive approaches Unsupervised recursive models Toward next lecture 2. Moving to DPAG and graphs: the role of causality 66

62 Unsupervised learning in SD A general framework extending SOM approach B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A General Framework for Unsupervised Processing of Structured Data, Neurocomputing (Elsevier Science) Volume 57, Pages 3-35, March B. Hammer, A. Micheli, A. Sperduti, M. Strickert. Recursive Self-organizing Network Models. Neural Networks, Elsevier Science. Volume 17, Issues 8-9, Pages , October-November 2004.

63 Self-Organizing Map Dip. Informatica University of Pisa Unsupervised learning NN Clustering (find natural groupings in a set of data) data compression projection and exploration SOM: learn a map from input space to a lattice of neural-units topology-preserving map (neighboring units respond to similar input pattern) visualization properties of the SOM output feature map (2D) Neural map Alessio Micheli 68

64 SOM Dip. Informatica University of Pisa SOM Units in the map layer compete with each other to represent the current input pattern. The unit whose weight vector is closest to the input is the winner. Competitive stage: Winner units = units with minimal d(w,x) Cooperative stage: adjust w of winner and neighborhood units Alessio Micheli 69

65 Similarity Based Approaches Dip. Informatica University of Pisa Define ad hoc similarity measure on structured data (sequence, trees etc.) Apply known unsupervised technique Examples: SOM for non-vectorial data (e.g. Kohonen et al.): Prototypes: mean vs generalized median Edit distance, graph matching etc. The problem: there has been no systematic way to extract features or metrics relations between examples for SD (general for every task) Alessio Micheli 70

66 Basic Idea Dip. Informatica University of Pisa Transfer recursive idea to unsupervised learning No prior metric/pre-processing (but still bias!!!) Evolution of the similarity measure through recursive comparison of sub-structures Iteratively compared via bottom-up encoding process c a b Alessio Micheli 72

67 SOM for Structured Domains Dip. Informatica University of Pisa Motivations (SOM-SD): Unsupervised learning for SD New supervised model for SD basing on the SOM Visualization properties of the SOM output feature map (2D) (e.g for SAR analysis) SOM: learn a map from input space to a lattice of neural-units NN realized by a SOM discrete output display space (2D coordinates) = discrete code space! The SOM is applied recursively to each node of the graph Vertex is represented by Label + subgraph code value (coordinates of the winning units for the subgraph) Alessio Micheli 73

68 SOM-SD Bottom-Up Processing Dip. Informatica University of Pisa , , , ,1.3,-1,-1,-1, Alessio Micheli 74

69 SOM-SD Bottom-Up Processing Dip. Informatica University of Pisa , ,0.27,2,1,-1, ,1.3,-1,-1,-1, , ,1.3 Alessio Micheli 75

70 SOM-SD Bottom-Up Processing Dip. Informatica University of Pisa , ,1.3,-1,-1,-1, ,0.27,2,1,-1, ,0.51,0,2,2,1 3.14, ,1.3 Alessio Micheli 76

71 A General Framework for Unsupervised Processing of SD Dip. Informatica University of Pisa Motivations of the General Fw Exploiting recursive approach for unsupervised learning in SD specific aspects (internal representation) make the concept of similarity measures in its formulation explicit Combining different SOM-based existing approach set of unrelated models sharing common characteristics Basis for new implementation and new models and learning methods (e.g. extension to Neural Gas and VQ). Uniform evaluation of general properties (prior to learning) E.g. w.r.t. noise tolerance and in-principle capacity of representation. Alessio Micheli 77

72 A General Framework for Unsupervised Processing of SD Dip. Informatica University of Pisa Aims: Uniform formulation within the recursive dynamics of Models Learning methods (Hebbian / based on energy function) Theoretical properties: investigate fundamental mathematical properties Key point: formalize how structures are internally represented Alessio Micheli 78

73 GF for Various Models Dip. Informatica University of Pisa Models: RSOM: Recursive SOM - originally for Sequences (extended) TKM: Temporal Kohonen map - originally for Sequences SOM-SD: Sequences and Trees all models share the basic recursive dynamics models differ in the internal representation (R) of structures Explicitly define a mapping rep: map activity profiles IR N R Neural map Formal repr. of structure Alessio Micheli 79

74 Internal Representation Dip. Informatica University of Pisa SOM-SD RSOM TKM i,j (coordinate vector of winner) All unit activations (exp) Single unit activation (selecting for each unit the previous activation) Alessio Micheli 80

75 GF for Unsupervised Processing of SD (III) Dip. Informatica University of Pisa Objective: a(t 1,t 2 ) a Method: t 1 t 2 Compare unit W L W 1 W 2 with : rep() rep() a t 1 t 2 Alessio Micheli 81

76 Def. General Framework (GSOMSD) Dip. Informatica University of Pisa Each approaches obtained specifying similarity measure on label (d L ) similarity measure on formal representation of tress (d R ), the set of units, W :N (W L, W 1, W 2 ), N the function rep: IR R Activation of unit n for a(t 1,t 2 ) (or distance from the input): ~ d ( a( t1, t2), n) dl( a, W0 ( n)) dr( R1, W1 ( n)) dr( R2, W2 ( n)) R j ~ rep( d ( t j null ~, n ),..., d ( t, n 1 j N )) nil otherwise Activity on the map for subtree t j Alessio Micheli 82

77 GF for Unsupervised Processing of SD (IV) Dip. Informatica University of Pisa SOM-SD : rep( x 1,..., x N ) outputs the index of the winner RSOM: rep x,..., x ) (exp( x ),...,exp( x )) ( 1 N 1 N TKM : rep identity, W 1 per n i = i-th unit vector Graph Encoding: E ( t) ~ rep( d ( t, n ),. ~ rep( d ( t, N )). ~., d ( t, n 1 N )) Alessio Micheli 83

78 Hebbian Learning Dip. Informatica University of Pisa A synaptic weight is increased with simultaneous occurrence of presynaptic and postsynaptic activities Iteratively making weights of the neuron that fires in response to input x more similar to x. + Simple - Investigation of learning behavior is often difficult Objectives of learning can be expressed in terms of a cost function For flat domain (input vector) most unsupervised Hebb learning can be derived as gradient descent of a an appropriate cost function Kohonen used a simple geometric computation for the Hebb-like rule W(t+1) = W(t) + n(t) h(t) [x - w] Alessio Micheli 84

79 General Framework for Unsupervised Learning Dip. Informatica University of Pisa Learning (in all the models): Hebbian Learning: the weights W(winner) are iteratively moved toward the current inputs (a, R 1, R 2 ). Soft-max adaptive rule by Neighborhood function on SOM lattice Learning based on an energy function E (Objectives of learning can be expressed in terms of a cost function) Derive rule as stochastic gradient descent General form where changing f we obtain different learning algorithms e.g. VQ for SD by E Extension to Neural Gas. ~ E f ( d ( t, N)) t Alessio Micheli t ~ d ( t, n j ( t) ) 2

80 Learning: main general result Dip. Informatica University of Pisa For flat domain (input vector) most unsupervised Hebbian training rule can be derived as gradient descent of an appropriate cost function Computing gradient for VQ, SOM, NG in the SD case we found that Hebbian learning is only an (efficient) approximation to the precise gradient descent of the cost function ~ E f ( d ( t, N)) t Alessio Micheli 86

81 Theoretical Investigations (examples) Dip. Informatica University of Pisa Properties of rep for SD: efficiency compactness noise tolerance (rep(x) on d R ) (nice for SOM-SD) representational capability (cardinality of R) preservation of similarity (topological map) locality (TKM) / globality (SOM-SD, RSOM) of the computation: locality limits the recognition of structures Key issue: choice of the internal representation rep Uniform evaluation of methods prior to learning Alessio Micheli 88

82 Alessio Micheli Learning in Structured Domain Plan in 2 lectures 1. Recurrent and Recursive Neural Networks Extensions of model for supervised and unsupervised learning in structured domainsi Motivation and examples The structured data RNN and RecNN Recursive Cascade Correlation and other recursive approaches Unsupervised recursive models Toward next lecture 2. Moving to DPAG and graphs: the role of causality 89

83 Alessio Micheli Analysis (I) Computational properties: approximations capability Suitable solutions for A structured form of computation: compositionality tackles variability Parsimony: weight sharing reduces number of parameters Adaptive transduction: BPTS, RTRL,... allowing adaptive representation of SD handling of variability distributed code of structure (X) state (code) space of constant size (ind. from structure size/deep) 90

84 Analysis (II): Assumptions and Open Problems Alessio Micheli Inherited from time sequence processing: Stationarity: efficacy solution to parsimony without reducing expressive power relevant for SD Causality: affects the computational power! Learning task: supervised and unsupervised Is it possible a uniform approach allowing investigation of fundamental mathematical properties? Other approaches Assessments (applications) Proof-of-the-principle New methods and results 91

85 Alessio Micheli Graphs by NN? For Graphs by NN: see next lecture! Following a journey through the causality assumption! d b c a a How to deal with cycles and causality? 92

86 MODELS panorama for SD (examples) Alessio Micheli l 1 l 1 l 2 l 3 l 4 l 5 Standard ML models for flat data d d Recurrent NN/ESN HMM Kernel for strings d b c b c b c A. Micheli a a Tree: Recursive NN Tree ESN HTMM Tree Kernels a a DPAG: CRCC See references for models in the bibliography slides (later) a a GNN/GraphESN NN4G Graph Kernels SRL 93

87 Alessio Micheli Bibliography: aims Different parts in the following: Basic/Fundamentals * Possible topic for seminars May be useful also for future studies Many topics can be subject of study and development Many many works in literature (arrive continuously)! Many possible topics for demand and possible thesis More bibliography on demand: micheli@di.unipi.it 94

88 95 Bibliografia (Basic, origins of RecNN) Alessio Micheli RNN A. Sperduti, A. Starita. Supervised Neural Networks for the Classification of Structures,IEEE Transactions on Neural Networks. Vol. 8, n. 3, pp , P. Frasconi, M. Gori, and A. Sperduti, A General Framework for Adaptive Processing of Data Structures, IEEE Transactions on Neural Networks. Vol. 9, No. 5, pp , A.M. Bianucci, A. Micheli, A. Sperduti, A. Starita. Application of Cascade Correlation Networks for Structures to Chemistry, Applied Intelligence Journal (Kluwer Academic Publishers), Special Issue on "Neural Networks and Structured Knowledge" Vol. 12 (1/2): , A. Micheli, A. Sperduti, A. Starita, A.M. Bianucci. A Novel Approach to QSPR/QSAR Based on Neural Networks for Structures, Chapter in Book : "Soft Computing Approaches in Chemistry", pp , H. Cartwright, L. M. Sztandera, Eds., Springer-Verlag, Heidelberg, March 2003.

89 96 Bibliography: NN approaches-2 Alessio Micheli * UNSUPERVISED RecursiveNN B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A General Framework for Unsupervised Processing of Structured Data, Neurocomputing (Elsevier Science) Volume 57, Pages 3-35, March B. Hammer, A. Micheli, A. Sperduti, M. Strickert. Recursive Self-organizing Network Models. Neural Networks, Elsevier Science. Volume 17, Issues 8-9, Pages , October-November * TreeESN: efficient RecNN C. Gallicchio, A. Micheli. Tree Echo State Networks, Neurocomputing, volume 101, pag , * HTMM: further developments (generative) D. Bacciu, A. Micheli and A. Sperduti. Compositional Generative Mapping for Tree-Structured Data - Part I: Bottom-Up Probabilistic Modeling of Trees, IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 12, pp , 2012

90 97 Bibliography: RecNN applications (example) Alessio Micheli * NLP applications (that you can extend with recent instances, and relate them to the general RecNN framework present in this lecture and the basic RecNN bibliography references ) R. Socher, C.C. Lin, C. Manning, A.Y. Ng, Parsing natural scenes and natural language with recursive neural networks, Proceedings of the 28th international conference on machine learning (ICML-11) R. Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D. Manning, A.Y. Ng, C.P. Potts, Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages , Seattle, Washington, USA, October 2013

91 98 Bibliography: Next lecture Alessio Micheli * RecNN for DPAGs : how to extend the domain I A. Micheli, D. Sona, A. Sperduti. Contextual Processing of Structured Data by Recursive Cascade Correlation. IEEE Transactions on Neural Networks. Vol. 15, n. 6, Pages , November Hammer, A. Micheli, and A. Sperduti. Universal Approximation Capability of Cascade Correlation for Structures. Neural Computation. Vol. 17, No. 5, Pages , (C) 2005 MIT press. * NN for GRAPH DATA: how to extend the domain II * A. Micheli. Neural network for graphs: a contextual constructive approach, IEEE Transactions on Neural Networks, volume 20 (3), pag , doi: /TNN , C. Gallicchio, A. Micheli. Graph Echo State Networks, Proceedings of the International Joint Conference on Neural Networks (IJCNN), pages 1 8, F. Scarselli, M. Gori, A.C.Tsoi, M. Hagenbuchner, G. Monfardini. The graph neural network model, IEEE Transactions on Neural Networks, 20(1), pag , 2009.

92 Alessio Micheli DRAFT, please do not circulate! For information Alessio Micheli Dipartimento di Informatica Università di Pisa - Italy Computational Intelligence & Machine Learning Group

Intro to Learning in SD -2

Intro to Learning in SD -2 Alessio Micheli E-mail: micheli@di.unipi.it 2- Neural Networks for Graphs Apr 208 DRAFT, please do not circulate! Dipartimento di Informatica Università di Pisa www.di.unipi.it/groups/ciml