Impact sensitive ranking of structured documents

Size: px

Start display at page:

Download "Impact sensitive ranking of structured documents"

Horatio Burke
6 years ago
Views:

1 University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 211 Impact sensitive ranking of structured documents Shujia Zhang University of Wollongong Recommended Citation Zhang, Shujia, Impact sensitive ranking of structured documents, Doctor of Philosophy thesis, School of Computer Science and Software Engineering, University of Wollongong, Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library:

UNIVERSITY W OF OLLONGONG Impact Sensitive Ranking of Structured Documents A thesis submitted in fulfillment of the requirements for the award

3 UNIVERSITY W OF OLLONGONG Impact Sensitive Ranking of Structured Documents A thesis submitted in fulfillment of the requirements for the award of the degree Doctor of Philosophy from UNIVERSITY OF WOLLONGONG by ShuJia Zhang School of Computer Science and Software Engineering September 211

5 Dedicated to My mother and my father iii

6 Declaration This is to certify that the work reported in this thesis was done by the author, unless specified otherwise, and that no part of it has been submitted in a thesis to any other university or similar institution. ShuJia Zhang September 22, 211 iv

7 Abbreviation AAM AutoAssociative Memory AUC Area Under Curve BoW Bag of Words BPTS Back Propagation Through Structure CLG(s) ConceptLink Graph(s) CSOM-SD Contextual Self Organizing Map for Structured Data DAG Directed Acyclic Graph DOAG Directed Ordered Acyclic Graph ERA Excellence in Research for Australia ERUS Ensemble Random Under-Sampling GHMM Graph Hidden Macov Model GNN Graph Neural Network GNN 2 Graph Neural Network Version 2 GoG(s) Graph of Graph(s) GraphSOM Graph Self Organizing Map GRG Generalized Random Graphs HMM Hidden Markov Model INEX INitiative for the Evaluation of XML Retrieval IR Information Retrieval MAP Mean Average Precision MSE Mean Square Error PMGraphSOM Probability Mapping Graph Self Organizing Map PDF Probability Density Function RBF Radial Basis Function RMLP Recursive Multi-Layer Perceptron network SOM Self Organizing Map SOM-SD Self Organizing Map for Structured Data SSE Sum Square Error SVD Singular Value Decomposition SVM Support Vector Machine WME Weighted Mean Error WWW World Wide Web v

8 Abstract Ranking is an algorithm that defines an ordering among objects. This thesis will address the challenge of ranking structured documents. Traditional ranking methods involve link analysis, relevancy measure, or manual assessment; the aim is to rank the important document higher than the less important documents. The meaning of the term important would need to be defined. For example, one may rank the documents according to their popularity, by considering the number of documents which are linked to it. One may consider the quality of the documents, and rank the documents according to this quality criterion. Or one may consider the relevancy of the documents, and rank documents higher the more relevant they are to a given situation. However, most of these methods are not taking into consideration the contents of the document. The popularity and the contents of the document are actually inseparable in the study of document ranking according to an important criterion. The content of document as well as its structural properties can be extracted and the structural properties can be expressed as a graph representation, involving nodes (representing word tokens), and links (links among the word tokens). Note that in this representation, the word tokens are merely symbols ; they are not endowed with meanings, thus avoiding the tedious task of providing word tokens with meanings. The contextual information is inferred from the statistical properties of words which co-occurred in the same order, and in similar context within the given text corpus. This graph representation is called a concept link graph. In this thesis, we propose a novel representation called a graph of graphs (GoGs) representation to describe the structured documents, which may be connected by hyperlinks or citations. A GoGs is a hierarchy of graphs, the top level graph is the concept link graph, and the nodes in this top level concept link graph can be described by other concept link graphs which represent the documents which linked into the top level document, vi

9 and so on. Such a graph of graphs representation can be very useful to model the interlinking of documents by encoding the structural information at different levels, such as the relationships among documents, relationships among paragraphs within a document, relationships among words within a paragraph of a document, etc. This representation allows us to include document properties and contents to different depth, so that it can be used in different machine learning applications. We then propose a novel machine learning method (graph neural networks for graph of graphs) GNN 2 which is extended from a supervised learning method, viz., a graph neural network (GNN) [82]. This model is capable of encoding GoGs without the need to pre-process the graph representation into a set of vectorial inputs first, and is applicable to both classification and regression problems. This thesis will consider the ranking of large scale (hundreds of thousands) corpus of structured documents using the popularity criterion, but with consideration of content information, by deploying the GNN 2 methods. The main questions which are studied include: what features and structures can be extracted from the documents; how to combine the extractions to encode the structures; what are the processing methods for such structural documents; and how these methods perform on different classification and clustering problems. Through investigating these questions, we conclude that the GNN 2 and GoGs representation are a good combination for solving the ranking problem on a set of inter-linked documents. This thesis presents a number of findings: A novel unsupervised machine learning method called probability mapping graph self organizing map (PMGraphSOM); this is capable of taking into consideration the structure of the documents, in a very similar manner to graph self organizing map technique, and this method is applied to a clustering task. This produced state-of-the-art results on a benchmark document clustering problem. The development of a novel approach based on Hidden Markov Models for the encoding of sequences of graphs data structures. The approach allows the encoding of a time series of structured objects. Application of the GNN 2 algorithm to a classification task involving a large set of documents represented as GoGs. vii

10 A combined system containing both the PMGraphSOM, MLP and GNN applied to a Web spam detection problem. This produced state-of-the-art results on a benchmark problem in Web spam detections. GNN 2 is applied to encoding the temporal spatial graphs with an aim to produce ranking on the documents in the original CiteSeer database. The temporal spatial graph is an instance of GoGs which encodes three levels of relationships: Time-sequence: the status of the domain over time. Citation links: the referencing relationship between documents. Concept link: the contextual relationship among different concepts extracted from the documents. This is a very useful description of the documents in the structured domain, hence can be trained by GNN 2 to infer the underlying ranking function and to predict the ranking of documents with yet unknown ranks. This methodology is applied to the ranking of the documents in the original CiteSeer database. As there are no prior results in this area, it is not possible to assess if our results are good or bad. In order to show that the proposed methodology is useful to the modeling of temporal spatial problems involving graphs, we simulate a policemen database, in which we know the temporal spatial variations. We applied the proposed methodology to this simulated dataset, and found that the results are highly accurate. Thus, this shows that the proposed methodology can be applied to model temporal spatial problems involving structures. This thesis also provides some potential directions for future research: The PMGraphSOM is a state-of-the-art unsupervised learning algorithm for structured domains. This can be recommended for problems which require classification of structured representation of objects, e.g., documents, images, videos. The graph of graphs representation is a general temporal spatial representation of structured domains. This can encode inter-linked documents, a set of images, or videos. viii

11 The combined unsupervised and supervised approach can be recommended as a strategy for handling problems of deep learning, e.g., documents, handwritten character recognition, face recognition, human activity recognition, etc. In our case, we recommend a particular combination, viz., the PMGraphSOM and GNN combination. This combination has proven effective in the case of Web spam detection problems. The GNN 2 algorithm can be applied for supervised learning involving graph of graph situations. If the problem is relatively noise free, in other words, if the noise contamination of the inputs is not high, this approach will yield good results. This thesis also provide some potential directions for future research: Refine the PMGraphSOM for dealing with sparse graphs. Extend PMGraphSOM learning algorithm for clustering GoGs. Redevelop the code of GNN 2 based on a clear software design specification, and to provide an associated user manual. Apply the proposed model on benchmark problems which provides ground truth information for the evaluation of impact sensitive ranking in GoGs situations. Apply the proposed methodology to a wider range of problems, e.g. video information retrieval problems, image retrieval problems, chemical informatics, etc. Develop different training algorithm for GNN 2, e.g. map graphs to sequences, map graphs to graphs, etc. ix

12 Acknowledgement I gratefully acknowledge the financial support provided via an ARC discovery project grant (DP774168) to Dr. Markus Hagenbuchner and Prof. Ah Chung Tsoi to allow the project to be started. I especially appreciate the support received by the Australian Centre for Advanced Computing and Communications (AC3) and the University of Wollongong s Information Technology Services (ITS) for providing access to high performance computer clusters for accommodating the experimental needs of this research. I am also indebted to staff at University of Siena for providing technical support when I was offered an opportunity to participate in a three-months research collaboration program. The successful execution of the experiments in the thesis would not have been possible without all this support and assistance. My deepest gratitude goes first and foremost to my supervisor, Dr. Markus Hagenbuchner, whose encouragement, guidance and support from the initial to the final level enabled me to develop an understanding of the research. He walked me through all stages of the research project and the writing of this thesis. Without his consistent and illuminating instruction, this thesis would not have reached its present form. I wish to express my special appreciation to co-supervisor Prof. Ah Chung Tsoi who provided continuous encouragement and massive help during the PhD study. I also would like to acknowledge with deep gratitude the friendship with and assistance given by other researchers. They are Prof. Edmondo Trentin, Prof. Franco Scarselli, Prof. Alessandro Sperduti, Dr. Milly Kc and many other researchers I had the pleasure to collaborate with. I also would like to take this opportunity to thank Dr. Zhiquan Zhou who was the supervisor of my Honor thesis and lead me onto the way of research. I also owe a special debt of gratitude to all my family and friends. I would like to thank my parents for providing me opportunity to study overseas and their continued x

13 encouragement during my PhD study. Grateful acknowledgement is made to the kind understanding and supports from my fiancé Ralph. Last but not the least, my gratitude also extends to my adorable friends who have been assisting, supporting and caring for me. xi

14 Contributions of the Thesis One of the aims of this thesis is to explore an alternative ordering of structured or semistructured documents by employing machine learning methods. Towards that end, a suitable document representation and associated machine learning models need to be developed. We examine the features describing the structured documents, and their associated topological structures and construct a novel document representation which can better describe the documents in a structured domain. In order to encode the structured data, machine learning models are built based on some existing powerful methods. As a result, the contributions of this thesis are briefly summarized below: We proposed an unsupervised machine learning method called Probability Measure Graph Self Organizing Map (PMGraphSOM) which is an extension of the Graph Self Organizing Map (GraphSOM) unsupervised machine learning technique, by considering the underlying probability mapping between the activation of the neurons in the self organizing map, and the inputs at each node of the graph representing the structured documents. GraphSOM is unsupervised learning model for processing graph structure based on the standard Self Organizing Map (SOM) [49]. PMGraphSOM is capable of processing large scale document collections represented as undirected and non-positional graphs. It has been demonstrated that PMGraphSOM, through applying it to practical examples, can produce more stable and accurate results than its predecessors, e.g., GraphSOM, Self Organizing Map for Structured Domains (SOM-SD). We developed an original representation called Graph of graphs (GoGs) for describing linked documents in a structure domain. Normally a document has a flat structure. However, it is possible to extract contextual relationships among words, or phrases, in the document, and these can be represented in terms of a graph, with xii

15 nodes representing passages, phrases, or words occurring in the document, and the links representing its contextual relationships with other nodes (other words, phrases, or passages in the document). Moreover, in most documents, there are references, either in the form of a hyperlink, or actual reference to other works. Thus, it would be natural to represent the hyperlink, or cited reference occurring in the document, as a link to another document (web document or reference). As the words, or phrases are represented as nodes in the document graph, it would be natural to see that the relation to other documents as links connecting the nodes to other documents. In turn, the documents referred to other documents. Thus, this will form a number of close cliques of nodes connected together, with some connections among the cliques, representing the references made to other documents. Note that this graph of graphs, at least in the form indicated, consists of un-directed cliques representing the documents, and directed links between the cliques, indicating the direction in which the documents are referenced. Traditional machine learning methods, e.g., multilayer perceptrons, support vector machines, self organizing maps, are limited to processing vectorial input data. Some newly proposed algorithms, e.g., SOM-SD [48], PMGraphSOM [99], Graph neural networks (GNNs) [82] allow the processing of richer structured data such as trees and graphs. In order to encode GoGs without having first to pre-process the data to vectorial forms, we proposed a supervised machine learning model GNN 2 which is an extended version of the Graph Neural Network [82]. GNN 2 can encode any types of graphs, including GoGs. We propose a probability model called Graph Hidden Markov Model (GHMM) for the learning and classification of sequences of trees. The architecture relies on an underlying HMM structure, capable of dealing with long-term dependencies in sequential data of arbitrary length. Emission probability density functions are estimated by means of a combination of recursive encoding neural networks and constrained radial basis function like networks. A global optimization algorithm, aimed at the maximization of the likelihood of the model given the training observation sequences has been developed. Preliminary results confirm that the architecture and the algorithms are effective, both in terms of learning and generalization capabilities. xiii

16 The novel representation using GoGs and the associated encoding method GNN 2 is deployed in the ranking task of structured documents. We represented a set of scientific documents (being the documents contained in the CiteSeer database) which are connected via references as a temporal spacial graph (sequences of graph of graphs) in which connecting graphs, representing documents at a particular instance of time, are connected through references with other documents in other instances of time. This representation models the dynamic of referenced (citation) relationships among the documents over time. We applied the GNN 2 algorithm on the sequential GoGs and learned the change of the ranking in the set of documents according to both the citation links and the document contents. xiv

17 Publications 1. A.C. Tsoi, M. Hagenbuchner, M. Kc, and S. Zhang. Handbook on Neural Information Processing, chapter Learning Structural Representations of Text Documents in Large Document Collections. Springer, (accepted for publication on 23/11/21). 2. S. Zhang, M. Hagenbuchner, F. Scarselli, and A. Tsoi. Supervised encoding of graph-of-graphs for classification and regression problems. In S. Geva, J. Kamps, and A. Trotman, editors, Focused Retrieval and Evaluation, volume 623 of Lecture Notes in Computer Science, pages Springer Berlin / Heidelberg, E. Trentin, S. Zhang, and M. Hagenbuchner. Recognition of sequences of graphical patterns. In F. Schwenker and N. El Gayar, editors, Artificial Neural Networks in Pattern Recognition, volume 5998 of Lecture Notes in Computer Science, pages Springer Berlin / Heidelberg, Z.Q. Zhou, S. Zhang, M. Hagenbuchner, T.H. Tse, F.-C. Kuo and T.Y. Chen, Automated functional testing of online search services. Software Testing, Verification and Reliability, n/a. doi: 1.12/stvr.437. August, S. Zhang, M. Hagenbuchner, A.C. Tsoi, and A. Sperduti. Advances in Focused Retrieval, volume 5631, chapter Self Organizing Maps for the clustering of large sets of labeled graphs, pages Springer-Verlag, Berlin Heidelberg, M. Hagenbuchner, S. Zhang, A.C. Tsoi, and A. Sperduti. Projection of undirected and non-positional graphs using self organizing maps. In European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning, pages , Bruges, Belgium, April 29. xv

18 Contents Abbreviation Abstract Acknowledgement Contributions of the Thesis Publications v vi x xii xv 1 Introduction Introduction Research Motivation Research Objectives Research Design Research Scope Thesis Outline Background and Literature Review Introduction Document Search on the Web Document Ranking Alexa PageRank TrustRank HillTop Topic-Sensitive Ranking xvi

19 2.3.6 Journal Impact factor ERA rank Machine Learning Methods for Encoding Structure Domains Unsupervised methods Supervised methods Modelling Ranking Methods Conclusion Document Feature and Structure Introduction Document Features Bag of Words approach Content-based Feature Link-based Feature Document Structure XML Tag Tree/Graph Concept Link Graph Hyperlinked Graphs Citations Graph Temporal Sequences Graph of Graphs Conclusion Encoding Structured Documents by Machine Learning Methods Introduction Unsupervised Machine Learning Self-Organizing Map Self-Organizing Map for structured data Contextual Self-Organizing Map Graph Self-Organizing Map Probability Mapping GraphSOM Supervised Machine Learning The basics xvii

20 4.3.2 Simple Auto-associative Memory Graph Hidden Markov Model Back Propagation Through Structure Graph Neural Network Graph Neural Network version Conclusion Benchmark datasets Introduction Policeman Benchmark INEX28 XML Mining Dataset INEX29 XML Mining Dataset WebSpam Dataset - UK WebSpam Dataset - UK Conclusion Clustering Introduction Image Clustering XML Documents Clustering Effects of input weighting Effects of map size and grouping Effects of training radius Effects of node label Best Results and Discussion Role of Clustering Conclusion Classification Introduction Recognition of Sequences of Graphical Patterns GHMM to encode sequences of graphs Use GNN 2 to encode sequences of graphs xviii

21 7.3 Categorization of XML documents Network size Initial condition Balancing labelled data Stability control Label-to-out approach All-to-out approach Results and comparison for the INEX 29 competition Encode labelled edge Importance of the structure Pre-processing by using PMGraphSOM Improving the GNN 2 results Effects of long term dependencies on GNN Web Spam Detection Application to the UK26 dataset Application to the UK27 dataset Conclusion Impact Sensitive Ranking Introduction The CiteSeer dataset Data Preparation Document Contents Extraction Document Target Computing Temporal and Spacial Graphs Experiments Dataset Training and Prediction Truncate the sequence Use PMGraphSOM Smoothing the PageRank targets Using larger network architectures Use of ERA as targets xix

22 8.5.8 Discussion Conclusion Discussion and Conclusion Introduction Findings and implications Contributions Limitations Future research Bibliography 215 A Analysis of results and feature selection for WEBSPAM-UK B Experimental plots for Web spam detection tasks 231 C Result plots of applying ERUS on WEBSPAM-UK D Mapping of training PMGraphSOM on CLGs of CiteSeer documents 24 E Confusion matrices when training GNN 2 on ERA ranks 244 xx

23 List of Tables 2.1 An extract of a Web index table The notation for Jacobian control computation Main properties of the Policeman benchmark dataset The number of documents belonging to 39 different categories A comparison of performances between PMGraphSOM and the Graph- SOM given a network size of Performance of PMGraphSOM for INEX 8 by using differentµvalues Performance of PMGraphSOM for INEX 8 by using different radius Performance of PMGraphSOM for INEX 8 by using different labels Summary of results of the INEX 8 XML clustering task. The clustering and classification performances were not available for all participants Confusion Matrix for training documents generated by PMGraphSOM trained using mapsize=16x12, iteration=5, grouping=2x2, σ()=1, µ=.95,α()=.9, label: text=5 + template=4 + tag= Cluster purity when using mapsize=16x12, iteration=5, grouping=2x2, σ()=1,µ=.95,α()=.9, label: text=5 + template=4 + tag= Confusion Matrix for training documents generated by PM-GraphSOM trained using mapsize=12x1, iteration=5, grouping=1x1, σ()=1, µ= ,α()=.9, label: template= Cluster purity when using mapsize=12x1, iteration=5, grouping=1x1, σ()=1,µ=.99995,α()=.9, label: template= Cluster purity using mapsize=4x3, iteration=6, grouping=4x6, σ()=2,µ=.99,α()=.9, label: text=47 (BoW-based approach) xxi

24 7.1 Recognition accuracy of GHMM on sequences from the Policemen dataset Experiments using GNN 2 to encode sequences of graphs List of all input data files used for training on INEX 9 dataset Results of GNN 2 for INEX 9 by using different training configuration Comparison of all submissions for INEX 9 XML classification task A list of training data used by PMGraphSOM for INEX Performance of the PMGraphSOM for the INEX 9 dataset with different settings for PM Improved results of the GNN 2 for the INEX 9 learning problem Confusion matrix: True positive (TP), False negative (FN), False positive (FP), and True negative (TN) Performance of training PMGraphSOMs. TrainMode: 2-train on both; 1-train on spam hosts only; -train on normal host only Comparison of detection performance of MLP-Link, MLP-Content and GNN(link+content+topology) Comparison on features of hosts detected by GNN and MLP-Link Comparison on features of hosts detected by GNN and MLP-Content Comparison on features of hosts detected by GNN(MLP-Link+MLP- Content) and GNN(MLP-Link+MLP+Content+4d) List of all combinations of inputs for GNN training. Topology: A-random truncation; B-sorted truncation Performance of PMGraphSOM on UK27 train set (second row) and test set (third row). Train mode: 2, train on both; 1, train on spam;, train on normal Performance of training MLP with architecture for UK Different µ values for training PMGraphSOM on UK Some inaccurate results from Tesseract Statistic results of parsing text to XML. The percentage is computed out of385,951 documents Prediction error of a trained GNN 2 on full/reduced temporal spacial graphs with PageRank/ImpactRate as targets xxii

25 8.4 Prediction Error of training GNN 2 on citation graphs labelled with PM- GraphSOM outputs Confusion matrix of connections among documents from different rank levels Classification results of training GNN 2 on ERA venue rank and venue ID Confusion Matrix of the classification produced by training GNN 2 on citation graphs where nodes in the graph are labelled by CLG. Use ERA rank as targets. Hidden=8, State=1, EncodeDim=6. With balancing Confusion Matrix of the classification produced by training GNN 2 on citation graphs. Use ERA rank as targets. Hidden=8, State=1, EncodeDim=6. With balancing Confusion Matrix of the best classification results produced by GNN 2 for dataset Confusion Matrix of the best classification results produced by GNN 2 for dataset Confusion Matrix of the best classification results produced by GNN 2 for dataset Confusion Matrix of the best classification results produced by GNN 2 for dataset Example of GNN 2 training results on CiteSeer dataset Example of GNN 2 training results on CiteSeer dataset A.1 Comparison on features of hosts detected by MLP-Content and MLP- Link A.2 Comparison on features of hosts detected by GNN (MLP-Link+MLP+Content) and GNN (MLP-Link+MLP-Content+3d) A.3 Comparison on features of hosts detected by GNN (MLP-Link+MLP- Content) and GNN (MLP-Link+MLP+Content+4d) A.4 MLP(Link) vs GNN(MLP(Link)) A.5 MLP(Content) vs GNN(MLP(Content)) xxiii

26 E.1 Confusion Matrix. GNN trained on citation graphs where nodes are labeled by CLG. Use ERA rank as targets. Hidden=8, State=1, EncodeDim=6. No balancing E.2 Confusion Matrix. Train GNN on one level of citation graphs. Use ERA rank as targets. Hidden=8, State=1, EncodeDim=6. Without balancing xxiv

27 List of Figures 2.1 The architecture of a basic Web search engine system An example of an XML parsing tree and graph An example of a graph of graphs A 2-dimensional map of size5 2 (left), and an undirected graph (right). Each hexagon is a neuron. ID, codebook, and coordinate value for each neuron is shown. For each node, the node number, and coordinate of the best matching codebook is shown The general architecture of the GHMM A schematic of an RMLP network architecture consisting of 4 hidden units and 3 state units A simplified schematic illustration of the GNN network architecture A schematic view of the GNN 2 network The encoding architecture of the GNN 2 network for a given graph Sample sequence for policeman rotating clockwise Sample sequence for policeman rotating anti-clockwise Resulting mappings when training PMGraphSOM (left column) and the GraphSOM (right column). Training parameters used where: iterations = 5,α() =.9, σ() = 4, µ =.28274, and grouping = Cluster purity vs. Map size relative to the size of the dataset Cluster purity vs. Training iteration Mapping of all labelled nodes on a PMGraphSOM performing best in classifying the INEX28 documents xxv

28 6.5 Mapping of all labelled nodes on a PMGraphSOM performing best in clustering the INEX28 documents Mapping of all training data on a PMGraphSOM performing best in classifying the nodes The curves correspond to the labels as defined in Table 7.2. Shown are the MSE (left) for each of the 6 experiments, and the recognition rate (right). The horizontal scale indicates the number of training iterations SSE curves of training GNN 2 for INEX 9 with different network sizes. Left: trained on graph 3; Right: trained on graph SSE curves of training GNN 2 for INEX 9 with different initial conditions. Left: trained on graph 4; Right: trained on graph SSE curves of training GNN 2 for INEX 9 with balancing 1 (left) and balancing 2 (right). Trained on graph MAP and F1 score of training GNN 2 for INEX 9 without balancing ( and 1) and with balancing (2) SSE curves of train GNN 2 for INEX 9 with state reduction. Trained on graph SSE curves of training GNN 2 for INEX 9 with label-to-out approach on graph Comparison between label-to-out and all-to-out approaches SSE curves of training GNN 2 with labelled links. Trained on graph 8 and Category-Link Matrix. Different symbols represent different source categories; x-axis: destination category; y-axis: normalized counts of the links between documents from two categories The mapping results of training PMGraphSOMs for INEX 9 on dataset #1. Left: mapsize=8x6, grouping=1. Right: mapsize=4x32, grouping= The mapping results of training PMGraphSOM for INEX 9: the plots for two categories of training dataset # The random initialization schemes for PMGraphSOM: Box-muller gaussian vs. Bin-method xxvi

29 7.14 The mapping results of training PMGraphSOM for INEX 9: trained on dataset #4. Left: init1 initialization method; Right: init2 initialization method; positive and negative samples are distinguished by different symbols and colors Classification performance of the PMGraphSOM when trained on dataset #4. Left:old initialization; Right: new initialization Classification performance of the PMGraphSOM when trained on dataset #3. Left:old initialization; Right: new initialization Varying the size of the four internal network layers (horizontal scale), and the limitation of sequence length the network can encode (vertical scale). Shown is the average performance (left), and the maximum performance (right) Maintaining causality between dependent internal layers. Horizontal scale: Number of neurons in a layer, vertical scale: limit to the length of a sequence of graphs a network can classify The proposed combined machine learning system for solving large scale Web spam Detection Problem Results of MLP for UK26 by using different network architectures. Left: hidden=2; Right: two hidden layers, layer1=2, layer2= Comparison of MSE curves and performance of training MLPs for UK26 by using different input features Performance of training GNN with different network architecture. Left: AUC on train; Right: AUC on test Errorbars of GNN training results with different inputs Results of MLP training for UK26 by using ERUS strategy. k= 3.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results Train PMGraphSOM on UK27 with: map=12x1, group=1, maxout=16. Top: Train on normal hosts only; Bottom: Train on spam hosts only. Left: Train Performance; Right: Test Performance xxvii

30 7.26 Feature Analaysis for UK27. X-axis: ID of different features; Y-axis: maximum or average of the feature values. Left: Content-based feature; Right: Link-based feature; Percentage of spam and normal hosts from different second-level domains for UK27. The horizontal axis denotes second-level domain and vertical axis show the percentage of spam and normal hosts belong to that domain respectively. All the second-level domains end with.uk, so omit here Results of training MLP for UK27 by using noisy patterns. Left: Noise added to 47th content-based feature. Right: Noise added to the 124th link-based feature The results of training MLP for UK27 by using ERUS strategy. k= 4. Left: Single results sorted by AUC on train. Right: Average results on topnbest train results The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k= 4. Left: Single results sorted by AUC on train. Right: Results on top n best train results The ranking scores over time for some documents in the CiteSeer Dataset using ImpactRate (left) and PageRank (right) An example of GoGs for a document in the CiteSeer dataset The SSE (left) and the value range (left) produced by GNN 2 during training with ImpactRate as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6) The SSE (left) and the value range (right) produced by GNN 2 during training when using PageRank as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6) Samples where GNN 2 failed to produce good results on modelling PageRank, and corresponding results for the experiments training on ImpactRate. 177 xxviii

31 8.6 A GNN 2 trained on reduced length temporal spatial graphs using ImpactRate as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6) Train GNN 2 on reduced length temporal spatial graphs using PageRank as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6). Left: SSE; Right: Errorbar on the network outputs The mappings produced by PMGraphSOM of size 8 6 when trained on the CLGs extracted from the CiteSeer dataset The distribution of PageRank value before and after normalization Plots of training GNN 2 on temporal spatial graphs using normalized PageRank as targets. Left: SSE curves; Right: Errorbars of the network output values Plots of training GNN 2 on a smaller CiteSeer dataset using different network architecture. Top: hidden=4, state=6, encodedim=3; Middle: hidden=8, state=1, encodedim=6; Bottom: hidden=15, state=2, encodedim= Distribution of documents from different ERA venues and rank levels. Left: number of documents from different rank levels; Right: number of documents from different venues Distribution of documents from different ERA venues and rank levels. Top: Train set; Bottom: Test set; Left: number of documents from different rank levels; Right: number of documents from different venues SSE curves when training GNN 2 on GoGs defined for CiteSeer documents. Top: Use venue rank as targets; Bottom: Use venue ID as targets; Left: with CLGs; Right: without CLGs A sample of Policeman temporal spatial graphs SSE curves of training GNN 2 on Policeman temporal spatial graphs. Top: Dataset 1; Middle: Dataset 2; Bottom: Dataset SSE curves of training GNN 2 on Policeman temporal spatial graphs. Top: Dataset 4; Bottom: Dataset xxix

32 8.18 The changes of network parameters during training GNN 2 on Policeman temporal spatial graphs Impact of noise on training GNN 2 on Policeman temporal spatial graphs. Left: SSE curves; Right: Classification performance B.1 The mapping results of PMGraphSOM for UK27. Training configuration: map=2x18, group=1, use at most 16 out links. Top: Train on non spam hosts only; Bottom: Train on spam hosts only. Left: Train Performance; Right: Test Performance B.2 The mapping results of PMGraphSOM for UK27. Training configuration: map=12x1, group=1, use at most 1 out links. Top: Train on non spam hosts only; Bottom: Train on spam hosts only. Left: Train Performance; Right: Test Performance C.1 The results of MLP for UK27 by using ERUS strategy. k=1. Left: Single results sorted by AUC on train. Right: Average results on top n best train results C.2 The results of MLP for UK27 by using ERUS strategy. k=2. Left: Single results sorted by AUC on train. Right: Average results on top n best train results C.3 The results of MLP for UK27 by using ERUS strategy. k=2.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.4 The results of MLP for UK27 by using ERUS strategy. k=3. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.5 The results of MLP for UK27 by using ERUS strategy. k=3.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.6 The results of MLP for UK27 by using ERUS strategy. k=4.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results xxx

33 C.7 The results of MLP for UK27 by using ERUS strategy. k=5. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.8 The results of MLP for UK27 by using ERUS strategy. k=6. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.9 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=1. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.1 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=2. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.11 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=2.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.12 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=3. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.13 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=3.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.14 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=4.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.15 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=5. Left: Single results sorted by AUC on train. Right: Results on top n best train results C.16 The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k=6. Left: Single results sorted by AUC on train. Right: Results on top n best train results D.1 The mappings produced by PMGraphSOM of size 1 8 when trained on the CLGs extracted from the CiteSeer dataset xxxi

34 D.2 The mappings produced by PMGraphSOM of size 2 16 when trained on the CLGs extracted from the CiteSeer dataset D.3 The mappings produced by PMGraphSOM of size 12 1 when trained on the CLGs extracted from the CiteSeer dataset. With node label. µ= D.4 The mappings produced by PMGraphSOM of size 12 1 when trained on the CLGs extracted from the CiteSeer dataset. With node label. µ= D.5 The mappings produced by a PMGraphSOM of size 12 1 when trained on the CLGs extracted from the CiteSeer dataset. Using node label, andµ = xxxii

35 Chapter 1 Introduction 1.1 Introduction When the Internet was first created (many attribute Sir Tim Berners-Lee with the creation of the World Wide Web when he together with a number of others successfully connected a hypertext transfer protocol (http) client to a server via a network in late 199), its size was small so that easy and fast retrieval of the information was viable. With the continuous development of the Internet, the World Wide Web has become the most commonly used platform for information communication and knowledge sharing. The size of the Web is growing at a tremendous speed 1 which makes searching for desired information on the Web an increasingly difficult task. Search engines arose for the efficient information retrieval from the huge amount of data in the Web. The success of a search engine is to provide fast information retrieval and return desired search results to the users. A search engine usually works in three steps [15, 93]: 1. Crawl pages from the Web, download the web documents encountered; 2. Build an indexed database and an inverse indexed database of the downloaded web documents; 3. In response to user queries, search in the inverse indexed database. The first step can be achieved by using a robot program called a spider or a crawler. The spider program accesses the Web, starting from a number of seed pages, follows the hyperlinks contained in those seed pages, downloads the web documents encountered, 1 The size of the Web is estimated to be between a low of 11 billion web pages to a high of 4 billion web pages [77, 92]. The discrepancy is due to the ordering in using the indexes of four commonly used commercial search engines: Google, Bing, Yahoo Search and Ask [77, 92]. 1

36 1.1. Introduction 2 and continues recursively until there are no more hyperlinks to follow or until a maximum number of crawls has been reached. The motivation of using a number of seed pages is to prevent the untimely stop of a particular crawler due to the lack of outgoing hyperlinks and to continue the crawl when a particular isolated web page is reached. The contents of the collected pages are analysed and indexed via the extraction of related information such as keywords, modification time, size of the file, links, etc. An inverse index is also built, such that when in response to keywords issued by a user, particular pages containing the keywords can be retrieved. As often there are many web pages containing the same keywords which can be retrieved, some ranking of these retrieved web pages would need to be performed so that the user would not be overwhelmed. The user will be presented with a ranked list of URLs (Universal Resource Locations), from the most relevant to the least relevant. In such a situation, the ranking of pages becomes crucial in the information search and retrieval process and depends on the relevance measure used. Current ranking approaches include link based analysis ( [1, 15, 34, 42, 6, 72]), click-rate measures [4], topic-based analysis [2], document quality analysis [58], or other manual assessments. However, these available ranking methods have some limitations and it is challenging for many of them to scale up to handle the size and underlying dynamics of the Internet, e.g., some web pages are updated more regularly than others, some are created dynamically in response to queries, some are deleted by their creators, and replaced with other web pages, etc [93]. At the end of the 199s, various ranking strategies based on hyperlink analysis were proposed. The so-called PageRank method which was developed by Sergey Brin and Lawrence Page [15, 72] is one of the typical ranking algorithms. The principle of the PageRank is to provide an estimation on the importance of documents based on an analysis of the web link structure. The importance of a page depends on the number of pages which linked to it and also the importance of them. Some researchers have worked on improving the PageRank notion. In [51], Haveliwala proposed to compute a set of PageRank scores with respect to different topics. By considering the context around where the query appeared, the proposed method is capable of producing more accurate ranking results than the traditional PageRank. In [2], Agrawal et al. analysed the contextual preferences accumulated from various sources and use them to create a prior ordering of two items in a particular context in an online preprocessing step and use these orders

37 1.1. Introduction 3 and associated contexts at the query time to provide quick ranked answers. In [91], Tsoi et. al. explored the possibility of customizing PageRank to the various interest areas of Internet users. The approach allowed the altering of a rank of a document depending on its content area. However, how to balance the efficiency of the method and the precision of the outcome was still an obstacle. These approaches are based on link analysis. If in the situation that we have just created a new document which has no in links, the PageRank score for such a document would be at the default value (the lowest value possible under the PageRank equation). It would be an interesting question of what might be the in links to this web page, given that the web page contains interesting and useful information. This indicates that link based ranking methods have some limitations especially for newly introduced documents. In such a situation, the contents of a document may play a role in the ranking of the document. Hence, it could be useful to rank documents in cooperating document contents with the link structure. This thesis focuses on investigating means to model a set of given text documents with a view to ranking them in terms of their impact. This can be implemented by using machine learning models which are capable of encoding text documents represented in the form of graph structures. In this research, an unsupervised machine learning method called Graph Self Organizing Map (GraphSOM) [49] is extended and enhanced for processing more generic graph structures. This extended version of GraphSOM is called a Probability Mapping Graph Self Organizing Map (PMGraphSOM) [99]. The structured domain is being modelled using what we call a graph of graphs (GoG) model [98]; in such a model, the textual document is represented in the form of a graph, where the nodes in the graph represent word tokens in the text document, and the links between the nodes represent the strength of connections between the two word tokens. There may be references from a particular word token to another document. If this referred to document is represented in terms of a graph, then this graph is linked to the word token in the document. If we further assume that there might be links from this child document to other documents and these grand children text documents are represented in terms of graphs, then we will have a graph of graphs situation, in which there is a root graph representing the primary document, the hyperlinks from this parent document is represented as links to a set of child documents, which in turn are linked to their own child documents through

38 1.2. Research Motivation 4 hyperlinks. We extended a supervised method, viz., graph neural network (GNN) [82] which we call for convenience graph neural networks for graph of graphs (GNN 2 ) [98] to meet the requirements of encoding a GoG model. 1.2 Research Motivation Text documents, e.g., XML documents, HTML web pages, scientific papers, are often formatted in semi-structured or structured format. These documents often make reference to other documents, which in turn would make reference to other documents. The problem we wish to investigate in this thesis is: can we rank such a set of inter-connected semi-structured or structured documents. This is an interesting question to investigate, as there are more and more documents which are born digital in that they are created directly in a digital format. With the popularity of the Microsoft Office package among users, and the adoption of using XML (Extended Markup Language) as a format to store the Word documents from the 29 version of Word onwards, there are more and more documents stored in XML format. Moreover, the web documents are stored often in HTML (hypertext markup language). Both HTML and XML are semi-structured in nature in that it encodes some structure of the document, in addition to the encoding of the text component in a document. With these semi-structured documents, a question to ask is: how do we retrieve them. Moreover in the retrieval process, as there will be many documents which will match particular queries from the users, the question arises is how do we rank these documents in order of importance and present them to the users. For the ranking of documents, there is the classic PageRank algorithm, which considers the network structure among the documents. In other words, it will consider only the way in which documents are linked with one another, without considering the content of the documents, and provide a ranked order of the documents. This ranked order is known as the PageRank. There are many variants of the PageRank, e.g., a trustrank which is applied to studying if a set of given web pages can be trusted, a textrank which is applied to study the ranking of unstructured text documents, etc. Most of these do not consider the content of the documents. However, as the documents are semi-structured, one would ask the question: would

39 1.2. Research Motivation 5 the knowledge of the document structure (as such structures are implicitly provided through the XML or HTML format), and the knowledge of the document contents help in ranking the documents. It is intuitively clear that a knowledge of the document structure, together with the knowledge of the content, must help in ranking the documents as this is the basic knowledge upon which we humans rank documents. This question raises a number of issues: How do we represent the knowledge of the content of documents? How do we rank documents with a knowledge of their contents and their structures. This issue is specific to linked documents, how do we rank a set of linked documents. This thesis will address these issues. In particular, it will consider ways in which the knowledge of content information can be used in the representation of a document. This is a relatively matured field of study, as most people use a bag of words approach. In this approach, the list of words used in a text corpus are first extracted. This list of words will need to be de-stemmed (to reduce the number of vocabularies used in representing the documents), and a set of common words eliminated (as common words are nondiscriminate in nature). This set of de-stemmed words will form the vocabularies of the set of documents. Each document can then be represented as follows: a T w, where w is the vector representing the set of vocabularies used, and a is a vector of corresponding dimensions, the elements of which represent the number of occurrence of the particular word in the document. This method has been very popular in representing text documents and is widely used. In this bag of words representation the contextual connections among the words are not encoded. Hence an intuitive question is: can we represent the contextual information among the words as well. The answer to such a question is affirmative. In this thesis we have chosen to represent the set of documents using what we called a concept link graph (CLG) technique in which each document is represented in the form of a graph [2]; the nodes of the graph represent the words used in the document, while the links between the nodes represent the strengths of connection between the two words. However a follow-up question to ask is: what happens now that the contextual information is encoded, can we make use of such information on the ranking of documents.

40 1.3. Research Objectives 6 Most of the machine learning techniques can be used in the ranking of documents, without an encoded structure. For example, the PageRank method is such an algorithm, based on the Power method for the determination of the first few dominant eigenvalues and eigenvectors of a given matrix. To consider a document represented in a graph structure, most machine learning algorithms, e.g., support vector machines, multilayer perceptrons, self organizing map method, locally linear embeddings, ISOMAP, etc., will first need to flatten the structure of the document into a vectorial format (by ignoring the contextual structures among the words) and then process these vectorial inputs. This seems a wasted effort in extracting the structural representation of a document is only to throw it away as we do not use such structures in the machine learning part of the methodology. Hence, in this thesis we will use a supervised machine learning method, called graph neural networks (GNNs), which will take into account the structure of the documents [82]. In other words, the graph neural network can accept and encode graphs, (as well as vectors or sequences). Once we accept such a framework for the representation of the set of documents, and use a graph neural network to encode the resulting graphs, it is a relatively smaller step to ask the question: how do we study graph of graphs situation, in which the set of documents are inter-connected. An obvious approach will be extending the graph neural network model to encode such a situation as well. 1.3 Research Objectives The aim of this thesis is to introduce machine learning approaches capable of ordering structured documents such that the documents which are most influential are listed first while those which are of lesser influence will be ranked lower. The significance of this thesis is to provide a method to ranking structured objects in a structured domain. This thesis will analyse the impact of a (new) document on the web. For example, given a new scientific paper, we wish to investigate how this paper is related to other published scientific papers, and what might its links be with these already published papers. If we are given a new web page, with no linkage to other web pages in the web yet, the question to ask is: what might its linkage with existing web pages be? Which pages should link to it, and which pages should it link to? This question is interesting as a new web page does

41 1.4. Research Design 7 not have any PageRank (as there are no web pages linked to it). This thesis will introduce machine learning approaches that are suitable for such tasks. Thus, this thesis considers three challenging tasks which require the development of a number of novel approaches and methods. 1. Find a suitable methodology for representing both intra structure of documents (e.g., XML, HTML, or plain text documents) and the inter-structure between documents (e.g. the hyperlink structure or citation structure). 2. Find a suitable machine learning methodology for modelling such structured information. 3. Apply methods to structured documents with an aim to rank these documents by their importance. 1.4 Research Design The design of the approaches to handling the three tasks is described as follows: 1. Find a suitable methodology for representing both the intra-structure and inter-structure of the documents: Plain text documents are flat in structure (i.e. a sequence of characters and words, but without any explicit structures, except the common punctuations and separated paragraphs). However, it is important to note that the characters and words in a text document do not normally occur as a random sequence, and hence, the words and word phrases in a document are not independent of one another. Contextual relationships exist among the words in a document. The extraction of such context which is embedded within a document allows the discovery of meaning or the recognition of the category in which the document belongs by a suitable machine learning method. We plan to approach this task in various ways: using semi-structured documents; this refers to documents which are described by a markup language such as XML, HTML, PDF, etc. The markup language defines a tree-like structure over the document, and hence, can be used to extract structures from a set of documents. The main problem with

42 1.4. Research Design 8 this approach is that the structure refers to the formatting of a document rather than to its actual contents. using a concept link graph approach; there are various flavors of this approach. For example, a graphical representation can be extracted by grouping words and word phrases using a Self-Organizing Map approach. The connectedness between the clusters define the relationship between the words and word phrases, and hence, a fully connected undirected graph representation is obtained for each document in a collection. documents (such as web documents, scientific documents) often share relations with other documents in a collection. These relations are given by the citations or the hyperlinks embedded within a document. Hence, there is an inter-document structure which can be obtained quite easily. 2. Find a suitable methodology for modeling structured information: We propose to develop machine learning approaches to encode graph structured data. There is a body of work available in this area which provides methods for encoding various types of graphs such as labelled ordered graphs, cyclic undirected graphs and others. There are supervised and unsupervised methods available. We plan to extend some of these methodologies so as to allow the encoding of more generic and hybrid graphs. This will be necessary given that the graphical representation of a collection of documents contains several (at least two) types of graphs: the Intradocument graph which gives a structural description of a given document; and the inter-document graph which describes the relationship among different graphs in a collection. We plan to approach this task through an extension of existing methods (i.e. through the introduction of a recursive element). 3. Apply methods to encode structures with the aim to rank documents by importance: We plan to analyse a time-series snapshot of the data collection in order to identify documents which have had the greatest influence on other documents. For example, we will analyse the development of a link structure in the neighborhood of a newly added document in time. This will allow us to train models which extract important features for the identification of the potential impact of a given document. We also plan to label documents by using available ranking techniques,

43 1.5. Research Scope 9 such as PageRank and Impact Rate. We will assess whether a document introduces new and important concepts, as well as whether the document contains reliable information. Hence, our ranking method will give high rank to documents which introduce verifiable new concepts and have high probability to attract more in-links in the future. 1.5 Research Scope This research aims to develop a new ranking scheme for structured documents. A set of machine learning tools will be developed by extending some available methods which are suitable for the tasks at hand. A simulation of these extensions is then coded in C, and driven by using command lines for the experiments on some artificial data and some real world learning problems. Hence the user interface is non-existent, as this is intended to be research-only software, rather than with a wide user audience in mind. The code is available on-line via the Web site The software is developed to allow experiments to be run over a reasonably long time (over weeks) and where the memory usage is limited to 16GB for a single run. The proposed methods will be evaluated on benchmark datasets which are publicly available and the experimental results obtained are compared with those obtained by other researchers. The experiments in regard to impact ranking of documents are conducted on a snapshot of the CiteSeer dataset which is not publicly available. Since this is a research work on applying machine learning method to rank documents in structure domains, it is hard to make a direct comparison with other researchers results. The labelled data in the dataset will be split into training and testing datasets respectively for evaluation purposes.

44 1.6. Thesis Outline Thesis Outline The thesis is organised as follow: Chapter 1: Overview of the research, which includes the background of the research topic, project objectives, main tasks and the design of the research tasks. Chapter 2: Literature review in the related topic areas, with emphasis on the contributions and limitations of some available ranking algorithms and machine learning methods for modeling documents represented in graph structure. Chapter 3: Exploration of various representations for structured documents including vectorial features and structural features which can contribute to the research task. Chapter 5: Description in details of all the benchmark dataset involved in this research, including necessary analysis and pre-processing of the data. Chapter 4: Investigation of possible machine learning approaches to solve the research problem, including unsupervised and supervised methods which can be used to model structured documents. Chapter 6: Application of the unsupervised machine learning models to solve clustering tasks in purpose of evaluating and identifying suitable methods and useful features, and a discussion on the role of clustering. Chapter 7: Application of the supervised and combined machine learning models to solve classification tasks with an aim of evaluating and identifying suitable methods and useful features for experiments of document ranking. Chapter 8: Implementation of ranking structured documents from aspect of contents by employing suitable machine learning methods which encode documents represented as temporal graphs. Chapter 9: Summary of the finding of the research and some suggestions for future work.

45 Chapter 2 Background and Literature Review 2.1 Introduction Research on efficient information retrieval methods have been a main area of concern since the emergence of the digital age. Together with the dramatic increase of human knowledge, new ways for information storage and information retrieval are introduced into the digital world. Scalable techniques for information indexing and retrieval from very large databases are imperatively needed. The World Wide Web is currently the most important platform for the storage of information and sharing of knowledge. The World Wide Web follows a number of established standards such as the Hypertext Transport Protocol (HTTP) which standardizes communication between entities in the World Wide Web, and the Hypertext Markup Language (HTML) and the extensible Markup Languages (XML) which standardize information representation in the World Wide Web. However, there is no existing standard which specifies how information can be found on the World Wide Web. Search Engines such as Google, Excite, Yahoo, Baidu, Bing, etc have filled this gap. These search engines fulfill a very important role in the World Wide Web. The World Wide Web would be rendered largely useless without having some means of search functionality. Due to the known exponential rate by which the size of the World Wide Web 1 increases, efficient and effective document search methods on the Web have become essential for people to obtain information they need. The development of search engine started with the basic catalog indexing principal [6]. Early search systems worked by collecting the the resources available on the Web and then by grouping the information into hierarchical categories [6]. With the increase of the size of the Web, more scalable search engines emerged [6]. Modern search engines crawl large amount 1 In the following, we will use the term Web to refer to the World Wide Web. 11

46 2.2. Document Search on the Web 12 of Web documents recursively and store the crawled data in very large databases. The information stored in those databases is indexed in order to facilitate faster access to requested information. In response to user s query, search engine systems are capable of efficiently finding relevant information within the indexed database and providing a sorted set of results to the user. Ranking algorithms take a central role in search engine design since their performances have a direct influence on user acceptance and user satisfaction with a search engine [58]. Efforts on improving ranking techniques is a very active area of research. Approaches to the ranking of Web documents include numerical approaches [15, 34, 6, 72], methods employing cognitive processes [7, 69, 84], and machine learning methods capable of modelling ranking algorithms [45, 78, 82, 98]. This thesis concerns the ranking of document based on document impact factors. Two application domains are considered, viz. the ranking of Web documents, and the ranking of scientific documents. This Chapter will provide an overview of some of the more important existing document ranking algorithms and some search engines. Some existing machine learning approaches for modelling ranking algorithms are also reviewed. Since the learning problems will involve documents that may be represented structurally (i.e. document structure or hyperlink structure or both), and hence, we will also provide an overview to machine learning models for structured data. 2.2 Document Search on the Web We already identified in chapter 1, that a general Web search system consists in three main elements: 1. Crawling: Retrieves a copy of documents found on the Web 2. Indexing: Creates a dictionary of words and phrases found in retrieved documents, then associates to each dictionary word the documents that contain the word. 3. Querying and Ranking [97]: Accepts a query term from a user, then order matching documents according to a given ranking criterion. These three elements dictate the general architecture of a Web search engine. The basic architecture of a Web search engine is illustrated in Figure 2.1. During the crawling phase, a robot program called crawler or spider is collecting Web pages by following the

47 2.2. Document Search on the Web 13 WWW Ranking Retrunresults Users Crawling IndexedDatabase Querying UserInterface Figure 2.1: The architecture of a basic Web search engine system. link structure of the Web. This is done in an automated fashion, and executed either at periodic intervals or as a continuous process. The pages collected are indexed and then stored in a database. Web indexing is an analog of book-style A-Z indexing defined on the collective of Web documents, and is a procedure that plays a central role in achieving the scalability of the search functionality of a Web search engine. An extract of a Web index table is shown in Table 2.1. TERM URLs abalone Andromeda zircon Table 2.1: An extract of a Web index table. Indexing is the process of extracting meaningful items on the page (mainly from the text contents of the page) and systematically sorting and storing the information in the database. Once the indexed database is built, a user can issue queries, and the search system would search for the query term from within the indexed database and would return the matching records to the user. For example, a user may issue a search request for the term abalone. By considering the index Table 2.1, a search engine would immediately 2 be able to determine the complete set of Web documents that contain the requested search term. Note that the indexing process normally ignores words which do not carry information of value. For example, articles such as the, a, an are normally found in any document containing text, and hence, are not normally indexed. 2 Terms are normally hashed. Thus, a query word can be found in the index table within one time step independent to the size of the index table. Since the access occurs in constant computational time, and hence, such accesses can be referred to as being immediate.

48 2.2. Document Search on the Web 14 The third of the three core elements of a search engine is the ranker. The set of returned results are pre-sorted according to a given ranking/scoring mechanism. This is done during the time of search. Depending on the ranking algorithm, the document ordering may be rearranged according to the search query. Ranking of documents is done in recognition to the fact that a large number of documents may match a given user query. For example, there are over 4.15 billion documents currently indexed by the Google search engine which contain the term world 3. A user would not be able to take all matching documents into consideration nor would it be feasible for a search engine to return the complete list of matching documents to the user. Thus, most search engines return a limited number (such as 1) of matching documents to the user at a time. The main question is which 1 out of the list of matching documents should be returned to the user. In order to achieve user acceptance and user satisfaction, it is paramount to ensure that the limited number of documents returned to a user are the most likely ones to meet the user s expectation. As a result, Web search engines adopt ranking algorithms in order to define an order on matching documents. For example, Yahoo! utilizes a relevancy based ranking scheme. In such a ranking scheme, documents are considered more relevant the more they contain the query term, and the more they contain terms that are related to a query term. In contrast, Google adopts a popularity ranking scheme. In such a ranking scheme, a document is considered more popular the more hyperlinks point to it. More recent approaches take cognitive processes into account when ranking documents [58]. There are numerous other ranking methods. All of the existing ranking methods depend on either document content, document reference or domain reference structure, or on cognitive or random processes. Once ranked, documents of the highest rank are then returned first in response to a user query. It can be stated that the commercial success of a search engine largely depends on the ranking method adopted by the search engine. Section 2.3 will introduce some of the more well known ranking methods, and will offer a discussion on their limitations. 3 Checked on 1/May/211

49 2.3. Document Ranking Document Ranking The development of effective ranking algorithms have been a challenging problem since the introduction of search engines. Many of the currently most widely adopted ranking techniques are based on link analysis and relevance measures. The ranking results commonly depend on document content, the links, and the quality of the links. Some methods also require the analysis of the quality of the source document of a link [58]. However, link-based approaches are not always reliable since they can be easily subjected to link farming and other forms of Web spam. For example, Web spam misleads the search engine by inserting duplicate keywords and spam links into the page to increase the ranking score. While this may be easily identified through manual inspection, the sheer size of the Web does demand some automatic and intelligent detection of spam on the Web [27]. This section is to summarize some of the common ranking algorithms including Alexa, PageRank, TrustRank, HillTop, topic-sensitive rank, impact rate, and ERA rank Alexa Alexa is a web service founded in 1996 which publishes the ranking of websites over the world. As of February 211, it has collected 4.5 billion URLs from over 16 million web sites. The Alexa crawler collects approximately 1.6TB web data per day and a snapshot of the Web is created regularly at a rate of every two months [4]. Alexa collects the following information for the purpose of building the site rank: 1. Site Info (Such as traffic ranks, search analytics and demographics); The traffic rank is computed mainly based on monitoring the traffic data from Alexa Toolbar users and other traffic data sources. The traffic rank can be derived from the average value of quantities of reach and number of page views for the sites on the Web over time. Thereinto, reach is the percentage of users who access a given site and page view measures the number of pages on the site visted by the users. The results of these measures are compared with previous records every three months which would show the change on the rank over time. 2. Related Links (sites contains similar or relevant information to the one you are currently viewing) [4].

50 2.3. Document Ranking 16 There are two types of ranking schemes in Alexa: absolute ranking and categorized ranking. The absolute ranking is to define an ordering among all websites available, while categorized ranking provides a rank for websites within a particular categorized domain. As mentioned, the ranks are computed according to a combination of two measures: users reach and page views [4]. This is implemented by monitoring user accesses to the websites via the Alexa toolbar. This means Alexa ignores the visits from users which do not make use of the toolbar. This may introduce biases to the ranking results since it only covers a category of users which agree to being monitored in the Web surfing behaviour PageRank Google is the Worlds most popularly used search engine [76, 83]. Its popularity arouse out of outperforms other search engines with a remarkable ranking algorithm called PageRank [72]. PageRank performs link analysis on the hyperlink graph of the Web. The principle of PageRank is to provide an estimation on the importance of documents based on a recursive analysis on the links among pages on the Web. If a page X links to page Y, then the importance of page Y increases. If there is page X linked by many other pages, then page X is considered to be an important page. If page Y is not linked by a large number of other pages, but linked by pagex which is an important page, then page Y is also given a high rank [72]. The core of the PageRank formula is shown in Equation 2.1. PR(X) = n i=1 PR(T i ) C(T i ) (2.1) Here,PR(T i ) is the PageRank of paget i which links to page X,C(T i ) is the total number of outgoing links from page T i. It can be seen from the formula that if a page receives a large number of links from other pages and if the number of outgoing links from these pages is small, the higher will be the rank of that page. However, given the size of Web, the number of highly ranked pages in the Web should be kept small. Moreover, there is a large number of pages without in-links and without out-links so that in this case the equation returns a zero rank value for those pages. In order to avoid problems with such link sinks, a damping factor is introduced resulting in the actual PageRank equation as shown in Equation 2.1. PR(X) = 1 d N +d n i=1 PR(T i ) C(T i ) (2.2)

51 2.3. Document Ranking 17 where d is called residual probability (damping factor) and is often set to be a value close to 1 (in this thesis we will use d =.7 for the experiments), and N is number of documents in the collection. The PageRank algorithm can be interpreted as a random web surfer who arrives at a page and continues browsing on other pages via the hyperlinks without backtracking. If a page does not have any outgoing links, the surfer switches to another URLs at a probability controlled by d. Thus, d is a factor that approximates the probability that a surfer (re-)starts from another random page. In other words, the higher the PageRank of a page, the more frequently the surfer would discover this page. PageRank is different from Alexa in a number of ways. Firstly, Alexa treats all pages from same website the same whereas PageRank computes a rank value to each individual page available on the Web. Simply speaking, PageRank gives page importance according to the recursive link structure of web, and the importance of a page also depends on the importance of other pages that has link relationship with it. The link is usually built from a page to another page when the latter is contributing useful, relevant and high quality information. Secondly, PageRank ranks pages according to the analysis of existing links, and hence, can provide a quantitative assessment on the importance of the page. The larger the PageRank value, the earlier the page is listed in the results set. Thirdly, this procedure does not require any human assessment so that it can present a more representative ranking results than those approaches involve human factors (as is with Alexa). However, one drawback of PageRank is that it is limited to give well-linked documents a good rank. For a new page that contains useful contents, link analysis does not take effects if the page has not yet established in-links and hence, it will not be given high rank until linking environment around the page has been established. Since new pages are ranked lowly, they are discovered by users less likely. This in turn causes those pages to be less likely to attract new links. This is a well known problem with PageRank and is called the richer-gets-richer problem (or the poor-stay-poor problem). PageRank is a celebrated ranking algorithm and has attracted many researchers to study its properties [12, 14, 64] or in order to propose enhancements [29, 51, 91, 95]. In [29], Eiron et al. improved PageRank by considering several features of the web and proposed hostrank and DirRank methods. These methods has been proved to show advantage on computational cost and less impact from manipulation when compared with PageRank. Boldi et al. discussed the damping factor d used in the PageRank formula

52 2.3. Document Ranking 18 in depth [14]. Different values for d were tried and acknowledged that a value for d closer to 1 does not show an advantage on providing meaningful ranking results. They also proposed a closed-form formula for PageRank derivatives of any order and built an extension on the Power method by approximating them with convergence O(t k d t ) for the k th derivative, which is helpful to determine new forms of ranking. By taking into account the sensitivity of the hyperlinks, a so called Link-sensitive PageRank was proposed in [95]. This approach addressed the problem that original PageRank algorithm does not consider the search keywords requested by users. It is shown that enhanced algorithm outperforms PageRank and effectively solve the topic drift. Another attempt for improving PageRank is proposed in [95], Qiao et al. proposed an extension on PageRank based on a similarity measure from the vector space model, called SimRank. In addition to the standard link analysis of PageRank, the similarity between the page and the user s query is also involved. The results are compared with other approaches and producing promising results. Attempts to provide personalized ranks were made in [51, 91]. This was done by using a customized damping vector rather than a damping value, and by using a machine learning approach respectively. Those approaches are successful in controlling the ranking of documents depending on the content of a document, or depending on preferences of individual users TrustRank TrustRank is a link-based document ranking algorithm designed especially for combating Web Spam [42]. The TrustRank algorithm uses both human analysis and machine automation. A selected small subset of Websites is first evaluated by experts. Then a focused crawler collects Web pages that exhibit similar reliability and trustworthiness as the one that have been identified in the subset. Similar to as with Alexa, TrustRank is computed for each host or domain which may contain a set of pages rather than on individual pages. A main disadvantage of TrustRank is that it is not scalable since it requires human participation. Moreover, TrustRank values are not continuous and can be subject to contradictions. The former has been shown to be an inhibiting factor when attempting to model TrustRank using machine learning approaches [95] whereas the later is due to variations in perception amongst several human assessors.

53 2.3. Document Ranking HillTop The HillTop algorithm follows the core concept of PageRank by analysing the links and the quality of the links. The main difference is that HillTop only uses documents which are topic relevant to a particular keyword topic. In HillTop, the page which links to many relevant documents is called an Expert, and an Authority is a document that has a lot of in-links from Expert pages [1]. Authority pages are generally ranked higher. The results returned in response to a user s query will be sorted by examining the relevancy between the query and the descriptive text for the hyperlinks on expert pages linking to a given result page. The HillTop algorithm mainly consists of two steps: 1. Expert lookup; 2. Target ranking. Here the expert page is defined as a page about a particular topic and has a lot of links to non-affiliated pages on the topic. The expert pages set is crawled and selected by removing non-affiliated pages and pages which have number of links less than a given threshold value. Such a subset of expert pages are then indexed. The HillTop algorithm believes a page is an authority on the query topic if and only if some of the best experts on the query topic point to it [1]. However, the HillTop algorithm depends largely on the search and decision on expert pages. The selection of the expert pages involves a participation of human assessments. The procedure is thus unable to guarantee impartiality. Moreover, HillTop also neglects the impact from non-expert pages. The subset containing expert pages may not be able to provide a good coverage over the pages in the Web. A possible situation is that for a particular topic, less than two expert pages can be identified, and thus, the HillTop algorithm may result in unsatisfactory responses Topic-Sensitive Ranking The Web contents vary greatly and can usually be categorized into different areas or topics. However, some Web content is affected by ambiguity such as the Website referring to the Amazon. Those Websites may address the Amazon river, the amazon region, the mythological amazon female warriors, or Amazon the online Web service 4. Moreover, the category of Web content is affected by contexts as well as by motivations of a Web user. For example, a user searching for the term house may possibly be interested in 4 This is an incomplete list of meanings of the term amazon. There are several other meanings associated with this term.

54 2.3. Document Ranking 2 buying a house. But the person could also be interested in the price and the location of the house, in the architecture of the house, in the definition of the term house, or in the history of a house. There could be another user searching for the term house but for different purpose, say for a holiday accommodation or a house swap. Hence, the ordering on the search results should ideally take into account the various topics that the users may likely be interested in. Topic-Sensitive ranking is an important extension to PageRank that uses the topic of the context around where the query appeared. In [2], Agrawal et al. analysed the contextual preferences accumulated from various sources and used them to create a prior ordering of two items in particular context in an online preprocessing step and then used these orders and associated contexts at query time to provide quick ranked answers. However, how to balance the efficiency of the method and precision of the outcome is still an unanswered question Journal Impact factor We already established that some of the most successful ranking algorithms engage link analysis methods [72]. Thus, the documents on the Web can be ranked according to the analysis of the hyperlink structure. Similar possibilities exist with scientific documents when accessing the citation link structure [38]. Scientific documents are published at different venues, such as conferences, workshops, journals, books, etc. Such documents are rarely independent from other documents since there is normally a reference and citation structure that defines dependencies and relatedness of the content of one document to the content of some other documents. It is generally assumed that high quality and important papers are cited more likely by subsequent papers. A main difference to the Web document case is that once a link (a citation) is created in a published work it will remain indefinitely whereas for Web documents, a link can be created and deleted at any time. In other words, the topology of a citation graph is more stable. An additional consequence is that the citation graph resembles a multi-rooted tree structure whereas the Web graph is a more general graph exhibiting many cyclic dependencies. Thus, the finding of purpose build methods to evaluate and rank the impact of a paper on a certain scientific domain is a necessary and useful exercise. In [34], Garfield proposed the impact factor which is used to measure the average

55 2.3. Document Ranking 21 number of citations to the articles published in scientific journals. The calculation of the impact rate for a journal j consists of two components: a numerator and a denominator. Given a yeary, use A to denote the number of citations to any papers published in journal j during the two preceding years beforey; use B to denote the number of papers published in the same two years. The impact factor is equal toa/b. It can be seen that impact factor could reveal the impact of the journal over time. Higher rank is given to those journals with more dynamic status. For example, the number of citations for a particular journal may grow quickly during a short period. This will boost the rank of the affected journal. However, even if a journal received many citations within a short time at some time in the past the rank will decay as time passes if no further citations are introduced. However, this approach considers the quantity of citations only and has a number of additional limitations. Firstly, it is possible that a paper receives a large number of citations from papers which may not be important papers. Since the computing of impact factor is only based on the number of citations, this can result in a biased result when compared to papers that were cited by less papers but which were of higher significance. The lack of quality control on the sources of a link also render the algorithm prone to spam. Furthermore, there could be a case that a paper A is cited by another paper B which is cited by an even later paper C. Such indirect relationship also implies the importance of paper A and the non-ignorable contribution from paper A to paper C. However, the journal rank method does not capture such indirect citations ERA rank The Excellence in Research for Australia (ERA) initiative is offering a ranking method based on cognitive processes [7]. While the ERA rank is not normally applied to ranking Web documents, it is an important index on the importance of scientific papers. The ERA indexes all major journals and conferences. These journals and conferences are assessed by international experts, and are ranked out of the five possible categories A, A, B, C, and unranked. The ranking is then published via a government Website [7]. This ranking scheme is least likely to be subjected to spam but is slowest to adapt to changes. The ERA ranking scheme is updated irregularly at a rate of less than once a year. Moreover, the ERA rank is with respect to venues rather than with respect to individual papers. Hence, all papers published at the same venue are ranked equally.

56 2.4. Machine Learning Methods for Encoding Structure Domains Machine Learning Methods for Encoding Structure Domains Machine learning aims to develop algorithms that allow computers to learn useful information from data [55]. There are a large number of successful approaches to machine learning [17, 3, 31, 54, 61, 8]. The algorithms have been applied successfully to numerous learning problems such as natural language processing, computer vision, bioinformatics, pattern recognition, document mining, and many others [94]. However, the majority of these algorithms are limited to process objects which are represented in terms of fixed dimensional feature vectors. Such a simple representation is far inadequate for most learning problems that involve causal or contextual information. For example, in document classification task, vector-based features may overlook the contextual information that delivered from the content or the organization of a document. Structural representations, such as sequences, trees and graphs show higher representational power than the plain vectors. This requires machine learning systems that are capable to encode structure domains. Following sections will provide a review on some existing machine learning methods for encoding structured data Unsupervised methods Numerous learning problems provide data for which no meaning is defined. In other words, there is either no knowledge or very limited knowledge on the meaning of the data. Such data is said to be unlabelled. Machine learning algorithms that can deal with unlabelled data are said to be unsupervised machine learning algorithms. In unsupervised learning, the data is assumed unlabelled and the learning process aims to discover intrinsic features of the data [25]. Kohonen s Self-Organizing Map (SOM) [61] is one of the most well-known and popularly applied unsupervised machine method. The SOM allows the topology preserving projection of high dimensional data onto a low dimensional display space. The application of SOMs is mainly for dimension reduction of data and the clustering of data. However, most real world learning problems are involving data which is more complex in structure. Strings and sequences are more powerful representation for describing objects. For example, in speech recognition, the data is generally made available as a temporal sequence. The meaning of the sequences may not

57 2.4. Machine Learning Methods for Encoding Structure Domains 23 be given, unsupervised machine learning methods group the data together according to some similarity criterion. There exist SOMs capable of encoding data sequences [61]. However, sequences are only a simple case of richer data structures such as the tree data structure and graph data structures. General graph data structures encompass data trees, data sequences, and data vectors as special cases, and are more suited to represent learning problems in molecular chemistry, image and video processing, documents mining, and in many other domains. In fact, any machine learning method capable of encoding graphs will also be able to be applied to learning problems involving vectors, sequences, or trees. In other words, these algorithms allow a projection from the domain of graphs to a fixed-dimensional display space. Among these models, SOM for Structured Data (SOM-SD) [48] and Contextual SOM-SD (CSOM-SD) [59] can deal with directed acyclic graphs. The one proposed in [41], which is a standard SOM using a specific graph edit distance is able to deal with generic graphs. However, the later approach does not scale beyond toy problems. A much more powerful and much more efficient algorithm, the GraphSOM model as is proposed in [49], is able to deal with undirected and non-positional graphs by representing each graph vertex through the activation pattern generated on the SOM by its neighbors. There exists a number of other unsupervised machine learning algorithms for the encoding of graphs [75, 86]. However, these are unrelated to the purpose of this thesis and hence, a discussion on these will be omitted here Supervised methods Some learning problems provide ground truth information. Thus, some knowledge about the meaning of the data is available with such data. Such knowledge is commonly used as a target for a machine learning algorithm, and the algorithm itself defines a target error function that is to be minimized. Thus, in contrast to unsupervised learning, teaching signals are available in supervised machine learning. Given the supervision from target values, a machine learning method is expected to extract rules or infer the function from the data and generalize the results on unlabelled data. Some supervised neural networks are able to encode structured information which opened the door to solve learning problems involving graphs without the need to pre-process. Recursive Multi-Layer Perceptron networks (RMLP) trained via a Back-propagation through structure (BPTS)

58 2.4. Machine Learning Methods for Encoding Structure Domains 24 algorithm [32] are an MLP-based approach [52] capable of encoding tree structured data. Their practical abilities were demonstrated on image classification tasks by processing a set of images represented as a set of directed trees [28, 32]. Similarly, recursive Cascade correlation (RCC) [13] was shown to solve logo classification problems by processing company logos represented by a region adjacency graph. In [11], a methodology is proposed to allow recursive neural networks encode cyclic and unordered graph structures. This is implemented through converting the cyclic unordered graph to a tree which is recursively equivalent. The currently most powerful supervised machine learning method is the Graph Neural Network (GNN) [81]. The GNN was shown to be able to solve any practically useful learning problem involving graphs. However, these methods are still limited to deal with graph structures where the nodes and links are labelled by a real valued vector of fixed dimension. In other words, none of the before mentioned methods is able to deal with graphs which are labelled by other graphs, trees, or sequences. Kernel methods are comparatively more effective approaches for handling structured data. The core of kernel methods is the kernel function which can be used for measuring the similarity between data in the feature space. SVM is one of the element of kernel methods and commonly applied for data analysis and pattern recognition. SVM is a concept for supervised methods that perform classification and regression tasks. For a given set of input data, SVM can project the data to higher dimensional space, so that examples can be separated by more clear gaps [23]. When using neural networks, there are always issues for network architecture selection, overfitting, local minima, etc. SVM can outperform neural networks on solving non-linear, high-dimensional recognition tasks. and learning tasks that involve smaller number of examples. However, it is always difficult to determine an appropriate set of kernel function parameters for the SVM [23]. In [3], Fabio et. al. proposed a novel tree kernel method for non-discrete domain called Activation Mask Kernel. Such kernels are generated by using a compressed representation of structured data returned by unsupervised machine learning method SOM-SD. This approach help produce less sparse kernels and meanwhile can preserve the structural properties of the data. The experimental results on INEX25 (a large set of formatted XML documents) showed an obvious advantage over the results from using SVM with ST and SST kernels. However, the method is limited to processing of tree data structures which is only a very special case of generic graphs. This confirms again a need of

59 2.4. Machine Learning Methods for Encoding Structure Domains 25 developing a scalable machine learning approach for encoding GoGs Modelling Ranking Methods Ranking defines an ordering on the results returned to the user. The underlying ranking function can be modelled through machine learning approaches. A simple example, a machine learning system such as a neural network can be fed with input data consisting of a set of queries and a set of result pages. These query-result pairs are labelled with a relevance level indicates how relevant the ranking results are in response to the query [16]. The training is performed on such data and aims to generalize the ranked results for unlabelled queries. Document ranking can be considered a regression problem [5] and it has been shown that the rank boundaries are significant for the training of some algorithms such as those described in [22] and [5]. In [24], Dekel et al. proposed a framework to model the ranking by using directed graphs. A boosting-based learning algorithm is embedded into the framework system and shows the applicability of the approach on learning ranking functions and dealing with large-scale text corpus. RankBoost [33] is another efficient ranking algorithm implemented through training on preferences such as the results from different search engines. The results show that RankBoost has advantages on processing data that vary in size and the ability to combine different approaches for ranking purpose. However, RankBoost uses a binary feedback function which may lack of predictive power when compared to approaches capable to model regression ranking problem. In [53], a combined approach of unsupervised method SOM and probability model Bayesian is proposed for dynamic Web document ranking purpose. The approach is first utilize SOM to cluster documents according to entropy calculated between the query and document, and then Bayesian probability model is used to produce real-time classification results. The results evaluated on a benchmark dataset show that the proposed combined system can produce results with higher precision and stable recall. In [95], GNN is applied to learn different Web page ranking algorithms such as PageRank [72], TrustRank [42], HITS [6] and OPIC [1]. The results show that GNN is capable to learn the underlying ranking functions and also imply the potentiality of GNN to learn ranking schemes without knowing the algorithms.

60 2.5. Conclusion Conclusion Ranking is a crucial component in document retrieval and search engine systems. The algorithms assist the system in producing results in a more logical and useful order. Commonly used ranking algorithms are usually based on link analysis, relevance measure, or manual assessment. Ongoing research on document ranking continuous to explore the possibility of improvement on the existing ranking algorithms, or pioneer solutions to emerging document ranking problems. Ranking schemes increasingly address the issue of Web spam in order to avoid manual assessment on a subset of documents. It is desirable that a ranking algorithm should be an automatic and scalable procedure. Machine learning is often employed in data mining tasks due to their generalization ability, and their insensitivity to noise. Current approaches train machine learning methods for this purpose of inferring the underlying ranking function. However, none of the existing machine learning methods consider link analysis and the document content together. This is a crucial requirement if the task is to rank documents by impact value. New concepts and novel ideas may be picked up by other researchers who in turn may publish works that cite the original paper. This implies that the evaluation of the impact value of a document requires the encoding of link (citation) structure along with the actual content of both the source and destination documents. Since document contents were shown to be more appropriately described by structured representations [2] and hence, a suitable ranking algorithm would have to exhibit an ability to encode both the link structure as well as the content structure of a document. The aim of this thesis is to develop a machine learning approach capable of modelling the rank value of documents by their impact factor. This Chapter identified the need to develop an alternative and novel machine learning method for the task at hand. We will gradually approach the development of a suitable machine learning approach with the following Chapters by firstly describing the problem domain in some detail, then by exploring possibilities of extending or modifying existing methods for this task. We will propose two solutions, one based on a Hidden Markov Model and another based on the Graph Neural Network method. It will be found that the Graph Neural Network approach is best suited for solving real world problems, and it will be shown that our proposal is successful in achieving its given aims.

61 Chapter 3 Document Feature and Structure 3.1 Introduction A main aim of this research is to find a methodology for the representation of document features and domain structures. Instead of handling the entire document as unstructured text, numerous features can be extracted to characterize some of the documents properties. Features, e.g., the frequency of occurrence of words, summarize the contents and describe the properties of the document which are usually represented in vectorial forms. Such vectors are normally independent of one another which implies that any relationship(s) among the vectors will be lost. On the other hand, some structural properties are available for documents such as Web pages, scientific papers, etc. There are two types of structural properties, viz., intra-document structural properties and inter-document structural properties. It is obvious that there is intra-document structural properties which relate words together so that semantic and pragmatics of the text become clear. The semantic and pragmatic mechanisms are the relationships among the words such that the meaning of the text can be understood. It is the structural relationships among the words which make the content of the document clear. Documents could have connections with other documents which represent the relationship between documents (hyperlinks, citations, etc). These are the inter-document structural properties. Theoretically, it is possible to decompose a document into various components, until the level of word tokens. For example, the text contents of scientific document are usually organised into paragraphs or sections which are formed by sentences. A sentence can be parsed into words and their parts of speech according to the syntax structure. Such structural information could be extracted for the entire document. Moreover throughout the document there will be 27

62 3.2. Document Features 28 references to other scientific papers which can also be analysed and obtained their structures. The linkage between the documents can also be effected through a directed link from one document to another. Hence, an integrated representation can be built based on both the document features and its structure. This reduces the complexity of the document and allows the building of models using machine learning algorithms. This chapter will explore various types of the document features and structures, and investigate how these features can contribute to the document mining tasks. 3.2 Document Features A common method to process text documents is the bag of words approach (which will be described in Section below. Then, we will provide some descriptions of both content based features and link based features in Section and Section respectively Bag of Words approach The bag of words (BoW) model is built on an assumption that a document is a container for a set of words. This approach ignores the ordering, grammar and syntax, in other words, any structural relationships among the words used in the document. Such a model treats all the unique words in the document independently; a word retrieved from a particular location is assumed to be independent of other words in the document. To construct a BoW model of a document, we first extract all unique words in the document, with the stemmed words and commonly occurring words are not taken into account, as they are not discriminating; this will result in a dictionary for the particular document. Call this dictionary vector w. Next, count the occurrence of each word. The document can be represented as a vector a, the same dimension as the w vector, the i-th element of the a vector is the number of occurrences of the i-th dictionary word in the document [65]. In other words, the document is represented as a T w. If we have a set of documents in the same corpus, then the dictionary vector is a merge of the unique words in all the individual dictionary vectors of the individual documents. Then, each document in the corpus is represented by a vector a j, where the i-th elements of the vector represents the number of times of occurrence of the i-th word in the dictionary of document j. This representation does not assume any meaning of the words in the dictionary.

63 3.2. Document Features 29 While BoW is being popularly applied due to its simplicity, the document representation produced by BoW is inadequate for some applications. In some applications, where the context in which words occurred is important, then the BoW approach would not be a suitable method. For example, a sentences like The dog jumps over the cat, The cat jumps over the mouse, and The mouse jumps over the dog, in which the meanings of the sentences depend on the relationships among the words. In this case, a BoW would not work too well, as it will assume two occurrences of the words dog, one occurrence of the word cat, and mouse respectively. The BoW does not differentiate the context in which the words are used. In such circumstances, it would be useful if the context within which the words occurred can be included in the representation of the document. An obvious method to include the context in which the words occurred would be to use some semantic type of representation in which the words are provided with meanings, or what is commonly called ontologies. If the words are provided with meanings or ontologies, and the links among words are included, then this will allow the machine to form a better understanding of the sentences. However, English words often have a number of meanings, dependent on the context. Hence to encode the correct meaning of a word, one is required to indicate a particular meaning out of a group of meanings related to the word, or to disambiguate the meaning of the word. This is quite labour intense work, as one needs to understand the context in which a word is being used, before deciding on a particular meaning of a word. Machine based disambiguation techniques also have not advanced to a stage which will permit automatic disambiguation of words in context. In this thesis we have adopted an intermediate approach, known as concept link graph (CLG) approach, which is an extension of the BoW method. This approach does not require one to encode the meaning of words, except with some inclusion of the context in which groups of words occurred in the text corpus. Thus, this method allows deployment to large scale problems, as it is largely data driven with little human intervention. The details of this approach will be provided in a later section Content-based Feature Content-based features describe the properties of the documents which are extracted based on analyzing the document contents. In [19], Castillo et al. reported a set of

64 3.2. Document Features 3 content-based features which can comprehensively describe the document contents. Some obvious features include: the number of words in the (web) page, the number of words used in the title, the average word length, the fraction of anchor text, the fraction of visible text (in the web page), etc. Among these possible features, only the words consist of alphanumeric characters in the visible text are considered. Some features are meant to capture the compressibility of the (web) page, for example, the compression rate and the entropy of trigrams (three words occurring in the same phrase). The compression rate is the ratio of the size of compressed text using an application, e.g., the bzip utility to the size of uncompressed version of the text. The entropy is an alternative measure designed to capture the compressibility of the page which is more macroscopic than the compressibility ratio and is computed based on the distribution of trigrams. They also proposed a set of features that measures the precision and recall of the words: corpus precision, corpus recall, query precision and query recall. The corpus measure builds a term list on the top frequent occurring words and compute the fraction of words in the page that exist in such a popular term list. The query measure is analogous to the corpus measure but uses a set of most popular terms in a query log instead Link-based Feature Link-based features depict the inter-structural relationship of the documents. These features are computed over the links among the documents. There are mainly four types of link-based features: Degree-related measures: Degree-related features include in-degree and out-degree of the page and its neighboring pages. Moreover, edge-reciprocity (e.g. the fraction of out-links which are also in-links to the document) and assortativity (e.g. degree of a page divided by the average degree of its neighbors) of the links are considered. PageRank: The PageRank is a well known link-based ranking algorithm that compute score for a page to indicate the popularity of the page [72]. TrustRank: The TrustRank is another well known linked-based ranking algorithm for computing a score of the documents. This starts with a set of seed web pages which had been manually classified as trustworthy, in other words, non spam pages. Then, the algorithm searches the web for pages which are similar to the

65 3.3. Document Structure 31 set of seed pages and provide a score of the pages encountered. Obviously the further a web page logically from the set of seed pages, the less reliable will be the TrustRank computed [42]. Estimation of supporters [19]: This is a method for estimating the number of supporters (wbe pages which linked into the current page) of all web pages in the vicinity of the current page. This feature is mainly used for web spam detections. These features characterize Web pages by numerical values, e.g., numerical scores. 3.3 Document Structure The features extracted as described in the previous section are mainly statistical features. They do not encode any structural information, e.g. the contextual information among the words. The results of learning tasks, eg., classification, regression, might be improved if richer representation models are in used which encode contextual information. There are two types of structural relationships among documents: intra-document and inter-document structural relationships. The intra-document structure reveals how the elements, e.g., word token, of a document is related. For documents formatted in markup languages such as XML, HTML, GeoML, etc, structures can be defined according to the structural markers (e.g. XML/HTML tags). For documents which are flat in structure (only contains a sequences of words), such a structure can be extracted by inferring the relationship among words or concepts within the the documents. The intra-document structure is defined by considering each document independently and separately; and the inter-document structure, in contrast, is to model the inter-relationships among a particular document and other documents in the same text corpus. Such relationships could be recognized via the hyperlinks of web pages, citations of scientific documents, etc. In the following sections, some typical intra- and inter- document structures are described in more details XML Tag Tree/Graph XML is designed to carry data information enclosed in what is known as a tag, denoted in the XML documents as < information >. In XML documents, contents are separated

66 3.3. Document Structure 32 by tags and different tags summarize the information within the tag block. An XML document is normally represented by a XML parsing tree. The tree consists of nodes which represent the XML tags, and links which represent the nesting of the tags. The nodes could be labelled by an identifier which uniquely identifies the associated tag. With the parsing tree, it is easier to retrieve information from the XML document. The top level tag in the XML document is parsed to be the root node in the tree and each tag belongs to one level of the tree. Similar to Tag trees, each node in the XML parsing tree represents a unique tag within the XML document, and edges represent the relationship between the tags. Since XML supports the referencing to other XML elements within the same document, and hence, such a graph may contain cycles or self connections. Figure 3.1 shows an example of an XML document and the corresponding XML parsing tree and tag graph. Moreover, Article <Article> <Title>...</Title> <Body> <Section> <Weblink>...</Weblink> <Name>...</Name> </Section> <Weblink>...</Weblink> <Name>...</Name> </Body> </Article> Title Body Section Weblink Name Article Body Title Section XMLDocument Weblink Name XMLTag Tree Name XMLTag Graph Weblink Figure 3.1: An example of an XML parsing tree and graph. XML tags can be augmented by attribute-value pairs. This means that the nodes in the XML parsing tree may be labelled by the attribute-value pairs associated with the corresponding XML tag Concept Link Graph A Concept Link Graph (CLG) is a fully connected undirected graph which describes the document contents. A CLG is generated in three main steps [2]. 1. Discover a set of concepts by clustering related nouns extracted from the training corpus of documents using a self-organizing maps method. Related nouns used in a similar context in the training corpus are clustered. This is referred to as a term clustering step.

67 3.3. Document Structure Given the term clustering results, for each paragraph in a document, map every existing noun to a concept to which it belongs. Then count the number of occurrences of every concept paragraph by paragraph. Given the paragraph-based occurrence statistics of the concepts, each document is represented using a concept-paragraph matrix. 3. A singular value decomposition (SVD) is applied to the concept-paragraph matrix to compute a concept-to-concept association matrix. Formally speaking, given a n m matrix A, n m, as a concept-paragraph matrix of m concepts and n paragraphs, decomposing A using a SVD returns (UΣV) T, where U and V are unitary matrices, i.e.,uu T = V T V = I, andσis a diagonal matrix, with diagonal elements σ 1 σ 2 σ m. Given the SVD result, Σ is interpreted as the theme matrix, where each of its diagonal elements represents the strength of its corresponding theme, U is the concept-to-theme relevance matrix and V is paragraphto-theme relevance matrix. The concept link graph is thus obtained by computing the concept-to-concept association matrixaa T = UΣ 2 V T. The elements in the concept-to-concept association matrix can be interpreted as the strength of correlation between any two concepts. A CLG representation is then obtained by representing every concept as a node, and the elements in the concept-to-concept association matrix are represented by the (weighted) links in the CLG Hyperlinked Graphs Web documents are connected via hyperlinks. A document can have an arbitrary number of hyperlinks to other documents on the Web, and the linked documents in turn can be linked by other documents via hyperlinks. It is possible to have multiple hyperlinks between any two Web pages, and self-referencing can also be realized by hyperlinks. A hyperlink always points from one page (the source page) to another page (the destination page). A link can only be defined in the source page. Thus, hyperlinks are directed links. It can be used to build a graph where nodes represent the Web documents and the directed edges represent the hyperlinks [74]. It is noted that such a graph may contain cycles, self-connections and multiple links between two nodes; it is a directed unordered graph. Moreover, Web pages can be altered at any time. This means that the hyperlink structure can change dynamically over time.

68 3.3. Document Structure Citations Graph Any scientific documents can attract citations (in-links) from subsequent papers since it was published. Different from hyperlinks (as hyperlinks can be added or removed at will), such a relationship between two documents will exist forever once it is created (as once a paper is committed to print, the citation link will stay forever and seldom repudiated afterwards). The references (out-links) from a paper will remain static (permanent) once the paper is published. The number of citations to a document from other papers can only remain unchanged or increase over time. This can be represented as a citation graph where nodes are papers and edges reveal the citation relationships [5]. While self-citations and cyclic references are possible, this is rarely observed in practice. The citation graph is hence most closely related to a multi-rooted directed tree structure Temporal Sequences A web document is not static but is changing with time since it was posted or published on the Internet. Such dynamics can involve both the changes in its contents and the development of new links around it. If we represent a particular web document at a certain time t as a node, this allows to build a temporal sequence which can model the change of the document over time, both in contents and changes of its linking structure [63]. The nodes in the sequence can be labelled with information that indicate the status of the document at that particular time t. For example, a web page can change in terms of its contents and its hyperlinks that the page issued over time. Accordingly, such changes may attract more links from other web pages. Thus, the state of such a document at a time t + 1 is an updated version of the one at time t. Another example could be a set of scientific documents connected via citations. As time goes by, some paper tend to become popular with a large number of citations from other papers. This allows us to build a temporal sequence which contains nodes representing the state of the citation relationship at a certain timet. The state of the document at a timet+1 is an extension of the one at timet Graph of Graphs It is found that there are learning problems for which the nature of a node is of a more complex structure, such as a tree or a graph. Such learning problems require the encoding

69 3.3. Document Structure 35 of a graph structure, the nodes of which can also be graphs. This results in data featuring a graph of graphs (GoG) which contains different graphical elements [98]. The situation is best explained using an example (see Figure 3.2): the World Wide Web consists of a collection of documents which feature a referencing method known as hyperlinks. A hyperlink defines a (directed) relationship between two documents; a link can exist between two web documents or not. The collective combination of all hyperlinks produces a Web graph the nodes of which represent the web documents, and the links represent the hyperlinks. The nodes in such a graph can be labelled to describe properties and content of the associated document. Given a typical Web document, it can be described as a graph consisting of nodes representing sections of text encapsulated by HTML formatted elements, and edges representing the links between the encapsulation of the HTML formatted elements [96]. The nodes of this document graph may be labelled by a word graph consisting of nodes which may represent the word tokens, or a set of word tokens, and links among the nodes indicating the relationship between two sets of word tokens [68]. Now, say, in a particular paragraph, there is a hyperlinked document linked to it. Such hyperlinked document in itself can be represented by a graph and this may be represented as a graph within a node of the parent document. This is a graph of graphs (GoG) structure. Obviously, there is a possibility that the hyperlinked document also in itself contains hyperlinks to other documents, and these documents in turn, can be represented as graphs within the nodes in the hyperlinked document graph. We note that a GoG is generally of a hybrid nature since the graphs at different levels encode different relationships, and describe different atomic elements. For example, the document may be represented as a concept link graph which is an undirected graph, while the link between tow documents is directed. Hence, the GoG is an embedded hybrid graph whose components may differ significantly in properties. For example the hybrid graph depicted in Figure 3.2 features a Web graph which is a directed cyclic graph, XML graphs which are tree structured, and a word graph which may be labelled and undirected. Such graph of graphs structure can be further extended to any finite depth (e.g. in the case of Figure 3.2 the depth of the GoG structure is 3). In the context of this thesis, we will use the GoG representation for a given domain. More explicitly, we will work towards the development of machine learning methods capable of encoding GoG data structures. We will eventually demonstrate our work through

70 3.4. Conclusion 36 Web Graph Document Graph Text Graph Figure 3.2: An example of a graph of graphs. an application to a document ranking problem which is represented by a GoG of depth 3, i.e. a time series (level 1) of hyperlinks graphs (level 2) representing CLGs (level 3). 3.4 Conclusion This chapter analysed various document features and document structures in order to identify a suitable representation for text documents in a structured domain. A welldefined feature representation can be very useful for learning tasks. Many potential features and structures can be extracted from the text documents, including features expressed in terms of vectors, inter-document structure and intra-document structure. Motivated by the limitation on simple bag of words representation of documents, we propose to represent documents using a hybrid graph, called graph of graphs. Documents in a text corpus can be represented as a GoG, which consists of several levels of graphs nested together. The topmost graph is a graph where the nodes are documents and the edges are hyperlinks among the documents. The node in the link graph is then labelled with another document graph which describes the document contents. The document graph can be expanded by labeling the nodes with a graph which summarizes the contextual relationships among the word tokens in a paragraph. The significance of the new representation is that various document features and structures from different levels can be represented without abstracting the intrinsic connections among them. This provides a motivation to develop a machine learning method which is capable of encoding all elements in a GoG as a whole.

71 Chapter 4 Encoding Structured Documents by Machine Learning Methods 4.1 Introduction Depending on the given learning problem, a suitable representation of electronically stored documents is either as data vectors, data sequences, tree data structures, graph data structures, or graph of graph data structures. While the vast majority of machine learning methods are limited to the encoding of data vectors or data sequences, some machine learning capable of encoding tree structured data and graph structured data were proposed in recent years [13, 32, 46, 48, 49, 81, 99]. A main task of this research project is to develop and engage scalable machine learning tools for the encoding of structured documents. This chapter provides an overview of some existing machine learning methods which are capable of dealing with structures. These machine learning methods are grouped into two categories: Unsupervised machine learning methods and supervised machine learning methods. The contributions and limitations of current existing approaches are discussed. We will also propose some extensions which allow the processing of more generic types of graphs, or in order to address some known limitations of current algorithms. 4.2 Unsupervised Machine Learning Kohonon s Self-Organizing Maps (SOM) is one of the best known and most commonly used unsupervised learning algorithms [61]. A SOM is capable to provide low-dimensional visualizations of high-dimensional data. However, the traditional SOM is limited to take only input data in vectorial form. The SOM for structured data (SOM-SD) extends the 37

72 4.2. Unsupervised Machine Learning 38 basic SOM towards an ability to cluster tree structured data [48]. The algorithm recursively processes the nodes in the tree one at a time and from the leaf nodes to root node. The encoding process takes into account a numeric data label that may be attached to a node and the causality of a given node within a graph. The SOM-SD dynamically shapes the input vectors according to the changes in encoding of a node s dependencies during training. The Contextual SOM-SD (CSOM-SD) further extends this concept to allow the encoding of contextual information [46]. The CSOM-SD takes, in addition to the encoding of offsprings, also the encoding of parent nodes into account. Nevertheless, both the SOM-SD and the CSOM-SD are limited to process tree data structures which must be acyclic and ordered. The GraphSOM is a subsequent extension which allows for the processing of cyclic or undirected graphs [49]. Instead of using a concatenated state vectors of children/neighbors, a fixed dimensional state vector is used which reflects the activation of all a node s dependencies on the map. The GraphSOM can suffer from significant stability issues [99]. These were addressed with the introduction of the Probability Mapping Graph SOM (PMGraphSOM). The PMGraphSOM is part of the latest generation SOMs for graphs, and introduces a likelihood factor to produce more stable mappings [99] 1. The PMGraphSOM has been developed as part of this Thesis. This Section is to provide a more detailed description to some of the existing unsupervised methods in order to introduce the development of the PMGraphSOM Self-Organizing Map Self-Organizing Maps (SOM) is a popular unsupervised machine learning method which allows a topology preserving projection of high dimensional data onto a low dimensional display space for the purpose of clustering or visualization [61]. The popularity of the SOM is mainly due to the following reasons: (1) The computational complexity of the underlying algorithms is linear, and hence, SOMs can be applied to data mining tasks, (2) high dimensional data can be projected onto low-dimensional (e.g. 2-dimensional) space, and hence, data can be readily displayed on paper, a display, etc, and (3) SOMs perform a topology preserving projection, in that the topology of data in the data space is largely preserved on the display space. 1 During the writing of this thesis, a further improvement has been proposed that is capable of dealing with sparse graphs more effectively. The SOM has been coined the compact GraphSOM [43]

73 4.2. Unsupervised Machine Learning 39 In general, the display space can be of any dimension but due to the benefits of direct visualization, the dimensionality of the SOM is most commonly a two-dimensional map. Without loss of generality, this thesis assumes the display space is 2 dimensional. The display space is formed by a set of prototype units which are arranged on a regular grid. There is one prototype unit associated with each element on the lattice. An input to a SOM is expected to be a set ofk-dimensional vectors, the prototype units must be of the same dimension. The elements of the prototype units are adjusted during training by the two training steps which are referred to as the competitive step and cooperative step respectively. An algorithmic description is given as follows: Competitive step: One sample input vector u is randomly drawn from the input data set and its similarity to the codebook vectors is computed. When using the Euclidean distance measure, the winning neuron is obtained through Equation 4.1. r = argmin i (u m i ) T Λ, (4.1) where m i refers to the i-th prototype unit, the superscript T denotes the transpose of a vector, and Λ is a k k dimensional diagonal matrix. For the standard SOM, all diagonal elements ofλare equal to1. Cooperative step: m r itself as well as its topological neighbors are moved closer to the input vector in the input space. The magnitude of the attraction is governed by the learning rateαand by a neighborhood functionf( ir ), where ir is the topological distance between m r and m i. The most commonly used updating algorithm is given by Equation 4.2. m i = α(t)f( ir )(m i u), (4.2) where α is the learning rate decreasing to with time t, f( ) is a neighborhood function which controls the amount by which the codebooks are updated. The most widely used neighborhood function is the Gaussian function: f( ir ) = exp ( l ) i l r 2, (4.3) 2σ(t) 2 where the spread σ is called neighborhood radius which decreases with time t, l r and l i are the coordinates of the winning neuron and the i-th neuron in the lattice respectively.

74 4.2. Unsupervised Machine Learning 4 These two steps together constitute a single training step and they are repeated for a given number of iterations. The number of iterations must be fixed prior to the commencement of the training process so that the rate of convergence in the neighborhood function, and the learning rate, can be calculated accordingly. However, the traditional SOM deals with vectorial information, and hence the processing of structured data such as sequences and graphs require a pre-processing step to squash a graph structure into a vectorial form such that the data can be processed in a conventional way Self-Organizing Map for structured data The pioneering work presented in [48] introduced a first Self-Organizing Map approach for the encoding of structured data (SOM-SD). The SOM-SD allows the processing of labelled directed ordered acyclic graphs (DOAGs) by individually processing the atomic components of a graph structure (the nodes, the labels and the links). The key idea of the SOM-SD is to dynamically form network input by describing each node s properties at each step during network training. This results in a dynamically formed input vector. Each input vector is a representation of a node from a given set of graphs are then presented to a SOM. An input vector is a composition of a numeric data label that may be attached to a node, and the information about the mappings of the offsprings of a node. Hence, the input vector is dynamic during training in comparison to the traditional SOM. This is because the mapping of the node s offsprings can change when the codebook vectors of the SOM are updated, and hence, the input vector to a corresponding node changes accordingly. In practice, it has been shown that such dynamics in the input space do not impose a stability problem to the training algorithm [48]. The SOM-SD forms an input for each node in a set of graphs. The network input for thej-th node is a vectorx j = (u j,y ch[j] ), whereu j is a numerical data vector associated with the j-th node, y ch[i] is the concatenated list of coordinates of the winning neuron of all the children of thej-th node. Since the size of vectory ch[i] depends on the number of offsprings, and since the SOM training algorithm requires constant sized input vectors, padding with a default value is used for nodes with less than the maximum out-degree of

75 4.2. Unsupervised Machine Learning 41 any graph in the training set. The competitive step of the training algorithm changes to: r = argmin i (x m i ) T Λ, (4.4) where x is a n-dimensional vector representation of a given node, m i refers to the i- th prototype unit, the superscript T denotes the transpose of a vector, and Λ is a n n dimensional diagonal matrix. The vector x contains hybrid information consisting of a k-dimensional numerical label, and a l-dimensional coordinate vector called the state vector. The nomenclature of the state vector is derived from the observation that it represents the network s state of one or more nodes. It follows that n = k + l. Given that the dimension of the label component inxand the state vector component inxcan differ, and given that the magnitude of their elements can differ, and hence, Λ is used to control the balance of the two input vector components by setting the first k diagonal elements toµ (...1), and all remaining diagonal elements to1 µ. The cooperative step of the SOM-SD training algorithm remains as with Equation 4.2 except for the substitution of u withx Contextual Self-Organizing Map SOM-SD only allows for the inclusion of information about the descendants of a given node. This restricts the SOM-SD to encode causal dependencies only. Hence, the SOM- SD is unable to discriminate identical sub-graphs even if those sub-graphs appear in a different context, or are part of a different graph. The Contextual SOM-SD permits the encoding of contextual information about nodes in a directed graph [46]. This is achieved by altering the way in which input vectors are formed. With the CSOM-SD the input vector representing thej-th node is x j = (u j,y ch[j],y p[j]] ), where u j is a numerical data vector associated with the j-th node, y ch[i] is the concatenated list of coordinates of the winning neuron of all the children of the j-th node, and y p[i] is the list of coordinates of the parent nodes for j-th node. Thus, the input vectors is a combination of three vector components (the label, states of the offsprings, and states of the parent nodes). Assuming that the dimension of the numeric data label is k, the dimension of the state vector of the child nodes is l, and the dimension of the state vector of the parents is o, then the input dimension of the input vectors becomen = k +l+o. The influence of these three components during training is controlled by setting the first k diagonal elements of Λ to

76 4.2. Unsupervised Machine Learning 42 µ 1 (...1), the next l diagonal elements to µ 2 (...1), and all remaining diagonal elements to 1 µ 1 µ 2. Other than that, the training algorithm for CSOM-SD remains unchanged to Equation 4.4 and Equation 4.2. The problem of CSOM-SD is that the improved ability to discriminate identical substructures can create a substantially increased demand on mapping space, and hence, the computational demand can be prohibitive for large scale learning problems. This method remains limited to be processing of ordered acyclic graphs Graph Self-Organizing Map The GraphSOM allows for the processing of cyclic or undirected graphs, and was shown to be computationally more efficient than a CSOM-SD for large scale learning problems [49]. SOM-SD and CSOM-SD are both limited to train tree structure and is not workable if the maximum number of children or neighbors is unknown. GraphSOM is capable to deal with more general graphs which may be cyclic or have undirected links. Instead of building state vectors by concatenating the encoding of each child node or neighboring node, the GraphSOM takes the activation of the SOM containing the sum of activations of all of a node s neighbors as additional information that is concatenated to the node labels [49]. In this case, the input vector is formed throughx j = (u j,m ne[j] ), whereu j is defined as before, and M ne[j] is am-dimensional vector (m is the size of the map) containing the activation of the mapmwhen presented with the neighbors of node j. An elementm i of the map is zero if none of the neighbors is mapped at thei-th neuron location, otherwise, it is the number of neighbors that were mapped at that location. This latter approach produced fixed sized input vectors which do not require padding. Note also that the latter approach requires knowledge of the mappings of all of a node s neighbors. The availability of these mappings cannot be assured when dealing with undirected or cyclic graphs. This is overcome in [49] by utilizing the mappings from a previous time step. The approximation is valid since convergence is guaranteed. Hence, the GraphSOM can process undirected or cyclic graphs. As an example shown in Figure 4.1 which shows a SOM of size 5 2, and an undirected graph containing 5 nodes. For simplicity, we assume that no data label is associated with any node in the graph. For such a small graph, the dimension of the input vector for GraphSOM is 1 (equal to the number of neurons on the map) while the dimension for CSOMSD is 6 (related to the maximum out-degree of the graph). This indicates that

77 4.2. Unsupervised Machine Learning 43 Figure 4.1: A 2-dimensional map of size 5 2 (left), and an undirected graph (right). Each hexagon is a neuron. ID, codebook, and coordinate value for each neuron is shown. For each node, the node number, and coordinate of the best matching codebook is shown. the computational cost of GraphSOM and CSOMSD for small graphs could be similar. However, since the input dimension for GraphSOM is independent of the out-degree, GraphSOM can show more and more advantage as the out-degree of the graphs increase. The large size of the GraphSOM is required for large scale clustering task, this could be a potential limitation of GraphSOM. An approximation approach is proposed to overcome this by consolidating the mappings from neighboring nodes into groups. In the example shown in Figure 4.1, if a 1x2 grouping is in use, the dimension of the input vector for GraphSOM can be reduced to 5. Even larger grouping can further boost the training speed but may lead to inaccuracy. It can be observed that the inclusion of state information in the input vector provides a local view of the graph structure. The iterative nature of the training algorithm ensures that local views are propagated through the nodes in a graph, and hence, structural information about the graph is passed on to all reachable nodes. It can be observed that the concatenation of data label and state produces hybrid input vectors. The diagonal matrix Λ is again used to control the influence of these two components on the mapping. The diagonal elements λ 11 λ kk are set to µ (;1), all remaining diagonal elements are set to 1 µ, where p = u. Thus, the constant µ influences the contribution of the data label, and the state component to the Euclidean distance. Note that if u = x and µ = 1 then the algorithm reduces to Kohonen s basic SOM training algorithm. After training a GraphSOM (or any of the other SOMs) on a set of training data it becomes possible to produce a mapping for input data from the same problem domain

78 4.2. Unsupervised Machine Learning 44 but which may not necessarily be contained in the training dataset. The level of ability of a trained SOM to properly map unseen data (data which is not part of the training set) is commonly referred to as the generalization performance. The generalization performance is one of the most important performance measures. Rather than computing the generalization performance of the SOM, the performance can be evaluated on the basis of micro purity and macro purity. However, we discovered a stability problem with GraphSOM which can be described by using the example shown in Figure 4.1. When processing node w = 3 with a Graph- SOM, then the network input is the vectorx 3 = (,,2,,,1,,,,). This is because two of the neighbors of node 3 are mapped at the coordinate (1,3) which refers to the 2-nd neuron, and the third neighbor of node 3 is mapped at(2, 1) which refers to the 5-th neuron. The algorithm proceeds with the execution of Equation 4.1 and Equation 4.2. Due to the weight changes in Equation 4.2 it is possible that the mapping of neighbors w =,w = 1, andw = 4 change. Assume that there is a minor change in the mapping of nodew =, for example, to the nearby location(1,4). However, the Euclidean distance measure in Equation 4.1 does not make a distinction as whether a mapping changed to a nearby location or to a far away location; the contribution to the Euclidean distance remains the same. This defects the very purpose to achieve topology preserving properties, and can cause alternating states to occur [99]. To counter this behavior it is necessary to either reduce the learning rate to a very small value (causing long training times due to an increased demand on the iterations), or to use a large value for µ (causing a neglect of structural information). This issue has been addressed through the introduction of a probability mapping graph SOM Probability Mapping GraphSOM A weakness of the GraphSOM is an inability to maintain a topology preserving mapping due to the Euclidean distance measure used in Equation 4.1. To counter this behavior it is necessary to either significantly reduce the learning rate (causing long training times), or to use a large value forµ(causing a neglect of structural information). To overcome this problem, we propose as part of the work carried out for the purpose of this thesis to soft code the mappings of neighbors. The GraphSOM hard codes the mappings of nodes to be either1if there is a mapping at a given location, orif there is no mapping at a given

79 4.2. Unsupervised Machine Learning 45 location. Instead, we propose to code the likelihood of a mapping in a subsequent iteration with a probability value. We note that due to the effects of Equation 4.2 it is most likely that the mapping of a node will be unchanged at the next iteration. But since all neurons are updated, and since neurons which are close to a winner neuron (as measured by Euclidean distance) are updated more strongly (controlled by the Gaussian function), and, hence, it is more likely that any change of a mapping will be to a nearby location rather than to a far away location. These likelihoods are directly influenced by the neighborhood function and its spread. Hence, we propose to incorporate the likelihood of a mapping in subsequent iterations as follows in Equation 4.5. M i = e l i lr 2 2σ(t) 2 1 σ(t) 2π (4.5) where σ(t) decreases with timettowards zero, all other quantities are as defined before. The computation is accumulative for all of the node s neighbors. Note that the term 1 σ(t) normalizes the states such that 2π i M i 1.. It can be observed that this approach accounts for the fact that during the early stages of the training process it is likely that mappings can change significantly, whereas towards the end of the training process, as σ(t), the state vectors become more and more similar to the hard coding method. Normally, there are three phases in the training process [61]. During the first phase, the GraphSOM organises the mapping of nodes on the map based on the initial training radius. The second phase aims to adjust the locations of activations within a smaller area and improve the accuracy. The third phase focus on updating the codebook of winning neurons in order to shorten the distance between input and codebook. This leads to a problem which is at the very beginning of training, it is possible that the location of the winning neuron for a node may jump over the map. With the random initialization, such a large scale adjustment is reasonably required to form a rough projection at the beginning of the training. Hence, the PM approach may restrict such natural arrangement since the activations of the nodes may have tiny chance to be changed. It is necessary to allow GraphSOM get a first impression on the inputs without any interference. Based on the observations on some experiments, it usually takes 8% of the total number of training iterations to complete the first phase. Accordingly, we enable the PM approach after the first phase is completed, e.g. disable the PM approach for the first 4 iterations for a 5 iterations training task, whereas the first 4 iterations are executed as with the GraphSOM approach. In other words, the PMGraphSOM is initialized through the execution of the

80 4.3. Supervised Machine Learning 46 GraphSOM training algorithm for a limited number of iterations. The proposed approach produces state vector elements which are non-zero, as compared to the GraphSOM where the state vector can be sparse. This creates the opportunity for an optimization of the competitive step. Since the Euclidean distance is computed through element-wise computations on two vectors, and since we are only interested in finding the best matching codebook, hence, the computation of the Euclidean distance can be interrupted as soon as the partial sum exceeds a previously found minimum distance. This was found to be very effective in practice in reducing the compute time. PMGraphSOM has been applied to a benchmark dataset involving structured data and has shown its effectiveness for solving both clustering and classification learning problems. This will be present and discussed later in Chapter 6 and Supervised Machine Learning Supervised machine learning methods are used to infer functions from training samples with supervisory signals [62]. Unlike unsupervised methods, supervised learning is usually employed for classification and regression learning tasks. The training patterns involved in the supervised learning contains paired information, input and desired target. In the context of supervised learning based on neural networks, the training system takes inputs and aims at minimizing the distance between the outputs produced by the system and the targets provided by the dataset. This is done by adjusting internal network parameters called weights. A well-trained system can predict the correct output corresponding to any valid input. Perceptron-based networks are basic supervised learning models [52]. If the learning problem is linearly separable, then there exist proofs which show that a single layer perceptron system is competent in solving the problem [52]. With additional hidden layers, multi-layer perceptron (MLP) is able to distinguish data that is not linearseparable (e.g. XOR problem) [8], and approximate any continuously differentiable function [52]. However, MLP is limited to take vectorial inputs which can be unsuitable to describe some types of data. Recursive MLP (RMLP) and the associated back propagation through structure (BPTS) learning algorithm is an extension to the MLP. The extension allows the encoding of tree data structures [32]. A further, more recent extension was proposed with the introduction of the graph neural network (GNN) [82]. The

81 4.3. Supervised Machine Learning 47 GNN was introduced to solve learning tasks on graphs which may be unordered, contain undirected links and cycles [82]. This thesis will show that the GNN can be further extended for processing more generic graphs. For example, the nodes of a graph can be described by another graph, and whose nodes in turn may be described by yet other graphs. Such data structures are referred to as the graph-of-graphs (GoGs) data structure. Besides neural networks, statistic models such as those based on Hidden Markov Models (HMMs) are useful for pattern recognition and classification tasks. HMM is simplest type of Bayesian model which is popularly applied to time series modelling such as with speech recognition problems [78]. The purpose of this Section is to summarize some of the available supervised machine learning methods and to describe two new approaches: 1. The Graph Hidden Markov Model (GHMM), which is a combined system of HMM and RMLP capable to process sequential graphic patterns; 2. GNN 2 which is an extension on GNN and able to encode any common type of graphs including GoGs [98] The basics A traditional MLP is a basic supervised machine learning model. It is a feedforward neural network that takes a set of known data as inputs for training and then is able to produce outputs for unlabelled data [8]. An MLP consists of an input layer, one or more hidden layers, and an output layer. Each layer contains a number of neurons, and each layer is fully connected to the next layer via network weights. Such a model is trained by a gradient descent method based on a given error function. The training algorithm of MLP can be described as follow. Denote an-dimensional input for MLP as x = (l 1,l 2,l 3,...,l n ). Between any two adjacent layers there is a set of weights connect the neurons of these two layers, each weight is denoted by w ij which connects the i-th neuron of one layer and j-th neuron of the next layer. The hidden neuron usually uses a non-linear activation function φ(x) (e.g. sigmoid or hyperbolic tangent). The output of each neuronj is computed through: c = φ( i w ija i ), wherea i refers to the output of the i-th node in the previous (i.e. input or hidden) layer. The network error is then computed on the output layer by E(x) = 1 2 i (o i(x) t i (x)) 2 (o i (x) is the output of i-th neuron for x-th pattern in the dataset, t is the target). The error is propagated back through a gradient decent method. By using gradient descent, the amount of the change for each weight is computed as: w ji (x) = η E(x) v j (x) y i(x), where η (...1) is a learning rate,

82 4.3. Supervised Machine Learning 48 v j = i w ija i and y i is the output of a neuron in the previous layer. MLP has been widely used in artificial intelligence field for pattern recognition tasks. However, the use of it is limited to learning problems of which the data can be represented as vectors Simple Auto-associative Memory It can be a matter of debate about whether or not auto-associative memories (AAMs) are a supervised machine learning method. This is due to the fact that the teacher signal used by the AAM training algorithm is not actually an external signal, or a signal added to the input data. However, the AAM is often realized by an MLP network which is generally considered to be a supervised machine learning method. The purpose of AAM is to memorize an input pattern, and this is achieved by using the input pattern also as the desired output pattern during training. A particular power of AAM is that it is capable, once trained, to retrieve a piece of data from only partial information. There exists two main categories of training algorithms for the AAM. Firstly, the Hopfield network [54] realizes autoassociation through bitwise operations on input data. The Hopfield network has been studied to great depth, and it was shown that the optimal storage capacity is reached when the input pattern is sparse [73]. Secondly, AAM can be realized as a multilayer MLP which has an identical number of neurons for both input and output layer. Such an MLP can be trained on a set of input data and learned through back propagation as normal. The MLP approach works best when the input pattern is non-sparse. The realization of MLP is able to compress the input data through the hidden layer, and is able to reconstruct the data through its output layer. Hence, the MLP approach is useful for data dimension reduction tasks as the compression on the input data through the hidden layer can be considered as a lower dimensional projection of the input patterns Graph Hidden Markov Model The Graph Hidden Markov Model (GHMM) is a novel machine that fits the framework of sequential graphical pattern recognition [9]. The general architecture of the GHMM is illustrated in Figure 4.2. The hidden part of a hidden Markov model (HMM) is taken, along with its intrinsic capability of modeling long-term time dependencies, and of performing automatic segmentation of long sequences into sub-sequences. The emission

83 4.3. Supervised Machine Learning 49 ANN-1 ANN-2 ANN-3 ANN-n RMLP or GNN RBF-1 RBF-2 RBF-3 RBF-n n HMM Figure 4.2: The general architecture of the GHMM. probabilities associated with the states of the HMM are probability density functions (PDF) defined over labelled graphs. Emission PDFs are modelled using a combined artificial neural network (ANN) architecture. A suitable encoding ANN for graphs, such as the RMLP or GNN, is combined with a radial basis function (RBF)-like network that realizes the estimation of the PDF. This probabilistic interpretation is made possible by (i) a description of individual graphs as the random outcomes of a generalized random graph; and by (ii) a constrained RBF which actually realizes a PDF model that satisfies probability axioms. We stress the fact that the resulting, hybrid machine is not just the aggregation of separate, cooperating architectures. Rather, it shall be thought of as a whole, since a global, joint optimization algorithm is developed, which trains all the model parameters (RBFs parameters, encoding networks weights, HMM initial and transition probabilities) simultaneously in order to increase a shared, overall criterion function, namely the maximum-likelihood (ML) of the model given a sample of training observation sequences involving graphs. Training takes place within the popular Forward-backward (or, Baum-Welch) procedure for HMMs. Once training is accomplished, the popular Viterbi algorithm can be applied in order to carry out segmentation and classification of sequences of graphical patterns. The GHMM has been developed as part of this thesis and hence, the GHMM be described in some detail. We will use the notion of generalized random graph [89] for a formal definition of the probabilistic quantities over sequences of graphs that underly the given framework. Let V be a given discrete or continuous-valued set (vertex universe), and let Ω be any given sample space. We define a Generalized Random Graph (GRG) over V and Ω as a function G : Ω {(V,E) V V,E V V}. Let then G = {(V,E) V V,E V V}. A probability density function (PDF) for GRGs overv is a functionp : G R such that: (1)p(g), g G, and (2) p(g)dg = 1 (refer to [89] G

84 4.3. Supervised Machine Learning 5 for a discussion on measurability of GRG spaces, i.e. meaning of this integral). Loosely speaking, any function G(.) = ξ(t) which maps time t (either discrete or continuous) onto a GRG is then defined to be a stochastic graph process. In turn, a hidden Markov model over graphs (GHMM) is a pair of stochastic processes: a hidden Markov chain 2 and an observable stochastic graph process which is a probabilistic function of the states of the former. More precisely, a GHMM is defined as a traditional HMM [78] except for the notion of emission probability, namely as: 1. A set S of Q states, S = {S 1,...,S Q }, which are the distinct values that the discrete, hidden stochastic process can take. 2. An initial state probability distribution, i.e. π = {Pr(S i t = ),S i S}, where t is a discrete time index. 3. A probability distribution that characterizes the allowed transitions between states, that is a ij = {Pr(S j at time t + 1 S i at time t),s i S,S j S} where the transition probabilities a ij are assumed to be independent of time t. Note that {Pr(S j at timet+1 S i at timet,s k at timet 1,...) = a ij due to the Markov assumption. 4. A set of PDFs over GRGs (referred to as emission probabilities) that describes the statistical properties of the GRGs for each state of the model: b G = {b i (g) = p(g S i ),S i S,g G}. Let us assume that a certain sequence Y = g 1,...,g T of graphs generated by a (hidden) stochastic graph process has been observed, and that it is the expression (outcome) of a certain sequence W = ω 1,...,ω L of states of nature (i.e., classes). Recognition (classification) of the correct class(es) W relying on the observations Y can be accomplished according to the class(es) posterior probability given Y, yielded by Bayes theorem: Pr(W Y) = p(y W)Pr(W)/p(Y). The quantity Pr(W) is referred to as the prior probability of W. It can be estimated from relative frequencies of classes as in statistical pattern recognition. We propose GHMMs for modeling the class-conditional density p(y W). Note that this approach deals with continuous recognition tasks, that is a sequence of classes ω 1,...,ω L is hidden behind the observations Y, and no 2 This is a traditional, discrete time random process.

85 4.3. Supervised Machine Learning 51 prior segmentation ofy into subsequencesy 1,...,Y L corresponding with the individual classes in known in advance. The proposed machine relies on a connectionist non-parametric model of the emission probabilities of a GHMM, with gradient-ascent global training techniques over the ML criterion. An ANN is introduced for each state of the GHMM. The output unit of a generic ANN provides an estimate of the corresponding emission probability (that is, a PDF over GRGs) given the current graphical observation in the input graph space. Training of the other probabilistic quantities in the underlying Markov chain, i.e. initial and transition probabilities, still relies on likelihood maximization via the forward-backward algorithm [78]. The Viterbi algorithm is then applied to the recognition step [78]. The global criterion function to be maximized during training, namely the likelihood L of a graphical observation sequence given the model 3, is defined as L = ι F α ι,t. The sum is extended to the set F of all possible final states [8] within the GHMM corresponding to the current training sequence. The GHMM is supposed to involve Q states, and T is the length of the current observation sequence Y = g 1,...,g T. The forward terms α ι,t = Pr(q ι,t,g 1,...,g t ) and the backward terms β ι,t = Pr(g t+1,...,g T q ι,t ) for ι-th state at timetcan be computed recursively as follows [78]: α ι,t = b ι,t a jι α j,t 1 (4.6) j and β ι,t = j b j,t+1 a ιj β j,t+1 (4.7) wherea ιj denotes the transition probability from theι-th state to thej-th state,b ι,t denotes the emission probability associated with theι-th state over thet-th graphg t, and the sums are extended to all possible states within the GHMM. The initialization of the forward probabilities is accomplished as in HMMs [78], whereas the backward terms at time T are initialized in a slightly different manner, namely: 1 ifι F β ι,t = otherwise. (4.8) Given a generic parameter θ of an ANN, hill-climbing gradient-ascent overlprescribes 3 A standard notation is used in the following to refer to quantities involved in HMM training (e.g. [78]). Note that the Greek letter ι (iota) is used to denote the index for a generic state q ι of the GHMM; it shall not be confused with the indexithat will be introduced later.

86 4.3. Supervised Machine Learning 52 a learning rule of the following well-known kind: θ = η L θ (4.9) where η R +, and η is commonly known as the learning rate. It can be observed (from [8]) that the following property can be easily shown to hold true by taking the partial derivatives of the left- and right-hand sides of Equation (4.6) with respect tob ι,t : α ι,t b ι,t = α ι,t b ι,t. (4.1) In addition, by borrowing the scheme proposed by [8], the following theorem can be proved to hold true: L α ι,t = β ι,t, for each ι = 1,...,Q and for each t = 1,...,T. Given this theorem and Equation (4.1), repeatedly applying the chain rule we can expand L θ by writing: L θ = q = q = q t t t L b q,t b q,t θ L α q,t b q,t α q,t b q,t θ α q,t b q,t β q,t b q,t θ (4.11) where the sums are extended over all states q of the GHMM involved in the current training sequence (i.e., all the rows in the current trellis [78]), and to all t = 1,...,T, respectively. It is seen that all the quantities in the right-hand side of Equation (4.11) are available upon recursive processing on the standard HMM trellis, except for bq,t θ. From now on, attention is thus focused on the calculation of bq,t θ, whereb q,t is the output from the corresponding ANN at timet. Now, for each ι = 1,...,Q let us assume the existence of an integer d and of two functions,φ ι : G R d and p ι : R d R, s.t. b ι,t can be decomposed as: b ι,t = p ι (φ ι (g t )). (4.12) We call φ ι (.) the encoding for ι-th state of the GHMM, while p ι (.) is simply referred to as the emission associated with that state 4. Again, we assume parametric forms φ ι (g t θ φι ) and p ι (x θ pι ) for the encoding and for the emission, respectively, and we set 4 It is seen that there exist (infinite) choices forφ(.) and ˆp(.) that satisfy Equation(4.12), the most trivial beingφ(g) = p(g θ), ˆp(x) = x.

87 4.3. Supervised Machine Learning 53 θ ι = (θ φι,θ pι ). Bearing in mind Equation (4.12), we propose a state-specific, two-block connectionist/statistical model for b ι,t as follows. The function φ ι (g t θ φι ) is realized via an encoding network, suitable to map graphs g t into real vectors x t for t = 1,...,T, as described in [87] for supervised training of RNNs over structured domains. The weights of the encoding network are the parametersθ φι. A radial basis functions (RBF)-like neural net is then used to model the emissionp ι (x t θ pι ), where θ pι are the parameters of the RBF. Basically, for each state ι in the GHMM a state-specific RBF is expected to define a mixture of Normal densities over the state-specific encoding φ ι (g t θ φι ) of t-th input graph. From now on, since (i) HMMs assume that the emission probabilities associated with different states are independent of each other [78], (ii) a separate connectionist model is adopted for each one of the states of the GHMM, and (iii) also individual observations (i.e., graphs) within the input sequence are assumed to be independent of each other given the state [78], we simplify the (cumbersome) notation by dropping the state index ι and the time index t, and we focus on the generic quantity p(φ(g θ φ )) θ p ). Note that, for notational convenience, in the following this quantity may be written as p(g). In order to ensure that a PDF is obtained this way, a standard RBF cannot be used straightforwardly: specific constraints have to be placed on the nature of the Gaussian kernels, as well as on the hidden-to-output connection weights. The training algorithm shall provide us with a likelihood maximization scheme that undergoes such constraints. Three distinct families of adaptive parameters θ of the ANNs have to be considered: (1) Mixing parameters c 1,...,c n, i.e. the hidden-to-output weight of the RBF network. Constraints have to be placed on these parameters during the ML estimation process, in order to ensure that they are in the range (,1) and that they sum to one. A simple way to satisfy the requirements is to introducenhidden parametersγ 1,...,γ n, which are unconstrained, and to set c i = ς(γ i ) n j=1 ς(γ,i = 1,...,n (4.13) j) where ς(x) = 1/(1 + e x ). Each γ i is then treated as an unknown parameter θ to be estimated via ML. (2) Thed-dimensional mean vectorµ i andd d covariance matrixσ i for each one of the Gaussian kernels K i (x) = N(x;µ i,σ i ), i = 1,...,n of the RBF-like network, where N(x;µ i,σ i ) denotes a multivariate Normal PDF having mean vector µ i, covariance matrixσ i, and evaluated over the random vectorx. A common (yet effective) simplification

88 4.3. Supervised Machine Learning 54 is to consider diagonal covariance matrices, i.e. independence among the components of the input vectorx. This assumption leads to the following three major consequences: (i) modeling properties are not affected, according to [67]; (ii) generalization capabilities of the overall model may turn out to be improved, since the number of free parameters is reduced to a significant extent; (iii) i-th multivariate kernel K i may be expressed in the form of a product ofdunivariate Normal densities as: { d 1 K i (x) = exp 1 ( ) } 2 xj µ ij 2πσij 2 j=1 σ ij (4.14) i.e., the free parameters to be estimated are the meansµ ij and the standard deviationsσ ij, for each kernel i = 1,...,n and for each componentj = 1,...,d of the input space. (3) The weightsu of the encoding network. In the following, we will derive explicit formulations for p(φ(g θ φ) θ p) θ for each of the three families of free parameters θ above. These derivatives are then put in place of b q,t in Equation (4.11), obtaining the quantity L θ θ θ = η L θ for the generic parameterθ. and, in turn, the overall learning rule As regards a generic mixing parameter c i,i = 1,...,n, from Equations (4.12) and (4.13), and sincep(g) = n k=1 c kk k (x), we can obtain Equation p(φ(g θ φ ) θ p ) γ i = = n j=1 = K i (x) n p(g) c j (4.15) c j=1 j γ i K j (x) ( ) ς(γj ) γ n i k=1 ς(γ k) { ς (γ i ) k ς(γ } k) ς(γ i )ς (γ i ) [ k ς(γ + { } ς(γj )ς (γ i ) K k)] 2 j (x) [ j i k ς(γ k)] 2 ς (γ i ) = K i (x) k ς(γ k) K j (x) ς(γ j)ς (γ i ) [ j k ς(γ k)] 2 { } ς (γ i ) = K i (x) k ς(γ k) ς (γ i ) c j K j (x) k ς(γ k) = ς (γ i ) k ς(γ k) {K i(x) p(g)} j Bearing in mind that the calculations were carried out for a connectionist model of the emission probability b ι,t associated with the generic ι-th state of the GHMM and evaluated over t-th graph in the sequence, and using the symbol Q(ι) to denote the subset of

89 4.3. Supervised Machine Learning 55 the states involved in the current trellis that are instances of the state ι, we can reintroduce the state index and the time index t in the notation, and rewrite Equation (4.15) as follows: b q,t γ (ι) i { } ς (γ (ι) i ) K (ι) k ς(γ(ι) k ) i (x (ι) t ) b ι,t ifq Q(ι) = otherwise (4.16) where the writings in the form γ (ι) i, x (ι) t and K (ι) i (.) (i.e., those having superscript (ι) for any value of ι = 1,...,Q) denote the corresponding quantities γ i, x t and K i (.) within the RBF associated with ι-th state, respectively, according to the previous notation. Substituting Equation (4.16) into Equation (4.11) and the latter, in turn, into Equation (4.9), we obtain the following learning rule for the i-th mixing parameter γ (ι) i within the ι-th emission model: γ (ι) i = η q Q(ι) t β q,t α q,t b ι,t ς (γ (ι) i ) { } k ς(γ(ι) k ) K (ι) i (x (ι) t ) b ι,t where we implicitly exploited the (obvious) fact thatb q,t = b ι,t for all q Q(ι). (4.17) For the means µ ij and the standard deviations σ ij we proceed as follows. Let θ ij denote the free parameter, i.e. µ ij orσ ij, to be estimated. We can write: where the calculation of K i(x) θ ij p(φ(g θ φ ) θ p ) θ ij = n k=1 c kk k (x) θ ij (4.18) = c i K i (x) θ ij can be accomplished as follows. First of all, let us observe that for any real-valued, differentiable function f(.) this property holds true: f(.) log[f(.)] x. As a consequence, from Equation (4.14) we can write { [ K i (x) = K i (x) d 1 ( log(2πσik 2 θ ij θ ij 2 )+ xk µ ik σ ik For the means, i.e. θ ij = µ ij, Equation (4.19) yields { K i (x) = K i (x) 1 ( ) } 2 xj µ ij µ ij µ ij 2 σ ij k=1 = K i (x) x j µ ij σ 2 ij which can be substituted into Equation (4.18), obtaining ) 2 ]} f(.) x =. (4.19) (4.2) p(φ(g θ φ ) θ p ) = c i K i (x) x j µ ij. (4.21) µ ij σ 2 ij

90 4.3. Supervised Machine Learning 56 Now we can reintroduce the state indexι and the time indextin the notation, and rewrite the Equation as bι,t µ (ι) ij = c (ι) i K (ι) i (x (ι) t ) x(ι) tj µ(ι) ij σ 2(ι) ij, where x (ι) tj denotes the j-th component of the vector x (ι) t which represents the ι-th encoding of t-th input graph g t within the training sequence, σ 2(ι) ij is the j-th component of the diagonal of the covariance matrix associated with the i-th kernel of ι-th emission PDF, while the other symbols have the same meaning as above. Again, since bq,t = when q / Q(ι), the expression can be µ (ι) ij substituted into Equation (4.11) and the latter, in turn, into Equation (4.9), obtaining the following learning rule for the j-th component of the mean vector µ (ι) ij associated with i-th kernel function within theι-th emission model: µ (ι) ij = η q Q(ι) t β q,t α q,t b ι,t c (ι) i K (ι) i (x (ι) t ) x(ι) tj µ(ι) ij σ 2(ι) ij (4.22) where, again, we exploited the fact thatb q,t = b ι,t for allq Q(ι). For the covariances, i.e. θ ij = σ ij, Equation (4.19) yields: { K i (x) = K (xj ) 2 i(x) µ ij 1} σ ij σ ij σ ij which can be substituted into Equation (4.18) obtaining { (xj ) 2 p(φ(g θ φ ) θ p ) K i (x) µ ij = c i 1} σ ij σ ij σ ij (4.23) (4.24) that, adopting the notation above for expressing the dependence on the generic ι-th state of the GHMM and on the time index t, can be substituted into Equation (4.11) and the latter, in turn, into Equation (4.9), obtaining the following learning rule: σ (ι) ij = η ( ) α q,t β q,t c (ι) K (ι) i (x (ι) t ) (ι) 2 x tj µ(ι) ij i 1 b t ι,t σ (ι) ij σ (ι). (4.25) ij q Q(ι) Finally, let us consider the connection weightsu = {v 1,...,v s } within the encoding network. The term p(φ(g θ φ) θ p) v the chain rule yields: in Equation (4.11) can be computed as follows. Applying p(φ(g θ φ ) θ p ) v = p(φ(g θ φ) θ p ) y y v (4.26) wherey is the output from the unit (in the encoding net) which is fed from connectionv. The quantity y v can be easily computed by taking the partial derivative of the activation function associated with the unit itself, as usual. In particular, ifv = v lm is the connection

91 4.3. Supervised Machine Learning 57 weight between the genericm-th unit in a given layer andl-th unit in the following layer, s.t. the corresponding outputs arey m andy l, respectively, we have y l v lm = f l (a l)y m (4.27) where y l = f l (a l ) and y m = f m (a m ) are the activation functions associated with l-th unit and m-th unit, respectively, and a l and a m are the corresponding activations (i.e., inputs), and where f l (a l) denotes the derivative of the activation function given. As regards the quantity p(φ(g θ φ) θ p), we proceed as follows. First of all, let us assume y thatv feeds the output layer, i.e. it connects a certain hidden unit withj-th output unit of the encoding net. In this case, we havey = x j, and: p(φ(g θ φ ) θ p ) = n i=1 c ik i (x) (4.28) x j x j { [ n = c i K i (x) d 1 ( ) ]} 2 log(2πσik 2 x i=1 j 2 )+ xk w ik µ ik σ ik k=1 { n = c i K i (x) 1 ( ) } 2 xj w ij µ ij 2 x i=1 j σ ij n K i (x) = c i (x σij 2 j w ij µ ij )w ij. i=1 Equations (4.27) and (4.28) can be substituted into Equation (4.26) obtaining: p(φ(g θ φ ) θ p ) n K i (x) = c i (x v jm σij 2 j w ij µ ij )w ij f j (a j)f m (a m ). (4.29) By defining the quantity: δ j = i=1 n i=1 K i (x) c i (x σij 2 j w ij µ ij )w ij f j(a j ) (4.3) for the generic j-th output unit in the encoding network, we can rewrite Equation (4.29) in the following, compact form: p(φ(g θ φ ) θ p ) v jm = δ j f m (a m ). (4.31) When v is a hidden weight (say, v = v ml where l and m are the indexes of generic hidden units connected via v), the quantity p(φ(g θ φ) θ p) v ml can be obtained applying the usual backpropagation through structures (BPTS) algorithm [87], once the deltas to be backpropagated have been initialized at the output layer via Equation (4.3). In so doing, a quantityδ m can be defined for each hidden unitmsuch that p(φ(g θ φ ) θ p ) v ml = δ m f l (a l ). (4.32)

92 4.3. Supervised Machine Learning 58 Substituting Equation (4.31) or Equation (4.32), respectively, into Equation (4.11) yields an overall learning rule for any given weightv (ι) ij within the encoding network associated withι-th state of the GHMM in the following, common form: v (ι) ij = η q Q(ι) t β q,t α q,t b ι,t δ (ι) i f (ι) j (a (ι) j ) (4.33) where the superscript (ι) has the usual meaning Back Propagation Through Structure Back Propagation Through Structure (BPTS) is a supervised machine learning algorithm for the encoding of tree structured data in recursive multi-layer perceptron (RMLP) networks [32]. The RMLP network is an extension of MLP by adding a state layer between the hidden and the output layer. The input data is processed node by node from the leave nodes to the root node where root node is usually attached a target. For each node in the tree, RMLP takes a combined input that consists of a label vector and a set of state vectors of its children, and such a combined vector will be forwarded through hidden layers and a state layer. For processing a particular node, the network feeds back the outputs from the state layer to the input layer for its parent node. Dependencies exist among nodes in the tree and hence, the input for a particular node could be dynamic. This requires a processing of the nodes in the tree in a particular order so that for each node we can obtain a known state vector. This is achieved by first processing the nodes without any offsprings (the leaf nodes) then by working through intermediate nodes towards the nodes that have no parent nodes (the root nodes). The state vector for each node is updated accordingly. After one round of processing of all the nodes in the tree, the stable outputs from state layer are passed down to the output layer and the errors are propagated back through the structure in the reverse direction: from root node to leaf nodes. Figure 4.3 shows a sample of the RMLP network which have four hidden neurons within a single hidden layer and three state neurons. The layers are fully connected by the network weights. As with MLP, the internal nodes (the hidden and state nodes) are non-linear units whereas the output nodes are often linear units. Also shown in Figure 4.3 is that the output produced by the state layer for each of a node s offsprings is used as an additional input to the network when processing a node. The states of a node s neighbors must be ordered in

93 4.3. Supervised Machine Learning 59 output states hidden Label Figure 4.3: A schematic of an RMLP network architecture consisting of 4 hidden units and 3 state units. order to compute a meaningful mapping. Hence, the RMLP is limited to processing of ordered tree data structures Graph Neural Network An ability to encode tree structure data is sufficient for many real-world learning problems. Some problems require data representation with more complex data structures such as general graphs. A general graph contains nodes connected via any types of edges: directed, undirected, labelled, etc. Thus, the dependency between nodes in the graph can form cycles or self connections which can not be handled by RMLP networks. In [81], Franco et al. proposed a neural network model called Graph Neural Network (GNN) for processing of data represented as graphs. The most basic GNN architecture has similarities to the RMLP network architecture in that both feature four layers: input, hidden, state and output. One main difference is that the GNN sums the states of all a node s neighbors, and that the training algorithm guarantees a convergence of the state nodes to a stable and unique solution. Figure 4.4 illustrates the architecture of a simple GNN network consisting of four hidden neurons in a single hidden layer and 3 state neurons. Shown is that the GNN takes as input the sum of states of all a given node s neighbors rather than concatenated list of states as in the RMLP shown in Figure 4.3. A main problem of using the sum of states is that the GNN cannot uniquely identify the contribution of an individual neighbor. Hence, a critical component of the GNN training algorithm is to ensure that the sum of state is unique for a given context. This is achieved through an iterative contraction procedure which computes a stable point for the sum of states [81].

94 4.3. Supervised Machine Learning 6 output states hidden Label Sum of State Figure 4.4: A simplified schematic illustration of the GNN network architecture. GNN is used to process a set of labelled or semi-labelled graphs. The training algorithm can be summarized as follows: Initialization. The parameters of the GNN network are initialized using small random values. For each node and each graph in the dataset, perform forwarding and error backpropagation for a number of iterations. In the forwarding phase, a graph is processed node by node in random order. Each nodeican be labelled by a feature vectorl i and a state vectors i (containing some default values at the beginning of the training). When processing node i in the graph, the set of neighboring nodesn(n 1,n 2,n 3,...,n m ) are considered. In addition to the node label vector L i, the sum of the state vectors of neighboring nodes N are computed and concatenated. This sum-of-state approach resolved a limitation of RMLP that required knowledge of the maximum out-degree in advance for the decision on the dimension of the state inputs. The inputs[l i, S N ] are fed to the hidden layer and then state layer. The state vector for node i is updated with the outputs from the state layer: S i = FORWARD(L i, 1 j<=m (S j),w) where W is the matrix of weights connecting input to hidden layer and hidden layer to state layer. All nodes in a graph are forwarded through hidden and state layers, the state vectors for the nodes are updated accordingly. For each iteration, the state vectors from the previous iteration is used. It is noticed that during such a procedure, the dynamic of the state vectors may produce unstable outputs. This avoided by ensuring that the procedure realized a contraction mapping and be iterating the

95 4.3. Supervised Machine Learning 61 procedure until convergence is observed [81]. The convergence is checked by computing the distance between the states at current time t: S(t) and the states from previous iterationt 1: S(t 1). If S(t) S(t 1) ε f, then this terminates the iterations. The outputs of the network are computed after the stable states are obtained. The correction on outputs is built based on the error function defined. The errors are propagated back through the output layer to the state layer. Similarly, a repeated procedure is required to reach convergence on the delta for the states. The gradients are accumulated during the iterations and at the end applied on the weight parameters of the network. We omitted the formal description of the GNN training algorithm here in order to avoid repetition. This is because an extension of the GNN is formally described in Section The GNN is a full subset of the extension and hence, the training algorithm of the GNN is fully covered by the descriptions in Section

96 4.3. Supervised Machine Learning Graph Neural Network version 2 It is found that there are learning problems that require the encoding of graph structures where the node represents another structured object. Hence, this results in a graph whose nodes may be labelled by another graph [98]. Consider the situation that is illustrated in Figure 3.2: The World Wide Web can be represented as a hyperlink graph in which the nodes are the Web pages, and the links represent the hyperlinks. The Web pages are semi-structured objects which can be represented by a content structure. Hence, the nodes in the hyperlink graph can be described (labelled) by a document structure. These two levels of graphs are a collection of graphs from a different domain and hence, we propose that the encoding of these different sets of graphs is best done using distinct subsets of network weights. Formally, this can be described formally as follows: Assume that we have a node i in a graph with a neighborhood N[i], which consists of a number of other nodes, where the superscript denotes the topmost level, i.e. the XML graph or CLG graph level. The nodes in the neighborhood N [i] all have connections via links to the node i. Each node is described by a set of features, and each link is described by yet another set of features. Each node could have an external input. If we assume that each node is described by an entity called state then node i is described by a state bfx i, an n -dimensional vector. The state of nodeican be described by Equation 4.34 x i = F i (u i,c (x N [i] ),x 1 i ) (4.34) where u i is the input to node i, F i (,, ) is a nonlinear vector function (which may be considered a hyperbolic tangent function, or a sigmoid function), andc denotes the connections from the neighborhood N [i] and the vector x1 i into the i-th node via connecting links. x 1 i is an n 1-dimensional vector denoting the state 5 of the graph which is attached to the i-th node. In other words, here we assume an additive model, in which the (state of the) graph attached to a node i is assumed to be an additional input to the nodei. The statex 1 i is described similarly as follows: x 1 i = F1 i (u1 i,c 1(x 1 N [i] ),x 2 i ) (4.35) 5 Here we abuse the notion of statex 1 for simplicity sake. The state here can be the output of the graph, or the concatenation of the individual states of the nodes in the child graph.

97 4.3. Supervised Machine Learning 63 where x 1 i denotes the level 1 state 6, that is the state of the graph (or output of the graph) representing the child document of the i-th node; u 1 i denotes the input into the level 1 node;f 1 i denotes a nonlinear vector function;c 1 denotes the connections into the nodei in the child graph; and x 2 i denotes the state of the child graph associated with node i in level 1. From Equation (4.34) and Equation (4.35) the recursive nature of the approach becomes clear. The recursion stops at the maximum level of encapsulation of graphs. If we assume that there are k levels, then the k-th level will be described by the following equation: x k i = Fk i (uk i,c k(x k N [i] )) (4.36) Thus, the methodology allows the processing of graphs whose nodes are labelled by other graphs whose nodes are labelled by yet other graphs, and so on. Each set of graphs defines a level in the GoG. If a GoG consists of k levels, then the GoG is said to be of depthk. Ifk = 1 then this describes the case of the standard GNN. At the k-th level by assumption there will not be any inputs from the graph within the node i. Thus k denotes the terminal level; the depth of the GoG structure. Then, a mapping from the state space to the output space takes place as follows: o i = G i (u i,b(x i )), (4.37) where B denotes the configuration of the state vector x i with the output vector o i. The output can then be compared to an associated target value, and a gradient descent method can be applied to update the system with the aim to minimize the squared difference between network output and target values. It is noted that similar equations to Equation (4.37) can be written to provide an output to any of thek levels. Note that Equation (4.37) is suitable for node focused applications. With node focused applications, a model is required to produce an output for any node in a graph. In contrast, graph focused applications require one output for each graph. In the literature, a graph focused behavior of such systems is achieved by either selecting one node (i.e. the root node) to be representative for the graph as a whole [32, 4], or by producing a consolidated mapping from all states [81]. The same principles can be applied here. It is important to note that the model at any level other than levelare graph focused ones since the state of the graph (as a whole) which is associated to a given nodeiis forwarded 6 Again we abuse the notion of state here, as it can represent either the output of the child graph associated with node i in level, or the concatenation of the individual states of the nodes in the child graph.

98 4.3. Supervised Machine Learning 64 as an input to nodei. In contrast, the model at levelcan be either a graph focused one or a node focused one depending on the requirements of the underlying learning problem. It is trivial to observe that the GoG model accepts the common graph model as a special case. Indeed, if k = 1, this collapses to the standard graph model considered in [81]. Since the graph model in [81] contains the time series as a special case (a tree), and hence one may observe that the GoG model may be considered as the most general graph model formulated to represent objects so far. An MLP can be used to compute the states in each layer. Hence, the model consists of k MLPs which have forward (and backward) connections to the MLP at the next level. Once the GoG model is expressed in recursive form as shown in Equations (4.34) to (4.36), then it is quite clear how the unknown parameters 7 in the model can be trained using the standard back propagation algorithm. The training algorithm can be stated quite simply as follows: Step Initialization. The parameters in the model are initialized randomly. Step 1 From the deepest levelk compute the state, and progressively compute the states in levelsk 1,k 2,...,. Step 2 Compare the outputs at the topmost level with those of the desired ones, and form an error function. Step 3 Back propagate the error from the topmost level through the levels until we reach the innermost levelk and update the parameters of the model in the process. Step 1 through to step 3 are repeated for a limited number of times, or until the sum of squared errors is below a prescribed threshold, then the algorithm stops. Figure 4.5 shows an architecture of a sample GNN 2 for encoding GoGs up to two levels. There are two network components in the model: encoding network and output network. The encoding network deals with the graph which describes nodes in another graph, and the outputs are attached to the label vector of the corresponding node. The 7 Here we assume that the function F k i is characterized by a set of parameters. For example, if the encoding mechanism is a multilayer perceptron, then the unknown parameters will be the strengths of connections from the inputs to hidden layer neurons. In the case if we use the outputs then the set of parameters will include the strengths of connections from the hidden layer neurons to output neurons. For simplicity, we will assume shared weight model, i.e. all F k i in the same level k share the same set of weights.

99 4.3. Supervised Machine Learning 65 Output Network output states hidden Label Sum of State output Encoding Network states hidden Label Sum of State Figure 4.5: A schematic view of the GNN 2 network. errors are propagated back through the connections between label vector and hidden layer in the output network, and then pass to the encoding networks. This can be more clearly viewed in Figure 4.6 which shows how such GNN 2 deals with a simple two-level GoGs. In this example, a graph g 1 contains four nodes. Node for instance, it is associated with a label vector l and a descriptive graph g which consists of three nodes and three edges. The encoding network encodes graph g by forwarding each node throughf (w) which involves a combined input includes label vector and state vector of neighboring nodes. This procedure is repeated until the convergence of states is observed and the final outputso (t) for the node can be computed. The output network encodes the graph g through f 1 (w). For node, it iteratively takes l, o (t) and states of its neighboring nodes as inputs and producex (t). Until all the states converge, the outputs for the nodes ing 1 can be computed. It should be noted that from an implementing point of view, GNN 2 is not just simple stacking of GNN components. Especially the recursive backward phase, the error computed at the output layer of output network are propagated back through layers of output

100 4.3. Supervised Machine Learning 66 l1 l l2 l1 l g w g w o () t x 1() t f w x () t x 2() t l2 g w o () f 1 t w f w o () 2 t x ( t 1 ) + f w l x (), t x () t, 1 2 label Encoding network l T l,o () t 1 g w o () t x () t 1 f w x 3() t g 1 w l,o 3 () t 3 x ( t 1 ) + 1 f w l 1 l3 x 1() t f 1 w 1 f w o 3() t l o (), t x(), t x (), t x() t, label T 1 l2 T2 T3 l,o 1 () t 1 1 g w o 1() t fw x2() t o 2() t gw l,o 2 () t 2 Output network Figure 4.6: The encoding architecture of the GNN 2 network for a given graph. network and then layers of encoding networks in an integrative step. In Figure 4.6, node l from level 1 graph is attached a targett, and once a stable output forl is available, an error can be computed. Such error is propagated recursively through the structure of the output network, and delta values are available for all the label inputs to output network. Such deltas are set for the output layer of encoding network (indicated in Figure 4.6 by the red arrow-headed line but in reversed direction) and propagated back through the network architecture and also the structure of g that describing the corresponding node in g 1. Above procedure is only for a single node in the top-most graph (where the nodes are given supervised signal, viz. g 1 ), and will need to be repeated for each node in such graph. It also can be seen that the computational cost raises exponentially with the size of network and complexity of the structured data. We found that the algorithm is rather sensitive to the initial network condition and observed that some network weights can reach saturation during training. This negatively affects the rate of convergence. The problem could be addressed by using very small initial network weights and a small learning rate. However, this prolongs the training time very significantly. A much more efficient method to overcome the posed problem is to engage a Jacobian control mechanism on the network weights.

101 4.3. Supervised Machine Learning 67 Jacobian control Jacobian control is a way to restrict the weights in the GNN network to be within a limited range during updating. The basic idea is to keep the norm of the Jacobian on the weights connecting the input layer to state layer no larger than 1. The notation used in this section is summarizes in Table 4.1. The algorithm can then be described as follows: 1. For each node i,h,k in the graph, consider a parent j and a child j of i and compute x i,k x i,k x j,h, perform Jacobian computation (will be described in details later). x j,h Then, the x i,k x g,h for other parents or children ofican be deduced by Equation For each i,k, compute i,k as defined in (4.43) and select those i,k such that i,k > t. 3. Compute for selected i, k, the updated gradients as defined by Equations (4.46), (4.47), (4.49), (4.49), (4.51). 4. Multiply the computed gradient by a weighting constantβ and add it to the original square error gradient. The Jacobian is composed by the derivatives x i,k x j,k of a component k of the state of nodeiw.r.t. a componenthof the state of another nodej. The derivative x i,k x j,k equals x i,k x j,h = where if there is no link between j and i r v k,r σ 1 (α i,r ) w1 r,h ifj is a parent ofi r v k,r σ 1(α i,r ) wr 2,h ifj is a child ofi r v k,r σ 1 (α i,r ) (w1 r,h +w2 r,h) ifj is both a child and a parent ofi α i,r = h w 1 r,h x1 i,h + h w 2 r,h x2 i,h + h w 3 r,h l i,h +a r (4.38) x i,k x j,h = x i,k x f,h (4.39) for each pairs of nodes j,f that are connected in the same way to i, i.e., j,f are both parents ofior they are both children ofj, and so on. Thus, x i,k x j,h has to be computed once for each i,k,h.

102 4.3. Supervised Machine Learning 68 Table 4.1: The notation for Jacobian control computation. Notation Description x i,k Thek-th element of the state of nodei. wr,h 1 Input-to-hidden layer weight of the transition network. More precisely, the weight connecting the h-th component of the parents to the r-th hidden neuron. wr,h 2 Input-to-hidden layer weight of the transition net. More precisely, the weight connecting theh-th component of the children to ther-th hidden neuron. wr,h 3 Input-to-hidden layer weight of the transition network. More precisely, the weight connecting theh-th component of the label to ther-th hidden neuron. a r Hidden layer weight (of the transition net). More precisely, the bias of ther-th hidden neuron. v k,r Hidden to state layer weight (of the transition net). More precisely, the weight connecting ther-th hidden neuron to thek-th neuron of the state layer. b k State layer weight (of the transition net). More precisely, the bias of the k-th state neuron. σ 1,σ 2 Activation functions of the hidden and state layer (of the transition net), respectively. σ 1,σ 2 The first order derivatives of the activation functions of the hidden and state layer (of the transition net), respectively. σ 1,σ 2 The second order derivatives of the activation functions of the hidden and state layer (of the transition net), respectively. x 1 i,h, x2 i,h The sum of the states of the parents, children, respectively (see Equations (4.4), (4.41)) α i,r Activation level of ther-th hidden neuron fori-th node (see Equation (4.38)). x 1 i,h, x2 i,h For a nodei, the sums of the states of its parents and its children, respectively. Theh-th component of the label of nodeh. l i,h Thek-th element of the state of nodeiis computed as x i,k = σ 2 ( r v k,r σ 1 ( h w 1 r,h x1 i,h + h w 2 r,h x2 i,h + h w 3 r,h l i,h +a r ) where x 1 i,h x 2 i,h = j pa[j] = j ch[j] +b k ) x j,h (4.4) x j,h (4.41) The back propagation procedure is simulated by setting the delta error for the state neuron to a vector made of all zeros except the k-th component, which is set to 1. Then, we can apply the back propagation procedure, which will automatically produce x i,k m for each input node m of the encoding net. Then x i,k x j,h is computed accumulating the x i,k m which is corresponding to nodej and componenth. x i,k = x i,k x j,h m

103 4.3. Supervised Machine Learning 69 where the above sum range between the j having a link to i and the h that are given in input tom. The core of Jacobian control is to keep the norm of the Jacobian J = max x i,k i,k j,h x j,h smaller than 1. In order to achieve this, a penalty function is in use: p = i,k p i,k, where p i,k = if x i,k j,h x j,h < t ( ) x i j,h,k 2 t if x i j,h,k > t x j,h x j,h (4.42) where t < 1 is a threshold. Notice that, in order to compute p i,j, it will be useful to compute: i,k = x i,k x j,h j (4.43),h = pa[i] v k,r σ 1(α i,r ) w 1 r,h + (4.44) h r ch[i] v k,r σ 1 (α i,r ) w2 r,h h r Thus, Equation 4.42 becomes p i,k = if i,k < t ( i,k t) 2 if i,k > t (4.45) In order to implement the learning, gradients need to be defined p i,k, where w de- w notes a generic weight, p if i,k < t i,k w = ( ) ( ) xi,k 2( i,k t) ( j,h sign xi,k ) x j,h if x i,k j,h x j,h w x j,h > t Thus, for hidden to output layer weightsv k,r, if there is no link between j, and i ( ) ork k xi,k x j,h σ = 1(α i,r ) wr,h 1 ifj, is a parent ofiand k = k v k,r σ 1 (α i,r) wr,h 2 ifj, is a child ofiand ork = k σ 1(α i,r ) (wr,h 1 +w2 r,h ) ifj is both a child and a parent ofi and ork = k

104 4.3. Supervised Machine Learning 7 The contributions from differentiare accumulated. Thus, p = v k,r i: i,k >t 2( i,k t) ( xi,k sign x j j pa[i],h,h ) σ 1 (α i,r) wr,h 1 + ( ) xi,k sign σ x 1(α i,r ) w 2 r,h j j ch[i],h,h = ( ( ) xi,k 2( i,k t) pa[i] sign σ x 1(α i,r ) wr,h 1 i: i,k >t h j1,h + ( ) ) xi,k ch[i] sign σ 1(α i,r ) wr,h 2 (4.46) h x j2,h wherej 1,j 2 are any parent and child ofi, respectively. For the biasb k, we have ( xi,k x j,h b k ) = For the input to hidden weightwr,h 1, we have ( xi,k x j,h w 1 r,h ) = v k,r σ 1 (α i,r) w 1 r,h x1 i,h v k,r σ 1 (α i,r)+v k,r σ 1 (α i,r) w 1 r,h x1 i,h v k,r σ 1 (α i,r) w 2 r,h x1 i,h v k,r σ 1 (α i,r) (w 1 r,h +w2 r,h ) x1 i,h v k,r σ 1 (α i,r)+v k,r σ 1 (α i,r) (w 1 r,h +w2 r,h ) x1 i,h if there is no link betweenj and i if j is a parent of i and h h if j is a parent of i and h = h if j is a child of i and h h ifj is both a parent and a child of i and h h ifj is both a parent and a child of i and h = h Thus, p w 1 r,h = i,k: i,k >t j pa[i] + j ch[i] 2( i,k t) ( sign ( ( xi,k h sign x j,h ) ( xi,k v k,r σ 1(α i,r )+ ( ) ) xi,k sign v k,r σ x 1(α i,r ) wr,h 1 x1 i,h h j,h x j,h ) v k,r σ 1(α i,r ) w 2 r,h x1 i,h) (4.47)

105 4.3. Supervised Machine Learning 71 For the input to hidden weightwr,h 2, we have ( xi,k x j,h w 2 r,h ) = v k,r σ 1(α i,r ) w 1 r,h x2 i,h v k,r σ 1(α i,r ) w 2 r,h x2 i,h v k,r σ 1(α i,r )+v k,r σ 1(α i,r ) w 2 r,h x2 i,h v k,r σ 1(α i,r ) (w 1 r,h +w2 r,h ) x2 i,h v k,r σ 1 (α i,r)+v k,r σ 1 (α i,r) (w 1 r,h +w2 r,h ) x2 i,h if there is no link between j and i if j is a parent of i and h h if j is a child of i and h h if j is a parent of i and h = h ifj is both a parent and a child of i and h h ifj is both a parent and a child of i and h = h Thus, p w 2 r,h = i,k: i,k >t + j ch[i] j pa[i] ( 2( i,k t) sign ( h sign ( xi,k x j,h ) ( xi,k x j,h ) v k,r σ 1 (α i,r) w 1 r,h x2 i,h ) (4.48) v k,r σ 1 (α i,r)+ ( ) ) xi,k sign v k,r σ 1 x (α i,r) wr,h 2 x2 i,h h j,h For the input to hidden weightwr,h 3, we have if there is no link between j and i ( ) xi,k x j,h v k,r σ 1 = (α i,r) wr,h 1 x3 i,h ifj is a parent ofi wr,h 3 v k,r σ 1 (α i,r) wr,h 2 x3 i,h ifj is a child ofi v k,r σ 1 (α i,r) (wr,h 1 +w2 r,h ) ifj is both a parent and x3 i,h a child ofi Thus, p w 3 r,h = i,k: i,k >t j pa[i] + j ch[i] 2( i,k t) h sign h sign ( xi,k x j,h ( xi,k x j,h ) v k,r σ 1(α i,r ) w 1 r,h x3 i,h ) v k,r σ 1(α i,r ) wr,h 2 x3 i,h (4.49)

106 4.4. Conclusion 72 And finally, for the biasa r, Thus, ) x j,h = a r ( xi,k if there is no link betweenj andi v k,r σ 1(α i,r ) w 1 r,h ifj is a parent ofi v k,r σ 1 (α i,r) w 2 r,h ifj is a child ofi v k,r σ 1(α i,r ) (w 1 r,h +w2 r,h ) ifj is both a parent and a child ofi p w 1 r,h = i,k: i,k >t j pa[i] + j ch[i] 2( i,k t) (4.5) h sign h sign ( xi,k x j,h ( xi,k x j,h ) v k,r σ 1 (α i,r) w 1 r,h ) v k,r σ 1 (α i,r) wr,h 2 (4.51) Jacobian control has been implemented in GNN 2 as a plugin for solving the stability issue of the output network under certain initial conditions. The experiments shown in this thesis confirm the effectiveness of the approach. 4.4 Conclusion Machine learning methods include supervised and unsupervised models were examined in some detail in this chapter. Unsupervised methods based on SOM were extended in succession, towards the current state-of-the-art in unsupervised machine learning on graphs (viz. the PMGraphSOM). The proposed extension is capable to process data featuring cyclic and undirected graphs. A consideration on the likelihood of a mapping helps to produce more consistent and stable results. Supervised machine learning methods were also reviewed from the basic feedforward neural network model to the ones that can solve learning tasks on graphs (the GNN). In responding to the need of encoding GoGs, we proposed an extension of the GNN model to a hybrid model contains multiple GNNs as components. The method is called GNN 2. While the GNN 2 is also capable of encoding a simple case of GoG, viz. a sequence of trees, the proposed statistic model GHMM is specifically designed to encode such data structures. The GHMM is built

107 4.4. Conclusion 73 as a combination of HMM, RMLP, and RBF, and is known to solve the problems on recognition of sequential graphic patterns. In subsequent chapters, we will shift the focus on evaluating the newly introduced methods: PMGraphSOM, GHMM, and GNN 2 on benchmark problems and several real world problems.

108 Chapter 5 Benchmark datasets 5.1 Introduction Learning structures from data is a topical research area. The data usually can have structural representations, such as trees, sequences, graphs, etc. One of the research goals of this thesis is to find suitable machine learning algorithms for encoding structured documents. To allow for comparisons and practical evaluations it is beneficial to develop testbeds or to utilize existing benchmark problems. There are a number of existing benchmark collections available for learning tasks involving structure. These include an artificial benchmark problem called the Policeman benchmark [44, 47], and some real world benchmark problems such as the INEX XML dataset [26], and the WebSpam dataset [19]. The policeman benchmark provides a collection of images represented as trees [39]. This is an artificially generated dataset and hence its properties can be controlled. This is ideal as an initial dataset used to evaluate algorithms which are proposed for processing structural domains. A real world learning problem is provided by the INitiative for the Evaluation of XML Retrieval (INEX). INEX holds annual competitions on XML document mining tasks and provides XML documents drawn from sources such as from the Wikipedia database [36, 37]. Another benchmark dataset involves Web spam data from a subset of the Web and is provided by the Laboratory of Web Algorithmics, Università degli Studi di Milano for advanced research on Web spam detection [18, 19, 79]. We conduct a series of statistical analysis on these datasets and perform the necessary pre-processing steps for conducting the experiments. These datasets will be described in more details in the following sections. 74

109 5.2. Policeman Benchmark Policeman Benchmark The policeman benchmark problem (available via consists of 375 trees which were extracted from images featuring policemen, houses, and ships [39]. The nodes of the tree correspond to regions of uniform color in the original image, the links between the nodes show how the regions are connected. A label is associated with each node indicating the coordinates of the center of gravity of the region. The main properties of the dataset are listed in Table 5.1. It is observed that the trees representing the class policeman are of deep and narrow structure while ships and houses produce wider and shallower graphs. The patterns in such a dataset are distinguishable through both structure and features embedded in the node label, or only through the node label. Table 5.1: Main properties of the Policeman benchmark dataset Type Max. outdegree Max. num. Min. num. Max. Min. Num. of of nodes of nodes depth depth subclasses Policeman Ship House The trees of the three classes in the dataset are then categorized into 12 sub-classes: 1. Policeman with a raised left arm 2. Policeman with a lowered left arm 3. Ships featuring two masts 4. Ships with three masts 5. Houses without window 6. Houses with one window in lower left corner 7. Houses with one window in upper right corner 8. Houses with one window in upper left corner 9. Houses with two windows in lower left and upper left 1. Houses with two windows in upper left and upper right 11. Houses with two windows in lower left and upper right 12. Houses with all three windows

110 5.3. INEX28 XML Mining Dataset 76 Figure 5.1: Sample sequence for policeman rotating clockwise Figure 5.2: Sample sequence for policeman rotating anti-clockwise The trees featuring policeman can also be represented as a sequence of trees. Such a sequence can be used to simulate some movements of the policeman. For example, the policeman raises arms, the policeman rotates, etc. Figure 5.2 shows two sample sequences that simulate the policeman rotating clockwise and anti-clockwise respectively. The policeman dataset requires machine learning models capable of processing tree structures and an ability to deal with sequences of trees. 5.3 INEX28 XML Mining Dataset The dataset consists of 114,366 Wikipedia XML documents with given hyperlink information,1% documents are labelled with one of15 classes [36]. There are totally 636, 187 directed links between documents in the dataset. The maximum number of out links (out-degree) from a document is 1,527 and a document on an average has 11 neighboring documents (either linked by or link to). The XML documents are provided in full; it allows us to extract XML parsing trees, BoW vectors and other features from the documents: XML tag tree: For each document, all existing XML tags were extracted. Find all unique XML tags contained in the dataset and count the numbern. Initialize an N-dimension tag vector for each document and each element in the vector associates to a particular tag.

111 5.3. INEX28 XML Mining Dataset 77 For each document we update the tag vector by counting the occurrences of the tag within the document and assign the value of counts to the corresponding element. We divide the documents into 16 groups, the first 15 groups corresponding to the 15 given classes, and all unlabelled documents are covered in the last group. For each unique tag, we count the number of documents from each class and build a 16 N matrix. For each row (each tag), we also compute the standard deviation of percentages among different groups. This helps to discover the correlation between the tags and the given classes. In order to reduce the dimension of the tag vector, we filter out some tags according to following rules: 1. Remove tags which are not contained in any labelled documents. 2. Remove tags which are not contained in any unlabelled documents. 3. Remove tags where the standard deviation is less than a given threshold. Use PCA (Principal Component Analysis) [56] to reduce the dimension N to n by using the firstnprincipal components. Attach an n-dimension tag-vector as the labels to each document. There are totally626 unique tags and83 of them exist in labelled documents. After filtering, there are 14 tags remaining in the tag vector. After PCA, the first three principal components are used in the label vector. Document text vector: By following the same steps for the XML tag analysis, we extract text contents of the documents and build text vector for each document. There are 567, 34 unique words, and remaining 432, 94 unique words after stemming. Since the dimension of text vector is too large to be analysed by the PCA software, we filter out some words based on the rules similar with the ones defined for reducing the dimension of the tag vector. After PCA, the five most important principal components are included in the label vector for the document. Template vector: Documents in the dataset are using different templates which are defined within the tag < template >. We extract names of templates appearing in the documents and conduct similar statistical analysis. By investigating the

112 5.4. INEX29 XML Mining Dataset 78 matrix built of templates and target categories, it is observed that the template information is associated with the document classes. Some documents within the same class share similar sets of templates. We keep the first four principal components for the template vector after applying the PCA. Others: Other than tags, there are some tag-related features such as out-degree and the depth of the tag parsing tree. In addition to links between documents, there are also some unknown links which connect documents with other web resources. In order to analyse the importance of these features for the clustering task, we conduct statistical analysis, the results of which show that there is no obvious indication on the correlation between these features and the document classes. For more complete feature vectors, we combine the 5-dimension text information, the 4-dimension template information and the 3-dimension tag information for the experiments (12 dimensions in total). The documents are interlinked with each other and this allows us to build a link graph where nodes represent documents and edges represent links. Each node is labelled by a 12-dimensional feature vector. 5.4 INEX29 XML Mining Dataset The dataset provided by INEX29 is a collection of XML formatted documents from the domain Wikipedia [37]. The documents are interlinked via a xref or hyperlink structure. INEX has provided a subset of the Wikipedia articles for the classification task. The dataset consists of 54, 889 documents. A target label is available for 11, 28 of these documents. Hence, these 11, 28 documents provide a supervising signal to the training algorithm. All remaining 43, 861 documents are documents in the test dataset. The task is to classify the43,861 documents for which no target is available. The documents are interlinked. However, we found that only 54, 121 contain outgoing links. The maximum out-degree (the maximum number of outgoing links for any one document) for this dataset is 2,382, the maximum in-degree is 27,518. The dataset contains a total of 4, 554, 23 links. For simplicity, we remove the redundant links (if a document contains several links to the same document, then we count only once). The removal of redundant links reduces the maximum out-degree quite significantly to 969, the maximum in-degree to15,27, and the total number of links to3,368,54.

113 5.4. INEX29 XML Mining Dataset 79 The result is a graph whose nodes represent the XML documents, and the links represent the hyperlinks. The nodes can be labelled so as to describe each of the XML documents. It is quite obvious that XML formatted documents are suitably represented by a graph structure, and hence, the node labels in the hyperlink graph can be labelled by a set of graphs describing each of the documents. For this dataset, we consider a variety of possibilities for labeling the nodes: XML tag tree: Each document is provided in an XML format, and can be naturally represented by an XML parsing tree. The tree consists of nodes which represent the XML tags, and links which represent the nesting of the tags. The nodes are labelled by an identifier which uniquely identifies the associated tag. The way to extract the XML parsing tree from XML documents is described in [45]. XML tag graph: Each node in this graph represents a unique tag within the document, and edges represent the relationship between tags. We then apply the well-known rainbow software package (a package which implements the bag of words approach) on tags and compute the information gain for each with respect to different categories. The top 1 tags with highest information gain were selected and attached to the nodes in the tag graph as node labels. Concept Link graph: Concept link graph is a novel text representation scheme which encodes the contextual information of a document using a graph of concepts. Specifically, for a document d, it is represented as a weighted, undirected graph d = {N,E} wheren is the set of nodes representing the concepts, ande is the set of edges representing the strengths of association among concepts. The Concept Link graph extraction method was described in Chapter 3 and in [2]. TF-IDF(term frequency-inverse document frequency) vector: INEX provided a Term-Frequency vector (presumably obtained by the well-known Bag-of-Words algorithm) [57]. The i-th element of the vector lists the number of occurrences of the i-th dictionary word. Hence, the vector encodes the textual content of a given document. Rainbow classification results: The rainbow software was used to classify the test documents over all categories [66]. That is, for each category, we split the

114 5.4. INEX29 XML Mining Dataset 8 training dataset into two classes, one belongs to the category, and another is not belonging to the category. In this way, rainbow can produce a probability vector for each test document against all categories. The nodes in the hyperlink graph are labelled by either graphs describing the document of the associated node, a vector describing the content of the same document, or both. We extracted the XML structure for each of the documents, then attached the XML structure as a label to the associated node in the hyperlink graph. Some of the documents are listed for the task but not provided with the original file, so that the XML structure is not available for those unknown documents. Thus, there are a total of 54,575 XML structures. The maximum out-degree of any of the XML structures is1,6, and the total number of nodes is42,668,59. The latter is equivalent to the total number of XML tags in the dataset. We removed redundant tags by consolidating successively repeated XML tags and obtained a somewhat reduced XML graphs with a maximum out-degree of533. We realized that only few XML trees have an out-degree larger than 6. Hence, we truncated the out-degree to 6 since any algorithms driven by back-propagation discard sparse information anyway. Alternatively, we also computed the concept link graph for each document in the dataset, and used this as a second level of graphs in a GoG representation of the domain. The (given) term frequency vector is of dimension 186, 723 which corresponds to the 186, 723 dictionary words found in the dataset. There is one term frequency vector for each document, and hence, this can also be used to label the nodes in the hyperlink graph. The term frequency vector is very sparse because not every document contains all dictionary words. We were able to reduce the dimensionality of the term frequency vector to 439 by building a matrix of category and feature (the term), then counted the number of documents belonging to a category containing the feature. We retained only those features which exhibited standard deviation of larger than 1, and hence, exhibited the greatest discriminative power. We also used another approach to reduce the dimension of the term frequency vector by computing the information gain for each word in the dataset with respect to the different classes, then selected the five words with the highest information gain per class. We ended up with 133 unique words, thus producing a 133- dimensional feature vector.

115 5.4. INEX29 XML Mining Dataset 81 Table 5.2: The number of documents belonging to 39 different categories. History Space WorldWarI Weather Philosophy Video Games Saints Tropical Cyclones Politics Cricket Chess Chemistry Medicine War Literature Astronomy Horror Pornography Geography Nautical Science Anarchism Architecture Bible Business Comics Formula One American Civil War Pharmacy Physics Christianity Music Biography Religion Food Baseball Catholicism Aviation Train The classification problem is defined by 39 classes. The classes are known to represent categories to which a document belongs. The total number of categories in the given dataset is 39. Thus, this produces a 39 dimensional target vector containing binary elements. Note that the target vector may contain several non-zero elements. This indicates that the corresponding document can belong to several categories. We note also that the distribution of the different categories is not balanced (see Table 5.2). The largest category contains 1, 337 documents, while the smallest category contains 191 documents, and average number of documents per category is 414. As a result, this dataset is described by a GoG which is similar as was depicted in Figure 3.2, and is consisting of two levels: Level : One graph per document, describe the structure, contents or other properties within the document. Level 1: One graph where documents are nodes, connected via links represent the hyperlinks between documents.

116 5.5. WebSpam Dataset - UK WebSpam Dataset - UK26 The UK26 dataset was provided for advanced research on web spam detection and was obtained using a large set of Web pages from the.uk domain downloaded in May 26 [18]. The pages were manually assessed by humans according to guidelines available on-line. Some statistics of the spam dataset are listed below: Number of hosts: 11,42 Number of labelled hosts: 1,9 Hosts in the training set: 8,239. 7,472 non-spam hosts,767 spam hosts Hosts in the test set: 1, non-spam hosts,1,25 spam hosts Number of hosts given content-based features: 8,944 Number of hosts given link-based features: 11,42 Number of links among hosts: 73,774 Maximum out-degree: 5, 994 Average number of out links: 64 Dimension of content-based features: 96 Dimension of link-based features: 41 There are 8, 944 hosts in the dataset provided with 96 dimensional content-based features and all hosts were provided with 41 dimensional link-based features. These features can be preprocessed by using models such as unsupervised methods (PMGraphSOM) and supervised methods (MLP and SVM). Hence, the dimension of the feature vectors could be reduced via a mapping between the original feature space and the classification targets, so that the problem is simplified for further training. Some pre-computed content-based, link-based and host graph are provided. By considering the memory usage and training efficiency, the features are truncated to a reasonable size such that they can fit into the memory capacities of a modern desktop computing machine.

117 5.5. WebSpam Dataset - UK26 83 Content-based features. There are two types of web pages defined for each host: home page (the index page of the host) and the main page (the page with highest PageRank in the host). There are 12 different types of features included (a total of 24-dimensional feature vector): number of words in the page, number of words in the title, average word length, fraction of anchor text, fraction of visible text, compression rate of the page, corpus precision, corpus recall, query precision, query recall, entropy and independent likelihood. Here the corpus recall is defined as the fraction of popular terms that appear in the page by considering the k most frequent words in the dataset. The corpus precision is defined as the fraction of words in a page that appear in the set of popular terms. For both corpus precision and corpus recall, there are four features extracted fork = 1,2,5 and 1 respectively. As with the corpus precision and recall, a total of eight features are extracted for query precision and recall by considering the set of q (q = 1, 2, 5 and 1 respectively) the most popular terms in a query log. Four sets of 24 dimensional features are included in the final content-based feature vector in use: (1) the features of the home page, (2) the features of the main page, (3) the average values for all the pages in the host and (4) standard deviation for all the pages in the host. We only used 21 dimensional features (without the number of words in the page, the number of words in the title, the average word length) for the home pages in the experiments. This was done mainly in order to reduce the turn-around time of the experiments. Link-based features. There are a total of 41 dimensional link-based features including: 16 dimensional degree-related features (in-degree, out-degree, number of neighbours, etc.), 11 dimensional PageRank-related features, 11 dimensional truncated PageRank features and 2 dimensional TrustRank-related features. The complete feature vectors will be used for the experiments. Links. Outgoing links of a host have been truncated to be at most 64 which is the average number of out-degree for all the hosts in the dataset. Multiple links between any two hosts were counted only once.

118 5.6. WebSpam Dataset - UK27 84 In summary, for the UK26 dataset we construct two types of input for training: 1. Vectors. Each pattern represents a document described by a vector. This can be used for MLP training and support vector machine (SVM) training. Such a vector could contain selected content-based features, link-based features or a combination of both. 2. Hyperlink graph. The nodes are documents and edges are hyperlinks, each node is labelled by a feature vector. Such representation is accepted by PMGraphSOM and GNN. This dataset allows for a direct comparison between MLP, SVM, PMGraphSOM, and the GNN 5.6 WebSpam Dataset - UK27 The WEBSPAM-UK27 is a newer collection for research of Web Spam Detection. This is based on a crawl of the.uk domain done on May 27, and labelled by a group of volunteers [79]. Some statistics are as follows: Number of hosts: 114,529 Number of hosts given targets (spam/non-spam/undecided): 4, 275 in the training dataset, 2,24 in the testing dataset. 6,53 spam and non-spam hosts Number of hosts in the training dataset: 3, 776 non-spam, 222 spam and 277 undecided. 3, 998 spam and non-spam hosts Number of hosts in the testing dataset: 1,933 non-spam, 122 spam and 149 undecided. 2, 55 spam and non-spam hosts Maximum out-degree: 51, 692 Total number of out links: 1,885,82 Average number of links: 16 Number of hosts which have content-based feature: 19,361

119 5.7. Conclusion 85 Compared to UK26 dataset, this newer dataset is considerably larger but contains fewer labelled hosts. The nodes are more sparsely connected and the out-degree varies over a much larger range. Hence, these pose as additional challenges for this dataset. For experimental purpose, we also construct both vector-based and structure-based representations for this dataset. 5.7 Conclusion In this chapter, some existing benchmark problems used for the experiments in this thesis are described. Moreover, some necessary pre-processing on the dataset is performed in order to fit the available features in the computer memory of a modern day desktop computing machine for performing the experiments. The policeman dataset has been expanded by combining a set of images to form sequential graph patterns. This is considered as a dataset to evaluate the efficacies of an instance of GoG. We also constructed more complex GoGs on the INEX29 dataset by considering a Web graph containing nodes described by CLGs. Reasonable truncations on the dataset have been performed in order to reduce the computational cost. The selected benchmark problems cover the learning tasks which include clustering, classification and regression tasks. Such a selection is a useful testbed for the evaluation of the ability of machine learning approaches proposed in this thesis on encoding structured data (trees, graphs, graph of graphs, etc).

120 Chapter 6 Clustering 6.1 Introduction Clustering algorithms are most commonly designed to work on unsupervised learning problems. A cluster is defined as a collection of similar objects [41]. Thus, the objective of a clustering algorithm is to group objects such that objects within the same cluster are sharing similar features. In contrast, objects separated into different clusters are more distinguishable. This can help discover the characteristics and relationships of data (images, structured documents, etc). In this Chapter, we investigate several unsupervised machine learning methods on benchmark problems involving the clustering of graph structured objects. Applied are the GraphSOM and our proposed extension, the PMGraphSOM, on an image clustering task. We then provide a comparison of the results. It is shown in Section 6.2 that the PMGraphSOM outperforms GraphSOM from aspect of performance and training efficiency. We then apply the PMGraphSOM for the identification of similarities on document features and document structure. The work was carried out as part of an international competition on document categorization (the INEX 28 competition). A set of experiments are carried out with an aim of examining how different training parameters contribute to the system s performance. Some typical results are illustrated and discussed in Section 6.3. It is shown that the PMGraphSOM is capable of producing the state-of-the-art performance on this benchmark problem. 6.2 Image Clustering The unsupervised machine learning methods GraphSOM and PMGraphSOM are applied on an image clustering task involving Policeman dataset. The Policemen benchmark 86

121 6.2. Image Clustering 87 dataset is an artificial learning problem designed to validate the ability of a learning method to encode both structure and feature vectors (see Section 5.2). This task aims to evaluate PMGraphSOM. It will be found that the PMGraphSOM outperforms its predecessor, the GraphSOM. algorithms. The task is to segment the graphs in the dataset into 12 pattern classes. While the class memberships are available with this dataset, the information is exclusively used to evaluate the quality of the clustering once the training has concluded. The class membership information is not used during training. A variety of map sizes and training parameters were tried. The performance was evaluated based on two measures: classification performance and clustering performance. PMGraphSOM can create clusters on the display area according to the inputs consist of node label and structure. Classification performance is computed by assigning each activated neuron with the the class of the majority patterns mapped on it. The clustering performance is by examining a group of neurons that are nearby and evaluate the performance via considering the given class information. A typical observation is illustrated in Table 6.1 which presents the results of maps of size 12 8 trained using identical training parameters for both PMGraphSOM and GraphSOM. Table 6.1: A comparison of performances between PMGraphSOM and the GraphSOM given a network size of12 8. GraphSOM PMGraphSOM Classification Performance Clustering Performance Training Time 1mins 8mins Supported by the results depicted in Table 6.1 we found that the PMGraphSOM generally outperformed the GraphSOM in terms of clustering performance and classification performance. Moreover it was observed that the PMGraphSOM is actually trained faster than the GraphSOM. This is despite the fact that additional computations need to be carried out (the probabilities need to be computed) with the PMGraphSOM. We attribute the observed speedup to the possibility to optimize the PMGraphSOM training algorithm (as was discussed in Section 4.2.5). We found that in all experiments conducted, the PMGraphSOM always outperformed the GraphSOM by a margin. While it would have been possible to improve the performance of the GraphSOM by substantially reducing the learning rate, and increasing the number of training iterations. This generally resulted in training times exceeding a multiple of the ones shown in Table 6.1. Moreover, due to

122 6.3. XML Documents Clustering 88 excessive time requirements we were not able to train a GraphSOM sufficiently long to reach a performance level as obtained by PMGraphSOM. Once trained, the resulting mappings produced by the two SOMs are shown in Figure "xr" u 1:2 "yr" u 1:2 "Ar" u 1:2 "Br" u 1:2 "Cr" u 1:2 "Dr" u 1:2 "Er" u 1:2 "Fr" u 1:2 "Gr" u 1:2 "Hr" u 1:2 "cr" u 1:2 "dr" u 1: "xr" u 1:2 "yr" u 1:2 "Ar" u 1:2 "Br" u 1:2 "Cr" u 1:2 "Dr" u 1:2 "Er" u 1:2 "Fr" u 1:2 "Gr" u 1:2 "Hr" u 1:2 "cr" u 1:2 "dr" u 1: "x" u 1:2 "y" u 1:2 "A" u 1:2 "B" u 1:2 "C" u 1:2 "D" u 1:2 "E" u 1:2 "F" u 1:2 "G" u 1:2 "H" u 1:2 "c" u 1:2 "d" u 1:2 "s" u 1: "x" u 1:2 "y" u 1:2 "A" u 1:2 "B" u 1:2 "C" u 1:2 "D" u 1:2 "E" u 1:2 "F" u 1:2 "G" u 1:2 "H" u 1:2 "c" u 1:2 "d" u 1:2 "s" u 1: Figure 6.1: Resulting mappings when training PMGraphSOM (left column) and the GraphSOM (right column). Training parameters used where: iterations = 5, α() =.9,σ() = 4,µ=.28274, and grouping =4 4. The upper two plots in Figure 6.1 show the mapping of the (labelled) root nodes, the lower two plots show the mapping of all nodes. From Fig. 6.1 it can be observed that the performance improvement can be attributed to a better condensation of the pattern classes, and a better utilization of the mapping space which became possible through the probability mappings which are richer in information. 6.3 XML Documents Clustering The previous experiment has produced evidence that the PMGraphSOM is indeed improving on the ability to encode structured objects when compared to GraphSOM. However, the policemen dataset is an artificial learning problem which does not contain noisy signals, or may not exhibit properties encountered in the real world. In this Section we

123 6.3. XML Documents Clustering 89 explore the capabilities of the PMGraphSOM when applied to an internationally recognized benchmark problem involving real world data. Towards this end, we participated at the international competition on document categorization made available by the Initiative for the Evaluation of XML retrieval (INEX) in 28. The INEX28 competition which required us to solve a practical data mining learning problem involving XML formatted documents drawn from Wikipedia database. The dataset was described in Section 5.3. The dataset is represented as a web graph containing nodes that represent the XML documents and the edges representing the hyperlinks from one document to another. Each node in the graph is labelled by a feature vector (described in Section 5.3). We used the combined features extracted from texts, tags and templates and form a 12-dimensional feature vector to label the nodes in the Web graph. The performance was again evaluated by two measures on the mappings: classification and clustering. The evaluation steps for these two measures are described as follow: Classification Performance: After the PMGraphSOM is fully trained, we select neurons which are activated by at least one labelled document. For each neuron we find out the dominating class (the one has most activations on one neuron) and associate this class label to this neuron. For all nodes in the graph we re-compute the Euclidean distance between the node label and the codebooks on these activated neurons, then re-map the node to the winning neuron and assign the label of winning neuron to this node. For each of the labelled documents, if its original attached label matches the new label given by the network, it is counted as a positive result. Then the classification performance is measured via computing the percentage of the number of positive results out of the number of all labelled documents. Clustering Performance: This measure is to evaluate pure clustering performance. For all nodes in the graph, we compute the Euclidean distance between the node label vector and codebooks of all neurons on the map and get the coordinates of the winning neurons. Then we applied K-means on these coordinates to perform clustering. By using different values for K, we could define different numbers of clusters. Within each cluster, we search for the dominating class and associate the class label to this cluster. For all labelled documents, if its original attached label matches the label of the cluster which the document belongs to, it is counted as a positive result. The cluster performance is the percentage of number of positive results out of the number of all labelled documents.

124 6.3. XML Documents Clustering 9 Table 6.2: Performance of PMGraphSOM for INEX 8 by using differentµvalues µ Classification (%) Clustering (%) The training of the PM-GraphSOM allows the adjustment of a variety of training parameters such as the learning rate, network size, number of training iterations, the weight µ, and neighborhood radius σ(). The optimal value of these parameters are problem dependent, and are not known a-priory. Hence, trial and error was used to identify a suitable set of training parameters. The results are presented and discussed in following sections: Effects of input weighting The input of a PMGraphSOM is a hybrid vector consisting of a node label vector and a state vector. These two components could vary in dimension and magnitude. A weighting factorµis used to control the contribution from the label and the state components. More specifically, the label is weighted by a valueµ...1, and the state vector component is weighted by the value 1 µ. Table 6.2 summarizes experiment results by using different µ values. For a set of experiments, we vary the µ values and set same map size and similar training parameters for a fair comparison. The common configuration for the experiments is listed below: Map size: 12x1 Grouping: 2x2 Training iteration: 5 Initial training radius: 4 Initial learning rate:.9 The results are illustrated in Table 6.2 in ascending order of the value for µ. When using µ =.55, the contribution of the feature vector is perfectly in balance with the

125 6.3. XML Documents Clustering 91 contribution of the state vector. This means that for µ <.55 the state vector dominates over the feature vector, and for µ >.55 the feature vector dominates over the impact of structural information. The results shown in Table 6.2 indicate that the classification performance can be improved quite significantly when placing a greater focus on the node label. It is also observed that the worst clustering result was yielded when using the balancing µ value. This indicates that for this particular clustering task the node label contains more useful information than the structure Effects of map size and grouping The INEX28 dataset contains documents from 15 different categories. The document classes vary in size, and hence the documents in the dataset are not evenly distributed among different target classes. Some categories may require much larger display area than other smaller categories. In order to counter the unbalanced feature of the data, one possible approach is to increase the size of the PMGraphSOM. The large display space may be required to help distinguish documents from different classes, especially for cases where a document class is minor with respect to the size of a dataset. A variety of map size and grouping configuration are tried with the aim at investigating the effects of increasing the size of the SOM. To allow a equal basis for the comparisons, the same dataset and parameters are used when training the maps of different size. This is not generally recommended since some training parameters (most notably µ) are dependent on the network size and should be adjusted accordingly. Nevertheless, we keep these parameters constant to exclude the influence of a parameter change on the analysis of the effects of a change in network size. There are seven map sizes are tried and the results are summarized in Figure 6.2 where the x-axis indicates the proportion of the number of neurons on the map to the number of documents and the y-axis indicates the cluster purity. This shows that the clustering performance of the map can be significantly improved by increasing the training map size. However, according to the tendency of the curve, it can be predicted that it will get more and more difficult to increase the cluster purity even if the map size is increased beyond the size of the dataset. Another interesting investigation is on the ability of SOMs to learn from the information provided with the dataset. This question arises out of the following fact: if a network

126 6.3. XML Documents Clustering MacroF1 MicroF Figure 6.2: Cluster purity vs. Map size relative to the size of the dataset. is trained that has substantially more neurons than data inputs (provided with the training set), then the network is able to provide a unique mapping for each data input. This means that the network would appear to be producing a perfect classification result (since there are no confusions in the mappings). Thus, a randomly initialized very large map may produced a perfect classification result despite of the initial random mappings. In order to investigate the SOMs ability to improve the mappings during training, we compute the cluster purity for each of the training iterations. Figure 6.3 shows the evolution of the cluster purity during training, which includes the results for 8 experiments ordered by map size in ascending order. For each iteration, macro purity is plotted. We ignore the micro purity here since it is consistent with macro purity in terms of the trend. We could observe that the larger maps can present an obvious advantage on cluster purity even at the very beginning of the training. The curves from bottom to top are for the experiments using map sizes in ascending order. However, the performance of using different size of maps is increasing at different speeds. The smallest map in use has 12, neurons while the number of neurons on the largest map is ten times larger. However, the performance of smallest map increases by more than 2% whereas the improvement of the largest map was tiny with just 3%. This experiment has shown the upper limit of the clustering performance that the PMGraphSOM can achieve by using a particular feature set, and provided evidence that the best map size for this given learning task is about 55% the number of nodes in the dataset. The later conclusion arises out of the observation that the network of such size is able to learn useful information while producing a

127 6.3. XML Documents Clustering Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 6 Exp. 7 Exp. 8 Figure 6.3: Cluster purity vs. Training iteration. relatively good final overall performance. This is indicated by the curve marked (Exp. 4) in Figure 6.3. An adaptive learning rate was used for the training. The learning rate decreases sigmoidally to zero with the number of iterations. Most curves in Figure 6.3 show that the performance was not improved gradually during training, but present fast increasing for a short period. The fast bit is corresponding to the steep part of the sigmoidal function, so it is possible to make the increasing last longer by stretching the sigmoidal curve. This can be done by increasing the number of iterations. In Figure 6.3, the results of experiment 4 and 5 are using same map settings and parameters except the number of training iterations shown in hollow-square and filled-square respectively. In experiment 5, we doubled the number of iterations which is set for experiment 4 (6 iterations). The curve of experiment 5 is generated by plotting macro purity every two iterations. It can be seen that the two curves of experiment 4 and 5 are nearly identical, this shows that the selection of the initial learning rate and the number of training iterations for these experiments has been appropriate Effects of training radius The updating of the SOM is based on the Gaussian neighborhood function, centered at the winning neuron. The nearby neurons are updated according to the spread (also called radius) of the Gaussian defined. With PMGraphSOM, the radius is not only used to control the updating area, but also to restrict the degree of probability mapping in the

128 6.3. XML Documents Clustering 94 Table 6.3: Performance of PMGraphSOM for INEX 8 by using different radius. Radius Classification (%) Clustering (%) input vector. Table 6.3 shows the results of experiments by using different initial training radius. The best result was produced by using a large initial radius. Such an observation can be explained as follows: A large initial radius usually can help organise activations on the map at the beginning phase of the training [61]. As defined in the algorithm, the radius will shrink during training. Thus, as the training radius decays, it becomes less possible to have large scale arrangement of the mapping within the display area. Thus, a larger initial radius can lead to a better organization of the mappings from a random initialization of the map Effects of node label It has been observed that the experimental results are more sensitive to the node label. We then conduct comparable experiments on different features extracted from documents, such as text, tags and template names aims at identifying the more influential features. These features were used individually or concatenated together and attached to the nodes as node labels. The results of experiments by using different labels are to reveal the significance of various features for distinguishing the document classes. The features considered were explained in Section 5.3 of the previous Chapter. Table 6.4 lists the experimental results of using different types of labels. The results show that using higher dimensional and combined features improve classification performance significantly, but is less effective on enhancing the pure clustering performance. This is an expected result given that an increase of document features leads to improvements on diversifying the mappings, and hence, leads to a better network utilization which in turn allows for improvements in the classification performances.

129 6.3. XML Documents Clustering 95 Table 6.4: Performance of PMGraphSOM for INEX 8 by using different labels Label Classification (%) Clustering (%) 1 dimension text dimension template dimension template dimension combination dimension combination "49" "252" "323" "339" "38" "471" "897" "1131" "131" "153" "1542" "4347" "5266" "943" "149" Figure 6.4: Mapping of all labelled nodes on a PMGraphSOM performing best in classifying the INEX28 documents Best Results and Discussion The INEX28 competition had strict time limitations for the submission of results. The experiments were ongoing at the time submissions were due. Hence, we submitted the best available results at the time, then continued with our experiments. In this section, we will first present our best result obtained at the due time for the INEX28 mining challenge and present a comparison to the competitors results. Then, later in this section we present the final results which we had obtained after the competition. A general observation was that the training of PMGraphSOMs of a useful size required anywhere between 13 hours and 27 hours training time on a 3GHz Intel CPU whereas the training of a GraphSOM of similar size and with similar parameters required about 4 days (approx. one iteration per day). This shows again the evident advantage of PMGraphSOM for a large scale learning problem.

130 6.3. XML Documents Clustering "49" "252" "323" "339" "38" "471" "897" "1131" "131" "153" "1542" "4347" "5266" "943" "149" Figure 6.5: Mapping of all labelled nodes on a PMGraphSOM performing best in clustering the INEX28 documents. The organizer of INEX28 evaluated the results in different ways. They compute a micro average purity (MicroF1) and a macro average purity (MacroF1) in order to summarize the performances of the different models over all the documents and all the clusters. There was no defined formulas provided except a pre-built software package. Thus, the internal calculation of MicroF1 and MacroF1 was not shown clearly. Here we first present the maps which performed best in clustering or classification at due time of the competition. Figure 6.4 shows the mapping of documents in the training set of a trained map which performed best in classifying the data, whereas Figure 6.5 shows the mapping of nodes for a map that performed best in clustering the same data. It can be seen, that the map shown in Figure 6.4 utilizes the mapping space considerably better when compared to the map shown in Figure 6.5. However, there are less clear clusters formed in Figure 6.4. In contrast, the map shown in Figure 6.5 visibly produced clusters of data belonging to the same pattern class. To extract clusters from a trained SOM, we applying K-means clustering to the mapped data of a trained PMGraphSOM. By setting K to a constant value, we can extract exactly K clusters from a SOM. The overall clustering and classification performance of these two maps when setting K to be either 15 or 512 are shown in Table 6.5. The choice of K has been somewhat arbitrary due to a lack of guidance by INEX as is evidenced by

131 6.3. XML Documents Clustering 97 Table 6.5: Summary of results of the INEX 8 XML clustering task. The clustering and classification performances were not available for all participants. Name #Clusters MacroF1 MicroF1 Classification Clustering hagenbuchner hagenbuchner hagenbuchner QUT Freq struct 3+links ?? QUT collection ?? QUT Freq struct ?? QUT LSK ?? QUT LSK ?? QUT LSK ?? QUT LSK ?? QUT LSK ?? QUT LSK ?? QUT LSK ?? QUT LSK ?? QUT Freq struct ?? QUT collection ?? QUT Freq struct 15+links ?? Vries 2m level ?? Vries 1m level ?? Vries nmf ?? Vries nmf ?? Vries 15k 2m ?? Vries nmf ?? the fact that other competitors also used many different number of clusters when submitting there results. In Table 6.5, the result named hagenbuchner-1 refers to the SOM shown in Figure 6.4, whereas the results named hagenbuchner-2 and hagenbuchner- 3 refer to the SOM shown in Figure 6.5. Result hagenbuchner-3 is obtained by combining the classification and clustering evaluations. We first evaluate the classification performance on the mappings and then all nodes can be associated with a label predicted by the SOM. Then we perform clustering evaluation on such classification results, this produced even better clustering performance. All other results in the table refer to competitor s approaches. It can be seen that the best of our models performs reasonably well in comparison. We note that the classification performance and the clustering performance was not available from the competitors. We later found that the main reason which held us back from producing better results were due to the unbalanced nature of the dataset. This is illustrated by Table 6.6 and Table 6.7 respectively which present the confusion matrices of the training data set and

132 6.3. XML Documents Clustering 98 the testing data set respectively produced by these SOMs. The first row in the Table lists all the category IDs. Table 6.6 refers to the SOM of a given size which produced the best classification performance, whereas Table 6.7 refers to the performance on train and test set respectively of the same SOM. In Table 6.6 the values on the diagonal are the number of documents correctly classified. A total of 8, 84 out of 11, 437 training patterns are classified correctly (micro purity is 68.84%). However, the results show that the generalization ability of the trained map is considerably low; the micro purity value dropped to 18.8% for the testing data set. In comparison, Table 6.8 and Table 6.9 refer to a SOM of a smaller size which produced the best clustering performance shows comparatively more consistent performance for both the training dataset and the testing data set. Here a total of % training documents and % testing documents are classified correctly. It can be observed that for the above two experiments, the worst performing classes are the smallest classes, and hence, constitute a main contribution to the observed overall performance levels. In the former experiment, we used 12 dimension labels which combined text, template name and tags information of the documents while in the latter experiment we only attached 1 dimensional labels of template information. Even though a larger map was used for the former experiment, it cannot perform consistently on train and test. The rich set of information provided with the concatenated features may result in an overfitting on the training documents, and hence, this produces a lower generalization performance than the one using the much lower dimensional template features. It indicates that the training can be ineffective without a good set of features regardless to the size of a map. In order to evaluate training performance by using different features, we conducted an experiment by using the 47-dimensional text vector generated via the Bag of Word model after the INEX28 competition. Table 6.1 presents a summary of the confusion matrices for the set of training documents and set of testing documents respectively, produced by the best trained SOM for the task so far. It can be seen that the PMGraphSOM is able to significantly outperform any competitor s method in both MacroF1 and MircoF1 if the right features are used in the training set. The corresponding mappings is shown in Figure 6.6. A reasonable segmentation of the 15 pattern classes can be observed. There exists some overlap between the pattern classes. However, given that this has been an unsupervised learning task during which the network was not informed about the desired

133 6.3. XML Documents Clustering 99 Table 6.6: Confusion Matrix for training documents generated by PMGraphSOM trained using mapsize=16x12, iteration=5, grouping=2x2, σ()=1, µ=.95, α()=.9, label: text=5 + template=4 + tag= % ( training set ) MacroF1: MicroF1: Table 6.7: Cluster purity when using mapsize=16x12, iteration=5, grouping=2x2, σ()=1,µ=.95,α()=.9, label: text=5 + template=4 + tag=3. Class Macro Micro F1 F1 Train Test clustering result, and given that no competitor has been able to come close to this quality of clustering, and hence, the clustering result observed in Figure 6.6 shows a respectable state-of-the-art in clustering XML documents from the Wikipedia domain. The results indicate that the Bag of Word based feature selection approach can provide more useful data labels which would benefit the clustering task. Apart from selecting useful features for the training task, the size of the network can be increased to overcome issues related to imbalance in a data set. Note that the dataset consists of114,366 nodes which were mapped to a network of size16 12 = 192. In other words, the mapping of nodes in the dataset is compressed by at least83%. Such high compression ratio forces the SOM to neglect less significant features and data clusters, and hence, can contribute to a poor performance in clustering performance on small classes. The study conducted in Section has already shown that a map with the appropriate size can help produce good performance.

134 6.4. Role of Clustering 1 Table 6.8: Confusion Matrix for training documents generated by PM-GraphSOM trained using mapsize=12x1, iteration=5, grouping=1x1, σ()=1, µ= , α()=.9, label: template= % ( training set ) MacroF1: MicroF1: Table 6.9: Cluster purity when using mapsize=12x1, iteration=5, grouping=1x1, σ()=1,µ=.99995,α()=.9, label: template=1. Class Macro Micro F1 F1 Train Test Role of Clustering Clustering is unsupervised learning problem which is defined as a process of categorizing similar data into groups [41]. Here, a cluster contains a collection of data that share certain similarity. Clustering aims to detect certain inherent grouping of a set of unlabelled data. However, the meaning of the grouping is usually unknown. A document clustering task for instance, documents in a defined cluster are similar and distinct from documents from other clusters. However, the nature of the similarity that documents are sharing in the same cluster is not visible. It should be noticed that such clustering results cannot provide a classification on the data domain, but it can act as a basis or prediction for a Table 6.1: Cluster purity using mapsize=4x3, iteration=6, grouping=4x6, σ()=2,µ=.99,α()=.9, label: text=47 (BoW-based approach). Class Macro Micro F1 F1 Train Test

135 6.5. Conclusion Figure 6.6: Mapping of all training data on a PMGraphSOM performing best in classifying the nodes. classification task. The mapping results are usually a set of coordinates which are within the range of the display map for all input patterns involved in the learning problem. The coordinates can be concatenated to describe the pattern in addition to the available feature vector. This is especially significant for learning on structural information since this unsupervised approach is not using gradient based learning algorithm so that long term dependency is not a problem in this case. 6.5 Conclusion The work presented in this chapter demonstrated a way to suitable map graph structured data onto a fixed dimensional display space by using PMGraphSOM. A successful application of the PMGraphSOM to an image clustering task has proved that the proposed probability mappings during the training of a GraphSOM helps to significantly improve its performance. The approach helped to substantially reduce the computational demand so that it becomes possible to apply the PMGraphSOM to data mining tasks involving large number of generic graphs. This was demonstrated through an application on a large scale learning task involving XML documents from the Wikipedia dataset. Experimental results produced a very respectable performances by setting a state-of-the-art performance on an international benchmark problem. The experiments also revealed that

136 6.5. Conclusion 12 the unbalanced nature of the training set can be a main inhibitive factors to the task. It has been identified that the encoding of textual information embedded within the Wikipedia documents can also be a challenge. This was resolved by using a Bag-of-Words approach in combination with Principal Component Analysis to extract and compress such type of information. We note that the results presented in this chapter may not be the optimal results reachable by the approach. A different parameter set may be able to further improve on the results already presented here.

137 Chapter 7 Classification 7.1 Introduction In contrast to clustering, classification learning tasks require the availability of some ground truth information. In other words, a given learning problem is required to have a set of input-target pairs available. The task is then to learn a mapping from the input domain to the desired target domain, and then to be able to predict the target for data items for which the target is unknown. In classification learning problems a set of classified data is provided as samples for the machine to learn. The underlying rules for the classification are inferred by the machine learning system so that unlabelled data can be classified accordingly. This Chapter will first look into a classification learning problem involving sequential graphical patterns (sequence of graphs) by utilizing the proposed probability-based model GHMM and the GNN 2 (see Section 7.2). Both the GHMM and GNN 2 are supervised machine learning methods. This is pioneering work in that no-one has previously been able to learn on data involving the encoding of GoG representations 1. We then evaluate the GNN 2 further through a deployment of the method to an international data mining competition involving the classification of a large number of XML documents (see Section 7.3 for a description of the dataset). The INEX29 document classification task requires a machine learning model to be capable of encoding both the document contents and the hyperlink structure, and hence, the task is suitable to verify the capabilities of the proposed GNN 2 learning method. The benchmark problem also allows us to assess the performance of the GNN 2 in comparison to approaches taken by other researchers. It will be found that the GNN 2 is among the methods producing the best results on this benchmark problem. A third investigation is made with respect 1 To the best of our knowledge. 13

138 7.2. Recognition of Sequences of Graphical Patterns 14 to possible long-term dependency problems that may afflict the training algorithm of the GNN 2. Learning long term dependencies have been reported to afflict recurrent neural networks using gradient-based algorithms [9]. The problem is that the gradient becomes smaller and smaller with the depth of a network architecture, and hence, this implies that a recursive network architecture (such as the GNN 2 ) may not be able to effectively learn very deep structures. In Section 7.4, we conduct a pilot study with the aims to examine the ability of the GNN 2 to learn deep GoG structures. An alternative machine learning approach that is known to be unaffected by long-term dependency problems is also considered. Towards this end, we will study the PMGraphSOM to serve as a tool to reduce the effects of long term dependencies for a given learning problem. We propose a combined system consisting of unsupervised and supervised components to solve a large scale Web Spam detection learning task. The experiments are described and discussed in Section 7.5. It is shown that the proposed approach is a generic approach that can produce performances close to the state-of-the-art. To the best of our knowledge, there is no other approach that can be applied to a variety of spam detection problems while producing consistently at the state-of-the-art level. 7.2 Recognition of Sequences of Graphical Patterns The task is to encode sequences of graphical patterns. Sequences of graphs are special cases of GoGs. In a sequence, the nodes are connected one after another, and each of the nodes in the sequence is labelled by a graph data structure. A sequence of graphs defines a linear time dependency between a set of graphs. This is a very common situation in the real world. For example, a document is may be represented as an XML-tree at a time instancet. Such a document may undergo several revisions over time (i.e. the document is being developed into a final product), and hence, the structure of the same document may change over time. Another example is the time series of events. For example, in gesture recognition where a sequence of images featuring a hand in various positions, and where the sequence of these positions defines the meaning of a gesture. The nodes of such a sequence are a representation of an image which can be represented appropriately by a region adjacency graph or any other image processing techniques. The policeman dataset is expanded for this experimental purpose. As introduced in Section 5.2, images featuring policeman are represented as data trees. By combining a set of trees for policeman in

139 7.2. Recognition of Sequences of Graphical Patterns 15 different postures, this simulates a movement of the policeman. For example, a rotating policeman, a policeman raising arms, etc. For this task, graphical sequences have been created from the Policemen dataset. This dataset features images of synthetically generated policemen, having different color, orientation, position of the arms, etc. Directed ordered acyclic graphs (DOAG) were used for representing the individual images, as explained in [44]. Two-dimensional real-valued labels are associated with the nodes in this DOAG representation. Sequences were generated by taking individual images and creating concatenations of images s.t. a coherent movement of the policeman emerged. For instance, the images in a given sequence may represent the policemen gradually rising then lowering his left arm, followed by an analogous movement of the right arm. The task involves 158 sequences overall, belonging to 4 disjoint classes. In turn, each class is further divided into subclasses as follows. Class 1: rotation; subclasses: (1.1) clockwise and (1.2) counter-clockwise. Class 2: shift; subclasses: (2.1) right-left and (2.2) top-down. Class 3: zoom; subclasses: (3.1) zoom-in and (3.2) zoom-out. Class 4: arms movement; subclasses: (4.1) both arms up, (4.2) both arms down, (4.3) right arm up, (4.4) right arm down, (4.5) left arm up, and (4.6) left arm down. Hence, there are 12 classes in total. The length of individual sequences ranged from 1 to 17 graphs. The Dataset was described in Chapter 5, and Figures 5.1 and 5.2 showed two sample sequences from Class 1. We note that the starting and ending frame of some sequences from different classes were identical, and that for sequences from other classes, some frames from within a sequence were identical. Hence, a good classification can only be obtained if the method can encode the given sequences as a whole. We proposed a Graph Hidden Markov Model in Chapter for the encoding of sequential graphs. For a comparison, we also applied supervised machine learning approach GNN 2 on the same learning problem GHMM to encode sequences of graphs The GHMM is a novel machine learning system for modeling sequential graphs. It is a hybrid system consisting of HMM, RBF and RMLP as components. For the training of the GHMM, the dataset was split into a training set (72 sequences, chosen by drawing 6 sequences from each subclass at random) and a test set (all the remaining sequences). Separate left-to-right Markov chains were used for each class, each of them having 4 states. Emission probabilities were modelled with 2 multivariate Gaussian kernels RBFs,

140 7.2. Recognition of Sequences of Graphical Patterns 16 Table 7.1: Recognition accuracy of GHMM on sequences from the Policemen dataset. Accuracy on training set Accuracy on test set Upon initialization % % After training 87.5 % 86.5 % and the RMLP networks having 8 sigmoid hidden neurons, 1 state neurons, encoding dimension of 2, and a maximum out-degree of 6 [87]. System parameters were initialized according to a segmental k-means-like procedure [78]. The proposed training algorithm was applied for 4 iterations (i.e., epochs) using different learning rates (obtained via cross-validation) for the specific parameters involved in the optimization process. The results reported in Table 7.1 give the recognition accuracies. Although no direct comparison is possible w.r.t. any other benchmark approaches (since, to the best of our knowledge, the present model is the first attempt to dealing with classification of sequences of graphs), it is seen from Table 7.1 that: (i) the architecture is indeed suitable to graphical sequence modeling; (ii) the training algorithm, whilst focusing on the maximization of the likelihood (of the model given the training sample) criterion, results also in a significant accuracy in terms of the sequence recognition rate criterion; and (iii) comparison between the accuracy on the training and test sets confirms that learning capability of the machine does not prevent emergence of an appreciable generalization capability. We found that the residual classification error is attributed to two of the subclasses whose properties are such that a correct classification requires the encoding of the associated data labels. Since the labels provide features which are furthest from the output layer, and hence, this implies that the proposed approach prioritizes the encoding of structure over the data labels. In general, it can be expected that the classification result will improve further when training is carried out for more iterations, or by adding direct forward links from the labels to the output layer for the ANN component. However, we did not carry out these tasks due to time restrictions Use GNN 2 to encode sequences of graphs For a comparison, we also apply the GNN 2 on this sub-instance of the GoG domains. A GNN 2 able to learn sequences of trees consists of a composition of two GNNs, one for encoding the graphs at level k = 1 responsible for encoding the sequence, and one for encoding the graphs at level k = responsible for encoding the trees. For simplicity

141 7.2. Recognition of Sequences of Graphical Patterns 17 we will refer to the later as the encoding network and the former as the output network. We have implemented the system as follows: Each of the two networks has two internal layers referred to a hidden layer and a state layer. Hence, there is a layer of internal nodes dedicated to represent the internal state as was shown in Equations We also implemented Equation 4.37 in the encoding network so that the input to the output network uses the output of the encoding network rather than its state. While not strictly required, the later was done to allow us to observe more independently the influence of the various network parameters on the performance of the system. In the following, we will refer to the output dimension of the encoding network as the encoding dimension, and its output simply as encoding. Thus, we can now control independently a number of network parameters such as the dimension of the hidden layers, state layers, and the encoding dimension. In the following, these parameters will be denoted as EN state, EN hidden, ON state, ON hidden, and EN out respectively. We used the back propagation through structures algorithm as proposed in Section to update the network parameters. Different seeds were used for random initialization for comparison purposes. As mentioned earlier, the classification problem involving 12 classes which is a challenging problem since the classes are overlapping. The machine learning approach must be capable of learning to distinguish the different causal arrangements of the images within a sequence in order to be able to differentiate the 12 classes. We list 6 representative experimental results obtained from 6 different network configurations. The training parameters used for each of these experiments is summarized in Table 7.2. The table presents the configuration in the descending order of the performance achieved from training the corresponding network. Performance is measured in terms of cost (sum of squared errors at the output layer of the output network) and also the recognition rate in percent. The results are illustrated in Figure 7.1. It is shown that larger networks and higher encoding dimension enhance the training performance. It is also evident that for this learning problem, the system is more sensitive to the choice of architecture of the output network than the encoding network. Compare the results to what we obtained by using GHMM, GNN 2 is able to produce perfect recognition rate on the training samples while best results obtained by GHMM is worse by1%. The important observation here is that the method is capable of creating a model

142 7.3. Categorization of XML documents 18 Table 7.2: Experiments using GNN 2 to encode sequences of graphs. Exp ID EN state EN hidden ON state ON hidden EN out Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp Figure 7.1: The curves correspond to the labels as defined in Table 7.2. Shown are the MSE (left) for each of the 6 experiments, and the recognition rate (right). The horizontal scale indicates the number of training iterations. for a GoG structure which can perfectly discriminate the 12 pattern classes. The results demonstrate the ability to encode a GoG structural information in a single model. This is the first time that a machine learning approach exhibits such ability. Training times were a very reasonable 5-1 minutes when executed on current standard PC hardware. 7.3 Categorization of XML documents It was shown that GNN 2 is effective for modeling sequences of trees derived from a small scale artificial learning problem. In this Section we deploy the GNN 2 to a real world XML documents classification task involving more generic GoGs. The dataset was provided by an international competition on XML document mining (the INEX 29 competition). the dataset contains a collection of XML documents and was described in Section 5.4. Thus, the dataset can be described by a GoG of depth two: Level : One graph per document, describe the structure, contents or other properties within the document.

143 7.3. Categorization of XML documents 19 Level 1: One graph where documents are nodes, connected via links represent the hyperlinks between documents. We trained the GNN 2 on the resulting GoG, and varied the labeling mechanisms as is illustrated in Table 7.3 to identify the impact of different features on the classification performance. In the following sections, the graph ID as shown in the table will be referred for short. We also varied the number of state neurons and hidden neurons to identify the impact of the number of internal network parameters on the classification performance (See Section 7.3.1). Different initial conditions are tried to avoid local best solution and obtain representative results (See Section 7.3.2). In order to improve the results, we also allow the model to use some mechanisms as proposed in Section to In Section 7.3.7, we present our best results obtained during competition and discuss the findings. After the competition, we endeavour to enhance GNN 2 training by using the approaches proposed in Section to 7.3.1, which allowed to further improve the classification performance. Performance of the training is evaluated according to several measures: Precision (PRE), Recall(REC), Accuracy (ACC), F-measure score (PRF) and Mean Average Precision (MAP) with respect to document classes. The classification results returned by the model allow us to count four quantities: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). Then these evaluation measure are computed as follows: PRE: The precision of the classification results is computed as follows: PRE = TP TP +FP (7.1) The micro PRE is computed by using the total counts on TP and FP for all the categories. The macro PRE is the average value of PRE computed for each individual category. REC: The recall of the classification results is computed as follows: REC = TP TP +FN (7.2) The micro REC is computed by using the total counts on TP and FN for all the categories. The macro REC is the average value of REC computed for each individual category.

144 7.3. Categorization of XML documents 11 Table 7.3: List of all input data files used for training on INEX 9 dataset IDDescription 1 document-link graph. maxout=2382, maxin=27518, nodelabel=439 (reduced tfidfn vector) 2 document-link graph. maxout=2382, maxin=27518, nodelabel=133 (word counts) 3 document-link graph. maxout=969, maxin=1527, nodelabel=133 4 GoGs. Level : tag graphs, maxout=195, nodelabel=1 (tag id); level 1: document-link graph, maxout=969, maxin=1527, nodelabel=encoding of level tag graphs. 5 GoGs. Level : tag graphs, maxout=52, nodelabel=39 (tag information gain); level 1: document-link graph, maxout=969, maxin=1527, nodelabel=encoding of level tag graphs. 6 GoGs. Level : tag graphs, maxout=52, nodelabel=39 (tag information gain); level 1: document-link graph, maxout=969, maxin=1527, nodelabel=encoding of level tag graphs+133(word counts). 7 GoGs. Level : concept link graph, maxout=41, nodelabel=51; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=encoding of level CLGs+39 dimensional label from classification results of rainbow for each category respectively (NavieBayes). 8 GoGs. Level : concept link graph, maxout=41, nodelabel=51; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=encoding of level CLGs+39 dimensional label from classification results of rainbow for each category respectively (NavieBayes with logarithm normalization). 9 GoGs. Level : concept link graph,maxout=41,nodelabel=51,edgelabel=1; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=encoding of level CLGs+39 dimensional label from classification results of rainbow for each category respectively (NavieBayes with logarithm normalization). 1GoGs. Level : concept link graph, maxout=41, nodelabel=51, edgelabel=1; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=pmgraphsom batch mode training, 4x32 map. 11GoGs. Level : concept link graph, maxout=41, nodelabel=51, edgelabel=1; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=node label of graph 9 level 1 graph + 78d PMGraphSOM results obtained in separate train mode (old initialization). 12GoGs. Level : concept link graph, maxout=41, nodelabel=51, edgelabel=1; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=node label of graph 9 level 1 graph + 78d PMGraphSOM results obtained in separate train mode (new initialization). 13GoGs. Level : concept link graph, maxout=41, nodelabel=51, edgelabel=1; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=node label of graph 9 level 1 graph + 78d PMGraphSOM results trained on structure only in separate train mode (new initialization). 14GoGs. Level :concept link graph,maxout=41,nodelabel=51,edgelabel=1; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=78d PMGraphSOM results obtained in separate train mode. 15GoGs. Level : concept link graph, maxout=41, nodelabel=51, edgelabel=1; level 1: document-link graph, maxout=969, maxin=1527, nodelabel=node label of graph 9 level 1 graph + 78d PMGraphSOM results trained on structure only in separate train mode (new initialization and controlled PM). ACC: The accuracy of the classification results is computed as follows: ACC = TP +TN TP +FP +TN +FN (7.3) The micro ACC is computed by using the total counts on TP, FP, TN and FN for all the categories. The macro ACC is the average value of ACC computed for each individual category. F-score: This measure evenly weights Precision and Recall. PRF = 2 (PRE REC) PRE +REC (7.4) we averaged the F-measure scores for all categories to obtain macro PRF.

145 7.3. Categorization of XML documents state=6, hidden=4, outhidden=4 state=1, hidden=8, outhidden=6 state=3, hidden=2, outhidden= state=4, hidden=3 state=6, hidden=4 state=8, hidden=7 state=1, hidden=8 state=15, hidden=1 state=2, hidden= SSE 16 SSE Epoch Epoch Figure 7.2: SSE curves of training GNN 2 for INEX 9 with different network sizes. Left: trained on graph 3; Right: trained on graph 5. MAP: The mean of the average precision with respect to each document. This measure used to evaluate whether the system is able to retrieve highly relevant categories first. For each documenti, we get a list of relevant categories by sorting the scores in ascending order, and then compute: AvgP i = N r=1 (PRE(r) rel(r)) N relevant (7.5) MAP can be in turn computed as: MAP = i AvgP i, (7.6) n wherenis the total number of documents evaluated Network size We vary the size of the hidden layer and the state layer of the output network and carried out a set of experiments on graph 3 which only contains one level of graph. The SSE curves are shown in Figure 7.2 (the left plot). It can be observed that larger network is able to produce lower cost than the smaller network. However, this is a first set of experiments for identifying the impact of network size on the training performance. Hence, only a small number of iterations are trained for obtaining indicative results. The longer training conducted on graph 5 is shown in the right plot of Figure 7.2. Some findings are: Generally, more training epochs are required to train larger networks for a convergence of the SSE. This is shown from the different speed on the decline of the SSE curve for different networks.

146 7.3. Categorization of XML documents 112 The SSE curves end at similarly low levels for the different networks. A general knowledge is that the size of the network can limit the ability of the learning system to model the underlying function. However, this experimental result does not show any advantage of using larger networks. This could be caused by insufficient training on the larger networks. The training graph used here is a GoG featuring two levels, and hence, two network components are required. The computational cost of the training exponentially increases as there are: 1.) more nodes to be encoded in the higher level graphs, and thus; 2.) more network parameters need to to be trained. Thus, for the later experiments we will use a larger number of training epochs especially when the network is of large size and the GoG is of deep structure Initial condition The set of network weights are initialized with random values before training and are then adjusted by the training algorithm over a number of epochs towards a set of values that can produce the desired network outputs. The gradient descent procedure is known to lead to a local minima on the error surface [52]. Depending on the initial condition it may be possible that a local minima is not close to a global minima on the error function surface. It is known that for some learning problems the local minima can be much worse than a global minima. The randomization of the initial weights will set a starting point for the training. Hence, different random initializations may lead to different local minima solutions. A possible case is that the set of initial network weights may not be far from the global solution, so the learning will be faster and lead to better results. For this reason, we investigated the sensitivity of the network on a number of different initial conditions. We illustrate some typical results in Figure 7.3. From the experiments depicted in Figure 7.3, it appears that the training algorithm is not sensitive to the different initial conditions for this given learning problem. It is observed that all the SSE curves eventually converge to very similar low levels Balancing labelled data The labelled data is not balanced amongst the different category classes. This is known as a desirable problem of machine learning due to noise tolerant abilities of such systems.

147 7.3. Categorization of XML documents 113 SSE seed 1 seed 13 seed 23 seed 33 seed 43 seed 49 SSE seed 1 seed 3 seed 7 seed 9 seed 11 seed 13 seed 15 seed 17 seed 19 seed 21 seed 23 seed 25 seed 27 seed 31 seed Epoch Epoch Figure 7.3: SSE curves of training GNN 2 for INEX 9 with different initial conditions. Left: trained on graph 4; Right: trained on graph 5. In order to suitably encode a learning problem features classes which are much smaller than other classes, then counter measures need to be taken so as to avoid that the network dismisses small classes as noise. In order to avoid this, it is useful to balance the distribution of train samples. Since this is a multiple categories task, the traditional methods for balancing data is not applicable. Instead of complementing the number of samples from smaller classes, we modified the error back-propagation algorithm by altering the error which is propagated back during training as follows: for each train sample, the network produced an output vector which has the same dimension as the number of category classes available, and each element in the vector represents the output for corresponding category. Originally for each element i in the vector, the error ε i = o i t i is computed. In order to balance train samples from categoryi,ε i can be revised by using the number of negative samples N n and the number of positive samplesn p for a categoryi, there are two alternative ways which have the same effect on the training algorithm: 1. ift i = 1, thenε i = (o i t i ) N n ; ift i =, thenε i = (o i t i ) N p 2. ift i = 1, thenε i = (o i t i ) N p ; ift i =, thenε i = (o i t i ) N n Figure 7.4 are the plots that show the results when using the two different balancing methods. The classification performance is evaluated based on the MAP measure and the F1 score. This is illustrated in Figure 7.5. It can be observed that balancing can help produce higher MAP and F1 score.

148 7.3. Categorization of XML documents 114 2e e+12 seed 1 seed 9 seed 21 seed 29 seed 37 seed seed 1 seed 9 seed 21 seed 29 seed 37 seed e e+12.8 SSE 1.2e+12 SSE 1e e e e Epoch Epoch Figure 7.4: SSE curves of training GNN 2 for INEX 9 with balancing 1 (left) and balancing 2 (right). Trained on graph MAP.115 MacroF Figure 7.5: MAP and F1 score of training GNN 2 for INEX 9 without balancing ( and 1) and with balancing (2) Stability control The graphs involved in the task are of higher complexity than for the artificial policemen benchmark (i.e. the node in the graph can have a much larger large out-degree), and hence, this contributes to a large sum-of-state input of the network. The transition function of the neuron is usually set to be a sigmoid or hyper-tangent. Both functions limit the outputs to be within a fixed range. A large input value may result in the saturation of the outputs. Saturation can be a major problem with gradient descent methods since the curvature of the sigmoid is very flat in a saturated case. Hence, the gradient would be very small and the network is not able to produce accurate and stable responses. In order to reduce the effects from large states, we apply stability control mechanisms in the following two ways:

149 7.3. Categorization of XML documents 115 State reduction We start by using a simple approach which we call state reduction for solving the stability issue. Instead of passing the sum of states to the hidden layer, we modify the algorithm by building the mean values over all contributors to the sum of the states. This limits the states within a reasonable range during the iterative forwarding phase. This also makes the magnitude of the state independent to the out-degree that a node. Figure 7.6 shows some results when training with this state reduction mechanism on graph 3. From the plot, it is not showing any evidence of improving on the results. Hence, a more sophisticated approach is considered as in the following Without state reduction With state reduction SSE Epoch Figure 7.6: SSE curves of train GNN 2 for INEX 9 with state reduction. Trained on graph 3. Jacobian control We then consider a more sophisticated approach called Jacobian control. The formalism of the approach was described in Section The GNN 2 software implementation uses the Jacobian control as a plugin mechanism. The Jacobian control restricts the updates on weights so that the state outputs will not exceed a given range. We anticipate that Jacobian control could be useful to avoid unstable results. In addition to using Jacobian control, we also used a smaller value range for the initial network condition. This is done to avoid that saturation has already occurred at the initialization phase. We enabled the Jacobian control for a set of experiments. Typical results will be presented and discussed in Section This is done in order to avoid the repetition of the presentation of experimental results.

150 7.3. Categorization of XML documents Label-to-out approach We further considered a slight modification of the proposed algorithm, namely the addition of direct network links between the input layer (taking the node label) of the encoding network and the output layer of the output network. This additional set of network weights allows the treatment of labels as independent vectors. The effect is a bias input to the output layer, one for each node in the graph. We call this the label-to-out approach. Thus, by using this label-to-out approach, we anticipate that the training task may help the network to encode the given domain more efficiently. The anticipation was confirmed by experimental results. Figure 7.7 (left) shows the plots of the SSE curve during training for graph 7 using the label-to-out approach. A confirmation can be seen in Figure 7.7 (right) which shows the results of training for graph 7 with different initial network conditions. It can be seen that the training performance is significantly improved by using label-to-out approach in terms of the SSE after training. Later, it will be shown that the combination of the label-to-out approach and the GoGs of concept link graphs also allowed us to obtain the best results for the INEX29 competition: MAP=.68, F1=.57 (as shown in Table 7.4). SSE SSE seed 1 seed 3 seed 7 seed 9 seed 11 seed 13 seed 15 seed 17 seed 19 seed 21 seed 23 seed 25 seed 27 seed 31 seed Epoch Epoch Figure 7.7: SSE curves of training GNN 2 for INEX 9 with label-to-out approach on graph All-to-out approach Motivated by the successes of the label-to-out approach, we considered additionally creating connections between the state inputs of the encoding network and the output layer of the output network. These weights are not involved in the iterative forwarding, the output layer will take the stable states from the last forwarding iteration as the inputs. This

151 7.3. Categorization of XML documents 117 will help propagate information that exist in the depth of the structure directly to the output layer using a linear projection. We first test this approach on a Policeman dataset and obtained some positive results. This is shown in Figure 7.8) (left). It can be seen that the all-to-out approach enhanced the training efficiency. However, the results of the attempt on INEX29 dataset illustrated in Figure 7.8 does not provide evidence that all-to-out approach can further improve the performance than label-to-out approach. It is possible that the linear projection of state information is ineffective on learning problems which are rich in structural information. This assumption is supported by the fact that the nodes in the policemen benchmark problem are only causally dependent on each other whereas the nodes in the INEX dataset can have cyclic dependencies. Thus, a linear projection may be insufficient to support learning problems with complex dependencies between nodes. 2.5 label-to-out all-to-out label-to-out all-to-out Figure 7.8: Comparison between label-to-out and all-to-out approaches Results and comparison for the INEX 29 competition In this section, we present the best results obtained at the time of the evaluation round in the INEX 29 competition. A range of network architectures and training parameters were tried on this training task. A selection of the training parameters used, and the results obtained, are given by Table 7.4. The main observations can be summarized as follows: The use of the label-to-out approach has improved the performance significantly. This indicates that the additional bias is effective in simplifying the given learning task. The simplification arises out of the fact that some features can influence

152 7.3. Categorization of XML documents 118 Table 7.4: Results of GNN 2 for INEX 9 by using different training configuration Training Configuration MAP PRF ROC ACC mean macro micro macro micro macro micro 1 graph 6; state=2, hidden=15, outhidden=2, output=1; rprop; seed=45 2 graph 6; state=2, hidden=15, outhidden=2, output=1; weight control; rprop; seed=45 3 graph 6; state=15, hidden=1, outhidden=15, output=5; weight control; seed=3 4 graph 3; state=1, hideen=8, outhidden=6, output=39; balance method 1; seed= graph 3; state=3, hidden=2, outhidden=15, output=39; balance method 1; seed=1 6 graph 3; state=1, hidden=8, outhidden=6, output=39; balance method 2; seed=9 7 graph 8; state=6, hidden=8, outhidden=1, output=15; jacobian control; seed=91; label-to-out the network output directly without having to travel through the relatively deep network architecture (the unfolded iteration network). Larger networks produce better results. The more neurons a network features, the more parameters are available to encode a given learning problem. The need for relatively large number of parameters indicates that given learning problem is nontrivial. Weight control through a Jacobian control helped to produce better results. This weight control mechanism can aid the training procedure by restricting the movement of weight changes such that the size of weight adjustments remains within a limited range. This aids the stability of the weights during the training, and hence, can result in a general improvement in the quality of the training procedure. A comparison with other approaches to this classification problem is given in Table 7.5. In this comparison, it can be seen that the proposed GoGs approach produced a favorable performance. We are in fact very pleased by these results since these are preliminary results on a system which was still under development at the time. We continued the development of the system after the competition and were able to significantly improve on the results. This is described in the following.

153 7.3. Categorization of XML documents 119 Table 7.5: Comparison of all submissions for INEX 9 XML classification task Submission MAP PRF REC ACC mean macro micro macro micro macro micro University of Wollongong University of Peking Xerox Research Center University of Saint Etienne University of Granada Encode labelled edge The first attempt is to include more complete information in the CLGs attached to the nodes in the hyperlink graph by including the label vectors for the edges in the CLG. The GoGs defined for INEX29 could contain CLGs as level graphs to describe the contents of the documents. A CLG contains a set of nodes representing the concepts and edges indicating the relationship among concepts extracted from the document contents. Each concept is labelled by a binary vector where the location of a positive value indicates which concept it is representing. Similarly, the edge of the CLG is labelled with a onedimensional numerical vector indicating the strength of a link between two concepts. The GNN 2 software implementation is extended to allow for the encoding of GoGs featuring labelled edges. This was realized by adding another GNN component on top of the GNN 2 network architecture (the base network). This added network contains an input layer, hidden layer and output layer. The dimension of the output layer is equal to the number of state neurons in the base network. For the base network, instead of taking the state of neighboring nodes as the input, the sum of edge states are used. Here the sum of edge states is with respect to one node which is currently processed, and is computed as follow: For each edge that connects the current node to another neighboring node, we use the added network component and take as input the vector [neighboring node label, neighboring node state, edge label]. Then this vector is forwarded through the layers of the new network. The output produced is said to be the state for a given edge. After processing all edges, we sum the states and pass this to the base network. The base network will take the node label and the sum of edge states as input, and forward through all layers as described in We carried out a set of experiments on graph 9 by enabling the encoding for edge labels. The results show that the training performance is generally better than the one without using the edge label. Figure 7.9 shows a comparison

154 7.3. Categorization of XML documents Without edgelabel With edgelabel SSE zoomed area Epoch Figure 7.9: SSE curves of training GNN 2 with labelled links. Trained on graph 8 and 9. between the SSE curves obtained when training with or without edgelabel. It can be seen that both converge to low SSE and using edgelabel allows a quicker and steeper decline of the SSE curve Importance of the structure According to the experiments conducted so far on the INEX 29 benchmark problem, it can be observed that the major contribution to a good result is from the node label vectors. Thus, one can question the importance of the structural information for this classification task. Some possibilities that may explain the observation are: This may be due to the effects from the long-term dependency which can be of significance when encoding deep structures. The structural information itself may be of limited use for the given classification task. In order to examine whether and how the link structure contributes to the classification task, a matrix can be built for document category vs. links. For each category, we count the number of links from the documents belong to this category to the documents in all other categories. Such a matrix shows whether the category and linking structure is related, e.g. one category may contains documents mostly connect to documents in another category, which reveals a strong relationship between these two categories. The matrix is shown a correlation graph in Figure 7.1.The figure shows the strength of correlation

155 7.3. Categorization of XML documents 121 between any two categories. The higher a data point in this figure, the stronger the relationship between a given category with another category. The main observation here is that some categories have mostly self-connections. This indicates that the documents from same category are connected with each other. Such property of the structure may help classify the documents. For each category, if we sort the counts of links to other categories, we found that some categories are connected but not though direct links. For example, category A links to B, and then B links to C. This means that indirect links exist between A and C. Thus, the learning from indirect context is also required from the GNN 2. However, such deep dependencies may require additional training epochs and network parameters for GNN 2 to learn Figure 7.1: Category-Link Matrix. Different symbols represent different source categories; x-axis: destination category; y-axis: normalized counts of the links between documents from two categories Pre-processing by using PMGraphSOM According to the analysis on the correlation between structure and classification targets, we have identified that structure information can be of relevance for this learning task. The learning of structure requires an unfolding of the GNN network (this is an effect of

156 7.3. Categorization of XML documents 122 Table 7.6: A list of training data used by PMGraphSOM for INEX 9. Training mode ID Node label Structure One PMGraphSOM #1 No Yes #2 Rainbow classification results Yes Separate PMGraphSOMs #3 No Yes #4 Rainbow classification results Yes the training algorithm). Hence, the deeper structure that data has, the deeper the unfolded network architecture is required. As described earlier, the GNN learning algorithm is gradient-based, and this may present difficulties to learn deep structures due to long term dependency issues [9]. In contrast, the PMGraphSOM training algorithm is unsupervised machine learning method based on Euclidean distance measures and Gaussian updating which has not affected by any gradients. It is thus anticipated that the PMGraphSOM could be effective to provide a representation of structured objects on a 2D display area, and that such representation is unaffected by long-term dependencies. Then, we consider to reduce long-term dependencies (with respect to the GNN 2 ) by reducing the the depth of the GoG. This can be done through a projection of the highest level graph onto a 2Dsurface as obtained by a PMGraphSOM. Then, this projection is used to label the nodes in the 2nd level of the GoG. This procedure effectively reduces the depth of a GoG by 1. We carried out two sets of experiments by using the PMGraphSOM on the hyperlink graphs: Either by training the PMGraphSOM on all nodes in the hyperlink graph, or by training a set of PMGraphSOMs where each PMGraphSOM is trained only on nodes belonging to a particular category class. Firstly, a PMGraphSOM was trained on a single hyperlink graph that contains all documents in the dataset. The SOM is updated for all the nodes in the graph. Such a result shows how documents are clustered according to the features provided. Given the task is multi-category classification, we also propose to train individual PMGraphSOM for each category. This means for each category c, the same hyperlink graph is used but the SOM is only updated when processing the nodes that belong to categoryc. The outputs from all trained PMGraphSOMs are then combined as the final results in which each pair of the coordinates is the corresponding to a particular category. The node in the graph could contains structure information only or in addition being labelled by a feature vector that describe the contents or property of the document. This allows us to create a training data set as shown in Table 7.6. In the following sections we will refer to the train data by using the IDs defined in Table 7.6. Note that the node

157 7.3. Categorization of XML documents c c1 c2 c3 c4 5 c5 c6 c7 c8 c9 1 c1 c11 c12 c13 c14 15 c15 c16 c17 c18 c19 2 c2 c21 c22 c23 c24 25 c25 c26 c27 c28 c29 3 c3 c31 c32 c33 c c35 c36 c37 c38 Figure 7.11: The mapping results of training PMGraphSOMs for INEX 9 on dataset #1. Left: mapsize=8x6, grouping=1. Right: mapsize=4x32, grouping=2 label used here is a classification result from Rainbow (the software based on the BoW vectors of the documents). Using one PMGraphSOM to map the hyperlink graph We first trained PMGraphSOMs on dataset #1 using a number of different training parameters. This is generally required since a number of training parameters need to be set for which an optimal setting it is not a-priori clear. When then identified a suitable set of training parameters that led to clustering that largely agreed with the class membership of the nodes. The resulting mapping is shown in Figure Shown is the final mapping of two PMGraphSOMs that differed in map size. Different color and symbols are representing the documents from different categories. From the plots, it can be seen that PMGraphSOM is able to produce a clustering of documents that agrees to the category membership of the documents. This is quite remarkable given that the PMGraphSOM was trained on structural information only (as there was no label attached to the nodes in this graph). It can also be observed that the larger map can significantly improve the network utilization on the mapping area. The later can be attributed to the relative smaller grouping of neurons used in this experiment. Training Separate PMGraphSOMs on the nodes belonging to different categories In this experiment we trained one PMGraphSOM for each of the 39 categories. To carry out this task, we create a training dataset that contains a single (hyperlink-)graph where nodes represent documents and links are the hyperlinks. The node in the graph is either

158 7.3. Categorization of XML documents 124 labelled by a symbolic class label or is without a class label. Then for each category, if a document belongs to the category, then the associated node is marked as a positive sample, otherwise it is a negative sample. The result is 39 hyperlink graphs containing complimentary positive and negative samples. Note that regardless of this pre-processing step, the class label is not used to provide supervision during the training, but rather it is used to segment the learning problems into 39 smaller sub-problems. The approach is valid since the underlying learning problem is a supervised one, and hence, we are allowed to use the target vectors for a pre-processing step. During training, the PMGraph- SOM only updates the codebooks on the map if the node was marked a positive sample. This approach could help produce better separation between positive and negative samples for each category. It allows to obtain 39 pairs of mapping coordinates eventually for each document in the dataset. Such information can be used as a prediction on the clustering and concatenated to the node label vectors used by GNN 2. The performance of PMGraphSOM in response to the classification task can be evaluated on two measures: PRE and ACC: PRE: The precision on the classification. For each trained PMGraphSOM, we have a mapping result for a corresponding category. Each activated neuron on the map is associated with either positive or negative according to the majority mapped on it. Then the classification can be determined for each mapped document. The PRE is computed as a ratio of number of correctly classified positive samples divided by the number of total positive samples. ACC: The accuracy on the classification. This measure also considers the negative samples and is computed as a ratio of number of correctly classified samples divides number of total samples in the dataset. However, such evaluation is only available to the current case since these measures can only be applied to situations for which there are at least two-classes to be classified. Figure 7.12 shows the mappings produced by a selection of two PMGraphSOMs for two (out of the 39) categories respectively. The red-plus represents negative samples and green-cross are positive samples. The plots show that pure clusters can be observed for both negative and positive samples but there also exist overlaps between the two classes. The main problem identified here is the relative low utilization rate of the mapping area.

159 7.3. Categorization of XML documents Unlabeled Labeled 8 Unlabeled Labeled Figure 7.12: The mapping results of training PMGraphSOM for INEX 9: the plots for two categories of training dataset #4 At the beginning of the training, the SOM is initialized with random codebooks which contains values within the value range observed from all the inputs. This allows random activations of the winning neurons at the starting phase. But since the PMGraphSOMs are updated for positive samples only which are corresponding to different categories, and given that the positive samples are always the minority, the distribution on the input value range covered by the positive samples could be significantly different from the one of the rest of the dataset. This was found to be indeed the case, and hence, showed us that a different approach on the initialization of the network is required. Initialization of the SOM when training on specific pattern classes Under normal circumstances, a PMGraphSOM is initialized with randomly values from within a value range as covered by the input dataset. A problem here is that the network is only trained on one category of nodes (the positive samples), and it is common that number of positive samples is much less than the negative samples and may cover a value range that differs from the negative samples. Given the unbalanced distribution of the two cases in the dataset, it is possible that there are only a few codebooks are similar to the inputs of positive samples, and hence, this can result in a low network utilization from which the network can not recover. To counter this issue, some modified random initialization approaches are proposed. Random Initialization. This is used to initialize the label component of the codebook vector. Then, after loading the data, we have the knowledge about the maximum, minimum and mean of the values from the node label vectors for the whole

160 7.3. Categorization of XML documents 126 dataset. The random initialization can be done by generating random values for codebooks based on the range obtained. The value range is obtained only from nodes which are actually used to train the network. Box-Muller Gaussian distribution randomization. In this case, we assume that the distribution of the node state follows the Gaussian distribution. For one node, the mappings of its neighbors are more likely on different locations rather than on single neuron. In order to better initialize the state vector of the codebook, we took the following steps: Get maximum, minimum, mean and the standard deviation (sigma) of the number of neighbors for all labelled documents Compute the Gaussian based on the mean and sigma. For each codebook, generate a random indexk over the Gaussian Select k elements in the codebook for assigning non-zero values. For each value of the k elements, randomly generate real values over the defined Gaussian. Bin-method. The Bin-method is to create vectors holding the possible numbers of neighbors, and return the randomly selected value from the vector. Processing all the labelled noden i that has c neighbors, concatenate c elements to the vector and set value to i. Both the Box-Muller and the Bin-methods can produce a values in each codebook that follow the Gaussian distribution. Figure 7.13 shows the sample distribution generated by two methods. The box-muller is adopted for future experiments since it can produce a Gaussian with a better shape and more consistent with the sigma and mean values obtained from the inputs. In order to investigate the impact of using two of the three improved initialization methods, we carried out a number of experiments for comparison studies. By considering the values from positive samples only, we initialize the codebooks by applying random initialization on label component in the code books (init1) and the box-muller approach (init2) on state component. The comparison of the resulting mappings is illustrated in Figure These are the results for the same category which are trained starting from

161 7.3. Categorization of XML documents "Box-muller" "Bin-method" Figure 7.13: The random initialization schemes for PMGraphSOM: Box-muller gaussian vs. Bin-method. the init1 initialization method and the init2 initialization method respectively. It can be observed that the proposed random initialization based on the box-muller algorithm addressed the limitation of previous approach by producing a considerably higher map utilization and also improves on the clustering performance. 8 Unlabeled Labeled 8 Unlabeled Labeled Figure 7.14: The mapping results of training PMGraphSOM for INEX 9: trained on dataset #4. Left: init1 initialization method; Right: init2 initialization method; positive and negative samples are distinguished by different symbols and colors. Degree of Probability Mapping The probability mapping approach on state vector considers the probability of any changes in the mapping of nodes, and allows encoding on the likelihood of the mapping in the subsequent iteration. This leads to non-sparse state vector which benefits the map utilization. However, the degree of probability mapping (PM) is reduced during training as the neighborhood radius shrinks. Towards the end of the training, the standard deviation

162 7.3. Categorization of XML documents 128 Table 7.7: Performance of the PMGraphSOM for the INEX 9 dataset with different settings for PM. Training Configuration PRE ACC train test all train test all Default PM Keep 1%PM becomes very small so that there is no effective PM at the end of the training. This means that the state vector will become sparse again during the last a few training iterations. This reduces the discriminate power of the input data, and hence, leads to a reduction in the utilization of the map space. Hence, we propose to force the PM to be at least 1% of the initial value during training and until the end. In Table 7.7, we compare the results of using default PM and the controlled PM. It can be seen that the PRE is significantly improved and the ACC also has a slight boost on both train and test sets. This is simply the effect of maintaining a good network utilization. Effects of feature vectors when training the PMGraphSOM As was shown in Table 7.6, we also varied the representation of the train data by either including or not including the feature vectors that describe the document contents. This refers to dataset #3 and dataset #4 shown in Table 7.6. Figure 7.15 presents the average and max performance curves among categories during training for both train and test sets. The results show that the proposed new initialization mechanism can significantly improve the performance on the training documents train avg test avg train max test max train avg test avg train max test max Figure 7.15: Classification performance of the PMGraphSOM when trained on dataset #4. Left:old initialization; Right: new initialization. In comparison, we remove the feature vectors (dataset #3) for some experiments and

163 7.3. Categorization of XML documents train avg test avg train max test max train avg test avg train max test max Figure 7.16: Classification performance of the PMGraphSOM when trained on dataset #3. Left:old initialization; Right: new initialization. present the typical results in Figure It again proves that the proposed modification on the initialization can help PMGraphSOM to produce generally a better result on both training and test sets. Without using feature vectors, it can be observed 2% drop on the performance (compare to the results shown in Figure 7.15), which shows the significant contribution of the features to the classification task Improving the GNN 2 results The experimental results presented in previous sections show evidence that the classification performance of GNN 2 can be improved through: The use of edge labels of CLGs. Use of the PMGraphSOM as a pre-processor to obtain additional features. Use of larger networks for encoding the structures. Accordingly, we created graphs for training sets that include the edge label (as was shown in Table 7.3) and the mappings produced by the PMGraphSOM, the experiments are carried out with a variety of network architectures and initial condition. A selection of the results is given in Table 7.8. It is observed that the approach resulted in a significant improvement to MAP=7.64 from our previous best of MAP=68.1. The best results were obtained by using a combined inputs of the rainbow classification results and mappings produced by the 39 PMGraphSOMs. This shows an advantage of PMGraphSOM on encoding the structural information. By comparing the results obtained by attaching the 39 PMGraphSOM mappings, the main findings are:

164 7.4. Effects of long term dependencies on GNN 2 13 Table 7.8: Improved results of the GNN 2 for the INEX 9 learning problem. GoG Network architecture Train Test MAP PRE ACC PRF MAP PRE ACC PRF 1 hidden=11, state= hidden=2, state= hidden=5, state= hidden=4, state= hidden=5, state= hidden=1, state= The results of training PMGraphSOM on structure already helps the GNN 2 to produce better results. This can also be attributed to the label-to-out approach, so that effective depth of the network architecture of the GNN 2 is reduced. The control applied on the degree of PM can help produce more supportive mappings by the PMGraphSOM. This is evidenced by an improvement of the MAP for both the training and test set. 7.4 Effects of long term dependencies on GNN 2 It is known that learning methods such as MLP, BPTS and GNN which use a gradientbased algorithm can suffer from a long term dependency problem [9]. In this section, we investigate the effects of long term dependency on the ability to learn deep structures by the GNN 2. For this purpose, an artificial dataset consisting of just two classes is created. Each class is represented by a structure of varying depths. This is expressed as a sequence of trees of varying lengths. The sequences were identical except for a node that is furthest away from the end of the sequence. Hence, the feature which discriminates the two classes is located deep in a structure. By varying the length of the sequences, this influences the depth of the discriminating feature. A GNN 2 is trained for each sequence, and the time needed to observe convergence is measured. We then record the longest sequence that a model was able to discriminate. For example, when attempting to learn a sequence of length 9, the two sequences were as follows: Class 1: g g g g g g g g g 1 Class 2: g g g g g g g g g 2, whereg,g 1, andg 2 are different trees (extracted from three randomly chosen policeman images). Thus, the two sequences can be discriminated only by a model capable of

165 7.4. Effects of long term dependencies on GNN propagating (forward and backward) information about the graph that is furthest from the root of the sequence. Each of the following experiments is repeated with 17 different initial conditions, the mean, maximum and minimum performances are then recorded. This is to avoid false recordings arising out of accidental correct classifications (we may be lucky with some initial conditions). To be counted as a successful attempt, all 17 initial conditions need to lead to a successful classification of the pattern classes after training the networks for at most 2 iterations. It was observed that the network configuration can have a significant effect on the ability to classify a given learning problem. Hence, we varied each of the parameters EN state, EN hidden, ON state, ON hidden and recorded the limit of the sequence length the network can encode. When varying one of the parameters, the other parameters were left at the default value asen state = 15,EN hidden = 14,ON state = 15,ON hidden = 14. The parameters were varied from within the range[1 : 4]. The result is shown in Figure The Figure shows that sequences of up to a length of 26 can be encoded successfully if 3 25 Output Net Hidden Output Net State Encode Net Hidden Encode Net State 9 8 Output Net Hidden Output Net State Encode Net Hidden Encode Net State Figure 7.17: Varying the size of the four internal network layers (horizontal scale), and the limitation of sequence length the network can encode (vertical scale). Shown is the average performance (left), and the maximum performance (right) the system is sufficiently large. It is also observed that larger systems generally increase the depth of a structure that can be encoded, and that the system is more sensitive to the architecture of the output network. The network s ability to encode long dependencies increases significantly with the number of internal parameters in the output network. The increase levels off at about 2 hidden neurons and 15 state neurons. In contrast, an increase of parameters in the encoding network has little influence on the overall ability to encode long sequences. It is interesting to note that the network s performance can

166 7.4. Effects of long term dependencies on GNN Mean Min Max Figure 7.18: Maintaining causality between dependent internal layers. Horizontal scale: Number of neurons in a layer, vertical scale: limit to the length of a sequence of graphs a network can classify. vary significantly with the initial condition. It is shown in Figure 7.17 (right) that some initial conditions allowed the network to encode sequences of length 89. The previous experiment neglects potential causality between the size of the various network layers. It is well known that the optimal size of internal layers in an MLP depends on the choice of size for the preceding and following layer in the network. As a consequence, we repeated the experiment by concurrently varying the size of related layers such as EN state and EN hidden as follows: EN state = EN hidden +1 ON state = EN state +1 ON hidden = EN state The results of the mean, maximum, and minimum performances are shown in Figure It can be seen that the network s capability to encode deep structures increases significantly to 32 when compared to the result obtained earlier. This was achieved by simply respecting the causality between the size of internal layers. The experiment also confirmed that 15 to 2 neurons in each layer are a good choice when an encoding of deep structures is required.

167 7.5. Web Spam Detection Web Spam Detection With the rapidly increasing size of the Web, and the unregulated nature of the Web, spam detection in the Web remains a challenging problem. Spam pages are designed to adversely affect the rank of a web document so that the order of Web pages returned by a search engine is altered. Hence, spam generally reduces the perceived quality of a web search engine. Spam detection is known to provide a moving target problem. New spam detection methods need to be constantly developed and improved in order to account for the ever changing spamming techniques. A good platform for the development of spam detection methods is provided by the international competition on spam detection (WEBSPAM) which is held at annual intervals. It is interesting to note that one of the machine learning methods developed by us (the PMGraphSOM) has been applied with considerable success to one of such benchmark datasets [27]. However, we found that the approach taken in [27] did not utilize the enhancements proposed in Section We henceforth repeated the experiments reported in [27] and enabled our enhancements. We found that this indeed resulted in a noticeable improvement on the result, placing our method near in-par with the state-of-the-art in this application domain. We then explored the possibility of applying the same methodology to a second spam detection dataset and were able to again produce very good results. This is a very significant finding since there is no other known method that has been applied successfully to two different spam detection problems. Thus, we claim that our methodology is the most generic approach capable of solving different spam detection problems. Our work on spam detection will be reported in this Section. Di Noi et.al. proposed an approach that combines several state-of-the-art machine leaning methods which are then trained on link-based features, content-based features, as well as on the topology of the Web as defined by the hyper-link graph [27]. The task is a classification task in that Web pages or Web domains are to be classified into either spam or non-spam. They combined our PMGraphSOM machine learning method with a supervised method known as the classic Multi-layer perceptron (MLP). This approach is then extended by the application of the Graph Neural Network (GNN) which is trained on the outputs of the MLP as well as the topological information of the Web hyperlink connection graph. The architecture of the combined machine learning system is illustrated in Figure There are three main components: PMGraphSOM, MLP and GNN. The first

168 7.5. Web Spam Detection 134 Figure 7.19: The proposed combined machine learning system for solving large scale Web spam Detection Problem. step is to use PMGraphSOM produce mappings for all the data (i.e. the web hosts in this task) on a 2 dimensional map by considering link-based features and topology. The MLP is then trained on the combined vectorial inputs consist of the mapping results from PM- GraphSOM, link-based feature and content-based feature. This allows MLP to produce classification outputs according to a combined knowledge of links and contents. This is extended by the GNN training on the outputs from MLP and produce the classification results for all the hosts. It was reported in [27] that the addition of a second GNN which is trained on the outputs of a previous GNN enhanced the classification performance. The significance of the work is the combination of state-of-the-art machine learning methods for solving a classification problem involving graph structured information as well as link and content based vectorial features. Such a combination is designed based on the consideration of maximizing the performance produced by GNN which is the main component in the combined system. GNN is an extension of recurrent neural network and theoretically it is capable to deal with any types of graphs: graphs with directed or undirected links and cyclic graphs. However, it has been shown that recurrent neural networks using gradients descent back propagation algorithms can have long-term dependency problem. This classification task requires processing of complex Web link graph so that long term dependencies may limit the quality of the classifications produced by the system. Hence, the PMGraphSOM is used to encode the link-based features and the

169 7.5. Web Spam Detection 135 topology of the domain. We reported in an earlier section that such a strategy is effective in reducing the depth a a GoG, and hence, reduces the long term dependency problem. The training of an MLP prior to GNN is a realization of a stage-based learning approach proposed in [71]. The GNN network has two sets of weights connecting the hidden layer to node labels and the node states respectively. The node labels remain static in the recurrent process while the node states are dynamic. This may result in different speeds on the convergence of these two sets of parameters. The MLP is employed for obtaining a mapping between the vectorial features and the classification targets. Thus, the MLP models the vectorial information available with this dataset, and hence, can be seen as a pre-processing step. The GNN is then engaged to reduce the residual error (of the MLP) by including topological features of the domain. The approach is taken so that the training task for the GNN is simplified. This is because: 1. the dimension of the node labels is reduced; 2. the load of encoding node labels for GNN is reduced so that it can focus on the topological information. We trained the method by engaging the optimizations suggested in Section (initialization of the PMGraphSOM), and in Section (improved probability mappings). This was then applied to two international benchmark problems and compared with results obtained by competitors. In Section 7.5.1, we demonstrate the procedure that we apply the proposed approach to the WEBSPAM-UK26 dataset (described in Section 5.5). A comparison between our results and the competitors shows that our method outperformed all others on the crucial performance measure called AUC on ROC: area under the receiver operating characteristic (ROC) curve (will referred as AUC in later sections). We will present how we obtained a performance of AUC=.95 compared to the previous best of AUC=.93 produced by others. Then, in Section 7.5.2, we attempt to apply our methods to the WEBSPAM-UK27 dataset (described in Section 5.6). It will be shown that we are able to obtain results which are approaching the best performances reported at the competition Application to the UK26 dataset In this Section, we conduct training by following the procedure documented in [27]. A number of PMGraphSOMs are trained first on the link-based features and we then train MLPs on the vectorial features and identify that the training on a combination of

170 7.5. Web Spam Detection 136 content-based, link-based features, and PMGraphSOM mappings produce the best result. These are reported in the following sections. We then analyse the different features and models with the aim to find out an appropriate combination for solving the Web spam detection task. The analysis provides guidance for the establishment of the combined system where each component is trained on a good set of features so that the performance can be maximized. We then consider the unbalanced nature of the dataset and employ a sampling approach which addresses unbalance classification problem. The outcome shows that our combined training system can beat all other competitors on this Web spam detection task. Train PMGraphSOMs The PMGraphSOM is provided with link-related information during training. This includes link-based feature vectors and the topology of the hosts. The result is a projection from the domain of graphs to a 2 dimensional map. The coordinates of the mapping in turn are used as node label for further encoding. Given the task is to classify nodes into two classes (spam or normal), we consider three possibilities for training the PMGraph- SOM: 1.) to update codebooks when processing nodes representing spam hosts only; 2.) update codebooks when processing nodes representing normal hosts only; 3.) to update codebooks when processing all nodes in the dataset. The motivation is to reduce any effects from unbalanced features of the dataset. Given that the majority of the training samples are normal hosts, and hence, the useful information for identifying spam samples is possibly dismissed as noise. By training spam and normal hosts on two separate PM- GraphSOMS separately, this allows to learn one class without interference from the other class and helps to produce more precise mappings undisturbed by the unbalanced nature of the dataset. This approach had already been proved to be effective in Section 7.3.1). Similarly, we modify the initialization on codebooks of PMGraphSOM accordingly with different updating schemes, i.e. initial the SOM by considering the inputs of labelled samples. The mapping of the hosts at the end of the training allows us to build a confusion matrix as shown in Table 7.9. The performance can be evaluated on the following measures: Classification rate (C): (TP+TN)/(TP+TN+FN+FP) Classification rate on Spam (C-POS): TP/(TP+FN)

171 7.5. Web Spam Detection 137 Table 7.9: Confusion matrix: True positive (TP), False negative (FN), False positive (FP), and True negative (TN). Spam Non-Spam Spam TP FN Non-Spam FP TN Classification rate on Non-Spam (C-NEG): TN/(TN+FP) Recall (R): TP/(TP+FN) Precision (P): TP/(TP+FP) False positive rate (FPR): FP/(FP+TP) F-score (F): 2(R*P)/(R+P) A variety of training configurations are tried for the training of the PMGraphSOM. Table 7.1 lists both training performance and generalization performances for a number of these experiments. The best results are highlighted. It can be observed that by using a small map of size 8x5 the training on spam hosts only can produce better results on all measures for both training and testing sets. However, training on positive samples only could not benefit the C-POS for the training set when using larger maps whereas the results on test set are consistent across all different sizes of maps. This may be due to fact that the distribution of spam and normal hosts in the training set is the opposite to what is found in the test set. The training set contains more normal hosts than spam pages while the majority of the data in the testing set are spam hosts. Thus, a focus on spam only helped produce better clusters for spam hosts. Training the MLP component After training the PMGraphSOM, the MLP was used to further encode all vectorial inputs available. The classification targets are provided in form of binary vectors. The well-trained MLP is anticipated to identify correlation between the input feature vectors and the classification targets. Experiments were carried out by using different network architectures and different combinations of input vectors. The dimension of inputs are ranged from 21 to 64 by judiciously selecting a subset of available features, and we consider using 15 hidden neurons as a starting point for the experiments then gradually increase the size of hidden layer up to a size of 6. The experimental results show that

172 7.5. Web Spam Detection 138 Table 7.1: Performance of training PMGraphSOMs. TrainMode: 2-train on both; 1- train on spam hosts only; -train on normal host only. MapSize TrainMode C C-POS C-NEG R P FPR F x8y5g x8y5g x8y5g x8y5g x8y5g x8y5g x12y1g x12y1g x12y1g x2y18g x2y18g x2y18g x8y5g x8y5g x8y5g x8y5g x8y5g x8y5g x12y1g x12y1g x12y1g x2y18g x2y18g x2y18g hidden neurons are sufficient because a even larger hidden layer does not improve the performance further. We also tried using additional hidden layers which shows a speedup on the convergence of the training but then overfitting can be observed after training for more than 4 epochs. This is shown in Figure 7.2 Figure 7.21 shows a comparison on AUC and PRF when using different input features. The curves with pluses and crosses show that Link-based features have more contribution to the classification task than Content-based features. By additionally adding the mappings produced by the PMGraphSOM (to serve as an additional input for the MLP), it is observed that the generalization performance can be improved slightly. Comparison of MLP and GNN detection results We conduct a study to identify the contribution from the different features. This question is of interest given that the dataset provided a vectorial description for each host in

173 7.5. Web Spam Detection Train Test.35 Train Test MSE.2 MSE Epoch Epoch 1 Train Test 1 Train Test AUC AUC Epoch Epoch Figure 7.2: Results of MLP for UK26 by using different network architectures. Left: hidden=2; Right: two hidden layers, layer1=2, layer2=1. a subset of the Web. This feature vector is of dimension 2,346 (8,944 content based features plus11,42 link based features). Without any further pre-processing, these feature vectors would be attached to the nodes in hyperlink graph. However, it is important to understand and to identify how the various features can help the given classification task. Such understanding leads to domain knowledge that can help the development of an optimized learning method. The analysis is conducted by comparing the results obtained when using the MLP and the GNN on different combinations of input features. The detection results can then be compared between the two models by identifying the number of correctly detected spam hosts detected by both models, and the number of correctly detected spam hosts by any one model. We first train MLPs on content-based and link-based features respectively (this will be referred as MLP-Content and MLP-Link later). The total number of spam hosts detected by MLP-Link is 1,262 while MLP-Content classified 1, 137 as spam hosts; many of which were classified incorrectly as spam. Nevertheless, the experiment reveals that both types of features contribute to the classification task and that the link-based features may play a more important role. Given that the hosts were assessed to be spam or normal based on either link-based or content-based criteria or a mix of both. Thus, it is anticipated that detection rates will differ when different feature

174 7.5. Web Spam Detection AUC on Train.6.4 AUC on Test Epoch Epoch Content Link Content+PM Link+PM Content+Link+PM Content+Link Content Link Content+PM Link+PM Content+Link+PM Content+Link MSE on Train.25.2 MSE on Test Epoch Epoch Content Link Content+PM Link+PM Content+Link+PM Content+Link Content Link Content+PM Link+PM Content+Link+PM Content+Link Figure 7.21: Comparison of MSE curves and performance of training MLPs for UK26 by using different input features. sets are selected. The experiments confirm that spam hosts detected correctly by these two models are very different. There are only 443 spam hosts detected correctly by the MLP-Link, and 318 spam hosts detected correctly by the MLP-Content. This indicates the necessity to combine both features for better detection results. Both MLPs produce one-dimensional outputs for all hosts in the dataset. The outputs are then concatenated to describe the node in the link graph. This information is then used to train a GNN. The results show that the trained GNN significantly boost the classification performance when compared to the MLPs by detecting more spam hosts (correctly classified as spam were 1, 485 nodes). This is attributed to the fact that the GNN takes topological information as additional inputs. This indicates that structural information significantly support the classification task. In order to investigate how GNN outperform MLPs, we analyse the common and difference of the detection results produced by the MLPs and the GNN. The results are summarized in Table Here, Diff1 is the number of spam hosts detected by the former model and the Diff2 is the one detected by the later model, and Common refers to correct detections made by both (the intersect). We found that the GNN can generally detect more spam hosts than any of the MLPs.

175 7.5. Web Spam Detection 141 Table 7.11: Comparison of detection performance of MLP-Link, MLP-Content and GNN(link+content+topology). Model Common Diff1 Diff2 MLP-Link vs MLP-Content GNN vs MLP-Link GNN vs MLP-Content This shows the advantage of using the combined inputs. However, there are also spam hosts which were detected correctly by the MLPs but not by the GNN. Since the GNN is known to have a bias on structural information (since it is trained on both link-based features as well as the topology of the graph), and hence, this explains the observation that most spam hosts detected by MLP-Link can be also detected by GNN, while the situation is worse when comparing with MLP-Content. This implies that MLP may perform better when encoding certain types of vector based features. A further comparison has been carried out by checking the features of the spam hosts detected by different models. This may provide indications on which features a model can encode better. The subset of features that are used for comparison are listed below 2 : hp mp IN OUT n(c) sn(c) aio aoi fat fvt cr cp-(n) cr-(n) qp-(n) qr-(n) entropy LH Home page Page with maximum PageRank In-degree Out-degree number of neighbors at distance c number of different hosts pointing to hp, supporting at distance c the hp. Average in-degree of out-neighbors. Average out-degree of in-neighbors. Fraction of anchor text Fraction of visible text Compression rate. The visible text compressed by using bzip. Top n corpus precision Top n corpus recall Top n queries precision Top n queries precision entropy of trigrams. Independent likelihood. 2 This subset was selected manually by choosing features which we think may contribute strongest to the classification task.

176 7.5. Web Spam Detection 142 We then trained the MLPs on each one of those features, and trained the GNN as before. Then we computed the classification performances achievable by using any one of these features and models. The result is shown in Table 7.12 and Table Table 7.12 shows the comparison between GNN and MLP-Link. There are four main columns in the table from left to right listing features names, and then features of spam in three sets: Common (features of spam hosts detected by both GNN and MLP); Diff1 (features of spam hosts detected by GNN only); Diff2 (features of spam detected by MLP only). For each set of features, we record the maximum, minimum and average values. By comparing the mean values between Diff1 and Diff2, it can be found that the spam hosts detected by GNN have comparatively lower IN, OUT, n(c), aio, but higher sn and aoi. In contrast, Table 7.13 shows the comparison between GNN and MLP-Content. It can be seen that the spam detected by MLP-Content have a much lower performance when using IN and OUT but a higher performance on sn than the spam hosts detected by the GNN. These observations could be a consequence of the truncation of links (the maxoutdegree was truncated to 64 as was reported in Section5.5). Truncation was done only once, but this may have affected the training performance given that the truncation from the original maximum out-degree of 5994 to 64 is huge. In order to find out whether and how the truncation may affect the training results, in addition of using outputs from MLPs, 4 link-based features were selected to be concatenated to the node label vectors. These four features include IN-hp, IN-mp, OUT-hp and OUT-mp. It is expected that this approach can compensate the impact from the truncation of the links by providing information of the actual number of links via a feature vector. We then trained a set of MLPs and a GNN as before. The result is shown in Table By using the additional link-based features, the GNN can detect 83 more spam hosts but classifies 48 spam hosts incorrectly that were classified correctly previously. By comparing the linkbased features of the spam hosts detected by these two GNNs, it is found that without using the additional features, the GNN detected spam hosts with much larger IN and OUT. This indicates that by providing IN and OUT, the GNN was not able to correctly classify the spam with high in-degree and out-degree. This may be explained in the existence of contradictions between these features and the classification targets. For example, some spam hosts have high IN and OUT while some normal hosts also

177 7.5. Web Spam Detection 143 Table 7.12: Comparison on features of hosts detected by GNN and MLP-Link. Feature Common Diff1 Diff2 max min avg max min avg max min avg IN-hp IN-mp OUT-hp OUT-mp n2-hp n3-hp n4-hp sn1-hp sn2-hp sn3-hp sn4-hp aio-hp aio-mp aoi-hp aoi-mp fat fvt cr cp cp cr cr qp qp qr qr entropy LH share the same property. We run further experiments to evaluate the impact from different selections of additional link-based features when training a GNN. The results are shown in Appendix A. From these results, we did not observe any other feature that allowed for a significant improvement in the classification results. We then repeated some of the above experiments by slightly changing the training configurations (network size, training iterations). The results produced were consistent with the observations made earlier in Appendix A. Hence, we are confident that the conclusions drawn from these experiments are correct.

178 7.5. Web Spam Detection 144 Table 7.13: Comparison on features of hosts detected by GNN and MLP-Content. Feature Common Diff1 Diff2 max min avg max min avg max min avg IN-hp IN-mp OUT-hp OUT-mp n2-hp n3-hp n4-hp sn1-hp sn2-hp sn3-hp sn4-hp aio-hp aio-mp aoi-hp aoi-mp fat fvt cr cp cp cr cr qp qp qr qr entropy LH The Training of the GNNs A variety of network architecture, initial conditions and inputs are tried for optimizing the classification performance of the GNN. The basis of this exercise has been the classification performance of the training set. Hence, in the following, we aim at maximising the classification performance of the GNN on the training set. The expectation is that the generalization ability of the GNN will then reflect on the observed test set performance. Figure 7.22 shows a comparison of AUC performance on both training and test with different network architectures. The experimental results help identify that a network with 15 hidden neurons and 8 state neurons could be sufficient to produce good results on both training and testing sets.

179 7.5. Web Spam Detection 145 Table 7.14: Comparison on features of hosts detected by GNN(MLP-Link+MLP- Content) and GNN(MLP-Link+MLP+Content+4d). Feature Common Diff1 Diff2 max min avg max min avg max min avg IN-hp IN-mp OUT-hp OUT-mp n2-hp n3-hp n4-hp sn1-hp sn2-hp sn3-hp sn4-hp aio-hp aio-mp aoi-hp aoi-mp fat fvt cr cp cp cr cr qp qp qr qr entropy LH The nodes in the graph can be labelled by using different label vectors. Experiments were carried out by using different combinations of the best PMGraphSOM mappings, MLP outputs, Link-based feature, Content-based feature and the topology. The topology is defined as a link graph where nodes represent web hosts and edges represent hyperlinks. In order to reduce the computational cost of the experiments, we truncate the links in two ways: (A): Random. Randomly select at most 64 out links for each host. (B): Sorted. For each host, we sort the links to other hosts by counts in descending order, and selected the links pointing to the top 64 targets.

180 7.5. Web Spam Detection AUC on train.6.4 AUC on test Epoch Epoch Hidden=5 State=4 Hidden=8 State=1 Hidden=1 State=7 Hidden=1 State=2 Hidden=15 State=8 Hidden=2 State=3 Hidden=4 State=2 Hidden=5 State=4 Hidden=8 State=1 Hidden=1 State=7 Hidden=1 State=2 Hidden=15 State=8 Hidden=2 State=3 Hidden=4 State=2 Figure 7.22: Performance of training GNN with different network architecture. Left: AUC on train; Right: AUC on test All combinations tried are listed in Table For each combination, a set of experiments are conducted by using different training configurations, such as network initialization, network architecture, BP algorithm, output function, and balancing techniques. The six best results for each set of experiments are selected to build an errorbar (using maximum, minimum and mean values) illustration for the AUC and PRF performance measures. An overall comparison is then given with Figure The main findings from this figure are as follows: The GNNs which take the combination of link-based, content-based, PMGraph- SOM mappings, and the topology as inputs produced the best overall results. This refers to the input configuration 6 and 13 as shown in Table Balancing the data when training MLP benefits the performance on both train and test sets. This is observed via the results for input 6 which are consistently better than for the input configuration 4. It is beneficial to use the outputs from MLP for those unlabelled hosts instead of using zero vectors. This is observed due to the results on input 13 are slightly better than for the input configuration 6. The sorted truncation on the topology generally shows less negative impacts on the classification performance. (See 9 vs. 16, 13 vs. 14, 3 vs. 19 and 1 vs 18). Hence, truncation is better done using a ranking on the links rather than through a random selection process.

181 7.5. Web Spam Detection 147 Table 7.15: List of all combinations of inputs for GNN training. Topology: A-random truncation; B-sorted truncation. ID Node Label Topology 1 Link+Content B 2 Link+Content+MLP(Link+Content+PMGraphSOMres) B 3 Link+Content+PMGraphSOMres B 4 MLP(Link+Content+PMGraphSOMres) B 5 PMGraphSOMres B 6 MLP(Link+Content+PMGraphSOMres). Use balanced labelled inputs. B 7 MLP(Link+Content) B 8 Link+Content+MLP(Link+Content) B 9 Content. B 1 MLP(Link)+MLP(Content) B 11 MLP(Link)+MLP(Content)+ OUT + IN B 12 MLP(Link)+MLP(Content)+ OUT-mp + aoi-mp + aio-mp B 13 MLP(Link+Content+PMGraphSOMres). Include MLP outputs for B hosts are not given targets. 14 MLP(Link+Content+PMGraphSOMres). Include MLP outputs for A hosts are not given targets. 15 Link+Content A 16 Content A 17 PMGraphSOMres A 18 Link+Content A 19 Link+Content+PMGraphSOMres A Training on content-based features (input 15) produced better results on AUC than when using link-based features (input 16) or PMGraphSOM mappings (input 17). Ensemble random under-sampling classification strategy - ERUS The experiments presented in the previous sections have shown that the training results can benefit from the balancing the pattern classes. However, the balancing used there is a simple approach by creating duplicates for smaller class so that the number of samples is balanced. In this section, we consider a better approach called ensemble random undersampling classification strategy (ERUS). ERUS has been popularly used for classification tasks on unbalanced datasets [35]. ERUS uses a subset from the major class to train the classifier. ERUS independently re-samples a number of subsets from the major class and train a set of classifiers accordingly. The results from sub classifiers are combined and averaged to be the final classification results. This strategy was proposed in [35] and was able to produce the best result on AUC in the competition. The procedure of the approach is described in the following. We denote the times of re-sampling by n and k is the ratio

182 7.5. Web Spam Detection Train AUC 1 Test AUC.8.8 AUC on Train.6.4 AUC on Test Node labels Node labels 1 Train PRF 1 Test PRF.8.8 PRF on Train.6.4 PRF on Test Node labels Node labels Figure 7.23: Errorbars of GNN training results with different inputs. for number of normal hosts and number of spam hosts in the sampling. The approach is implemented in the following steps: Step 1: Randomly select non-spam samples from the training set until the number of non-spam is k times the number of spam. Here, the non-spam samples are all unique (no duplicates). This is runntimes and obtainntrain sets. Step 2: Run a classifier for each ofntraining sets and produce outputs for both original training set and the test set. Step 3: Sort n classification results by AUC on the training set. Step 4: Merge the topiresults ( <= i <= n). Here the results are sorted in descending order, so the top result produce best performance of AUC on the training set. For each hostxin the dataset compute the probability that it is a spam host: PS i (x) = P spam(x) P spam(x)+p normal (x). For each pattern sum up PS i(x) for all i results: spamicity = spamicity +PS i (x). Step 5: For each pattern x, compute spamicity(x) = spamicity(x)/i. This gives the final result.

183 7.5. Web Spam Detection Perf Perf Different resampling AUC on Train PRF on Train AUC on Test PRF on Test Different resampling AUC on Train PRF on Train AUC on Test PRF on Test Figure 7.24: Results of MLP training for UK26 by using ERUS strategy. k= 3.5. Left: Single results sorted by AUC on train. Right: Results on top n best train results. We anticipate that by incorporating ERUS approach into our system, this can help to further improve the classification results on the UK26 dataset. The input for the MLP is the identified best combination of content-based, link-based and PMGraphSOM mappings. A set of experiments are carried out with different values fork and the results are evaluated on AUC and PRF. The best result is obtained by using k = 3.5 and when merging the 2 top results in Step 4. The performance curves of such experiment are shown in Figure The outputs obtained are then used to label the nodes in the hyperlink graph, and a GNN is then trained on this graph. The GNN trained on this dataset produced as a best result AUC= This is the best result obtained by us, and this also outperforms all others in the competition Application to the UK27 dataset The learning system developed in the previous section is then applied on a later dataset of Web Spam detection. We attempt to create an analog on what been done for UK26 but by applying our methods to the newer UK27 dataset. We note that the labelling of the UK27 differs very significantly to the UK26 dataset. Hence, the two datasets pose very different challenges. The results of this exercise are shown later in this Section. We observed that our best results cannot reach the best performance recorded in the competition. We then consider various mechanisms for improving the results. Hybrid System The PMGraphSOM is trained on link-based features and on the topology of the UK27 domain graph. Table 7.16 lists all the results that we obtained when training the PM- GraphSOMs with different map size. The C-POS results are generally low for both the

184 7.5. Web Spam Detection 15 Table 7.16: Performance of PMGraphSOM on UK27 train set (second row) and test set (third row). Train mode: 2, train on both; 1, train on spam;, train on normal. Map Train mode C C-POS C-NEG R P FPR F x8y5g x8y5g x8y5g x12y1g x12y1g x2y18g x2y18g x8y5g x8y5g x8y5g x12y1g x12y1g x2y18g x2y18g training and testing set respectively. Training on both classes produced worse results than when training two PMGraphSOMs on each class separately. The training on positive samples helps to detect more spam hosts but produces a worse FPR. The generalization performance on testing set is much worse than the performance on the training dataset. This indicates that the samples in the training set may be not providing a sufficient coverage of the feature space describing spam hosts. Next, we train MLPs with a variety of different configurations on the combined vectorial features. A typical result is shown in Table 7.17 by using an MLP with architecture (62 input sensors, 4 neurons for the first hidden layer, 2 neurons for the second hidden layer, and 1 output neuron) trained on combination of link-based feature and content-based feature. The results show that after 1 iterations, the classification Table 7.17: Performance of training MLP with architecture for UK27. Epoch Set AUC ACC PRE REC PRF 1 Train Test Train Test Train Test performance on the training set is approaching a perfect result while the generalization

185 7.5. Web Spam Detection 151 performance on the testing set remains comparatively poor. Further training cannot improve the performance either on the training or testing set. Based on the performance on train set, we take the outputs of a trained MLP as a node label for the hyperlink graph on which we then train a GNN. The first representative result obtained from GNN is AUC= 77.8 which is 6% worse than the best performance reported by others. Balancing the dataset Strategies for balancing the dataset were utilized in order to reduce effects from unbalanced distribution of spam and non spam hosts in the train set. In this task, the training data was balanced in two ways: 1. Create duplicates of minority class (spam hosts). There are 3998 train patterns, 3776 non-spam and 222 spam. In addition of using all 222 spam hosts available, randomly select ( ) duplicates from 222 spam hosts. Eventually, the training dataset includes the same quantity of spam hosts as normal hosts. 2. Select patterns from PMGraphSOM mapping. Based on the mapping of a trained PMGraphSOM, for each activated neuron, randomly select one spam host and one normal host (if there are) and combine these samples to be a new train set. The new train set can be further balanced by using first approach. This approach ensures that the sub-set selected evenly cover the feature space of the original training set. The results obtained show that both of these balancing approaches help to improve the training performance but not the generalization performance. Different µ for PMGraphSOM The µ value controls the contribution from label vector and state vector when training the PMGraphSOM. We anticipate that different µ values may improve the quality of the mappings of the PMGraphSOM. The values for µ tried are listed in Table 7.18, and the corresponding performance curves are shown in Figure Experimental results when using alternative PMGraphSOM configurations are shown in Appendix B. We found that by using µ larger than 1.e 6, that the classification performance on both training and test remains stable. The PMGraphSOM software suggest a µ value that balances

186 7.5. Web Spam Detection 152 Table 7.18: Different µ values for training PMGraphSOM on UK27. ID µ ID µ ID µ e e e e e e e e the influence of state information with the node label. The computation is based on the dimension and value range of the node label and dimension and (estimation) value range of the state vector. However, the estimation of the state vector values assumes a random mapping of the nodes. In practice, the nodes are not mapped at random, and hence, the proposedµvalues may be incorrect. Normalization It is well known that MLP training generally works better when the inputs and targets are normalized. The large inputs can send dominating signal to the training system. Three normalization methods are considered here: Globally linear normalization Element-wise linear normalization Non-linear normalization The most common normalization technique is linear normalization. This is done by searching for the global maximum value over the range of the feature vectors and using it as denominator to modify other values. However, some of the features such as out-degree and in-degree of link-based feature could have extreme and unbalanced values over the whole dataset. Different features also can range differently. Hence in this case, globally linear normalization will have side effects. Given the larger values are minority in the feature vector, most normalized values will approach to zero. Based on this consideration, we tried also element-wise linear normalization. The weakness of linear normalization is lack of consideration on the real distribution of the values. Non-linear normalization is able to capture the distribution by using a particular function. The logarithm form of the

187 7.5. Web Spam Detection Perf Perf ID of Mu1 (ascending order) 1 C on Train CPOS on Train CNEG on Train P on Train FPR on Train F on Train ID of Mu1 (ascending order) 1 C on Test CPOS on Test CNEG on Test P on Test FPR on Test F on Test Perf Perf ID of Mu1 (ascending order) C on Train CPOS on Train CNEG on Train P on Train FPR on Train F on Train ID of Mu1 (ascending order) C on Test CPOS on Test CNEG on Test P on Test FPR on Test F on Test Figure 7.25: Train PMGraphSOM on UK27 with: map=12x1, group=1, maxout=16. Top: Train on normal hosts only; Bottom: Train on spam hosts only. Left: Train Performance; Right: Test Performance link-based feature are provided and then used for training. We attempted to use all these approaches. However, we observed that none of them can improve on the results. Feature Analysis According to the assessment guidelines provided with the WEBSPAM-UK27 dataset which describe the assessment criteria for marking a host as spam or normal, it became evident that not all of the available features may be effective in contributing to the classification task. A study is conducted to identify which features are useful for distinguishing the pattern node classes. The analysis is performed on content-based and link-based features respectively. The values were element-wise linearly normalized before analysis. The statistics is computed as follows, and the result is plotted in Figure For each feature, compute the maximum, minimum and mean value for spam hosts and for the normal hosts respectively. 2. For each feature, compute the maximum, minimum and mean values of spam hosts

188 7.5. Web Spam Detection Max Max Content-based feature Nonspam Spam Link-based feature Nonspam Spam Avg Avg Content-based feature Train Spam Test Spam Train Normal Test Normal Link-based feature Train Spam Test Spam Train Normal Test Normal Figure 7.26: Feature Analaysis for UK27. X-axis: ID of different features; Y-axis: maximum or average of the feature values. Left: Content-based feature; Right: Linkbased feature; for the training set and test set respectively. 3. For each feature, plot the distribution of the values for spam hosts and normal host respectively. Figure 7.26 shows a comparison of the maximum values of spam and normal hosts and the average values of four groups of hosts: train spam, train normal, test spam and test normal. The plots of the average values show that spam and normal hosts are sharing similar values on most of the features. Some differences between spam hosts and normal hosts can be observed on the maximum value plot. However, in comparison with the average value plot, such difference could be caused by some outliers. It also can be seen that values are unevenly distributed among different link-based features. The features of average values with large magnitude shows little difference between Train Spam and Test Normal (Compare plus and square). This could be the problem that is holding us back from producing a good generalization performance.

189 7.5. Web Spam Detection Percentage org co ac gov me sch bl nhs plc mod ltd net Non spam Spam Figure 7.27: Percentage of spam and normal hosts from different second-level domains for UK27. The horizontal axis denotes second-level domain and vertical axis show the percentage of spam and normal hosts belong to that domain respectively. All the second-level domains end with.uk, so omit here. Domain name analysis All the hosts in this dataset are from within the UK top-level domain. The given vectorial feature set does not contain any information about the domain names. In order to investigate whether some particular second-level domains were more likely to contain spam than others, we performed analysis on the percentage of spam and normal hosts under each available domain. For each domain, there are two quantities computed: percentage of spam and percentage of normal. The results are plotted in Figure The figure does not give any obvious evidence that particular domain contains more spam than others. We validated this assessment by concatenating these two quantities to the original feature vector before training the MLP and the GNN. No improvement is observed in any of the experiments conducted. Noisy pattern We investigated whether a local minima-problem could cause the observed performances. This is addressed by introducing some random noisy training patterns, and was executed by slightly modifying the value of existing features by a percentage. We implemented this in following steps: Step 1: Select a single element in the feature vector for injecting noisy. Here the elements with highest mean values are selected. For the link-based feature

190 7.5. Web Spam Detection 156 vector we identified the 124th feature, and for the content-based feature vector we identified the 47th feature respectively. Step 2: For randomly selected train patterns, modify the element selected in previous step by a percentage (we tried a range from.5% to 5%). Step 3: Train on the noisy train set, and test on the original train set and test set. The results are illustrated in Figure 7.28, where the left plot is for modifying contentbased feature and right plot is for modifying link-based feature. Both plots show that generalization performance did not improve as a result of this exercise. Hence, indications are that local minima may not be the cause of the poor generalization performance Perf Perf Noise factor AUC on Train PRF on Train AUC on Test PRF on Test Noise factor AUC on Train PRF on Train AUC on Test PRF on Test Figure 7.28: Results of training MLP for UK27 by using noisy patterns. Left: Noise added to 47th content-based feature. Right: Noise added to the 124th link-based feature. ERUS We applied the ERUS approach as described in Section on UK27. A set of MLPs are trained on different train samples represented as vectors containing normalized linkbased features and content-based features. The output produced is one dimensional. Each of the MLP works as a classifier. Here, the targets are set to be either for normal or 1 for spam, so we assume the output from MLP is the prediction on the probability of a host to be spam. Another assumption is that the probability values for spam and normal nodes are summed to 1, so that PS i (x) in Step 4 is simply equal to the output of MLP. The first set of experiments were carried out by using k = (1,2,2.5,3,3.5,4,4.5,5,6) and n = 1. The performance is first evaluated on the original complete training set,

191 7.5. Web Spam Detection Perf Perf Different resampling AUC on Train PRF on Train AUC on Test PRF on Test Different resampling AUC on Train PRF on Train AUC on Test PRF on Test Figure 7.29: The results of training MLP for UK27 by using ERUS strategy. k= 4. Left: Single results sorted by AUC on train. Right: Average results on top n best train results. and the best threshold that maximize the PRF on this training set is computed. Such a threshold is then used to evaluate the performance on the test set. However, for the evaluation on the top n results, we assume the threshold is.5. Figure 7.29 shows the performance evaluated on single results and top n results with k = 4 which produced best results according to AUC. The plots for experiments using other values for k are attached in Appendix C. The left plots show the curves of performance evaluated on single result returned by each sub classifier and sorted by AUC on train in ascending order. It can be observed that by using larger values for k, the classifiers produce more consistent results for different subsets of the training samples. If a smaller sampling size is used then the results returned by the classifiers could vary significantly. For example this can be seen from the red-plus curves (AUC on the train) which become more and more uneven when k is decreasing. The plots on the right show the performance curves for evaluating average results from the top n classifiers. Note that the results from the classifiers have been sorted by AUC on the training set. Thus the decline of the AUC on train is expected. In contrast, we found the most stable results are on PRF for the training set (the green-cross curve). Such outputs are then taken to label the nodes in the hyperlink graph. Then, the GNN is trained on this hyperlink graph. This produced a results of AUC=.812. In order to better simulate the probability prediction, a two dimensional binary output layer could be used for the MLP. The target vector is then(,1) for normal and(1,) for the spam cases. In this case, MLPs are allowed to produce a probability for each class. The results from a single classifier are evaluated in this manner. The experiments are

192 7.6. Conclusion 158 executed using k = 3.5,4 and n = 1. However, no improvement on the performance is observed. This indicates that the size of the MLP (number of network parameters within the MLP) is sufficient for this learning task. Extended evaluation on ERUS results The experimental results from previous section were re-evaluated. As mentioned earlier, for each k in use, we have 1 results returned from the 1 models trained on different subsets of the original training set. For each set of experiments, and for each k, the new evaluation is as follow: Step 1: For each of the 1 experiment result, evaluate the performance on the training samples involved in the training procedure (the subset only), and search for the best threshold that maximize the PRF. Step 2: Adjust all outputs to be centered around.5. Step 3: Use.5 as threshold to evaluate the test set. Step 4: Sort the results by AUC on train. Step 5: For1 <= n <= 1, merge the top n results by searching for the median value among all results returned by topnmodels. Step 6: Use.5 as threshold to evaluate merged results. The best results are again obtained by using k= 4. This is shown in 7.3 (the plots for other values of k are located in Appendix C). The GNN is then trained as before producing a final results of AUC=.82. Further attempts using different initializations for the GNN did not improve the results any further. 7.6 Conclusion In this chapter, we investigated supervised machine learning methods and a probability model on classification problems that require the encoding of structured data. The novel machine learning approach GNN 2 and the probability model GHMM have been applied

193 7.6. Conclusion Perf Perf Different resampling AUC on Train PRF on Train AUC on Test PRF on Test Different resampling AUC on Train PRF on Train AUC on Test PRF on Test Figure 7.3: The results of training MLP for UK27 by using ERUS strategy with extended evaluation method. k= 4. Left: Single results sorted by AUC on train. Right: Results on top n best train results. to learning and classification over a special case of GoG: sequences of graphs. The architecture of the GHMM relies on an underlying HMM structure, and is capable of dealing with long-term dependencies in sequential data of arbitrary length. Emission PDFs over the GoGs are estimated by means of a combination of recursive encoding nets and constrained RBF-like nets. Results returned by the GHMM confirm that the architecture and the algorithms are effective, both in terms of learning and generalization capabilities. In comparison, the GNN 2 is the first machine learning model that is able to deal with such learning problems directly and as a unity. The GNN 2 is effective in encoding embedded structural information, and hence, is able to suitably discriminate GoGs as desired. It is also shown that long term dependency may not be a limiting issue for most practical learning problems by using GNN 2. We then considered a large scale application of GNN 2 on a challenging XML document classification task. The results are very encouraging when compared to attempts made by others on the same classification problem using very different approaches. Moreover, the PMGraphSOM, MLP and GNN 2 are integrated to a hybrid system for solving a large scale Web spam detection task and produced very promising results. The significance of the work conducted in this chapter is that proposed machine learning methods are evaluated on practical learning problems, and hence, an elaborate preparation is made for the core research task of this thesis on ranking structured documents by impact. This will be addressed in Chapter 8.

194 Chapter 8 Impact Sensitive Ranking 8.1 Introduction This Chapter investigates the possibility of addressing impact sensitive ranking tasks by using the proposed machine learning methods. This will be done through the application of the GNN 2 to a set of scientific documents. We have access to the entire collection of scientific documents hosted on CiteSeer in 25. The collection was made available to us in the form of a set of PDF documents 1. The documents will be represented as temporal and spacial graphs, and the aim is to model ranking algorithms such as PageRank [72], ImpactRate [34], and ERA ranking [7]. The contents of the documents are parsed and arranged so that CLGs can be built on the texts extracted. The inter-structure of documents are represented by a citation graph according to the references structure of the given set of documents. The citation graph consists of nodes that represent the documents in the domain, and edges that represent the citation links between the documents. It is well known that a document can attract an increasing number of citations after it was published. Thus, the citation graph grows dynamically over time. This allows us to build a time series of citation graphs, where each citation graph describes the references relationship among documents at a particular time. Then this citation graph is connected to another citation graph at the next time instance. The node in the citation graphs represent documents which can be further described by CLGs built on the document contents. This is called a temporal and spacial graph (a sequence of graph of graphs). Some pre-processing on the dataset is required in order to extract the text from the PDF documents. This procedure is demonstrated in Section 8.3. To the best of 1 We wish to acknowledge Dr. Lee Giles who has kindly provided us with the CiteSeer dataset. The CiteSeer system (discontinued) is a brainchild of Dr. Lee Giles. 16

195 8.2. The CiteSeer dataset 161 our knowledge, the GNN 2 is the only machine learning method capable to encode such hybrid structures. By using targets computed based on existing ranking schemes as a supervising signal, it is expected that the GNN 2 can infer the underlying ranking function based on the observable development of the citation graph over time. A successfully trained system is anticipated to discover the correlation between document contents and the ranking results. This especially benefits the ranking of new documents which may be semi-isolated (i.e. new documents which have not yet any citations from other documents) in the domain. A number of experiments (see Section 8.5) have been conducted and the results indicate that GNN 2 is able to encode such semi-supervised sequences of graph of graphs and produce prediction on the ranking of documents. However, the prediction performance is not as good as expected. We then attempted to improve the results by using some mechanisms such as those proposed in Section 8.5.3, Section and Section It is found that this still does not help to reach a desired level of quality for the mappings. We then analysed the situation and discovered that the possible causes are: 1. the noise in the dataset, and 2. insufficient training iterations and insufficient network parameters. We furthermore found that some of the residual error is a desired property. In order to verify the correctness of our approach, we then apply the GNN 2 on a simpler analog to the CiteSeer dataset. This is addressed in Section The results show that the GNN 2 is not only capable to deal with the temporal and spatial graphs, but also to produce perfect predictions by generalizing over unseen structures. This indicates that the GNN 2 is effective for such type of learning problems if the training data contains correct and useful features. An observation is that in real world learning problems there is usually some degree of noise. Hence, we conducted a study to investigating the impact of noise on the ability of the GNN 2 to model a given learning task. This is addressed in Section 8.5.8, and the results show that the GNN 2 can be sensitive to noise in the dataset. This explains one of the reasons that caused the relative poor performance of the GNN 2 on the CiteSeer document collection. 8.2 The CiteSeer dataset The dataset is a snapshot of the complete CiteSeer document database as of 25 containing a relatively large number of scientific documents. There are a total of 385, 951 valid

196 8.3. Data Preparation 162 documents 2 in the dataset. Only documents published before year 25 are captured by this dataset. The documents are in the Portable Document Format (PDF) format, and the majority of the PDF files are generated from scanned images. This requires some preprocessing so as to convert the PDF file to plain text which then allows us to more easily extract document features and the domain structures, and to build a dataset for training. In the following sections, we describe the work carried out for obtaining the temporal and spacial graph representation. 8.3 Data Preparation Most of the PDF documents contained scanned versions of scientific documents. Thus, most PDF documents contained binary information (bitmap images) rather than markup text. It is necessary to first extract the textual information, and then to analyse the reference section of each document for links to other documents in the collection. The following steps are taken for this purpose: Step 1: Convert each PDF file to a bitmap image. The command tool ImageMagick [88] is used for this task. The result is a collection of Tagged Image File Format (TIFF) images, where one TIFF image is generated per page in the original PDF file. The main reason for this step is that the OCR software used in step 2 can only deal with bitmap images. Step 2: Use optical character recognition (OCR) software to extract text from the images. The open source OCR software Tesseract [85] is used since it is freely available and founded to produce the highest text recognition rate amongst any of the free OCR software tools. Teseract returns one file containing all textual contents for each of the documents. Step 3: Parse text to XML and categorize the contents. Parscit [21] is used for this task. The software categorizes the content into title, authors, citations for any wellformatted high quality scientific document. Step 4: Parse the XML documents and create a database. There are two sub-steps: 1. 2 There exist some incompletely downloaded documents and some documents contain internal formatting errors so that cannot be accessed

197 8.3. Data Preparation 163 store all available documents in a database (title, author, year, etc); 2. parse citations for each document and record the citation links. Step 5: Build temporal and spatial graphs for each document in the database. This includes the generation of time sequences, citation graphs and CLGs. It turned out that the main challenges with these steps were in the extraction of plain text from PDF sources and the building of the citation graph. The 5 steps are explained in detail as follows: Document Contents Extraction The first step is to extract textual contents from the CiteSeer documents. The documents are in PDF format, and the majority of the documents are scanned files (images). The following steps are taken for the extraction of text from images: Obtain the number of pages in the PDF file. We used the linux command line tool pdfinfo for this purpose. For each page: Convert a page to a TIFF image. For this task, we use ImageMagick and set the option -density to a reasonable large value of 3dpi to obtain high quality results. This option is used to control the dots per inch (DPI) in the image. The higher DPI of an image, the better its quality. However, the fonts on the title page are usually larger than the body text, and given the scanned version of the document is comparatively in low quality, a higher DPI may expose disconnection on the character symbols to the OCR software. Thus, for those low quality documents, smaller density is used so that incompletion information (missing or added pixels) are smoothed out. The strategy helped to improve the text recognition rate of the OCR software. Use Tesseract to extract the text within each of the TIFF images. Concatenate the text of all pages and re-construct the document in one text file. Some problems are unavoidable given the quality of the documents vary among each other even among pages within a single document. In this case, we have to ignore those non-processable documents and pages.

198 8.3. Data Preparation 164 The conversion from PDF to TIFF failed on a number of occasions. Most of those unsuccessful attempts are due to file truncation or some internal file format errors. The errors discovered when processing the files are: PDF file is damaged or truncated. (3,888 files) Postscript delegate failed. (35,483 pages, and about2,5 files are affected) File has an invalid xref entry. (327 files) File has insufficient data for an image (a blank file). (53 files) It also can be observed some inaccuracy of the texts returned by Tesseract. Table 8.1 shows samples of errors made by the OCR which occurred quite frequently. Those errors were mostly due to low scanning quality of some of the files, so that Tesseract is not able to produce reliable results given the incomplete pixel information of a character. Another reason could be that the internal module of Tesseract is not sufficiently trained for this data domain, e.g. some fonts are not covered by Tesseract. In order to reduce the effects from such problems, we tried to detect errors by scanning for patterns such as those shown in the right column of Table 8.1. This helped to improve the OCR results. However, we can not guarantee that we captured all cases of incorrect character recognitions. Table 8.1: Some inaccurate results from Tesseract. Actual character Recognized by OCR as M l\/l, or l\/i, or i\\i W vv, or v/\i im irn off oh Once the textual contents were extracted, the next task is to parse contents into XML so that the document content is categorized. This allows an easier extraction of documents features and citations. Parscit is a powerful tool for this task and the main function of it is to parse a text string in the bibliography or reference section of a work [21]. In addition, it also can be used to identify a hierarchy of logical components, for example, titles, authors, affiliations, abstracts, sections, etc [21] of a scientific document. This perfectly fits our need for identifying header information (title, author, abstract, etc), citations and body contents respectively from the available texts of the CiteSeer documents.

199 8.3. Data Preparation 165 However, in our attempt, Parscit rejects to process some files due to formatting errors of the file. A typical case is when a document contains non-standard referencing strings. In such cases, Parscit will return an error message and terminate the parsing. Thus, an XML tree representation is only obtained for a portion of the dataset. The statistic of successful Parscit tasks is shown in Table 8.2. It can be seen that the parsing of header information is successful for nearly the entire document collection, while there were 25% of documents from which no citation information could be extracted. The lower successful rate on citation extraction is mostly due to the internal limitation of Parscit mentioned earlier or the indirect impact from the problems experienced in previous steps, e.g. Parscit may not be able to successfully process an incomplete result from Tesseract. The body of the documents usually contains most of the text so that there is a higher risk for this to be rejected by Parscit due to various formatting problems within some documents. Table 8.2: Statistic results of parsing text to XML. The percentage is computed out of 385, 951 documents. Successful Success rate Header 382, % Body 245, % Citations 291, % In order to build temporal sequences and citation graphs, the time of when each citation was created should be known. References are created when a paper is published. We are interested in the year of the publication date of a document. The publication date is not normally provided with the document itself. The publication year of a document can be extracted in a number of different ways: References within the reference section of a document normally provide information on the publication year of a cited document. Thus, during the parsing of references (by Parscit), if the referenced paper matches a paper in the database, then the information about this paper s publication year is added to the database. If the previous approach failed then scan the first page of a document for indications on the publication year. Some documents feature the publication year as part of a title page, or give the publication year as a footnote on the first page. If this information is available, then this can be identified by Parscit.

200 8.3. Data Preparation 166 If the previous two approaches failed for a document, then estimate the publication year based on the information of the references that a document has in its reference section. We analyse all references within the reference section of the document then find the reference to the most recent paper. We then assume that the paper was published one year after the year of the most recent cited paper. The approach ensures that a publication year will become known for any document with a well-formatted reference list in the collection. The extraction results show that the publication year is available for 26, 855 documents, and that the publication years range from 19 to 25. About 1% of documents cannot be associated with a year due to errors occurred when processing the reference list of the documents Document Target Computing The CiteSeer dataset did not contain any ground truth information. In order to demonstrate the abilities of the GNN 2 to learn ranking algorithms based on the structure of a domain as well as based on the contents of the documents, a set of targets indicating the ranking scores for the document at different time stamps is computed. Several ranking schemes are considered. The ranking schemes considered are PageRank [72], ImpactRate [34], and Excellence in Research for Australia (ERA) ranking [7]. The underlying algorithms were provided in Chapter 2. PageRank is a ranking algorithm which solely relies on the link structure of a given domain. In contrast, ImpactRate takes time evolution of citations into account. PageRank is usually applied to rank Web pages rather than scientific documents, while ImpactRate is especially designed for the scientific domain. In addition to the number of citations, ImpactRate also captures the impact of a paper by investigating how the number of citations received increases over a certain time duration. Neither PageRank nor ImpactRate considers the document contents as criteria for computing the ranking scores. This gap is filled by the ERA ranking scheme. The ERA rank is largely based on the quality and impact of papers within any one venue. Collectively these three ranking methods cover the range of features which the GNN 2 need to be able to encode. While one may argue that document content could be of no use when considering PageRank and ImpactRank as target values. However, it is important to notice that there exist indirect relationships between the contents and the citation

201 8.3. Data Preparation 167 structure. A citation indicates a topical relationship between two documents and normally does not exist between completely unrelated documents. The common situation for the creation of a citation is that an author has read other papers in the past, and hence, is aware of the content of some other papers. The author may subsequently decide that a reference to those papers provides a suitable support for an own paper. Moreover, it is common that an author is inspired by previously published work, and hence, may publish an own paper that is based on previous work. As a consequence, a citation link is created from the latter paper to the former one. This indicates that the document contents can contribute to the development of the citation link structure. The PageRank algorithm solely relies on the correctness of a link structure. However, it is possible that some citations are inappropriate. It is not unusual that papers published at lowly rated venues exhibit a lack in care of suitable referencing. Thus, PageRank will be influenced by such noise. However, by including information about document content when modelling the PageRank using GNN 2, an interesting question to ask is whether GNN 2 is able to identify supportable citations or whether it is able to identify unsupported citations. In other words, it would be expected that the GNN 2 would deviate from a given PageRank target value if a document is cited by an unsupported reference (a reference from an unrelated or weakly related document). Unsupported links can be considered a type of noise in the dataset, and hence, it can be anticipated that the GNN 2 may be able to detect unsupported citations. If the GNN 2 exhibits such ability, then this means that we expect that the output of the GNN 2 deviates from a given target value for some documents. Hence, a certain level of error would be expected. We will find later in this chapter that the GNN 2 is indeed able to identify weakly related citations and unrelated citations. The ERA ranking scheme assesses the venues by using a combination of indicators. ERA is created by a manual assessment procedure carried out by experienced and internationally-recognised experts. In this case, the document contents form a direct part in the determination of the rank of a venue. The ERA ranking is used as a main performance indicator by the Australian government. The introduce of automatic means to infer the underlying ERA ranking mechanism can be seen as a significant contribution by this thesis. ImpactRate: The traditional ImpactRate is computed for different venues (journals, conferences, books, etc.). The algorithm has been described in Section 2.3. We modified

202 8.3. Data Preparation ImpactRate.6.4 PageRank Year Year Figure 8.1: The ranking scores over time for some documents in the CiteSeer Dataset using ImpactRate (left) and PageRank (right). the algorithm slightly since a rank is required for each individual document. This is done as follows: For a given document i and a given year y, let c 1 denotes the number of inlinks of document i received during the last five years until year y, c 2 denotes the total number of in-links a documentireceived until yeary. Then the impact rate for document i in yeary is computed by Eq. 8.1: ImpactRate i = c 1 c 2 (8.1) As a result, each document is given an impact rating at single year intervals. An example, Figure 8.1 (left) shows the ImpactRate for some documents in the dataset. It can be seen that most of the curves follow a similar trend, starting from low values at the time of publishing. The impact rate reaches the peak after a couple of years, then gradually decreases with time. This matches the fact that a paper was not cited by any other papers when it was just published; and then became well-known over time so that a great number of citations from newer papers are attracted; the novelty of the paper wears off over time, and thus, authors may not cite this paper as frequently. PageRank: Similarly, we build a PageRank matrix for document vs. year of publication by using the PageRank formula described in Chapter 2. With PageRank, the rank of a document i depends on the rank of all documents u linking to i, and that the rank of each documents u in turn depends on the documents pointing to them. Thus, the rank of a document is accumulative. The rank increases with the addition of more citation links. Figure 8.1 (right) shows the time evolution of PageRank for a set of 7 documents. It can be seen that all curves exhibit an increasing trend from the year of publication, and that the increase stalls when no more references to these documents are created.

203 8.4. Temporal and Spacial Graphs 169 ERA rank The ERA rank for journals and conferences is provided by ARC and the ranking results are publicly accessible online [7]. ERA evaluates the journals and conferences according to a range of indicators and expert reviews. With the participation of human assessors, some noise is inevitably introduced to the ranking results. This differs from PageRank or ImpactRate which strictly follow given formulas. The ERA-21 was the first full evaluation of the ERA initiative. Given the CiteSeer dataset in this research is a snapshot for 25, there may exist some inconsistencies between the ranking results of 21. However, the experiments carried out on ERA rank aim to provide indicative results by assume that ranking of most venues remain unchanged during these five years. ERA21 ranks a total of2,715 unique journals and1,953 unique conferences. These venues are ranked with one of four ranking levels: A* (the highest rank), A, B, and C (the lowest rank). There were 279 journals and 5 conferences which were listed but not ranked. These will be ignored when training a GNN 2 on this target set. We therefore associate the venues at which the CiteSeer were published with the ERA rank. These will serve as targets for the GNN 2 training algorithm. 8.4 Temporal and Spacial Graphs The documents in the dataset are represented as temporal and spacial graphs representing for each document a time sequence of citation graphs whose nodes are labelled by CLGs. Based on the assumption that the header information (the abstract and title) of a document should contain the most essential concepts, and hence, the CLGs will be build based on the header information (by following the procedure described in Section 3.3.2). For each document, we build a set of citation graphs, one for each year since the year of publication. This results in a time sequence of citation graphs. Each node in the time sequence except the last one can be assigned a target which is either the PageRank, ImpactRate, or ERA rank of the document at the corresponding year. The unlabelled node is used for testing purpose. The citation graph develops over time, and for each document. Starting from the year of publication, we build a citation graph by recursively considering all references to the documents at a certain year. In this way, it allows us to obtain a citation graph for a specific document that contains all direct and indirect references.

204 8.4. Temporal and Spacial Graphs Level2:TemporalSequence A A A A C B Level1:C itationgraphs C B C B D c1 c2 c7 c1 c9 c4 c5 c1 c8 c3 c4 c2 Level:ConceptLinkgraphs c2 c7 c4 c3 c6 c5 c6 c9 Figure 8.2: An example of GoGs for a document in the CiteSeer dataset. In each citation graph, a node represents a document in the dataset and it is labelled by the corresponding CLG representation of that document. For each document, we build a time series of citation graph from the year it was published to the latest year available in the dataset (25). A time sequences of citation graph is built for each document whose publication year is known. The creation of such temporal and spacial graphs can be visualized as in the following example: Assume the CLGs have been created for all the documents in the dataset and that PageRank has been computed. Given a document A published in 22, we first build a time sequence for it which contains 4 nodes (there are 4 years from 22 to 25). Each node in the sequence is described by a citation graph for the corresponding year. The first node in such a sequence is labelled by a citation graph contains only one node which represents paper A, since in 22 paper A was just introduced and not cited by other documents yet. This node is assigned a target with.3 since this is the default PageRank value for nodes without in-links. Assume that paper A is cited by paper B and paper C in 23. Then the second node in the sequence will be labelled by a citation graph containing three nodes representing papers A, B and C respectively. In this citation graph, A receives two incoming links from B and C. Such a node will be attached with a target equal to.72 as computed by PageRank. The citation graphs for the next node in the sequence is extended by adding documents which added references to any one of the previous documents (the documents A, B, and C). If in 24, there was no new paper citing either A, B or C, then the citation graph for 24 will remain unchanged to the one

205 8.5. Experiments 171 of 23. Assume that in 25, a paper D adds a reference to paper B. Then the citation graph for 25 will be built by adding the node and a link from D to B. In this case, the PageRank of B becomes.51, and the PageRank for node A becomes.867. The rank of all other nodes remain unchanged at the default value.3. The procedure is visualized in Figure 8.2. Since the last node in the sequence will be used for testing purposes, and hence it is not assigned the target during training. To complete the CLG representation of the document, we now associate the CLGs with all the nodes in the citation graphs. Each CLG contains a set of connected concepts extracted from the content of associated documents. This is illustrated in Figure 8.2. The length of sequences can be controlled by setting a time gap, the example temporal and spacial graph shown in Figure 8.2 is a full-length sequence using a time interval of 1 year. By increasing the time interval, a shorter sequences can be created for other experimental needs. The result is one sequence of citation graphs for each document in the dataset. We discarded sequences of length one. Such sequences refer to documents which do not attract any citation, and hence, not particularly interesting cases for our purposes. This left us with17,267 sequences corresponding to17,267 unique documents. 8.5 Experiments Dataset For the first set of experiments, we used the ImpactRate or the PageRank as target values for the GNN 2 to train. ImpactRate as target: For each document in the dataset, we compute the ImpactRate starting from the year a document was published to year 25. There are a total of 11, 37, 49 entries of ImpactRate computed for all the documents in the dataset, each entry is the ImpactRate for a particular pair of documentiand yeary. 185,836 entries have non-zero values and the number of unique values is2,118. The default value for ImpactRate is zero and the maximum value is 1.. The average value is indicating that most values are closed to zero. PageRank as target: For each document in the dataset, we compute the PageRank starting from the year of publication to year 25. A total of 11,37,49 entries were computed, where 243, 886 out of them have non-default values and the number of

206 8.5. Experiments 172 unique values is just 27. The number of unique PageRank values is surprisingly small but they have been verified to be correct. When compared with ImpactRate, the variation on PageRank values is considerably smaller. This is mainly due to: (1) The link structure within cited scientific documents is considerably less dense when compared to the World Wide Web (PageRank has been designed to work with Web pages). The connections among Web pages may contain cycles and self-connections, while the link structure between scientific documents resembles a directed tree. (2) In the case that the link structure of a document is not changed over a duration, the PageRank will remain unchanged, while the ImpactRate could produce changing values due to the numerator in the formula which changes while the time window is shifting. The maximum PageRank value of any document in the dataset is , the average PageRank value is The large value range indicates that the PageRank target values are extremely unbalanced. It is expected that this will cause some problems during training unless some balancing mechanism is used. The temporal and spacial graphs in the training set are described as follows: Level : CLGs. There are a total of 17,267 CLGs. The nodes in the CLG do not contain a node label (feature) vector. Each edge in the CLG is associated with a weight value (will be referred as edgelabel in later sections). The maximum outdegree and the maximum in-degree are equally 3. The maximum number of nodes of any CLG in the dataset is 3. Level 1: Citation graphs. For each document which is associated with a publication year y, a set of citation graphs is built starting fromy to 25. Each citation graph contains nodes representing the document itself and all documents which directly or indirectly refer to it. 96, 159 documents out of 17, 267 have a publication year available. This produced a total of887,89 graphs at this level. The graphs have a maximum out-degree of 3, maximum in-degree of 19, and the maximum number of nodes is 29. In other words, the largest citation graph in the dataset describes the citation connection among 29 documents. Level 2: Temporal sequences. Since the publication year is available for 9% of documents, 96, 159 sequences are generated. The nodes in the sequences are attached with a target except for the very last node in the sequence. The target value

207 8.5. Experiments 173 used is either PageRank or ImpactRate. In total, 791, 65 nodes are supervised and the remaining nodes are used for testing purposes. The sequences by default is set to full-length by using a time interval of 1 year. As indicated earlier, we also considered building sequences with a larger time interval of up to 5 years. In these cases the maximum length of the sequence decreases from 16 to 22, and the number of supervised nodes to 197, Training and Prediction Experiments are carried out by using PageRank and ImpactRate as targets respectively. We evaluate the results by examining the training performance and prediction error. The summed square error (SSE) is recorded during training, and the maximum, minimum and mean values of the network outputs are used as indicators of the training performance. The prediction error is computed as follow: Retrieve the target value for the unsupervised nodes in the sequences (the last node in every sequence) from the database. Compute the difference between the network outputs and the actual targets. Compute the SSE and mean squared error (MSE). Based on a set of trial and error experiments, we have identified 1 as a reasonable number of training iterations which seemed to be sufficient for the training to reach a convergence point (by looking at the decline of the SSE curve). All the experiments present in this section are trained for 1 epochs. We tried a variety of architecture of GNN 2 s. However, given the large scale of the dataset, we limit the size of the networks in the range so that the experimental computational cost is affordable under the research scope. Figure 8.3 displays two typical results when using ImpactRate as targets. The two plots at the top show the results of training a GNN 2 with an output network containing 6 hidden neurons, 8 state neurons and 4 dimensional encoding inputs starting from different random initializations. The two plots at the bottom give the result of using larger output network which has 8 hidden neurons, 1 state neurons and 6 dimensional encoding inputs. The plots on the left show the SSE during training plotted every 1 iterations and the plots on the right illustrates the errorbars (min, max, and average values) of the network outputs during training. The errorbar plot can help monitor the value

208 8.5. Experiments Unweighted SSE Errorbar Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min Unweighted SSE Errorbar Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min Figure 8.3: The SSE (left) and the value range (left) produced by GNN 2 during training with ImpactRate as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6). distribution of the outputs during training and provide alternative indicator on training performance. The same two network architectures are also tried in experiments using PageRank as targets and the training performance is illustrated in Figure 8.4. It can be seen that the GNN 2 created a significantly larger SSE whereas the range of output values remains constraint. This is due to a.) the PageRank values are typically much larger than the ImpactRate values, and b.) the PageRank values are extremely unbalanced. The results of the above four experiments are summarized in Table 8.3 (the first two rows). Here, we list SSE and MSE on both supervised (train) and unsupervised (test) nodes and also the Mean Weighted Error (MWE) on the test data. These measures are computed as follows: SSE = n (o i t i ) 2, whereo i is the output for node i and t i is target for nodei. i= MSE = SSE n n MWE = i= o i t i t i n (ift i =, set to 1)

209 8.5. Experiments 175 8e e e+6 25 Unweighted SSE 6.5e+6 6e+6 Errorbar e+6 5 5e+6 4.5e Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min 7.5e+6 4 7e e+6 25 Unweighted SSE 6e+6 5.5e+6 Errorbar e e+6 4e Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min Figure 8.4: The SSE (left) and the value range (right) produced by GNN 2 during training when using PageRank as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6). For the training performance, we anticipated to observe a good convergency of the training error, i.e. ideally the SSE would approach zero during training. However, the best results recorded for both sets of experiments are still a distance from the desired outputs. This is possibly due to the long term dependency problem since this learning problem requires encoding of deep structure (GoGs up to three levels, and sequences longer than 1 nodes). Moreover, when using ImpactRate as targets, we observed that the training appears sensitive to the initial condition of the network. The speed of decrease of the network error and the convergence points of the SSE vary with the initial condition of the network. This shows the possibility that the training may be caught in a local minima. The results by using PageRank as targets seem to be even more sensitive to the initial conditions as most SSE curves are not smooth. One possible reason could be the unbalanced distribution of the target values given the value range of PageRank is almost 4 times larger than with ImpactRate. The results for ImpactRate are generally better than PageRank, however, even the best one for ImpactRate is showing an 21% MWE on the test set which still requires further optimization. We identify several

210 8.5. Experiments 176 Table 8.3: Prediction error of a trained GNN 2 on full/reduced temporal spacial graphs with PageRank/ImpactRate as targets. Sequence length Target Network Size Train SSE Train MSE Test SSE Test MSE Test MWE Full Impact h6s8ed Rate h8s1ed Full Page h6s8ed4 4.71e Rank h8s1ed6 5.13e Reduced Impact h6s8ed Rate h8s1ed Reduced Page h6s8ed4 1.14e Rank h8s1ed6 1.3e potential problems which could cause this result, and present some possible solutions: Long term dependency problem: The PMGraphSOM has been applied successfully in Section 7.5 to reduce the long term dependencies in a Web spam detection problem. The task involved the encoding of deep structure and evidence was produced that PMGraphSOM is effective in reducing the effects when used as a pre-processor for GNN 2. Another possible solution is to reduce the depth of the structure within the data by truncating the length of the temporal sequences. This can be done by increasing the time interval as which the citation graphs are sampled. Local minima: Local minima problems are often addressed by running a number of experiments using different initial conditions and then to select the ones that produced the lowest error. Another approach is to use back-propagation with a momentum term, or to use Rprop. Rprop only uses the sign of the gradient rather than the size of the gradient. Hence it can be effective in overcoming local minima or local plateaus on the error surface. Unbalanced distribution of target vales: The experimental results when training on the PageRank targets were considerably worse than the one obtained when trained on the ImpactRate targets. This is caused by the different nature of the target function. Firstly, the PageRank function is non-linear function while ImpactRate function is a linear function. Moreover, ImpactRate does not consider the influence of indirect links. Thus the learning task associated with ImpactRate is significantly easier. Secondly, the value range of PageRank is much larger than for ImpactRate, and the

211 8.5. Experiments 177 PageRank targets GNN outputs ImpactRate targets GNN outputs Figure 8.5: Samples where GNN 2 failed to produce good results on modelling PageRank, and corresponding results for the experiments training on ImpactRate. distribution of documents with different PageRank values is unbalanced (only few documents have a high PageRank). Machine learning methods are well known for dismissing infrequent patterns (this is a wanted property which enables a machine learning method to cope with noise). Thus, we need to reinforce the signal of the infrequent target patterns in order to get GNN 2 to encode the entire value range. Balancing techniques can be adopted for this purpose. We examined the results further by selecting the training samples for which the GNN 2 performed worst. The aims is to find the causes of the poor performance. The plots on the left in Figure 8.5 show a comparison between the network outputs and targets on the worst results of experiments training on PageRank. The plots on the right are the results for the same documents when using ImpactRate as targets. It can be observed that the increasing trend of the PageRank values is indeed learnt by the GNN 2, however, the value range seems to be capped. This is an indication of the lack of network parameters (the network may be too small). The capping on the output range reveals that the modelling of the target function may be out of the ability of such network configuration. Larger networks are trained on a subset of the CiteSeer dataset for obtaining some indicative results. The three phases of the SSE decline are not observed. This indicates that the number of epochs required to train the network on the PageRank target values could increase exponentially with the size of the network.

212 8.5. Experiments Unweighted SSE 2 15 Errorbar Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min Unweighted SSE 2 15 Errorbar Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min Figure 8.6: A GNN 2 trained on reduced length temporal spatial graphs using ImpactRate as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6) Truncate the sequence In order to improve the results, an approach is tried by first reducing the complexity of the structural information. The length of the temporal sequences is controlled by setting a time interval between the sampling of the citation graph. The longer a sequence, the deeper the dependencies between nodes. We shorten the sequence by increasing the time interval from 1 year to 5 years. We then re-trained the GNN 2 s. The results are shown in Figure 8.6 (for ImpactRate) and Figure 8.7 (when using PageRank as target). The prediction performance is listed in Table 8.3 (the last two rows). By comparing the MSE, it can be observed that the training on reduced sequences can produce better results on the training set but the generalization performance on the test set becomes worse. This indicates that truncation on the sequences reduces the long term dependencies among nodes so that the learning on supervised nodes is simplified. However, it takes risk that some useful information could be lost through truncation, so that good generalization cannot be guaranteed. This is a typical situation in which overfitting has occurred.

213 8.5. Experiments e e e+6 25 Unweighted SSE 1.1e+6 1e+6 Errorbar Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min 1.4e e e+6 25 Unweighted SSE 1.1e+6 1e+6 Errorbar Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed Epoch Seed 3 Seed 15 Seed 37 Seed 59 Seed 71 Target Avg Target Max Target Min Figure 8.7: Train GNN 2 on reduced length temporal spatial graphs using PageRank as targets. The size of the output network varies: first row (hidden=6, state=8, encodedim=4); second row (hidden=8, state=1, encodedim=6). Left: SSE; Right: Errorbar on the network outputs Use PMGraphSOM If long term dependency is a problem the PMGraphSOM can be engaged to reduce its effects in a supervised learning problem. This was demonstrated in Chapter 7. Hence, we apply PMGraphSOM by training a PMGraphSOM on the set of CLGs. This creates a mapping which is then used to label the nodes in the citation graphs. Since the mappings produced by a PMGraphSOM can be represented by 2 dimensional coordinates for each CLG, these coordinates are an encoding of the content of the associated document. The mapping can thus be used to label the nodes in the citation graphs. In this way, it is allowed to obtain an prediction of the intrinsic similarity of documents in terms of the contents (CLGs) prior to the training of the GNN 2. It is anticipated that the clustering results can assist the training of the GNN 2 particularly when encoding the information that is located deeply in the structure. We train PMGraphSOM on 17,267 CLGs where each of them is representing the contents of one document in the CiteSeer dataset. When training the PMGraphSOMs, a number of parameters need to be set. It is well known

214 8.5. Experiments Map-Y Map-X Figure 8.8: The mappings produced by PMGraphSOM of size 8 6 when trained on the CLGs extracted from the CiteSeer dataset. that the optimal configuration of training parameters for SOMs can only be determined through trial and error, which is because a good set of parameters is always problem dependent. The computational complexity of PMGraphSOM raises with the number of nodes and connections between nodes. Given that the dataset consists of 1, 78, 797 nodes and 28, 23, 984 connections, the training is time-consuming. The experiment using a reasonable size of a map requires 3 hours per iteration. The configuration of such an experiment is as follow: map size is 8x6, 1x1 grouping,σ() = 4,α =.9,µ =.1. The fully trained PMGraphSOM produced mapping as illustrated in Figure 8.8. The distribution of the activations on the map is reflected by a color meter ranged from black color to yellow. The darker the color the lower the number of mappings at a given location. There were no mappings in the White areas of the map. The plot shows a good utilization of the display space and a number of clearly defined dense areas (the yellow area) indicate that the CLGs were mapped into clusters. This indicates that the documents are clustered together according to certain similarity of the input patterns. We then investigate the the clusters by comparing the CLGs mapped on the same neuron. It is found that those CLGs have similar structures or at least are sharing similar substructures. For example, most CLGs mapped on location [6, 2] contain five concepts fully connected. Since the node label vectors were used in the training, some identical concepts can be found from different CLGs. This indicates that PMGraphSOM is able

215 8.5. Experiments 181 to cluster documents according to the similarity of concepts and contextual relationship among concepts extracted from the document contents. We tried a variety of training parameters. The mappings are shown in Appendix D. It is found that larger maps do not show significant benefits in the clustering performance since the utilization of the display space is reduced (See for example Fig. D.2). This reveals that the configuration using a 8x6 map may be suitable for this learning task. We also observed that the number of clusters is significantly reduced when training PMGraphSOM using a larger value for µ (see Fig. D.4). Hence, µ =.1 appears to be a reasonable choice. The mapping results returned by PMGraphSOM are then used to label the nodes in the citation graphs, and the GNN 2 are re-trained on this dataset. The experimental results are summarized in Table 8.4. We varied the initializations of the GNN 2. This is indicated Table 8.4: Prediction Error of training GNN 2 on citation graphs labelled with PMGraph- SOM outputs. Seed Balance Train SSE Train MSE Test SSE Test MSE Test MWE 3 no yes no yes no yes no yes no yes by the different seed values (for the random number generator) in Table 8.4. We also made attempts to balance the dataset. The results show that the training remains sensitive to different initial conditions and that the best result on the training set is obtained without using the balancing technique. With balancing, the prediction error on the test set improves for both the MSE and the MWE. However, by comparing the results with the one obtained prior to using PMGraphSOM, it appears that no improvement has been achieved. This indicates that the problem may be caused by reasons other than long term dependencies.

216 8.5. Experiments Original Normalized 4 Counts 2 Original Normalized PageRank Figure 8.9: The distribution of PageRank value before and after normalization Smoothing the PageRank targets MLP based systems can simulate any function by realizing a parametric function. It is well known that the MLP methods can approximate any continuously differentiable function through the superpositioning of transition functions. Thus, a learning task is simpler the more similarities exist with a mixture of transition functions. In general, target functions which show extremes or abrupt changes are harder to approximate. The GNN 2 consists of MLP-like substructures, and the transition function is usually set to sigmoid. Given that the distribution of PageRank values is extremely unbalanced (as can be seen in Figure 8.9) this may make it harder for the GNN 2 to learn. Hence, we attempt to smooth the target function through a normalization mechanism. Realizing that the distribution of PageRank values follow the function log 2 (1/n), we smooth the PageRank values by usinglog 2 (v), wherev refers to the original PageRank value. The main motivation is that logarithm function can smooth the function curve without destroying the order of values. We plot the original PageRank rank values and the normalized values in Figure 8.9. It can be seen that the normalized curve is much smoother and that the value range shrinks to be within1% of the original range. We then re-train the GNN 2 on the normalized targets by using similar configurations as in the previous experiments. The training performance is illustrated in Figure 8.1. It

217 8.5. Experiments 183 can be observed that the SSE curve is not as jagged as the one shown in Figure 8.4. This is a positive side effect of smoothing the target values. However, the residual SSE is at high levels after 1 training epochs. The SSE and expansion on the output value range continues gradually to until 1 training iterations. It appears that convergence has not yet occurred. While this problem is easily addressed by increasing the number of training iterations, this was not executed due to excessive training times (more than one week of training time). 3e+6 2.5e Network Outputs Target Avg Target Max Target Min 2e+6 25 SSE 1.5e+6 Errorbar e Epoch Epoch Figure 8.1: Plots of training GNN 2 on temporal spatial graphs using normalized PageRank as targets. Left: SSE curves; Right: Errorbars of the network output values Using larger network architectures The previous experimental results indicated that the quality of the mappings may improve with an increase in the size of the GNN 2. By increasing the number of neurons in GNN 2, this increases the number of parameters within the system, and allows the encoding of more complex tasks. In order to allow us to carry out experiments on larger networks we reduced the size of the training set. This was done to achieve a reasonable turn around time for the experiments, and to keep training times within 7 days. The training dataset contains 89, 61 randomly selected CLGs (documents) and associated time sequences. The experimental results of using different network architectures on this dataset are illustrated in Figure 8.11 where the plots on the left show the SSE curves plotted every 1 iterations and plots to the right are the errorbars (min, max, and average values) of the network output values. The main findings are: The larger the network the better it covers the value range of the targets. The larger the network the smaller the residual error after training.

218 8.5. Experiments e e e+6.6 Unweighted SSE 1e Errorbar Epoch Update All Balance-Update All Update All Balance-Update All Epoch Target Avg Target Max Target Min Unweighted SSE 2 15 Errorbar Epoch Update All Balance-Update All Update All Balance-Update All Epoch Target Avg Target Max Target Min 2e e e e+6 1 Unweighted SSE 1.2e+6 1e+6 8 Errorbar Epoch Update All Balance-Update All Update All Balance-Update All Epoch Target Avg Target Max Target Min Figure 8.11: Plots of training GNN 2 on a smaller CiteSeer dataset using different network architecture. Top: hidden=4, state=6, encodedim=3; Middle: hidden=8, state=1, encodedim=6; Bottom: hidden=15, state=2, encodedim=14. Convergence of the SSE appears to occur earlier for larger networks. Balancing is not effective in this task. As a consequence of this experiment we find that the network size had been a factor that limited the ability to encode the PageRank values in the previous experiments Use of ERA as targets The CiteSeer dataset contains 17, 267 documents, and publication venue information is available for 16, 57 documents through reference analysis. We were only able to associate a venue with a minority of the documents in the dataset due to the difficulty of parsing venue information from the citation by Parscit. As mentioned earlier, Parscit

219 8.5. Experiments 185 strictly requires the citation string in a standard format so that it can automatically and correctly categorize the information. We then tried to associate the venues covered by the CiteSeer dataset with the venues listed in the ERA venue database. We found ERA recognizes the venues for 5, 43 documents in the dataset. Out of these, 879 were unique venues. The matching of venues in the CiteSeer dataset with venues listed in the ERA database is performed by using a similarity voting approach described below: Get the list of all unique words from the ERA venue list and count the occurrences. For each CiteSeer document, for each piece of venue information available, and for each word in the string, try to match it with the words in unique word list obtained in the previous step. Such mapping task is achieved in companion with spell checking and fuzzy matching using wildcards. If multiple matches are found, pick the word with the highest frequency. If no match is found, use the original word. This results in a refined list of venues for CiteSeer documents. Then we try to match each refined venue entry with the ERA venues, compute the similarity based on string Soundex values. If the venue only contains one word, we compare the Soundex [7] values character by character. If the venue contains multiple words, we compare the Soundex values word by word, the returned similarity will be a percentage of correct matchings found. For one document, we may find several matches in the ERA venue list. The best match will be selected according to the similarity voting results. The venue that has the highest similarity with the extracted document venue will be assigned to the document. Figure 8.12 shows the distribution of documents from the 4 possible ranks (left) and for the 879 different venues sorted by frequency (right) respectively. From the plots, it can be observed that the higher rank level the more documents are available. Such an observation is expected since the extraction on venue information for CiteSeer documents is based on reference parsing. The highly ranked documents are more likely cited by other documents so that such venue information is more possible to be discovered. The distribution of documents from different venues shows the unbalanced nature of the dataset. 8% of papers are found in the5% most popular venues. This indicates that many venues

220 8.5. Experiments 186 contain only a few papers. This means that during training there are only a few samples available for most venues, so that the learning results may be biased to larger classes Number of Documents 1 Number of Documents A* A B C Not ranked ERA Rank ERA Venue ID Figure 8.12: Distribution of documents from different ERA venues and rank levels. Left: number of documents from different rank levels; Right: number of documents from different venues. The dataset is then split into two sets: training set and test set. The training set contains 3, 149 documents and test set contains the remaining 1, 879 documents. The generation of these subsets is based on a random selection procedure. Figure 8.13 shows the distribution for train and test sets. Both sets are sharing similar distribution and the train set also has good coverage on available rank levels and venues. We then analysed the citation structure with respect to document categories. We were interested in knowing whether there exists a bias by which a class of document refers to other documents. For example, a document rated A* (highly rated) may link to other document rated A*. The question is whether it is more likely for A* documents to refer to other A* documents rather than to lower rated documents. The same question arises for any of the other document classes. We build a matrix to represent the number of citations from one document class to another document class. This resulted in a confusion matrix shown in Table 8.5. The rows list the number of in-links to the journals that ranked at a corresponding level and the columns list the number of out-links. Column Total lists the total number of in-links from both labelled and unlabelled documents. Similarly, row Total lists the total number of out-links from the journal ranked at a certain level to all the available documents. Several findings are as follow: More than half the A*-rank papers cite papers also ranked at A*. In such a dataset, A* papers receive totally 97 citations, and over 8% of them are from papers ranked equal or higher than A.

221 8.5. Experiments Number of Documents 8 6 Number of Documents A* A B C ERA Rank ERA Venue Number of Documents 8 6 Number of Documents A* A B C ERA Rank ERA Venue Figure 8.13: Distribution of documents from different ERA venues and rank levels. Top: Train set; Bottom: Test set; Left: number of documents from different rank levels; Right: number of documents from different venues. A-rank papers have almost7% in-links from A* and A ranked papers. B-rank papers have about 6% references are A*-rank and A-rank journals and mostly cited by A*-rank papers, and then C, A and B ranked papers. C-rank journals only have6% references to the journals ranked also at C, and similar number of references to journals at other higher rank levels. Journals with highest rank A* have the most in-links and also the out-links in total. The higher rank, the more citations received. It is interesting to observe that A* paper are most likely to refer to other A* papers, while A papers are most likely to refer to other A rated papers. This is the more remarkable given that the dataset contains more A* rated papers than A rated papers. Also interesting is that C rated papers refer to other paper classes at a likelihood that follows the size of the pattern classes (and thus, it resembles a random reference strategy). In contrast, B rated papers popularly refer to C rated papers at a much higher rate than is justified by the size of the class C. Thus, the citations within a document allow to associate a document

222 8.5. Experiments 188 Table 8.5: Confusion matrix of connections among documents from different rank levels. A* A B C In-link AvgInlink Total A* A B C Out-link Total to a document class. In other words, it can be expected that a machine learning method will be able to encode the target set to a certain degree. There are two types of targets that can given to the nodes in the citation graph: the venue rank and the venue ID. The CLG can be used as additional features for GNN 2 training. As a result, we scheduled four sets of experiments by using the following feature and target combinations: 4-dimensional Venue rank target vectors. Using CLGs to label the nodes in the citation graph. This resulted in GoGs of depth 2. 4-dimensional Venue rank target vectors. Without using CLGs. The GoG in this case is of depth dimensional Venue ID target vector. Using CLGs to label the nodes in the citation graph (GoGs of depth 2). 223-dimensional Venue ID target vector. Without the CLGs (GoGs of depth 1). For all the experiments, we used the same network architecture for the output network: Hidden=8, State=1, EncodeDim=6. The number of training epochs is set to 1. Figure 8.14 shows the SSE curves during training for all four experiments and Table 8.6 lists the classification performance. It is found that the inclusion of CLGs allowed to obtained a slightly better performance on the training set but worse prediction results on the test set. Confusion matrices are built in order to present a visualization on the classification performances. This is shown in Table 8.7 and Table 8.8. Each column of the matrix represents the documents classified to a corresponding rank level, while each row represents

223 8.5. Experiments Unweighted SSE Unweighted SSE Epoch Epoch Unweighted SSE 4 Unweighted SSE Epoch Epoch Figure 8.14: SSE curves when training GNN 2 on GoGs defined for CiteSeer documents. Top: Use venue rank as targets; Bottom: Use venue ID as targets; Left: with CLGs; Right: without CLGs. Table 8.6: Classification results of training GNN 2 on ERA venue rank and venue ID. Target CLG Train Test Venue rank Yes Venue rank No Venue ID Yes Venue ID No the documents given the actual rank. The numbers on the diagonal show the correct classification among different rank levels. Table 8.7 presents the classification result when using CLGs, and Table 8.8 gives the classification results when trained on data not featuring the CLG representation of the documents. The results for the experiments not using balancing techniques are located in Appendix E. It can be observed that: Without balancing, all documents from category B and category C are categorized into A or A*. This is possibly caused by the unbalanced distribution of the document classes. With balancing, some documents from smaller classes (lower rank levels) can be

224 8.5. Experiments 19 Table 8.7: Confusion Matrix of the classification produced by training GNN 2 on citation graphs where nodes in the graph are labelled by CLG. Use ERA rank as targets. Hidden=8, State=1, EncodeDim=6. With balancing. A* A B C A* A B C Train 974(.3216) A* A B C Test 492(.2641) Table 8.8: Confusion Matrix of the classification produced by training GNN 2 on citation graphs. Use ERA rank as targets. Hidden=8, State=1, EncodeDim=6. With balancing. A* A B C A* A B C Train 969(.3199) A* A B C Test 459(.2464) correctly classified. However, the performance on higher ranked documents becomes worse when compared to the unbalanced training runs Discussion The GNN 2 was not able to exhibit a convincingly good performance on the CiteSeer document ranking experiments. Nevertheless it was shown that the proposed machine learning tool GNN 2 is capable of encoding such complex data representations. The poor performance on the training set and on the prediction task indicate that there exist difficulties. These difficulties could be caused by a dataset containing too much noise, or could be caused by limitations of the machine learning method. As mentioned earlier, this CiteSeer dataset is incomplete with respect to both document contents (due to low

225 8.5. Experiments 191 quality of the PDF version so that the OCR cannot correctly recognize some texts) and citation links (varied reference standards so that automatic matching task is difficult). It is not reasonable to expect the GNN 2 to produce perfect results on an incomplete and noisy dataset. The experiments provided some evidence that GNN 2 has potential to solve the ranking problem, but this CiteSeer document ranking problem may not be of a sufficiently high quality to demonstrate the full capability of the GNN 2. To eliminate doubts, we then crafted an artificial learning problem designed to test the abilities of the GNN 2. Towards this end, we propose to extend the Policeman dataset [39] to be sequence of graph of graphs which would be a noise-free analog to the temporal and spacial graphs defined for CiteSeer dataset. This is an artificial dataset for which its degree of difficulty can be controlled. Hence, this makes a better application for evaluating the methods. In Section 5.2, we defined sequences of trees for Policeman dataset which simulate certain movement of the policeman. Such sequential graphical patterns already contain two levels of graphs. A further step is executed by describing the nodes in the Policeman trees. This will add a third layer to the GNN, and hence, creates a noise-free analogy to the CiteSeer problem. The idea is illustrated on an example in Figure In Figure 8.15 a policeman is raising both arms over time, and meanwhile his facial expression changes from unhappy to very happy, and then from very happy to unhappy. The change of facial expressions is purposely out of sync with the movement of the arms or body of the policemen. Thus, it is not possible to draw a conclusion on the facial expression by considering the policemen tree alone. The learning task is to model the state of happiness of the policemen. The challenge with this task is that the information about the state of happiness is embedded at the deepest level of the graph, and that a prediction of the state in a next time instance requires the modelling of the time series of policemen. Thus, if successfully trained, the GNN 2 would demonstrate the ability to encode the time series information amongst elements located deep in the structure, and would demonstrate to ignore irrelevant features that may be embedded in the sequence (such as the raising of the arms). Thus, the task is to predict the last facial expression that the policeman will have in a sequence. This is indicated in Figure 8.15 by the assumed unknown state of the last node in the sequence. We summarize, the benchmark problem is a GoGs containing three levels of graphs: Level : Face graphs; Level 1: Policeman tree; Level 2: Time sequence. The nodes except the last one in the

8.5. Experiments 192 Level:Facegraphs Level1:Policemantree Level2:Movement sequence Unhappy Neutral Happy Happy Very Happy Neutral??? Figure 8.15: A sample of Policeman temporal spatial graphs.

These are described as four face graphs located at level, and the node in level 1 graphs representing the head of the policeman can be labelled by one of these four graphs to indicate current face

226 8.5. Experiments 192 Level:Facegraphs Level1:Policemantree Level2:Movement sequence Unhappy Neutral Happy Happy Very Happy Neutral??? Figure 8.15: A sample of Policeman temporal spatial graphs. sequences are attached with one of the four targets defined for this task: unhappy, neutral, happy and very happy. These are described as four face graphs located at level, and the node in level 1 graphs representing the head of the policeman can be labelled by one of these four graphs to indicate current face expression. We build several datasets with different sizes for the experiments. The sequences varied in length, and depicted policemen raising one or both arms, lowering one or both arms, policemen rotating clockwise or anti-clockwise, and policemen moving from left to right or from right to left, or from front to back, or from back to front (similar to the cases defined in Section 5.2). A total of five datasets are considered: Dataset 1: 12 sequences of graph of graphs. 15 supervised and 12 unsupervised nodes at level 2. Dataset 2: 57 sequences of graph of graphs. 717 supervised and 57 unsupervised nodes at level 2. Dataset 3: 72 sequences of graph of graphs. 9 supervised and 72 unsupervised nodes at level 2. Dataset 4: 15 sequences of graph of graphs supervised and 15 unsupervised nodes at level 2. Dataset 5: 57 sequences of graph of graphs. 546 supervised and 228 unsupervised nodes at level 2.

Self-Organizing Maps for cyclic and unbounded graphs

Self-Organizing Maps for cyclic and unbounded graphs M. Hagenbuchner 1, A. Sperduti 2, A.C. Tsoi 3 1- University of Wollongong, Wollongong, Australia. 2- University of Padova, Padova, Italy. 3- Hong Kong