Graph Matching. Filtering Databases of Graphs Using Machine Learning Techniques

Size: px

Start display at page:

Download "Graph Matching. Filtering Databases of Graphs Using Machine Learning Techniques"

Claud Rice
5 years ago
Views:

1 Graph Matching Filtering Databases of Graphs Using Machine Learning Techniques Inauguraldissertation der Philosophisch-naturwissenschaftlichen Fakultät der Universität Bern vorgelegt von Christophe-André Mario Irniger von Niederrohrdorf, AG Leiter der Arbeit: Prof. Dr. H. Bunke Institut für Informatik und angewandte Mathematik Universität Bern

3 Graph Matching Filtering Databases of Graphs Using Machine Learning Techniques Inauguraldissertation der Philosophisch-naturwissenschaftlichen Fakultät der Universität Bern vorgelegt von Christophe-André Mario Irniger von Niederrohrdorf, AG Leiter der Arbeit: Prof. Dr. H. Bunke Institut für Informatik und angewandte Mathematik Universität Bern Von der Philosophisch-naturwissenschaftlichen Fakultät angenommen. Bern, 13. Juni 5 Der Dekan: Prof. Dr. Paul Messerli

5 Abstract Graphs are a powerful concept useful for various tasks in science and engineering. In applications such as pattern recognition and information retrieval, object similarity is an important issue. If graphs are used for object representation, then the problem of determining the similarity of objects turns into the problem of graph matching. Some of the most common graph matching paradigms include graph and subgraph isomorphism detection, maximum common subgraph extraction and error-tolerant graph matching. A number of solutions for all of these tasks have been proposed in the literature, but they all suffer from the high computational complexity inherent to graph matching. An additional problem arises in applications where an input graph is to be matched not only to another single graph, but to an entire database of graphs under a given matching paradigm. If the database is large, sequential comparison of the input graph with each graph from the database using conventional approaches becomes infeasible. In this thesis the comparison of input graphs with databases of graphs is studied. Different retrieval paradigms, namely graph isomorphism, subgraph isomorphism, and error-tolerant matching are considered. The approach pursued is based on comparing feature vectors which have been extracted from the graphs. The idea is to use features that can be quickly computed from a graph on the one hand, but are, on the other hand, effective in discriminating between the various graphs in the database. Given a potentially large number of such features, the most powerful ones for discriminating the graphs in the database are determined by means of decision tree induction algorithms as known from machine learning. Under the proposed procedure, given an input, i.e. a query graph, a (preferably small) subset of possible candidates will be retrieved from the database. Only the graphs contained in this subset are then subject to full-fledged, expensive graph matching. Significant savings in computation time can be expected as the time complexity of graph feature extraction and decision tree traversal is small compared to full graph matching. In this work database filters for three different matching paradigms (graph isomorphism, subgraph isomorphism and error-tolerant matching) have been developed. They have been successfully tested on synthetically generated graphs with various characteristics as well as on publicly available real-world graph databases.

7 Acknowledgments This thesis was written at the Institute of Computer Science and Applied Mathematics of the University of Bern under the guidance of Prof. Dr. Horst Bunke. I would like to thank him for directing me through this work, offering valuable advice while also giving me room to include my ideas. Many thanks as well to Prof. Dr. Hanspeter Bieri for supervising the defense examination and to Prof. Dr. Xiaoyi Jiang for his willingness to act as the co-referee of my work. Many thanks to the graph-guys at the University of Bern, namely Pascal Habegger and Michel Neuhaus, for interesting discussions regarding and valuable contributions to my work, and for proofreading this manuscript. I would also like to thank the graduate, project and image-analysis students for contributing valuable parts to this work. Thank you to Pascal Habegger for providing the graph-framework used in this thesis. Further thanks to Adam Schenker for providing the document graphs, to Michel Neuhaus for the fingerprint graphs and to the Developmental Therapeutics Program NCI/NIH for collecting the graph data on chemical compounds. I would like to thank my friends from the AI- and CG-group Cyril Marti, Andreas Schlapbach, Simon Günter, Roman Bertolami, Tamas Varga, Marcus Liwicki, Stefan Fischer, Thomas Wenger, Lorenz Ammon, Philippe Robert and Thomas Bebie for interesting discussions regarding university work as well as social issues. Many thanks as well to the KaGi - as well as the dunkus (b) im bolles? - crew Pascal Etienne Habegger, Thomas Karel Buchberger, Sani Mehmed Tetik and Sonja Schär for valuable social discussion (let s call it that) on a wide variety of issues. Thank you to my friends Daniel Schlüssel and Cyrill Salzmann who have always been there for the fun times. Very special thanks my parents Elvire & Walter Irniger, to my brother Patrik, to my host family Pat, Randy, Geneva, Kent and to all my friends not mentioned by name. All of you have been a great support for me during this time. This research was supported by the Swiss National Science Foundation (Nr. -667). I would like to thank the Foundation for the support.

9 Contents 1 Introduction Graphs and Graph Matching Decision Trees and Graph Database Filtering Outline Part I Fundamentals Graph Theory 15.1 Definitions and Notation Graph Matching Decision Trees Decision Tree Classification Introduction Decision Tree Classifier Split Criteria RainForest - Extension To Large Datasets Part II Decision Trees for Graph Database Indexing 4 Graph Database Filtering A Performance Analysis Theoretical Examination Concluding Remarks Graph Database Filtering Using Feature Vectors Graph Features Graph Isomorphism Subgraph Isomorphism Error-Tolerant Graph Matching v

10 vi CONTENTS 6 Decision Trees for Graph Database Filtering Decision Tree Filtering Graph Isomorphism Decision Trees Subgraph Isomorphism Decision Trees Error-Tolerant Decision Trees Conclusions Part III Experiments and Results 7 Graph Datasets Generated Graphs Extracted Graphs Graph Database Filtering Performance Study Experimental Setup Experimental Results Conclusions Feature Vector Filtering Experimental Evaluation Feature Type Evaluation Feature Vector Filtering Feature Vector Filtering Conclusions Decision Tree Filtering Graph Isomorphism Filtering Subgraph Isomorphism Filtering Error-Tolerant Filtering Decision Tree Filtering Conclusions Conclusions and Future Work Conclusions Future Work Part IV Appendix A Fingerprint Classification 165 A.1 Graph Representation A. Error-Tolerant Classification A.3 Experimental Results A.4 Conclusions and Future Work

11 CONTENTS vii B Decision Tree Framework 177 B.1 Utility Subsystem B. Process Management B.3 Graph Package B.4 Decision Tree Framework List of tables 189 List of figures 191 Bibliography 197

13 Chapter 1 Introduction 1.1 Graphs and Graph Matching Graphs are a powerful and universal tool widely used in information processing. Numerous methods for graph analysis have been developed and become important in computer science and engineering. Examples include the detection of Hamiltonian cycles, shortest paths, vertex coloring and many more [1,, 3, 4, 5]. In applications such as pattern recognition or information retrieval, object similarity is an important issue. Given a database of objects and a query, the task is to retrieve one or several objects from the database matching the query. If graphs are used for object representation this problem turns into determining the similarity of graphs, which is generally referred to as graph matching. Standard concepts in graph matching include graph isomorphism, subgraph isomorphism and maximum common subgraph. Two graphs are called isomorphic if they have identical structure. More formally, an isomorphism between two graphs g 1 and g is a bijective mapping between the nodes of g 1 and g that preserves the structure of the edges. Graph representations of objects are often invariant under a number of transformations, for example, spatial transformations such as translation, rotation, and scaling. Hence, graph isomorphism is a useful concept to examine whether one object is a transformed version of another. Subgraph isomorphism is another popular concept in graph comparison. Given two graphs, 1

14 1.1. Graphs and Graph Matching there exists a subgraph isomorphism if one graph contains a subgraph that is isomorphic to the other. Subgraph isomorphism is useful to find out if a given object is part of another object or even of a collection of several objects. The maximum common subgraph of two graphs g 1 and g is the largest graph that is isomorphic to a subgraph of both g 1 and g. Maximum common subgraph is useful to measure the similarity of two objects. Clearly, the larger the maximum common subgraph of g 1 and g is, the more similar the two graphs are. Algorithms for graph isomorphism, subgraph isomorphism and maximum common subgraph detection have been reported in [6, 7, 8, 9]. A more general method to measure the similarity of two graphs is graph edit distance. It is a generalization of string edit distance, also known as Levenshtein distance [1]. In graph edit distance one introduces a set of graph edit operations. These edit operations are used to model distortions that transform a noisy pattern into an ideal object representation. Common sets of graph edit operations include the deletion, insertion and substitution of nodes and edges. Given a set of edit operations, graph edit distance is defined as the minimum number of operations needed to transform one graph into the other. Often a cost is assigned to each edit operation. The costs are application dependent and are generally used to model the likelihood of the corresponding distortions. Typically, the more likely a certain distortion is to occur, the lower is its cost. If a cost is assigned to each edit operation then the edit distance of two graphs, g 1 and g, is defined as the minimum cost taken over all sequences of edit operations that transform g 1 into g. Graph edit distance and related similarity measures have been discussed in [11, 1, 13, 14]. Another approach measuring the similarity of two graphs is a distance measure based on the maximum common subgraph between g 1 and g. With increasing work being done in the field of maximum common subgraph detection these measures are growing in popularity. In [15], a graph distance measure based on the maximum common subgraph of two graphs is introduced. This measure has two interesting properties that make it potentially attractive for various applications. First, it is a metric and thus is suitable for similarity evaluation involving several objects. Secondly, it doesn t need any edit cost specified by the user. In [16] relations between the edit distance of two graphs and their maximum common subgraph are presented. It is shown that the well-known concept of maximum common subgraph distance is a special case of graph edit distance under particular edit costs. Consequently, algorithms originally developed for maximum common subgraph detection can be used for edit distance computation and

15 Chapter 1. Introduction 3 vice versa for the considered edit costs. A more rigorous study of the influence of edit costs on the optimal matching of two graphs is presented in [17]. One of the main results is that any node mapping of two graphs that is optimal under a particular cost function is optimal under infinitely many other, equivalent cost functions. The minimum common supergraph of two graphs, a new concept that is dual to the maximum common subgraph, is introduced in [18]. It is shown that this concept has a number of properties that are potentially interesting for the definition of new graph similarity measures. Furthermore, in [19] the concepts of maximum common subgraph and minimum common supergraph are combined to derive a graph distance measure and in [], graph distances based on the minimum common supergraph denoted as the graph union are discussed. Graph matching can be applied to a great number of applications. One of the earliest was in the field of chemical structure analysis [1]. Later, graph matching has been applied in case-based reasoning [, 3], machine learning [4, 5], planning [6], semantic networks [7], conceptual graphs [8], information retrieval from video database [9], monitoring of computer networks [3], biomedical applications [31], data mining [3], and in the context of visual languages and programming by graph transformations [33, 34]. Numerous applications from the areas of pattern recognition and machine vision have been reported. They include recognition of graphical symbols [35, 36], character recognition [37, 38], shape analysis [39, ] three-dimensional object recognition [41, 4], image database indexing and retrieval [43], biometric person identification by means of facial images [44, 45] and fingerprints [46, 47], diatom identification [48], and others. Recently, a special class of graphs has gained interest in the graph community [49, 5]. Graphs of this class are characterized by the existence of unique node labels. This class is interesting because common matching tasks, such as isomorphism, subgraph isomorphism, maximum common subgraph, graph edit distance and median graph computation have a complexity that is linear in the number of involved data items, i.e. nodes and edges. Although the condition of unique node labels is a strong constraint, this class of graphs seems useful for practical applications, for example computer network monitoring [51, 5, 53] and web document analysis [54, 55]. A number of graph matching algorithms are known from the literature [1, 13, 56, 57]. All of these methods are guaranteed to find the optimal solution, but require exponential time and space. Suboptimal or approximate methods on the other hand are polynomially bounded in the number of computation steps, but may fail to find the optimal solution. For ex-

16 4 1.. Decision Trees and Graph Database Filtering ample, in [58, 59] probabilistic relaxation schemes are described. Other approaches are based on neural networks such as the Hopfield network [6] or the Kohonen map [61]. Also, genetic algorithms have been proposed [6, 63]. In [64] an approximate method based on maximum flow is introduced. Recently, spectral methods have become popular approaches in graph matching [65]. Based on the adjacency matrix and Laplacian matrix representation of a graph, its eigensystem (i.e. its eigenvectors and eigenvalues) is extracted and analyzed. This approach is referred to as the spectral decomposition of a graph, and the resulting graph representation is called its spectral representation. The spectral representation can be used in a variety of ways to perform graph matching. In [66], inexact graph matching is performed calculating the Levenshtein distance on the eigenvectors of the graphs. Another approach illustrated in [67] converts the adjacency matrix into a string, then uses the leading eigenvector to impose a serial ordering on the string. Graphs are then matched by applying string matching techniques to their string representation. A different idea is pursued in [65, 68] where eigen(sub)space projections and vertex clustering methods are explored. In both approaches the eigenspace of the graph matrices is renormalized using eigenspace renormalization projection clustering (EPC) in order for the approach to be able to match graphs with different number of vertices. Whereas in [68] the objective of the method is to work in the eigenspace of the graphs, in [65] similar subgraphs are matched based on their vertex connectivities defined in the common subspace. With the increasing popularity of kernel methods a number of structural kernels for graphs haven been derived. In [69, 7] a kernel on graph data basically describing the graph by means of substructures has been proposed. Another class of kernel functions, derived from random walks on graphs, is described in [71, 7]. The above methods can be characterized by explicitly addressing the matching problem via a kernel function. In [73] the authors propose a graph matching system for unconstrained large graphs based on functional interpolation theory. Finally, graph heat kernels [74, 75] have recently been explored to derive graph representations stable under structural error when used for graph matching. 1. Decision Trees and Graph Database Filtering In many applications, particularly in pattern recognition and information retrieval, the computational complexity of graph matching is further increased by the fact that not just a pair of graphs is to be compared with

17 Chapter 1. Introduction 5 each other, but some input graph is to be matched against an entire database of graphs. Hence an additional factor proportional to the database s size is introduced in the matching process. A variety of mechanisms have been proposed to reduce the complexity of graph matching when databases are involved [76, 77, 78, 79, 8, 81]. Messmer and Bunke propose a decomposition approach in [76]. The database of model graphs is decomposed into subgraphs and common parts are connected in a hierarchical network structure. In [77] they propose an approach based on a tree-structure and the adjacency matrices of the graphs. Both approaches suffer from intensive memory usage, making them unsuitable for large databases of graphs. In [78], Shapiro and Haralick suggest to organize the graphs in clusters according to their mutual similarity and index each cluster by means of a graph representative. Organizing the graphs in a hierarchy is the idea of the approach proposed in [79] by Sengupta and Boyer. Lopresti and Wilfong pursue the idea of graph probing [8]. Although suitable for comparing a large number of graphs pairwise, their approach is based on assumptions not suitable for graph database retrieval. In [81], Guigno and Shasha propose a method for querying a database of graphs for graph isomorphism. However, their approach is yet to be extended to subgraph isomorphism and error-tolerant paradigms. Overall, the problem of accessing large databases of graphs is still a widely unsolved problem. In this thesis an approach based on machine learning techniques is applied to address that problem. The aim is to develop, implement and test these techniques when used for the retrieval of graphs from large graph databases. Large databases in this context consist of at least thousands of graphs, with a typical graph size ranging from a few tens up to hundred nodes or more. Such a scenario can t be addressed by the methods developed previously because of the high computational complexity of subgraph isomorphism and error-tolerant graph matching. The algorithms developed in this thesis are of potential interest to any application domain where graphs are a suitable representation formalism. Particular application examples include pattern recognition and computer vision, CAD/CAM applications with databases of mechanical, electrical or electronic components, graphical information systems, and others. The retrieval/matching paradigms considered include graph and subgraph isomorphism, both from the input graph to the database and from the database graphs to the input, as well as error-tolerant matching (i.e. retrieval of graphs with an edit distance to the input that is smaller than some given threshold value). The approach pursued in this work is to filter the graph database with respect to a given input sample. The idea of

18 6 1.. Decision Trees and Graph Database Filtering database retrieval algorithm input sample graph database filtering procedure match candidates matching algorithm final result Figure 1.1 An illustration of the graph database filtering scheme. filtering a database is to reduce the size of the database by first ruling out as many graphs as possible using a few simple and fast tests. After the filtering phase, an ordinary exact matching algorithm is applied to the remaining database graphs. Hence, the filtering approach simply reduces the database size by as many graphs as possible. An illustration is given in Figure 1.1. More specifically, in a very simple example, consider an input graph g and a set of graphs G = {g 1,..., g n } in the database and assume the goal is to find out if there is a graph in G that is isomorphic to g. In database filtering, the database graphs are first compared to the input graph by few fast tests. Using decision tree techniques, these tests are based on feature vectors. Suitable features to rule out some graphs from any further consideration are the number of nodes, the number of edges, the number of nodes or edges with a particular label, the number of nodes with a particular degree a.s.o. Using these features, set G can be filtered, leaving only a subset G G of graphs that are potentially isomorphic to g. On this subset more powerful features can perhaps be applied leaving eventually only a small number of candidate graphs from G, on which a full graph isomorphism test needs to be conducted. Clearly this scenario becomes more complicated if extended from graph to subgraph isomorphism, or even to error-tolerant graph matching. The main problems addressed in this thesis can be grouped into three categories: 1. Exploration of useful graph features.. Database organization and retrieval for various matching paradigms. 3. Experimental evaluation.

19 Chapter 1. Introduction 7 The first problem is the exploration of graph features useful for the proposed filtering approach. The number of features that can potentially be used to characterize graphs is quite large. However, features suited for filtering databases of graphs have to be extractable with low computational cost while guaranteeing high discrimination between as many graphs as possible in the database. It has been found that, in general, simple features such as the number of nodes with a particular label l or the number of nodes of degree n and a particular label l describe the underlying graph database sufficiently well for decision tree filtering. In this work, database retrieval for various graph matching paradigms, namely graph isomorphism, subgraph isomorphism and error-tolerant graph matching is addressed. Common to all approaches is the representation of the graphs as feature vectors. Naturally, feature vector representations of graphs can be of very high dimensionality. The main proposition of this thesis is to reduce the dimensionality of such feature vectors by applying machine learning techniques. The thesis proposes to analyze ( data mine ) the feature vectors by a decision tree induction algorithm to identify the most favorable features for a given database. For graph isomorphism, it is clear that a necessary condition for g 1 and g being isomorphic is that they have identical feature values. Hence given the feature vector f 1 extracted from g 1 and f extracted from g, g 1 and g can immediately be ruled out to be isomorphic if there exist non-identical feature values between f 1 and f. Given a sample graph g s the decision tree induced during preprocessing can be used to test the most significant features identified on the database, ruling out a great number of graphs to be tested by a full-fledged isomorphism algorithm. The approach used for graph isomorphism filtering can be extended to subgraph isomorphism filtering in a straightforward way. The basic idea is that the feature values of the subgraph g s occur at most as many times as they do in the designated supergraph g. For this matching paradigm, there exist two filtering methods. Decision trees induced for graph isomorphism filtering can be used to identify subgraph isomorphism candidates if the traversal algorithm is modified. The other approach is to alter the decision tree structure. In that case, the same traversal algorithm can be used for both tree types, graph isomorphism as well as subgraph isomorphism trees. Motivated by multiple classifier systems, a third database retrieval approach for subgraph isomorphism can be developed when combining both approaches concurrently. Using the combination of both approaches, the resulting graph-candidates are determined by intersecting the candidate sets of both approaches. The error-tolerant approach is based on the concept of imposing a lower limit on the size of a possible maximum common subgraph between database

20 Outline graphs and input sample. It has been shown in [16] that there exists a direct relation between the maximum common subgraph of two graphs, g 1 and g, and the associated edit costs of transforming g into g 1, (assuming a certain class of cost-functions). Thus, by bounding the required minimum size of the maximum common subgraph between input sample graph and database graphs, one can specify the minimum similarity of input graph and graphs retrieved from the database. The error-tolerant filter assumes a limited class of feature types extracted from the graphs. Based on these features, it is possible to make an estimate of the size of the maximum common subgraph of input sample and graphs in the database. If the estimated size is below the given input threshold it is certain that the distance between the graphs is larger than required by the user and the graphs can be ruled out from the candidate set. This algorithm has been found to be very effective for database filtering. All methods described above have been developed and experimentally evaluated in this project. During this thesis, a wide variety of synthetic as well as real world graph data has been collected. The approaches developed have then been extensively tested on synthetically generated graphs as well as real world graphs. During the experimental evaluation the suitability of all proposed filtering approaches for pattern indexing and retrieval has thoroughly been demonstrated. 1.3 Outline This thesis is organized into four parts. Part I deals with the theoretical background to the developed approaches. It is organized into two chapters. The first chapter (Chapter ) gives a basic introduction into graph theory and the principles needed in the context of this thesis. Following that, Chapter 3 yields an introduction to machine learning techniques, specifically decision tree methods. Part II addresses the problem of graph database filtering based on feature vectors. First, in Chapter 4, a theoretical study on graph matching performance and database filtering is presented. Parameters influencing the performance of a filtering approach are identified and formally described. Following that, Chapter 5 introduces the features used in this work and how they can be used to establish relations between graphs. Basic filtering methods for the three considered matching paradigms are introduced. In Chapter 6, the main contribution of this thesis is introduced. In this chapter it is shown how data mining can be performed in order to reduce

21 Chapter 1. Introduction 9 the dimensionality of given feature vectors. The filtering paradigms introduced are applied to the reduced feature vectors returned by the decision tree induction algorithm. Part III presents experimental results for all studies conducted during the thesis. During these experiments, numerous graph databases have been used to evaluate the proposed methods. The databases consist of graphs from different generators (artificial graphs such as random graphs, meshes, bounded valence graphs) and several other extraction tools (graph representations of fingerprints, chemical compounds and web documents) representing a wide variety of graph data in use today. Chapter 7 gives an overview of the databases used in this thesis. The following chapter then illustrates the various experiments made (Chapters 9 & 1). At first, an experiment evaluating state-of-the-art graph matching algorithms and their performance is presented. Following this introductory study experiments evaluating general feature suitability are documented. Based on these experiments the feature vector comparison is evaluated for all matching paradigms studied. Then, the work done on decision trees in combination with feature vector comparison is illustrated. Finally, conclusion are drawn and future work necessary based on this work are presented in Chapter 11. In Part IV the appendices of this work are listed. Appendix A contains a study where the error-tolerant filtering approach is used as a classifier on graph representations of fingerprints. This study illustrates that although developed for database filtering, the approaches are also of potential interest for graph classification purposes. Appendix B contains an overview of the framework developed in this thesis. It is not meant as a detailed code documentation but rather as a general overview of the implemented framework.

23 Part I Fundamentals

25 Introduction This thesis combines concepts from structural pattern recognition (graph theory) with statistical methods used in data analysis (decision trees). Before presenting the approaches developed in this work, a solid theoretical framework needs to be established for both fields. This part deals with the theoretical background of the applied approaches. Chapter gives a brief introduction into graph theory and the principles needed in the context of this thesis. Chapter 3 yields an introduction to machine learning techniques, specifically decision tree methods.

27 Chapter Graph Theory Graphs are a powerful and universal data structure used in various fields of science and engineering, e.g. structural pattern recognition. When graphs are used for the representation of objects, the problem of comparing different objects to each other can be formulated as the search for correspondences between attributed graphs representing the objects. Thus, using graphs, the pattern recognition problem becomes a graph matching problem. Structural pattern recognition by means of graph matching is attractive because graphs are a universal representation formalism. Graph matching is however expensive from the computational complexity point of view. The task of matching two graphs usually requires exponential time and memory resources in the number of nodes or edges involved. Correspondences between graphs can be established by a variety of graph relation paradigms. Popular paradigms found today include graph and subgraph isomorphism detection for exact matching approaches. However, in real world applications, patterns are often affected by noise, thus it is necessary to incorporate error-correction into the matching process. Although subgraph isomorphism includes error-correction to some extent, it is often too restrictive for real-world applications. As a consequence, errorcorrection in graph matching is usually tackled by other concepts such as graph edit distance (error-correcting graph isomorphism) or maximum common subgraph comparison. This chapter introduces graph theory as used in this thesis. In the next section, basic terms from graph theory as needed in this work will be in- 15

28 16.1. Definitions and Notation troduced. Following that, the concepts of graph matching (exact as well as inexact) and graph distance will be specified..1 Definitions and Notation Graphs are assumed to be attributed, that is their elements (nodes and edges) are assigned attributes (or labels). Such graphs are generally called attributed graphs. A formal definition is given below: Definition 1 Graph A graph is a six-tuple g = (V, E, α, β, L V, L E ), where V denotes a finite set of nodes, E V V is a finite set of edges, α : V L V is a node labelling function, β : E L E is an edge labelling function, L V is a set of node labels, and L E is a set of edge labels. Graphs as defined above are usually referred to as attributed graphs. In this thesis, the terms graph and attributed graph will be used synonymously. In the above definition, L V and L E denote finite or infinite sets of node and edge labels, respectively. Nodes are often referred to as vertices of a graph. The terms node and vertex will also be used synonymously in this work. The graphs considered are defined to be directed graphs, i.e. there is an edge from v 1 to v if (v 1, v ) E. However, undirected graphs can easily be modelled if for each edge (v, v 1 ) E an edge (v 1, v ) E pointing in the other direction is required 1. A simple example of a graph is given in Figure.1. The graph in this example can be defined as follows: 1 Consequently, all concepts presented in this work can be applied to undirected graphs as well.

29 Chapter. Graph Theory 17 C 1 d A D 3 b a c c 5 D 4 B Figure.1 An example of an attributed directed graph. V = {1,, 3, 4, 5} E = {(, 1), (4, ), (5, 3), (3, 4), (4, 3)} α: 1 C, A, 3 D, 4 B, 5 D β: (, 1) d, (4, ) c, (5, 3) c, (3, 4) b, (4, 3) a L V = {A, B, C, D} L E = {a, b, c, d} Graphs can be assigned various attributes or properties. For example, the size g of a graph g is defined as the number of nodes in the graph. Considering a node v in a graph, one can define the in-/out-degree of v. The in-degree v in of a node v in a graph is defined as the number of incoming edges. Similarly, the out-degree v out of a node v is the number of edges starting from that node. The degree v of a node v is defined as the sum of in- and out-degree, v = v in + v out. Applied to the example shown in Figure.1, the properties graph size and node degree(s) are as follows: g = 5 v 1 in = 1, v in = 1, v 3 in =, v 4 in = 1, v 5 in = v 1 out =, v out = 1, v 3 out = 1, v 4 out =, v 5 out = 1 v 1 = 1, v =, v 3 = 3, v 4 = 3, v 5 = 1

30 18.1. Definitions and Notation g D C 3 1 d b a c 4 B c A 5 D g s D c 3 5 b a 4 B D Figure. A graph g and an induced subgraph g s. It is often necessary to formally describe parts of a graph. To do this, the concept of a subgraph can be defined: Definition Subgraph A subgraph g s = (V s, E s, α s, β s, L V, L E ) of a graph g, g s g, is a six-tuple, where V s V, E s E (V s V s ), α s (v) = α(v) v V s and β s (e) = β(e) e E s. From the above definition it is easy to see that given a graph g any subset of its vertices uniquely defines a subgraph g s of g. Frequently the condition on the edge-set E s E (V s V s ) is replaced by the more rigorous condition E s = E (V s V s ). In that case, the subgraph is called an induced subgraph. In this work, unless otherwise stated, the term subgraph always refers to a subgraph as described in Definition. An example of a subgraph is given in Figure..

31 Chapter. Graph Theory 19. Graph Matching Graphs as defined in the previous section are useful concepts for various tasks. In this work it is assumed that graphs represent objects or, more generally speaking, patterns. In order to determine relations between such objects it is therefore necessary to compare their graph structures. Such comparison approaches are usually referred to as graph matching methods. In a general context two graphs g and g can be compared by determining a mapping function u which associates nodes and edges of g with nodes and edges of g and vice versa. Some of the more frequently used mapping functions are graph isomorphism, subgraph isomorphism, maximum common subgraph, and graph edit distance. One of the simplest mapping functions is graph isomorphism. A graph isomorphism between two graphs g and g is given if there exists a bijective mapping u from the nodes of g to the nodes of g such that the structure of the edges as well as all node and edge labels are preserved under u. More formally: Definition 3 Graph Isomorphism A bijective function u : V V is a graph isomorphism from a graph g = (V, E, α, β, L V, L E ) to a graph g = (V, E, α, β, L V, L E ) if: α(v) = α (u(v)) v V for any edge e = (v 1, v ) E there exists an edge e = (u(v 1 ), u(v )) E such that β(e) = β (e ); for any e = (v 1, v ) E there exists an edge e = ( u 1 (v 1 ), u 1 (v )) E such that β(e) = β (e ). Intuitively speaking two graphs are isomorphic to each other if they are equal. Naturally, for most applications this mapping is a much too strict mapping. In practical scenarios, more relaxing concepts are introduced, one of which being subgraph isomorphism. Similarly to mapping one graph to another, a graph can be mapped onto parts of another graph. Such a mapping can be defined as a subgraph isomorphism. Definition 4 Subgraph Isomorphism An injective function u : V V is a subgraph isomorphism from g to g if there exists a subgraph g s g such that u is a graph isomorphism from g to g s.

32 .. Graph Matching g 1 g a C B c a B c a a D A A mcs(g,g ) 1 B c Figure.3 Two graphs g 1 and g and their maximum common subgraph mcs(g 1, g ). A Subgraph isomorphism is a very useful concept, allowing identification of well-known/well-defined patterns in larger graphs. Based on subgraph isomorphism, another mapping, the maximum common subgraph of two graphs, can be defined. Definition 5 Maximum Common Subgraph Given two graphs g 1 = (V 1, E 1, α 1, β 1, L V, L E ) and g = (V, E, α, β, L V, L E ) a common subgraph is defined as a graph g = (V, E, α, β, L V, L E ) such that a subgraph isomorphism exists from g to g 1 as well as from g to g. A common subgraph g of g 1 and g is called a maximum common subgraph mcs(g 1, g ) if no other common subgraph of g 1 and g exists with more nodes than g. Obviously, the maximum common subgraph of two graphs defines the parts of both graphs which are identical to one another. An intuitive description of the maximum common subgraph is that it describes the intersection between the two graphs. Note that the maximum common subgraph mcs(g 1, g ) of two given graphs g 1, g does not necessarily need to be unique. In the remainder of this work, the abbreviations mcs(g 1, g ) or even mcs (where the context is clear) are used for the maximum common subgraph of two graphs. An example of a maximum common subgraph is given in Figure.3. Similarly to the maximum common subgraph being the intersection between two graphs one can introduce the concept of a set union in terms

33 Chapter. Graph Theory 1 of graphs. The minimum common supergraph describes such a concept. Definition 6 Minimum Common Supergraph Given two graphs g 1 = (V 1, E 1, α 1, β 1, L V, L E ) and g = (V, E, α, β, L V, L E ) a common supergraph is defined as a graph g = (V, E, α, β, L V, L E ) such that a subgraph isomorphism exists from g 1 to g as well as from g to g. A common supergraph g of g 1 and g is called a minimum common supergraph MCS(g 1, g ) if no other common supergraph of g 1 and g exist with fewer nodes than g. The minimum common supergraph, similar to the maximum common subgraph, does not need to be uniquely defined for two given graphs. In graph matching, it is often required to specify the distance δ between two given graphs. There exist a wide variety of distance functions on graphs (see [1, 15, 55, 8]). A commonly used approach is the well-known graph edit distance [1, 76]. In this approach, the distance δ(g 1, g ) of two graphs is measured by transforming graph g 1 into g applying a sequence of edit operations to g 1 (i.e. node deletion/insertion/substitution and edge deletion/insertion/substitution). Each edit operation is assigned a cost (under a cost function defined by the user) and the cost of a sequence of operations is the sum of the individual edit operation costs. The distance between the two graphs is then in general determined by the minimum cost necessary to transform g 1 into g. For the sake of completeness, graph edit distance will be formally introduced in the remainder of this section. Definition 7 Error-Correcting Graph Matching Let g 1 = (V 1, E 1, α 1, β 1, L V, L E ) and g = (V, E, α, β, L V, L E ) be graphs. An error-correcting graph matching (ecgm) from g 1 to g is a bijective function f : ˆV 1 ˆV, where ˆV 1 V 1 and ˆV V. A node u ˆV 1 is said to be substituted by node v ˆV if f (u) = v. If α 1 (u) = α (f (u)) then the substitution is called an identical substitution. Otherwise it is termed a non-identical substitution. Furthermore, any node from V 1 ˆV 1 is deleted from g 1, and any node from V ˆV inserted in g under f. In the following ĝ 1 and ĝ denote the subgraphs of g 1 and g that are induced by the sets ˆV 1 and ˆV, respectively. The mapping f directly implies an edit operation on each node in g 1 and g. I.e., nodes are substituted, deleted, or inserted, as described above. Additionally, the mapping f indirectly implies edit operations on the edges

34 .. Graph Matching g 1 g D 1 b 3 B a b C D 4 c 6 A a c C 5 Figure.4 An example of an error-correcting graph matching of two graphs g 1 and g. of g 1 and g. If f (u 1 ) = v 1 and f (u ) = v and there exist edges (u 1, u ) E 1 and (v 1, v ) E then edge (u 1, u ) is substituted by (v 1, v ) under f. Otherwise, if there exists no edge (u 1, u ) E 1, but an edge (v 1, v ) E, then edge (v 1, v ) is inserted. Similarly, if (u 1, u ) E 1 exists but no edge (v 1, v ) E, then (u 1, u ) is deleted under f. If a node u is deleted from g 1, then any edge incident to u is deleted, too. Similarly, if a node u is inserted in g, then any edge incident to u is inserted, too. Obviously, any ecgm f can be understood as a set of edit operations (substitutions, deletions, and insertions of both nodes and edges) that transform a given graph g 1 into another graph g under f. In Figure.4, an example of an error-correcting graph matching is given. These graphs are defined as: V 1 = {1,, 3}; V = {4, 5, 6}; L V = A, B, C, D. E 1 = {(1, ), (1, 3), (, 3)}; E = {(4, 5), (4, 6), (5, 6)}; L E = a, b, c, d. α 1 : 1 D, C, 3 B. α : 4 D, 5 C, 6 A. β 1 : (1, ) a, (1, 3) b, (, 3) b. β : (4, 5) a, (4, 6) c, (5, 6) c. A possible ecgm is f : 1 4, 5 with ˆV 1 = {1, } and ˆV = {4, 5}. Under this ecgm, nodes 1 and are substituted by 4 and 5, respectively.

35 Chapter. Graph Theory 3 Consequently, edge (1, ) is substituted by edge (4, 5). Note that all these substitutions are identical substitutions. Under f, node 3 and edges (1, 3) and (, 3) are deleted, and node 6 together with its incident edges (4, 6) and (5, 6) are inserted. There are of course many other ecgm s from g 1 to g. Obviously, there exist exponentially many error-correcting graph matchings between two graphs g 1, g. To measure the quality of an ecgm, the cost of an ecgm is defined: Definition 8 Cost of an Error-Correcting Graph Matching The cost of an ecgm f : ˆV 1 ˆV from a graph g 1 = (V 1, E 1, α 1, β 1, L V, L E ) to a graph g = (V, E, α, β, L V, L E ) is given by c(f ) = c ns (u) + c nd (u) + c ni (u)+ u ˆV 1 u V 1 ˆV 1 u V ˆV c es (e) + c ed (e) + c Eli (e), e E s e E d e E i where c ns (u) is the cost of substituting node u ˆV 1 by f (u) ˆV, c nd (u) is the cost of deleting node u V 1 ˆV 1 from g 1, c ni (u) is the cost of inserting node u V ˆV in g, c es (e) is the cost of substituting edge e, c ed (e) is the cost of deleting edge e, c ei (e) is the cost of inserting edge e, and E s, E d, and E i are the sets of edges that are substituted, deleted, and inserted, respectively. All costs are non-negative real numbers. Notice that the sets E s, E d, and E i are implied by the mapping f. A particular set of costs c ns, c nd,..., c ei according to the above definition is called a cost function. Definition 9 Optimal Error-Correcting Graph Matching / Edit Distance Let f be an ecgm from a graph g 1 to a graph g under a particular cost function. We call f an optimal ecgm if there exists no other ecgm f from g 1 to g with c(f ) < c(f ).

36 4.. Graph Matching The cost of an optimal ecgm from a graph g 1 to a graph g is also called the edit distance of g 1 and g, and is denoted by d(g 1, g ). Note that each ecgm implies a sequence of edit operations, i.e. insertions, deletions and substitutions. In practical applications the costs c ns,..., c ei introduced in Definition 8 are used to model the likelihood of errors or distortions that may corrupt ideal graphs of the underlying problem domain. The more likely a certain distortion is to occur, the smaller is its cost. Concrete values for c ns,..., c ei have to be chosen depending on the particular application. Graph edit distance as defined above is a very popular concept in errorcorrecting graph matching. In recent developments, efforts have been made to relate other distance functions to graph edit distance. Another popular graph distance is the following distance measure based on the maximum common subgraph between two graphs. This distance is defined as: δ(g 1, g ) = 1 mcs(g 1, g ) max( g 1, g ). (.1) Intuitively speaking, using the mcs the distance relates the size of the common parts of g 1, g to the size of the graphs. As an example, the distance δ(g 1, g ) between the graphs shown in Figure.3, would be δ(g 1, g ) = 1 mcs(g 1, g ) max( g 1, g ) = 1 3 = 1 3. Recently [15] it has been shown that this graph similarity measure satisfies, for any three graphs g 1, g and g 3, the following relations: δ(g 1, g ) 1 δ(g 1, g ) = g 1 = g δ(g 1, g ) = δ(g, g 1 ) δ(g 1, g 3 ) δ(g 1, g ) + δ(g, g 3 ) Particularly, this means that δ(g 1, g ) is a metric. Furthermore, assuming a certain class of cost functions for graph edit distance it can be shown that distance δ is equivalent to the graph edit distance (see [16, 17]). Hence, all concepts based on this distance function can be directly related to graph edit distance as well.

37 Chapter 3 Decision Trees The main topic of this thesis is to filter databases of graphs based on feature vectors, and to data mine the feature vectors using decision tree methods. Originally, decision tree algorithms are classification methods useful for machine based analysis of large datasets of training samples 1. In this chapter, decision tree theory as used in this work will be briefly reviewed. In the next section, an introduction to decision tree classification will be given. In Section 3. the fundamental principles of decision tree induction and classification are outlined. Then, in Section 3.3, various tree induction strategies defined by split-criteria are described. Finally, in Section 3.4, extensions of the basic tree induction schemes to larger datasets are discussed. 3.1 Decision Tree Classification Introduction Popular decision tree induction algorithms such as C4.5 [83] or CART [84] have their origins in Hunt s Concept Learning Systems [85]. The idea of decision tree algorithms is to analyze a set of so-called training samples each assigned a class label. The decision tree system splits the training samples into subsets so that the data in each of the descendant subsets is purer than the data in the parent super set. (The meaning of purer is that 1 Note that the training samples are expected to be labelled, i.e. they consists of a feature vector and a class label. 5

38 Decision Tree Classification Introduction in the ideal case the final subsets consist only of samples belonging to the same class.) As a classifier, the decision tree can then be used to classify an unknown sample, i.e. a sample with no class information associated according to the previously analyzed training set. From a more formal point of view a decision tree based classification method is a supervised learning technique that builds a decision tree from a set of training samples during the machine learning process. This process is also often referred to as data mining. During the data mining step it is important to see that the tree induction algorithm chooses the most suitable one from the large set of possible feature values automatically, solely based on the distribution of the feature values extracted from the training sample set (hence the term machine learning). The result of the learning procedure is a tree in which each leaf is labelled by a class name, and each interior node specifies a test on a particular feature, with one branch corresponding to each possible value or range of that feature. Generally decision tree methods consist of two basic steps: 1. data mining / tree induction. classification / tree traversal In the preprocessing step, a set of training samples is analyzed. Each training sample consists of an instance (the samples itself), a feature vector describing the sample, and a class membership. Based on the set of training samples, a tree structure is induced which later serves to classify unknown samples. During classification, the tree structure previously induced is used to assign a class label to an input sample. The input sample is similar to a training sample, except that class label information is missing. The decision tree induced during preprocessing is traversed using the feature vector of the input sample. After successful traversal of the tree, class membership of the input sample can be decided based on the information given in the according leaf of the tree. In Figure 3.1 an example of a decision tree is shown. In the root node of the example tree the current season is evaluated. In this example, the season fundamentally changes the possible activities. Winter activities are considered in the left branch, summer sports in the right subtree. Subsequently other features such as overcast or countryside aspects are used as test criteria until a leaf node is reached (a decision is made). In this example a leaf node s class is associated with a recreational activity proposal.

39 Chapter 3. Decision Trees 7 season winter summer overcast location sunny cloudy country sea ski hockey golf swim Figure 3.1 Example of a simplified decision tree for recreational activity classification. The system used in this thesis is based on the well-known C4.5 algorithm [83]. In order to make it suitable for larger datasets it was customized using the RainForest Framework [86]. 3. Decision Tree Classifier The functioning of common decision tree classifiers can usually be divided into two steps, namely tree induction (data analysis) and classification of unknown samples. During preprocessing, a tree structure is induced according to a set of training samples. Based on this tree structure, unknown samples are then classified. This section will give an overview describing both steps of decision tree schemes, tree induction as well as classification of unknown samples Decision Tree Induction The aim of decision tree induction is to derive a tree structure which can later be traversed allowing the classification of samples whose class labels are unknown. The induction algorithm is given a labelled set of samples described by feature vectors. The basic idea is then to split this dataset into subsets in such a way that in the end each subset holds only samples belonging to exactly one class. Based on the training set the induction algorithm extracts the knowledge (data mining) in the data to automatically

40 8 3.. Decision Tree Classifier generate a decision tree. The knowledge is extracted by analyzing the feature vector, identifying the features suitable to split the training set into purer subsets. The induction algorithm grows the decision tree top-down in the following way: At the beginning, the entire dataset is assigned to the root node. Then, the feature vectors of the training samples are evaluated searching the feature which best splits the training set into purer subsets. The purity of the subsets, in other words the quality of a feature split, is measured using a split criterion. The split criterion evaluates the quality of a split using a feature value, i.e. it measures the gain in purity. There exist a variety of different split criteria. C4.5 for example uses a criterion measuring the decrease in entropy comparing the original set to the subsets defined by the feature. Split criteria will be discussed more thoroughly in Section 3.3. Once the best feature has been chosen and subsets according to it created, a son node is created for each specific feature value. Each feature value is assigned to an edge connecting the son to the root node and the subset is assigned to the son node. This procedure is then recursively continued to construct the subsequent layers of the tree. The recursion is stopped once a certain purity in the subsets is achieved or if there are no features left to divide the subset. An illustration of the general decision tree induction scheme is shown in Table 3.1. Note that most decision tree algorithms proceed according to this general scheme. Each node is recursively processed by the induction function induce (Treenode v, Dataset T ). The currently processed treenode v as well as the dataset T assigned to v are passed as input parameters. At the beginning of tree induction, induce(v, T ) is called with the tree s root node r and the entire database as input dataset T. Each sample t T is described by a vector f of features f i and assigned a predefined class label c. The goal of the induction procedure is to create a tree structure, assigning each leaf node a subset of the sample set T. Based on this subset, the leaf node is assigned a class label. The induction procedure splits the input dataset T into subsets, according to a predefined split-criterion s. The split-criterion s evaluates the quality of the available features, and then the most suitable feature f best is selected to divide T into subsets. In general, decision tree induction algorithms only differ in the selection of the best feature through the split criterion s. An example of a decision tree induction process is shown in Figure 3.. During initialization, the complete set of samples T is assigned to the root node. For all features f i, i = 1,, 3, the split-criterion s(t, f i ) is evaluated. Let s assume that the best split at the root node is done based on feature

41 Chapter 3. Decision Trees 9 Generic Decision Tree Induction Scheme 1 function induce ( Treenode v, Dataset T ) { # in T, every valid feature f i has n fi feature values, 3 # hence T can be s p l i t into n fi subsets according to f i 4 find best split feature f best based 5 on T and split criterion s 6 7 # T i s partitioned into n fbest subsets according to the 8 # best feature f best and induction i s recursively continued 9 i f ( n fbest > ) then 1 create n fbest children c 1,..., c nfbest of v 11 use f best to partition T into T 1,..., T nfbest 1 for i = 1 to n fbest do 13 induce ( c i,t i ) 14 done 15 f i 16 } Table 3.1 Illustration of a generic decision tree induction scheme. f 1, then T is split into two subsets T 1 and T according to the two feature values v 11 and v 1. All samples with f 1 = v 11 are assigned to the subset T 1 and all samples with f 1 = v 1 to subset T. Both subsets are regarded as new datasets T and the procedure is continued until pure subsets for classes c 1,..., c 4 are obtained. 3.. Decision Tree Traversal / Classification As has been said before, decision trees are used to classify samples whose class labels are unknown. These samples are characterized by a feature vector analogously to the samples in the training set. During the classification step, the decision tree previously induced is traversed following the tree branches corresponding to the feature values of the input sample vector. After successful traversal, the input sample is assigned the class label occurring with the highest frequency in the leaf nodes reached. An example traversal can be given looking at Figure 3.. Assume a sample t in is given represented by a feature vector f tin = (v 1, v 1, v 3 ). In order to

42 3 3.. Decision Tree Classifier T f=v 1 11 f=v 1 1 T 1 T f=v 1 f=v f=v 3 31 f=v 3 3 T 11 T 1 T 1 T c 1 c c 3 c 4 Training set Features Feature values Classes T F = {f,f,f } 1 3 V = {{v,v },{v,v },{v,v }} C = {c,c,c,c } Figure 3. Schematic decision tree. classify this sample using the given decision tree, the traversal algorithm would first compare the value of f 1 of the input sample with the values assigned to the edges on the first level in the tree, following the right branch reaching node T. Then, the same procedure would be repeated for feature f 3, finally reaching node T and consequently assigning class c 4 to the input sample t in. (Note that feature f was not used during traversal.) It is clear to see that the structure of a decision tree induced as explained before has a significant impact on its classification performance. Furthermore, runtime performance is also affected since fewer or more tests need to be made depending on the balance of the tree. The tree structure itself on the other hand is highly dependent on the split-criterion s used. Hence, in the next section, this topic will be discussed in more detail.

43 Chapter 3. Decision Trees Split Criteria Depending on the tree induction framework, there exist a wide variety of split-criteria. Popular ones include the gini-index, CART as well as gain and gain-ratio used by the C4.5 algorithm. As the tree induction framework in this study is based on C4.5, the focus has been on split criteria successfully used in C4.5. Namely, the following criteria have been evaluated: gain and weighted entropy criterion gain-ratio criterion These measures evaluate the quality of a split in terms of reduction of the entropy between son nodes and respective father nodes. Furthermore, by quantifying the information contained in a set of samples of a node, they allow an estimate on the number of tests to be made in order to isolate the classes in the set into pure subsets. Gain Criterion In C4.5 subsets are built based on an information theoretical gain criterion. This means that, at each node in the decision tree, the dataset T is split into subsets T i choosing the feature that maximizes information gain. For a sample of the dataset T, the probability that the sample belongs to class c can be estimated as follows: P(c) = number of samples in T belonging to c. (3.1) number of samples in T Using the class probabilities of all classes c 1,..., c k the information of the dataset is measured by the entropy E(T ) = k P(c i ) ld P(c i ). (3.) i=1 If a feature f is selected and the dataset is split into subsets according to all possible values of f, the entropy of the split can be expressed as the weighted sum of the entropy of the subsets T 1,..., T n as B(T, f ) = n i=1 T i T E(T i). (3.3)

44 Split Criteria In Equation (3.3) f indicates the selected feature and n is the number of possible values of that feature. The entropy E(T i ) of the subsets T i is weighted according to the number of samples in T i (written as T i ) in relation to the number of samples in T. The entropy B of a split is large if and only if the subsets T i hold samples from different classes. Based on this observation, for each feature f the information gained by splitting T according to the possible outcomes is measured by G(T, f ) = E(T ) B(T, f ). (3.4) The gain criterion was designed bearing in mind that tree induction could be stopped once the information gain would fall below a certain threshold value. However, for database filtering as considered in this thesis, the goal is not to induce a tree which serves to generalize data, but to induce a tree which can be used to identify instances of the samples themselves (see Chapter 6). Hence, the decision tree needs to be specifically overfitted. The gain criterion can be simplified to suit filtering requirements. Motivated by that, another split-criterion, the weighted-entropy, which is based on gain and gain-ratio yet optimized for the task of filtering of graph databases, has been exploited. Assume the dataset T consists of feature vectors describing objects. In database filtering, the goal is not to determine a class-membership of an input sample but to retrieve samples from the database relating to the input sample. Furthermore, the runtime of the algorithm needs to be minimized in order to achieve maximum retrieval performance for filtering. Runtime itself crucially depends on the number of tests made to reach a leaf node in the tree. Based on the concept of database retrieval, the following two modifications are applied to the induction paradigm: No explicit class label is assigned to the individual objects. The object instance is considered its own class. (Hence, there exist as many classes as there are training samples in T.) Generally there exists no information about the distribution of the object instances. Hence, for simplicity reasons, all objects in the dataset are expected to be equally probable. The idea of the weighted entropy criterion is to measure the uncertainty of a given set of objects. Since the goal is to create decision trees as shallow as possible, the entropy needs to be minimized in every internal node. Therefore, when dividing a set of objects assigned to a father node into subsets

45 Chapter 3. Decision Trees 33 assigned to son nodes, the (weighted) summed entropy of the object sets in the son nodes needs to be minimized over all possible features. This leads to the following split criterion. The entropy of a set of graphs with given probabilities is defined as (see Equation 3.3): n E(T ) = p i ld(p i ) i=1 where p i is the probability of object c i, hence p i = P(c i ). Applying the simplification introduced before assuming all graphs to be equally alike, E can be rewritten as: E(T ) = n i=1 1 n ld( 1 n ) = ld( 1 n ) = ld(n). Intuitively speaking, the entropy gives a measure of how many decisions have to be taken until a single graph is identified. The greater the entropy, the more decisions have to be taken and therefore the deeper the decision tree becomes. From a father node point of view, the entropy of every son node needs to be minimized. Hence, the split-criterion is the weighted sum of the entropies in the set of son nodes. For the datasets T i in the son nodes s i, this quantity is given by: E sum (T 1...M ) = M i=1 T i T E(T i) where T i is the cardinality of the object set at son node s i, E(T i ) is the entropy of this set, and M is the number of son nodes of the considered father node. It is clear that the gain and weighted entropy criterion will have a strong bias towards features producing many subsets T i of the initial dataset T. A simple example would be a dataset of samples containing a feature f id which is a unique identifier of each sample in T. Using feature f id as the feature to be tested, dataset T would be split into n subsets where n = T, thus producing one subset per training sample. Gain and weighted entropy would consider such a split to be optimal, since each subset would maximize the information gain by generating the purest possible successor sets. As has been said in this simplified context an object is considered equivalent to a class!

46 RainForest - Extension To Large Datasets In general, such a behavior is undesired in a classification context since it would overfit the classifier structure to the training data thus preventing proper classification of new unknown samples. Simply speaking, classification of unknown samples would be merely impossible. To overcome this deficiency the gain ratio criterion has been introduced. Gain Ratio Criterion The bias inherent to the gain criterion can be corrected by a kind of normalization, adjusting the apparent gain of many outcomes by the size of these outcomes. The shortcoming of the gain criterion G(T, f ) is that it simply ignores the size of the individual subsets T i of T. Hence, unfavorable splits can be avoided if the entropy of the size of the subsets is integrated into the split criterion. The information of the split is computed similarly to the class information (see Equation (3.)) as S(T, f ) = n i=1 T i T ld ( ) Ti. (3.5) T Based on S(T, f ) one can formulate the gain-ratio criterion. Gain-ratio is composed of the gain criterion G(T, f ) previously introduced normalized by the information inherent to the split of the dataset S(T, f ): G (T, f ) = G(T, f ) S(T, f ). (3.6) Using gain-ratio the inherent bias to the gain criterion can be avoided. As a consequence, decision trees derived by applying gain ratio are much more balanced and have shown to be much more stable classifiers. 3.4 RainForest - Extension To Large Datasets Most algorithms used for decision tree induction are so-called main memory algorithms, i.e. the entire dataset is held in main memory during tree induction. Even though main memory size is continually being increased, today s datasets are in general much larger than main memory. There exist several approaches dealing with such large datasets. Common techniques are partitioning the data such that each subset fits into main memory [87], discretization of the feature vectors [88] or custom sampling of the dataset

47 Chapter 3. Decision Trees 35 Dataset: sample season overcast location class-label s 1 s s 3 winter cloudy country hockey winter sunny country ski winter cloudy sea hockey AVC-Sets: attribute value sample(s) class(es) AVC-Set season overcast location winter s,s,s 1 3 hockey,ski {hockey,ski} cloudy s,s 1 3 hockey {hockey}, sunny s ski {{ski} } country s,s 1 hockey,ski {hockey,ski}, sea s 3 ski {{ski} } Table 3. Example of AVC-Sets based for various feature values of a given dataset. { } at each node of the decision tree [89]. Partitioning the dataset into subsets before starting the induction usually results in classifiers producing worse classification results. Discretization methods on the other hand still expect that the entire dataset fits into main memory at one point in time. Custom sampling of the dataset at each node in the tree also assumes sampling of the entire dataset at the root node plus it produces a significant sampling overhead during tree induction. A formalism analyzing these problems and proposing solutions to various scenarios is presented by Gehrke et. al. in the RainForest framework [86]. RainForest describes a framework applicable to most decision tree induction algorithms used today, ensuring scalability for large datasets. RainForest assumes the top-down tree induction scheme as shown in Table 3.1. At the root node, the database is examined and a split-criterion is computed. According to the split-criterion the best feature is chosen and the dataset is split into subsets assigned to son nodes of the root node. This procedure is recursively continued until a certain terminating criterion is fulfilled. The RainForest framework makes use of the observation that at

48 RainForest - Extension To Large Datasets any internal node t in the tree, the evaluation of a feature f through the split-criterion is independent of other feature values. Based on this observation, the AVC-Set of a feature f is defined (AVC-Set = Attribute-Value- Class label-set). The AVC-Set of a feature f is defined to be the projection of the given dataset onto the sets of classes 3, based on the various values of f. An example is given in Figure 3.. In this example, the dataset consists of three training samples, {s 1, s, s 3 }. For each sample, a feature vector has been extracted consisting of three features, namely season, overcast and location. Each sample is also assigned a class label denoting a recreational activity. Based on this training set, three AVC-Sets corresponding to the three features can be extracted. As an example, consider the AVC-Set defined by feature overcast. The feature value of samples s 1 and s 3 for feature overcast is cloudy. Both samples belong to class hockey. Hence, hockey is added to the AVC-Set of feature overcast. Sample s belongs to class ski, hence class ski is added to the AVC-Set. The resulting AVC-Set consists of two sets, {hockey} for attribute value overcast=cloudy and {sunny} for attribute value overcast=sunny. It is important to see that the size of the AVC-Set of feature value f and a given dataset is solely dependent on the number of distinct values of f and the number of class labels in the dataset. The set of all AVC-Sets at a given tree node v is called the AVC-group of node v. Note that AVC-group and AVC-set are not merely another representation of the database T. It is not possible to reproduce T through its AVC-group since the AVC-group does not contain information on the individual samples in the dataset. Rather, this information has been summarized using class label information. Using the AVC-Set of a given feature f, RainForest extends the general induction scheme presented in Table 3.1 as shown in Table 3.3. As can be seen, RainForest isolates the process of best-feature selection from the given decision tree algorithm. Furthermore, it limits the input data of the selection method to only portions of the current dataset T. The way this is done is that only the AVC-sets are given to the selection procedure and, only one AVC-set is used at any given time. Depending on the size of the database and the size of the AVC-sets and AVC-groups, various tuning techniques can be used. For small databases, it is most likely that for every node v the entire AVC-group will fit into main memory. Hence, in that case, no specialization needs to be implemented. On the other hand, if the database is too large to fit an AVC-group in main memory, other tuning techniques become necessary. These techniques include simple ap- 3 In decision tree induction a dataset is usually dependent upon a tree node v. Hence a given AVC-Set is indirectly dependent upon a tree node v.

49 Chapter 3. Decision Trees 37 Generic Decision Tree Induction Scheme 1 function induce ( Treenode v, Dataset T ) { # in T, every valid feature f i has n fi feature values, 3 # hence T can be s p l i t into n fi subsets according to f i 4 foreach feature f i valid in T do 5 evaluatesplit (AVC Set of f i,t ) 6 i f ( isbetterfeature ( f i ) ) then f best = f i 7 done 8 9 # T i s partitioned into n fbest subsets according to f best 1 # and induction i s recursively continued 11 i f ( n fbest > ) then 1 create n fbest children c 1,..., c nfbest of v 13 use f best to partition T into T 1,..., T nfbest 14 for i = 1 to n fbest do 15 induce ( c i,t i ) 16 done 17 f i 18 } Table 3.3 Illustration of the decision tree induction scheme as proposed by the RainForest framework. proaches like storing parts of the database as physical files on disk (e. g. text files), or exporting them to database systems (e. g. sql-databases such as mysql, postgres, and so on).

51 Part II Decision Trees for Graph Database Filtering

53 Introduction As has been shown in previous chapters, graph matching is attractive for many pattern recognition problems since graphs are a universal representation formalism. However, graph matching is also a computationally expensive approach. Other than the pairwise comparison of graphs, it is often required to match an input graph to a database of graphs, retrieving suitable model graphs. There exist a variety of approaches based on a different assumptions and constraints dealing with retrieval of graphs from databases (see [11, 1, 76, 8, 81]). In this thesis, the concept of filtering a database is pursued. Compared to other techniques, in database filtering there is no explicit need to provide an exact solution to the matching paradigm. For example, if all database graphs isomorphic to the input graph are requested, a filtering approach does not need to provide an exact solution. It is required that the filter identifies graphs possibly isomorphic to the input graph and returns these graphs as a result. It is also a necessity that the filter eliminates as many graphs as possible that are not isomorphic to the input graph. It is however not requested that the filter exactly identifies all graphs isomorphic to the input sample. The filter is only required to not rule out graphs isomorphic to the input graph. The calculation of the exact solution is passed on to external tools. Under a given matching paradigm (e. g. graph isomorphism) filtering algorithms merely need to rule out as many candidates as possible not fulfilling the matching paradigm. This part of the thesis introduces various aspects to be considered when using database filtering. In Chapter 4, the performance of database filtering is analyzed from the theoretical point of view. Then, Chapter 5 presents a very simple filtering approach based on feature vectors extracted from graphs. Finally, Chapter 6 describes the main approach of this thesis, an improved method using decision trees to filter databases of graphs.

55 Chapter 4 Graph Database Filtering A Performance Analysis This thesis analyzes the problem of efficiently filtering databases of graphs based on given matching paradigms. The solution proposed in this work is based on the filtering concept previously introduced. It can easily be seen that the effectiveness of any filtering approach crucially depends on the underlying matching algorithm as well as the graphs stored in the database. This leads to the conjecture that the performance of a proposed graph filtering approach can be increased if it is used in combination with suitable matching algorithms. In this chapter, matching as well as non-matching performance of various graph matching paradigms considered in this thesis are evaluated. Furthermore, their influence on database filtering is examined. The problem of measuring the performance of different graph matching algorithms has been studied numerous times [9, 91, 9]. However, the main objective in these studies was measuring and comparing the time needed to compare two matching graphs (i.e. isomorphic or subgraph isomorphic graphs). Little attention was given to the performance of the algorithms when comparing non-matching graphs. This is however a crucial factor in graph database filtering methods (as will be shown below). The objective of the study presented in the following sections is to examine the influence of graph matching algorithms when used in combination with graph database filtering methods, providing a solid theoretical analysis to the problem. 43

56 Theoretical Examination In the following sections, a theoretical framework for the problem of database filtering will be derived in Section 4.1 and conclusions will be drawn in Section 4.. In Chapter 8, an experimental study is presented where the developed theory is evaluated for various graph types and matching algorithms. 4.1 Theoretical Examination In this section, theoretical foundations for a performance evaluation of filtering procedures as considered in this thesis will be established. As has been explained in Section 1., a filtering procedure is considered to be a method which allows the reduction of a graph database with respect to a given input graph. The concept of such a procedure has been shown in Figure 1.1. After the filtering procedure has been applied there are in general still some candidates remaining that need to undergo a full matching procedure with the chosen matching algorithm, for example, Ullmann s algorithm [7] or VF [93]. To measure the effectiveness of a filter it is straightforward to compare it to a brute force approach. That is, filtering in combination with the chosen matcher is compared against sequential testing of the entire database with the same matcher (brute force approach). To formally describe the problem, the following variables are introduced: T i filter, total time needed to extract all matching graphs for a given input graph from the database using the filtering approach in conjunction with some graph matching algorithm i (for example, i=ullmann s algorithm). T i brute_force, total time needed to extract all matching graphs for a given input graph from the database applying the chosen graph matching algorithm i to each graph in the database in a brute force fashion. t i match, the average time needed by the matching algorithm i to perform a comparison of two matching graphs. t i non_match, the average time needed by the matching algorithm i to perform a comparison of two non-matching graphs. t filter, the average time needed to perform the filtering procedure. Notice that this variable doesn t depend on the underlying graph matching algorithm.

57 Chapter 4. Graph Database Filtering A Performance Analysis 45 m db, the size of the graph database (the number of graphs stored in the database). m filter, the size of candidates left for testing after the filtering procedure has been applied on the database (reduced database size). The smaller this value, the more effective is the filter. s, the number of matching graphs for a given input sample. In case of isomorphism (subgraph isomorphism), this number is equal to the number of graphs in the database that are isomorphic (subgraph isomorphic) to the input. This means that the matching algorithm will have to be executed s times for a successful match. Note that t i match and ti non_match are themselves depending on other parameters, such as the number of nodes and edges in a graph and the size of the alphabet of labels. Similarly, m filter is highly dependent on the given input sample, the database as well as the filtering method. Note that the value of m filter is bounded by s and m db, such that s m filter m db Brute Force Matching One can now formally state the time needed to retrieve all candidate graphs of a given sample from the database. Assuming there is no filtering involved, the graphs in the database must sequentially be tested with the input sample using the matching algorithm. The total time needed to do this is composed of two terms, the time needed to process the matching graphs and the time needed to process the non-matching graphs. Hence T i brute_force = ( s t i match) + ( (mdb s) t i non_match). (4.1) Assuming we are given a matching algorithm and a graph database, the behavior of brute force matching can be examined for various values of s. There are two cases to be considered (note that s is limited to the interval s [, m db ]): 1. s large, with a maximum of s = m db : In this case, the time needed for comparing matching graphs t i match is weighted more whereas the weight of the time needed for comparing non-matching graphs is reduced. Hence, the brute force performance will highly depend on t i match. (In case of s = m db, t i non_match has no influence at all.)

58 Theoretical Examination. s small, with a minimum of s = : Here, the weighting of the times is inverse to the case discussed before. Brute force performance will therefore mostly depend on t i non_match and the effect of ti match will be minimal. Considering two graph matching algorithms {A,B}, using Equation (4.1) one can state T A brute_force = ( s tmatch) A ( + (mdb s) tnon_match A ) T B brute_force = ( s t B match) + ( (mdb s) t B non_match). Defining δ match and δ non_match as δ match = t A match tb match δ non_match = t A non_match tb non_match, then Tbrute_force B can be rewritten as T B brute_force = ( s (t A match δ match) ) + ( (mdb s) (t A non_match δ non_match) ). (4.) Expressing Tbrute_force B as shown in Equation (4.), the performance of two different graph matching algorithms {A,B} can be compared under brute force matching. Assuming δ match and δ non_match (the case where these values are is trivial), Tbrute_force A and T brute_force B will be equal if ( ) ( s tmatch A + ( s (tmatch A δ match) ) (m db s) tnon_match A ) ( (m db s) (t A non_match δ non_match) ) =. This is equivalent to δ non_match m db δ non_match s + δ match s = which can be simplified to:

59 Chapter 4. Graph Database Filtering A Performance Analysis 47 s m db = 1 1 δ match δ non_match. (4.3) Since s m db, the value of the expression on the left-hand side of Equation (4.3) is obviously limited to the interval [, 1]. Therefore, the value of the denominator on the right-hand side must be larger than or equal to 1. It follows that the value of δ match /δ non_match is limited to the interval ], ]. This is particularly interesting because it shows that δ match and δ non_match are of inverse sign. Hence, if brute force matching with two different matching algorithms, A and B, takes equal time to match a sample with a given database and δ match as well as δ non_match, then if algorithm A s match performance is better than algorithm B s match performance, B s non-match performance must be better than A s non-match performance. In that case there must be an inverse relation between matching and non-matching performance for algorithms A and B. Furthermore, given a fixed value for δ non_match and assuming Tbrute_force A = T brute_force B, if the value of δ match is increased, the ratio s/m db must decrease, which means that either s becomes smaller or m db grows larger. This means that if the performance of one matching algorithm drops when comparing matching graphs, either the database size must be increased (more nonmatching graphs must be introduced) or the number of matching graphs must be reduced in order for the two brute force algorithms, A and B, to still perform equally. On the other hand, if we assume that tmatch A tb match and tnon_match A tb non_match, then δ match/δ non_match is always positive and it follows that the right-hand side of Equation (4.3) will never be within the interval [, 1] unless δ match /δ non_match = which is the case if tmatch A = tb match and tnon_match A = tb non_match (which simply means that the two graph matching algorithms perform the same) Filtered Matching Introducing a filtering method, its performance can be formulated similarly to Equation (4.1) as: T i filter = s t i match + (m filter s) t i non_match + t filter. (4.4) Note that due to the filter, the database size has been reduced from m db to m filter. On the other hand, an additional term, t filter, has been added to the

60 Concluding Remarks total time. For the filtering to be effective, the relation T i filter < T i brute_force must hold. This is equivalent to ( ( s t i match) + (mfilter s) tnon_match) i +tfilter < ( s tmatch) i ( + (mdb s) t i ) non_match and can be simplified to t filter < t i non_match (m db m filter ). (4.5) When analyzing Equation (4.5) one can see that comparing the brute force approach with filtering, the time t i match needed for comparing two matching graphs is irrelevant. The inequality shows that the main factors for the performance of a filter approach are the time needed for comparing two nonmatching graphs, t i non_match, and the term m filter which reflects the ability of the filter to reduce the database size m db. If the matching algorithm performs well comparing non-matching graphs (i.e. if t i non_match is small), then the filter approach needs to be able to significantly reduce the database size in order to outperform the brute force approach (i.e. m db m filter needs to be large, see Equation (4.5)). On the other hand, if the matching algorithm performs poorly comparing two non-matching graphs (i.e. if t i non_match is large), then the reduction factor of the filter approach needs not to be quite as large in order for the filter to be faster than the brute force procedure (i.e. m db m filter can be rather small). 4. Concluding Remarks One particular aspect that has been widely neglected until now is the observation that the non-matching time of an algorithm has a crucial impact on the overall performance of a graph matching scheme. For large databases, a small number of expected matches, and no or only moderate filtering, the non-matching performance of the underlying graph matching algorithm is the dominant factor in the overall efficiency of a graph matching scheme. In addition to matching and non-matching time, a number of parameters affecting filtering efficiency can be identified, e. g. database size, size of the database after filtering, number of matches in the database, and time needed for filtering. These parameters have a crucial influence on the behavior of a graph matching scheme. In particular, the number of matches in the database and database size can be used to predict the performance of given graph matching algorithms under both brute force matching and a filtering approach.

61 Chapter 5 Graph Database Filtering Using Feature Vectors Graph database filtering as proposed in this thesis is designed with respect to a given matching paradigm (graph isomorphism, subgraph isomorphism or error-tolerant graph matching). It is expected that the developed filters rule out as many candidates as possible not fulfilling the matching paradigm when compared to a given input graph. Naturally, there exist a wide variety of such filtering approaches. The most straightforward way to implement a filter would be to sequentially compare the input graph to each graph in the database using the given graph matching algorithm (brute-force filtering). However, since graph matching in general is computationally very expensive (the task of matching two graphs usually requires exponential time and memory resources in the number of nodes or edges involved), such a brute-force filter is unsuitable for larger databases. The approach proposed in this thesis is based on feature vectors f i extracted from the database graphs and the input samples. The idea is to avoid expensive sequential full-fledged matching by comparing the feature vectors extracted from the graphs. Based on the outcome of the comparison, graphs are added or ruled out from the result set of the filtering algorithm. This chapter introduces the concept of graph features and how graph feature vectors can be used to compare graphs for various matching para- 49

62 Graph Features digms. Based on feature vector comparison, a very simple sequential filter 1 is introduced. In Section 5.1, a basic definition of the graph features used in this thesis will be given. Section 5. outlines how feature vectors can be used for graph isomorphism filtering. In Section 5.3, an extension to subgraph isomorphism filtering is presented and finally, in Section 5.4, the approach presented is extended to error-tolerant feature vector evaluation. 5.1 Graph Features The aim of this thesis is to evaluate graph database filtering systems based on data mining techniques, namely decision tree methods. Particularly, the intent was to develop and evaluate systems for three graph matching paradigms: 1. graph isomorphism. subgraph isomorphism 3. error-tolerant graph matching Decision tree methods classify objects (in our case graphs) based on a feature vector representation of the objects. Hence, in this study, graphs need to be represented as feature vectors. Since the vectors are used for the purpose of database filtering, the features must meet the following two requirements: fast extraction from sample graphs high saliency It is expected that the features can be extracted fast. Rather than matching an input graph against all prototypes in the database, the database is first filtered and only the graphs that survive the filtering are matched against the input graph. However, a substantial gain in efficiency is achieved only if the filtering process itself, including feature extraction, is fast when compared to graph matching. Furthermore, it is expected that the features have 6. 1 This simple filter structure will be improved using decision tree techniques in Chapter

63 Chapter 5. Graph Database Filtering Using Feature Vectors 51 a high degree of saliency. That is, the features must have the ability to discriminate between as many graphs in the database as possible. The more graphs a given feature can distinguish, the less graphs remain as candidates after filtering and potentially have to undergo an expensive matching procedure. The requirements mentioned above impose restrictions on the types of features suitable for filtering. Naturally, there exist a great number of features which can be extracted from a graph. A very simple feature for example is the number of nodes in the graph. Other, more complicated features would be the average node degree in the graph, the diameter of a graph, the connectivity of a graph, the eigenvalues of a graph and so on [1, ]. However, in this work the types of features considered are features of at most complexity O(n ) where n specifies the number of nodes in the graph. As has been said before, the selection of graph features is heavily restricted by the limitations imposed from the computational complexity point of view. Based on these restrictions, the following types of features (or metafeatures) are defined in this thesis: f 1 (g): number of vertices in graph g f (a, g): number of vertices in graph g per label a f 3 (a, g): number of incoming edges per vertex label a in graph g f 4 (a, g): number of outgoing edges per vertex label a in graph g f 5 (n, g): number of vertices per in-degree n in graph g f 6 (n, g): number of vertices per out-degree n in graph g f 7 (n, a, g): number of vertices per label a and in-degree n f 8 (n, a, g): number of vertices per label a and out-degree n

64 Graph Features 1 g 1 f (g 1) = 3 B Figure 5.1 A graph g 1 and a selection of extracted features. A B f (B,g 1) = 3 f (B,g 1) = 5 f (1,g 1) = 7 f (,A,g 1) = 1 Note that in the above list solely meta-features are defined. The term metafeature was chosen because each of the above features defines a class of features. Based on a meta-feature, an instance of a feature is extracted by applying a parameter setting to the meta-feature. If a meta-feature is evaluated for a given parameter setting, it is termed a feature. (An example of meta-features and features will be given below.) Furthermore, the above list of meta-features is merely a simple selection based on no other criteria except the computational complexity restrictions. Naturally, there exist a great variety of possible meta-features to be extracted. An example of the above features for a given graph g 1 is shown in Figure 5.1. In this figure, the feature types or meta-features are f 1 (g), f (a, g), f 3 (a, g), f 5 (n, g), and f 7 (n, a, g). An example of a feature is f (B, g 1 ), counting the number of nodes with label B. In this example, the parameter a L V and graph g of meta-feature f (a, g) have explicitly been set to a graph instance g = g 1 and a label value a = B, resulting in feature f (B, g 1 ). Another example of a feature would be f 7 (, A, g 1 ), evaluating the number of nodes with label A and an in-degree. The selection of features does not include any features considering edgelabelling. Edge information is merely included in the degree of a node. However, all concepts presented here can be extended to also include edgelabelling features. Considering only node labels, the number of features to be evaluated is significantly reduced. Clearly, the above selection and representation of graph features is merely one particular approach. There exist a wide variety of models on how to classify features and an even wider variety of features which could be considered. However, the chosen features are quite suitable for discrimination between different graphs in the application domains considered later.

65 Chapter 5. Graph Database Filtering Using Feature Vectors 53 The features introduced in Section 5.1 are not equally well suited for all considered matching paradigms. In fact, depending on the matching task, different sets of features need to be used. The next sections describe what feature sets are suitable for a given matching paradigm and how feature vectors f i extracted from graphs g i can be compared to one another in order to get some information on the relation between the graphs g i. 5. Graph Isomorphism Graph isomorphism is the most restrictive matching paradigm considered in this work. Intuitively speaking, two graphs g 1 and g are isomorphic if they are equal. This essentially means that apart from complexity issues there are really no other restrictions on the features extracted from the graph. Hence, it follows that if two given graphs g 1 and g are isomorphic, then any feature defined as above should yield the same value for both graphs, meaning f i (g 1 ) = f i (g ). Note that the opposite is not essentially true. If g 1 and g are not isomorphic to one another, then it does not follow that f i (g 1 ) f i (g )! However, given the feature vector f 1 = (f 1 1,..., f 8 1 ) extracted from g 1 and f = (f 1,..., f 8 ) extracted from g, one can immediately rule out g to be potentially isomorphic to g 1 if there exists a feature j such that f j 1 f j. Based on this observation a simple sequential filtering approach can be derived (see Table 5.1). In a preprocessing step, the feature vectors are extracted from the graphs in the database. Then, at runtime, all features are extracted from the given input graph. The extracted feature vector is then sequentially tested against all feature vectors in the database. Per feature vector comparison, there are two outcomes: a) successful match: The feature vectors of the two graphs are identical. b) failed match: The feature vectors are non-identical. The resulting candidate set is then the union of all successfully matched feature vectors.

66 Subgraph Isomorphism Simple Graph Isomorphism Filter 1 function f i l t e r ( input vector f s, Database vectors F ) { foreach database vector f i in F do 3 i f ( f s = f i ) then 4 add f i to result 5 f i 6 done 7 } Table 5.1 Scheme of a sequential graph isomorphism filter. 5.3 Subgraph Isomorphism Feature comparison for subgraph isomorphism is similar to the approach used for graph isomorphism. Basically, for the features introduced in Section 5.1 one can assume that if graph g s is a subgraph of graph g, then for the feature value f i the relation f i (g s ) f i (g) must hold. However, the relation is not that simple. Looking at the list of features presented above, it can be seen that there are two basic types of features: features not containing information on the degree of the nodes in the graph: {f 1, f, f 3, f 4 } features containing information on the degree of the nodes in the graph: {f 5, f 6, f 7, f 8 } Extending the graph isomorphism feature vector comparison to subgraph isomorphism is straightforward for features not containing node degree information. In order for graph g s to be isomorphic to a subgraph of graph g, the value of a feature f j s must be smaller than, or equal to, the feature value f j in g. Thus, the isomorphism relation f j s = f j needs to be replaced by f j s f j. (Naturally, this is only a necessary but not a sufficient condition for g s being a subgraph of g.) When looking at features containing node degree information however, the approach must be refined. The difficulty is that in subgraph isomorphism,

67 Chapter 5. Graph Database Filtering Using Feature Vectors 55 original graph g pruned subgraph g s 5 f (7,g) = 1 5 f (1,g) = 7 5 f (,g) = 5 f (7,g s) = 5 f (1,g s) = 5 f (,g s) = 7 Figure 5. Two graphs g and g s and the values for feature f 5 (n, g), n {, 1}. nodes of lower vertex degrees in the subgraph may be mapped onto nodes with a higher degree. A simple example would be a star-graph as the graph and the same graph with its center node removed as the subgraph (illustrated in Figure 5.). In the original graph, all nodes except the center node are of degree 1. In the subgraph, since the center node has been removed, these nodes are now of degree. Thus, the features referring to degree will fulfill the subgraph condition defined above. Therefore f 5 (, g s ) can be compared to f 5 (, g) as well as to f 5 (1, g). In order to properly match these feature values, one must not only consider the value for the current node s degree but also include the values for the nodes of a lower degree. To account for this case, features f 5, f 6, f 7 and f 8 are redefined in the following way: f 5 (n, g): number of vertices with in-degree less than or equal to n in graph g f 6 (n, g): number of vertices with out-degree less than or equal to n in graph g f 7 (n, a, g): number of vertices per label a and in-degree less than or equal to n f 8 (n, a, g): number of vertices per label a and out-degree less than or equal to n

68 Error-Tolerant Graph Matching Intuitively speaking the features vector s values are summed according to node degree order. It is clear to see that if the feature is extracted for all occurring in-/out-degrees of the nodes in the graph, there exists a bijective function transforming each original feature to the above new representation. Hence, through this transformation no information in the graph description is lost. Applying this technique allows us to use the same features for both graph isomorphism and subgraph isomorphism. Consequently, the same sequential filter algorithm as shown in Table 5.1 can be used for subgraph isomorphism filtering if the isomorphism condition f s = f i is replaced by the subgraph isomorphism condition f s f i. 5.4 Error-Tolerant Graph Matching Feature vector comparison for error-tolerant graph matching differs fundamentally from the approaches presented before. Whereas for graph isomorphism the aim is simply to determine whether or not two graphs g 1 and g are isomorphic, the goal in error-tolerant graph matching is to make a statement on how much the two given graphs differ. Motivated by the graph distance function introduced in Section., which is based on the maximum common subgraph of two graphs, the features considered for error-tolerant graph matching need to be features giving explicit information on the number of nodes involved. Furthermore, all features including edge information are disregarded since the approach should not be restricted to the case of induced subgraphs. Therefore, the evaluation is limited to feature f (a, g). This means that the number of vertices in a graph with a given label is analyzed: f (a, g) : number of vertices in graph g with label a It is clear that the number of possible values of feature f (a, g) depends on the size of the label alphabet L V and the distribution of the node labels of graph g. Based on this feature, an estimate on the maximum possible size of the mcs of two graphs g 1, g can be given. The idea is straightforward. Assume that for two graphs g 1, g feature f (a, g) has been extracted for all labels a L V. Further assume, without loss of generality, that all nodes in the graphs are labelled. The number of nodes mcs(g 1, g ) in the maximum common subgraph of g 1, g is given by mcs(g 1, g ) = a L V f ( a, mcs(g 1, g ) ).

69 Chapter 5. Graph Database Filtering Using Feature Vectors 57 To compute this quantity, one simply iterates over all the label-values in the mcs counting the number of nodes assigned to that label. Note that by definition a node is only allowed to contain one label. The sum of the values of feature f (a, g) in the mcs is bounded by: a L V f (a, mcs(g 1, g )) a L V min ( ) f (a, g 1 ), f (a, g ). Intuitively speaking this means that per node label a the value of feature f (a, mcs) cannot be larger than the smaller value of that feature in graphs g 1 and g. Hence, using feature f (a, g) an estimation mcs max on the maximum size of the mcs of two graphs (g 1, g ) can be given: mcs(g 1, g ) a L V min ( ) f (a, g 1 ), f (a, g ) = mcs max (g 1, g ). (5.1) A lower bound on the distance δ, as defined in Equation (.1), is therefore given by: δ min (g 1, g ) = 1 mcs max(g 1, g ) max( g 1, g ). (5.) Using this relation, one can make an approximation of the similarity of two graphs based on the feature f (a, g) introduced before. Assume we need to determine whether or not the distance δ(g 1, g ) between a pair of given graphs g 1, g is larger than a specified threshold distance δ t. If that is the case, the following relation will hold: where δ t is defined as δ t = 1 δ(g 1, g ) > δ t c max( g 1, g ). Obviously, the threshold distance δ t can also be specified in terms of a required minimum size of the mcs. This size is given by variable c. Applying Equations (5.1) and (5.), the following relation can be derived:

70 Error-Tolerant Graph Matching Simple Error-Tolerant Filter 1 function f i l t e r ( input vector fs, Database vectors F, 3 number of common nodes c ) { 4 foreach database vector f i in F do 5 i f ( ) a L V min (fs (a), f i (a) > c ) then 6 add f i to result 7 f i 8 done 9 } Table 5. Scheme of a sequential error-tolerant filter. δ min (g 1, g ) > δ t δ(g 1, g ) > δ t (5.3) which simply means that if the distance δ min (g 1, g ) between the graphs g 1, g is larger than the threshold distance δ t, then δ(g 1, g ) will also certainly be larger than the threshold distance. In that case, it is unnecessary to calculate the computationally expensive exact mcs-distance δ(g 1, g ). Moreover, by substituting and simplifying Equation (5.3) we obtain: c > mcs max (g 1, g ). (5.4) This means that if the number of nodes of the upper bound estimate mcs max (g 1, g ) is smaller than the threshold number of nodes c, then the two graphs g 1, g will have a distance value larger than the specified threshold distance δ t. As a consequence, by calculating the estimation of the upper bound of the mcs size it is possible to decide if the distance between two graphs is larger than a given threshold distance. Obviously, the threshold distance can either be specified via the distance δ t itself or indirectly by specifying the number of nodes c required as a minimum size of the mcs. From the above observations, a filter performing a quick estimate of the minimum distance between two graphs can be designed. Table 5. shows Equation (5.4) is a necessary as well as a sufficient condition to determine whether the distance between two given graphs is larger than a given threshold. However, it is not a sufficient condition to determine if the distance between the two graphs is smaller.

71 Chapter 5. Graph Database Filtering Using Feature Vectors 59 the principle of such a filter. Again, the feature vector fs extracted from the input sample is compared to each feature vector of the database graphs f i. During comparison, an estimate of the maximum possible size of the mcs is made. Based on this estimate it is possible to rule out all candidates whose distance is definitely larger than the specified threshold distance. (The threshold distance is specified by the number of common nodes c required in the mcs.)

73 Chapter 6 Decision Trees for Graph Database Filtering The aim of this work is to reduce the size of a graph database with respect to a given sample graph. Various indexing mechanisms have been proposed to reduce the complexity of graph matching in case of large databases [76, 77, 78, 79]. In this thesis, a filtering algorithm is proposed. The goal is to rule out a large number of graphs in the database by a few fast test using only simple features. A potential problem with this approach is that there may exist a large number of simple features to be tested. For the meta-features introduced in the previous chapter the size of the feature vector is directly dependent upon various parameters, such as the size of the label alphabet or the edge distribution of the graphs in the database. Hence the question arises which of these features are most suitable to rule out as many candidates from the database as quickly as possible. As its main contribution, this thesis proposes to use data mining techniques, namely decision tree induction methods, to deal with this problem. The method developed is based on the well known C4.5 algorithm extended by the RainForest framework [83, 86]. In this chapter, the methods developed for the various matching paradigms will be presented. In the next section, differences between decision trees used for classification and decision trees used for database filtering are outlined. In Section 6., a graph isomorphism filtering scheme is proposed. 61

74 Decision Tree Filtering Then, the method is extended to subgraph isomorphism. Following that, a scheme allowing filtering of the graph database in an error-tolerant way is presented. Finally, in Section 6.5 conclusions on the usage of decision tree filtering methods are drawn. 6.1 Decision Tree Filtering As has been said in Chapter 3, decision trees are a supervised learning technique, i.e. each training sample has a class label assigned. For classification, an input sample whose class is unknown is assigned a class label by traversing the decision tree. Hence, for decision tree methods it is important that there exists a class label for each training element. Applying decision tree methods to graph database filtering, the graph database is considered the training set for the decision tree algorithm. In the database, the graphs themselves are represented by feature vectors. To successfully induce decision trees on these feature vectors, a class label needs to be assigned to the feature vector representations of the graphs. In this work, each training sample also represents a class label. The class of a training sample is therefore identical to the sample instance itself. Hence, for a database of n training samples there exist n classes in that training set. The aim of this choice of class label assignment is specifically to overfit the decision tree. The idea of overfitting the tree is to minimize the set of graphs retrieved by traversing the decision tree, hence, in database retrieval a smaller number of graphs remain to be fully tested under the given matching paradigm. Whereas ordinary decision tree methods try to generalize from a training set of objects, the approach presented here tries to overfit the data in the sense that, in the ideal case, all leaf nodes in the tree include just a single graph. In general, the smaller the number of graphs in a leaf node, the smaller the number of full-fledged graph matchings to be computed. Other differences can be seen in the nature of the feature vectors to be classified. In this work the feature vector representation is expected to be complete, i.e. there exist no unknown feature values. In classification scenarios specific feature values may be missing. There are a variety of reasons for missing feature values, such as noise in the original pattern, or the extraction method simply not being able to extract that particular feature. In general, decision tree classifiers need to be able to cope with missing feature values. In graph database filtering however, a missing feature indicates differences in graphs. For example, consider the feature number of nodes with label A with respect to graph isomorphism filtering. If in the

75 Chapter 6. Decision Trees for Graph Database Filtering 63 input sample no such node exists, then this feature is not introduced in the feature vector and consequently all database graphs represented by this feature can be ruled out from the resulting graph set. Hence, for database filtering, one can assume that feature vector representation is complete. Another difference lies in the nature of feature values. Ordinary decision tree classifiers need to be able to cope with continuous as well as discrete feature values. Feature values ranging in continuous intervals are usually dealt with by the decision tree induction algorithm splitting the interval into subintervals according to the quality of the split. However, in terms of database filtering such splits make little sense. Consider the following example regarding graph isomorphism filtering. Assume the input graph is represented by the feature number of nodes = 1. Further assume that in the database there exist graphs of varying size, particularly 5,1 and nodes. If feature number of nodes is regarded to be defined on a continuous interval, then the decision tree induction algorithm might define subintervals ranging from ], 1] and ]1, [. Hence, during traversal using the input sample, all database graphs of size 5 would still be included in the result set. If however feature number of nodes is regarded to be defined on a discrete interval, then induction algorithm will introduce three subsets, representing graphs of size 5, graphs of size 1 and graphs of size. In that case, during traversal, all graphs not consisting of 1 nodes would immediately be ruled out as valid isomorphism candidates. For the feature types previously introduced it is clear that (considering the proposed matching paradigms) all feature value intervals are of discrete nature. Finally, it is assumed that with respect to database filtering, the time needed to traverse from any father node to a suitable son node according to a feature value is constant and not depending on the position of the node in the tree. Such behavior can easily be achieved using specialized data structures linking father nodes to son nodes. Summarizing, it can be seen that there exist a variety of differences between decision trees used for classification and decision trees used for database filtering: complete feature vector representation, no unknown values limitation to discrete feature value intervals no ordinary class label, one class label per training instance specific overfitting of the tree constant cost traversing from father nodes to son nodes

76 Graph Isomorphism Decision Trees As a consequence, the decision tree approach in general can be significantly simplified. In the remaining sections, the main contribution of this thesis will be presented. In the next section, an approach using decision tree filtering for graph isomorphism is shown. Then, the approach will be refined to subgraph isomorphism. Finally, an error-tolerant retrieval method will be presented in Section 6.4 and conclusions will the be drawn in Section Graph Isomorphism Decision Trees In Chapter 5 a simple approach comparing feature vectors with respect to graph isomorphism has been presented. The main problem with this approach was that there exists a large number of features to be tested. The method presented in this section tries to reduce this number of features to be tested to only a small subset. Based on decision tree techniques, the graph database is analyzed identifying the most powerful features extracted from the database objects. The presented framework can be divided into a preprocessing (data mining) and a runtime phase. During preprocessing the graph database is analyzed. At first, feature vectors are extracted from the graphs. Then, based on the feature vectors, a decision tree structure is induced reducing the initial feature space by identifying the most powerful features in the database. The decision tree significantly reduces the number of features to be tested. At runtime, an input sample to be matched is given as an input parameter to the decision tree traversal procedure. The traversal algorithm first extracts the features needed to traverse the tree, then the decision tree is traversed and the retrieved candidate set is returned as a result. This section first introduces how the decision tree structure is created from the database. Then, it is shown how the resulting tree structure can be traversed in order to reduce the size of the graph database Decision Tree Induction In Section 5. it has been shown that for graph isomorphism, a necessary condition for g 1 and g being isomorphic is that they have identical feature values. Hence, given the feature vector f 1 = (f 1 1,..., f 8 1 ) extracted from g 1 and f = (f 1,..., f 8 ) extracted from g, g can immediately be ruled out from the set of graphs potentially isomorphic to g 1 if a feature

77 Chapter 6. Decision Trees for Graph Database Filtering 65 j is discovered such that f j 1 f j. Based on this observation, it is clear that the tree should be induced in such a way that at every internal node the most salient feature should be chosen as the splitting feature. Using the most salient feature guarantees minimal successor subsets and thus a maximal chance of ruling out as many non-conforming candidates as possible. Therefore, the decision tree must be grown trying to maximize the split factor at every internal node in the tree. An increased split factor, and consequently a smaller tree depth resulting in less decision tree traversal time, is achieved by using C4.5 s gain or weighted entropy split criterion instead of the standard gain-ratio criterion (see Section 3.3). The decision tree induction algorithm classifies the graphs in the database with respect to identical feature values. Note that the aim is to retrieve single graph-candidates from the database, not the classification of the input graph. Hence, in the presented approach, there are no class labels as introduced in Chapter 3. Rather, the class labels correspond to the sample instance (meaning sample instance and class label are equal). As a consequence, there exist as many classes as there are samples in the graph database. In the initializing step, the tree s root node is constructed and it is assigned the entire graph set of the database. Then, each available feature is tested and its suitability is evaluated according to the given split criterion. Amongst all features, the best one is chosen and the current root s graph set is split into subsets according to the best feature. For each feature value, a son node is created and the node is assigned the subset of graphs that correspond to that feature value. The induction procedure is recursively continued with the son nodes until one of the following termination conditions holds: a) the graph set in a node contains only one graph, b) no features are left to divide a subset, c) the features left cannot distinguish the remaining graphs in the set. Cases b) and c) correspond to the situation where later in the decision traversal phase, multiple graphs are returned by the filtering procedure, while case a) reflects the ideal situation where only one candidate graph remains to undergo the full-fledged graph matching procedure. An example of a graph database G, extracted features f and corresponding graph isomorphism tree is given in Figure 6.1. A decision tree induced as described above can be used to filter a graph database for graph isomorphism candidates. The next section describes how this can be done.

78 Graph Isomorphism Decision Trees graph database G: g 1 g g 3 g 5 D D A g 4 B C C C C extracted feature f : g 1 g g 3 g 4 g 5 f (C, g i ) 1 1 f (A, g i ) 1 graph isomorphism decision tree: f (C,g i)= g,g,g,g,g f (C,g i)=1 f (C,g i)= g,g 1 g,g 3 4 g 5 f (A,g i)= f (A,g i)=1 g 4 g 3 Figure 6.1 Graph database G, extracted features of type f and corresponding induced graph isomorphism decision tree. 6.. Decision Tree Traversal Once the decision tree representing the database of model graphs G has been constructed, a necessary condition for an input graph g being isomorphic to model graph g i is that all features extracted from g have values identical to the corresponding features extracted from g i. The processing is as follows. First, the same features that were extracted from the graphs in the database are extracted from the input graph. Then the values of the features extracted from the input graph are used to traverse the decision

79 Chapter 6. Decision Trees for Graph Database Filtering 67 g input sample A C f (C,g input sample)=1 f (A,g input sample)=1 g input sample A A C f (C,g input sample)=1 f (A,g input sample)= g,g,g,g,g g,g,g,g,g f (C,g i)= f (C,g i)=1 f (C,g i)= f (C,g i)= f (C,g i)=1 f (C,g i)= g,g 3 4 g,g 3 4 f (A,g i)= f (A,g i)=1 f (A,g i)= f (A,g i)=1 g 4 g 3 g 4 g 3 successful traversal unsuccessful traversal Figure 6. Successful and unsuccessful traversal examples for various input graphs. tree. There are only two possible outcomes. The first outcome is that a leaf node is reached. In this case the graphs associated with the leaf node are possible matches to the input graph. Each of these graphs is tested against the input graph for graph isomorphism using a conventional algorithm [7]. The second outcome is that no leaf node is reached. In this case there are no graphs in the database that can be isomorphic to the input graph. An example of both traversal outcomes for the decision tree depicted in Figure 6.1 is given in Figure Subgraph Isomorphism Decision Trees Looking at graph isomorphism, a necessary condition for two graphs g and g being isomorphic is that they have identical feature values. However, for subgraph isomorphism, the relation is not that simple (see Section 5.3). Looking at the list of features presented Section 5.3, it can be seen that there are two basic types of features:

80 Subgraph Isomorphism Decision Trees features not containing information on the degree of the nodes in the graph: {f 1, f, f 3, f 4 } features containing information on the degree of the nodes in the graph: {f 5, f 6, f 7, f 8 } Recall that extending the graph isomorphism feature vector comparison to subgraph isomorphism is straightforward for features not containing node degree information. When looking at features containing node degree information however, the approach must be refined. The difficulty is that in subgraph isomorphism, nodes of lower vertex degrees in the subgraph may be mapped onto nodes of a higher degree. An example is given in Section 5.3, Figure 5.. However, this problem is solved by adapting the representation of the feature vector. Thus, only minor modifications need to be made to the decision tree scheme. Besides feature comparison, there are other issues to be considered. First of all, the filtering paradigm needs to be more clearly defined. There are two possible search scenarios: supergraph search: the input sample is considered to be isomorphic to subgraphs of the graphs in the database. subgraph search: the database contains graphs possibly isomorphic to a subgraph of the input sample. In the following explanations the focus is set on the second scenario where the input sample is a supergraph and the database graphs are subgraphs (subgraph search). However, supergraph search works analogously. Once the retrieval scenario has been determined, the database can be filtered either in the following two ways: a) by using the graph isomorphism decision tree as described in Section 6., altering the traversal algorithm, b) by introducing a new decision tree structure, using the sample traversal algorithm as described in the previous section. Naturally, both approaches have advantages and disadvantages as will be shown later.

81 Chapter 6. Decision Trees for Graph Database Filtering 69 Furthermore, based on the concepts of multiple classifier systems, a third approach can be obtained combining both subgraph filtering methods, taking the intersection of both sets as the result. In the next section, extending the graph isomorphism framework to subgraph isomorphism filtering changing the traversal algorithm (thus reusing the graph isomorphism tree structure) will be described. Then, the altered tree structure is introduced and the combined filtering method is presented. All approaches are discussed in detail for subgraph retrieval. However, supergraph retrieval from the database works analogously and is therefore only briefly described where necessary Graph Isomorphism Tree -Induction This filter is based on the decision tree structure described in Section As explained in Section 5.3 it is required that the feature vectors are in proper format. Hence, no adaption needs to be made to the induction algorithm itself, the goal being to reuse the graph isomorphism decision tree. The idea of the approach reusing the graph isomorphism tree is to adapt the traversal algorithm so that it concurrently follows all valid successor nodes of an internal node. Compared to graph isomorphism traversal it can be seen that traversal no longer follows one single path to reach a leaf node. Instead, several branches of the tree are explored possibly leading to several leaf nodes being part of the result. Consequently, maximizing the split factor not necessarily yields the global minimal number of tests made to traverse the tree for a given sample. Rather, it can be expected that if the tree is better balanced, overall fewer nodes need to be visited. Better balanced trees can be achieved by inducing the tree with the gain ratio criterion. Whether or not gain ratio proves to produce superior results over weighted entropy naturally depends on the features extracted as well as the underlying graph database. Both criteria are given the same information and as a consequence are able to produce equally sized sets assigned to the leaf nodes in the tree. The difference is merely in the order this information is processed, meaning the priority in which features are chosen for splitting an internal node in the decision tree. Through the possible difference it may be that, applying filtering, clusters of different size are retrieved.

82 Subgraph Isomorphism Decision Trees 6.3. Graph Isomorphism Tree -Traversal Extending the isomorphism traversal to the problem of finding subgraphs g i in the database imposes the following difficulties on the traversal process: a) there are in general several successor nodes to be followed, b) feature values not occurring in the database may occur in the sample graph and must therefore be considered while traversing the tree. Hence, for a single traversal step, two cases can occur. If there exists a decision tree node for the input graph s feature value, then the traversal algorithm must move ahead to this node and all nodes representing a smaller feature value. On the other hand, if there exists no such node the algorithm must follow all nodes representing smaller feature values. If no such nodes exist, traversal must be stopped. The algorithm either reaches one or several leaf-nodes in which case all graphs associated to the union of the leaf nodes need to undergo fullfledged graph matching. The other scenario is that the algorithm does not reach any leaf node which means that there are no graphs in the database possibly subgraph isomorphic to the input graph. An example of subgraph traversal on a graph isomorphism tree illustrated in Figure 6.1 is shown in Figure Subgraph Isomorphism Tree -Induction It is clear that it is possible to include the additional logic of the described traversal algorithm into the decision tree structure itself, leaving the traversal algorithm unchanged. To do this, the decision tree structure needs two major modifications before it can be used for subgraph filtering purposes. The first adaption concerns the assignment of graph subsets to son nodes. In the isomorphism case, the father node s graph set is split into disjoint subsets according to the best feature. For subgraph isomorphism trees however, these subsets are not disjoint anymore. Consider the case where a feature occurs n times in the input sample. In that case, all graphs in the database where the same feature occurs n < n times are possible subgraph isomorphism candidates and need to be assigned to the son node representing feature value n. Hence, the sets assigned to the son nodes of an internal node are overlapping subsets where the candidates of the

83 Chapter 6. Decision Trees for Graph Database Filtering 71 g input sample C A f (C,g input sample)=1 f (A,g input sample)=1 f (C,g i)= g,g,g,g,g f (C,g i)=1 f (C,g i)= g,g 1 g,g 3 4 g 5 f (A,g i)= f (A,g i)=1 g 4 g 3 retrieved candidates: {g 1,g,g,g} 3 4 Figure 6.3 Subgraph traversal on a graph isomorphism filtering decision tree. son representing the smallest feature value are included in all other son s sets as well. Similarly, the candidate set of the son of largest feature value contains the union of all the graphs in its siblings sets, or more precisely, it contains the same set of graphs as its father node (see Figure 6.4). Since the son node with the maximal feature value at any level in the tree is assigned the entire graph set of its father, the depth of the decision tree is only limited by the number of features evaluated. Depending on the graph type, the number of possible features is in general quite large. In case a son node containing its father s graph set is unlikely to be reached during traversal, it should be induced after other nodes that are more likely to be reached. Otherwise, the induction algorithm would focus on inducing possibly very deep tree branches hardly ever reached in the traversal process. Hence, while constructing the tree, the nodes ready for expansion must be ordered so that candidates unlikely to be reached are not (or only very late in the process) expanded. From the application oriented point of view, nodes likely to be reached should be examined much more closely. Intuitively the probability of a node to be reached is given by the number of candidates only appearing in the node compared to the number of candidates in the father node. The second adaption is due to the fact that feature values not occurring in the graph database may occur in the input sample and must therefore

84 Subgraph Isomorphism Decision Trees A A A A A A f (A,g i)= f (A,g i)=3 f (A,g i)=4 f (A,g i)>4 A A A A A A A A Figure 6.4 Example of a decision tree split. be considered during tree construction. This can easily be handled by introducing additional edges from the father node to the appropriate son node (the node representing the next smaller feature value) for each nonoccurring value of the considered feature. Figure 6.4 illustrates the described modifications, namely overlapping graph sets in the son nodes as well as feature values not occurring in database graphs. In the father node, there are two graphs with two and four nodes with label A, respectively. Hence, the decision tree consists of two son nodes (and their corresponding edges), one for each feature value f (A, g i ) = and f (A, g i ) = 4. Furthermore, since a sample graph may contain 3 nodes with label A, an additional edge with label f (A, g i ) = 3 must be introduced, also pointing to the node for graphs with f (A, g i ) =. Naturally, all samples where f (A, g i ) > 4 must be directed to the node where f (A, g i ) = 4 (branch to the very right) and all samples satisfying f (A, g i ) < are not possible supergraphs to the graphs in the database (therefore no branch is provided in the tree for this case). Consider the case where, at runtime, the decision tree is traversed for a sample graph with f (A, g i ) = 5. While traversing the tree, the leaf with f (A, g i ) > 4 is reached. Hence, the input sample is a possible supergraph to both graphs in the illustration which makes sense since the sample consists of 5 nodes with label A.

85 Chapter 6. Decision Trees for Graph Database Filtering 73 g input sample C f (C,g input sample)=1 A f (A,g input sample)=1 g,g,g,g,g f (C,g i)= g,g 1 f (C,g i)=1 f (C,g i)>1 g,g,g,g g,g,g,g,g f (A,g i)= g,g,g 1 4 f (A,g i)=1 g,g,g,g retrieved candidates: {g,g,g,g} Figure 6.5 Traversal on a subgraph isomorphism filtering decision tree Subgraph Isomorphism Tree -Traversal Decision trees induced as described above can be used to retrieve possible subgraph isomorphism candidates of a given sample graph in the following way. First, the same features that were extracted from the database graphs and used to induce the decision tree are extracted from the input graph. Then, the traversal algorithm follows the tree branch whose feature values are equal to the values extracted from the sample graph. There are only two possible outcomes of the decision tree traversal procedure. The first outcome is that a leaf node is reached. In this case, the graphs associated with the leaf node are possible matches to the input graph. Each of these graphs must then be tested against the input graph for subgraph isomorphism using a conventional algorithm (e. g. as described in [6, 7, 93]). The second outcome is that no leaf node is reached in which case there are no graphs in the database that can be subgraph isomorphic to the input graph. Figure 6.5 illustrates the parts of the adapted decision tree structure and the nodes traversed for a given input sample. Note that the branches on the left and on the right are only partially depicted. Only the branch being followed by the traversal algorithm is fully shown.

86 Error-Tolerant Decision Trees Combined Filter Motivated by concepts used in multiple classifier combination, a third approach for subgraph filtering can be derived. The key idea in multiple classifier combination is that if classifier C 1 is very accurately classifying certain cases whereas other perform poorly for these cases, C 1 may level out the other classifiers weaknesses. Vice-versa, the other classifiers may correct C 1 s weaknesses. Adopting this principle one can combine the two subgraph filtering methods previously introduced. In order for the combined approach to be successful however, the specific decision trees need to test different features. At the beginning of tree induction, due to the mostly similar graph sets in the tree nodes, both presented approaches are likely to identify equal or at least similar features as the most powerful features to split the given nodes. However, with increasing tree depth, the graph sets in the nodes will vary considerably, and hence it can be expected that the induction algorithms choose different features for different internal tree nodes. It is therefore a straightforward approach to combine both traversal algorithms, taking the intersection of the result-sets as a combined result, expecting this combined result to be significantly smaller than the individual results. An illustration of such a combined scheme is shown in Figure 6.6. Naturally, whether or not the combination of the filtering methods works heavily depends on the underlying graph database and thus the features extracted. Furthermore, actual hardware limitations may apply, severely limiting the depth of the trees induced and consequently reducing the variety of the splitting features chosen. 6.4 Error-Tolerant Decision Trees In this section it is explained how the concept of feature-vector comparison for maximum common subgraph estimation can be used in combination with decision trees. The decision tree induction procedure itself is analogous to the previous decision tree induction methods for graph isomorphism trees. Decision tree traversal however needs to be modified such that several branches are followed concurrently while keeping track of the assignable as well as the non-assignable feature values. Based on these assignment counters and Equation 5.4 in Section 5.4, an estimate on the maximum size of the

87 Chapter 6. Decision Trees for Graph Database Filtering 75 g input sample B C f (A,g input sample)= f (B,g input sample)=1 f (C,g )=1 input sample f (D,g input sample)= g,g,g,g,g g,g,g,g,g f (C,g i)= f (C,g i)=1 f (C,g i)= g,g 1 g,g 3 4 g 5 f (C,g i)= g,g 1 f (C,g i)=1 f (C,g i)>1 g,g,g,g g,g,g,g,g f (A,g i)= f (A,g i)=1 f (D,g i)= f (D,g i)=1 g 4 g 3 g,g 3 4 g,g,g,g retrieved candidates: {g 1,g,g} retrieved candidates: {g,g} combined result: {g 4 } Figure 6.6 Illustration of a combined retrieval scheme. maximum common subgraph is possible. Using Equation 5. (Section 5.4) one can then build a decision tree traversal algorithm which filters out all graphs in the database with a distance larger than a given threshold distance δ t. In Section a brief explanation will be given on how to induce decision trees useful for error-tolerant graph matching. A detailed description on how the tree structure can be used for error-tolerant filtering follows in Section Error-Tolerant Decision Tree Induction The tree induction procedure is analogous to the induction procedure for graph isomorphism trees. There are only two minor differences to be considered when inducing error-tolerant decision trees. The first difference is that for error-tolerant trees, it is important that all graphs in a specific tree branch are equal in size. This is to make it easier for the traversal algo-

88 Error-Tolerant Decision Trees rithm to keep track of the nodes remaining to be assigned for the database graphs. Hence, the first feature tested in an error-tolerant decision tree structure is feature f 1, the number of nodes in a graph. Second, whereas for graph isomorphism trees no restriction is made on the features to be used, for error-tolerant decision trees only features of type f (a, g) can be applied, where f (a, g) denotes the number of vertices in graph g with label a (this restriction is made based on the observations derived in Section 5.4). The decision tree structure obtained by this type of traversal will be referred to as the simple tree structure from now on. During traversal, an input sample may contain additional information in graph features not available in the graph database. Hence, features not capable of distinguishing database graphs may be useful for filtering unknown sample graphs. To account for this case, an extended tree structure is introduced. After the induction of the simple tree structure as described above, the remaining features are induced even though they do not contain any split information for the database graphs. The goal is that these features are tested during traversal and the maximum information given in database and sample graph is evaluated. This tree structure will be referred to as the extensive tree structure. An illustration of extensive and simplified tree structure is given in Figure 6.7. Note that for both tree structures feature f 1 is tested first to guarantee graphs of homogeneous size in each tree branch. Also note that for the extended tree structure, additional features are tested resulting in additional tree nodes (marked red) Decision Tree Traversal Decision trees induced as described above can be used to retrieve possible graph candidates differing no more than a specified threshold distance δ t from the input given sample graph. Assume all graphs in the database are equal in size 1. Furthermore, assume the size c of the requested mcs as described in Section 5.4 is also known. The general idea is to keep track of the number of nodes assigned to a possible mcs. During traversal, each visited tree node represents a number of nodes assigned to a possible mcs. Similarly, each branch followed represents a hypothetical mcs between database graphs and input sample. In order to do this, three counters are needed: 1 This is achieved by the initial test of feature f 1 in the decision tree.

89 Chapter 6. Decision Trees for Graph Database Filtering 77 simple tree structure extended tree structure g,g,g,g,g g,g,g,g,g f (g i)=1 1 f (g i)= 1 f (g i)=1 1 f (g i)= g,g 1 g,g,g g,g 1 g,g,g f (D,g i)=1 f (C,g i)=1 f (C,g i)= f (D,g i)=1 f (C,g i)=1 f (C,g i)= g,g 1 g,g 3 4 g 5 g,g 1 g,g 3 4 g 5 f (A,g i)= f (A,g i)=1 f (C,g i)= f (A,g i)= f (A,g i)=1 f (D,g i)= g 4 g 3 g,g 1 g 4 g 3 g 5 f (A,g i)= f (D,g i)= f (D,g i)= f (A,g i)= g,g 1 g 4 g 3 g 5 f (B,g i)= f (B,g i)=1 f (B,g i)= f (B,g i)= g,g 1 g 4 g 3 g 5 Figure 6.7 Simple and extended tree structure (additional nodes marked red). n mcs : The number of nodes assigned to a possible mcs in the current branch. This counter is initialized with zero (i.e. beginning tree traversal, no nodes are assigned to the mcs). n dr : The number of nodes remaining in the database graphs of the current branch. This counter is initialized with the size of the graphs in the database (i.e. all nodes remain to be assigned to a possible mcs). n sr : The number of nodes remaining in the sample graph in the current branch. This counters is initialized with the size of the sample graph, analogously to n dr. First, all features of the specified type are extracted from the input graph. The traversal algorithm follows all possible tree branches concurrently.

90 Error-Tolerant Decision Trees g,g,g,g,g f (A,g i)=3 A g,g 3 4 f (A,g i)= C sample A possible mcs A A A database A A B f (A,g i)=3 n mcs = n mcs + min(, 3) n dr = n dr 3 n sr = n sr Figure 6.8 Illustration of a traversal step and corresponding counter updates. During the traversal of a branch, feature values are compared between database and input sample. Suppose the feature tested at a specific branch is denoted by f (a, g), the input sample is denoted by g s and similarly, the branch s graph set by G db. (Hence, f (a, g s ) represents the sample s feature value whereas f (a, G db ) represents the database graphs value.) Then the counters are updated in the following way: n mcs = n mcs + min ( f (a, g s ), f (a, G db ) ) n dr = n dr f (a, G db ) n sr = n sr f (a, g s ) Figure 6.8 illustrates a traversal step and the corresponding counter assignments. The feature tested on the illustrated branch is the number of nodes assigned a label A. The input sample contains two such nodes whereas the database graphs contain three nodes. It is clear to see that two nodes may be assigned to a possible mcs, hence, n mcs can be incremented by two. Also, three nodes have been evaluated for the database graphs and can be removed from the nodes remaining to be assigned. Similarly, two nodes can be removed from the remaining nodes of the sample graph. This procedure is recursively continued until one of the following terminating conditions is met:

91 Chapter 6. Decision Trees for Graph Database Filtering failure: (n mcs + n dr < c) (n mcs + n sr < c) (i.e. not enough nodes can be assigned to a possible mcs). success: n mcs c (i.e. enough nodes have been assigned) 3. success: a leaf node has been reached and none of the other terminating conditions have occurred. All nodes reached with a successful termination condition are assigned to the overall result set. The graphs contained in these nodes are possible candidates and as a consequence need to be tested with a standard graph distance measure. 6.5 Conclusions In this chapter, the main contribution of this thesis, applying decision tree methods to feature vector filtering, has been introduced. It has been shown that there exist some fundamental differences to be considered when using decision tree methods for database filtering instead of classification (Section 6.1). Based on the refinements necessary the basic concept of database filtering based on feature vectors adapted to enable the application of machine learning techniques. Various decision tree filtering approaches for three graph matching paradigms have been introduced: 1. graph isomorphism (Section 6.). subgraph isomorphism (Section 6.3) 3. error-tolerant graph matching (Section 6.4) It has been shown that for graph isomorphism the decision tree extension is straightforward. For subgraph isomorphism on the other hand several issues have been identified and addressed, resulting in several approaches to solve the problem. For error-tolerant decision tree filtering, two approaches have been derived based on different tree representations. In the next part, the derived methods are experimentally evaluated. First, the unrefined feature vector filtering approaches will be evaluated (Chapter 9). Then, the performance gain achieved by the application of decision tree methods is illustrated (Chapter 1).

93 Part III Experiments and Results

95 Introduction The efficiency of the proposed approaches was evaluated in a series of experiments. This part presents the experimental data used in the context of this thesis. The experiments conducted focussed on the following topics: the relation between the performance of graph database filtering approaches and the underlying graph matching algorithm (Chapter 8), evaluation of the feature vector representation of graphs (Chapter 9), namely feature type evaluation (Section 9.1) and sequential feature vector filtering performance (Section 9.), decision tree filtering performance (Chapter 1). This part is organized as follows. In Chapter 7 the datasets the experiments were conducted on are introduced. Following that, a study concerning the relation of graph matching algorithm performance and graph database filtering based on the work presented in Chapter 4 is shown (Chapter 8). Then, the suitability of the proposed feature types when used for database filtering is examined in Chapter 9. Finally, the performance of various filtering approaches (depending on given graph matching paradigms) based on decision tree methods is analyzed (Chapter 1).

97 Chapter 7 Graph Datasets In order to test the filtering methods, a variety of databases of different graph types have been created or collected within the course of this work. In general, the graph types studied can be categorized into two different classes: generated graphs (graphs created artificially, i.e. by a graph generator), extracted graphs (graphs extracted from real-world structural data, e. g. region adjacency graphs). The class of generated graphs consists of graphs created by graph generators. The graphs in this class are not based on other structural data. Rather, they are artificially derived based on given user parameters. Examples of these types of graphs are random graphs or bounded valence graphs. Extracted graphs on the other hand are graphs extracted from existing structural data, such as digital images or documents. Note that the underlying structural data itself can either be created by some sort of a generator (an image derived by a tool based on grammars such as the Pol- Gen images used in this thesis) or it can be a snapshot of the real-world (e. g. an image captured by a camera). Naturally, depending on the underlying structural data and application requirements, various types of graphs can be extracted. 85

98 Generated Graphs graph database generated graphs extracted graphs random graphs bounded valence graphs mesh graphs molecules rag graphs fingerprints documents Figure 7.1 The collected graph data. An overview of the graph data used in this thesis is given in Figure 7.1. In the next section, a brief introduction to the generated graph types is given. Then, in Section 7., the various graphs based on real-world data are presented. 7.1 Generated Graphs In order to test the approaches developed in this thesis on a wide variety of possible graphs, it was necessary to create a large collection of different graph types. These graphs were generated by carefully varying parameters such as structure, size, type, and label alphabet. The goal was to test the filtering approaches for a broad scale of possible parameter settings. Therefore, an entire collection of graph databases was derived for each parameter setup. There are basically two aspects to be considered when describing generated graphs. These are: 1. graph structure (including size, edge distribution and type). label distribution From the structural point of view, different types of graphs have been examined, including random graphs, bounded valence graphs (regular/irregular) and meshes (regular/irregular). These graph types have been applied

99 Chapter 7. Graph Datasets 87 in the research community before (see [9, 91, 94]). The generators to create these types were either provided by third-party tools such as VFLib [9] or implemented from scratch. Label alphabets of varying size were used, generally ranging from 5 to labels, based on a uniform random distribution Random Graphs Random graphs as used in this thesis are a straightforward way to artificially generate a graph structure. In general a random graph is created by first adding a certain number of nodes to the graph and then arbitrarily connecting the nodes through edges. Finally, labels are randomly assigned to the nodes and edges of the graph. In order to control the randomness of the created graphs the random variables were limited in various ways: 1. The graph size was constant per database constructing different databases for different graph sizes.. The initial number of edges introduced was constant with respect to the number of nodes in a graph. 3. The graphs were assumed to be connected. 4. The labels were assigned from a label alphabet of fixed size. The above parameters were varied within certain limits and for each setup a database of graphs was derived. The size of the graphs (i.e. the number of nodes and as a consequence the number of edges in the graphs) was varied from 5 to nodes, generating graphs of size 5, 1, 3, 5, 7 and nodes. Similarly, the size of the label alphabet ranges from 5 to labels (5, 1, 3, 5, 7 and labels). The labels were uniformly assigned to the nodes and edges of the graphs. For each setup, a database of 1, random graphs was created. During creation it was made sure that the graphs in the database were not isomorphic to each other. Consequently, the entire random graph database consisted of 36 databases of graphs with varying size and label distribution. An example of a random graph from the database of graphs with 5 nodes assigned labels from an alphabet 1 of 5 labels is given in Figure Note that node and edge label alphabet are identical.

100 Generated Graphs Figure 7. Random graph example Mesh / Hypercuboid Graphs Mesh or hypercuboid graphs used in this work are mesh structures of varying dimension n. In this thesis, the dimension was varied from n =,..., 5, generating mesh graphs (n = ), cuboids (n = 3) and hypercuboid graphs (n = 4, 5). For simplicity reasons, the terms mesh and hypercuboid are used synonymously in the context of this thesis. The size of the graphs itself was not controlled directly, but rather indirectly by specifying the size of the intervals per mesh dimension. The number of nodes e i along a specific dimension i was then chosen at random. Thus, per database, graphs of varying size were produced. The boundaries for the axis-intervals were set as follows: d-mesh: e i [,..., 1], e. g. a mesh of size 3 9 3d-cuboid: e i [,..., 8], e. g. a cuboid of size 4 6 4d-hypercuboid: e i [,..., 6], e. g. a hypercuboid of size d-hypercuboid: e i [,..., 4], e. g. a hypercuboid of size Additionally, the regularity of the meshes was controlled by randomly adding edges to nodes in the meshes according to a given probability. This probability was varied from.,.,.4 to.6. The assignment of labels was handled analogously to random graphs. The label alphabet was varied

101 Chapter 7. Graph Datasets dimensional mesh 3-dimensional cuboid Figure 7.3 Regular mesh (left) and cuboid (right) graph examples. from 5 to labels (5, 1, 3, 5, 7 and labels), nodes and edges being randomly assigned elements from the alphabet based on a uniform distribution of the labels. Similar to random graphs, for each parameter setup a database of 1, mesh graphs was generated ensuring that no two graphs in the database were isomorphic to each other. Hence, the mesh graph database derived consisted of 96 labelled databases (4 databases per dimension, 4 per (ir)regularity degree, 6 by label alphabet size). Examples of mesh graph structures derived are shown in Figure Bounded Valence Graphs Bounded valence graphs are graphs where the number of edges incident to each node of the graph is limited by a valence value. The number of nodes per graph and the size of the label alphabet for bounded valence graphs can be controlled analogously to random graphs. The main difference lies in the assignment of the edges to the graphs. Each node in the graph is assigned a constant number of edges, based on the valence value specified as input parameter. For the bounded valence graphs used in this thesis, the number of nodes in the graphs was varied from 5 to and the valence was set to 3% of the number of nodes in the graph. The size of the label alphabet was varied from 5 to labels, uniformly assigning the labels to the nodes and edges in the graphs.

102 9 7.. Extracted Graphs Figure 7.4 Regular bounded valence graph example. Based on regular bounded valence graphs as described above one can generate irregular bounded valence graphs by arbitrarily moving a specified percentage of the edges in the graph. Irregular bounded valence graphs are therefore less restrictive than regular bounded valence graphs, ensuring constant valence only over the entire graph instead of every vertex in the graph. Again, for each parameter setup a database consisting of 1, graphs was generated. Therefore, 36 regular bounded valence graphs (6 different graph sizes and label alphabets) as well as 36 irregular bounded valence graphs databases were generated. (The irregularity was introduced by moving 1% of the nodes in the graphs.) An example of a regular bounded valence graph of size 1 (3 edges per node) is given Figure Extracted Graphs The wide variety of generated graph databases enables the evaluation of the approaches on a large scale of possible input data. However, the results found on generated graphs only illustrate possible outcomes on real world data and are therefore no substitute for tests on extracted graphs. The efficiency of the approaches therefore needs to be tested on real world graphs as well. During this thesis, a variety of graph databases was collected. These include:

103 Chapter 7. Graph Datasets 91 region adjacency graphs [95, 96] fingerprint graphs [97] chemical compound graph representations [98] document graphs [54] In the remainder of this section, each of the above graph types as well as the underlying data will be more thoroughly described Region Adjacency Graphs Region adjacency graphs are a popular approach to represent segmented image data. The graphs used in this thesis are extracted from images created using a plex-grammar generator named PolGen (Policemen Generator, see [95, 96]). PolGen allows to create simplified images resembling scenes from the real world. The plex grammar controls the desired outcome of the images by means of certain parameters. An examples of a segmented image created using the PolGen parser is shown in the left column of Figure 7.5. In the context of this thesis the parser has been extended to generate images with the following content: policemen houses ships landscapes containing policemen, churches, trees, skyline (mountains and hills) and sky (clouds, sun) Based on these images region adjacency graphs have been extracted [99, ]. Each region in a PolGen image is assigned a node in the graph. If two regions share a common boundary, an edge is introduced between these nodes. The background region is ignored in this process, hence the graphs created need not be connected. Various features are extracted from the regions in the images and assigned to the nodes in the graphs. Within the context of this thesis only the color attribute has been evaluated, the reason being that it was the most straightforward value to be controlled in the PolGen parser. In a more sophisticated graph representation other

9 7.. Extracted Graphs Figure 7.5 Segmented PolGen image and extracted region adjacency graph. attributes can definitely be included. shown in Figure 7.5. An example of a scenery graph is A database of, images was created for each image type (policemen, houses and ships), resulting in 6, basic images.

104 9 7.. Extracted Graphs Figure 7.5 Segmented PolGen image and extracted region adjacency graph. attributes can definitely be included. shown in Figure 7.5. An example of a scenery graph is A database of, images was created for each image type (policemen, houses and ships), resulting in 6, basic images. In addition, 1, more complex scenery images were derived. For each image a graph as described above was extracted resulting in a database containing 7, graphs. In order to gain information on the structure of the database it has been analyzed with respect to the node, edge, and label distribution. In Figure 7.6 the results of this analysis are depicted. The distributions of the nodes and edges are shown in the histogram in the top row (left and right column respectively). It can be seen that most graphs are around 1 nodes in size. Intuitively these graphs correspond to images depicting simple objects, such as policemen or houses. A second cluster can be observed for graphs consisting of about 5 nodes. Obviously, these larger graphs correspond to images of sceneries containing more regions in an image. Considering edges similar tendencies can be seen indicating that the number of edges correlates with the number of nodes in a graph. This behavior is certainly expected. Regarding node label distribution it can be seen that the distribution is not uniformly balanced. There are some labels significantly more frequent than the rest of the values. It can be assumed that this increase is also partially due to the unbalanced color values assigned to each image type. Scenery images contain a much broader color spectrum whereas simpler images are relatively limited. Hence, the color distribution is biased towards the colors used in simple images. In order to compare the amount of information stored in node, edge and label distribution one can measure the entropy of the respective histograms.

105 Chapter 7. Graph Datasets 93 Number of Graphs in the Database Number of Occurrences Region Adjacency Graph Database Number of Nodes in a Graph Region Adjacency Graph Database Labels Number of Graphs in the Database Region Adjacency Graph Database Number of Edges in a Graph number of graphs 7, avg. # nodes/graph 15.4 standard deviation entropy 3.87 avg. # edges/graph 15.8 standard deviation entropy 4.3 label alphabet size 4 label entropy 4. Figure 7.6 Characteristics of the graphs in the RAG database. It can be seen in Figure 7.6 that overall the information stored in the various histograms differs only slightly from one another. The highest amount of entropy is found in the edge distribution of the graphs in the database. Little less entropy is given via the node label distribution. Finally, the size of the graphs contains the least entropy of the values considered. However, as will be shown in Chapter 1 the information available suffices for the filtering approaches to perform sufficiently well. 7.. Fingerprint Graphs Another database of graphs that was used in the experiments is based on a graph representation of fingerprint images. These graphs are designed to be used in fingerprint classification [97]. The problem in fingerprint classification is to assign a given input fingerprint to one of the five Galton- Henry classes (left loop (L), right loop (R), whorl (W), arch (A) and tented arch (T)). This problem is often addressed by extracting global characteristics such as the ridge flow and singular points [11, 1].

106 Extracted Graphs Graph-Quadrant: Figure 7.7 Fingerprint of class whorl (W) and extracted graph. The fingerprint graphs used within the context of this work were extracted using an image filter based on a new definition of directional variance. The graphs are extracted in the following way. First, a directional variance value is measured at every position of the ridge orientation field of the fingerprint. This measure is defined such that high variance regions correspond to relevant regions for the fingerprint classification task (including singular points). From the results of the region extraction process based on the modified directional variance filter an attributed graph is extracted. The idea is to generate a structural skeleton of the fingerprint. The region image is post-processed by applying binarization and thinning methods. Then, ending and bifurcation points of the resulting skeleton are represented by graph nodes. Additional nodes are inserted along the skeleton at regular intervals. An attribute giving the position of the corresponding pixel is attached to each node. Edges containing an angle attribute are used to connect nodes that are directly connected through a ridge in the skeleton. Furthermore, the average direction of the ridge-lines of a node is also added as a node attribute. To make the graph representation more suitable for the approaches presented in this work the graph structure is post-processed in the following way. The average direction stored in the nodes is discretized into 8 major directions a ridge line can follow. Also, as a graph attribute, the barycenter of all node-coordinates is calculated and its location within the nine major regions of the image determined. The region is then assigned as an additional attribute to the graph. An example of a fingerprint representation as described above is given in Figure 7.7. As can be seen in Figure 7.8 the entire database consists of 3,17 graphs. Analyzing the database it can be seen that on average a graph consists of

107 Chapter 7. Graph Datasets 95 Number of Occurrences Number of Graphs in the Database Fingerprint Database Number of Nodes in a Graph Fingerprint Database Labels Number of Graphs in the Database Fingerprint Database Number of Edges in a Graph number of graphs 3,17 avg. # nodes/graph 7.69 standard deviation 5. entropy 3.98 avg. # edges/graph 1.94 standard deviation 1.9 entropy 3.96 label alphabet size 78 label entropy 4.15 Figure 7.8 Characteristics of the graphs in the fingerprint database. about 8 nodes connected through 13 edges. Comparing the distribution of the size of the graphs with the previously described region adjacency graphs (see Figure 7.6) it can be seen that they are much more evenly distributed. As expected, the number of edges in the graphs is closely correlated to the number of nodes in the graphs. Looking at the distribution of the node labels it can be seen that due to the rough discretization approach node labels are biased towards a small range of possible label values, concentrating the distribution within 1 label values occurring in the graphs. In total there exist 8 labels concerning ridge direction and 9 labels on the barycenter of the graph nodes. In practice these two separate label alphabets have been combined into one by constructing the Cartesian product of both alphabets. As a result, the overall label alphabet consists of 78 labels effectively occurring in the graphs. Measuring the information contained in the graphs it can be seen that despite the local concentration of the node labels around 1 the entropy on the label distribution is still comparable to the case of region adjacency Note that the graphs are directed graphs with two edges connecting two nodes in graph.

108 Extracted Graphs bondorder: atom: C bondorder: 1 atom: C atom: C bondorder: 1 atom: C bondorder: bondorder: atom: C bondorder: 1 atom: C atom: C bondorder: 1 atom: C bondorder: 1 bondorder: 1 atom: S bondorder: bondorder: atom: N bondorder: 1 bondorder: atom: C atom: C atom: C bondorder: 1 bondorder: 1 bondorder: 1 atom: N bondorder: 1 bondorder: bondorder: atom: C atom: C atom: S atom: C bondorder: 1 bondorder: 1 atom: S bondorder: 1 atom: S bondorder: 1 Figure 7.9 Chemical compound graph NSC. graphs. The entropy for graph size and edge distribution is not quite that high, but comparable to the values obtained for the region adjacency graphs as well. Therefore, it can be expected that the systems developed perform equally well on region adjacency as well as fingerprint graphs Chemical Compound Graph Representations Graphs are a very popular approach to model the structure of chemical compounds. The US Developmental Therapeutics Program DTP of the US National Cancer Institute NCI has collected a large amount of chemical compounds in graph data format [98]. At the time of this writing 16,75 chemical compounds are freely available as graph data. The graphs contain the following information on the compound they represent. The nodes are assigned an atomic symbol as well as 3-dimensional coordinates of the position of the atom. Edges are assigned an attribute denoting the bonds and the bond order of the connection between the atoms. Furthermore, the

109 Chapter 7. Graph Datasets 97 Number of Graphs in the Database Chemical Compound Database Number of Nodes in a Graph Number of Graphs in the Database Chemical Compound Database Number of Edges in a Graph Number of Occurrences Chemical Compound Database 1.8e+6 1.6e+6 1.4e+6 1.e+6 1e Labels number of graphs 16,75 avg. # nodes/graph standard deviation 8.6 entropy 4.89 avg. # edges/graph standard deviation 9.65 entropy 5.8 label alphabet size 78 label entropy 1.34 Figure 7.1 Characteristics of the graphs in the compound database. chemical compound is classified according to NSC (the NCI s internal ID) as well as the CAS registry number. An example of such a graph is given in Figure 7.9. Analyzing the chemical compound database it can be seen that containing 16,75 graphs makes it the largest database in the testbed (see Figure 7.1). On average, the graphs contain about 19 nodes and edges. Considering the histogram curves it can be seen that both graph size as well as number of edges approximate normal distributions. Looking at the distribution of the node labels it can be seen that there is a minority of labels (chemical elements) dominating the distribution. This is expected since some few elements are much more common than other elements, e. g. carbon C versus terbium Tb. Looking at the information contained in graph size and number of edges it can be seen that both entropy values are about equally large, comparable to the values obtained on other databases (see Figure 7.1). However, significantly less information is stored in the distribution of the node labels of the graphs. It can be expected that this lack of information will have some

110 Extracted Graphs ESTATE REAL VENTURE INVESTORS FUNDING PRIVATE SMALL SBA CAPITAL BUSINESS Figure 7.11 Graph representation of a web document. impact on the approaches which mainly rely on node label information, such as the error-tolerant filtering approaches Document Graphs The document database is derived from two different collections of web documents. It is based on a graph representation successfully applied in document classification [54]. There are in total 78 documents as ground truth data. The graph representation chosen is based on the n most frequent keywords extracted from a document. Each word in a section of the document is assigned a node in the graph, ignoring very frequent terms such as the, of etc. Thus each vertex in the graph represents a unique word in a section and is labelled with a unique term not used to label any other node. Directed edges are added between two words a and b if a precedes b within a section in the document. The sections regarded are title, links and text which can easily be extracted from web documents. Then, a stemming is performed conflating words to the most frequent occurring case, thus updating the edge structure of the graph. Finally, all vertices except the n most frequently occurring terms are pruned from the graph, where n is a user-specified input parameter. An example of graph of a web document is given in Figure In the experimental evaluation of the filtering method n has been set to 1 nodes, thus creating a graph database with graphs consisting of 1 nodes.

111 Chapter 7. Graph Datasets 99 Number of Graphs in the Database Number of Occurrences Document Graphs Database 1 Nodes Number of Nodes in a Graph Document Graphs Database 1 Nodes Labels Number of Graphs in the Database Document Graphs Database 1 Nodes Number of Edges in a Graph number of graphs 78 avg. # nodes/graph 1 standard deviation entropy avg. # edges/graph 8.15 standard deviation 5.36 entropy 4.1 label alphabet size 1, label entropy 9. Figure 7.1 Characteristics of the graphs the in the document database (1 terms extracted per document). As can be seen in Figure 7.1, this creates a label alphabet much larger than the alphabets introduced so far. It can be seen that virtually no information is contained in the size of the graphs. Naturally, this is expected since the number of terms n to be extracted is constant, hence all graphs are of size n = 1. More information however is contained in the structure of the graphs, denoted by the edges connecting the nodes. The amount of information contained in the structure is comparable to the databases previously described. Looking at the size of the label alphabets on the database it can be seen that it is much larger than the alphabets of the databases introduced so far. Consequently, the most information is contained in the distribution of the labels in the graphs.

112

113 Chapter 8 Graph Database Filtering Performance Study 8.1 Experimental Setup In order to study theoretical results found in Chapter 4, match and nonmatch performance of graph matching algorithms on different graph types need to be evaluated. While matching performance has been studied in other papers before [9, 91, 9], in this chapter an investigation of the nonmatching performance is included. The aim is to gain some understanding of how matching and non-matching performance of different graph matching algorithms depends on the graph type. The following parameters are varied in the experimental setup: matching paradigm: graph isomorphism and subgraph isomorphism graph type: random graphs, meshes, bounded valence graphs graph matcher: Ullmann [7] and VF [93] graph matching algorithms In this experiment, the types of graphs examined are random graphs, bounded valence graphs (regular/irregular) and meshes (regular/irregular) 11

114 1 8.. Experimental Results.1.1 Ullmann Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time.1.1 VF Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time Runtime [s].1.1 Runtime [s].1.1 1e-5 1e-5 1e Number of Nodes 1e Number of Nodes Figure 8.1 Match times vs. non-match times for the Ullmann and the VF algorithm on random graphs. from the previously introduced datasets (see Chapter 7). To measure nonmatch performance, each graph of the database is matched against the remaining 999 graphs. To measure match-performance, each graph is matched against itself 1, times. Furthermore, to ensure high runtime resolution, each measurement is conducted several times in a loop, the runtime summed up, and then divided by the total number of loops. Hence, for each graph, there are at least 999 non-match measurements and at least 1, match measurements. To minimize the effect of outliers, 5% of the measured runtime values are discarded (5% of the smallest and 5% of the largest values). In this study the VFlib library (described in [9]) is used to provide implementations of several isomorphism and subgraph isomorphism algorithms. Only Ullmann s and VF are compared against each other since these are the only fundamentally different algorithms contained in VFlib providing both graph isomorphism as well as subgraph isomorphism functionality. 8. Experimental Results The data is analyzed with the objective of comparing matching performance versus non-matching performance for each of the selected matching algorithms. Figure 8.1 (left) shows match and non-match time of Ullmann s algorithm on random graphs as a function of graph size. The figure indicates very clearly that for Ullmann s algorithm the average time needed

115 Chapter 8. Graph Database Filtering Performance Study Ullmann Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time 1.1 VF Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time Runtime [s] Runtime [s] e-5 1e-5 1e Number of Nodes 1e Number of Nodes Figure 8. Match times vs. non-match times for the Ullmann and the VF algorithm on bounded valence graphs. to match two graphs which are isomorphic to each other is much longer than the time needed to identify two graphs as being non-isomorphic. This is not just a phenomenon specific to Ullmann s algorithm, but can be observed for VF as well, see Figure 8.1 (right). However, comparing the performance of Ullmann s algorithm and VF it can also be seen that the gap between match-/non-match-performance is much smaller for the VF algorithm than for Ullmann s method. Obviously VF is faster than Ullmann s algorithm for matching graphs, but the inverse relation holds for non-matching graphs. Because of the logarithmic scale in Figure 8.1, the difference between the two algorithms is significantly larger for the nonmatching case than for the matching graphs. Hence for the case of random graphs, Ullmann s method and VF are instances of algorithms where parameters δ match and δ non_match, introduced in Section 4.1, have different sign. Therefore, which of the two algorithms is more efficient if used for database filtering eventually depends on the ratio s/m db (see Equation (4.3)). If there are many matching graphs, then VF will outperform Ullmann s method, while for the case where there are many non-matching graphs in the database, Ullmann s method will be faster. The behavior of the two algorithms observed in Figure 8.1 can also be verified for other graph types. Figure 8. shows the performance for bounded valence graphs. In general, both algorithms are slower on graphs of this kind, but it can be seen again that Ullmann s algorithm is faster for the recognition of non-matching graphs whereas VF performs better on matching graphs.

116 Experimental Results Ullmann Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time VF Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time Runtime [s] Runtime [s] e-5 1e-5 1e Dimension of Mesh 1e Dimension of Mesh Figure 8.3 Match times vs. non-match times for the Ullmann and the VF algorithm on mesh graphs of varying dimension. On mesh graphs there is again a large difference in runtime between the matching and the non-matching case (Figure 8.3). Again, VF performs better than Ullmann in matching isomorphic graphs. However, the inverse relation between Ullmann and VF concerning non-match performance that was observed on the bounded valence and random graphs no longer holds. It can be observed that, for the case of non-matching graphs, both methods have about the same performance. The results obtained for meshes, bounded valence and random graphs are confirmed when looking at subgraph isomorphism. To evaluate the performance of subgraph isomorphism detection, the graphs in the database are assumed to be supergraphs of the given input sample. The sample graph is initially defined to be one of the database graphs, and then 3% of its nodes are cut. In Figure 8.4 a behavior very similar to Figure 8.1 can be observed. That is, Ullmann s approach is again very efficient compared to the VF algorithm on non-matching subgraph samples, yet it performs poorly on identifying subgraphs. Figures 8.5 and 8.6 confirm these observations for bounded valence and mesh graphs. In the following an example of how the results from Section 4.1 and 8. can be combined will be given. Look at Figure 8.1 and assume the graphs in the database are of size 5. Then

117 Chapter 8. Graph Database Filtering Performance Study Ullmann Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time.1.1 VF Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time Runtime [s].1.1 Runtime [s].1.1 1e-5 1e-5 1e Number of Nodes 1e Number of Nodes Figure 8.4 Subgraph isomorphism match times vs. non-match times for the Ullmann and the VF algorithm on random graphs. 1.1 Ullmann Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time 1.1 VF Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time Runtime [s] Runtime [s] e-5 1e-5 1e Number of Nodes 1e Number of Nodes Figure 8.5 Subgraph isomorphism match times vs. non-match times for the Ullmann and the VF algorithm on bounded valence graphs. δ match δ non_match = t Ullmann match t VF_ match = t Ullmann non_match tvf_ non_match =.8s..15s.83s =.14s.63s.87s Using Equation (4.3) it follows that s m db

118 Experimental Results Ullmann Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time VF Algorithm, Label Alphabet Size=1 Legend Non-Match Time Match Time Runtime [s] Runtime [s] e-5 1e-5 1e Dimension of Mesh 1e Dimension of Mesh Figure 8.6 Subgraph isomorphism match times vs. non-match times for the Ullmann and the VF algorithm on mesh graphs. This means that for a database size of 176 graphs and 1 matching graph in the database, brute force matching using either Ullmann s algorithm or VF will approximately perform the same. If the database is increased (for example to 1, graphs) with constant value s = 1, then brute force matching using Ullmann s algorithm will be considerably faster than brute force matching using VF. If on the other hand there are more matching samples (s is increased) yet the database size stays constant, brute force matching will perform better if used in combination with VF. The results derived in Section 4.1 also show that graph database filtering only improves the overall performance of database retrieval if the time needed to apply the filtering method is smaller than the time needed to compare all non-matching graphs that are filtered out with the input sample. For a given database of graphs and a graph matching algorithm, Equation (4.5) shows that filtering can be effective if either the filter method significantly reduces the number of candidates remaining for full-fledged graph matching or the non-match performance of the graph matching algorithm is very poor. Consider again Figure 8.1 for graphs consisting of 5 nodes. The average non-match runtime for Ullmann s algorithm is t Ullmann non_match.63s. The database itself consists of m db = 1, graphs. Assuming the filtering method reduces the database size to % (m filter = ), one obtains t filter <.63s 98.6s, which means that a filtering method needs to be faster than.6s in or-

119 Chapter 8. Graph Database Filtering Performance Study 17 der for it to be more effective than a brute force approach using Ullmann s algorithm. Furthermore, the maximum performance gain to be expected from a filtering method is at most.6s. With the average matching time t Ullmann match for that database being s t Ullmann match.15s =.3s, the filtering effect for this database and graph matching algorithm is minimal. Considering VF s match and non-match performance on the other hand (t VF_ match.83s, tvf_ non_match.87s respectively), the maximum filter gain to be achieved is t filter <.87s 98.85s. Comparing this to the time needed for final matching s t VF_ match.83s.17s, the effect of the filter will be significant. Note that if the database size m db is increased further with constant matching sample value s and filter value m filter, the effect of the filter will become even stronger. If one is given a graph database, a graph matching algorithm and a filtering method, then the performance gain to be achieved by the filter can be predicted very precisely. Considering the databases analyzed in this chapter filtering methods applied to these databases should be used in combination with VF instead of Ullmann s algorithm since VF s matching performance is superior to Ullmann s. 8.3 Conclusions In this experiment results of a theoretical and experimental study on graph database filtering and graph matching performance have been presented. In particular two popular graph matching algorithms and the relation between matching and non-matching performance on different graph types have been tested and compared. Furthermore, the algorithms suitability for databases of various graph types has been evaluated. The results of the study may be useful for the design of optimal graph matching schemes that include a possibly large database of model graphs. One particular aspect that has been widely neglected until now is the observation that the non-matching time of an algorithm has a crucial impact on the overall performance of a graph matching scheme. For large databases, a small number of expected matches, and no or only moderate filtering, the non-matching performance of the underlying graph matching algorithm is

120 Conclusions the dominant factor in the overall efficiency of a graph matching scheme. In addition to matching and non-matching time, a number of parameters such as total database size, size of the database after filtering, number of matches in the database, and time needed for filtering has been identified. These parameters have a crucial influence on the behavior of a graph matching scheme. In particular, the number of matches in the database and database size can be used to predict the performance of given graph matching algorithms under both brute force matching and a filtering approach. Also, it has been shown that if two different matching algorithms perform equally on a given database, there exists an inverse relation between their matching and non-matching performance. The framework presented in the current study can be extended, in a straightforward way, to other types of graphs, other (subgraph) isomorphism algorithms and other graph matching tasks.

121 Chapter 9 Feature Vector Filtering Experimental Evaluation In this chapter the filtering approaches based on feature vector comparison as introduced in Chapter 5 are presented. The aim is to present results of experiments demonstrating the suitability of the suggested feature vector representation of graphs. In detail, experimental results on the following topics are discussed: feature type evaluation sequential feature vector filtering performance This chapter will first present results obtained evaluating the suitability of the proposed feature types for database filtering (Section 9.1). Section 9. then discusses experimental results observed by measuring the performance of a simple vector based filtering approach. Finally, concluding remarks are drawn in Section

122 Feature Type Evaluation 9.1 Feature Type Evaluation Experimental Setup In this experiment the goal is to evaluate the suitability of the features presented in Section 5.1 for database filtering. Particularly, the aim is to measure the saliency of the various feature types when used alone or in combination with other features. The proposed features are tested on databases of randomly created graphs. In order to very thoroughly test the feature vector representation, the size of the random graph databases previously shown is increased to 1, instead of 1, graphs per database. (Again it is made sure by the random graph generation procedure that there are no isomorphic graphs in the database.) The following parameters are varied in the graph creation process: number of vertices per graph (5 to 5) number of vertex labels in the graph population (5 to ) The average number of edges per vertex is kept constant, i.e. the number of edges is increased proportionally to the number of vertices. During the experiment, the following features are evaluated: f (a, g): number of vertices in graph g per label a f 3 (a, g): number of incoming edges per vertex label a in graph g f 4 (a, g): number of outgoing edges per vertex label a in graph g f 5 (n, g): number of vertices per in-degree n in graph g f 6 (n, g): number of vertices per out-degree n in graph g f 7 (n, a, g): number of vertices per label a and in-degree n

123 Chapter 9. Feature Vector Filtering Experimental Evaluation 111 f 8 (n, a, g): number of vertices per label a and out-degree n Note that since each database generated is homogeneous in the number of vertices per graph, feature f 1 (g) is not evaluated. However, this feature should perform similarly to feature f (a, g) if the number of labels per graph is small compared to the number of nodes in a graph (that is if g L V ). In order to measure the saliency of the features proposed, feature vectors are extracted from the graphs in the database and classified with the decision tree induction algorithm developed in this thesis. During tree induction, it is made sure that the tree is specifically overfitted, hence no pruning techniques are applied. The split-criterion applied is the entropy criterion previously introduced (see Section 3.3). The saliency can then be described in terms of leaf set size and split factor of the decision tree. The leaf set size is the number of graph-candidates assigned to a leaf node in the decision tree. The smaller the leaf set size, the better. In the ideal case, only one graph from the database is assigned to a leaf node in the decision tree. The split factor of a decision tree node is the number of son nodes assigned to a father node. The larger the average split factor in the decision tree, the shallower the decision tree and hence a smaller number of tests is necessary to identify a graph in the database. Three sets of experiments are conducted in this setup. In each set, the number of features considered for distinguishing the various graphs is different. In the first set, the tree construction algorithm is only given one feature. In the second set, combinations of two features are used and in the third set, all seven features listed are used Experimental Results The motivation for the first set of experiments is to identify the power of individual features in separating the database graphs from one another. In these experiments, the tree induction procedure is applied to all generated graph databases, inducing one tree per graph database. Then, the trees are analyzed. Figure 9.1 shows the average leaf set size as well as the average split factor of the trees grown using feature f 5 (n, g) (number of vertices per in-degree n in graph g) 1. Since this feature does not extract any information about 1 The results for feature f 6 were similar to the results here and are therefore not depicted.

124 Feature Type Evaluation Average Leaf Set Size 5 3 Number of Nodes 8 6 Number of Labels Average Split Factor Number of Nodes 8 6 Number of Labels Figure 9.1 Leaf set size and split factor of decision trees induced on random graph databases using feature f 5. Average Leaf Set Size Number of Nodes 8 6 Number of Labels Average Split Factor Number of Nodes 8 6 Number of Labels Figure 9. Leaf set size and split factor of decision trees induced on random graph databases using feature f 3. the labels of the vertices in the graph, its behavior is obviously independent of variation of the label alphabet size. The feature only measures information available in the node and edge distribution of the graphs. It is clear to see that for random graphs the entropy of a graph set is increased if the graphs increase in size (simply because more variation is possible). Therefore, the leaf set size decreases with an increasing size of the graphs in the database. Similarly, the split-factors grow larger with increasing size of the graphs. A feature comparable to feature f 5 (n, g) yet accounting for information stored in vertex labels is feature f 3 (a, g) (number of incoming edges per

125 Chapter 9. Feature Vector Filtering Experimental Evaluation 113 Average Leaf Set Size Number of Nodes 8 6 Number of Labels Average Split Factor Number of Nodes 8 6 Number of Labels Figure 9.3 Leaf set size and split factor of decision trees induced on random graph databases using feature f 7. vertex label a in graph g). Its performance is depicted in Figure 9.. (Again, similar results were obtained for feature f 4 and as a consequence these results are not presented.) Naturally, due to the random distribution of the node labels in the graphs the information available by this feature is growing with an increasing size in the node label alphabet. Furthermore, with larger graphs there is a wider variety in edge degrees per node label. Hence, the leaf set size decreases (the split factor increases) with an increasing size of the label alphabet and an increasing size of the graphs in the database. Feature f 7 (n, a, g) (number of vertices per label a and in-degree n) extracts information contained in both feature f 3 (a, g) and feature f 5 (a, g). In Figure 9.3 it is clear to see that its performance is superior to the performance of its predecessor features. Clearly, this features yields the smallest worst case leaf set size since it is capable of processing information stored in the number of nodes, the node degrees as well as the node labels assigned. Feature f (a, g) (number of vertices in graph g per label a) finally shows values obtained when no node degree information is included. Similar to the features presented, the information included in number of nodes and the label alphabet is enough to entirely isolate the graphs solely based on this feature. However, the decline in leaf set size is not as fast as for feature f 7 where additional information on the degree of a node is evaluated. This indicates that for database filtering on average more tests need to be made in order to successfully return a reduced candidate set. Summarizing, it can be said that all features proposed perform very well

126 Feature Type Evaluation Average Leaf Set Size Number of Nodes 8 6 Number of Labels Average Split Factor Number of Nodes 8 6 Number of Labels Figure 9.4 Leaf set size and split factor of decision trees induced on random graph databases using feature f. Average Leaf Set Size Number of Nodes 8 6 Number of Labels Average Split Factor Number of Nodes 8 6 Number of Labels Figure 9.5 Leaf set size and tree depth of decision trees induced using the combination of features f and f 3. in distinguishing the graphs examined and are therefore highly suitable for database filtering. However, the results also lead to the conjecture that small leaf set sizes can be reached requiring fewer tests if combinations of features are used. This assumption is evaluated in the next set of experiments. There, the attention is focussed on features which produced good decision trees during the first set of experiments. Consequently, since the average leaf set size of feature f 5 and f 6 was shown to be quite large, these features are not considered in this set. Figure 9.5 shows leaf set size and split factor obtained for a tree induced combining features f and f 3. (Similar results are obtained for any other

127 Chapter 9. Feature Vector Filtering Experimental Evaluation 115 Average Leaf Set Size Number of Nodes 8 6 Number of Labels Average Split Factor Number of Nodes 8 6 Number of Labels Figure 9.6 Leaf set size and tree depth of decision trees induced using the combination of features f and f 7. combination.) As can be seen, leaf set size as well as split factor are now comparable to the values obtained by feature f 7. Naturally this is expected since in combination these features extract the same information as feature f 7 does. Hence some confirmation is given to the assumption that the more features are used the higher the saliency factor of the overall feature vector will be. Figure 9.6 shows an example of the statistics obtained when using combinations of two other features, namely f and f 7. As can be seen, opposite to the previous case since feature f 7 contains all the information evaluated by feature f as well, no gain is made either in leaf set size or in split factor. Overall however, one can expect a higher saliency rate if features are combined with one another. This conjecture is confirmed by the third set of experiments. The third set focussed on the use of all features (including features using no label information) to construct the decision tree. As one can see from Figure 9.7, as soon as enough node label information becomes available or the number of nodes is increased, the algorithm constantly produces leaf sets of size 1. Hence, the feature vector achieves the maximum saliency possible. Also, when using very few nodes with a large number of node labels the split factor decreases. This is due to the fact that many nodes in the databases graphs have different labels. Hence the algorithm is only able to factor out a few of them in a given node of the decision tree. However, because of the many different labels the algorithm is also more likely to produce leaf sets of size 1 in that case.

128 Feature Vector Filtering Average Leaf Set Size 3 Average Split Factor Number of Nodes 8 6 Number of Labels Number of Nodes 8 6 Number of Labels Figure 9.7 Leaf set size and split factor of decision trees induced using the combination of all available features Conclusions In the previous set of experiments the quality of the proposed features has been evaluated. Three sets of experiments have been conducted, all on artificially generated graphs of varying size in nodes and node alphabet size. It has been shown that if used alone, each feature produces satisfactory, although in some cases not optimal, results. However, if the feature types are combined into a single vector, the resulting vector performs very well achieving a high saliency degree. Therefore, in the following experiments the feature vectors used are always combinations of all features introduced (unless restrictions apply based on the graph matching paradigm). Based on the insights gained in the previous experiments one can assume that the proposed features despite their simplicity should perform very well when used for database filtering. 9. Feature Vector Filtering In Chapter 5 various feature vector filtering approaches for different matching paradigms have been introduced. The feature vectors used were assumed to be combinations of all features previously analyzed. In this experiment, feature vector filtering as proposed is evaluated. The feature

129 Chapter 9. Feature Vector Filtering Experimental Evaluation 117 vectors are composed of all feature types, namely f 1 to f 8. Furthermore, within a feature vector the feature types are ordered according to the complexity of the feature. Thus, f 1 is positioned at the beginning, followed by f etc. Based on this representation, the performance of the feature vector filtering methods for the matching paradigms considered (graph isomorphism, subgraph isomorphism and error-tolerant matching) is analyzed. The efficiency of feature vector evaluation is tested on several different types of graphs as presented before, namely random graphs, bounded valence graphs (regular, irregular), and mesh graphs (regular, irregular). Note that in this section, feature vector comparison is evaluated by examining the number of candidates returned as well as the number of tests made by the given evaluation paradigm. In the context of this work the number of candidates returned by a filtering method is also denoted the cluster size of the filtering method. In the following, the terms candidate set size and cluster size will be used synonymously Graph Isomorphism Evaluation In a first experiment, graph isomorphism feature vector comparison is evaluated. The experimental setup is as follows. For each database examined, an input graph is chosen at random. Then its feature vector is compared with all the feature vectors from the database graphs. The comparison of two vectors is stopped as soon as it is certain that the graphs are not isomorphic. Since all graphs in the database are isomorphic only to themselves, only one candidate is a possible candidate for graph isomorphism with the input graph. For each database, 1, candidates are chosen at random allowing for an average estimate of the number of candidates returned size and the number of tests made. Figure 9.8 shows the average number of isomorphism candidates returned by graph isomorphism feature vector evaluation. It is clear to see that the method behaves very well for both graph types illustrated, mesh graphs (left) as well as random graphs (right), returning only one graph as a possible filtering candidate. This is in fact the optimal filtering performance to be reached on the examined databases. (The results for bounded valence graphs are similar and therefore not depicted.) Thus it can be said that the proposed features are very well suited for database filtering considering the cluster sizes returned. Another interesting property is the number of tests to be made during the filtering procedure. In Figure 9.9 the average number of tests made during

130 Feature Vector Filtering Mesh Graphs Random Graphs Average Cluster Size Mesh Dimension Number of Labels Average Cluster Size Number of Nodes Number of Labels Figure 9.8 Average cluster size obtained for mesh and random graphs using graph isomorphism feature vector evaluation. Mesh Graphs Random Graphs Average Number of Tests Mesh Dimension Number of Labels Average Number of Tests Number of Nodes Number of Labels Figure 9.9 Average number of tests made for mesh and random graphs during graph isomorphism feature vector evaluation. retrieval of the result sets given in Figure 9.8 are illustrated. Several things can be observed looking at the number of tests made. First, the number of tests made for mesh graphs is lower than the number of tests made for random graphs. This is due to the specific ordering of the feature types in the feature vector. Since feature f 1 measuring the size of the graphs is evaluated first, there is an overhead of an additional 1, tests not reducing the result set for the random graphs database. The graphs in the mesh database on the other hand are not homogeneous in size. Therefore, testing f 1 already significantly reduces the result candidate set. Looking at mesh graphs, the number of tests gradually increases with an increasing

131 Chapter 9. Feature Vector Filtering Experimental Evaluation 119 Mesh Graphs Random Graphs Maximum Number of Tests Mesh Dimension Number of Labels Maximum Number of Tests Number of Nodes Number of Labels Figure 9.1 Maximum number of tests made for mesh as well as random graphs during graph isomorphism feature vector evaluation. number of labels in the node label alphabet. This is due to the very homogeneous structure of the graphs. After passing the test for graph size in feature f 1, all mesh graphs are equal in size and also similar in edge structure. Therefore, the only feature containing significant information to further reduce the result set are features regarding the node label alphabet. As the size of the alphabet increases, so does the size of the feature vector. Consequently, with an increasing size of the label alphabet and a resulting increasing length of the feature vector, the probability of testing more positions in the vector also rises. The results obtained for random graphs are more difficult to interpret. Generally it can be stated that for smaller random graphs, there are fewer structural variations and as a consequence more tests need to be made. Similarly, with a smaller label alphabet less variations in the label distribution are possible resulting in more tests to be made. On the other hand, an increased size of the label alphabet also increases the information stored in the database and therefore reduces the number of tests to be made. The worst case for both graph databases is illustrated in Figure 9.1. It can be seen that for both databases the average number of tests (Figure 9.9) does not significantly increase in the worst case (Figure 9.1). Therefore, it can be assumed that with respect to computational complexity there is significant potential to increase the performance of a feature vector based system (e. g. through data mining techniques). Overall, for graph isomorphism, since the feature vectors are not ordered, a high number of tests need to be made. This number is also influenced Note that f i merely denote meta-features. The actual features are given by the graph instance and the label alphabet. The larger the alphabet, the more features are produced.

132 1 9.. Feature Vector Filtering Mesh Graphs Random Graphs Average Cluster Size Mesh Dimension Number of Labels Average Cluster Size Number of Nodes Number of Labels Figure 9.11 Average cluster size obtained for mesh and random graphs using subgraph isomorphism feature vector evaluation. by an unsuitable ordering of the meta-features in the feature vector representation. It is clear to see that the extracted features are well capable of distinguishing amongst graphs, however, there is a significant need for data mining these features. 9.. Subgraph Isomorphism Evaluation In the second experiment, subgraph isomorphism feature vector comparison is evaluated. Particularly, subgraph retrieval from a given database of graphs is considered. Similar to graph isomorphism evaluation, an input graph is chosen at random from the given database. Then, a certain percentage of nodes with a label not occurring in the database is added to the graph before the graph s feature vector is compared to all the feature vectors from the database graphs. In the experiments presented the sample graphs were enlarged by 3% of their original size. By assigning a non-occurring label to the extra nodes, it is made sure that the subgraph isomorphism relation between the graphs is not influenced by the additional nodes (i.e. no additional subgraph isomorphisms are introduced). Again, the comparison of two vectors is stopped as soon as it is certain that the graphs are not subgraph isomorphic. For each database examined, the above procedure is repeated 1, times in order to give an average estimate of the performance of the system. Figure 9.11 shows the cluster size returned by the feature comparison. For random graphs (right diagram), since the graphs in the database have not

133 Chapter 9. Feature Vector Filtering Experimental Evaluation 11 Mesh Graphs Random Graphs Average Number of Tests Mesh Dimension Number of Labels Average Number of Tests Number of Nodes Number of Labels Figure 9.1 Average number of tests made for mesh and random graphs during subgraph isomorphism feature vector evaluation. been altered, the clusters are still very small once enough information is introduced in the graphs. For mesh graphs (left side) however, there is a significant increase in cluster size compared to graph isomorphism. Recall that the graphs in the mesh graph database need not be homogeneous in size. Obviously there are several smaller graphs in the database which are potentially subgraph isomorphic to other graphs in the database. These graphs then yield in an increased cluster size if chosen as input graphs. Considering the number of tests made during filtering an increase can be observed. This is due to the relaxed feature comparison condition where the individual positions in the feature vector do not need to be equal anymore. Furthermore, for mesh graphs there is a significant increase in the number of tests made for meshes of dimension 3 and 4. The explanation is simple considering the size of the candidate sets returned. Obviously an increased result set implies an increased number of tests to be made. For random graphs the increase in tests made is not as high since the result sets are not as large as for mesh graphs. Again it can be stated that regarding cluster size the features and filtering approach chosen behave very well for the subgraph isomorphism paradigm. However, considering the number of tests made there is undoubtedly need to increase the performance of the approach, reducing the number of features that need to be tested.

134 1 9.. Feature Vector Filtering Mesh Graphs Random Graphs Average Cluster Size Mesh Dimension Number of Labels Average Cluster Size Number of Nodes Number of Labels Figure 9.13 Average cluster size for mesh and random graphs using error tolerant feature vector evaluation (δ t =.) Error-Tolerant Evaluation The third matching paradigm evaluated is error-tolerant feature evaluation. In this experiment, an input sample graph s similarity to the graphs in the database is evaluated. The procedure is as follows. An input graph is chosen at random from the given database. Then, its feature vector is compared to the feature vectors of the graphs in the database. The graphs are allowed a given maximum threshold distance δ t (see Section 5.4), varied from. (graph isomorphism) to.. Again, the comparison of two vectors is stopped once it is sure that the distance between the two graphs is larger than the specified threshold distance. In order to get an estimate on the performance of the approach the above procedure is repeated 1, times for each database. Figure 9.13 shows the cluster sizes obtained for mesh as well as random graphs with no error allowed. Essentially, this is the same as retrieving graph isomorphism candidates and hence clusters of size 1 should be retrieved (equal to the clusters obtained for graph isomorphism). The assumption is confirmed by the results depicted in Figure Figure 9.14 illustrates the size of the clusters obtained if the threshold distance is varied. It is clear to see that with an increasing threshold distance (top row), larger candidate sets are returned by the filtering method. Furthermore, as expected the cluster size grows more rapidly if the graphs contain only a small selection of node labels. The more labels a graph can contain, the more non-overlapping parts there are between the graphs resulting in smaller maximum common subgraphs. However, if fewer labels

135 Chapter 9. Feature Vector Filtering Experimental Evaluation 13 Mesh Graphs Random Graphs Average Cluster Size Mesh Dimension Number of Labels Average Cluster Size Number of Nodes Number of Labels Threshold Distance δ t =.1 Mesh Graphs Random Graphs Average Cluster Size Mesh Dimension Number of Labels Average Cluster Size Number of Nodes Number of Labels Threshold Distance δ t =. Figure 9.14 Average cluster size for mesh and random graphs (right) using error tolerant feature vector evaluation allowing a distance of δ t =.1 and δ t =.. exist the size of the possible maximum common subgraph also increases. The number of tests made for a specified threshold distance δ t =. is illustrated in Figure It can be seen that for both graph types there are in general fewer tests than for graph isomorphism. This is due to the fact that the feature vectors are smaller than for graph isomorphism testing because not all features are suitable for error-tolerant traversal. This underlines the fact that for graph isomorphism, too many features are unnecessarily tested. Besides that, similar tendencies apply for graph isomorphism as well as error-tolerant comparison where δ t =.. The more labels are introduced, the larger the feature vectors and generally the more features need to be tested. For mesh graphs, again, a significant number of graphs can be ruled out to lie within the specified threshold distance after initial size comparison. For random graphs, this does not apply since per

136 Feature Vector Filtering Conclusions Mesh Graphs Random Graphs Average Number of Tests Mesh Dimension Number of Labels Average Number of Tests Number of Nodes Number of Labels Figure 9.15 Average number of tests made for mesh and random graphs using error tolerant feature vector evaluation (δ t =.). database all graphs are equal in size. For random graphs, the larger the ratio number of labels versus number of nodes, the fewer tests need to be made. For a large number of nodes assigned a small label alphabet, the graphs are generally very similar, whereas for small graphs assigned a large label alphabet, the dissimilarity occurs after fewer feature tests. This effect can be expected to decrease with an increasing distance allowed between the graphs. Figure 9.16 illustrates the number of tests to be made with an increasing threshold distance. As expected it can be seen that the number of tests made is proportional to the distance increase. Generally it can be observed that there are more tests to be made for the random graph database which is again due to the fact that these graphs are all homogeneous in size whereas the graphs in the mesh database are of different size. 9.3 Feature Vector Filtering Conclusions In this chapter feature vector filtering has been studied. First, the suitability for database filtering of the selection of features has been thoroughly studied. It has been shown that all features achieve a high degree of saliency, thus being able to identify or separate large numbers of graphs. Furthermore, it could be seen that when used in combination, the degree of saliency could even be increased. Besides the suitability of the feature types, feature vector filtering for all

137 Chapter 9. Feature Vector Filtering Experimental Evaluation 15 Mesh Graphs Random Graphs Average Number of Tests Mesh Dimension Number of Labels Average Number of Tests Number of Nodes Number of Labels Threshold Distance δ t =.1 Mesh Graphs Random Graphs Average Number of Tests Mesh Dimension Number of Labels Average Number of Tests Number of Nodes Number of Labels Threshold Distance δ t =. Figure 9.16 Average number of tests made for mesh and random graphs using error tolerant feature vector evaluation allowing a distance of δ t =.1 and δ t =.. proposed matching paradigms has been thoroughly studied. Summarizing it can be seen that for all considered graph matching paradigms straight forward feature vector evaluation works well with respect to the reduction factor of the initial database size. However, there is also strong indication that the performance of the approach (namely the number of tests made) can be significantly improved if the feature vectors are analyzed by applying data mining techniques. In the next chapter, the main topic of this thesis, the application of data mining techniques used in combination with feature vector evaluation, is presented.

138

139 Chapter 1 Decision Tree Filtering It has been shown in the previous chapter that the basic idea of feature vector comparison performs very promising for database filtering. However, there has also been indication that the performance could be significantly improved if the feature vectors were analyzed using conventional data mining techniques. In this chapter, applying decision tree induction methods to the extracted feature vectors, which is the main approach of the thesis, is studied. In the next section, the problem is analyzed considering the graph isomorphism matching paradigm. Then, results obtained for subgraph isomorphism retrieval are presented. Finally, the approach has also been applied to error-tolerant vector comparison. These results are presented in Section 1.3. Finally, in Section 1.4, concluding remarks are drawn. 1.1 Graph Isomorphism Filtering The decision tree filtering approach is tested on the same graph types as the feature vector filtering methods. Two aspects of the decision tree filter are analyzed during the experiments, each based upon a different set of graph data. The first dataset consisted of the same graph databases previously introduced for feature vector evaluation: random graphs, bounded valence, and mesh graphs. Each database contained 1, graphs, every 17

140 Graph Isomorphism Filtering graph being isomorphic only to itself. This dataset is used to analyze cluster size and the number of tests made during traversal for the two splitcriteria previously introduced, gain ratio and weighted entropy. The second dataset consists of the graphs described in Chapter 9.1 where the suitability of the features types has been evaluated. This dataset is used to evaluate the approaches capability of dealing with larger datasets while still performing well with respect to the tests made and the size of the candidate sets returned. In the first set of experiments the cluster size and the number of tests made are measured depending upon graph database and induction strategy chosen (split-criterion). In order to measure the quality of the approach, decision trees are induced for all graph databases and each type of split-criterion previously described (gain-ratio as well as weighted entropy). Then, graphs are chosen at random from the database, their feature vectors extracted and the corresponding decision tree traversed. The values measured are the size of the resulting candidate set (denoted as cluster size) as well as the number of tests made during traversal. The results are depicted in Figure 1.1. Figure 1.1 illustrates the cluster sizes obtained for mesh graphs and random graphs. Clearly, the approach behaves very well in terms of cluster size, reducing the initial database size of 1, graphs to only one graph remaining for full-fledged graph isomorphism matching after traversal. Compared to feature vector evaluation, the decision tree approach does not decrease in reduction efficiency. As can be seen in Figure 1.1, this result is independent of graph type or split criterion evaluated. The quality of a graph isomorphism filter not only depends on the size of the result sets returned, but also on the number of tests made to retrieve the result. In Figure 1., the left column illustrates the number of tests made during traversal of decision trees induced with the weighted entropy criterion. The right column on the other hand shows the same value obtained from traversing trees induced with the gain ratio criterion. Although both approaches perform very well compared to ordinary feature vector (see Figure 9.9) it can also be seen that when traversing weighted entropy trees less tests need to be made to retrieve a candidate set. This is expected since the goal of the weighted entropy criterion is to prefer features having a bias towards creating many successor nodes in a tree whereas the gain ratio criterion tries to normalize that effect (in order to maximize generalization performance needed for classification purposes). The effect becomes more obvious when analyzing worst case behavior in the number

141 Chapter 1. Decision Tree Filtering 19 Average Cluster Size Number of Nodes Weighted Entropy Number of Labels Average Cluster Size Number of Nodes 8 Gain Ratio 8 6 Number of Labels Random Graphs Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Mesh Graphs Figure 1.1 Cluster size for trees induced using gain ratio or weighted entropy as the split criterion. of tests made as shown in Figure 1.3. However, from a database filtering point of view the difference in number of tests made is negligible. In a second experiment the approach s capability of dealing with large databases is analyzed. Based on the results obtained in the previous experiment, decision trees are induced using the weighted entropy criterion. Figure 1.4 illustrates the cluster size and average tree depth for graph isomorphism decision trees induced on databases consisting of 1, graphs. It is clear to see that as soon as enough node label information becomes available or the number of nodes is increased, the filter is capable of reducing the number of remaining candidates to 1. Analyzing the tree depth it can be seen that generally speaking once there is a large amount of information available to the decision tree only very few tests need to be

142 Graph Isomorphism Filtering Average Number of Tests Number of Nodes Weighted Entropy Number of Labels Average Number of Tests Number of Nodes 8 Gain Ratio 8 6 Number of Labels Random Graphs Average Number of Tests Weighted Entropy Average Number of Tests Gain Ratio Mesh Dimension Number of Labels Mesh Dimension Number of Labels Mesh Graphs Figure 1. Number of tests made during traversal before a result set is returned. made to rule out a large number of candidates 1. This is despite the fact that the length of the original feature vectors increases with an increasing number of nodes and labels in the graphs. Therefore, the data mining approach works very well identifying the important features reducing the feature space. Also, when using very few nodes with a large number of node labels the tree depth increases. This is due to the fact that many nodes in the databases graphs have different labels. Hence the algorithm is only able to factor out a few of them in a given node of the decision tree. However, due to the many different labels the algorithm is also more likely to produce clusters of size 1 in that case. Summarizing it can be said that data mining the feature vectors presented in Chapter 9. works very well for all considered circumstances, cluster 1 For graph isomorphism trees the tree depth is equal to the number of tests made since per retrieval only one tree branch is being followed.

143 Chapter 1. Decision Tree Filtering 131 Maximum Number of Tests Number of Nodes Weighted Entropy Number of Labels Maximum Number of Tests Number of Nodes 8 Gain Ratio 8 6 Number of Labels Random Graphs Maximum Number of Tests Weighted Entropy Maximum Number of Tests Gain Ratio Mesh Dimension Number of Labels Mesh Dimension Number of Labels Mesh Graphs Figure 1.3 Maximum number of tests made during traversal before a result set is returned. size as well as number of tests made. For all generated graph databases evaluated, the cluster size was maximally reduced to only single graphs remaining for full-fledged matching. Hence, no compromise when comparing with full feature vector evaluation had to be made. Furthermore, using the decision tree approach the number of features to be tested was significantly reduced. For mesh graphs for example, using the decision tree the number of tests to be made is reduced from more than 1, tests (Figure 9.9) to less than 5 tests to be made (Figure 1.). Also, it has been shown that the approach behaves well if database size is increased with only very little (if any) additional testing necessary while keeping the performance with respect to cluster size (Figure 1.4). The approach is also evaluated on the real-world graph databases introduced in Chapter 7. The experimental setup is analogously to the generated

144 Graph Isomorphism Filtering Average Cluster Size Number of Nodes 8 6 Number of Labels Average Tree Depth Number of Nodes 8 6 Number of Labels Figure 1.4 Cluster size and tree depth of decision trees created using the combination of all available features. RAG Fingerprints Compounds Documents database size 7, 3,17 16,75 78 minimum cluster maximum cluster average cluster cluster std. deviation average reduction [%] Table 1.1 Cluster size statistics for region adjacency, fingerprint, and chemical compound databases. graph databases. First, a decision tree is induced on each database. Then, in order to measure cluster size and number of tests made, graph samples are chosen at random from the database, their feature vectors extracted and the decision tree traversed. This procedure is repeated for 1, randomly chosen graph samples ( samples for the document database). Table 1.1 illustrates measurements made on region adjacency, fingerprint, chemical compound graph and document graph database (first, second, third and forth column respectively). It can be seen that in the best case the approach is able to eliminate the entire database except for the sample chosen (minimum cluster size). In the worst case, for all four databases, the approach still filters out a significant number of graphs (more than 96% for all three databases). Also, for all three databases, the average reduction factor is more than 99.5%. Consequently, for the document graph

145 Chapter 1. Decision Tree Filtering 133 RAG Fingerprints Compounds Documents minimum tests 1 maximum tests average tests tests std. deviation Table 1. Statistics concerning number of tests made during traversal for region adjacency, fingerprint, and chemical compound databases. database, on average only 1 graph has to undergo a full-fledged graph isomorphism matching. For the region adjacency and chemical compound graph databases, only approximately 4 graphs have to undergo a final fullfledged graph isomorphism test whereas for the fingerprint database still only 13 graphs out of more than 3, graphs need to be tested. Table 1. illustrates the number of tests made in order to achieve the reduction rates depicted in Table 1.1. It is clear to see that the computational cost is negligible. On average, only 1 tests for the document and 5 tests for the remaining databases need to be made to identify the matching candidates. In the worst case, 55 tests need to be made (fingerprint database). However, this number is still negligible considering the computational cost to sequentially match every graph in the database with the input graph. Summarizing the performance on real world databases, it has to be noted that graph isomorphism is a very restrictive concept. Still, the results obtained on the generated datasets have been confirmed on real world graph databases. The approach performs very well on the databases tested. Furthermore, it is clear to see that the filtering performance approach is relatively independent of the size of the database. It could also be seen that the approach behaves very well, meaning that the increase in computational cost is very small, if the size of the database is increased (fingerprint versus chemical compound database for example). 1. Subgraph Isomorphism Filtering To evaluate the performance of the subgraph isomorphism filters the first dataset presented in Section 1.1 consisting of random graphs, bounded

146 Subgraph Isomorphism Filtering valence and mesh graphs are used. In the experiments, all three approaches (adapted traversal, adapted tree type and combined approach) are evaluated. The experimental setup is as follows. An input graph is chosen at random from the database. Then, its size is increased by.3% of its original size. The additionally introduced nodes are assigned non-occurring labels to make sure that no additional subgraph isomorphisms are introduced during the procedure (analogous to feature vector evaluation). All three subgraph isomorphism filtering approaches described in Section 6.3 are evaluated with regards to cluster sizes obtained as well as number of tests made during traversal. In the next section, subgraph isomorphism traversal on graph isomorphism trees will be evaluated before the results obtained using specific subgraph isomorphism tree structure are presented. Finally, results found when using the combination of both approaches will be shown Graph Isomorphism Tree -Traversal The primary objectives for this set of experiments is to measure the cluster size as well as the number of tests made when using graph isomorphism trees for subgraph isomorphism traversal. A second objective is to determine how decision trees induced using the weighted entropy criterion perform compared to trees induced using the gain ratio criterion. Due to the concurrent traversal of several branches in the tree it is not clear if trees induced using gain ratio perform better or worse than trees induced using weighted entropy with respect to cluster size or number of tests made. Therefore, the cluster size as well as the number of tests made are evaluated for both types of trees. Figure 1.5 shows the cluster sizes measured for both tree types induced (entropy as well as gain ratio trees). The average cluster size was measured by randomly picking a graph from the database, extracting its features and traversing the tree. For graph isomorphism filtering it has been shown in the previous chapter that as soon as enough node label information becomes available or the number of nodes is increased, the isomorphism traversal algorithm constantly produces clusters of size 1. This means that, after filtering, the input sample only has to be matched against one graph in the database. It is surely not a surprise that in the case of subgraph isomorphism the cluster size is increased. Figure 1.5 illustrates that for random graphs the cluster size typically increases to approximately - 3 graphs. This means that the filtering approach is still quite effective, eliminating about 7-9% of all graphs in the database as being potentially

147 Chapter 1. Decision Tree Filtering 135 Average Cluster Size Number of Nodes 8 Gain Ratio 8 6 Number of Labels Average Cluster Size Number of Nodes Weighted Entropy Number of Labels Random Graphs Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Mesh Graphs Figure 1.5 Average cluster size obtained on decision trees induced using gain ratio (left column) and weighted entropy (right column). subgraph isomorphic to the input graph. Figure 1.5 also shows the cluster size for mesh graphs. Similarly to random graphs, one can observe that once the graphs contain enough information, the filtering becomes quite efficient with a typical cluster size of approximately 3 graphs. Comparing these numbers to complete feature vector evaluation it can be seen that there is in fact higher potential to be achieved. For random graphs, feature vector evaluation reduces the cluster size to only 1 graph remaining, and for mesh graphs it can be reduced to 15 graphs remaining. However, as will be shown below, these cluster sizes come at a much higher computational cost in the number of tests made, namely approximately, tests for random graphs (as opposed to a few hundred as shown below) and more than 15, for mesh graphs (as opposed to 3 tests using the decision tree).

148 Subgraph Isomorphism Filtering Average Number of Tests Number of Nodes 8 Gain Ratio 8 6 Number of Labels Average Number of Tests Number of Nodes Weighted Entropy Number of Labels Random Graphs Average Number of Tests Mesh Dimension Gain Ratio Number of Labels Average Number of Tests Mesh Dimension Weighted Entropy Number of Labels Mesh Graphs Figure 1.6 Average number of tests made using decision trees induced with the gain ratio (left column) or the weighted entropy (right column) split criterion. Furthermore, as can be seen in Figure 1.5, trees based on gain ratio generally produce smaller cluster than trees based on weighted entropy. This performance increase is not coupled with a significant increase in the number of tests made (see Figure 1.6). For subgraph isomorphism filtering, tree structures obtained using gain ratio select more suitable features to be tested than weighted entropy trees. This is most likely a consequence of the generalization goal gain ratio tries to achieve. Comparing the number of tests made for subgraph isomorphism traversal with isomorphism traversal as previously shown (Figure 1.), a significant increase in the number of tests made can be observed. (A rough estimate would be that for mesh graphs and isomorphism traversal only 5 tests are made whereas for subgraph isomorphism 3 tests need to be conducted.) This increase is expected because for subgraph isomorphism traversal there are in general several successor nodes to be followed whereas for graph isomorphism

149 Chapter 1. Decision Tree Filtering 137 RAG Fingerprints Compounds Documents database size 7, 3,17 16,75 78 minimum cluster 1 3 3,169 1 maximum cluster,7 1,68 115,194 average cluster , cluster std. deviation , average reduction [%] Table 1.3 Cluster size statistics for region adjacency, fingerprint, and chemical compound databases. traversal at most one successor node needs to be followed. However, comparing it with the full feature vector comparison as shown in Figure 9.1 there is still a great reduction in the number of tests committed. (Again, an estimate for mesh graph filtering using decision trees 3 tests have to be made whereas at least 15, have to be made if the feature vectors are compared as a whole.) The adapted traversal procedure is also tested on the real-world graph databases. For all four databases the experimental setup is as follows. For region adjacency, fingerprint and document graph database it is assumed that the goal is to identify known patterns stored in the database in an error-prone input sample. Hence, the setup is that the database consisted of error-free ground truth data whereas the input sample is assumed to contain errors in its extracted graph. In order to simulate this setup, a tree is induced on the considered database. Samples are then chosen from the database and their size increased, again by.3% of their original size. The added nodes are assigned labels not occurring in the respective database, denoting errors in measurement or graph extraction process. For the chemical database it is assumed that the database consisted of well-known small compounds to be identified in larger structures. Hence, for that database no error is supposed to be in the input graphs. Again, the decision tree is induced on the entire database. However, only graphs consisting of more than 5 nodes are considered as input samples (offering a selection of 1,33 possible input graphs). The results are illustrated in Table 1.3. As can be seen, the obtained clusters are in general larger than for graph isomorphism. Similar to generated graph data, this result is also expected. For the setup where the input sample is expected to be error-prone, the reduction rate is nevertheless quite

150 Subgraph Isomorphism Filtering RAG Fingerprints Compounds Documents minimum tests , maximum tests,835 1,54,36 8 average tests , tests std. deviation , Table 1.4 Statistics concerning number of tests made during traversal for region adjacency, fingerprint, and chemical compound databases. high (more than 9% for all three databases). Evaluating the reduction rate on the chemical database it can be seen that with approximately 48%, the performance is significantly worse than for the other databases. This is due to the fact that more complex compounds used as input samples are composed of the (numerous) simple compounds stored in the database. Hence, this result is expected. Also, looking at the absolute numbers it can be seen that still more than 6, graphs are eliminated from the database. Overall, the results obtained on generated graph data are confirmed and in the case of region adjacency, fingerprint and document graphs superseded. Considering the number of tests that have to be made in order to achieve the reduction rates illustrated before, it can be seen that there is a general increase in computational cost for all four databases (see Table 1.4) compared to graph isomorphism filtering. However, comparing maximum number of tests to be made with the number of graphs stored in the database, for all databases considered less than one test per graph need be made to eliminate more than 9% of the graphs (48% for the chemical compound database). 1.. Subgraph Isomorphism Tree -Traversal The experimental setup to evaluate the adapted tree structure is analogous to the setup for the adapted traversal method. The same databases are used and for each database an adapted tree is induced using gain ratio or weighted entropy criterion. Decision trees suitable for graph isomorphism retrieval can be fully induced with no problems. Decision trees suitable for subgraph isomorphism on the other hand can grow, due to their special structure, quite large and therefore need to be limited in size. To control the subgraph isomorphism decision tree growth and to compare the approaches performance with the adapted traversal method shown in the

151 Chapter 1. Decision Tree Filtering 139 Average Cluster Size Number of Nodes 8 Gain Ratio 8 6 Number of Labels Average Cluster Size Number of Nodes Weighted Entropy Number of Labels Random Graphs Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Mesh Graphs Figure 1.7 Average cluster size obtained on decision trees specifically induced for subgraph isomorphism traversal using gain ratio (left column) and weighted entropy (right column). previous section, the trees are limited in size to the same number of nodes of the corresponding graph isomorphism trees (from hundreds up to several thousand nodes). In order to measure the average cluster size, graphs are randomly picked from the database and are then used as an input sample. Before extracting the features, the sample graph s size is increased and the additional nodes are assigned labels not occurring in the database graphs. Then the feature vector is extracted and the decision tree traversed. This procedure is repeated 1, times to get an average value for each database concerning cluster size and number of tests made. Figure 1.7 shows the cluster size for random and mesh graphs, respectively. The results for bounded valence graphs are similar to the values This is to ensure that no additional subgraph isomorphisms are introduced.

152 1 1.. Subgraph Isomorphism Filtering obtained for random graphs and therefore not depicted. As can be seen, trees grown using the weighted entropy criterion clearly outperform trees grown using gain ratio. This is due to the fact that with the current subgraph tree type, only one successor node is traversed hence a higher split factor is desirable. Furthermore it can be observed that the reduction factor is not as high as when traversing graph isomorphism trees using a modified traversal procedure. The reason for that is that the number of features during traversal is too little. Naturally, trees of this type can easily grow very large and hence need to be heavily pruned. As a consequence, the performance is not as high as the performance of graph isomorphism trees. However, the performance with regards to cluster size can easily be improved if the tree is allowed to grow larger. Looking at mesh graphs it can be seen that still approximately graphs can be eliminated from the database which is equal to a reduction factor of %. Figure 1.8 illustrates that in fact only very few (less than five) tests are made to achieve the reduction shown. This leads to the conjecture that the reduction rate can be significantly improved if the trees are grown larger. To undermine the assumption that the reduction rate is improved with larger trees another experiment is conducted allowing the subgraph isomorphism tree to grow as many internal nodes 3 as the total number of nodes in the corresponding graph isomorphism tree. The results for this set are depicted in Figures 1.9 and 1.1. Figure 1.9 shows the cluster size for random and mesh graphs, respectively. It can be seen that the increased tree on average reduces the database size to about 3 graphs for random graphs (formerly 7). This means that the initial database size is reduced by about 7%. Due to the much more regular structure of mesh graphs, the filtering effect is not quite that high. However, still approximately 5 graphs can be eliminated from the database which is equal to a reduction factor of 5%. Considering the number of tests made it can be seen in Figure 1.1 that the number of tests made has increased by about to 3 test per database. In order to compare the adapted tree structure performance on real world graph data with the adapted traversal on graph isomorphism trees the same experimental setup is applied. Again, the tree size is allowed to contain as many nodes as the corresponding graph isomorphism tree. Note that due to its size the chemical compound database is not processed using this approach. For the three remaining databases, the same scenario as 3 Internal nodes reflect the effective number of features tests evaluated. Hence, the performance of the adapted tree structure presented is no longer directly comparable to the adapted traversal algorithm used with the according graph isomorphism trees.

153 Chapter 1. Decision Tree Filtering 141 Average Number of Tests Number of Nodes 8 Gain Ratio 8 6 Number of Labels Average Number of Tests Number of Nodes Weighted Entropy Number of Labels Random Graphs Average Number of Tests Mesh Dimension Gain Ratio Number of Labels Average Number of Tests Mesh Dimension Weighted Entropy Number of Labels Mesh Graphs Figure 1.8 Average number of tests made on decision trees specifically induced for subgraph isomorphism traversal using gain ratio (left column) and weighted entropy (right column). described in Section 1..1 is used as the experimental setup. Again, the results obtained on the generated graph data are confirmed on real-world graphs. The performance is generally worse than the adapted traversal s filtering performance (see Table 1.5). Whereas for the fingerprint graph database roughly % of the graphs are eliminated by the filter, for the region adjacency database only 8% and the document graph database only 15% of the graphs are eliminated. The drop in performance can be explained considering Table 1.6 where it can be seen that only very few tests are made (regardless of the database considered). On average little more than one test for region adjacency and fingerprint database is made. For the document database, only two test are made. Hence, the computational cost is very low considering the cluster sizes obtained. It can be expected that similarly to the generated graph databases the performance regarding

154 Subgraph Isomorphism Filtering Average Cluster Size Average Cluster Size Mesh Dimension Number of Labels Number of Nodes Number of Labels Figure 1.9 Cluster size for mesh and random graphs if trees are grown larger. Average Number of Tests Made Mesh Dimension Number of Labels Average Number of Tests Made Number of Nodes Number of Labels Figure 1.1 Average number of tests made for mesh and random graphs if trees are grown larger. cluster size will improve if the trees are grown larger. However, due to hardware constraints no such tests have been in made in the context of this work Combined Subgraph Filter Similar to the studies committed in the previous sections the combination of both subgraph filtering methods are also evaluated. Recall that the motivation of the combination of both approaches is that due to the differing tree structures the resulting candidate sets differ as well. Therefore, combining both methods eventually leads to smaller result sets being returned.

155 Chapter 1. Decision Tree Filtering 143 RAG Fingerprints Documents database size 7, 3,17 78 minimum cluster maximum cluster 6,998 3,17 78 average cluster 4, , cluster std. deviation 1, average reduction [%] Table 1.5 Cluster size statistics for region adjacency, fingerprint, and chemical compound databases. RAG Fingerprints Documents minimum tests 1 1 maximum tests 3 4 average tests tests std. deviation Table 1.6 Statistics concerning number of tests made during traversal for region adjacency, fingerprint, and chemical compound databases. However, in order for the combination to be effective different features need to be tested during traversal of the trees, thus both trees need to be deep enough for this behavior to become effective. To evaluate whether or not the above conditions hold the following experiment is conducted. The databases and trees are reused from the previous experiments. Again, samples are picked from the database, their size increased and the additional nodes assigned labels not occurring in the database. Then, both tree structures are traversed an their results combined. (The resulting candidate set is the set intersection between the candidates returned by the adapted traversal and the candidates returned by the adapted tree structure.) This procedures is repeated 1, times for each database in order to get an estimate on the gained reduction of the database. Figure 1.11 shows the results obtained for random (top row) and mesh graphs (bottom row). As can be seen, the results do not improve significantly (if at all) using the trees induced in the previous experiments (compare with Figure 1.5). Considering the average number of tests made for

156 Subgraph Isomorphism Filtering Average Cluster Size Number of Nodes 8 Gain Ratio 8 6 Number of Labels Average Cluster Size Number of Nodes Weighted Entropy Number of Labels Random Graphs Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Mesh Graphs Figure 1.11 Average cluster size obtained applying the combination of both approaches, using gain ratio (left column) and weighted entropy (right column). the adapted tree structure (see Figure 1.8) this result is not surprising as only very few features are tested in the adapted tree structure. As a consequence, the graph sets in the internal nodes do not differ very much and hence similar (if not the same) features are selected to split a given node. Since both approaches test similar features closely related information is evaluated and consequently no substantial reduction can be made on the result sets. Nevertheless it can be assumed that if both trees induced deep enough that the feature values tested will be increasingly different and hence the result sets will not overlap anymore. However, at the time of this writing hardware constraints do not allow for this presumption to be confirmed. The combined approach has not been tested on the real-world databases. Since adapted traversal on real-world data generally performs better than

157 Chapter 1. Decision Tree Filtering 145 on generated data and conversely because for the adapted tree structure the performance on the real-world graph databases examined is worse than on the generated graph databases, it can be expected that the performance gain combining both approaches is even lower than observed for the generated databases. Consequently, the combination of the two subgraph filtering methods has not been evaluated on the real world databases. 1.3 Error-Tolerant Filtering To demonstrate the efficiency and feasibility of error-tolerant filtering it was tested on several different types of graphs as introduced before. The experiments are conducted on two sets of databases. The first set is a collection of small random graph databases ( graphs per database) varying in size and label alphabet. This datasets is used to determine the general behavior of the method, i.e. cluster size and tests made depending on the maximum distance allowed. The second set is the collection of graph databases previously used, namely random and mesh graphs. In the first set of experiments, the general behavior of the method is evaluated on the smaller dataset. In order to measure the average cluster size, graphs are arbitrarily picked from the database and are then used as an input sample for the decision tree filter. This procedure is repeated for various allowed distances, ranging from. to., which is enough to determine the general behavior of the proposed method. In order to get an average value for the cluster size, the procedure is repeated times for each database and distance setting. In the next paragraph, results obtained on the smaller databases will be discussed before considering the larger ones. Figure 1.1 shows the cluster size for random graphs ( nodes in size) obtained with the simple as well as the extensive tree structure. (The results for the other graphs sizes are similar to the values illustrated here and are therefore not depicted.) For both approaches it can be seen that, as expected, the cluster size increases with an increasing threshold distance. For a maximum allowed distance of., the approach constantly returns just one graph from the database. (Since in each database every graph is only isomorphic to itself, this suggests that the method is very well suited for graph isomorphism filtering.) For the simple tree structure, it can be seen that cluster size increases with the number of labels in the label alphabet. This is also expected. With a larger label alphabet and the labels being uniformly distributed over the graphs in the database, there are overall fewer

158 Error-Tolerant Filtering Cluster Size 8 6 Randomgraphs, Nodes, Simple Tree Structure Legend Distance. Distance.1 Distance Number of Labels Cluster Size 8 6 Randomgraphs, Nodes, Extensive Tree Structure Legend Distance. Distance.1 Distance Number of Labels Figure 1.1 Cluster size for simple and extensive tree. nodes with equal labels between the graphs. Since the trees induced have a low tree depth, only a limited number of features can be tested in a branch. Consequently although n mcs is low, there are enough nodes remaining in the database as well as in the sample to reach a leaf node. This means that none of the explicit failure- or success-conditions (conditions 1 and ) are met during tree traversal. Success is merely achieved by fulfilling condition 3, reaching a leaf node before anything else happens. Therefore, the graphs in the leaf reached are valid candidates and added to the result set. To minimize cluster size, the tree structure implementing extensive feature testing has been introduced. The cluster sizes for this traversal type are illustrated on the right of Figure 1.1. It is clear to see that the nodes in the graphs not accounted for during regular tree traversal are now being consulted and serve to further decrease the cluster size. Also, with an increasing label-size the graphs contain more information and therefore are much better distinguishable. Looking at the curve denoting a threshold δ t =.1 one can observe that the cluster size becomes 1 once the label alphabet is of size 1. This means that finally only one single full-fledged graph distance computation needs to be committed. Similarly for δ t =. and a label alphabet of size 1, on average only 4.7 graph distance computations need to be performed. If the size of the label alphabet is further increased, even less distance evaluations need to be done. In addition to cluster size, the number of tests made during traversal also influences filtering performance. Since in the proposed approach each test is assigned a node in the tree, this number is reflected in the number of tests made during traversal. An illustration for the same databases considered before is given in Figure On the left of Figure 1.13 the simple

159 Chapter 1. Decision Tree Filtering 147 Number of Features Tested Randomgraphs, Nodes, Simple Tree Structure Legend Distance. Distance.1 Distance Number of Labels Number of Features Tested Randomgraphs, Nodes, Extensive Tree Structure Legend Distance. Distance.1 Distance Number of Labels Figure 1.13 Average number of tests made for simple and extensive tree. approach is illustrated. It can be seen that with an increasing alphabet size the number of features tested (slowly) increases. This can be explained by the fact that with more non-overlapping labels between the graphs (see above) more features are needed to split the graph set. Hence, the depth of the tree increases which results in more features being tested during traversal. The right side illustrates the number of tests made during extensive traversal. Clearly, more tests need to be made when the extensive tree structure is adopted. However, considering the significant reduction in cluster size, this additional effort gets easily compensated. In the second set of experiments the results found in the first set are verified on a larger variety of graph types and larger databases. Figures 1.14 and 1.15 show the results concerning cluster size on mesh graphs for various distances allowed 4. It is clear to see that with an increasing threshold distance allowed the clusters returned by the filter increase in size as well. Concurrent to the cluster size the number of tests made also increases since with an increasing threshold distance allowed more branches can be followed deeper into the tree. Thus the number of tests made is increased. The role of the split criterion can be analyzed looking at Figures 1.14 and (Again, the results for the random graph database are similar and therefore not depicted.) It can be seen that the tree structures derived do not have a significant influence on the resulting cluster size (Figure 1.14) nor on the number of tests made (Figure 1.16). This can be explained when examining the number of tests made during traversal where one can see that the number of tests made during traversal of a weighted entropy tree 4 The results for the random graph database are similar and are therefore not depicted.

160 Error-Tolerant Filtering Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Simple Tree Structure, δ t =.1 Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Extensive Tree Structure, δ t =.1 Figure 1.14 Cluster size for simple/extensive tree induced using weighted entropy (left column) or gain ratio (right column) on mesh graphs of varying dimension for a threshold distance of.1. is similar to the number of tests made during traversal of a tree induced using gain ratio. Hence, about the same amount of information is evaluated yielding similar results in cluster size. Comparing reduction performance of simple versus extensive tree structure it is clear to see that independent of threshold distance δ t the extensive tree structure significantly outperforms the simple approach (Figures 1.14 and 1.15). The simple tree structure obviously induces less features. Consequently less tests are made resulting in its average cluster size being larger than the cluster sizes obtained by the extensive tree structure. The extensive method behaves especially well if the size of the label alphabet is increased and thus the graphs contain more information. Note that results obtained on the random graph database are similar. Due to increased di-

161 Chapter 1. Decision Tree Filtering 149 Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Simple Tree Structure, δ t =. Average Cluster Size Mesh Dimension Weighted Entropy Number of Labels Average Cluster Size Mesh Dimension Gain Ratio Number of Labels Extensive Tree Structure, δ t =. Figure 1.15 Cluster size for simple/extensive tree induced using weighted entropy (left column) or gain ratio (right column) on mesh graphs of varying dimension for a threshold distance of.. versity in the individual graphs in the random graph database, tree depth is generally lower which results in a larger cluster size when the simple tree structure is traversed. The extensive tree structure on the other hand benefits from the increased diversity which means that the cluster sizes returned are similar to the cluster sizes for mesh graphs. In Figures 1.16 and 1.17 it can be seen that the performance gain of the extensive method in cluster size comes at the price of an increased number of features tested during traversal. The additional number of features tested during extensive tree structure traversal seems quite significant at first sight. However, considering the high computational cost of graph distance computation, there is still a considerable performance gain by further ruling out database graphs (right column of Figures 1.14 and 1.15).

162 Error-Tolerant Filtering Average Number of Features Tested Mesh Dimension Weighted Entropy Number of Labels Average Number of Features Tested Mesh Dimension Gain Ratio Number of Labels Simple Tree Structure, δ t =.1 Average Number of Features Tested Mesh Dimension Weighted Entropy Number of Labels Average Number of Features Tested Mesh Dimension Gain Ratio Number of Labels Extensive Tree Structure, δ t =.1 Figure 1.16 Average number of features tested for simple (left) and extensive (right) tree on mesh graphs of varying dimension for a threshold distance of.1. The proposed approach is also tested on the real world datasets available. The experimental setup is similar to the generated graph database experiment. For all real world databases a simple and an extensive decision tree is induced. Then, 1, graph samples are chosen at random from the database and the corresponding trees traversed with various distances allowed. After traversal, the resulting cluster size as well as the number of tests made are measured. For these experiments, instead of the cluster size the reduction rate is evaluated. The reduction rate is defined as the ratio between the number non-candidate graphs and the size of the entire database. Therefore, the higher the reduction rate the better the filter performs with respect to cluster size. Figure 1.18 shows the reduction rates obtained for various databases and distance measures. As expected, the extended tree structure generally out-

163 Chapter 1. Decision Tree Filtering 151 Average Number of Features Tested Mesh Dimension Weighted Entropy Number of Labels Average Number of Features Tested Mesh Dimension Gain Ratio Number of Labels Simple Tree Structure, δ t =. Average Number of Features Tested Mesh Dimension Weighted Entropy Number of Labels Average Number of Features Tested Mesh Dimension Gain Ratio Number of Labels Extensive Tree Structure, δ t =. Figure 1.17 Average number of features tested for simple (left) and extensive (right) tree on mesh graphs of varying dimension for a threshold distance of.. performs the simple tree structure with respect to cluster size. For region adjacency, fingerprint and document graphs, the additional reduction is apparent. For the chemical compound database however, the performance does not improve significantly. An explanation for this can be found by looking at the distribution of the information in the chemical database (see Table 7.1, Chapter 7). As can be seen in Table 7.1, only very little information is given in the distribution of the labels on the graph database compared to the other databases (Tables 7.6, 7.8 and 7.1). However, decision trees for error-tolerant filtering mostly evaluate features extracting information on the label information of the graphs in the database. It is therefore very likely that most information is already evaluated in the features tested by the simple tree structure and only very little additional information is contained in the other features. This assumption is confirmed by the performance of the approach on the document graph database. Compared

164 Error-Tolerant Filtering Region Adjacency Graphs 99.8 Fingerprint Graphs Reduction Rate Reduction Rate Simple Tree Extended Tree Threshold Distance 98.6 Simple Tree Extended Tree Threshold Distance Reduction Rate Compound Graphs 9 91 Simple Tree Extended Tree Threshold Distance Reduction Rate Document Graphs 5 Simple Tree Extended Tree Threshold Distance Figure 1.18 Average reduction rates for all tested real-world databases using the simple and extensive tree structure. to the other databases, document graphs contain significantly more information in their label distribution (Table 7.1, Chapter 7). Consequently, the benefit using the extended tree structure is maximal on document graphs. (E.g., for a threshold distance δ t =. the reduction rate of the simple tree structure is increased by approximately 5% resulting in a reduction rate of just about % on the document graph database.) Figure 1.19 shows the number of tests made during traversal. As can be seen, the computational cost using the simple tree structure is minimal compared to applying a sequential full-fledged error-tolerant matching on the entire database. For fingerprint graphs for example more than 98% of the graphs can be eliminated from a database of 3,17 graphs by applying approximately tests even if a distance of. (% difference in the node structure of the graphs) is allowed. It is clear to see that this is a significant performance gain over a straightforward sequential matching. Considering the benefit using the extended tree structure the results on

165 Chapter 1. Decision Tree Filtering 153 Average Number of Tests Region Adjacency Graphs Simple Tree Extended Tree Average Number of Tests Fingerprint Graphs Simple Tree Extended Tree Threshold Distance Threshold Distance Average Number of Tests Compound Graphs Simple Tree Extended Tree Average Number of Tests Document Graphs Simple Tree Extended Tree Threshold Distance Threshold Distance Figure 1.19 Number of tests made for all tested real-world databases using the simple and extensive tree structure. the generated graph databases are only partially confirmed. On real world graphs the extended tree structure generally decreases the size of the candidate set returned by the filter. However, there is also a considerably increase in the computational cost of the procedure. Considering the fingerprint database example previously mentioned the number of tests made is increased from to more than 1, tests made. While this is still little compared to sequentially applying a full-fledged error-correcting matching algorithm, it is questionable if the increase in computational complexity is justifiable when considering that for the simply tree structure, 47 graphs remain to be tested and for the extended tree structure, 16 graphs remain. This ratio becomes worse if the chemical compound database is evaluated. Whereas for an allowed distance of. the simple tree structure returns 1,69 graphs using,65 tests, the extended tree structure 1,497 using more than 8, tests. Hence, an additional 78, tests only reduce the result set by 11 graphs, less than.9% increase in reduction efficiency.

166 Decision Tree Filtering Conclusions It is therefore highly dependent on the underlying database whether or not the use of the extended tree structure is applicable or not. Generally speaking, it can be assumed that the simple tree structure performs well enough on real world data to provide satisfactory filtering performance. 1.4 Decision Tree Filtering Conclusions In this chapter, the decision tree filtering methods proposed in this thesis have been evaluated. In Chapter 9 it has been found that using feature vectors for database filtering is a very suitable approach with respect to the size of the candidate sets retrieved. However, it could also be seen that there is a significant need in reducing the number of features tested. The approach proposed in this thesis reduces the number of features tested by applying machine learning techniques on the feature vectors. In this chapter, the decision tree filtering methods derived in Chapter 6 have been experimentally evaluated. It could be seen that with respect to straightforward feature vector filtering, all approaches significantly decrease the number of features tested. For graph isomorphism, decision tree filtering was shown to be highly suitable, returning the same filtering efficiency with respect to cluster size, while significantly reducing the number of tests. Similarly, for subgraph isomorphism, the decision tree approach was shown to be very effective. Although not achieving reduction rates comparable to graph isomorphism, the decision tree filter still significantly reduced the number of tests to made when compared with feature vector filtering. Finally, considering error-tolerant retrieval it could be seen that again the of decision tree filtering very significantly increases filtering performance. Overall it can be said that the proposed method, database filtering based on feature vectors in combination with decision tree induction, proves to be a very suitable approach to handle large database of graphs.

167 Chapter 11 Conclusions and Future Work 11.1 Conclusions This thesis addresses the problem of how to process databases of graphs representing structural data. Graphs and graph matching are a popular approach in pattern recognition. However, in general graph matching is also a computationally expensive approach. Besides pairwise comparison of graphs, matching of an input sample to an entire database of graphs has become more interesting in recent years. When dealing with databases of graphs, the size of the database further increases the computational cost of the overall matching scheme. Several approaches have been proposed in the literature to deal with this factor, yet most of these approaches are unsuitable for large databases. In this thesis, a database filtering architecture capable of dealing with large databases of graphs has been proposed. In database filtering it is assumed that database retrieval is composed of two independent steps: 1. Database filtering: In a first step, the entire database is filtered based on a given matching paradigm and a given input graph. The filtering step reduces the initial size of the database leaving only a subset of the entire database to be processed in the exact matching step. 155

168 Conclusions. Exact matching on filter result: The subset returned by the filtering step is then processed using a graph matching algorithm for the given matching scheme. Every graph in the reduced subset is matched against the given input graph using a conventional matching algorithm. Matching graphs are added to the overall result of the retrieval method, non-matching graphs are eliminated. This thesis proposes to apply machine learning techniques to implement the filtering step of the database retrieval method. The proposed architecture has been evaluated focussing on several aspects of database filtering: 1. relation between database filtering method and matching algorithm. suitability of a feature representation for graph 3. graph feature vector filtering 4. decision tree filtering In the following, obtained results will be briefly summarized and conclusions drawn for all of the above topics. Database Filtering Scheme In order to gain general insight on the proposed graph database retrieval architecture (filtering following full-fledged sequential matching) a study has been conducted evaluating the relation between filter performance and exact matching algorithm performance. In this study it has been shown that the benefit to be expected from filtering can be estimated comparing filtering performance and non-matching performance of the applied matching algorithm. It has been seen that when comparing the efficiency of graph database filtering, contrary to most research efforts made today not the matching performance of a graph matching algorithm is the dominant factor but the non-matching performance of such an algorithm. In the course of this study, a number of parameters such as total database size, size of the database after filtering, number of matches in the database, and time needed for filtering have been identified affecting the behavior of a graph matching scheme. Several scenarios have been proposed where popular graph matching algorithms would outperform others depending on database retrieval scenario.

169 Chapter 11. Conclusions and Future Work 157 Within this thesis, apart from the general examination of the relation between database filtering and graph matching algorithms, various graph database filtering approaches have been examined. For these approaches, three matching paradigms have been considered for database filtering: graph isomorphism subgraph isomorphism error-tolerant graph matching Based on the above matching paradigms, database filtering using feature vectors and decision trees has been evaluated. Graph Features In a first phase of this work, suitable graph features have been defined with respect to various constraints given in filtering. Suitable graph features need to be easily extractable while maintaining a high degree of saliency. The features used in the context of this work have been selected based on their low computational complexity guaranteeing fast extraction. In experiments it has been shown that they also possess a high degree of saliency (Section 9.1) and are therefore very suitable for database filtering. Furthermore, it has also been shown that combinations of the features improve the saliency degree while keeping extraction cost constant. Feature Vector Filtering Based on the derived feature types, simple vector filtering approaches have been proposed. In order to evaluate the efficiency of database filtering based on feature vectors sequential vector comparison methods for all considered matching schemes have been developed. For graph isomorphism, a straightforward approach comparing feature vector position for equality can be applied. It has been shown that the same comparison method can be used for subgraph isomorphism if the feature representation is properly modified. For error-correcting graph matching, a more sophisticated method working on a subset of the features presented has been developed. Based on this reduced subset the method returns an estimate on the maximum size of the maximum common subgraph between the two graphs

170 Conclusions compared. According to this estimation on the size of the maximum common subgraph a lower bound of the distance between the two graphs can be given. It has been shown that a feature vector filter significantly reduces the number of graphs to be tested for all considered matching paradigms. Hence, only a very small portion of the original graph database needs to undergo a full-fledged matching algorithm (Section 9.). It can therefore be said that database filtering based on feature vectors is a very suitable approach with respect to the reduction efficiency. Besides filtering efficiency the computational cost of feature vector filtering has also been measured. It was observed that depending on graph type relatively large feature vectors are produced yielding high computational cost for the comparison of the feature vectors. To reduce these costs while maintaining the reduction efficiency data mining techniques have been applied. Decision Tree Filtering In order to reduce the computational complexity of the vector filtering methods decision tree methods have been applied. As an additional preprocessing step, besides feature vector extraction on the database graphs, the feature vectors are analyzed using decision tree techniques known from data mining. The result of this preprocessing step is a tree structure which sorts the features to be tested according to their descriptive power on the underlying graph set. Similar to feature vector filtering, different tree structures need to be induced for any given graph matching scheme. At runtime, the features needed are extracted from the given input sample graph and the decision tree is traversed. The traversal procedure returns a graph set to be tested against the input graph with the full-fledged matching algorithm. Generally it has been shown that applying decision tree methods significantly reduces the number of tests to be made while maintaining the reduction efficiency of vector filtering methods. For graph isomorphism and the considered database it has been shown that the filter generally eliminates more than 99% of the graphs in the database (Section 1.1). This result has been achieved on both graph database types evaluated, generated graphs as well as real world databases. For subgraph isomorphism, two filtering approaches have been proposed. Based on the decision trees used for graph isomorphism filtering, a special traversal procedure has been developed allowing filtering for subgraph

171 Chapter 11. Conclusions and Future Work 159 isomorphism candidates traversing a graph isomorphism tree. The modification of the traversal procedure yields a small computational overhead during traversal. In order to reduce this overhead a second approach has been developed, adapting the decision tree structure while using the same traversal algorithm that has been used for graph isomorphism filtering. The tradeoff for the reduced computational overhead is an increased memory requirement of the tree structure. Naturally, for both subgraph isomorphism filters the efficiency is not as high as the graph isomorphism filtering performance (Section 1.). It has been shown that the increase in memory requirement of the adapted tree structure is much too high to provide a satisfactory filtering result (Section 1..). The trees induced simply do not test enough features in order to thoroughly examine database graphs with the input graph. Again, these results have been confirmed on generated as well as real world graph databases. For the adapted traversal algorithm on the other hand the performance with respect to the reduction of the database is much better (Section 1..1). On generated as well as real world datasets the resulting candidate sets where significantly smaller than the original database size. The decision tree filtering approach has also been applied to the third matching paradigm, error-tolerant graph matching. Within this thesis, two decision tree structures enabling filtering databases of graphs for candidates with a given threshold distance have been developed. Both tree structures, simple as well as extended tree structure, are based on the same traversal principle. The difference lies in the assumption of the completeness of the features found in the underlying database. For the simple tree structure the assumption is that most features occurring in the input samples also occur in the database graphs. Hence, a very simple tree structure is developed where tree induction is stopped once singular leaf nodes are achieved. The extended tree structure assumes that a large portion of the features of the potential input samples does not occur in the database graphs. Consequently, induction of singular leaf nodes is continued until no features are left to be induced allowing for more thorough examination of input and database graph during tree traversal. Both approaches have been thoroughly tested on generated as well as real world graph databases (Section 1.3). In general it has been shown that (as expected) with respect to cluster size the extended tree structure significantly outperforms the simple tree. On the other hand, it has also been found that the extended tree structure yields a significant computational overhead compared to the simple tree structure. Whereas for the generated graph datasets it has been found that this overhead is generally justified considering the increase in filtering efficiency, for real world databases this could not be confirmed.

172 Future Work In fact, for the real world databases considered, the size and distribution of the label alphabet, and therefore the information contained in it has a significant impact on the performance of simple versus extended tree structure. Summarizing it can be seen that processing large databases of graphs is an important topic in pattern recognition. It could be seen that there is a high computational cost inherent to this problem. However, the approach developed in this thesis, applying feature vector filtering in combination with decision tree method, has been shown to be highly suitable for this task. 11. Future Work Based on the work done in this thesis there are a number of issues to be considered for future work. The current status of the work has progressed enough for three strategies to be pursued: development of specific indexing applications (document search, plan storage etc.) extending the data mining approach for graph data structures development of a graph database application The system developed in this thesis is at a stage where it now needs to be specialized if it is to be used for indexing patterns. So far, only few database of patterns are available in graph format. However, many data collections are very well suited to be represented as graphs. Document data (web or paper documents) has successfully been represented and classified using graphs as the underlying data structure. Possible indexing scenarios include more powerful document search engines (web search, desktop search, etc). Increasing interest has also evolved in digitizing plan data, such as street maps, buildings etc. Vast amounts of such data need to be analyzed and stored in some way. Graph based methods as proposed in this thesis are a natural, suitable and a powerful method to do this. With the increasing popularity of graphs in pattern recognition, the use of graphs for classification tasks has also increased considerably. Large collections of data are being digitized and represented as graphs. Examples of

173 Chapter 11. Conclusions and Future Work 161 such data include web documents, paper documents, building plans, biometric data. More recently graphs have also been successfully applied to the examination of software systems or in network analysis. The framework provided in this thesis can be used to analyze the collected graph data in order to retrieve or identify relations or knowledge not apparent at first sight. Furthermore, through the provided framework, the data can be analyzed with respect to a specific graph matching paradigm. First steps in this direction have been made in the development of classifiers for fingerprint representations of the NIST 4 database (see Appendix A), but others could easily be included as well. Numerous datasets could be analyzed in this way, even though the system is not primarily targeted for this type of application. The system developed in this thesis for filtering databases of graphs can be considered a prototype system. An interesting challenge would be the development of an application based on the prototype. There are several topics that need to be addressed on this approach. One important task is the automatic discretization of labels assigned to nodes graphs. There has been an increased interest in the machine learning community regarding discretization of continuous attributes recently. Many data mining techniques require discrete feature values. Others build more accurate models when applied to discrete attributes. Consequently research in this field has become more popular. As the system proposed in this thesis relies on discrete labels on the nodes of the graphs processed, a discretization module to the system would be of great importance. Another important topic when developing an application from the prototype would be its integration into a common database system. Until now, the system is capable of data mining feature vectors, distinguishing more powerful features from the rest. Only little effort has been made in research on how the remaining data could be used with traditional database systems. However, for a graph database application to be successful considerable effort would need to be spent in the integration (and possible promotion) of it with regard to databases as known from the industry. Furthermore, as the prototype system is solely an inexact estimation mechanism, popular exact matching tools would need to be integrated in a full-fledged application. The above steps are minimal requirements for a full-fledged graph database retrieval system. Starting from this, there are a wide variety of tasks that could be addressed. With increasing computational power of hardware today, one can start to exploit the power of such systems resulting in more flexible alternatives to other existing techniques used today.

174

175 Part IV Appendix

176

177 Appendix A Fingerprint Classification In fingerprint recognition one generally distinguishes between fingerprint identification and fingerprint classification. In fingerprint identification, fingerprints are matched against a large database of known samples. In fingerprint classification on the other hand, the aim is to assign classes to different types of fingerprints 1. The fingerprint classification problem is considered to be difficult because of the large within-class variability and the small between-class separation. There exist a wide variety of approaches based on different techniques to solve the classification problem (e. g. neural network based approaches [11, 1]). Recently, structural pattern recognition approaches have become more popular to address the classification problem [47, 13, 14]. Even though these approaches often fall behind in terms of classification performance, they have successfully been applied to improve the performance of multiple classifier systems [14, 15]. In fingerprint identification the aim is to match and identify identical fingerprints. This problem is usually addressed by focussing on local characteristics such as minutiae points. Conversely, in fingerprint classification, the problem is to assign a given input fingerprint to one of the five Galton- Henry classes (left loop (L), right loop (R), whorl (W), arch (A) and tented arch (T)). This problem is often addressed by extracting and representing global characteristics such as the ridge flow or singular points [11, 1]. In [97], 1 Fingerprint classification is usually done in order to reduce the complexity of fingerprint identification. 165

166 A.1. Graph Representation left loop (L) right loop (R) whorl (W) arch (A) tented arch

the authors propose a structural approach to solve the classification problem.

proposed to solve the classification problem. In Section A.

178 166 A.1. Graph Representation left loop (L) right loop (R) whorl (W) arch (A) tented arch (T) Figure A.1 Extracted skeletons for all classes. the authors propose a structural approach to solve the classification problem. In this thesis, the graph structure developed in [97] has successfully been applied to database filtering. In this appendix, two new approaches based on data mining graphs using decision trees are proposed to solve the classification problem. In Section A.1, a brief introduction on the extraction of the graph structure is given. Then, in Section A., the two classifiers developed are outlined before experiments are presented in Section A.3. Finally, in Section A.4 conclusions are drawn and future work is suggested. A.1 Graph Representation The fingerprint graphs used within the context of this work are extracted using an image filter based on a measure of directional variance. The ex-

Appendix A. Fingerprint Classification 167 Graph-Quadrant: 5 1. -..... 1. -4. 3. -3. 3. -3. 4. -4. 4. -4. 4. Figure A. Fingerprint of class W and extracted graph.

179 Appendix A. Fingerprint Classification 167 Graph-Quadrant: Figure A. Fingerprint of class W and extracted graph. traction of the graphs is composed of two steps. First a region image is extracted using a directional variance value. The directional variance value is measured at every position of the ridge orientation field of the fingerprint. The variance measure itself is defined such that high variance regions correspond to relevant regions for the fingerprint classification task (including singular points). Based on the variance, a region image is generated. In the second step the results of the region extraction process are used to extract an attributed graph structure. The idea is to generate a structural skeleton of the characteristic variance regions. The region image is post-processed by applying binarization and thinning methods. Then, ending and bifurcation points of the resulting skeleton are represented by graph nodes. Additional nodes are inserted along the skeleton at regular intervals. An attribute giving the position of the corresponding pixel is attached to each node. Edges containing an angle attribute are used to connect nodes that are directly connected through a ridge in the skeleton. Furthermore, the average direction of the ridge-lines of a node is added as a node attribute. To make the graph representation more suitable for the approaches presented in this work the graph structure is post-processed in the following way. The average direction stored in the nodes is discretized into 8 (respectively 16, see Section A.3) major directions. Also, as a graph attribute, the barycenter of all node-coordinates is calculated and its location within the nine major regions of the image determined. The region is then assigned as an additional attribute to the graph. Examples of extracted skeletons for each class are given in Figure A.1. An example of a fingerprint representation as described above is given in Figure A..

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques M. Lazarescu 1,2, H. Bunke 1, and S. Venkatesh 2 1 Computer Science Department, University of Bern, Switzerland 2 School of