Aggregation and Selection in Relational Data Mining

Size: px

Start display at page:

Download "Aggregation and Selection in Relational Data Mining"

Clementine Scott
5 years ago
Views:

1 in Relational Data Mining Celine Vens Anneleen Van Assche Hendrik Blockeel Sašo Džeroski Department of Computer Science - K.U.Leuven Department of Knowledge Technologies - Jozef Stefan Institute, Slovenia C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

2 Outline Introduction C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

3 Relational Data Mining Data Mining: searching for patterns in (large) databases. Propositional (Classical) Data Mining: data is stored in single table patterns involve intra-tuple relations Relational Data Mining: data is stored in multiple tables (relational database) patterns involve inter-tuple or inter-table relations how to deal with 1-n or m-n relations (sets)? C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

4 Working Example Current relational learners : 2 approaches to dealing with sets C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

5 Outline Introduction C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

6 First approach: Aggregation Use SQL-like aggregation to summarize set in one big table Apply classical data mining technique (e.g. decision tree inducer) Optimized for highly non-determinate domains C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

7 Second approach: Selection Apply relational data mining technique (e.g. relational decision tree inducer) Test for existence of specific elements in the set Optimized for structurally complex domains e.g. ILP: Inductive Logic Programming database and patterns in Prolog possibility to add background knowledge C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

8 Example concepts 1. Persons that have two books. 2. Persons that have a computer book. 3. Persons that have two computer books. How to express concept 3?? Selective methods need aggregate function in background knowledge. Aggregating methods need separate relation for each genre. Solution: combine aggregation and selection in context of relational data mining C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

9 Decision Trees Combining selection and aggregation Outline Introduction Decision Trees Combining C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

10 Decision Trees Combining selection and aggregation Decision Trees One of the most widely used and practical data mining methods Each internal node contains a test on some attribute Each leaf contains a prediction Classification of new instance C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

11 Decision Trees Combining selection and aggregation Decision Trees: learning them Divide & conquer algorithm Pseudocode: grow node(node,examples): IF stopcriterium: assign majority class from Examples to Node ELSE generate all possible tests for Node associate best test with Node grow two childnodes Left and Right split Examples into ExamplesPass and ExamplesFail grow node(left,examplespass) grow node(right,examplesfail) C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

12 Decision Trees Combining selection and aggregation : learning them Upgrade of classical algorithm: Tilde [Blockeel and De Raedt 98] Trees are relational: contain first order logic literals in test of internal node Selective approach (ILP) Tests can introduce variables : possible tests may differ at each node C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

13 Decision Trees Combining selection and aggregation Adding aggregation User specifies basic components: aggregate functions, sets to be aggregated, query to generate set to be aggregated Aggregate conditions are created, using discretization Aggregate conditions are added to the set of possible tests C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

14 Decision Trees Combining selection and aggregation Adding selections to aggregation: first manner If a node contains an aggregation, any node in its left subtree can add a selection within that aggregate condition Local search within aggregate condition C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

15 Decision Trees Combining selection and aggregation Adding selections to aggregation: second manner Lookahead technique to look ahead in refinement lattice add several literals at once computationally expensive C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

16 Decision Trees Combining selection and aggregation with aggregation and selection: learning them Pseudocode: grow node(node,examples): IF stopcriterium: assign majority class from Examples to Node ELSE generate all possible first order tests for Node: usual tests aggregate functions refinement of aggregate function higher in tree associate best test with Node grow two childnodes Left and Right split Examples into ExamplesPass and ExamplesFail grow node(left,examplespass) grow node(right,examplesfail) C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

17 Decision Trees Combining selection and aggregation with aggregation and selection: problem Number of tests at each node in the tree grows very fast Need some way to deal with it Make use of technique from classical data mining: Random Forests [Breiman 01] C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

18 Random Forests Outline Introduction Random Forests C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

19 Random Forests Random Forests Random Forests C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

20 Random Forests Random Forests Random Decision Tree Algorithm T = f ( T ) with e.g. f (x) = 0.1x or f (x) = x C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

21 Random Forests Random forests with our relational decision tree algorithm. Pseudocode: grow node(node,examples,probability): IF stopcriterium: assign majority class from Examples to Node ELSE generate all possible first order tests for Node: usual tests aggregate functions refinement of aggregate function higher in tree select random subset from possible tests using Probability associate best test out of random subset with Node grow two childnodes Left and Right split Examples into ExamplesPass and ExamplesFail grow node(left,examplespass,probability) grow node(right,examplesfail,probability) C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

22 Real world data Artificial data Outline Introduction Real world data Artificial data C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

23 Real world data Artificial data Experimental Setup: Real world data Average over 5 times 5-fold cross-validation Different parameters: number of trees: 3, 11, 33 proportion of feature sample: 100%, 75%, 50%, 25%, 10%, sqrt level of aggregates: No Aggregates (NA), Simple Aggregates (SA), Refined Aggregates (RA), Lookahead Aggregates (LA) C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

24 Real world data Artificial data : Real world data The effect of aggregates and the number of trees (P = 0.25) C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

25 Real world data Artificial data : Real world data The effect of the number of features (e.g. Mutagenesis) FORF (33 trees) P LA RA SA NA sqrt Tilde NA C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

26 Real world data Artificial data : Real world data Compared to other systems (FORF-SA uses 33 trees and 25% of the features) Financial FORF-SA DINUS-C RELAGGS PROGOL (0.005) (0.103) (0.065) (0.071) Diterpenes FORF-SA FOIL IBL-matchings ICL (0.006) (0.011) (0.006) (0.009) C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

27 Real world data Artificial data : Artificial data Summary of experimental results so far: Positive effect of random forest Positive effect of adding (simple) aggregates Effect of combination of aggregates and selection? Artificial dataset C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

28 Real world data Artificial data : Artificial data Datagenerator for east-/ westbound trains. 800 trains, 400 in each direction Target concept: C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

29 Real world data Artificial data : Artificial data Results (P = 0.25, number of trees = 33) Accuracy LA RA SA NA Avg number of nodes in a tree LA RA SA NA Average induction time LA RA SA NA C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

30 Outline Introduction C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

31 First order random forest induction algorithm based on Tilde Feature space enlarged by including aggregates Refinement operator adjusted to include selection conditions within the aggregates Strength was experimentally shown C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

32 Acknowledgements and References Acknowledgements: Maurice Bruynooghe Full Paper: C. Vens, A. Van Assche, H. Blockeel, and S. Dzeroski, First Order Random Forests with Complex Aggregates, Proceedings of the 14th International Conference on Inductive Logic Programming (ILP-2004), Porto, Portugal, 2004 C. Vens, A. Van Assche, H. Blockeel, S.Džeroski in Relational Data Mining

First order random forests: Learning relational classifiers with complex aggregates

Mach Learn (2006) 64:149 182 DOI 10.1007/s10994-006-8713-9 First order random forests: Learning relational classifiers with complex aggregates Anneleen Van Assche Celine Vens Hendrik Blockeel Sašo Džeroski