Where we are. Exploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)

Size: px

Start display at page:

Download "Where we are. Exploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min)"

Osborne Simpson
5 years ago
Views:

1 Where we are Background (15 min) Graph models, subgraph isomorphism, subgraph mining, graph clustering Eploratory Graph Analysis (40 min) Focused Graph Mining (40 min) Refinement of Query Results (40 min) Machine Learning and Visualization (40 min) Challenges and discussion D. MOTTIN, E. MÜLLER 121

2 Interactivity An interactive approach allows the user to modify the state of the system Interactivity can be coupled with personalization: learning user preferences in an interactive way How do we learn on the way? Assume preferences follow a certain distribution and smoothness on neighbor nodes Try to minimize the number of interactions with the user D. MOTTIN, E. MÜLLER 122

items Two ways of learning: passive and active Learn

3 Online learning (Dataset eploration) Main idea: Learn the items to show online as more points are acquired items Two ways of learning: passive and active Learn items Is t v or? v t Learn Passive Active D. MOTTIN, E. MÜLLER 123

4 Online learning taonomy Online learning Statistical Learning Conve optimization Game theory Passive Active Traditional Bayesian Selective Pool-based Kernel based Forests and trees First/secondorder learning Prediction with eperts Multi-armed bandits Sampling Uncertainty sampling Query by Committee Variance sampling Density based Cluster-based Multi-armed bandits S. C. H. Hoi, D. Sahoo, J. Lu, and P. Zhao, Online Learning: A Comprehensive Survey, arxiv, D. MOTTIN, E. MÜLLER 124

5 Passive learning application: MindReader Main idea: learn an implicit query from user eamples and optional scores Searching mildly overweighted patients Weight The doctor selects eamples by browsing patient database q The eamples have oblique correlation We can guess the implied query : good : very good Height Ishikawa, Y., Subramanya, R., & Faloutsos, C.. MindReader - Querying Databases Through Multiple Eamples. VLDB D. MOTTIN, E. MÜLLER 125

6 Learning an ellipsoid distance Euclidean q weighted Euclidean q generalized ellipsoid distance q ( œ, ß = œ ß æ(œ ß) Implicit query / Weighted distance matri ( œ, ß = o o Ç Éâ (œ É ß É )(œ â ß â ) É / â Learn the query minimizing the penalty = weighted sum of distances between query point and sample vectors ÇeîeÇe d o œ 0 ß æ(œ 0 ß) 0 F;i dkg Gc det æ = 1 Ishikawa, Y., Subramanya, R., & Faloutsos, C.. MindReader - Querying Databases Through Multiple Eamples. VLDB D. MOTTIN, E. MÜLLER 126

7 Learning the distance Query point is moved towards good eamples Rocchio formula in IR Q 0 : query point Q 1 : retrieved data : relevance judgments Q 1 : new query point Q 0 Learning can be done online!!! Ishikawa, Y., Subramanya, R., & Faloutsos, C.. MindReader - Querying Databases Through Multiple Eamples. VLDB D. MOTTIN, E. MÜLLER 127

8 Let s go back to SLQ Query: Prof., ~70 yrs? Candidate Match: ( ) Geoffrey Hinton (Professor, 1947) UT Google University of Toronto Google DNNresearch Features Node matching features: u 5, 5 = ] (5, 5 ) 0 Edge matching features: ~ d, d = Å É É (d, d ) É Matching Score y ep( q u v u 5, 5 + ~ v ~ (d, d )) D. MOTTIN, E. MÜLLER 128

9 Let s go back to SLQ Query: Prof., ~70 yrs? Candidate Match: ( ) Geoffrey Hinton (Professor, 1947) UT Google University of Toronto Google DNNresearch Features What if some answer is irrelevant? Node matching features: u 5, 5 = ] (5, 5 ) Edge matching features: ~ d, d = Å É É (d, d ) É Matching Score y ep( q u v u 5, 5 + ~ v ~ (d, d )) 0 D. MOTTIN, E. MÜLLER 129

10 Query-specific Tuning Find query-specific Ø that better aligned with the query using user feedback Ø = 1 ÿ = M t < <, Ø ÿ = M t < <, Ø M p M ä + f(ø, Ø ) User Feedback Regularization 0,1 D. MOTTIN, E. MÜLLER 130

11 Type Inference Infer the implicit type of each query node The types of the positive entities constitute a composite type for each query node Query Positive Feedback Candidate Nodes D. MOTTIN, E. MÜLLER 131

12 Multi-armed bandits Maimize the reward by successively playing gamble machines (the arms of the bandits) Invented in early 1950s by Robbins for decision making under uncertainty when the environment is unknown The lotteries are unknown ahead of time Credits: Pietro Lovato Reward + Reward - Reward Ä D. MOTTIN, E. MÜLLER 132

13 Multi-armed bandits Reward = random variable 0,/ ; 1 e fi, î 1 e = inde of the gambling machine î = number of plays fl 0 = epected reward of machine e. A policy, or allocation strategy is an algorithm that chooses the net machine to play based on the sequence of past plays and obtained rewards. D. MOTTIN, E. MÜLLER 133

14 A greedy algorithm - Naïve Choose the machine with current best epected reward Eploitation vs eploration dilemma: Should you eploit the information you ve learned or eplore new options in the hope of greater payoff? In the greedy case, the balance is completely towards eploitation D. MOTTIN, E. MÜLLER 134

15 Quality measure - Regret Total epected regret (after T plays): f = fl o fl É [ñ 0, ] 0 + fl : highest epected reward [ñ 0, ]: epected number of times machine i is played An algorithm solve the multi-armed problem if it matches the lower bound f = W(log ) [Think about binary search] D. MOTTIN, E. MÜLLER 135

16 Upper confidence bound (UCB) algorithm 1. Pull at each time the arm with the maimum probability of being the best / á 1 o î É,Á É Á log î É 2. Repeat until the budget (number of steps T) is depleted î É : number of times the arm j has been pulled Balance eploration and eploitation: The uncertainty diminishes asthe time passes D. MOTTIN, E. MÜLLER 136

17 Gaussian processes [Bishop et al., 2006] Model reward as a Gaussian Process A Gaussian Process (GP) is an infinite set of variables, any subset of this is Gaussian y Ë Σ, fl = 2 Σ + - ep( 1 2 Ë fl Σ ä+ (Ë fl)) Gaussian prior Specified only by mean and covariance / Given observations œ, È 0 + over an unknown function f drawn from a Gaussian prior, the posterior is Gaussian y Ë Í Î L y(ë, Ï, Í) D. MOTTIN, E. MÜLLER 137

18 Putting all together Active search Main idea: the system query the user to understand her preferences Get item System Ask user preference Learn unknown preferences and minimize the number of questions to the user Vanchinathan, H. P., et al. Discovering Valuable items from Massive Data. KDD D. MOTTIN, E. MÜLLER 138

19 Learning unknown preferences Problem: Find a set S that maimize the user preference within a budget (e.g., number of interactions) S (intended user set) arg ma o ÌKd4(5) User preferences q û subject to IcFG B i;l dg Cost for the set S Each item has a cost Vanchinathan, H. P., et al. Discovering Valuable items from Massive Data. KDD D. MOTTIN, E. MÜLLER 139

20 GP-Select Idea: Model the user preferences as a Gaussian Process Learn posterior Trades off eploration eploitation Ask user feedback Eploration: select items with high-variance Eploitation: select items with high-value Vanchinathan, H. P., et al. Discovering Valuable items from Massive Data. KDD D. MOTTIN, E. MÜLLER 140

21 GP-Select bound Similar to multi-armed bandits Trades off eploration-eploitation Ψ : Maimum information gain f = 4 œ ma Ñ + 4 œ Ñ Ψ î UCB- strategy. Halves the search space and the uncertainty Srinivas et al. Gaussian process optimization in the bandit setting: No regret and eperimental design. ICML 2009 D. MOTTIN, E. MÜLLER 141

22 Active learning on graphs which prior? Idea: Use the graph structure to infer the node classes Use graph Laplacian as prior Ù = (, A is the adjacency matri Laplacian: higher probability of having the same class if two nodes are connected Ma et al. Active Search and Bandits on Graphs using Sigma-Optimality. UAI D. MOTTIN, E. MÜLLER 142

Eplore-by-Eample: AIDE Relevance Feedback Relevant

Samples User Model Query Formulation Sampling queries

Eplore-by-eample - an automatic query steering framework

23 Eplore-by-Eample: AIDE Relevance Feedback Relevant Samples Irrelevant Samples Data Classification User Model Samples User Model Query Formulation Sampling queries Space Eploration Data Etraction Query Dimitriadou et al. Eplore-by-eample - an automatic query steering framework for interactive data eploration. SIGMOD D. MOTTIN, E. MÜLLER 143

24 The AIDE algorithm 1. Divide the space into d-dimensional cubes 2. Find the sample points in the cubes (medoids) 3. Train the classifier 4. Refine the training sampling from neighbors of misclassified points 5. Boundary refinement Dimitriadou et al. Eplore-by-eample - an automatic query steering framework for interactive data eploration. SIGMOD D. MOTTIN, E. MÜLLER 144

Classification & Query Formulation Sample Red Green Relevant Object A 13.67 12.34 Yes Object B 15.32 14.50 No......... Object X 14.

74 Decision Tree Classifier Irrelevant Relevant SELECT * FROM galay WHERE red<= 14.82 AND red>= 13.5 AND green<=13.

25 Classification & Query Formulation Sample Red Green Relevant Object A Yes Object B No Object X Yes red red>14.82 red<=14.82 Irrelevant red red>=13.55 red<13.55 green Irrelevant green>13.74 green<=13.74 Decision Tree Classifier Irrelevant Relevant SELECT * FROM galay WHERE red<= AND red>= 13.5 AND green<=13.74 Dimitriadou et al. Eplore-by-eample - an automatic query steering framework for interactive data eploration. SIGMOD D. MOTTIN, E. MÜLLER 145

26 Misclassified Sample Eploitation Red wavelength Sampling Areas Green Wavelength Dimitriadou et al. Eplore-by-eample - an automatic query steering framework for interactive data eploration. SIGMOD D. MOTTIN, E. MÜLLER 146

27 Clustering-based Sampling Red wavelength Clusters- Sampling Areas Idea: Use a k-medoid approach to find sampling areas Green Wavelength Dimitriadou et al. Eplore-by-eample - an automatic query steering framework for interactive data eploration. SIGMOD D. MOTTIN, E. MÜLLER 147

An introduction to multi-armed bandits

An introduction to multi-armed bandits Henry WJ Reeve (Manchester) (henry.reeve@manchester.ac.uk) A joint work with Joe Mellor (Edinburgh) & Professor Gavin Brown (Manchester) Plan 1. An introduction to