Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Similar documents
Similarity Ranking in Large- Scale Bipartite Graphs

COMP5331: Knowledge Discovery and Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

Local higher-order graph clustering

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Using PageRank in Feature Selection

Fast Nearest Neighbor Search on Large Time-Evolving Graphs

Information Retrieval and Organisation

Ego-net Community Mining Applied to Friend Suggestion

Using PageRank in Feature Selection

Predictive Indexing for Fast Search

Mining Web Data. Lijun Zhang

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Mining Web Data. Lijun Zhang

Information Networks: PageRank

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Extracting Rankings for Spatial Keyword Queries from GPS Data

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Part 1: Link Analysis & Page Rank

Collaborative filtering based on a random walk model on a graph

Pregel. Ali Shah

You are Who You Know and How You Behave: Attribute Inference Attacks via Users Social Friends and Behaviors

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Lecture 8: Linkage algorithms and web search

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Link Structure Analysis

CS6200 Information Retreival. The WebGraph. July 13, 2015

Big Data Analytics CSCI 4030

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Information Retrieval. Lecture 11 - Link analysis

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Big Data Analytics CSCI 4030

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Data Informatics. Seon Ho Kim, Ph.D.

CS 6604: Data Mining Large Networks and Time-Series

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Lecture 8: Linkage algorithms and web search

The PageRank Citation Ranking

Mining di Dati Web. Lezione 3 - Clustering and Classification

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Machine Learning using MapReduce

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Information Retrieval and Web Search Engines

TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Gene Clustering & Classification

Mind the Gap: Large-Scale Frequent Sequence Mining

Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won

Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount

Recommender Systems: Practical Aspects, Case Studies. Radek Pelánek

Link Analysis. Hongning Wang

Advanced Marketing Lab

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

Embedded Technosolutions

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

Algorithms for Evolving Data Sets

Randomized Composable Core-sets for Distributed Optimization Vahab Mirrokni

Collaborative Filtering using Euclidean Distance in Recommendation Engine

B490 Mining the Big Data. 5. Models for Big Data


A Tutorial on the Probabilistic Framework for Centrality Analysis, Community Detection and Clustering

Lec 8: Adaptive Information Retrieval 2

Information Retrieval and Web Search

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

arxiv: v1 [cs.si] 18 Oct 2017

GIVEN a large graph and a query node, finding its k-

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

One Trillion Edges. Graph processing at Facebook scale

Lecture 27: Learning from relational data

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Introduction to Data Mining

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Link Analysis and Web Search

PAGE RANK ON MAP- REDUCE PARADIGM

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

PoP Level Mapping And Peering Deals

Information Retrieval and Web Search Engines

Basics of Data Management

A Class of Submodular Functions for Document Summarization

COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA

WebSci and Learning to Rank for IR

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Efficient Mining Algorithms for Large-scale Graphs

1 ihypr: Prominence Ranking in Networks of Collaborations with Hyperedges 1

Local Higher-Order Graph Clustering

Bruno Martins. 1 st Semester 2012/2013

TODAY S LECTURE HYPERTEXT AND

Transcription:

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome

Motivation Recommendation Systems: Bipartite graphs with Users and Items. Identify similar users and suggest relevant items. Concrete example: The AdWords case. Two key observations: Items belong to different categories. Graphs are often lopsided.

Modeling the Data as a Bipartite Graph 2$ Retailers 3$ 4$ 1$ 5$ 2$ Nike Store New York Soccer Shoes Apparel Sport Equipment Hundreds of Labels Soccer Ball Millions of Advertisers Billions of Queries

Personalized PageRank For a node v (the seed) and a probability alpha v u The stationary distribution assigns a similarity score to each node in the graph w.r.t. node v.

The Problem 2$ Retailers 3$ 4$ 1$ 5$ 2$ Nike Store New York Soccer Shoes Apparel Sport Equipment Hundreds of Labels Soccer Ball Millions of Advertisers Billions of Queries

Other Applications General approach applicable to several contexts: User, Movies, Genres: find similar users and suggest movies. Authors, Papers, Conferences: find related authors and suggest papers to read.

Semi-Formal Problem Definition Advertisers Queries

Semi-Formal Problem Definition A Advertisers Queries

Semi-Formal Problem Definition A Advertisers Labels: Queries

Semi-Formal Problem Definition A Advertisers Labels: Queries Goal: Find the nodes most similar to A.

How to Define Similarity? We address the computation of several node similarity measures: Neighborhood based: Common neighbors, Jaccard Coefficient, Adamic-Adar. Paths based: Katz. Random Walk based: Personalized PageRank. Experimental question: which measure is useful? Algorithmic questions: Can it scale to huge graphs? Can we compute it in real-time?

Our Contribution Reduce and Aggregate: general approach to induce real-time similarity rankings in multicategorical bipartite graphs, that we apply to several similarity measures. Theoretical guarantees for the precision of the algorithms. Experimental evaluation with real world data.

Personalized PageRank For a node v (the seed) and a probability alpha v u The stationary distribution assigns a similarity score to each node in the graph w.r.t. node v.

Challenges Our graphs are too big (billions of nodes) even for very large-scale MapReduce systems. MapReduce is not real-time. We cannot pre-compute the rankings for each subset of labels.

Reduce and Aggregate a a b 1) a b a c 2) c c b c 3) b Reduce: Given the bipartite and a category construct a graph with only A nodes that preserves the ranking on the entire graph. Aggregate: Given a node v in A and the reduced graphs of the subset of categories interested determine the ranking for v.

Reduce (Precomputation) Advertisers Queries

Reduce (Precomputation) Advertisers Queries Precomputed Rankings

Reduce (Precomputation) Advertisers Queries Precomputed Rankings Precomputed Rankings

Reduce (Precomputation) Advertisers Queries Precomputed Rankings Precomputed Rankings Precomputed Rankings

Aggregate (Run Time) A Precomputed Rankings Precomputed Rankings Ranking of Red + Yellow

Reduce for Personalized PageRank X Side A Y X Y Side B Markov Chain state aggregation theory (Simon and Ado, 61; Meyer 89, etc.). 750x reduction in the number of node while preserving correctly the PPR distribution on the entire graph. Side A

Run-time Aggregation

Koury et al. Aggregation-Disaggregation Algorithm A B Step 1: Partition the Markov chain into DISJOINT subsets

Koury et al. Aggregation-Disaggregation Algorithm A A B B Step 2: Approximate the stationary distribution on each subset independently.

Koury et al. Aggregation-Disaggregation Algorithm P AA A P B AB B A P BA Step 3: Consider the transition between subsets. P BB

Koury et al. Aggregation-Disaggregation Algorithm P AA A B 0 A P AB 0 B P BA Step 4: Aggregate the distributions. Repeat until convergence. P BB

Aggregation in PPR X A Y A Precompute the stationary distributions individually

Aggregation in PPR X B Y B Precompute the stationary distributions individually

Aggregation in PPR A B The two subsets are not disjoint!

Our Approach A B X Y X Y The algorithm is based only on the reduced graphs with Advertiser-Side nodes. The aggregation algorithm is scalable and converges to the correct distribution.

Experimental Evaluation We experimented with publicly available and proprietary datasets: Query-Ads graph from Google AdWords > 1.5 billions nodes, > 5 billions edges. DBLP Author-Papers and Patent Inventor- Inventions graphs. Ground-Truth clusters of competitors in Google AdWords.

Patent Graph 1 0.9 0.8 Precision vs Recall Inter Jaccard Adamic-Adar Katz PPR 0.7 0.6 Precision Precision 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall

Google AdWords Precision Recall

Conclusions and Future Work It is possible to compute several similarity scores on very large bipartite graphs in real-time with good accuracy. Future work could focus on the case where categories are not disjoint is relevant.

Thank you for your attention

Reduction to the Query Side X Y A B

Reduction to the Query Side X Y A B This is the larger side of the graph.

Convergence after One Iteration Kendall-Tau Correlation 1 DBLP Patent Query-Ads (cost) Kendall-Tau 0.8 0.6 0.4 0.2 0 10 20 30 40 50 All Position (k)

Convergence Approximation Error vs # Iterations DBLP (1 - Cosine) Patent (1 - Cosine) 1-Cosine Similarity 1-Cosine 0.001 0.0001 1e-05 1e-06 0 2 4 6 8 10 12 14 16 18 20 Iterations Iterations