Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Similar documents
modern database systems lecture 10 : large-scale graph processing

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Data-Intensive Computing with MapReduce

Link Analysis in the Cloud

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

One Trillion Edges. Graph processing at Facebook scale

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Graph Analytics in the Big Data Era

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

E6885 Network Science Lecture 10: Graph Database (II)

Databases 2 (VU) ( / )

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Big Data Management and NoSQL Databases

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Graph-Processing Systems. (focusing on GraphChi)

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Distributed computing: index building and use

Grid Computing Systems: A Survey and Taxonomy

Graph Data Management

Graphs (Part II) Shannon Quinn

CS November 2017

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Sempala. Interactive SPARQL Query Processing on Hadoop

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

GPU Sparse Graph Traversal. Duane Merrill

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

GPU Sparse Graph Traversal

Introduction to NoSQL by William McKnight

CS November 2018

Pregel. Ali Shah

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Locality. Christoph Koch. School of Computer & Communication Sciences, EPFL

Module 4: Tree-Structured Indexing

SociaLite: A Datalog-based Language for

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

Distributed Graph Algorithms

Distributed Graph Storage. Veronika Molnár, UZH

MapReduce: A Programming Model for Large-Scale Distributed Computation

Graph Processing. Connor Gramazio Spiros Boosalis

CS 5220: Parallel Graph Algorithms. David Bindel

Unifying Big Data Workloads in Apache Spark

Distributed computing: index building and use

TI2736-B Big Data Processing. Claudia Hauff

Oracle Spatial and Graph: Benchmarking a Trillion Edges RDF Graph ORACLE WHITE PAPER NOVEMBER 2016

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Billion-Node Graph Challenges

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Data Informatics. Seon Ho Kim, Ph.D.

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Accessing Arbitrary Hierarchical Data

University of Maryland. Tuesday, March 2, 2010

Distributed Systems 16. Distributed File Systems II

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Efficient graph computation on hybrid CPU and GPU systems

Shark: Hive (SQL) on Spark

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

GPS: A Graph Processing System

Challenges for Data Driven Systems

Webinar Series TMIP VISION

Text transcript of show #280. August 18, Microsoft Research: Trinity is a Graph Database and a Distributed Parallel Platform for Graph Data

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

CSE 544 Principles of Database Management Systems

Efficient Aggregation for Graph Summarization

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Harp-DAAL for High Performance Big Data Computing

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Analyzing a social network using Big Data Spatial and Graph Property Graph

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

CSE 530A. B+ Trees. Washington University Fall 2013

Graph-Parallel Problems. ML in the Context of Parallel Architectures

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Extreme Computing. NoSQL.

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Learn Well Technocraft

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Distributed File Systems II

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

DIVING IN: INSIDE THE DATA CENTER

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Introduction to NoSQL Databases

Map Reduce. Yerevan.

Transcription:

Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia

Outline Overview Storage Online query processing Offline graph analytics Advanced applications

Is it hard to manage graphs? Good systems for processing graphs: PBGL, Neo4j Good systems for processing large data: Map/Reduce, Hadoop Good systems for processing large graph data: Specialized systems for pagerank, etc.

This is hard. Graph System Big Data System General System

Existing Systems Native Graph* Online Query Index Transaction Support Parallel Graph Processing Topology in memory Distributed Neo4j YES YES Lucene YES NO NO NO HyperGraphDB NO YES Built-in YES NO NO NO InfiniteGraph NO YES Built-in + Lucene NO NO NO YES Pregel NO NO NO NO YES NO YES.

An Example: Online Query Processing in Graphs

People Search Demo

Very efficient graph exploration Visiting 2.2 million nodes on 8 machines in 100ms Graph models Relational Matrix Native (Trinity) Relation/Matrix requires data join operations for graph exploration Costly and producing large intermediary results

Joins vs. Graph Exploration for subgraph matching Subgraph matching Index sub-structures Decompose a query into a set of sub-structures Query each sub-structure Join intermediary results Index size and construction time Super-linear for any non-trivial sub-structure Infeasible for billion node graphs Graph exploration No structural index is needed Very few join operations Small intermediary results

SPARQL Query On Satori Microsoft s Knowledge System One of the world s largest repository of knowledge. 14 12 10 8 6 4 2 0 0.8 Billion 5x 4 Billion 3x 12 Billion Jun. 2011 Dec. 2011 Jun. 2012 (expected)

Compare with RDF-3x RDF3x is a state-of-the-art single node RDF engine Dataset Statistics LUBM-10240, a public benchmark for RDF Triple number: 1.36 billion Subject/Object number: 0.3 billion Queries are chosen from the benchmark queries published with LUBM Query ID 1 2 3 4 5 6 7 Our System 15640 9849 11184 5 4 9 37666 RDF-3x 39m2s 14194 30585 14 11 65 69560

Subgraph matching on a billion node graph No feasible solution for billion node graphs Super-linear index is not feasible Desiderata: requiring no index

Subgraph Match Query

Scale to size One size fits all? Disk-based Key-value Store Column Store Document Store Typical RDBMS In-memory Key-value Store Graph DB SQL Comfort Zone Scale to complexity

Memory-based Systems

Trinity A distributed, in-memory key/value store A graph database for online query processing A parallel platform for offline graph analytics

Graphs # of Edges 1 trillion the Web 104 billion Facebook 31 billion Linked Data 5.6 billion Satori 58 million US Road Map # of Nodes 24 million 504 million 800 million 1.4 billion 50 billion

# of Edges How many machines? 1 trillion the Web: 275 Trinity machines 104 billion Facebook: 25 Trinity machines 31 billion 5.6 billion Linked Data: 16 Trinity machines Satori: 4-10 Trinity machines 58 million US Road Map: 1 Trinity machine # of Nodes 24 million 504 million 800 million 1.4 billion 50 billion

Performance Highlight Graph DB (online query processing): visiting 2.2 million users (3 hop neighborhood) on Facebook: <= 100ms foundation for graph-based service, e.g., entity search Computing platform (offline graph analytics): one iteration on a 1 billion node graph: <= 60sec foundation for analytics, e.g., social analytics

Performance Highlight Typical Job State-of-the-Art Trinity N hop People Search 2 hop search 100 ms; 3 hop is not doable online 2 hop search in 10ms 3 hop search in 100ms Subgraph Matching For billion node graphs, indexing takes months to build No structural index required; response time 500 ms (Approximate) Shortest Distances/Paths No efficient solution for large graphs 10 ms for discovering shortest paths within 6-hops PageRank of Web Graph Cosmos/MapReduce: 2 days using thousands of machines 1 iteration on 1B graph 60sec using 10 machines Billion Node Graph Partitioning <22M nodes Billions of nodes Graph Generation For small graph only Generates 1 billion node graph of power law distribution in 15 minutes

Applications Algorithms Trinity Programming models Computation platform Query processing engine Storage infrastructures

Storage and Communication

Modeling a graph Basic data structure: Cell A graph node is a Cell A graph edge may or may not be a Cell Edge has no info Edge has very simple info (e.g., label, weight) Edge has rich info: an independent cell

Cell Schema Example public class mycell: ISubgraphMatchCell, IVisualCellParser { public byte[] label; public List<long> inlinks; public List<long> outlinks; internal long cell_id; public unsafe byte[] ToBinary(); public unsafe ISubgraphMatchCell FromBinary(byte[] bytes); } public void VisualParseCell(long nodeid, byte[] cell_bytes, VisualizerCallbackToolkit toolkit); Subgraph Matching Cell

Cell Transformation and Cell Parsing

Cell Transformation SubgraphMatch Cell DFS Cell

Partitioning Default: random partitioning by hashing Customizing hashing function Re-partitioning a graph (as an analytical job)

Cell Addressing

Cell Addressing

Big cell A warning to graph system builders: Lady Gaga has 40,000,000 fans. Her cell takes 320 Mb.

Cell as memory objects vs. blobs A memory object Runtime object on heap A Blob A binary stream Benefit: 50,000,000 cells, each of size 35 bytes CLR heap objects: 3,934 MB Flat memory streams: 1,668 MB Challenge: Parsing vs.

Data Model

In-place Blob Update The operations to a cell can be translated to memory manipulations to blob using a cell parser, e.g.

Cell Placements in Memory Cells are of different size Two types of applications: Mostly read-only, few cell-expansion operations (e.g., a knowledge taxonomy, a semantic network) Many cell-expansion operations (e.g., synthetic graph generation)

Cell Placements in Memory Sequential memory storage Cells are continuously stored in memory Compact. Sparse hash storage A cell may be stored as non-continuous blocks Dynamic. (streaming updates)

Concurrent Sequential Memory Allocation Concurrent Memory Allocation Request Atomically Incrementing Memory Pointer (A CAS-compare and swap-instruction) Concurrent write from allocated buffer header

Concurrent operations on a cell (using a Spin Cell Lock) while(cas(ref lockarray[index], 1, 0)!= 0); if(cas(ref spin_lock, 1, 0)!= 0) { While(true) { if(volatile_read(ref spin_lock) == 0 && CAS(ref lockarray[index], 1, 0) == 0) return; } } Optimized Version

Hash Storage The memory address space is a hash space. Use a hash function to allocate memory. Support lock-free streaming cell updates.

Hash Storage head id links -1 head id links -1 head id1 links jmp head id links -1 links jmp head id links -1 Chaining Links -1

Conflict Resolution in Hash Storage Conflicts Two node IDs are hashed to the same slot One node ID is hashed to the middle of the other Rehashing

Concurrency Control To lock: change HEAD flag to OCCUPY flag To unlock: change OCCUPY back to HEAD The Flag changing must be atomic (e.g. Interlocked CAS) HEAD id links -1 OCCUPY id links -1

Fast Billion Node Graph Generator Challenge: i) massive random access; ii) cell size change Existing graph generators Slow For small graphs only Trinity graph generator generate a 1 billion node graph of power law distribution in 15 minutes

Trinity Graph Generator parallel execution in the entire cluster edge #: 10 B node #: 1 B graph type

Offline Graph Analytics

Vertex-Based Graph Computation

Page Rank Script public class Cell : BaseCell { public int OutDegree; public Message message; Local Variables public override void Initialize() { OutDegree = outlinks.length; message.pagerankscore = 1.0f / GlobalNodeNumber; } public override void Computation() { float PageRankScore = 0; if (CurrentIterationRoundNumber > 100) VoteToStop(); foreach (long msg_id in inlinks) { Message msg = (Message)CacheManager.GetMessage(msg_id); PageRankScore += msg.pagerankscore; } message.pagerankscore = PageRankScore / OutDegree; } } Vertex-based Computation Fetch Message Generate New Message public struct Message { public float PageRankScore; }

Vertex-based Computation Take each graph vertex as a processor Synchronous communication model (BSP) Used in Pregel and its open source variants, such as Giraph and GoldenOrb

Restricted vertex-based computation In each super step, the sender and receiver set are known beforehand Messages can be categorized by their hotness Messages can be pre-fetched and cached

A bi-partite view Local Vertices Remote Vertices

A bi-partite view Local Vertices Remote Vertices

How many machines do we need? Facebook 800 million users, 130 friends per user 30 Trinity machines Web 25 billion pages, 40 links per page 150 Trinity machines 30 Trinity machines

Distribution-Aware Computing Support message sending by two APIs: Local message sending sending messages to neighboring nodes on the same machine Message sending sending messages to neighboring nodes

Distribution-Aware Computing Nodes are distributed on N machines (randomly) Each machine has 1/N nodes, and d/n edges Can we perform computation on 1 machine and estimate the results for the entire graph? Graph density estimation Can we perform computation on each machine locally, and aggregate their results as the final step? Finding connected components

Distributed betweenness Betweeness is the most effective guideline to select landmarks Exact betweenness costs O(nm) time, unacceptable on large graphs Approximate fast betweenness is expected Count shortest paths rooted at sampled vertices Count shortest path with limited depth On distributed platform We count the shortest paths within each machines approximation

How many machines do we need? Facebook 800 million users, 130 friends per user 30 Trinity machines 1 Trinity machine Web 25 billion pages, 40 links per page 150 Trinity machines 1 Trinity machine

Billion-node graph partitioning

Why partition? Divide a graph into k (almost) equal-size partitions, such that the number of edges between partitions is minimized. A better partition helps Load balance Reduce communication Example: BFS on the graph Best partitioning needs 3 communications Worst partitioning needs 17 e g f h a c b d i k j l

State-of-art method: Metis A multi-level framework Three phrases Coarsening by maximal match until the graph is small enough Partition the coarsest graph by KL algorithm [kl1972] Uncoarsening Ref: [Metis 1995]

Weakness of metis Not semantic aware: Coarsening is ineffective on real graphs Principle of coarsening An optimal partitioning on a coarser graph implies a good partitioning in the finer graph Coarsening by maximal match can guarantee this only when node degree is bounded, for example 2D, 3D meshes Real networks have skewed degree c, f e, g h, i distribution Inefficiency To support uncoarsening, many mappings and intermediate graphs need to be stored, leading to bad performance For example, a 4M graphs consumes 10G memory Not general enough Exclusive for partitioning a, d b, j k, l

Suggested solution : Multi-level Label Propagation Coarsening by label propagation Lightweight Can be easily implemented by message passing Semantic aware Can discover the inherent community structure of real graph Label propagation In each iteration, a vertex takes the label that is prevalent in its neighborhood as its own label. C 1 C 2 C 3

Handling Imbalance Imbalance: Too many small clusters Some extremely large clusters (monster clusters) Our approach First, we set size constraint on each label, to limit monster clusters Second, we use the following model to merge small cluster multiprocessor scheduling (MS) weighted graph partitioning (WGP)

Results - quality and performance Quality is comparable to Metis Performance is significantly better than Metis

Results- Scalability We use 8 machines, each of which has 48G memory. Our solution takes 4 hours on a graph with 512M node and 6.5G edges

Community search on large graphs

Distance Oracle

Why approximate distance oracle? What is your Erdos number? The shortest distance from you to mathematician Paul Erdos in the scientific collaboration network Shortest distance can also be used to compute centrality, betweenness Exact solutions Online computation Dijkstra-like algorithms, BFS At least linear complexity, computational prohibitive on large graphs Pre-computing all pairs of shortest path Of quadratic space complexity Approximate distance oracle Report approximate distance between any two vertices in a graph in constant time by pre-computation When graph is huge, approximation is acceptable, and precomputation is necessary

Current Solutions Two possible solutions Coordinate based Sketch based Problem solved Approximate distance in (almost) constant time by pre-computation. Currently for undirected, un-weighted graph only With potential consideration for dynamic graphs

Distributed betweenness Betweeness is the most effective guideline to select landmarks Exact betweenness costs O(nm) time, unacceptable on large graphs Approximate fast betweenness is expected Count shortest paths rooted at sampled vertices Count shortest path with limited depth On distributed platform We count the shortest paths within each machines approximation

Information about Trinity http://research.microsoft.com/trinity

Thanks!

Buffered Asynchronous Message Sender Lock free