On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Similar documents
On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Jinho Hwang and Timothy Wood George Washington University

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Scalable Enterprise Networks with Inexpensive Switches

Exploring the Hidden Dimension in Graph Processing

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Efficient Memory Disaggregation with Infiniswap. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

CS 347 Parallel and Distributed Data Processing

Summary of Research. Arijit Khan

CS 347 Parallel and Distributed Data Processing

SCALING A DISTRIBUTED SPATIAL CACHE OVERLAY. Alexander Gessler Simon Hanna Ashley Marie Smith

The Google File System

The Google File System

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Data Processing on Emerging Hardware

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

Graph Data Management

Facebook Tao Distributed Data Store for the Social Graph

Adaptive Parallel Compressed Event Matching

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

BlueDBM: An Appliance for Big Data Analytics*

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Assignment 5. Georgia Koloniari

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Dynamic Metadata Management for Petabyte-scale File Systems

Inexpensive Coordination in Hardware

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Performing MapReduce on Data Centers with Hierarchical Structures

Improving the Efficiency of Multi-site Web Search Engines

The Google File System

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Community Detection. Community

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Accelerating Analytical Workloads

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Pregel. Ali Shah

modern database systems lecture 10 : large-scale graph processing

Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction

Performance in the Multicore Era

Memory-Based Cloud Architectures

Building Efficient and Reliable Software-Defined Networks. Naga Katta

Big Data Management and NoSQL Databases

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

A Guide to Architecting the Active/Active Data Center

Decentralized Distributed Storage System for Big Data

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

CSE 124: Networked Services Fall 2009 Lecture-19

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Distributed Meta-data Servers: Architecture and Design. Sarah Sharafkandi David H.C. Du DISC

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

IEEE 2013 JAVA PROJECTS Contact No: KNOWLEDGE AND DATA ENGINEERING

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Interconnect Technology and Computational Speed

The Google File System

Accelerating Data Science. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

SEMESTER 2 Chapter 3 Introduction to Dynamic Routing Protocols V 4.0

Memshare: a Dynamic Multi-tenant Key-value Cache

Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

CA485 Ray Walshe Google File System

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

The Google File System

What is Parallel Computing?

Data Reduction and Partitioning in an Extreme Scale GPU-Based Clustering Algorithm

Adaptive Cluster Computing using JavaSpaces

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

CSE 124: Networked Services Lecture-16

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Advanced Data Management

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

08 Distributed Hash Tables

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Review: Creating a Parallel Program. Programming for Performance

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Load Balancing with Minimal Flow Remapping for Network Processors

Clustering Billions of Images with Large Scale Nearest Neighbor Search

CS 5220: Parallel Graph Algorithms. David Bindel

Was ist dran an einer spezialisierten Data Warehousing platform?

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

Harp-DAAL for High Performance Big Data Computing

Transcription:

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft Research, Redmond, USA

Big Graphs Google: > 1 trillion indexed pages Web Graph Facebook: > 800 million active users Social Network 31 billion RDF 31 triples billion in RDF 2011 triples in 2011 Information Network De Bruijn: 4 k nodes (k = 20,, 40) Biological Network 100M Ratings, 480K Users, 17K Movies Graphs in Machine Learning 1/ 32

Background: Distributed Graph Querying Systems First, partition the graph, and then place each partition on a separate server, where query answering over that partition takes place. State-of-the-art distributed graph querying systems (e.g., SEDGE [SIGMOD 12], Trinity [SIGMOD 13], Horton [PVLDB 13]) 2/ 32

Background: Distributed Graph Querying Systems Disadvantages Ø Ø Fixed Routing (lessflexible) Balanced Graph Partitioning and Re-Partitioning State-of-the-art distributed graph querying systems 3/ 32

Background: Distributed Graph Querying Systems Disadvantages Ø Fixed Routing (lessflexible) Ø The server which contains the query node can only handle that request à the router maintains a fixed routing table (or, a fixed routing strategy, e.g., modulo hashing). Less flexible with respect to query routing and fault tolerance, e.g., adding more machines will require updating the data partition and/or the routing table. Balanced Graph Partitioning and Re-Partitioning State-of-the-art distributed graph querying systems 4/ 32

Background: Distributed Graph Querying Systems Disadvantages Ø Fixed Routing (lessflexible) Ø Balanced Graph Partitioning and Re- Partitioning (1) workload balancing to maximize parallelism, (2) locality of data access to minimize network communication à NP-hard, difficult in power-law graphs. later updates to graph structure or variations in query workloads à graph re-partitioning/ replication à online monitoring of workload changes, repartitioning of the graph topology, and migration of graph data across servers are expensive. State-of-the-art distributed graph querying systems 5/ 32

Roadmap Distributed graph queryingand graph partitioning Decoupled graph querying system Related work Smart graph queryrouting Experimental results Conclusions 6/ 32

Decoupled Graph Querying System we decouple query processing and graph storage into two separate tiers. This decoupling happens at a logical level. Decoupled architecture for graph querying 7/ 32

Decoupled Graph Querying System Benefits Ø Ø Flexible routing Less reliant on good partitioning across storage servers [Due to our smart query routing strategy will be discussed soon!] Decoupled architecture for graph querying 8/ 32

Decoupled Graph Querying System Benefits Ø Flexible routing A query processor no longer assigned a fixed part of the graph à equally capable of handling any request à facilitating load balancing and fault tolerance. The query router can send a request to any of the query processors à more flexible query routing, e.g., more query processors can be added (or, a query processor that is down can be replaced) without affecting the routing strategy. Decoupled architecture for graph querying 9/ 32

Decoupled Graph Querying System Benefits Ø Flexible routing Each tier can be scaled-up independently. A certain workload is processing intensive à allocate more servers to the processing tier. Graph size increases over time à add more servers in the storage tier. Decoupled architecture, being generic, can be employed in many existing graph querying systems. Decoupled architecture for graph querying 10/ 32

Roadmap Distributed graph queryingand graph partitioning Decoupled graph queryingsystem Related work Smart graph queryrouting Experimental results Conclusions 11/ 32

Related Work: Decoupling Storage and Query Processors Facebook s Memcached [NSDI 13] Google s F1 [PVLDB 13] ScaleDB [http://scaledb.com/pdfs/technicaloverview.pdf] Loesing et. al. (On the Design and Scalability of Distributed Shared- Data Databases) [SIGMOD 15] Binnig et. al. (The End of Slow Networks: It s Time for a Redesign) [PVLDB 16] Shalita et. al. (Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks) [NSDI 16] 12/ 32

Decoupled Graph Querying System Disadvantages Ø Query processors may need to communicate with the storage tier via the network à additional penalty to the response time for answering a query. Ø May cause high contention rates on either the network, storage tier, or both. Decoupled architecture for graph querying 13/ 32

Our Contribution: Smart Query Routing We design a smart query routing logic to utilize the cache of query processors over such decoupled architecture. More cache hits à reduce communication among query processors and storage servers. More cache hits à less reliant on good partitioning across storage servers. Decoupled architecture for graph querying 14/ 32

Roadmap Distributed graph queryingand graph partitioning Decoupled graph queryingsystem Related work Smart graph query routing Experimental results Conclusions 15/ 32

h-hop Traversal Queries Examples Online, h-hop queries: explore a small region of the entire graph, and require fast response time. Start with a query node, and traverse its neighboring nodes up to a certain number of hops (i.e., h = 2, 3). h-hop neighbor aggregation h-step random walk withrestart h-hop reachability More complex queries, e.g., node labeling and classification, expert finding, ranking, discovering functional modules, complexes, and pathways 16/ 32

Objectives for Smart Query Routing Leverage each processor s cached data Balance workload even if skewed orcontains hotspot Make fast routing decisions [a small constant time, or << O(n) ] Have low storage overhead in the router [a small fraction of the input graph size] Decoupled architecture for graph querying 17/ 32

Challenges in Smart Query Routing Objectives are conflicting Ø For maximum cache locality, router can send all queries to the same processor (assuming no cache eviction) à imbalanced workload in processors à lower throughput. Ø router could inspect the cache of each processor before making a good routing decision à network delay. Hence, router must infer what is likely to be in each processor s cache. Smart Routing Objectives Leverage each processor s cached data Balance workload even if skewed or contains hotspot Make fast routing decisions Have low storage overhead in the router 18/ 32

Challenges in Smart Query Routing Smart Routing Objectives are conflicting! Topology-Aware Locality Ø successive queries on nearby nodes must be sent to the same processor. It is likely that h-hop neighborhoods of these nodes significantly overlap. Ø How the router knows about nearby nodes without storing the entire graph topology? 2-hop neighborhoods of u and v overlap significantly - use landmark, graph embedding 19/ 32

Challenges in Smart Query Routing Smart Routing Objectives are conflicting! Query Stealing Ø Always Routing queries to processors that have the most useful cache data à workload imbalance if skew/ query hotspot à lower throughput. Ø We perform query stealing at router à Whenever a processor is idle and is ready to handle a new query, if it does not have any other requests assigned to it, the router may steal a request and send to it which was intended for another processor. Ø Query stilling by maintaining topology-aware locality (asmuch aspossible). 20/ 32

Smart Routing-1: Landmark If two nodes are close to a given landmark, they are likely to be close themselves. 21/ 32

Smart Routing-1: Landmark Pre-processing Select a small set of L nodes as landmarks. Compute distance of every node to landmarks. Assign landmarks to query processors: Every processor is assigned a pivot landmark with the intent that pivot landmarks are as far from each other as possible. Each remaining landmark is assigned to the processor which contains its closest pivot landmark. If two nodes are close to a given landmark, they are likely to be close themselves. The distance of a node u to a processor p is defined as the minimum distance of u to any landmark that is assigned to processor p. This distance information is stored in the router, which requires O(nP) space and O(nL) time to compute. 22/ 32

Smart Routing-1: Landmark Online Routing a query on node u à the router verifies the pre-computed distance d(u, p) for every processor p à selects the one with the smallest d(u, p) value. Routing decision time: O(p) Load-balancing via Query-stealing If two nodes are close to a given landmark, they are likely to be close themselves. Route to smallest load-balanced distance. Nearbynodes are routed in similar way, maintainingtopology-aware locality. 23/ 32

Smart Routing-2: Embed Pre-processing Embed a graph in a lower D-dimensional Euclidean plane. The hop-count distance between graph nodes are approximately preserved via their Euclidean distance. Storage = O(nD), time = O( L 2 D + n L D) Graph embedding in 2D Euclidean plane A benefit of embed routing is that the pre-processing is independent of the system topology, allowing more processors to be easily added at a later time. 24/ 32

Smart Routing-2: Embed Online Routing Exponential moving average to compute the meanofthe processor s cache contents. Router finds the distance between a query node u and a processor p, denoted as d(u, p), and defined as the distance of the query node s co-ordinates to the historical mean of the processor s cache contents. Graph embedding in 2D Euclidean plane Route query on u to processor p with minimum d(u, p). Routing decision time: O(PD) 25/ 32

Smart Routing-2: Embed Online Routing Exponential moving average to compute the meanofthe processor s cache contents. Router finds the distance between a query node u and a processor p, denoted as d(u, p), and defined as the distance of the query node s co-ordinates to the historical mean of the processor s cache contents. Route query on u to processor p with minimum d(u, p). Graph embedding in 2D Euclidean plane Load-balancing via Query-stealing Routing decision time: O(PD) 25/ 32

Roadmap Distributed graph queryingand graph partitioning Decoupled graph queryingsystem Related work Smart graph queryrouting Experimental results Conclusions 26/ 32

Experimental Setup Graph Datasets Cluster Configuration Ø 12 servers each having 2.4 GHz Intel Xeon processors, 0 4GB cache. Ø interconnected by 40 Gbps Infiniband, and also by 10 Gbps Ethernet. Ø Use a single core of each server with the following configuration: 1 server as router, 7 servers in the processing tier, 4 servers in the storage tier; and communication over Infiniband with remote direct memory access (RDMA). Ø RAMCloudas storage tier. Ø Graph is stored as adjacency list every node-id is key, and the corresponding value is an array of its 1-hop neighbors. Ø The graph is partitioned across storage servers via RAMCloud s default and inexpensive hash partitioning scheme, MurmurHash3 over graph nodes. 27/ 32

List of Experiments Comparison with distributed graph systems (SEDGE [SIGMOD 12] with Giraph [SIGMOD 10], GraphLab [VLDB 12]) that use smart graph partitioning and reportioning - Our method achieves up to an order of magnitude higher throughput even withinexpensive hash partitioning of the graph! Scalability with number of processors and storage servers Impact of cache size Impact of graph updates Sensitivity w.r.t. different parameters: query locality and hotspot, h-hop queries, load factor, smoothing parameter, embedding dimensionality, landmark numbers, minimum distance between a pair of landmarks Performance Metrics Query efficiency, Query throughput, Cache hit rates Baseline Routing Methods Next ready, No cache, Modular hash with query stealing 28/ 32

Performance with Varying Number of Query Processors Embed routing is able to sustain almost same cache hit rate with many query processors. Hence, its throughput scales linearly with query processors. 29/ 32

Impact of Cache Sizes Both smart routings Embed and Landmark utilize the cache well; and for the same amount of cache, they achieve lower response time compared to baseline routings. 30/ 32

Roadmap Distributed graph queryingand graph partitioning Decoupled graph queryingsystem Related work Smart graph queryrouting Experimental results Conclusions 31/ 32

Conclusions Decoupled graph querying system Smart query routing to achieve higher cache hits for h-hop traversal queries Ø emphasize less on expensive graph partitioning and re-partitioning across storage tiers Ø provide linear scalability in throughput with more number of query processors Ø works well in the presence of query hotspots Ø adaptive to workload changes and graph updates. Decoupled architecture for graph querying 32/ 32