Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques

Similar documents
Big Graph Processing. Fenggang Wu Nov. 6, 2016

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, XXX. ZZ 1. Realizing Memory-Optimized Distributed Graph Processing

PAGE: A Partition Aware Graph Computation Engine

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Fine-grained Data Partitioning Framework for Distributed Database Systems

AN INTRODUCTION TO GRAPH COMPRESSION TECHNIQUES FOR IN-MEMORY GRAPH COMPUTATION

Pregel: A System for Large-Scale Graph Proces sing

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

FPGP: Graph Processing Framework on FPGA

GraphLab: A New Framework for Parallel Machine Learning

arxiv: v1 [cs.dc] 20 Apr 2018

igiraph: A Cost-efficient Framework for Processing Large-scale Graphs on Public Clouds

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 01, 2016 ISSN (online):

Handling limits of high degree vertices in graph processing using MapReduce and Pregel

Optimizing CPU Cache Performance for Pregel-Like Graph Computation

Report. X-Stream Edge-centric Graph processing

DISTINGER: A Distributed Graph Data Structure for Massive Dynamic Graph Processing

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Puffin: Graph Processing System on Multi-GPUs

CIM/E Oriented Graph Database Model Architecture and Parallel Network Topology Processing

Digree: Building A Distributed Graph Processing Engine out of Single-node Graph Database Installations

MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping

GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics

DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing

Giraphx: Parallel Yet Serializable Large-Scale Graph Processing

USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Towards Efficient Query Processing on Massive Time-Evolving Graphs

Fast Graph Mining with HBase

Bipartite-oriented Distributed Graph Partitioning for Big Learning

Scale-up Graph Processing: A Storage-centric View

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT

INCREMENTAL STREAMING DATA FOR MAPREDUCE IN BIG DATA USING HADOOP

Graph-Parallel Entity Resolution using LSH & IMM

GraphChi: Large-Scale Graph Computation on Just a PC

ZHT+ : Design and Implementation of a Graph Database Using ZHT

Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA*

Incremental and Iterative Map reduce For Mining Evolving the Big Data in Banking System

arxiv: v1 [cs.dc] 5 Nov 2018

One Trillion Edges. Graph processing at Facebook scale

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Optimistic Recovery for Iterative Dataflows in Action

MOTIVATION 1/18/18. Software tools for Complex Networks Analysis. Graphs are everywhere. Why do we need tools?

Early Experiences in Using a Domain-Specific Language for Large-Scale Graph Analysis

ANG: a combination of Apriori and graph computing techniques for frequent itemsets mining

MPGM: A Mixed Parallel Big Graph Mining Tool

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

Cost Model for Pregel on GraphX

Embedded domain specific language for GPUaccelerated graph operations with automatic transformation and fusion

CS 347 Parallel and Distributed Data Processing

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES

CS 347 Parallel and Distributed Data Processing

[CoolName++]: A Graph Processing Framework for Charm++

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

WGB: Towards a Universal Graph Benchmark

HUS-Graph: I/O-Efficient Out-of-Core Graph Processing with Hybrid Update Strategy

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

Pregel. Ali Shah

Acolyte: An In-Memory Social Network Query System

Chapter 5 Parallel Processing of Graphs

DFEP: Distributed Funding-based Edge Partitioning

BlitzG: Exploiting High-Bandwidth Networks for Fast Graph Processing

A. Papadopoulos, G. Pallis, M. D. Dikaiakos. Identifying Clusters with Attribute Homogeneity and Similar Connectivity in Information Networks

arxiv: v1 [cs.db] 22 Feb 2013

Performance of Alternating Least Squares in a distributed approach using GraphLab and MapReduce

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Scalable and Robust Management of Dynamic Graph Data

Full-Text and Structural XML Indexing on B + -Tree

Balanced Graph Partitioning with Apache Spark

A Distributed Framework for Large-Scale Time-Dependent Graph Analysis

modern database systems lecture 10 : large-scale graph processing

Optimizing Memory Performance for FPGA Implementation of PageRank

Enabling Scalable Social Group Analytics via Hypergraph Analysis Systems

Billion node graph inference: iterative processing on The Machine

Facilitating Real-Time Graph Mining

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation

AN EFFICIENT MONTE CARLO APPROACH TO COMPUTE PAGERANK FOR LARGE GRAPHS ON A SINGLE PC

GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine

Frameworks for Graph-Based Problems

Drakkar: a graph based All-Nearest Neighbour search algorithm for bibliographic coupling

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

GPS: A Graph Processing System

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters

Data Analytics: From Conceptual Modelling to Logical Representation

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Web page recommendation using a stochastic process model

1 Optimizing parallel iterative graph computation

Compressing and Decoding Term Statistics Time Series

PARALLEL SEED SELECTION METHOD FOR OVERLAPPING COMMUNITY DETECTION IN SOCIAL NETWORK

Graph Data Management

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

WOOster: A Map-Reduce based Platform for Graph Mining

arxiv: v5 [cs.dc] 7 Aug 2017

CS 5220: Parallel Graph Algorithms. David Bindel

Transcription:

Memory-Optimized Distributed Graph Processing through Novel Compression Techniques Katia Papakonstantinopoulou Joint work with Panagiotis Liakos and Alex Delis University of Athens Athens Colloquium in Algorithms and Complexity UoA, August 26 th, 2016

Motivation UoA Katia Papakonstantinopoulou Distributed Graph Compression- Intro Motivation 2/19 graph data whose size continuously grows distributed graph processing systems (e.g., Pregel & Apache Giraph) scale of real-world graphs hardens graph processing even in distributed environments we need efficient distributed representations of such graphs We address this problem by exploiting empirically-observed properties demonstrated by behavior graphs

Distributed graph-processing approaches and systems The proposed frameworks [1, 6, 4, 7] fail to handle the huge scale of real-world graphs, as a result of ineffective memory usage [2]. The partitioning hardens the task of compression. UoA Katia Papakonstantinopoulou Distributed Graph Compression- Intro Motivation 3/19 Most of these approaches adopt the think like a vertex programming paradigm introduced with Pregel [5] ( intuitive parallelizable algorithms). A graph partitioned on a vertex basis in a distributed environment: Worker 1 Worker 2 5 2 2 3,4,5 3 6 1, 8 1 2, 3, 4 4 1, 7 7 8 8 7 Worker 3

Indicative memory optimization approaches UoA Katia Papakonstantinopoulou Distributed Graph Compression- Intro Motivation 4/19 The distributed graph-processing systems face significant memory-related issues. Some memory optimization approaches are: Apache Giraph [1] (graph processing system that follows Pregel) with contributions by Facebook focused entirely on a more careful implementation for the representation of the out-edges of a vertex, without exploiting the redunduncy in real-world graphs Gbase[3] (a number of alternative compression techniques to reduce storage and hence network traffic and query execution time) does not follow the vertex-centric model requires decompression

Contribution UoA Katia Papakonstantinopoulou Distributed Graph Compression- Intro Motivation 5/19 We follow the Pregel paradigm and partition the graph vertices among the nodes of a distributed computing environment. In this context, we present a number of novel techniques that: 1 offer space efficient-representations of the out-edges of vertices, 2 allow fast mining (in-situ) of the graph elements without the need of decompression, 3 enable the execution of graph algorithms in memory-constrained settings, and 4 ease the task of memory management, thus allowing faster execution. Our work lies in the intersection of distributed graph processing systems and compressed graph representations.

Distributed vs non-distributed settings UoA Katia Papakonstantinopoulou Distributed Graph Compression- Intro Motivation 6/19 In non-distributed settings, we can exploit the fact that vertices tend to exhibit similarities (copy property). In order to achieve memory optimization, we need representations that allow mining of the graph s elements without decompression.

Giraph structures for out-neighbors UoA Katia Papakonstantinopoulou Distributed Graph Compression- Intro Motivation 7/19 Giraph s adjacency-list representations: ByteArrayEdges vertex id 1 weight 1 vertex id 2 weight 2... size 1 size 2 size 1 size 2 The bytes required per out-neighbor are determined by the data type used for its id and weight; for integer numbers 4+4=8 bytes are required.

Representations based on consecutive out-edges UoA Katia Papakonstantinopoulou Distributed Graph Compression- Our Approach 8/19 Consider the following sequence of neighbors to be represented: (2, 9, 10, 11, 12, 14, 17, 18, 20, 127). (1) 2 (9) 2 BVedges γ(0) 4 bytes 4 bytes }{{}}{{} number interval of intervals ζ(13) ζ(11) ζ(2) ζ(0) ζ(1) ζ(106) } {{ } residuals In the context of graph compression, Elias γ coding is preferred for the representation of rather small values of x, whereas ζ coding is more proper for potentially large values.

Representations based on consecutive out-edges UoA Katia Papakonstantinopoulou Distributed Graph Compression- Our Approach 9/19 Consider again the sequence of neighbors: (2, 9, 10, 11, 12, 14, 17, 18, 20, 127). IntervalResidualEdges (2) 2 (9) 2 (4) 2 (17) 2 (2) 2 4 bytes 4+1 bytes 4+1 bytes } {{ }} {{ } {{ } number of 1st interval 2nd interval intervals (2) 2 (14) 2 (20) 2 (127) 2 4 bytes 4 bytes 4 bytes 4 bytes } {{ } residuals

Representation based on concentration of out-edges UoA Katia Papakonstantinopoulou Distributed Graph Compression- Our Approach 10/19 Again using the sequence: (2, 9, 10, 11, 12, 14, 17, 18, 20, 127). IndexedBitArrayEdges... (0) 2 (1) 2 (2) 2 (15) 2 4 bytes 1 byte

Experimental Evaluation UoA Katia Papakonstantinopoulou Distributed Graph Compression- Experimental Evaluation 11/19 How much more space-efficient is each of our three compressed out-edge representations compared to ByteArrayEdges? Are our techniques competitive speed-wise when memory is not a concern? How much more efficient are our compressed representations when the available memory is constrained? Can we execute algorithms for large graphs in settings where it was not possible before?

Space Efficiency Comparison UoA Katia Papakonstantinopoulou Distributed Graph Compression- Experimental Evaluation 12/19 Memory requirements of Giraph s ByteArrayEdges and our three representations for the small and large-scale graphs of our dataset: ByteArray- BVEdges IntervalRe- IndexedBitgraph Edges sidualedges ArrayEdges uk-2007-05@100000 22.61MB 6.41MB 7.92MB 8.91MB uk-2007-05@1000000 279.16MB 67.36MB 82.7MB 97.79MB indochina-2004 1,511.67MB 442.34MB 646.03MB 554.23MB hollywood-2011 1,381.91MB 287.53MB 613.52MB 676.88MB uk-2002 2,733.6MB 1,092.82MB 1,224.07MB 1,255.67MB arabic-2005 4,820.09MB 1,428.97MB 1,674.75MB 1,849.83MB uk-2005 7,401.88MB 2,383.54MB 2,728.74MB 2,928.81MB sk-2005 14,829.64MB 4,889.85MB 5,657.79MB 6,354.17MB

Execution Time Comparison (small-scale graphs) UoA Katia Papakonstantinopoulou Distributed Graph Compression- Experimental Evaluation 13/19 Execution time of PageRank algorithm for graph indochina-2004 using a setup of 2, 4, and 8 workers: Execution time (in minutes) 45 40 35 30 25 20 15 10 5 ByteArrayEdges BVEdges IntervalResidualEdges IndexedBitArrayEdges 0 2 workers 4 workers 8 workers

Execution Time Comparison (large-scale graphs) Execution time for each superstep of PageRank algorithm for graph uk-2005 using 5 workers: 10 Execution time (in minutes) 8 6 4 2 ByteArrayEdges BVEdges IntervalResidualEdges IndexedBitArrayEdges 0 0 5 10 15 20 25 30 Supersteps of PageRank execution ByteArrayEdges performance fluctuates due to extensive garbage collection. UoA Katia Papakonstantinopoulou Distributed Graph Compression- Experimental Evaluation 14/19

Execution Time Comparison (large-scale graphs) UoA Katia Papakonstantinopoulou Distributed Graph Compression- Experimental Evaluation 15/19 Execution time of PageRank algorithm for graph uk-2005: Execution time (in minutes) 200 150 100 50 0 ByteArrayEdges BVEdges IntervalResidualEdges IndexedBitArrayEdges uk-2005 (5 workers) FAILED uk-2005 (4 workers) IntervalResidualEdges and IndexedBitArrayEdges outperform ByteArrayEdges.

Conclusion UoA Katia Papakonstantinopoulou Distributed Graph Compression- Experimental Evaluation 16/19 Our experimental results indicate significant improvements on space-efficiency for all proposed techniques. We reduced memory requirements up to 5 times in comparison with currently applied techniques. In settings where earlier approaches were able to execute graph algorithms, we achieve significant performance improvements. We reduced execution time up to 31% due to memory optimization. These findings establish our structures as the preferable option for web graphs, or any other type of behavior graphs.

Future Directions UoA Katia Papakonstantinopoulou Distributed Graph Compression- Future Directions 17/19 Design representation methods that favor mutations of the graph. Examine the compression of edge weights.

References UoA Katia Papakonstantinopoulou Distributed Graph Compression- References 18/19 [1] Apache Giraph. http://giraph.apache.org/. [2] Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, and Tianqi Jin. An Experimental Comparison of Pregel-like Graph Processing Systems. Proc. of the VLDB Endowment, 7(12):1047 1058, 2014. [3] U. Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, and Christos Faloutsos. GBASE: an efficient analysis platform for large graphs. VLDB J., 21(5):637 650, 2012. [4] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning in the Cloud. Proc. of the VLDB Endowment, 5(8):716 727, 2012. [5] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A System for Large-Scale Graph Processing. In ACM SIGMOD, 2010. [6] Semih Salihoglu and Jennifer Widom. GPS: a graph processing system. In SSDBM, 2013. [7] Da Yan, James Cheng, Yi Lu, and Wilfred Ng. Effective Techniques for Message Reduction and Load Balancing in Distributed Graph Computation. In WWW, 2015.

UoA Katia Papakonstantinopoulou Distributed Graph Compression- Contact 19/19 Thank you for your attention! for further details visit: http://hive.di.uoa.gr/network-analysis/ or email me at: katia@di.uoa.gr