[CoolName++]: A Graph Processing Framework for Charm++

Similar documents
King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Pregel: A System for Large-Scale Graph Proces sing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

modern database systems lecture 10 : large-scale graph processing

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Sampling Large Graphs for Anticipatory Analysis

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

One Trillion Edges. Graph processing at Facebook scale

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Frameworks for Graph-Based Problems

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Palgol: A High-Level DSL for Vertex-Centric Graph Processing with Remote Access

Report. X-Stream Edge-centric Graph processing

USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

Pregel. Ali Shah

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

FPGP: Graph Processing Framework on FPGA

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

AN INTRODUCTION TO GRAPH COMPRESSION TECHNIQUES FOR IN-MEMORY GRAPH COMPUTATION

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

GraphHP: A Hybrid Platform for Iterative Graph Processing

Extreme-scale Graph Analysis on Blue Waters

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Graph Processing. Connor Gramazio Spiros Boosalis

Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

LFGraph: Simple and Fast Distributed Graph Analytics

Large Scale Graph Processing Pregel, GraphLab and GraphX

GPS: A Graph Processing System

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

BSP, Pregel and the need for Graph Processing

GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics

c 2013 Imranul Hoque

Projections A Performance Tool for Charm++ Applications

arxiv: v1 [cs.dc] 20 Apr 2018

CS 5220: Parallel Graph Algorithms. David Bindel

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

A Parallel Community Detection Algorithm for Big Social Networks

Early Experience with Intergrating Charm++ Support to Green-Marl DSL

Scalable Dynamic Adaptive Simulations with ParFUM

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Optimizing CPU Cache Performance for Pregel-Like Graph Computation

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Recent Advances in Heterogeneous Computing using Charm++

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

Puffin: Graph Processing System on Multi-GPUs

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

Distributed Graph Algorithms

Scalable In-memory Checkpoint with Automatic Restart on Failures

GiViP: A Visual Profiler for Distributed Graph Processing Systems

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

Distributed Graph Storage. Veronika Molnár, UZH

Scalable GPU Graph Traversal!

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

igiraph: A Cost-efficient Framework for Processing Large-scale Graphs on Public Clouds

Data-Intensive Distributed Computing

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

Extreme-scale Graph Analysis on Blue Waters

PREGEL. A System for Large Scale Graph Processing

Greedy and Local Ratio Algorithms in the MapReduce Model

Graphs (Part II) Shannon Quinn

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing

Handling limits of high degree vertices in graph processing using MapReduce and Pregel

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

CS November 2017

Calculating Exact Diameter Metric of Large Static Graphs

Efficient graph computation on hybrid CPU and GPU systems

Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

PREGEL. A System for Large-Scale Graph Processing

The Future of High Performance Computing

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing

Graph Sampling Approach for Reducing. Computational Complexity of. Large-Scale Social Network

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

TI2736-B Big Data Processing. Claudia Hauff

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers

Batch & Stream Graph Processing with Apache Flink. Vasia

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

Harp-DAAL for High Performance Big Data Computing

GraphChi: Large-Scale Graph Computation on Just a PC

Graph Processing & Bulk Synchronous Parallel Model

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

Architectural Implications on the Performance and Cost of Graph Analytics Systems

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Scalable Label Propagation Algorithms for Heterogeneous Networks

CSE638: Advanced Algorithms, Spring 2013 Date: March 26. Homework #2. ( Due: April 16 )

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Transcription:

[CoolName++]: A Graph Processing Framework for Charm++ Hassan Eslami, Erin Molloy, August Shi, Prakalp Srivastava Laxmikant V. Kale Charm++ Workshop University of Illinois at Urbana-Champaign {eslami2,emolloy2,awshi2,psrivas2,kale}@illinois.edu May 8, 2015 1 / 26

Graphs and networks A graph is a set of vertices and a set of edges, which describe relationships between pairs of vertices. Data analysts wish to gain insights into characteristics of increasingly large networks, such as roads utility grids internet social networks protein-protein interaction networks gene regulatory processes 1 1 X. Zhu, M. Gerstein, and M. Snyder. Getting connected: analysis and principles of biological networks. In: Genes and Development 21 (2007), pp. 1010 24. doi: 10.1101/gad.1528707. 2 / 26

Why large-scale graph processing? Large social networks 2 1 billion vertices, 100 billion edges 111 PB adjacency matrix 2.92 TB adjacency list 2.92 TB edge list 2 Paul Burkhardt and Chris Waring. An NSA Big Graph Experiment. Technical Report NSA-RD-2013-056002v1. May 2000. 3 / 26

Why large-scale graph processing? Large web graphs3 50 billion vertices, 1 trillion edges 271 PB adjacency matrix 29.5 TB adjacency list 29.1 TB edge list 3 Paul Burkhardt and Chris Waring. An NSA Big Graph Experiment. Technical Report NSA-RD-2013-056002v1. May 2000. 4 / 26

Why large-scale graph processing? Large brain networks4 100 billion vertices, 100 trillion edges 2.08 mna bytes 2 (molar bytes) adjacency matrix 2.84 PB adjacency list 2.84 PB edge list 4 Paul Burkhardt and Chris Waring. An NSA Big Graph Experiment. Technical Report NSA-RD-2013-056002v1. May 2000. 5 / 26

Challenges of parallel graph processing Many graph algorithms result in 5......a large volume of fine grain messages....little computation per vertex....irregular data access....load imbalances due to highly connected communities and high degree vertices. 5 A. Lumsdaine et al. Challenges in parallel graph processing. In: Parallel Processing Letters 17.1 (2007), pp. 5 20. 6 / 26

Vertex-centric graph computation Introduced in Google s graph processing framework, Pregel 6 Based on the Bulk Synchronous Parallel (BSP) model A series of global supersteps are performed, where each active vertex in the graph 1 processes incoming messages from the previous superstep 2 does some computation 3 sends messages to other vertices Algorithm terminates when all vertices are inactive (i.e., they vote to halt the computation) and there are no messages in transit. Note that supersteps are synchronized via a global barrier Costly Simple and versatile 6 G. Malewicz et al. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. SCM, 2010, pp. 135 146. 7 / 26

Our contributions Implement and optimize a vertex-centric graph processing framework on top of Charm++ Evaluate performance for several graph applications Single Source Shortest Path Approximate Graph Diameter Vertex Betweenness Centrality Compare our framework to GraphLab 7 7 Yucheng Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. In: Proc. VLDB Endow. 5.8 (Apr. 2012), pp. 716 727. issn: 2150-8097. doi: 10.14778/2212351.2212354. url: http://dx.doi.org/10.14778/2212351.2212354. 8 / 26

CoolName++ framework overview Vertices are divided amongst parallel objects (Chares), called Shards. Shards handle the receiving and sending of messages between vertices. Main Chare coordinates the flow of computation by initiating supersteps. 9 / 26

User API Implementation of graph algorithms requires the formation of a vertex class compute member function In addition, users may also define functions for graph I/O mapping vertices to Shards combining messages being sent to and received by the same vertex 10 / 26

Example vertex constructor Algorithm 1 Constructor for SSSP 1: if vertex is the source vertex then 2: setactive() 3: distance = 0 4: else 5: distance = 6: end if 11 / 26

Example vertex compute function Algorithm 2 Compute function for SSSP 1: min dist = issource()? 0 : 2: for each of your messages do 3: if message.getvalue() < min dist then 4: min dist = message.getvalue() 5: end if 6: end for 7: if min dist < distance then 8: distance = min dist 9: sendmessagetoneighbors(distance + 1) 10: end if 11: votetohalt() 12 / 26

Implementation - the.ci file mainchare Main { e n t r y Main ( CkArgMsg m) ; e n t r y [ r e d u c t i o n t a r g e t ] void s t a r t ( ) ; e n t r y [ r e d u c t i o n t a r g e t ] void c h e c k i n ( i n t n, i n t counts [ n ] ) ; } ; group ShardCommManager { e n t r y ShardCommManager ( ) ; } a r r a y [ 1D] Shard { e n t r y Shard ( void ) ; e n t r y void p r o c e s s Message ( i n t s u p e r s t e p I d, i n t l e n g t h, s t d : : p a i r <u i n t 3 2 t, MessageType> msg [ l e n g t h ] ) ; e n t r y void run ( i n t mcount ) ; } ; 13 / 26

Implementation - run() function void Shard : : run ( i n t messagecount ) { // S t a r t a new s u p e r s t e p s u p e r s t e p = commmanagerproxy. cklocalbranch() > g e t S u p e r s t e p ( ) ;... i f ( messagecount == expectednumberofmessages ) { startcompute ( ) ; } e l s e { // Continue to w a i t f o r messages i n t r a n s i t } } void Shard : : startcompute ( ) { f o r ( v e r t e x i n a c t i v e V e r t i c e s ) { v e r t e x. compute ( messages [ v e r t e x ] ) ; } f o r ( v e r t e x i n i n a c t i v e V e r t i c e s with incoming messages ) { v e r t e x. compute ( messages [ v e r t e x ] ) ; } managerproxy. cklocalbranch() >done ( ) ; } 14 / 26

Optimizations Messages between vertices tend to be small but still incur overhead. Shards buffer messages User-defined message combine function (send/receive) 15 / 26

Example message combiner Algorithm 3 Combine function for SSSP 1: if message1.getvalue() < message2.getvalue() then 2: return message1 3: else 4: return message2 5: end if 16 / 26

Applications We consider three applications for the preliminary evaluation of our framework. Single Source Shortest Path (SSSP) Graph Diameter Longest shortest path between any two vertices We implement the approximate diameter with Flajolet-Martin(FM) bitmasks 8. Betweenness Centrality of a Vertex Number of shortest paths between every two vertices that pass through a vertex divided by the total number of shortest paths between every two vertices We implement Brandes algorithm 9. 8 P. Flajolet and G. N. Martin. Probabilistic Counting Algorithms for Data Base Applications. In: Journal of Computer and System Sciences 31.2 (1985), pp. 182 209. 9 U. Brandes. A faster algorithm for betweenness centrality. In: Journal of Mathematical Sociology 25.2 (2001), pp. 163 177. 17 / 26

Tuning experiments We want to tune parameters, specifically Number of Shards per PE Size of message buffer (i.e., the number of messages in the buffer) 18 / 26

Number of Shards per PE Tuning the number of shards per PE 250 200 runtime (s) 150 100 50 0 10 0 10 1 10 2 10 3 number of shards per PE Approximate diameter on a graph of sheet metal forming (0.5M vertices, 8.5M edges). All subsequent experiments use one shard per PE. 19 / 26

Size of message buffer runtime (s) Single Source Shortest Path 7 6 5 4 3 2 1 0 10 1 10 2 10 3 10 4 10 5 message buffer size runtime (s) Tuning message buffer size Approximate Diameter 20 15 10 5 0 10 1 10 2 10 3 10 4 message buffer size runtime (s) 1000 800 600 400 200 Betweenness Centrality 0 10 0 10 1 10 2 10 3 message buffer size Varying message buffer size on a graph of sheet metal forming (0.5M vertices, 8.5M edges). In the following experiments, we use a buffer size of 64 for SSSP, 128 for Approximate Diameter, and 32 for Betweenness Centrality. 20 / 26

Preliminary data for strong scalability We examine three undirected graphs from the Stanford Large Network Dataset Collection (SNAP) 10. as-skitter Internet topology graph from trace-routes run daily in 2005 1.7M vertices and 11M edges roadnet-pa Road network of Pennsylvania 1.1M vertices and 1.5M edges com-youtube Youtube online social network 1.1M vertices and 3M edges We compare our framework to GraphLab 11, a state-of-the-art graph processing framework originally developed at CMU. 10 Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. June 2014. 11 Yucheng Low et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. In: Proc. VLDB Endow. 5.8 (Apr. 2012), pp. 716 727. issn: 2150-8097. doi: 10.14778/2212351.2212354. url: http://dx.doi.org/10.14778/2212351.2212354. 21 / 26

Strong scalability of single source shortest path (SSSP) as-skitter Single source shortest path roadnet-pa com-youtube runtime (s) 10 0 CoolName++ GraphLab 10-1 10 1 10 2 number of cores runtime (s) 10 1 10 0 CoolName++ GraphLab 10-1 10 1 10 2 number of cores runtime (s) 10 0 CoolName++ GraphLab 10-1 10 1 10 2 number of cores 22 / 26

Strong scalability of approximate diameter 10 3 10 2 as-skitter Approximate diameter roadnet-pa 10 4 10 3 10 2 com-youtube runtime (s) 10 1 10 0 CoolName++ GraphLab 10-1 10 1 10 2 number of cores runtime (s) 10 2 10 1 10 0 CoolName++ GraphLab 10-1 10 1 10 2 number of cores runtime (s) 10 1 10 0 CoolName++ GraphLab 10-1 10 1 10 2 number of cores 23 / 26

Strong scalability of betweenness centrality runtime (s) 200 150 100 50 as-skitter 0 10 1 10 2 number of cores runtime (s) 200 150 100 50 Betweenness Centrality roadnet-pa 0 10 1 10 2 number of cores runtime (s) 70 60 50 40 30 20 10 com-youtube 0 10 1 10 2 number of cores 24 / 26

Conclusions We......implemented a scalable vertex-centric framework on Charm++....implemented three applications using our framework....get promising preliminary results in comparison to GraphLab....hope to test on larger graphs and a greater number of compute cores. 25 / 26

Future work Parallel I/O Vectorization of compute function Aggregators (e.g., global variables computed across vertices) Graph mutability Vertex addition/deletion Edge addition/deletion Edge contraction (message redirection) 26 / 26