USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING

Similar documents
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Pregel: A System for Large-Scale Graph Proces sing

Memory-Optimized Distributed Graph Processing. through Novel Compression Techniques

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

[CoolName++]: A Graph Processing Framework for Charm++

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

CS 347 Parallel and Distributed Data Processing

One Trillion Edges. Graph processing at Facebook scale

CS 347 Parallel and Distributed Data Processing

Scalable, Secure Analysis of Social Sciences Data on the Azure Platform

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

745: Advanced Database Systems

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

CE4031 and CZ4031 Database System Principles

Digree: Building A Distributed Graph Processing Engine out of Single-node Graph Database Installations

Data Analytics on RAMCloud

Pregel. Ali Shah

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

Axibase Time-Series Database. Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency

Overview of Web Mining Techniques and its Application towards Web

Graph Exploitation Testbed

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

Socrates: A System for Scalable Graph Analytics C. Savkli, R. Carr, M. Chapman, B. Chee, D. Minch

Acolyte: An In-Memory Social Network Query System

Fine-grained Data Partitioning Framework for Distributed Database Systems

arxiv: v1 [cs.db] 22 Feb 2013

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

August 23, 2017 Revision 0.3. Building IoT Applications with GridDB

SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS

Report. X-Stream Edge-centric Graph processing

REAL-TIME ANALYTICS WITH APACHE STORM

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing

Cloud platforms. T Mobile Systems Programming

Optimizing Memory Performance for FPGA Implementation of PageRank

Multimedia Streaming. Mike Zink

FPGP: Graph Processing Framework on FPGA

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Massive Data Analysis

Detecting and Analyzing Communities in Social Network Graphs for Targeted Marketing

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Behavioral Simulations in MapReduce

Graph Database and Analytics in a GPU- Accelerated Cloud Offering

Security analytics: From data to action Visual and analytical approaches to detecting modern adversaries

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

CE4031 and CZ4031 Database System Principles

AN INTRODUCTION TO GRAPH COMPRESSION TECHNIQUES FOR IN-MEMORY GRAPH COMPUTATION

Epilog: Further Topics

Scalable Streaming Analytics

Accessing Web Archives

Europeana DSI 2 Access to Digital Resources of European Heritage

Distributed Graph Storage. Veronika Molnár, UZH

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

WOOster: A Map-Reduce based Platform for Graph Mining

Modeling Dynamic Behavior in Large Evolving Graphs

A Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Introduction to Parallel Computing

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

A Parallel Community Detection Algorithm for Big Social Networks

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Massive Scalability With InterSystems IRIS Data Platform

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Achieving Horizontal Scalability. Alain Houf Sales Engineer

ANNUAL REPORT Visit us at project.eu Supported by. Mission

Demystifying the Cloud With a Look at Hybrid Hosting and OpenStack

Lecture 12 DATA ANALYTICS ON WEB SCALE

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

LazyBase: Trading freshness and performance in a scalable database

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

AT&T Flow Designer. Current Environment

An overview of Graph Categories and Graph Primitives

Qlik Sense Performance Benchmark

Scalable and Robust Management of Dynamic Graph Data

Towards Large-Scale Graph Stream Processing Platform

modern database systems lecture 10 : large-scale graph processing

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Introduction to Google Cloud Platform

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

User Control Mechanisms for Privacy Protection Should Go Hand in Hand with Privacy-Consequence Information: The Case of Smartphone Apps

Implementing Graph Transformations in the Bulk Synchronous Parallel Model

ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph

Transcription:

USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING Matthias Steinbauer, Gabriele Anderst-Kotsis Institute of Telecooperation

TALK OUTLINE Introduction and Motivation Preliminaries on temporal graphs DynamoGraph a platform for large-scale temporal graph processing How users visit the web Real-world Social Networks The Global Social Learning Network Conclusions

INTRODUCTION / MOTIVATION Graphs serve as models for real world structures in many different disciplines Social sciences > Social networks Biology > Protein-Protein Interactions Cartography > Digital road maps Web > The web graph Graph and network models are very well studied in mathematics and related disciplines, and computer science So why is there need for new research in this area?

REAL WORLD GRAPHS OFTEN GROW TO LARGE SCALES Exceed memory size of single computer require new tactics for visualisation It is not feasible to process large graphs with traditional tools and algorithms

THE DIMENSION OF TIME CANNOT BE NEGLECTED Static views on graphs show often show blurred or too dense data Biological processes are time dependent Static reachability measures in social networks do not hold for dynamic networks

TEMPORAL GRAPHS Graph G is a pair (V, E) where V denotes the set of vertices and E denotes the set of edges between any v, e V A temporal graph T can be given as a set of graphs T = {G 1, G2, G3,, Gt} where each Gx = (Vx, Ex) Gx is called a static snapshot at time x And Gtm..tn as a selection of multiple Gx from T is a static snapshot for a timespan F. Harary and G. Gupta. Dynamic Graph Models. Mathl. Comput. Modelling, 25(7):79 87, 1997.

VERTEX CENTRIC EMBODIMENT { } id: 39827736, name: 'Rob Henderson', description: '', inedges: [ { weight: 7.3, edgetype: 'PHONE', source: 39761932, target: 39827736, } ], outedges: [ { weight: 10.0, edgetype: 'EMAIL', source: 39827736, target: 39761932, } ],

VERTEX CENTRIC EMBODIMENT WITH TEMPORAL DOCUMENT { id: 39827736, resolution: 'MONTHS', '1420070400': { name: 'Rob Henderson', description: '', inedges: [ { weight: 3.3, edgetype: 'PHONE', source: 39761932, target: 39827736, } ], outedges: [ { weight: 4.0, edgetype: 'EMAIL', source: 39827736, target: 39761932, } ], }, '1422748800': { inedges: [ { weight: 4.0, edgetype: 'PHONE',

HORIZONTAL SCALABILITY 5 2 host1 host2 host3

PREGEL / COMPUTE AGGREGATE BROADCAST Master Worker 1 Worker 2 Worker 3 Worker 4 Initialisation Local Computation Message Routing Syncronisation Local Computation Message Routing Syncronisation Local Computation Message Routing Halting Step 1 Step 2 Step 3 G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large- Scale Graph Processing. In ACM SIGMOD International Conference on Management of Data, 2010.

DYNAMOGRAPH ARCHITECTURE Client Host 0 (Master) Host 1 (Worker) Host n (Worker) ClientApp MasterProcess WorkerProcess / Slots WorkerProcess / Slots Client API CodeManager StepExecutor StepExecutor SuperStepManager MessageQueue MessageQueue PartitionManager Partition Partition ZooKeeper Cassandra Node Cassandra Node Public Communication Network Private Communication Network

HOW USERS VISIT THE WEB Center for Complex Networks and Systems Research at Indiana University Bloomington collected the Click Dataset 53.5 billion HTTP headers logged from Sept. 2006 to May 2010 Available as a ~2.8 TB compressed (~12 TB uncompressed) Imported to DynamoGraph in monthly resolution retained only human generated traffic created vertices based on domain names (3.4 million after cleansing) http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/

HOW USERS VISIT THE WEB Click temporal graph was successfully loaded into DynamoGraph 10 compute nodes with 8 threads (graph partitions) each On top of an OpenStack cloud Currently experiments with distributed PageRank are conducted PageRank is computed in sliding windows of 6 to 3 months It is expected that popularity trends in web-sites, ad-networks, and social networks are visible in the data First look into the data clearly shows the drop of popularity for the online social network MySpace.com

REAL WORLD SOCIAL NETWORKS Users interacting with each other (online or in real-life) form a social network Much research from diverse disciplines (sociology, mathematics, computer science, reality mining, ) was already conducted in this area Social networks can be modelled as temporal graphs Many social interactions are now happening online > easy to track and record In collaboration with Ecker (iiwas 2015) data from Internet Relay Chat (IRC) was logged, annotated and imported to DynamoGraph

full time-span

selected time-span 1 st June 2008 15 th June 2008

REAL WORLD SOCIAL NETWORKS Data is available as online resource Users can filter and load data into DynamoGraph Algorithms for automatic layout of the visualisation are available (ForceAtlas2) In visualisation it is already clear that reachability measures computed on the full time-span will often not hold on shorter timespans How is information dissemination influenced by that fact? Can we perform cluster analysis on users and identify topics of interest?

THE GLOBAL SOCIAL LEARNING NETWORK

THE GLOBAL SOCIAL LEARNING NETWORK Scrapers LMS xapi Proxy xapi Repository DynamoGraph Experience API (xapi) used to store learning logs independent of a certain Learning Management System (LMS) Scrapers available that convert data from other systems An xapi proxy service is available that can ingest xapi statements into DynamoGraph Setting up experiments with real students

CONCLUSIONS AND FUTURE WORK DynamoGraph as a platform for large-scale temporal graph processing has matured enough to be evaluated in scientific scenarios Three scenarios were discussed that motivate the need for temporal graph analytics and layout the paths for our future work More temporal graph algorithms are to be implemented (reachability, clustering, ) to provide more interesting metrics to our users

Matthias Steinbauer matthias.steinbauer@jku.at slides available at http://steinbauer.org/