A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics
|
|
- Paul Clark
- 5 years ago
- Views:
Transcription
1 A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018
2 Graph Analytics What is Graph Analytics? 2
3 Graph Analytics What is Graph Analytics? 2
4 Graph Analytics What is Graph Analytics? Shortest Path 2
5 Graph Analytics What is Graph Analytics? Shortest Path 2
6 Graph Analytics What is Graph Analytics? Shortest Path PageRank 2
7 Graph Analytics What kind of graph are we interested in? 3
8 4
9 Large 4
10 Large Geo-Distributed 4
11 Analyze Large Data Process it in parallel (Spark, Hadoop ) 5
12 Analyze Large Data Process it in parallel (Spark, Hadoop ) Parallel graph analytics frameworks: 5
13 Analyze Large Data Process it in parallel (Spark, Hadoop ) Parallel graph analytics frameworks: State-of-the-art: Gemini [OSDI 16 ], etc. 5
14 Analyze Large Data Process it in parallel (Spark, Hadoop ) Parallel graph analytics frameworks: State-of-the-art: Gemini [OSDI 16 ], etc. Assume: 5
15 Analyze Large Data Process it in parallel (Spark, Hadoop ) Parallel graph analytics frameworks: State-of-the-art: Gemini [OSDI 16 ], etc. Assume: Data is accessible via fast network 5
16 Analyze Large Data Process it in parallel (Spark, Hadoop ) Parallel graph analytics frameworks: State-of-the-art: Gemini [OSDI 16 ], etc. Assume: Data is accessible via fast network A cluster of well-connected workers 5
17 Analyze Geo-Distributed Data 6
18 Analyze Geo-Distributed Data Reduce WAN traffic 6
19 Analyze Geo-Distributed Data Reduce WAN traffic Wide-area data analytics Placement: Iridium [SIGCOMM 15 ], etc. Workload: Clarinet [OSDI 16 ], Gaia [NSDI 17 ] 6
20 Analyze Geo-Distributed Graph Reduce WAN traffic Wide-area graph analytics Placement: Mayer et al. [ICDCS 16 ], Zhou et al. [ICDCS 17 ] Workload:? 7
21 Can we optimize the workload with the awareness of WAN transfers? 8
22 Motivating Example Connected Components: BSP in two regions WAN transfer: 9
23 Motivating Example Connected Components: BSP in two regions WAN transfer: 3 9
24 Motivating Example Connected Components: BSP in two regions WAN transfer: 3 2 9
25 Motivating Example Connected Components: BSP in two regions WAN transfer:
26 Motivating Example Connected Components: BSP in two regions WAN transfer:
27 Motivating Example Connected Components: sync locally -> globally WAN transfer: 10
28 Motivating Example Connected Components: sync locally -> globally WAN transfer: 10
29 Motivating Example Connected Components: sync locally -> globally WAN transfer: 1 10
30 Motivating Example Connected Components: sync locally -> globally WAN transfer: 1 10
31 HSP: Design Two modes of synchronization: local + global Local synchronization Keep mirrored vertices static, compute locally without WAN communication Global synchronization 11
32 HSP: Analysis Convergence We have proven HSP has the same convergence guarantee as BSP WAN traffic / # Global Synchronizations We have proven HSP has a much higher rate of convergence with the same amount of WAN traffic (if BSP converges linearly or super-linearly) 12
33 Synchronization Models 13
34 Synchronization Models Convergence Volume of WAN Traffic 13
35 Synchronization Models BSP Convergence GraphLab Graph++ GraphUC Volume of WAN Traffic 13
36 Synchronization Models BSP Convergence Graph++ GraphLab GraphUC Volume of WAN Traffic 13
37 Synchronization Models HSP BSP Convergence Graph++ GraphLab GraphUC Volume of WAN Traffic 13
38 Proof-of-Concept PageRank example DC A DC B kx (k) x k BSP HSP Local updates Global updates # Global Synchronizations 14
39 Implementation in GraphX Input Graph Accumulator c = 0 Graph Partition Launch n DC Manager Threads Mode Select Local 1 Local Update Global # updates = diameter? Yes c = c+1 No 1 Global Update Converged? Yes c = c+n No No Converged? Yes No c > n Yes DC Manager Thread Output Graph Join n Threads 15
40 Experimental Results 16
41 Experimental Results 16
42 Experimental Results 16
43 Experimental Results 16
44 Experimental Results 16
45 Experimental Results Monetary cost 17
46 HSP: Takeaways HSP = BSP + local mode WAN efficiency: faster && cheaper Correctness: strong convergence guarantee Transparency: independent from apps 18
47 Thanks! Q & A 19
48 HSP: Design Switch local -> global When all local partitions have run d iterations When a local partition that has already converged Run one global iteration and switch back to local mode 20
49 Runtime 21
50 Experimental Results Rate of convergence kx (k) x (k 1) k BSP HSP kx (k) x (k 1) k BSP HSP (a) # Global Synchronizations (b) PageRank Execution Time (s) 22
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationTo Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai, Mosharaf Chowdhury, Harsha Madhyastha
To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai, Mosharaf Chowdhury, Harsha Madhyastha Background Over 40 Data Centers (DCs) on EC2, Azure, Google Cloud A geographically denser set of DCs across
More informationA Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics
A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu, Li Chen, Baochun Li, Aiden Carnegie Department of Electrical and Computer Engineering, University of Toronto {shuhao,
More informationSiphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics. Shuhao Liu, Li Chen, Baochun Li University of Toronto July 12, 2018
Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics Shuhao Liu, Li Chen, Baochun Li University of Toronto July 12, 2018 What is a Coflow? One stage in a data analytic job Map 1 Reduce
More informationWhy do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationDistributed Graph Storage. Veronika Molnár, UZH
Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems
More informationCoflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan
Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationHotCloud 17. Lube: Mitigating Bottlenecks in Wide Area Data Analytics. Hao Wang* Baochun Li
HotCloud 17 Lube: Hao Wang* Baochun Li Mitigating Bottlenecks in Wide Area Data Analytics iqua Wide Area Data Analytics DC Master Namenode Workers Datanodes 2 Wide Area Data Analytics Why wide area data
More informationDistributed Machine Learning: An Intro. Chen Huang
: An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous
More information: Gaining Command on Geo-Distributed Graph Analytics
M : Gaining Command on Geo-Distributed Graph Analytics Anand Padmanabha Iyer?, Aurojit Panda, Mosharaf Chowdhury, Aditya Akella, Scott Shenker?, Ion Stoica?? University of California, Berkeley NYU University
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationvolley: automated data placement for geo-distributed cloud services
volley: automated data placement for geo-distributed cloud services sharad agarwal, john dunagan, navendu jain, stefan saroiu, alec wolman, harbinder bhogan very rapid pace of datacenter rollout April
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationBohr: Similarity Aware Geo-distributed Data Analytics. Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong
Bohr: Similarity Aware Geo-distributed Data Analytics Hangyu Li, Hong Xu, Sarana Nutanong City University of Hong Kong 1 Big Data Analytics Analysis Generate 2 Data are geo-distributed Frankfurt US Oregon
More informationMore Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Q. Ho, J. Cipar, H. Cui, J.K. Kim, S. Lee, *P.B. Gibbons, G.A. Gibson, G.R. Ganger, E.P. Xing Carnegie Mellon University
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationData Center Services and Optimization. Sobir Bazarbayev Chris Cai CS538 October
Data Center Services and Optimization Sobir Bazarbayev Chris Cai CS538 October 18 2011 Outline Background Volley: Automated Data Placement for Geo-Distributed Cloud Services, by Sharad Agarwal, John Dunagan,
More informationFault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler
Fault Tolerance in K3 Ben Glickman, Amit Mehta, Josh Wheeler Outline Background Motivation Detecting Membership Changes with Spread Modes of Fault Tolerance in K3 Demonstration Outline Background Motivation
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationBig Graph Processing. Fenggang Wu Nov. 6, 2016
Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao
More informationA Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta
A Relaxed Consistency based DSM for Asynchronous Parallelism Keval Vora, Sai Charan Koduru, Rajiv Gupta SoCal PLS - Spring 2014 Motivation Graphs are popular Graph Mining: Community Detection, Coloring
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationScaling Distributed Machine Learning
Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 nx min w f i (w) Distributed systems i=1 Large scale optimization methods Large-scale
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationInfiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin
Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow
More informationOne Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationParallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem
I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **
More informationPREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives
More informationCloud, Big Data & Linear Algebra
Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes
More informationGridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University
GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,
More informationVolley: Automated Data Placement for Geo-Distributed Cloud Services
Volley: Automated Data Placement for Geo-Distributed Cloud Services Authors: Sharad Agarwal, John Dunagen, Navendu Jain, Stefan Saroiu, Alec Wolman, Harbinder Bogan 7th USENIX Symposium on Networked Systems
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationAuthors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.
Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term
More informationKing Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing
King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents
More informationIT has now become commonly accepted that the volume of
1 Time- and Cost- Efficient Task Scheduling Across Geo-Distributed Data Centers Zhiming Hu, Member, IEEE, Baochun Li, Fellow, IEEE, and Jun Luo, Member, IEEE Abstract Typically called big data processing,
More informationA Network-aware Scheduler in Data-parallel Clusters for High Performance
A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters
More informationBolt: I Know What You Did Last Summer In the Cloud
Bolt: I Know What You Did Last Summer In the Cloud Christina Delimitrou 1 and Christos Kozyrakis 2 1 Cornell University, 2 Stanford University ASPLOS April 12 th 2017 Executive Summary Problem: cloud resource
More informationApache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis
Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014 Apache Giraph for applications in Machine Learning
More informationSummary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma
Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Course Schedule Tuesday 10.3. Introduction and the Big Data Challenge Tuesday 17.3. MapReduce and Spark: Overview Tuesday
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More information/ Cloud Computing. Recitation 13 April 14 th 2015
15-319 / 15-619 Cloud Computing Recitation 13 April 14 th 2015 Overview Last week s reflection Project 4.1 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 18 Project 4.2 Demo
More informationDistributed Graph Algorithms
Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction
More informationExploiting Inter-Flow Relationship for Coflow Placement in Data Centers. Xin Sunny Huang, T. S. Eugene Ng Rice University
Exploiting Inter-Flow Relationship for Coflow Placement in Data Centers Xin Sunny Huang, T S Eugene g Rice University This Work Optimizing Coflow performance has many benefits such as avoiding application
More informationFast, Interactive, Language-Integrated Cluster Computing
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More informationEvolution From Shark To Spark SQL:
Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese
More informationFast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration
Fast and Concurrent RDF Queries with RDMA-Based Distributed Graph Exploration JIAXIN SHI, YOUYANG YAO, RONG CHEN, HAIBO CHEN, FEIFEI LI PRESENTED BY ANDREW XIA APRIL 25, 2018 Wukong Overview of Wukong
More informationAchieving Horizontal Scalability. Alain Houf Sales Engineer
Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches
More informationPiccolo. Fast, Distributed Programs with Partitioned Tables. Presenter: Wu, Weiyi Yale University. Saturday, October 15,
Piccolo Fast, Distributed Programs with Partitioned Tables 1 Presenter: Wu, Weiyi Yale University Outline Background Intuition Design Evaluation Future Work 2 Outline Background Intuition Design Evaluation
More informationLarge Scale Graph Processing Pregel, GraphLab and GraphX
Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationOptimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation
Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory
More informationHigh Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster
High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster Summary Open source analytic frameworks, such as those in the Apache
More informationGiraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems
Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems ABSTRACT Minyang Han David R. Cheriton School of Computer Science University of Waterloo m25han@uwaterloo.ca
More informationDynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin
Dynamic Resource Allocation for Distributed Dataflows Lauritz Thamsen Technische Universität Berlin 04.05.2018 Distributed Dataflows E.g. MapReduce, SCOPE, Spark, and Flink Used for scalable processing
More informationCoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
: Confined Recovery in Distributed Asynchronous Graph Processing Keval Vora Chen Tian 2 Rajiv Gupta Ziang Hu 2 University of California, Riverside 2 Huawei US R&D Center {kvora,gupta}@cs.ucr.edu {Chen.Tain,Ziang.Hu}@huawei.com
More informationInformation-Agnostic Flow Scheduling for Commodity Data Centers. Kai Chen SING Group, CSE Department, HKUST May 16, Stanford University
Information-Agnostic Flow Scheduling for Commodity Data Centers Kai Chen SING Group, CSE Department, HKUST May 16, 2016 @ Stanford University 1 SING Testbed Cluster Electrical Packet Switch, 1G (x10) Electrical
More informationGraph-Processing Systems. (focusing on GraphChi)
Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])
More informationPregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010
Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very
More informationGraphChi: Large-Scale Graph Computation on Just a PC
OSDI 12 GraphChi: Large-Scale Graph Computation on Just a PC Aapo Kyrölä (CMU) Guy Blelloch (CMU) Carlos Guestrin (UW) In co- opera+on with the GraphLab team. BigData with Structure: BigGraph social graph
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationAutomated Bug Removal for Software-Defined Networks
Automated Bug Removal for Software-Defined Networks Yang Wu* Ang Chen* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University of Pennsylvania + Georgetown University 1 Motivation: Automated repair
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationBolt: I Know What You Did Last Summer In the Cloud
Bolt: I Know What You Did Last Summer In the Cloud Christina Delimitrou1 and Christos Kozyrakis2 1Cornell University, 2Stanford University Platform Lab Review February 2018 Executive Summary Problem: cloud
More informationTo Relay or Not to Relay for Inter-Cloud Transfers?
To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai University of Michigan Mosharaf Chowdhury University of Michigan Harsha Madhyastha University of Michigan Abstract Efficient big data analytics
More informationShark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )
Shark: SQL and Rich Analytics at Scale Yash Thakkar (2642764) Deeksha Singh (2641679) RDDs as foundation for relational processing in Shark: Resilient Distributed Datasets (RDDs): RDDs can be written at
More informationClustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!
RNA-seq: What is it good for? Clustering High-throughput RNA sequencing experiments (RNA-seq) offer the ability to measure simultaneously the expression level of thousands of genes in a single experiment!
More informationGiraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems
Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems University of Waterloo Technical Report CS-215-4 ABSTRACT Minyang Han David R. Cheriton School of Computer
More informationShark: Hive on Spark
Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap Nandita Vijaykumar Dimitris Konomis Gregory R. Ganger Phillip B. Gibbons Onur Mutlu Carnegie Mellon University EH
More informationHIGH PERFORMANCE DATA ANALYTICS:
www.gdmissionsystems.com/hpc HIGH PERFORMANCE DATA ANALYTICS: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster 1. Summary Open source analytic frameworks,
More informationG(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu
G(B)enchmark GraphBench: Towards a Universal Graph Benchmark Khaled Ammar M. Tamer Özsu Bioinformatics Software Engineering Social Network Gene Co-expression Protein Structure Program Flow Big Graphs o
More informationOptimizing Shuffle in Wide-Area Data Analytics
Optimizing Shuffle in Wide-Area Data Analytics Shuhao Liu, Hao Wang, Baochun Li Department of Electrical and Computer Engineering University of Toronto Toronto, Canada {shuhao, haowang, bli}@ece.toronto.edu
More informationScaling Distributed Machine Learning with the Parameter Server
Scaling Distributed Machine Learning with the Parameter Server Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su Presented
More informationSocial Network Analytics on Cray Urika-XA
Social Network Analytics on Cray Urika-XA Mike Hinchey, mhinchey@cray.com Technical Solutions Architect Cray Inc, Analytics Products Group April, 2015 Agenda 1. Introduce platform Urika-XA 2. Technology
More informationShen, Tang, Yang, and Chu
Integrated Resource Management for Cluster-based Internet s About the Authors Kai Shen Hong Tang Tao Yang LingKun Chu Published on OSDI22 Presented by Chunling Hu Kai Shen: Assistant Professor of DCS at
More informationNaaS Network-as-a-Service in the Cloud
NaaS Network-as-a-Service in the Cloud joint work with Matteo Migliavacca, Peter Pietzuch, and Alexander L. Wolf costa@imperial.ac.uk Motivation Mismatch between app. abstractions & network How the programmers
More informationGraph Processing. Connor Gramazio Spiros Boosalis
Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationLFGraph: Simple and Fast Distributed Graph Analytics
LFGraph: Simple and Fast Distributed Graph Analytics Imranul Hoque VMware, Inc. ihoque@vmware.com Indranil Gupta University of Illinois, Urbana-Champaign indy@illinois.edu Abstract Distributed graph analytics
More informationFuture of Computing II: What s So Special About Big Learning? : Introduction to Computer Systems 28 th Lecture, Dec. 6, 2016
Carnegie Mellon Future of Computing II: What s So Special About Big Learning? 15-213: Introduction to Computer Systems 28 th Lecture, Dec. 6, 2016 Instructor: Phil Gibbons Bryant and O Hallaron, Computer
More informationAn Adaptive Scheduling Technique for Improving the Efficiency of Hadoop
An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop Ms Punitha R Computer Science Engineering M.S Engineering College, Bangalore, Karnataka, India. Mr Malatesh S H Computer Science
More informationA Comparison of Performance and Accuracy of Measurement Algorithms in Software
A Comparison of Performance and Accuracy of Measurement Algorithms in Software Omid Alipourfard, Masoud Moshref 1, Yang Zhou 2, Tong Yang 2, Minlan Yu 3 Yale University, Barefoot Networks 1, Peking University
More information