GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel
|
|
- Barnard Small
- 6 years ago
- Views:
Transcription
1 GraphX Graph nalytics in Spark nkur ave Graduate Student, U Berkeley MPLab Joint work with Joseph Gonzalez, Reynold Xin, aniel U BRKLY rankshaw, Michael Franklin, and Ion Stoica
2 Graphs
3 Social Networks
4 Web Graphs
5 User-Item Graphs
6 Graph lgorithms
7 PageRank
8 Triangle ounting
9 ollaborative Filtering Users Ratings Users x Products f(j) f(i) Products
10 ollaborative Filtering f(3) r 13 f(1) User Factors f(2) r 14 r 24 f(4) Product Factors r 25 f(5) X f[i] = arg min w2r d r ij w T f[j] 2 + w 2 2 j2nbrs(i)
11 The Graph-Parallel Pattern
12 The Graph-Parallel Pattern
13 The Graph-Parallel Pattern
14 Many Graph-Parallel lgorithms ollaborative Filtering» lternating Least Squares» Stochastic Gradient escent» Tensor Factorization ommunity etection» Triangle-ounting» K-core ecomposition» K-Truss Structured Prediction» Loopy Belief Propagation» Max-Product Linear Programs» Gibbs Sampling Graph nalytics» PageRank» Personalized PageRank» Shortest Path» Graph oloring Semi-supervised ML» Graph SSL lassification» Neural Networks» om
15 High-egree Vertices hallenges: 1. Storage: How to store a graph where one vertex s edges don t fit on a machine? 2. PI: How to expose parallelism within vertex neighborhoods?
16 omplex Pipelines Link Table Hyperlinks PageRank Top 20 Pages Title Link Title PR Raw Wikipedia < / > < / > < / > Top ommunities om. PR.. XML ditor ommunity User Table ditor Graph etection ommunity ditor Title User om.
17 omplex Pipelines Solution: mbed graph processing within a tableoriented system (Spark) hallenges: 1. Storage: How to store graphs as tables? 2. omputation: How to express graph ops as table ops (map, reduce, join, etc.)? 3. PI: How to present the two views to the user?
18 The GraphX PI
19 Property Graphs Vertex Property: User Profile urrent PageRank Value dge Property: Weights Relationships Timestamps
20 reating a Graph (Scala) type VertexId = Long Graph val vertices: R[(VertexId, String)] = sc.parallelize(list( (1L, lice ), (2L, Bob ), (3L, harlie ))) 1 coworker lice class dge[]( val srcid: VertexId, val dstid: VertexId, val attr: ) val edges: R[dge[String]] = sc.parallelize(list( dge(1l, 2L, coworker ), dge(2l, 3L, friend ))) 2 friend 3 Bob harlie val graph = Graph(vertices, edges)
21 Graph Operations (Scala) class Graph[V, ] { // Table Views def vertices: R[(VertexId, V)] def edges: R[dge[]] def triplets: R[dgeTriplet[V, ]] // Transformations def mapvertices[v2](f: (VertexId, V) => V2): Graph[V2, ] def mapdges[2](f: dge[] => 2): Graph[V2, ] def reverse: Graph[V, ] def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] // Joins def outerjoinvertices[u, V2] (tbl: R[(VertexId, U)]) (f: (VertexId, V, Option[U]) => V2): Graph[V2, ] // omputation def aggregatemessages[]( sendmsg: dgeontext[v,, ] => Unit, mergemsg: (, ) => ): R[(VertexId, )]
22 Built-in lgorithms (Scala) } // ontinued from previous slide def pagerank(tol: ouble): Graph[ouble, ouble] def triangleount(): Graph[Int, ] def connectedomponents(): Graph[VertexId, ] //...and more: org.apache.spark.graphx.lib PageRank Triangle ount onnected omponents
23 The triplets view class Graph[V, ] { } def triplets: R[dgeTriplet[V, ]] class dgetriplet[v, ]( val srcid: VertexId, val dstid: VertexId, val attr:, val srcttr: V, val dstttr: V) Graph 1 lice R coworker 2 friend Bob triplets srcttr dstttr attr lice coworker Bob Bob friend harlie 3 harlie
24 The subgraph transformation class Graph[V, ] { def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] } graph.subgraph(epred = (edge) => edge.attr!= relative ) Graph Graph lice coworker Bob lice coworker Bob relative friend subgraph friend harlie relative avid harlie avid
25 The subgraph transformation class Graph[V, ] { def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] } graph.subgraph(vpred = (id, name) => name!= Bob ) Graph Graph lice coworker Bob lice relative friend subgraph relative harlie relative avid harlie relative avid
26 omputation with aggregatemessages class Graph[V, ] { def aggregatemessages[]( sendmsg: dgeontext[v,, ] => Unit, mergemsg: (, ) => ): R[(VertexId, )] } class dgeontext[v,, ]( val srcid: VertexId, val dstid: VertexId, val attr:, val srcttr: V, val dstttr: V) { def sendtosrc(msg: ) def sendtost(msg: ) } graph.aggregatemessages( ctx => { ctx.sendtosrc(1) ctx.sendtost(1) }, _ + _)
27 omputation with aggregatemessages R Graph vertex id degree lice relative harlie coworker friend relative Bob avid aggregatemessages lice 2 Bob 2 harlie 3 avid 1
28 xample: Graph oarsening Web Pages Intra-omain Links Pages by omain subgraph onnected omponents Graph constructor omains omain Graph
29 How GraphX Works
30 Storing Graphs as Tables Property Graph Vertex Table dge Table (R) (R) B B F Machine 1 Machine 2 B B F F F
31 Simple Operations Reuse vertices or edges across multiple graphs Input Graph Transformed Graph Vertex dge Vertex dge Table Table Table Table Transform Vertex Properties
32 Implementing triplets Vertex Table Routing Table dge Table (R) (R) (R) B 1 2 Machine 1 Machine 2 B F B F B F F
33 Implementing triplets dge Table (R) B B F F Mirror ache B Mirror ache F Vertex Table (R) B F B F Routing Table (R) B F
34 Reduction in ommunication ue to ached Updates onnected omponents on Twitter Graph (1.4B edges) Network omm. (MB) Most vertices are within 8 hops of all vertices in their component Iteration
35 Implementing aggregatemessages dge Table (R) B B F F Mirror ache B Mirror ache F Scan Result Vertex Table B F B F
36 Speedup ue To ccess Method Selection 30 onnected omponents on Twitter (1.4B edges) Runtime (Seconds) Scan ll dges Scan Indexed 10 5 Index of ctive dges Iteration
37 PageRank Benchmark 2 luster of 16 x m2.4xlarge (8 cores) + 1Gig Twitter Graph (42M Vertices,1.5B dges) UK-Graph (106M Vertices, 3.7B dges) Runtime (Seconds) x x GraphX GraphLab Giraph Naïve GraphX GraphLab Giraph Naïve Spark Spark GraphX performs comparably to state-of-the-art graph processing systems.
38 Open Source lpha release with Spark in Feb 2014 Stable release with Spark in ec 2014
39 Future of GraphX 1. Language support a) Java PI: PR #3234 b) Python PI: collaborating with Intel, SPRK More algorithms a) L (topic modeling): PR #2388 b) orrelation clustering 3. Research a) Local graphs b) Streaming/time-varying graphs c) Graph database like queries
40 IndexedR Support efficient updates to immutable Rs using purely functional data structures
41 Thanks!
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
GraphX: Unifying ata-parallel and Graph-Parallel nalytics Presented by aniel rankshaw Joint work with Reynold Xin, Joseph Gonzalez, nkur ave, Michael Franklin, and Ion Stoica Spark Meetup 3/25/2014 U BERKELEY
More informationGraphX: Graph Processing in a Distributed Dataflow Framework
GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley MPLab Co-founder, GraphLab Inc. Joint work with Reynold Xin, nkur Dave, Daniel Crankshaw, Michael Franklin,
More informationWhy do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More informationTegra Time-evolving Graph Processing on Commodity Clusters
Tegra Time-evolving Graph Processing on ommodity lusters SIM SE 17 2 March 2017 nand Iyer Qifan Pu Joseph Gonzalez Ion Stoica 1 Graphs are everywhere Social Networks 2 Graphs are everywhere Gnutella network
More informationGraphX: Graph Processing in a Distributed Dataflow Framework
GraphX: Graph Processing in a Distributed Dataflow Framework Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel
More informationSpark Streaming and GraphX
Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming
More informationThe GraphX Graph Processing System
The GraphX Graph Processing System Daniel Crankshaw Ankur Dave Reynold S. Xin Joseph E. Gonzalez Michael J. Franklin Ion Stoica UC Berkeley AMPLab {crankshaw, ankurd, rxin, jegonzal, franklin, istoica@cs.berkeley.edu
More informationLecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao
More informationCloud, Big Data & Linear Algebra
Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationSociaLite: A Datalog-based Language for
SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler
More informationFast, Interactive, Language-Integrated Cluster Computing
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org
More informationBig Graph Processing. Fenggang Wu Nov. 6, 2016
Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao
More informationLarge Scale Graph Processing Pregel, GraphLab and GraphX
Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah
More informationAnalysis of Algorithms Prof. Karen Daniels
UMass Lowell omputer Science 91.404 nalysis of lgorithms Prof. Karen aniels Spring, 2013 hapter 22: raph lgorithms & rief Introduction to Shortest Paths [Source: ormen et al. textbook except where noted]
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationWhat is Routing? EE 122: Shortest Path Routing. Example. Internet Routing. Ion Stoica TAs: Junda Liu, DK Moon, David Zats
What is Routing? Routing implements the core function of a network: : Shortest Path Routing Ion Stoica Ts: Junda Liu, K Moon, avid Zats http://inst.eecs.berkeley.edu/~ee/fa9 (Materials with thanks to Vern
More informationGraph-Based Parallel Computing. William Cohen
Graph-Based Parallel Computing William Cohen 1 Announcements Next Tuesday 12/8: Presentations for 10-805 projects. 15 minutes per project. Final written reports due Tues 12/15 2 Graph-Based Parallel Computing
More informationApache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis
Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014 Apache Giraph for applications in Machine Learning
More informationGraphX: Unifying Data-Parallel and Graph-Parallel Analytics
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics Reynold S. Xin Daniel Crankshaw Ankur Dave Joseph E. Gonzalez Michael J. Franklin Ion Stoica UC Berkeley AMPLab {rxin, crankshaw, ankurd, jegonzal,
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationBatch & Stream Graph Processing with Apache Flink. Vasia
Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph
More informationProgramming Models and Tools for Distributed Graph Processing
Programming Models and Tools for Distributed Graph Processing Vasia Kalavri kalavriv@inf.ethz.ch 31st British International Conference on Databases 10 July 2017, London, UK ABOUT ME Postdoctoral Fellow
More informationIP Forwarding Computer Networking. Routes from Node A. Graph Model. Lecture 10: Intra-Domain Routing
IP orwarding - omputer Networking Lecture : Intra-omain Routing RIP (Routing Information Protocol) & OSP (Open Shortest Path irst) The Story So ar IP addresses are structure to reflect Internet structure
More informationRESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationMatrix Computations and " Neural Networks in Spark
Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationGraphLab: A New Framework for Parallel Machine Learning
GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010 Overview
More informationSpark GraphX INTRODUCTION AND OVERVIEW WITH FOLLOW-ALONG GRAPH ANALYSIS EXAMPLES
Spark GraphX INTRODUCTION AND OVERVIEW WITH FOLLOW-ALONG GRAPH ANALYSIS EXAMPLES Notes Example data and scripts available at: All examples will be executed with the Spark-Shell A basic knowledge of Spark
More informationDistributed Machine Learning" on Spark
Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations
More informationGraphFrames: An Integrated API for Mixing Graph and Relational Queries
GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur ave, Alekh Jindal, Li rran Li, Reynold Xin, Joseph Gonzalez, Matei Zaharia University of California, erkeley MIT Uber Technologies
More informationMLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley
MLlib & ML base Distributed Machine Learning on Evan Sparks UC Berkeley January 31st, 2014 Collaborators: Ameet Talwalkar, Xiangrui Meng, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia,
More informationGraph-Parallel Problems. ML in the Context of Parallel Architectures
Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014
More informationThrill : High-Performance Algorithmic
Thrill : High-Performance Algorithmic Distributed Batch Data Processing in C++ Timo Bingmann, Michael Axtmann, Peter Sanders, Sebastian Schlag, and 6 Students 2016-12-06 INSTITUTE OF THEORETICAL INFORMATICS
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationThird Generation Routers
IP orwarding 5-5- omputer Networking 5- Lecture : Routing Peter Steenkiste all www.cs.cmu.edu/~prs/5-- The Story So ar IP addresses are structured to reflect Internet structure IP packet headers carry
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationDistributed Graph Storage. Veronika Molnár, UZH
Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems
More informationL10 Graphs. Alice E. Fischer. April Alice E. Fischer L10 Graphs... 1/37 April / 37
L10 Graphs lice. Fischer pril 2016 lice. Fischer L10 Graphs... 1/37 pril 2016 1 / 37 Outline 1 Graphs efinition Graph pplications Graph Representations 2 Graph Implementation 3 Graph lgorithms Sorting
More informationBig Data Infrastructures & Technologies Hadoop Streaming Revisit.
Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer
More informationCase Study 4: Collaborative Filtering. GraphLab
Case Study 4: Collaborative Filtering GraphLab Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin March 14 th, 2013 Carlos Guestrin 2013 1 Social Media
More informationCluster Computing Architecture. Intel Labs
Intel Labs Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More information3.1 Basic Definitions and Applications
Graphs hapter hapter Graphs. Basic efinitions and Applications Graph. G = (V, ) n V = nodes. n = edges between pairs of nodes. n aptures pairwise relationship between objects: Undirected graph represents
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationGridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University
GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,
More informationIntegration of Machine Learning Library in Apache Apex
Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial
More informationLightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast
Lightning Fast Cluster Computing Michael Armbrust - @michaelarmbrust Reflections Projections 2015 Michast What is Apache? 2 What is Apache? Fast and general computing engine for clusters created by students
More informationRouting. Effect of Routing in Flow Control. Relevant Graph Terms. Effect of Routing Path on Flow Control. Effect of Routing Path on Flow Control
Routing Third Topic of the course Read chapter of the text Read chapter of the reference Main functions of routing system Selection of routes between the origin/source-destination pairs nsure that the
More informationExecution Tracing and Profiling
lass 8 Questions/comments iscussion of academic honesty, GT Honor ode fficient path profiling inal project presentations: ec, 3; 4:35-6:45 ssign (see Schedule for links) Problem Set 7 discuss Readings
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationDistributed Computing with Spark and MapReduce
Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationA Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics
A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018 Graph Analytics What is Graph Analytics? 2
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationGraph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization
Network Science: Graph Theory Ozalp abaoglu ipartimento di Informatica Scienza e Ingegneria Università di ologna www.cs.unibo.it/babaoglu/ ranch of mathematics for the study of structures called graphs
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationGraphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech
CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey
More informationASAP: Fast, Approximate Graph Pattern Mining at Scale
ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;
More informationCS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,
FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6... PRT 1. LRGE SLE T NLYTIS WE-SLE LINK N SOIL NETWORK NLYSIS omputer Science, olorado State University
More informationStream Processing on IoT Devices using Calvin Framework
Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationCS Spark. Slides from Matei Zaharia and Databricks
CS 5450 Spark Slides from Matei Zaharia and Databricks Goals uextend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationJoseph hgonzalez. A New Parallel Framework for Machine Learning. Joint work with. Yucheng Low. Aapo Kyrola. Carlos Guestrin.
A New Parallel Framework for Machine Learning Joseph hgonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Alex Smola Guy Blelloch Joe Hellerstein David O Hallaron Carnegie Mellon
More informationToday s content. Resilient Distributed Datasets(RDDs) Spark and its data model
Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,
More informationLet s focus on clarifying questions. More Routing. Logic Refresher. Warning. Short Summary of Course. 10 Years from Now.
Let s focus on clarifying questions I love the degree of interaction in this year s class More Routing all Scott Shenker http://inst.eecs.berkeley.edu/~ee/ Materials with thanks to Jennifer Rexford, Ion
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationGraph Analytics using Vertica Relational Database
Graph Analytics using ertica Relational Database Alekh Jindal* Samuel Madden Malú Castellanos Meichun Hsu Microsoft MIT ertica ertica * work done while at MIT Motivation for graphs on DB Data anyways in
More informationIntroduction. Example
OMS0 Introduction Priority queues and ijkstra s algorithm In this lecture we will discuss ijkstra s algorithm, a more efficient way of solving the single-source shortest path problem. epartment of omputer
More information15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018
15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce
More informationUndirected graph is a special case of a directed graph, with symmetric edges
S-6S- ijkstra s lgorithm -: omputing iven a directed weighted graph (all weights non-negative) and two vertices x and y, find the least-cost path from x to y in. Undirected graph is a special case of a
More informationCS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,
S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration
More informationHomework 4 Solutions
CS3510 Design & Analysis of Algorithms Section A Homework 4 Solutions Uploaded 4:00pm on Dec 6, 2017 Due: Monday Dec 4, 2017 This homework has a total of 3 problems on 4 pages. Solutions should be submitted
More informationDistributed Graph Algorithms
Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction
More informationECE 158A: Lecture 5. Fall 2015
8: Lecture Fall 0 Routing ()! Location-ased ddressing Recall from Lecture that routers maintain routing tables to forward packets based on their IP addresses To allow scalability, IP addresses are assigned
More informationQuery Optimization of Distributed Pattern Matching
Query Optimization of istributed Pattern Matching Jiewen Huang, Kartik Venkatraman, aniel J. badi Yale University jiewen.huang@yale.edu, kartik.venkatraman@yale.edu, dna@cs.yale.edu bstract Greedy algorithms
More informationCSE 373: Data Structures and Algorithms. Graphs. Autumn Shrirang (Shri) Mare
SE 373: Data Structures and lgorithms Graphs utumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey hampion, en Jones, dam lank, Michael Lee, Evan Mcarty, Robbie Weber, Whitaker rand, Zora
More informationWeek 9 Student Responsibilities. Mat Example: Minimal Spanning Tree. 3.3 Spanning Trees. Prim s Minimal Spanning Tree.
Week 9 Student Responsibilities Reading: hapter 3.3 3. (Tucker),..5 (Rosen) Mat 3770 Spring 01 Homework Due date Tucker Rosen 3/1 3..3 3/1 DS & S Worksheets 3/6 3.3.,.5 3/8 Heapify worksheet ttendance
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationAdjacency Matrix Undirected Graph. » Matrix [ p, q ] is 1 (True) if < p, q > is an edge in the graph. Graphs Representation & Paths
djacency Matrix Undirected Graph Row and column indices are vertex numbers Graphs Representation & Paths Mixture of Chapters 9 & 0» Matrix [ p, q ] is (True) if < p, q > is an edge in the graph if < p,
More informationBSP, Pregel and the need for Graph Processing
BSP, Pregel and the need for Graph Processing Patrizio Dazzi, HPC Lab ISTI - CNR mail: patrizio.dazzi@isti.cnr.it web: http://hpc.isti.cnr.it/~dazzi/ National Research Council of Italy A need for Graph
More informationCPL 2016, week 10. Clojure functional core. Oleg Batrashev. April 11, Institute of Computer Science, Tartu, Estonia
CPL 2016, week 10 Clojure functional core Oleg Batrashev Institute of Computer Science, Tartu, Estonia April 11, 2016 Overview Today Clojure language core Next weeks Immutable data structures Clojure simple
More informationIP Forwarding Computer Networking. Graph Model. Routes from Node A. Lecture 11: Intra-Domain Routing
IP Forwarding 5-44 omputer Networking Lecture : Intra-omain Routing RIP (Routing Information Protocol) & OSPF (Open Shortest Path First) The Story So Far IP addresses are structured to reflect Internet
More informationScaled Machine Learning at Matroid
Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling
More informationALGORITHMS FOR DECISION SUPPORT
GORIHM FOR IION UOR ayesian networks Guest ecturer: ilja Renooij hanks to hor Whalen or kindly contributing to these slides robabilistic Independence onditional Independence onditional Independence hain
More informationMapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin
Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold
More informationDigraphs ( 12.4) Directed Graphs. Digraph Application. Digraph Properties. A digraph is a graph whose edges are all directed.
igraphs ( 12.4) irected Graphs OR OS digraph is a graph whose edges are all directed Short for directed graph pplications one- way streets flights task scheduling irected Graphs 1 irected Graphs 2 igraph
More informationarxiv: v1 [cs.db] 11 Jan 2017
SPARQL over GraphX Besat Kassaie Cheriton School of Computer Science, University of Waterloo bkassaie@uwaterloo.ca arxiv:1701.03091v1 [cs.db] 11 Jan 2017 ABSTRACT The ability of the RDF data model to link
More informationSociaLite: A Python-Integrated Query Language for
SociaLite: A Python-Integrated Query Language for Big Data Analysis Jiwon Seo * Jongsoo Park Jaeho Shin Stephen Guo Monica Lam STANFORD UNIVERSITY M OBIS OCIAL RESEARCH GROUP * Intel Parallel Research
More informationMLI - An API for Distributed Machine Learning. Sarang Dev
MLI - An API for Distributed Machine Learning Sarang Dev MLI - API Simplify the development of high-performance, scalable, distributed algorithms. Targets common ML problems related to data loading, feature
More information