Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Similar documents
Behavioral Simulations in MapReduce

GraphLab: A New Framework for Parallel Machine Learning

One Trillion Edges. Graph processing at Facebook scale

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

April Copyright 2013 Cloudera Inc. All rights reserved.

The Stratosphere Platform for Big Data Analytics

Pregel. Ali Shah

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Apache Flink. Alessandro Margara

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Harp-DAAL for High Performance Big Data Computing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

NUMA-aware Graph-structured Analytics

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

Lecture 22 : Distributed Systems for ML

Graph Processing. Connor Gramazio Spiros Boosalis

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

Practical Near-Data Processing for In-Memory Analytics Frameworks

Developing MapReduce Programs

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Webinar Series TMIP VISION

Introduction to MapReduce Algorithms and Analysis

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Graph Data Management

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines

High Performance Computing on MapReduce Programming Framework

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Processing of big data with Apache Spark

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

SociaLite: A Datalog-based Language for

An Introduction to Big Data Formats

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Map Reduce. Yerevan.

Introduction to parallel Computing

modern database systems lecture 10 : large-scale graph processing

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Evolution of Database Systems

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

Techniques to improve the scalability of Checkpoint-Restart

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

MapReduce: A Programming Model for Large-Scale Distributed Computation

Essential Features of an Integration Solution

MapReduce Algorithms

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

In-Memory Data Management for Enterprise Applications. BigSys 2014, Stuttgart, September 2014 Johannes Wust Hasso Plattner Institute (now with SAP)

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Data Transformation and Migration in Polystores

Massive Scalability With InterSystems IRIS Data Platform

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

LDetector: A low overhead data race detector for GPU programs

CS 61C: Great Ideas in Computer Architecture. MapReduce

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Unifying Big Data Workloads in Apache Spark

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Joseph hgonzalez. A New Parallel Framework for Machine Learning. Joint work with. Yucheng Low. Aapo Kyrola. Carlos Guestrin.

The Evolution of Big Data Platforms and Data Science

Column Stores vs. Row Stores How Different Are They Really?

Fast, Interactive, Language-Integrated Cluster Computing

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Part 1: Indexes for Big Data

Databases 2 (VU) ( / )

Why do we need graph processing?

Scalable Data-driven PageRank: Algorithms, System Issues, and Lessons Learned

CISC 7610 Lecture 2b The beginnings of NoSQL

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

A Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Resilient Distributed Datasets

LazyBase: Trading freshness and performance in a scalable database

CompSci 516: Database Systems

High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster

SCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE. x8534, x8505,

The Future of High Performance Computing

MY WEAK CONSISTENCY IS STRONG WHEN BAD THINGS DO NOT COME IN THREES ZECHAO SHANG JEFFREY XU YU

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Performance Tools for Technical Computing

Transcription:

Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1

What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics Aggregation Sorting Text parsing Inverted index etc.. Input Data Operator 1 Operator 2 Operator 3 Output Data

What are Iterative Computations? Iterative computation flow Directed Cyclic Examples Scientific computation Linear/differential systems Least squares, eigenvalues Machine learning SVM, EM algorithms Boosting, K-means Computer Vision, Web Search, etc.. Input Data Operator 1 Operator 2 Can Stop? Output Data

Massive Datasets are Ubiquitous Traffic behavioral simulations Micro-simulator cannot scale to NYC with millions of vehicles Social network analysis Even computing graph radius on single machine takes a long time Similar scenarios in predicative analysis, anomaly detection, etc

Why Hadoop Not Good Enough? Re-shuffle/materialize data between operators Increased overhead at each iteration Result in bad performance Batch processing records within operators Not every records need to be updated Result in slow convergence

Talk Outline Motivation Fast Iterations: BRACE for Behavioral Simulations Fewer Iterations: GRACE for Graph Processing Future Work 6

Challenges of Behavioral Simulations Easy to program not scalable Examples: Swarm, Mason Typically one thread per agent, lots of contention Scalable hard to program Examples: TRANSIMS, DynaMIT (traffic), GPU implementation of fish simulation (ecology) Hard-coded models, compromise level of detail 7

What Do People Really Want? A new simulation platform that combines: Ease of programming Scripting language for domain scientists Scalability Efficient parallel execution runtime 8

A Running Example: Fish Schools Adapted from Couzin et al., Nature 2005 Fish Behavior Avoidance: if too close, repel other fish Attraction: if seen within range, attract other fish Spatial locality for both logics α ρ 9

State-Effect Pattern Programming pattern to deal with concurrency Follows time-stepped model Core Idea: Make all actions inside of a tick order-independent 10

States and Effects States: Snapshot of agents at the beginning of the tick Effects: position, velocity vector Intermediate results from interaction, used to calculate new states sets of forces from other fish α ρ 11

Two Phases of a Tick Query: capture agent interaction Read states write effects Each effect set is associated with combinator function Effect writes are order-independent Update: refresh world for next tick Read effects write states Reads and writes are totally local State writes are order-independent Tick Query Update 12

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 13

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 14

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 15

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 16

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 17

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 18

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 19

A Tick in State-Effect Query For fish f in visibility α: Write repulsion to f s effects For fish f in visibility ρ: Write attraction to f s effects Update new velocity = combined repulsion + combined attraction + old velocity new position = old position + α ρ old velocity 20

From State-Effect to Map-Reduce Tick Query state effects Communicate Effects Update effects new state Communicate New State Map 1 t Reduce 1 t Map 2 t Reduce 2 t Map 1 t+1 Distribute data Assign effects (partial) Forward data Aggregate effects Update Redistribute data 21

BRACE (Big Red Agent Computation Engine) BRASIL: High-level scripting language for domain scientists Compiles to iterative MapReduce work flow Special-purpose MapReduce runtime for behavioral simulations Basic Optimizations Optimizations based on Spatial Locality 22

Spatial Partitioning Partition simulation space into regions, each handled by a separate node 23

Communication Between Partitions Owned Region: agents in it are owned by the node Owned 24

Communication Between Partitions Visible Region: agents in it are not owned, but need to be seen by the node Owned Visible 25

Communication Between Partitions Visible Region: agents in it are not owned, but need to be seen by the node Only need to communicate with neighbors to refresh states forward assigned effects Owned Visible 26

Experimental Setup BRACE prototype Grid partitioning KD-Tree spatial indexing Basic load balancing Hardware: Cornell WebLab Cluster (60 nodes, 2xQuadCore Xeon 2.66GHz, 4MB cache, 16GB RAM) 27

Scalability: Traffic Scale up the size of the highway with the number of the nodes Notch consequence of multi-switch architecture 28

Talk Outline Motivation Fast Iterations: BRACE for Behavioral Simulations Fewer Iterations: GRACE for Graph Processing Conclusion 29

Large-scale Graph Processing Graph representations are everywhere Web search, text analysis, image analysis, etc. Today s graphs have scaled to millions of edges/vertices Data parallelism of graph applications Graph data updated independently (i.e. on a pervertex basis) Individual vertex updates only depend on connected neighbors 30

Synchronous v.s. Asynchronous Synchronous graph processing Proceeds in batch-style ticks Easy to program and scale, slow convergence Pregel, PEGASUS, PrIter, etc Asynchronous processing Updates with most recent data Fast convergence but hard to program and scale GraphLab, Galois, etc 31

What Do People Really Want? Sync. Implementation at first Easy to think, program and debug Async. execution for better performance Without re-implementing everything 32

GRACE (GRAph Computation Engine) Iterative synchronous programming model Update logic for individual vertex Data dependency encoded in message passing Customizable bulk synchronous runtime Enabling various async. features through relaxing data dependencies 33

Running Example: Belief Propagation Core procedure for many inference tasks in graphical models Upon update, each vertex first computes its new belief distribution according to its incoming messages: Then it will propagate its new belief to outgoing messages: 34

Sync. vs. Async. Algorithms Update logic are actually the same: Eq 1 and 2 Only differs in when/how to apply the update logic 35

Vertex Update Logic Read in one message from each of the incoming edge Update the vertex value Generate one message on each of the outgoing edge 36

Belief Propagation in Proceed Consider fix point achieved when the new belief distribution does not change much 37

Customizable Execution Interface Each vertex is associated with a scheduling priority value Users can specify logic for: Updating vertex priority upon receiving a message Deciding vertex to be processed for each tick Selecting messages to be used for Proceed We have implemented 4 different execution policies for users to directly choose from 38

Original Belief Propagation Use last received message upon calling Proceed, and schedule all vertices to be processed for each tick 39

Residual Belief Propagation Use message residual as its contribution to vertex s priority, and only update vertex with highest priority 40

Experimental Setup GRACE prototype Shared-memory Policies Jacobi GaussSeidel Eager Prior Hardware: 32-core Computer with 8 quad-core processors and quad channel 128GB RAM. 41

Results: Image Restoration with BP GRACE s prioritized policy achieve comparable convergence with GraphLab s async scheduling, while achieve near linear speedup 42

Conclusions Thank you! Iterative computations are common patterns in many applications Requires programming simplicity and automatic scalability Needs special care for performance Main-memory approach with various optimization techniques Leverage data locality to minimize communication Relax data dependency for fast convergence 43

Acknowledgements 44