Designing and Evaluating a Distributed Computing Language Runtime. Christopher Meiklejohn Université catholique de Louvain, Belgium

Similar documents
Coordination-Free Computations. Christopher Meiklejohn

DISTRIBUTED EVENTUALLY CONSISTENT COMPUTATIONS

Selective Hearing: An Approach to Distributed, Eventually Consistent Edge Computation

Lasp: A Language for Distributed, Coordination-Free Programming

Conflict-free Replicated Data Types in Practice

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

SCALABLE CONSISTENCY AND TRANSACTION MODELS

CS Amazon Dynamo

Eventual Consistency Today: Limitations, Extensions and Beyond

BUILDING A SCALABLE MOBILE GAME BACKEND IN ELIXIR. Petri Kero CTO / Ministry of Games

Distributed Systems (5DV147)

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Eventual Consistency Today: Limitations, Extensions and Beyond

@joerg_schad Nightmares of a Container Orchestration System

Intuitive distributed algorithms. with F#

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

CAP Theorem, BASE & DynamoDB

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Replication in Distributed Systems

Key-value store with eventual consistency without trusting individual nodes

Distributed Data Management Replication

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

Chapter 8 Fault Tolerance

Modern Database Concepts

Basic vs. Reliable Multicast

Chapter 11 - Data Replication Middleware

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

There is a tempta7on to say it is really used, it must be good

Practical Considerations for Multi- Level Schedulers. Benjamin

Reliable Distributed System Approaches

02 - Distributed Systems

02 - Distributed Systems

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali

Search Head Clustering Basics To Best Practices

DISTRIBUTED SYSTEMS. Second Edition. Andrew S. Tanenbaum Maarten Van Steen. Vrije Universiteit Amsterdam, 7'he Netherlands PEARSON.

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Riak. Distributed, replicated, highly available

Providing Real-Time and Fault Tolerance for CORBA Applications

CS 655 Advanced Topics in Distributed Systems

Module 7 - Replication

Database Architectures

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

05 Indirect Communication

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

PROCESS SYNCHRONIZATION

Eventual Consistency. Eventual Consistency

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

April 21, 2017 Revision GridDB Reliability and Robustness

Note: Isolation guarantees among subnets depend on your firewall policies.

Concurrent & Distributed Systems Supervision Exercises

Strong Eventual Consistency and CRDTs

vsphere Replication for Disaster Recovery to Cloud vsphere Replication 8.1

Exam 2 Review. Fall 2011

Introduction to Distributed Data Systems

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Clustering in Go May 2016

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Distributed Systems. Day 9: Replication [Part 1]

Cloud Programming James Larus Microsoft Research. July 13, 2010

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Peer-to-peer systems and overlay networks

Database Management and Tuning

Loquat: A Framework for Large-Scale Actor Communication on Edge Networks

Verifying Concurrent Programs

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

10. Replication. Motivation

LightKone: Towards General Purpose Computations on the Edge

What Came First? The Ordering of Events in

MySQL HA Solutions Selecting the best approach to protect access to your data

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

[Docker] Containerization

GFS: The Google File System. Dr. Yingwu Zhu

Advanced Databases ( CIS 6930) Fall Instructor: Dr. Markus Schneider. Group 17 Anirudh Sarma Bhaskara Sreeharsha Poluru Ameya Devbhankar

Deep Dive: Cluster File System 6.0 new Features & Capabilities

7 Things ISVs Must Know About Virtualization

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Distributed Systems: Consistency and Replication

Scalable Streaming Analytics

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Building Consistent Transactions with Inconsistent Replication

Mutual consistency, what for? Replication. Data replication, consistency models & protocols. Difficulties. Execution Model.

CollabDraw: Real-time Collaborative Drawing Board

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

PLEXXI HCN FOR VMWARE ENVIRONMENTS

DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS

Architecture of a Real-Time Operational DBMS

Strong Eventual Consistency and CRDTs

Deterministic Parallel Programming

Siebel Server Sync Guide. Siebel Innovation Pack 2015 May 2015

Tintri Cloud Connector

Transcription:

Designing and Evaluating a Distributed Computing Language Runtime Christopher Meiklejohn (@cmeik) Université catholique de Louvain, Belgium

R A R B

R A set() R B

R A set() set(2) 2 R B set(3) 3

set() set(2) R A 2? R B 3? set(3)

Synchronization To enforce an order Makes programming easier 6

Synchronization To enforce an order Makes programming easier Eliminate accidental nondeterminism Prevent race conditions 6

Synchronization To enforce an order Makes programming easier Eliminate accidental nondeterminism Prevent race conditions Techniques Locks, mutexes, semaphores, monitors, etc. 6

Difficult Cases Internet of Things, Low power, limited memory and connectivity 7

Difficult Cases Internet of Things, Low power, limited memory and connectivity Mobile Gaming Offline operation with replicated, shared state 7

Weak Synchronization Can we achieve anything without synchronization? Not really. 8

Weak Synchronization Can we achieve anything without synchronization? Not really. Strong Eventual Consistency (SEC) Replicas that deliver the same updates have equivalent state 8

Weak Synchronization Can we achieve anything without synchronization? Not really. Strong Eventual Consistency (SEC) Replicas that deliver the same updates have equivalent state Primary requirement Eventual replica-to-replica communication 8

Weak Synchronization Can we achieve anything without synchronization? Not really. Strong Eventual Consistency (SEC) Replicas that deliver the same updates have equivalent state Primary requirement Eventual replica-to-replica communication Order insensitive! (Commutativity) 8

Weak Synchronization Can we achieve anything without synchronization? Not really. Strong Eventual Consistency (SEC) Replicas that deliver the same updates have equivalent state Primary requirement Eventual replica-to-replica communication Order insensitive! (Commutativity) Duplicate insensitive! (Idempotent) 8

R A R B

R A set() R B

R A set() set(2) 2 R B set(3) 3

set() set(2) max(2,3) R A 2 3 max(2,3) R B 3 3 set(3)

How can we succeed with Strong Eventual Consistency? 3

Programming SEC. Eliminate accidental nondeterminism (ex. deterministic, modeling non-monotonic operations monotonically) 4

Programming SEC. Eliminate accidental nondeterminism (ex. deterministic, modeling non-monotonic operations monotonically) 2. Retain the properties of functional programming (ex. confluence, referential transparency over composition) 4

Programming SEC. Eliminate accidental nondeterminism (ex. deterministic, modeling non-monotonic operations monotonically) 2. Retain the properties of functional programming (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime (ex. replication, membership, dissemination) 4

Programming SEC. Eliminate accidental nondeterminism (ex. deterministic, modeling non-monotonic operations monotonically) 2. Retain the properties of functional programming (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime (ex. replication, membership, dissemination) 5

Convergent Objects Conflict-Free Replicated Data Types 6 SSS 20

Conflict-Free Replicated Data Types Many types exist with different properties Sets, counters, registers, flags, maps, graphs 7

Conflict-Free Replicated Data Types Many types exist with different properties Sets, counters, registers, flags, maps, graphs Strong Eventual Consistency Instances satisfy SEC property per-object 7

R A R B R C

R A add() {} (, {a}, {}) R B R C

R A add() {} (, {a}, {}) R B R C add() {} (, {c}, {})

R A add() {} (, {a}, {}) R B R C add() {} (, {c}, {}) remove() {} (, {c}, {c})

add() R A {} {} (, {a}, {}) (, {a, c}, {c}) R B {} (, {a, c}, {c}) add() remove() R C {} {} {} (, {c}, {}) (, {c}, {c}) (, {a, c}, {c})

Programming SEC. Eliminate accidental nondeterminism (ex. deterministic, modeling non-monotonic operations monotonically) 2. Retain the properties of functional programming (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime (ex. replication, membership, dissemination) 23

Convergent Programs Lattice Processing 24 PPDP 205

Lattice Processing (Lasp) Distributed dataflow Declarative, functional programming model 25

Lattice Processing (Lasp) Distributed dataflow Declarative, functional programming model Convergent data structures Primary data abstraction is the CRDT 25

Lattice Processing (Lasp) Distributed dataflow Declarative, functional programming model Convergent data structures Primary data abstraction is the CRDT Enables composition Provides functional composition of CRDTs that preserves the SEC property 25

%% Create initial set. S = declare(set), %% Add elements to initial set and update. update(s, {add, [,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S and S2. map(s, fun(x) -> X * 2 end, S2). 26

%% Create initial set. S = declare(set), %% Add elements to initial set and update. update(s, {add, [,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S and S2. map(s, fun(x) -> X * 2 end, S2). 27

%% Create initial set. S = declare(set), %% Add elements to initial set and update. update(s, {add, [,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S and S2. map(s, fun(x) -> X * 2 end, S2). 28

%% Create initial set. S = declare(set), %% Add elements to initial set and update. update(s, {add, [,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S and S2. map(s, fun(x) -> X * 2 end, S2). 29

%% Create initial set. S = declare(set), %% Add elements to initial set and update. update(s, {add, [,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S and S2. map(s, fun(x) -> X * 2 end, S2). 30

Programming SEC. Eliminate accidental nondeterminism (ex. deterministic, modeling non-monotonic operations monotonically) 2. Retain the properties of functional programming (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime (ex. replication, membership, dissemination) 3

Distributed Runtime Selective Hearing 32 W-PSDS 205

Selective Hearing Epidemic broadcast based runtime system Provide a runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution 33

Selective Hearing Epidemic broadcast based runtime system Provide a runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution Well-matched to Lattice Processing (Lasp) 33

Selective Hearing Epidemic broadcast based runtime system Provide a runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution Well-matched to Lattice Processing (Lasp) Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient 33

Selective Hearing Epidemic broadcast based runtime system Provide a runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution Well-matched to Lattice Processing (Lasp) Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient Lasp s programming model is tolerant to message re-ordering, disconnections, and node failures 33

Selective Hearing Epidemic broadcast based runtime system Provide a runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution Well-matched to Lattice Processing (Lasp) Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient Lasp s programming model is tolerant to message re-ordering, disconnections, and node failures Selective Receive Nodes selectively receive and process messages based on interest. 33

Layered Approach 34

Layered Approach Membership Configurable membership protocol which can operate in a client-server or peer-to-peer mode 34

Layered Approach Membership Configurable membership protocol which can operate in a client-server or peer-to-peer mode Broadcast (via Gossip, Tree, etc.) Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode 34

Layered Approach Membership Configurable membership protocol which can operate in a client-server or peer-to-peer mode Broadcast (via Gossip, Tree, etc.) Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode Auto-discovery Integration with Mesos, auto-discovery of Lasp nodes for ease of configurability 34

Membership Overlay

Broadcast Overlay Membership Overlay

Broadcast Overlay Membership Overlay

Mobile Phone Broadcast Overlay Membership Overlay

Mobile Phone Broadcast Overlay Membership Overlay Distributed Hash Table

Mobile Phone Broadcast Overlay Membership Overlay Distributed Hash Table Lasp Execution

Programming SEC. Eliminate accidental nondeterminism (ex. deterministic, modeling non-monotonic operations monotonically) 2. Retain the properties of functional programming (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime (ex. replication, membership, dissemination) 4

What can we build? Advertisement 42

Advertisement Mobile game platform selling advertisement space Advertisements are paid according to a minimum number of impressions 43

Advertisement Mobile game platform selling advertisement space Advertisements are paid according to a minimum number of impressions Clients will go offline Clients have limited connectivity and the system still needs to make progress while clients are offline 43

Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad Riot Ads Contracts Lasp Operation 2 Riot Ad Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 44

Ads 2 Union Ads Produ Riot Ad Riot Ads Contracts Riot Ad 2 Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 45

Increment Read 50,000 Remove Rovio Ads 2 Union Ads Product Ads Contracts Filter Ad Wi Contr Riot Ad Riot Ads Contracts Riot Ad 2 Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 46

Remove Rovio Ads Union Ads Product Ads Contracts Filter Ads With Contracts Riot Ads Contracts Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 47

Read 2 Ads Product Ads Contracts Filter Ads With Contracts 2 2 Contracts 2 Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 48

Client Side, Single Copy at Client Read 2 Riot Ad Product Ads Contracts Filter Ads With Contracts 2 Riot Ad 2 Riot Ad 2 Riot Ad Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 49

Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Riot Ad 2 2 Riot Ad Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 50

Increment Read 50,000 Remove Rovio Ads 2 Union Ads Produc Riot Ad Riot Ads Contracts Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 5

Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Riot Ad 2 2 Riot Ad Client Side, Single Copy at Client Increment Read 50,000 Remove Rovio Ads Read 2 Riot Ad 2 Union Ads Product Ads Contracts Filter Ads With Contracts 2 Riot Ad Riot Ad 2 Riot Ad Riot Ads Contracts Lasp Operation Riot Ad 2 User-Maintained CRDT Lasp-Maintained CRDT 2 Riot Ad 52

Evaluation Initial Evaluation 53

Background Distributed Erlang Transparent distribution Built-in, provided by Erlang/BEAM, cross-node message passing. 54

Background Distributed Erlang Transparent distribution Built-in, provided by Erlang/BEAM, cross-node message passing. Known scalability limitations Analyzed in academic in various publications. 54

Background Distributed Erlang Transparent distribution Built-in, provided by Erlang/BEAM, cross-node message passing. Known scalability limitations Analyzed in academic in various publications. Single connection Head of line blocking. 54

Background Distributed Erlang Transparent distribution Built-in, provided by Erlang/BEAM, cross-node message passing. Known scalability limitations Analyzed in academic in various publications. Single connection Head of line blocking. Full membership All-to-all failure detection with heartbeats and timeouts. 54

Background Erlang Port Mapper Daemon Operates on a known port Similar to Solaris sunrpc style portmap: known port for mapping to dynamic portbased services. 55

Background Erlang Port Mapper Daemon Operates on a known port Similar to Solaris sunrpc style portmap: known port for mapping to dynamic portbased services. Bridged networking Problematic for cluster in bridged networking with dynamic port allocation. 55

Experiment Design Single application Advertisement counter example from Rovio Entertainment. 56

Experiment Design Single application Advertisement counter example from Rovio Entertainment. Runtime configuration Application controlled through runtime environment variables. 56

Experiment Design Single application Advertisement counter example from Rovio Entertainment. Runtime configuration Application controlled through runtime environment variables. Membership Full membership with Distributed Erlang via EPMD. 56

Experiment Design Single application Advertisement counter example from Rovio Entertainment. Runtime configuration Application controlled through runtime environment variables. Membership Full membership with Distributed Erlang via EPMD. Dissemination State-based object dissemination through anti-entropy protocol (fanout-based, PARC-style.) 56

Experiment Orchestration Docker and Mesos with Marathon Used for deployment of both EPMD and Lasp application. 57

Experiment Orchestration Docker and Mesos with Marathon Used for deployment of both EPMD and Lasp application. Single EPMD instance per slave Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. 57

Experiment Orchestration Docker and Mesos with Marathon Used for deployment of both EPMD and Lasp application. Single EPMD instance per slave Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. Lasp Local execution using host networking: connects to local EPMD. 57

Experiment Orchestration Docker and Mesos with Marathon Used for deployment of both EPMD and Lasp application. Single EPMD instance per slave Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. Lasp Local execution using host networking: connects to local EPMD. Service Discovery Service discovery facilitated through clustering EPMD instances through Sprinter. 57

Ideal Experiment Local Deployment High thread concurrency when operating with lower node count. 58

Ideal Experiment Local Deployment High thread concurrency when operating with lower node count. Cloud Deployment Low thread concurrency when operating with a higher node count. 58

Results Initial Evaluation 59

Initial Evaluation Moved to DC/OS exclusively Environments too different: too much work needed to be adapted for things to work correctly. 60

Initial Evaluation Moved to DC/OS exclusively Environments too different: too much work needed to be adapted for things to work correctly. Single orchestration task Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. 60

Initial Evaluation Moved to DC/OS exclusively Environments too different: too much work needed to be adapted for things to work correctly. Single orchestration task Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. Bottleneck Events immediately dispatched: would require blocking for processing acknowledgment. 60

Initial Evaluation Moved to DC/OS exclusively Environments too different: too much work needed to be adapted for things to work correctly. Single orchestration task Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. Bottleneck Events immediately dispatched: would require blocking for processing acknowledgment. Unrealistic Events do not queue up all at once for processing by the client. 60

Lasp Difficulties Too expensive 2.0 CPU and 2048 MiB of memory. 6

Lasp Difficulties Too expensive 2.0 CPU and 2048 MiB of memory. Weeks spent adding instrumentation Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. 6

Lasp Difficulties Too expensive 2.0 CPU and 2048 MiB of memory. Weeks spent adding instrumentation Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. Dissemination too expensive 000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. 6

Lasp Difficulties Too expensive 2.0 CPU and 2048 MiB of memory. Weeks spent adding instrumentation Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. Dissemination too expensive 000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. Unrealistic Two different dissemination mechanisms: thread to thread and node to node: one is synthetic. 6

EPMD Difficulties Nodes become unregistered Nodes randomly unregistered with EPMD during execution. 62

EPMD Difficulties Nodes become unregistered Nodes randomly unregistered with EPMD during execution. Lost connection EPMD loses connections with nodes for some arbitrary reason. 62

EPMD Difficulties Nodes become unregistered Nodes randomly unregistered with EPMD during execution. Lost connection EPMD loses connections with nodes for some arbitrary reason. EPMD task restarted by Mesos Restarted for an unknown reason, which leads Lasp instances to restart in their own container. 62

Overhead Difficulties Too much state Client would ship around 5 GiB of state within 90 seconds. 63

Overhead Difficulties Too much state Client would ship around 5 GiB of state within 90 seconds. Delta dissemination Delta dissemination only provides around a 30% decrease in state transmission. 63

Overhead Difficulties Too much state Client would ship around 5 GiB of state within 90 seconds. Delta dissemination Delta dissemination only provides around a 30% decrease in state transmission. Unbounded queues Message buffers would lead to VMs crashing because of large memory consumption. 63

Evaluation Rearchitecture 64

Ditch Distributed Erlang Pluggable membership service Build pluggable membership service with abstract interface initially on EPMD and later migrate after tested. 65

Ditch Distributed Erlang Pluggable membership service Build pluggable membership service with abstract interface initially on EPMD and later migrate after tested. Adapt Lasp and Broadcast layer Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. 65

Ditch Distributed Erlang Pluggable membership service Build pluggable membership service with abstract interface initially on EPMD and later migrate after tested. Adapt Lasp and Broadcast layer Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. Build service discovery mechanism Mechanize node discovery outside of EPMD based on new membership service. 65

Partisan (Membership Layer) Pluggable protocol membership layer Allow runtime configuration of protocols used for cluster membership. 66

Partisan (Membership Layer) Pluggable protocol membership layer Allow runtime configuration of protocols used for cluster membership. Several protocol implementations: 66

Partisan (Membership Layer) Pluggable protocol membership layer Allow runtime configuration of protocols used for cluster membership. Several protocol implementations: Full membership via EPMD. 66

Partisan (Membership Layer) Pluggable protocol membership layer Allow runtime configuration of protocols used for cluster membership. Several protocol implementations: Full membership via EPMD. Full membership via TCP. 66

Partisan (Membership Layer) Pluggable protocol membership layer Allow runtime configuration of protocols used for cluster membership. Several protocol implementations: Full membership via EPMD. Full membership via TCP. Client-server membership via TCP. 66

Partisan (Membership Layer) Pluggable protocol membership layer Allow runtime configuration of protocols used for cluster membership. Several protocol implementations: Full membership via EPMD. Full membership via TCP. Client-server membership via TCP. Peer-to-peer membership via TCP (with HyParView) 66

Partisan (Membership Layer) Pluggable protocol membership layer Allow runtime configuration of protocols used for cluster membership. Several protocol implementations: Full membership via EPMD. Full membership via TCP. Client-server membership via TCP. Peer-to-peer membership via TCP (with HyParView) Visualization Provide a force-directed graph-based visualization engine for cluster debugging in real-time. 66

Partisan (Full via EPMD or TCP) Full membership Nodes have full visibility into the entire graph. 67

Partisan (Full via EPMD or TCP) Full membership Nodes have full visibility into the entire graph. Failure detection Performed by peer-to-peer heartbeat messages with a timeout. 67

Partisan (Full via EPMD or TCP) Full membership Nodes have full visibility into the entire graph. Failure detection Performed by peer-to-peer heartbeat messages with a timeout. Limited scalability Heartbeat interval increases when node count increases leading to false or delayed detection. 67

Partisan (Full via EPMD or TCP) Full membership Nodes have full visibility into the entire graph. Failure detection Performed by peer-to-peer heartbeat messages with a timeout. Limited scalability Heartbeat interval increases when node count increases leading to false or delayed detection. Testing Used to create the initial test suite for Partisan. 67

Partisan (Client-Server Model) Client-server membership Server has all peers in the system as peers; client has only the server as a peer. 68

Partisan (Client-Server Model) Client-server membership Server has all peers in the system as peers; client has only the server as a peer. Failure detection Nodes heartbeat with timeout all peers they are aware of. 68

Partisan (Client-Server Model) Client-server membership Server has all peers in the system as peers; client has only the server as a peer. Failure detection Nodes heartbeat with timeout all peers they are aware of. Limited scalability Single point of failure: server; with limited scalability on visibility. 68

Partisan (Client-Server Model) Client-server membership Server has all peers in the system as peers; client has only the server as a peer. Failure detection Nodes heartbeat with timeout all peers they are aware of. Limited scalability Single point of failure: server; with limited scalability on visibility. Testing Used for baseline evaluations as reference architecture. 68

Partisan (HyParView, default) Partial view protocol Two views: active (fixed) and passive (log n); passive used for failure replacement with active view. 69

Partisan (HyParView, default) Partial view protocol Two views: active (fixed) and passive (log n); passive used for failure replacement with active view. Failure detection Performed by monitoring active TCP connections to peers with keep-alive enabled. 69

Partisan (HyParView, default) Partial view protocol Two views: active (fixed) and passive (log n); passive used for failure replacement with active view. Failure detection Performed by monitoring active TCP connections to peers with keep-alive enabled. Very scalable (0k+ nodes during academic evaluation) However, probabilistic; potentially leads to isolated nodes during churn. 69

Sprinter (Service Discovery) Responsible for clustering tasks Uses Partisan to cluster all nodes and ensure connected overlay network: reads information from Marathon. 70

Sprinter (Service Discovery) Responsible for clustering tasks Uses Partisan to cluster all nodes and ensure connected overlay network: reads information from Marathon. Node local Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. 70

Sprinter (Service Discovery) Responsible for clustering tasks Uses Partisan to cluster all nodes and ensure connected overlay network: reads information from Marathon. Node local Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. Membership mode specific Knows, based on the membership mode, how to properly cluster nodes and enforces proper join behaviour. 70

Debugging Sprinter S3 archival Nodes periodically snapshot their membership view for analysis. 7

Debugging Sprinter S3 archival Nodes periodically snapshot their membership view for analysis. Elected node (or group) analyses Periodically analyses the information in S3 for the following: 7

Debugging Sprinter S3 archival Nodes periodically snapshot their membership view for analysis. Elected node (or group) analyses Periodically analyses the information in S3 for the following: Isolated node detection Identifies isolated nodes and takes corrective measures to repair the overlay. 7

Debugging Sprinter S3 archival Nodes periodically snapshot their membership view for analysis. Elected node (or group) analyses Periodically analyses the information in S3 for the following: Isolated node detection Identifies isolated nodes and takes corrective measures to repair the overlay. Verifies symmetric relationship Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don t know me. 7

Debugging Sprinter S3 archival Nodes periodically snapshot their membership view for analysis. Elected node (or group) analyses Periodically analyses the information in S3 for the following: Isolated node detection Identifies isolated nodes and takes corrective measures to repair the overlay. Verifies symmetric relationship Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don t know me. Periodic alerting Alerts regarding disconnected graphs so external measures can be taken, if necessary. 7

Evaluation Next Evaluation 72

Evaluation Strategy Deployment and runtime configuration Ability to deploy a cluster of node and configure simulations at runtime. 73

Evaluation Strategy Deployment and runtime configuration Ability to deploy a cluster of node and configure simulations at runtime. Each simulation: 73

Evaluation Strategy Deployment and runtime configuration Ability to deploy a cluster of node and configure simulations at runtime. Each simulation: Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. 73

Evaluation Strategy Deployment and runtime configuration Ability to deploy a cluster of node and configure simulations at runtime. Each simulation: Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. Result aggregation Aggregate results at end of execution and archive these results. 73

Evaluation Strategy Deployment and runtime configuration Ability to deploy a cluster of node and configure simulations at runtime. Each simulation: Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. Result aggregation Aggregate results at end of execution and archive these results. Plot generation Automatically generate plots for the execution and aggregate the results of multiple executions. 73

Evaluation Strategy Deployment and runtime configuration Ability to deploy a cluster of node and configure simulations at runtime. Each simulation: Different application scenario Uniquely execute a different application scenario at runtime based on runtime configuration. Result aggregation Aggregate results at end of execution and archive these results. Plot generation Automatically generate plots for the execution and aggregate the results of multiple executions. Minimal coordination Work must be performed with minimal coordination, as a single orchestrator is a scalability bottleneck for large applications. 73

Completion Detection Convergence Structure Uninstrumented CRDT of grow-only sets containing counters that each node manipulates. 74

Completion Detection Convergence Structure Uninstrumented CRDT of grow-only sets containing counters that each node manipulates. Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. 74

Completion Detection Convergence Structure Uninstrumented CRDT of grow-only sets containing counters that each node manipulates. Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. Event Generation Event generation toggles a boolean for the node to show completion. 74

Completion Detection Convergence Structure Uninstrumented CRDT of grow-only sets containing counters that each node manipulates. Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. Event Generation Event generation toggles a boolean for the node to show completion. Log Aggregation Completion triggers log aggregation. 74

Completion Detection Convergence Structure Uninstrumented CRDT of grow-only sets containing counters that each node manipulates. Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. Event Generation Event generation toggles a boolean for the node to show completion. Log Aggregation Completion triggers log aggregation. Shutdown Upon log aggregation completion, nodes shutdown. 74

Completion Detection Convergence Structure Uninstrumented CRDT of grow-only sets containing counters that each node manipulates. Simulates a workflow Nodes use this operation to simulate a lock-stop workflow for the experiment. Event Generation Event generation toggles a boolean for the node to show completion. Log Aggregation Completion triggers log aggregation. Shutdown Upon log aggregation completion, nodes shutdown. External monitoring When events complete execution, nodes automatically begin the next experiment. 74

Results Next Evaluation 75

Results Lasp Single node orchestration: bad Not possible once you exceed a few nodes: message queues, memory, delays. 76

Results Lasp Single node orchestration: bad Not possible once you exceed a few nodes: message queues, memory, delays. Partial Views Required: rely on transitive dissemination of information and partial network knowledge. 76

Results Lasp Single node orchestration: bad Not possible once you exceed a few nodes: message queues, memory, delays. Partial Views Required: rely on transitive dissemination of information and partial network knowledge. Results Reduced Lasp memory footprint to 75MB; larger in practice for debugging. 76

Results Partisan Fast churn isolates nodes Need a repair mechanism: random promotion of isolated nodes; mainly issues of symmetry. 77

Results Partisan Fast churn isolates nodes Need a repair mechanism: random promotion of isolated nodes; mainly issues of symmetry. FIFO across connections Not per connection, but protocol assumes across all connections leading to false disconnects. 77

Results Partisan Fast churn isolates nodes Need a repair mechanism: random promotion of isolated nodes; mainly issues of symmetry. FIFO across connections Not per connection, but protocol assumes across all connections leading to false disconnects. Unrealistic system model You need per message acknowledgements for safety. 77

Results Partisan Fast churn isolates nodes Need a repair mechanism: random promotion of isolated nodes; mainly issues of symmetry. FIFO across connections Not per connection, but protocol assumes across all connections leading to false disconnects. Unrealistic system model You need per message acknowledgements for safety. Pluggable protocol helps debugging Being able to switch to full membership or client-server assists in debugging protocol vs. application problems. 77

Latest Results Reproducibility at 300 nodes for full applications Connectivity, but transient partitions and isolated nodes at 500-000 nodes (across 40 instances.) 78

Latest Results Reproducibility at 300 nodes for full applications Connectivity, but transient partitions and isolated nodes at 500-000 nodes (across 40 instances.) Limited financially and by Amazon Harder to run larger evaluations because we re limited financially (as a university) and because of Amazon limits. 78

Latest Results Reproducibility at 300 nodes for full applications Connectivity, but transient partitions and isolated nodes at 500-000 nodes (across 40 instances.) Limited financially and by Amazon Harder to run larger evaluations because we re limited financially (as a university) and because of Amazon limits. Mean state reduction per client Around 00x improvement from our PaPoC 206 initial evaluation results. 78

Plat à emporter Visualizations are important! Graph performance, visualize your cluster: all of these things lead to easier debugging. 79

Plat à emporter Visualizations are important! Graph performance, visualize your cluster: all of these things lead to easier debugging. Control changes No Lasp PR accepted without divergence, state transmission, and overhead graphs. 79

Plat à emporter Visualizations are important! Graph performance, visualize your cluster: all of these things lead to easier debugging. Control changes No Lasp PR accepted without divergence, state transmission, and overhead graphs. Automation Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. 79

Plat à emporter Visualizations are important! Graph performance, visualize your cluster: all of these things lead to easier debugging. Control changes No Lasp PR accepted without divergence, state transmission, and overhead graphs. Automation Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. Make work easily testable When you test locally and deploy globally, you need to make things easy to test, deploy and evaluate (for good science, I say!) 79

Thanks! Christopher Meiklejohn @cmeik http://www.lasp-lang.org http://github.com/lasp-lang 80