Scalable Control Plane Substrate. Sachin Ka3, John Ousterhout, Guru Parulkar, Marcos Aguilera, Curt Kolovson ONRC + ON.Lab + RAMCloud, VMWare

Similar documents
ONOS OVERVIEW. Architecture, Abstractions & Application

ONOS: TOWARDS AN OPEN, DISTRIBUTED SDN OS. Chun Yuan Cheng

ICONA Inter Cluster ONOS Network Application. CREATE-NET, CNIT/University of Rome Tor Vergata, Consortium GARR

ONOS and the importance of deployments

Data Driven Networks

Data Driven Networks. Sachin Katti

Self Programming Networks

ONOS YANG Tools. Thomas Vachuska Open Networking Foundation

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

lecture 18: network virtualization platform (NVP) 5590: software defined networking anduo wang, Temple University TTLMAN 401B, R 17:30-20:00

Raj Jain (Washington University in Saint Louis) Mohammed Samaka (Qatar University)

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

CPS 512 midterm exam #1, 10/7/2016

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Open Network Operating System

Cisco APIC-EM Components and Architecture, page 3. About the Cisco Application Policy Infrastructure Controller Enterprise Module (APIC-EM), page 1

Cisco Extensible Network Controller

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

A Global Operating System «from the Things to the Clouds»

Copyright 2014 Splunk Inc. Splunk for VMware. Architecture & Design. Michael Donnelly, Sr. Sales Engineer

China Unicom SDN Practice in WAN. Lv Chengjin/Ma Jichun, China Unicom

TECHNICAL WHITE PAPER

Building Consistent Transactions with Inconsistent Replication

Thomas Lin, Naif Tarafdar, Byungchul Park, Paul Chow, and Alberto Leon-Garcia

Mobile-CORD Enable 5G. ONOS/CORD Collaboration

Efficient Geographic Replication & Disaster Recovery. Tom Pantelis Brian Freeman Colin Dixon

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Memory-Based Cloud Architectures

Fault Tolerant Distributed Main Memory Systems

UNIVERSITY OF CAGLIARI

LTE-FDD and APTS support over Existing Cable Networks

ONOS. Open Network Operating System. Ali Al-Shabibi and Andrea Campanella. ON.Lab 13/09/2016 TIM Labs, Turin. #ONOSProject

ETSI FUTURE Network SDN and NFV for Carriers MP Odini HP CMS CT Office April 2013

Network Layer: The Control Plane

BROCADE CLOUD-OPTIMIZED NETWORKING: THE BLUEPRINT FOR THE SOFTWARE-DEFINED NETWORK

Software-Defined Networking (SDN) Overview

7/27/2010 LTE-WIMAX BLOG HARISHVADADA.WORDPRESS.COM. QOS over 4G networks Harish Vadada

What is ONOS? ONOS Framework (ONOSFW) is the OPNFV project focused on ONOS integration. It is targeted for inclusion in the Brahmaputra release.

ONOS Mini-Summit, Beijing, China

NetChain: Scale-Free Sub-RTT Coordination

Five Trends Leading to Opportunities in Multi-Cloud Global Application Delivery

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

New trends in IT. Network Functions Virtualization (NFV) & Software Defined-WAN

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Wireless SDN 기술. Seungwon Shin KAIST

Cisco Wide Area Bonjour Solution Overview

No compromises: distributed transactions with consistency, availability, and performance

Chapter 4 Communication

Campus Design Principals

Workspace & Storage Infrastructure for Service Providers

THE MOBILE ACCESS NETWORK, BEYOND CONNECTIVITY. xran.org White Paper October 2016

Low-Latency Datacenters. John Ousterhout

ALCATEL-LUCENT ENTERPRISE DATA CENTER SWITCHING SOLUTION Automation for the next-generation data center

TROPIC: Transactional Resource Orchestration Platform In the Cloud

Professor Yashar Ganjali Department of Computer Science University of Toronto

Chapter 5 Network Layer: The Control Plane

Chapter 5 Network Layer: The Control Plane

Scalability of web applications

Performance and Scalability with Griddable.io

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Accelerating SDN and NFV Deployments. Malathi Malla Spirent Communications

NFV and SDN: The enablers for elastic networks

IEEE NetSoft 2016 Keynote. June 7, 2016

Performance Assurance Solution Components

vepc-based Wireless Broadband Access

Beyond 1001 Dedicated Data Service Instances

The Edge of LTE. Deploying a MEC platform in today s networks. MEC Congress Munich, DE September SpiderCloud Wireless, Inc.

Accelerating 4G Network Performance

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Data Centers. Tom Anderson

Dr Markus Hagenbuchner CSCI319 SIM. Distributed Systems Chapter 4 - Communication

LTE Access Controller (LAC) for Small Cell Tool Design and Technology

Networking Seminar Stanford University. Madan Jampani 3/12/2015

Over-The-Top (OTT) Aggregation Solutions

Questions about LAA deployment scenarios

5G Network Architecture

Protection Schemes for 4G Multihop wireless Networks

Modern hyperconverged infrastructure. Karel Rudišar Systems Engineer, Vmware Inc.

Inventing Future Access and Services

Intersection of 5G & Open Reference Platforms Tom TOFIGH, PMTS, AT&T

Data-Driven DevOps: Bringing Visibility to Any Cloud, Any App, & Any Device. Erik Giesa SVP of Marketing and Business Development, ExtraHop Networks

Exploring Cloud Security, Operational Visibility & Elastic Datacenters. Kiran Mohandas Consulting Engineer

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

The Living Network: Leading the Path to 5G. Robert Olesen Director, InterDigital Inc InterDigital, Inc. All rights reserved.

A Hierarquical MEC Architecture: Experimenting the RAVEN Use-Case

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

Streaming Log Analytics with Kafka

SDN controller: Intent-based Northbound Interface realization for extended applications

Taxonomy of SDN. Vara Varavithya 17 January 2018

Features. HDX WAN optimization. QoS

Oracle IaaS, a modern felhő infrastruktúra

Transformation Through Innovation

Making Network Functions Software-Defined

DAL: A Locality-Optimizing Distributed Shared Memory System

Cisco CloudCenter Solution with Cisco ACI: Common Use Cases

VEXATA FOR ORACLE. Digital Business Demands Performance and Scale. Solution Brief

No Tradeoff Low Latency + High Efficiency

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

Distributed Network Function Virtualization

Transcription:

Scalable Control Plane Substrate Sachin Ka3, John Ousterhout, Guru Parulkar, Marcos Aguilera, Curt Kolovson ONRC + ON.Lab + RAMCloud, VMWare

MoEvaEon SeparaEon of control plane is a common trend: networks/systems Challenging to design, hard requirements on throughput, latency, consistency We have been building control planes for our specific systems ONOS, SoORAN, and RAMCloud Coordinator Can we design a generalized scalable control plane with A common foundaeon that can be customized for different contexts

Examples of Control Planes SDN Controllers Data center, enterprise, service provider wired and cellular networks IoT VM manager (e.g. Nova in OpenStack) Data Centers Name servers ( name node ) distributed file system Scheduler Many distributed systems SoOware defined storage

New Control Planes Drones, Cars Controlling fleets of drones and cars Collision avoidance, sharing the airspace Air traffic control for drones LogisEcs and delivery

SeparaEon of Control Plane Open Interface(s) App 1 App2 App3 Global View Control Plane Open Interface SeparaEon of Control Plane Switches enbs Servers Storage Server Server Server Storage

Logically Centralized Control Plane Provides global view Makes it easy to program control, management, config apps Enables new apps

High Volume of State: ~500GB-2TB Key Performance Requirements Apps Apps High Throughput: ~500K-20M ops / second ~100M state ops / second High throughput Low latency Consistency High availability Control Global Global Network View View / State / State Low Latency to Events: 1-10s ms Difficult challenge! Server Server Server Storage

Control Plane Challenges Scalability, High Availability and Performance Northbound and Southbound AbstracEons Modularity

Example Scalable Control Planes ONOS SoORAN RAMCloud Coordinator

Scale-Out Design w/ MulEple ONOS Instances Each instance is identical Apps One can add and remove instances seamlessly Each instance is a master for a subset of switches ONOS Global Network View / State It works like a single system for apps and network devices ONOS Instance 1 ONOS Instance 2 ONOS Instance 3

ONOS Architecture Tiers Northbound Abstraction: - network graph - application intents Core: - distributed - protocol independent Southbound Abstraction: - generalized OpenFlow - pluggable & extensible Apps Northbound - Application Intent Framework (policy enforcement, conflict resolution) Distributed Core (scalability, availability, performance, persistence) Southbound (discover, observe, program, configure) OpenFlow NetConf...

ONOS ApplicaEon Intent Framework I want to define what I need without worrying about how Provision 10G path from Datacenter 1 to Datacenter2 optimized for cost ApplicaEon Intent Framework: AbstracEons, APIs, Policy Enforcement, Conflict resolueon Intents translated and Compiled into specific instructions for network devices. Distributed Core Southbound Southbound Core API OpenFlow NETCONF Southbound Interface

Distributed Core: State Management Topology Flows Intents Switch to controller mapping Resource allocaeons Network ConfiguraEon And a plethora of applicaeon generated state Different states have different size and R/W pakerns They can use different consistency, replicaeon, sharding

Application Intents - immutable - durable & replicated Global Network View - eventually consistent - fully replicated Key to Performance/Scalability: Distributed State Management Optimistic Replication - gossip based - anti-entropy Optimistic Replication - gossip based - anti-entropy - partial ordering Flow Table Entries - strongly consistent - partitioned 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 172.16.0.0 Master/Backup Replication Switch to Controller - strongly consistent - replicated for durability Resource Assignment - strongly consistent - pareeoned for scale Switch è Master Switch è Master Switch è Master Switch è Master Switch è Master Switch è Master Consensus for strong consistency Consensus for strong consistency

ONOS Performance Metrics and Results Device & link sensing latency o o Less than 100ms ONOS processing less than 10ms Flow rule operaeons throughput o 500K to 3M ops/sec Intent operaeons throughput o 200k ops/sec Intent operaeons latency o Less than 50ms

SoORAN

So#RAN: So#ware Defined Radio Access Network LTE-A LTE-IoT So#RAN OS MVNO Goal: Global Network View: efficiency enbs +Maximize small cell enbs + signal of precious spectrum strengths from devices to neighboring enbs Create personalized usermove, experience As devices need to handover Minimize capex/opex Signal strengths constantly Microcell Macrocell Picocell Microcell Macrocell Picocell change with interference with rapidly 50 Macro enbs, 600 Micro enbs increasing devices and their bandwidth Up to 100K mobile devices; requirements few flows per device

Control Plane Requirements for So#RAN LTE-A LTE-IoT So#RAN OS MVNO Global Network View enbs + small cell enbs + signal strengths from devices to neighboring enbs CriEcal decisions: Handover Device to enb assignment Power assignment Microcell Macrocell Picocell Microcell Macrocell Picocell AllocaEon of 50 resource blocks per enbs to mobile devices every ms 50 Macro enbs, 600 Micro enbs Up to 100K mobile devices; few flows per device

SoORAN Key Performance Requirements High Volume of State: ~500GB-2TB High Throughput: ~500K-5M ops / second ~20M state ops / second Low Latency to Events: 1-10s ms Control Apps Apps Global Global Network View View / State / State Difficult challenge! Server Order of magnitude higher throughput and lower latency requirement Server Server Storage

Control Plane Challenges for So#RAN: SimplificaCon and OpCmizaCon LTE-A LTE-IoT So#RAN OS MVNO Global Network View MulEple TBs Updated every few ms Opportunity for Parallelism State management/processing can be done independently not as Eghtly coupled Only devices in transicon require immediate processing 50 Macro enbs, 600 Micro enbs Up to 100K mobile devices; few flows per device

RAMCloud Coordinator

RAMCloud Architecture: Control Plane 1000 100,000 ApplicaCon Servers Appl. Library Appl. Library Appl. Library Appl. Library Master Backup Master Backup Datacenter Network Master Backup Master Backup Coordinator Coordinator Standby External Storage (ZooKeeper) RAMCloud Control Plane 1000 10,000 Storage Servers Meta-state: mapping of key space to tables on servers, list of aceve servers, list of aceve client-leases App severs cache the state and so coordinator not in the criecal path

RAMCloud Coordinator: Requirements High Volume of State: ~500GB Apps Control Apps Medium Throughput: ~M updates per second Low Latency to Response: usec to msec Global Network System View // State Appl. Library Appl. Library Appl. Library Master Backup Master Backup Master Backup 100,000 App Servers 10,000 Servers

Generalized and Customizable Scalable Control Plane

What is domain specific and what isnt? Domain independent State management with configurable consistency, availability Streaming event processing Graph abstraceons Domain specific Northbound and southbound interfaces NoEficaEon APIs

Generalized and Customizable Scalable Control Plane Apps Northbound AbstracEon: - Provide different APIs - Customize for the context Core: - distributed - context independent Southbound AbstracEon: - Plug-ins for different contexts Northbound Absrtractions/APIs (C/C++, Declective Programming, REST) Strongly consistent, trasaction semantics? Distributed Core Cluster of Servers, 10-100Gbps, low latency RPC Distributed State Management Primitives Southbound Plug-in for different contexts OpenFlow/NetConf RAN X2 Interface RPC Switches enbs Servers Storage Server Server Server Storage

Time Varying Streaming Graph AbstracEon Common state abstraceon for many control planes à Graph E.g: In networks, graphs of nodes (routers, basestaeons, phones) with state changing frequently (link load, cell connecevity) O(1 million) nodes in graph, relaevely small compared to web graphs Updates to graphs in streaming fashion & highly dynamic E.g: In cellular networks, every edge updated every 3-5ms à O(2-5 million) updates every second

Complex Event Processing Compute abstraceon: Updates to graphs (events) trigger complex processing that s domain specific E.g: link down event triggering a route update to maintain connecevity Triggered event processing can have Eght latency constraints E.g: link down event can cause disconneceon, need fast aceon (~10-100ms) to keep the network connected Ideally, one should be able to write the event logic in a high level declaraeve fashion

Summary SeparaEon of control plane is a common trend: networks/systems We have been building control planes for our specific systems ONOS, SoORAN, and RAMCloud Coordinator Can we design a generalized scalable control plane with A common foundaeon that can be customized for different contexts

Backup

Generalized and Customizable Scalable Control Plane A common foundaeon for all control planes A cluster of commodity servers High performance network: 10-100Gbps, speed of light latency Low latency RPC: ~5usec end to end latency Customized distributed state management Different consistency, replicaeon, and sharding models for different state Customizable for specific contexts Allow the specific system to pick distributed state managements App programming abstraceons/models/apis C/C++, DeclaraEve Programming, REST APIs Support different consistency models Can we get away with everything being strongly consistent?

Outline Example control planes: requirements and designs ONOS for service provider networks SoORAN for cellular RANs Coordinator for RAMCloud Other examples Opportunity for a unified scalable control plane What makes us believe this is possible What are the challenges High level architecture of a unified scalable control plane Common building blocks CustomizaEon Next steps

So#RAN: Data Plane AbstracCon Flow-Graph F with throughput and latency requirement Match: If Pkt à Flow 1 AcEon: Execute Flow-Graph F A 1 A 2 A 3 A 4 A 5 A n Flow-graph: the pipeline/graph of packet processing atoms that need to be executed on all data/packets matching the flow Flow is flexibly defined (e.g. IP flow, UE bearer, QoS class) Atoms are simple packet processing funceons (e.g. header compression, segmentaeon, FFT) Flow-graphs can have latency and compute throughput requirements associated with them

So#RAN: Control Plane AbstracCon Update(Node State) Update(Node State) Update(Node State) N 1 N 2 N 3 Distributed Message Passing N 4 N 5 N 6 Update(Node State) Update(Node State) Update(Node State) R Reduce(Global State) Update/Reduce on control-graphs that represent network/applicaeon state Control-graph is defined flexibly (e.g network graph with enbs, UEs and edges annotated with link strength, flow records, CDN graph with content nodes) Local update computaeon at each node in the graph, message passing on edges and global reduce funceons Simple abstraceon is sufficient to express a large range of control funceons

So#RAN: Expressing network services using flow graphs & control graphs A 1 A 2 Data/Control Interface Update(Node State) Update(Node State) Update(Node State) N 1 N 2 N 3 Distributed Message Passing A 3 A 4 A 5 A n N 4 N 5 Update(Node State) Update(Node State) Update(Node State) R N 6 Reduce(Global State) SoORAN OS (Compiler, scheduler & resource orchestrator) So#RAN OS: Plasorm to execute any combinaeon of flow and control graphs expressed in high level programs on COTS distributed hardware while meeeng latency & throughput requirements Macrocell Microcell Picocell Microcell Macrocell Picocell

Control Plane for Service Provider Networks ONOS Network state 1TB Changes more slowly Control ops 1-3M ops/sec Latency for processing 10 ms High availability So#RAN Network state MulEple TBs Changes every few ms Control ops 20 M ops/sec Latency for processing 1 ms High availability SoORAN offers opportuniees for simplificaeon and opemizaeons

RAMCloud Architecture 1000 100,000 ApplicaCon Servers Appl. Appl. Appl. Library Library Library Appl. Library High-speed networking: 5 µs round-trip Full biseceon bandwidth Commodity Servers Master Backup Master Backup Datacenter Network Master Backup 1000 10,000 Storage Servers Master Backup Coordinator Coordinator Standby External Storage (ZooKeeper) RAMCloud Control Plane 64-256 GB per server

Scalable Control Plane: Breakout Session Summary Heard posieve comments about the overall goals and project with a reminder that we are taking on something challenging general and customizable and high performance re obviously at odds with each other Scalable control planes for networking easier to see from IoT to RAN to other networks Are they in a different category compared to scalable control plane for distributed systems Examples of scalable control planes for distributed systems Schedulers: clusters and muleple clusters Scheduling of micro-services or associated workload

What is domain or context specific and what is domain/context independent Distributed state management is domain independent ApplicaEon abstraceons and APIs and interfaces to devices being controlled are domain specific What is the right model or paradigm for the domain independent part Distributed state management but there are other models Eme varying network graph as a data base and standing queries streaming database with ability to customize: consistency, replicaeon, sharding Complex event processing on streaming database ApplicaEon abstraceons and APIs DeclaraEve/easy to program