Developing cloud infrastructure from scratch: the tale of an ISP

Similar documents
Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

virtual machine block storage with the ceph distributed storage system sage weil xensummit august 28, 2012

KVM 在 OpenStack 中的应用. Dexin(Mark) Wu

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

Who stole my CPU? Leonid Podolny Vineeth Remanan Pillai. Systems DigitalOcean

Painless switch from proprietary hypervisor to QEMU/KVM. Denis V. Lunev

KVM Weather Report. Amit Shah SCALE 14x

DEPLOYING NFV: BEST PRACTICES

AMD EPYC Processors Showcase High Performance for Network Function Virtualization (NFV)

Virtualization at Scale in SUSE Linux Enterprise Server

Cisco Extensible Network Controller

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Linux/QEMU/Libvirt. 4 Years in the Trenches. Chet Burgess Cisco Systems Scale 14x Sunday January 24th

A Gentle Introduction to Ceph

Tuning Your SUSE Linux Enterprise Virtualization Stack. Jim Fehlig Software Engineer

CS-580K/480K Advanced Topics in Cloud Computing. Network Virtualization

Cloud Networking (VITMMA02) Server Virtualization Data Center Gear

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

Bacula Systems Virtual Machine Performance Backup Suite

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

Cloud environment with CentOS, OpenNebula and KVM

Installation runbook for

CLOUD ARCHITECTURE & PERFORMANCE WORKLOADS. Field Activities

Increase KVM Performance/Density

Xen and CloudStack. Ewan Mellor. Director, Engineering, Open-source Cloud Platforms Citrix Systems

Virtuozzo Containers

CloudOpen Europe 2013 SYNNEFO: A COMPLETE CLOUD STACK OVER TECHNICAL LEAD, SYNNEFO

ANALYSIS OF VIRTUAL NETWORKS IN DATA CENTERS.

Integrating ovirt, Foreman And Katello To Empower Your Data-Center Utilization

virtio-mem: Paravirtualized Memory

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

SUSE. High Performance Computing. Eduardo Diaz. Alberto Esteban. PreSales SUSE Linux Enterprise

KVM on s390: what's next?

Best Practices for Setting BIOS Parameters for Performance

How to Handle Globally Distributed QCOW2 Chains? Eyal Moscovici & Amit Abir Oracle-Ravello

Virtualization and Performance

Agilio CX 2x40GbE with OVS-TC

Windows Server 2012 Hands- On Camp. Learn What s Hot and New in Windows Server 2012!

SmartNIC Programming Models

Take Back Lost Revenue by Activating Virtuozzo Storage Today

Cloud and Datacenter Networking

Pexip Infinity Server Design Guide

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

MidoNet Scalability Report

Hyper-V - VM Live Migration

CSE 120 Principles of Operating Systems

Novell Cluster Services Implementation Guide for VMware

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Multiprocessor Systems. Chapter 8, 8.1

Oracle Exadata and OVM Best Practice Overview

Virtual Open Systems (VOSyS)

Ethernet Fabrics- the logical step to Software Defined Networking (SDN) Frank Koelmel, Brocade

Live Migration: Even faster, now with a dedicated thread!

VMWARE EBOOK. Easily Deployed Software-Defined Storage: A Customer Love Story

Yabusame: Postcopy Live Migration for Qemu/KVM

Compute Node Linux (CNL) The Evolution of a Compute OS

CHAPTER 16 - VIRTUAL MACHINES

ITRI Cloud OS: An End-to-End OpenStack Solution

Xen scheduler status. George Dunlap Citrix Systems R&D Ltd, UK

Software Engineering at VMware Dan Scales May 2008

Unify Virtual and Physical Networking with Cisco Virtual Interface Card

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

Reducing Network Tiers Flattening the Network. Kevin Ryan Director Data Center Solutions

Extremely Fast Distributed Storage for Cloud Service Providers

Oracle s Netra Modular System. A Product Concept Introduction

Networking for Enterprise Private Clouds

SmartNIC Programming Models

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Cisco 4000 Series Integrated Services Routers: Architecture for Branch-Office Agility

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Quantum, network services for Openstack. Salvatore Orlando Openstack Quantum core developer

VMware vsphere with ESX 4 and vcenter

Deep Insights: High Availability VMs via a Simple Host-to-Guest Interface OpenStack Masakari Greg Waines (Wind River Systems)

Multiprocessor Systems. COMP s1

Windows Server 2012: Server Virtualization

Copyright 2012 EMC Corporation. All rights reserved.

Networking in Virtual Infrastructure and Future Internet. NCHC Jen-Wei Hu

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

A Brief Guide to Virtual Switching Franck Baudin (Red Hat) Billy O Mahony (Intel)

QuartzV: Bringing Quality of Time to Virtual Machines

Design and Implementation of Virtual TAP for Software-Defined Networks

Low overhead virtual machines tracing in a cloud infrastructure

Network flow automation and Visibility. Arista Networks France IX

Cisco HyperFlex Hyperconverged Infrastructure Solution for SAP HANA

Potpuna virtualizacija od servera do desktopa. Saša Hederić Senior Systems Engineer VMware Inc.

PLEXXI HCN FOR VMWARE ENVIRONMENTS

Zoptymalizuj Swoje Centrum Danych z Red Hat Virtualization. Jacek Skórzyński Solution Architect/Red Hat

Next Gen Virtual Switch. CloudNetEngine Founder & CTO Jun Xiao

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

ALCATEL-LUCENT ENTERPRISE DATA CENTER SWITCHING SOLUTION Automation for the next-generation data center

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

WIND RIVER TITANIUM CLOUD FOR TELECOMMUNICATIONS

SUSE Enterprise Storage Technical Overview

Lecture 14 SDN and NFV. Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it

libvirt integration and testing for enterprise KVM/ARM Drew Jones, Eric Auger Linaro Connect Budapest 2017 (BUD17)

Helion OpenStack Carrier Grade 4.0 RELEASE NOTES

OpenNebula on VMware: Cloud Reference Architecture

Configuring and Benchmarking Open vswitch, DPDK and vhost-user. Pei Zhang ( 张培 ) October 26, 2017

Transcription:

Developing cloud infrastructure from scratch: the tale of an ISP Andrey Korolyov Performix LLC 22 October 2013

Architecture planning Scalable to infinity and beyond Fault-tolerant (not just words for a salesman) Easy to build...or not so easy Cheap commodity hardware Avoid modular design because we are not building OpenStack Things that are tolerable in the private cloud completely unacceptable in the public one

The beginning: no expertise but lots of enthusiasm Ceph for storage just because every other distributed FS was far worse for our needs Flat networking with L2 filtering using nwfilter mechanism - keep all logic inside libvirt KVM virtualization because of rich featureset Monitoring via collectd because many components already has its support So seems we just need to write UI to build our cloud...

The beginning: after testing all together(mid 2012) Ceph(Argonaut) has very promising benchmarks but was unusable due to I/O freezes during rebalance RBD driver has a lot of leaks Libvirt has no OVS support and a lot of hard-to catch leaks Eventually first version of orchestrator was launched

VM management All operations with VMs are passed through one local agent instance to ease task of syncronizing different kind of requests VMs in cgroups: pinning and accounting Mechanism for eliminate leak effects via delayed OOM, simular to the [ a ] Statistics goes over preprocessing and squeezing on node agent a https://lwn.net/articles/317814/

VM management, part 2 Guest agent necessary for approximately half of tasks Emulator upgrades can be done via migration Automatic migration to meet guaranteed shareable resource needs Rack-based management granularity

A lot of features: implement, look and throw it out Pinning with exact match to the host topology Custom TCP congestion protocol Memory automatic shrinking/extending based on statistics Online CPU hot-add/hot-remove NUMA topology passthrough

Performance versus needs Most kind of regular workloads does not require strict NUMA tuning for large performance improvements Using THP is the same, boost is quite low IDE driver supports discard operations but its bandwidth limited by about 70Mb/s, so virtio is only an option BTRFS shows awesome benchmark resuls as a Ceph backend(zero-cost snapshot operations and workloads with ton of small operations) but ultimately unstable

Performance versus needs Most kind of regular workloads does not require strict NUMA tuning for large performance improvements Using THP is the same, boost is quite low IDE driver supports discard operations but its bandwidth limited by about 70Mb/s, so virtio is only an option BTRFS shows awesome benchmark resuls as a Ceph backend(zero-cost snapshot operations and workloads with ton of small operations) but ultimately unstable So all these things did not went to our production.

Building on top of instabilities(end of 2012) XFS was quite unstable under CPU-intensive workloads Ceph daemons leaks in many ways and should be restarted eventually Nested per-thread cgroups for QEMU may hang the system OpenVSwitch can be brought down by relatively simple attack

Memory hot-add: ACPI hotplug buggy but works Reserved kernel memory takes a lot when high and low limits are different by an order of magnitude, so ballooning can be used only with samepage merging Regular KSM does not have sufficiently fine tuning and UKSM has a memory allocation issues None of us was able to fix memory compaction issues, so we had switched to ACPI hotplug driver Out of tree, requires some additional work to play with [ a ] a https://github.com/vliaskov/qemu-kvm/ <qemu:commandline> <qemu:arg value= -dimm /> <qemu:arg value= id=dimm0,size=256m,populated=on />... <qemu:arg value= id=dimm62,size=256m,populated=off /> </qemu:commandline>

Ceph Ceph brings in a lot of awesome improvements in every release, as a Linux itself And as with Linux some work required to make it rocket fast Caches at every level Fast and low-latency links and precise time sync Rebalancing should be avoided as much as possible Combine storage and compute roles but avoid resource exhaustion Fighting with client latencies all the time

Ceph, part 2 Rotating media by itself is not enough to give proper latency ceilings to lot of read requests so large cache is required DM-Cache is only one solution which is pluggable, reliable and presented in the Linux tree in same time Cache helped us to put 99% of read latencies behind 10ms dramatically decreasing from 100ms with only HW controller cache Operation priorities are most important thing With 10x scale scope of problem priorities was completely changed A lot of tunables should be changed The less percentage of data one node holds, the lesser downtime (on peering) will be in a case of failure

VM scalability CPU and RAM cannot be scaled over one node without RDMA and for very specific NUMA-aware tasks Passing variable NUMA topology to VM is impossible to date so Very Large VMs are not in place VM can be easily cloned and new instance with rewritten network settings can be spinned of (under a balancer) Rebalance for unshareable/exhausted resources can be done using migration, but even live migration is visible for end-user; waiting for RDMA migration to come into mainline

Selecting networking platform Putting efforts on RSocket for Ceph Storage network may remain flat Switch hardware for networking should run opensource firmware Today a lot of options on the market: Arista, HP, NEC, Cisco, Juniper, even Mellanox... none of them offers opensource platform We chose Pica8 platform because it is managed by OVS and can be modified easily

Networking gets an upgrade In the standalone OVS we need to build GRE mesh or put VLAN on interface to make private networking work properly Lots of broadcast in the public network segment Traffic pattern analysis for preventing abuse/spam can be hardly applied to the regular networking OpenFlow 1.0+ fulfill all those needs, so we made necessary preparations and switched at once

OpenFlow FloodLight was selected to minimize integration complexity All OpenFlow logic comes from existing VM-VM and MAC-VM relation Future extensions like overlay networking can built with very small effort Still not finished multiple controller syncronization

Software storage and sotware networking Due to nature of Ceph VMs can be migrated and placed everywhere using snapshot+flatten (and hopefully something easier in future) Any VMs may be grouped easily Amount of overlay networking rules need to be built reduced dramatically up to IDC scale Only filtering and forwarding tasks are remains on rack scale for OF

Current limitations in Linux No automatic ballooning No reliable online memory shrinking mechanism outside ballooning THP are not working with virtual memory overcommit Online partition resizing came only in the 2012, so almost no distros have support out-of-box for this

Future work Advanced traffic engineering on top of fat-tree ethernet segment using Pica8 3295 as ToR and 3922 for second level Extending Ceph experience(rsockets and driver policies improvements) Autoballooning zero-failure mechanism implementation

Any questions?