RDMA Migration and RDMA Fault Tolerance for QEMU

Similar documents
Live Migration: Even faster, now with a dedicated thread!

Dirty Memory Tracking for Performant Checkpointing Solutions

Virtualization And High Availability. Howard Chow Microsoft MVP

Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

Status Update About COLO FT

The Convergence of Storage and Server Virtualization Solarflare Communications, Inc.

Zhang Chen Zhang Chen Copyright 2017 FUJITSU LIMITED

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

Multi-Hypervisor Virtual Machines: Enabling An Ecosystem of Hypervisor-level Services

Dan Noé University of New Hampshire / VeloBit

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Live block device operations in QEMU

CHAPTER 16 - VIRTUAL MACHINES

Live Migration of Virtual Machines

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

WIND RIVER TITANIUM CLOUD FOR TELECOMMUNICATIONS

CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives

Protecting Mission-Critical Workloads with VMware Fault Tolerance W H I T E P A P E R

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

ITRI Cloud OS: An End-to-End OpenStack Solution

The intelligence of hyper-converged infrastructure. Your Right Mix Solution

Reducing CPU usage of a Toro Appliance

Why Performance Matters When Building Your New SD-WAN

HA for OpenStack: Connecting the dots

BC/DR Strategy with VMware

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

Cisco Unified Data Center Strategy

VMware vsphere 4.0 The best platform for building cloud infrastructures

SUSE OpenStack Cloud Production Deployment Architecture. Guide. Solution Guide Cloud Computing.

Gavin Payne Senior Consultant.

64-bit ARM Unikernels on ukvm

Next-Generation Cloud Platform

Demystifying the Cloud With a Look at Hybrid Hosting and OpenStack

Next Generation Storage for The Software-Defned World

Live Migration with Mdev Device

StarWind Virtual SAN Free

VALE: a switched ethernet for virtual machines

Data Centers and Cloud Computing

SearchWinIT.com SearchExchange.com SearchSQLServer.com

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

70-414: Implementing an Advanced Server Infrastructure Course 01 - Creating the Virtualization Infrastructure

MODELS OF DISTRIBUTED SYSTEMS

Lies, Damn Lies and Benchmarks: How to Accurately Measure Distributed Application Performance. Heinz Schaffner

CSCI 8530 Advanced Operating Systems. Part 19 Virtualization

CREATING, EDITING AND MANAGING VMS

Amazon Aurora Deep Dive

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Thanks for Live Snapshots, Where's Live Merge?

Painless switch from proprietary hypervisor to QEMU/KVM. Denis V. Lunev

Data Centers and Cloud Computing. Data Centers

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables

IOV hardware and software plumbing for VMware ESX - journey to uncompromised I/O Virtualization and Convergence.

Nested Virtualization and Server Consolidation

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 20: Networks and Distributed Systems

Securing the Data Center against

Software SIParator / Firewall

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

Advanced Cloud Infrastructures

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

Today s Objec4ves. Data Center. Virtualiza4on Cloud Compu4ng Amazon Web Services. What did you think? 10/23/17. Oct 23, 2017 Sprenkle - CSCI325

SIMPLE, FLEXIBLE CONNECTIONS FOR TODAY S BUSINESS. Ethernet Services from Verizon

vsphere Networking Update 1 Modified on 04 OCT 2017 VMware vsphere 6.5 VMware ESXi 6.5 vcenter Server 6.5

Primary-Backup Replication

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant.

MODERNISE WITH ALL-FLASH. Intel Inside. Powerful Data Centre Outside.

Caching patterns and extending mobile applications with elastic caching (With Demonstration)

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

The vsphere 6.0 Advantages Over Hyper- V

Hyper-V - VM Live Migration

DB2 Data Sharing Then and Now

Operating in a Flash: Microsoft Windows Server Hyper-V

Build your own Cloud on Christof Westhues

Stratus puts emphasis on software-defined availability with everrun Enterprise launch

Multi-tenancy Virtualization Challenges & Solutions. Daniel J Walsh Mr SELinux, Red Hat Date

Chapter 9: Virtual Memory

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

Virtualization. Dr. Yingwu Zhu

Scaling DreamFactory

KVM 在 OpenStack 中的应用. Dexin(Mark) Wu

Craig Blitz Oracle Coherence Product Management

The Data-Protection Playbook for All-flash Storage KEY CONSIDERATIONS FOR FLASH-OPTIMIZED DATA PROTECTION

VIRTUAL APPLIANCES. Frequently Asked Questions (FAQ)

Carbonite Availability. Technical overview

Virtualization and memory hierarchy

Evolution of the netmap architecture

Software Engineering at VMware Dan Scales May 2008

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

VM Migration, Containers (Lecture 12, cs262a)

virtio-mem: Paravirtualized Memory

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

SAFEGUARDING YOUR VIRTUALIZED RESOURCES ON THE CLOUD. May 2012

HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Potpuna virtualizacija od servera do desktopa. Saša Hederić Senior Systems Engineer VMware Inc.

QuickSpecs. HP Z 10GbE Dual Port Module. Models

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 19: Networks and Distributed Systems

Virtualization. Michael Tsai 2018/4/16

Transcription:

RDMA Migration and RDMA Fault Tolerance for QEMU Michael R. Hines IBM mrhines@us.ibm.com http://wiki.qemu.org/features/rdmalivemigration http://wiki.qemu.org/features/microcheckpointing 1

Migration design / policy Problems Admins want to evacuate any size VMs Tens of GB / Hundreds of GB Arbitrary storage configurations Arbitrary processor configurations Management software is still pretty blind: Is VM idle? Busy? Full of zeros? Mission Critical? System software wants our cake and eat it too 2

Customers don t (yet) want to re-architect Just make my really big JVM or my honking DBMS run Don t ask me any questions about the workload But willing to tell you high-level workload characteristics Co-workers keep telling me security is important, probably don t want any extreme visibility anyway Customers do *want* high availability But they don t really trust us (much) They think we re really good at running workloads Not so sure we re good at *moving* workloads Also don t want to re-architect Don t want to put all their eggs in one basket Still very willing to spend money on mainframes PaaS not a panacea, but has the potential to be High Availability design / policy problems: 3

State of the art Migration: 40 Gbps *ethernet* hardware already on the market Faster ones not far away Can we do TCP @ 40 Gbps? Sure, at 100% CPU that s not good Fault Tolerance: A.K.A: Micro-Checkpointing / Continuous Replication Multiple hypervisors are introducing it: Remus on Xen VMWare lockstep Marathon / Stratus EverRun Where is KVM in all this HA activity? 4

Smaller active set of physical servers evacuate low utilization servers Higher utilization for active servers lower headroom requires rapid rebalance when workload spikes A decrease in energy use rapidly adjust servers online to load variation to preserve SLA Customers buy/install servers for peak loads fast VM migration allows dynamic size to actual Unsolved Needs 5

35 IBM proprietary VMs sampled every 5 minutes (WebSphere, DB2) over 24 hours 6

RDMA usage: Memory must be registered on both sides Small RDMA transfers are kind of useless IB programming is like IPv6: Necessary evil, but proven technology RDMA over Ethernet products from multiple Vendors QEMU: Avoid changing the QEMUFile abstraction many users Avoid re-architecting savevm.c/arch_init.c Avoid mixing TCP and RDMA protocol code They are really mutually exclusive programming paradigms One gives you a bytestream and one does not RDMA Migration Challenges 7

Results 8

Node 2 Node 1 Protected App Server VM 5 Protected DB Server VM Failover App Server VM 1Slow going, but steady Asynchronous Micro-checkpointing + I/O buffering + RDMA 3 4 Failover DB Server VM Challenge Status #1 Memory / CPU Done, ~1020% penalty #2 Network Buffering Done, ~50% penalty #3 Storage Buffering In Progress #4 RDMA Acceleration for I/O and Memory In Plan Application Awareness Not yet in plan 2 9 RDMA Fault Tolerance Status Checkpointed TCP/IP traffic #5

Sub-millisecond-FT requires consistent I/O Outbound network must be buffered Outbound storage must be mirrored Expensive Identification of dirty memory Multiple KVM log_dirty calls => QEMU bitmap Multiple QEMU bitmaps => QEMU bytemap Bytemap used by VGA? PCI? Others? Bytemap => single bitmap Stream through bitmap 60 milliseconds to identify dirty memory on a 16GB vm! Ahhhhh! RDMA Fault Tolerance Challenges 10

Network Barriers how it works with MicroCheckpointing (MC)? QEMU Protected VM virtio-frontend MCthread Vhost Linux Bypass Checkpoints (TCP or RDMA) Virtio-backend ---------------------------------------Vhost-backend Linux (optional) / KVM Tap-device IFB device TX/RX MC Buffer Control Signals Qdisc buffering Software IBM Confidential 11 Bridge Outbound Traffic

Micro-Checkpointing barometer : where do the costs come from? Average checkpoint size < 5MB Normal bytemap prep Transmit (VM Running) Local Copy to Staging (VM Stopped) Prepare Bytemap => Bitmap (VM Stopped) GET_LOG_DIRTY + bytemap (VM stopped) Idle VM (Parallel) Skip bytemap projected: 0 5 10 15 20 25 12

Very sensitive topic =) Myth: Sub-second, sub-working-set over-commitment Reality: Sub-day, sub-hour, working set growth and contraction RDMA Migration: Overcommit on both sides of connection? Again: Does your management software know anything? Lots of zero pages? VM is idle? Critical? Why can t our management software have APIs that libvirt can share? Or maybe OpenStack can share? Does policy engine know when to use RDMA? RDMA Fault Tolerance: checkpoints arbitrary in size double the VM in the worst case Checkpoints also need to be compatible w/ RDMA Support for Overcommitment 13

You (or your policy management) is dealing with a *known* memory bound workload that doesn t converge with TCP Migration Time = O(dirty rate) Where Dirty rate > TCP bandwidth You have sufficient available free memory on the destination host Your source host cannot tolerate stealing CPU cycles from VMs for TCP traffic When *should* you use RDMA? 14

1.Is your VM idle? Migration Time = O(bulk transfer) Is your VM young? (Full of zeros?) 2. - Migration Time = O(almost nothing) 3.Your known-memory-bound application doesn t converge AT ALL Migration Time = O(infinity) It is so heavy, you need auto-converge capability (new in 1.6) When *not* to use RDMA Migration? 15

1.You have no idea what the workload is But someone told you (i.e. paid) that it was important 2.Or you know what the workload is, but its memory content is mission-critical 3.I/O slow down is not a big deal 4.You must have high-availability 5.without re-architecting your workload When should you use FT? 16

1.Your workload can already survive restart 2.Your crashes are caused by *guest* bugs not host bugs or hardware failures 3.Your workload is stateless 4.Your workload cannot tolerate poor I/O 5.You re customer is willing to tightly integrate with the family of HA tools: HACMP / Pacemaker / LinuxHA / Corosync / Zookeeper.. The list goes on 17 When you should *not* use FT?

RDMA knows what a RAMBlock is New QEMUFileOps pointers: save_page() Allows for source override of the transmission of a page before_ram_iterate() Allows for source to perform additional initialization steps *before* each migration iteration after_ram_iterate() Similar source override, but at the end of each iteration hook_ram_load() Allows the destination override the *receipt* of memory during each iteration (merged, 1) Technical Summary of QEMU Changes for RDMA 18

QEMUFile operations: Translate into IB Send/Recv messages qemu_savevm_state_(begin iterate complete)() Invoke previous hooks (before_ram/after_ram) Hooks instruct destination to handle memory registration on-demand ram_save_block() Invokes save_page() If RDMA not compiled or enabled, then we fallback to TCP (merged, 2) Technical Summary of QEMU Changes for RDMA 19

Checkpoint size = O(2 * RAM), worst case Cannot allow QEMU to hog that much RAM Cannot register that much with RDMA anyway Solution: Splitup checkpoint into slabs (~5MB each) Grow and evict slabs as needed Register slabs as they are needed (Implemented) FT Technical Summary (2) 20

Network output commit problem: Xen community has released their network buffering solution into netlink community: Output Packets leave VM End up in tap device Tap device dumps packets into IFB device Turns inbound packets into outbound packets Qdisc plug inserted into IFB device Checkpoints are coordinated with plug and unplug operations (Implemented) FT Technical Summary (2) 21

Plan is to active QEMU drive-mirror drive-mirror would also need to be RDMA-capable Local I/O blocks go to disk unfettered Mirrored blocks: Held until checkpoint complete, then released This was chosen over shared storage For performance reasons And ordering constraints Not implemented: FT storage mirroring 22

Drive-mirror + MC Protected VM virtio-frontend Virtio-backend Drive-mirror (Direct I/Os) QEMU Backup VM Checkpoint s (TCP or RDMA) QEMU virtio-frontend -------------------- MCthread MC Buffer Control Signals Linux / KVM Local Storage Disk writes Virtio-backend Mirrored writes Drive-mirror (Buffered I/Os) Linux / KVM Local Storage 23 23

Thanks! 24

Backup Slides 25

OpenStack Cloud: Possible Management Workflow Controller Node Challenges: REST command does not return after FT is initiated => permanent thread must get kicked off Openstack has no concept (yet) of VM Lifecycle management or Failure Openstack would also need storage-level compatibility with our FT storage approach Who owns storage after failure? How recalculate cluster resources? How to keep mysql consistent? $ >./cb vmprotect $ > nova migrate 1 Messaging Service NovaCompute Mysql Compute Node 1 Compute Node 2 NovaCompute Novacompute Messaging Service Messaging Service Libvirt Libvirt 2 Failure? Protected QEMU Failover QEMU

Sequence of Events: Sender 1-b) Insert I/O barrier A VM running Receive r 1-a) ACK / Ready for next 1 checkpoint 2) Stop virtual machine 3) Capture checkpoint VM not running 4) Insert I/O Barrier B 5) Resume virtual machine 6) Transmit Checkpoint 6 7) Receive checkpoint VM running 8 8) Transmit ACK of checkpoint 10-b) Receive ACK 9-a) Apply checkpoint 11) Release I/O Barrier A 27 Time Time

checkpoint Sender 1-b) Insert I/O barrier A VM running Receive r 1-a) ACK / Ready for next 1 checkpoint 2) Stop virtual machine 3) Capture checkpoint VM not running 4) Insert I/O Barrier B 5) Resume virtual machine 6) Transmit Checkpoint 6 7) Receive checkpoint VM running 8 8) Transmit ACK of checkpoint 10-b) Receive ACK 9-a) Apply checkpoint 11) Release I/O Barrier A 28 Time Time

revert checkpoint Sender 1-b) Insert I/O barrier A VM running Receive r 1-a) ACK / Ready for next 1 checkpoint 2) Stop virtual machine 3) Capture checkpoint VM not running 4) Insert I/O Barrier B 5) Resume virtual machine 6) Transmit Checkpoint 6 7) Receive checkpoint VM running 8 8) Transmit ACK of checkpoint 10-b) Receive ACK 9-a) Apply checkpoint 11) Release I/O Barrier A 29 Time Time

revert checkpoint Sender 1-b) Insert I/O barrier A VM running Receive r 1-a) ACK / Ready for next 1 checkpoint 2) Stop virtual machine 3) Capture checkpoint VM not running 4) Insert I/O Barrier B 5) Resume virtual machine 6) Transmit Checkpoint 6 7) Receive checkpoint VM running 8 8) Transmit ACK of checkpoint 10-b) Receive ACK 9-a) Apply checkpoint 11) Release I/O Barrier A 30 Time Time

checkpoint Sender 1-b) Insert I/O barrier A VM running Receive r 1-a) ACK / Ready for next 1 checkpoint 2) Stop virtual machine 3) Capture checkpoint VM not running 4) Insert I/O Barrier B 5) Resume virtual machine 6) Transmit Checkpoint 6 7) Receive checkpoint VM running 8 8) Transmit ACK of checkpoint 10-b) Receive ACK 9-a) Apply checkpoint 11) Release I/O Barrier A 31 I/O Barrier B will release on next checkpoint Time Time