Xenwatch Multithreading

Similar documents
Introduction to Oracle VM (Xen) Networking

Xen is not just paravirtualization

My VM is Lighter (and Safer) than your Container

Xen on ARM. Stefano Stabellini

Xenrelay: An Efficient Data Transmitting Approach for Tracing Guest Domain

Keeping up with the hardware

Support for Smart NICs. Ian Pratt

Xen. past, present and future. Stefano Stabellini

To Grant or Not to Grant

Part 1: Introduction to device drivers Part 2: Overview of research on device driver reliability Part 3: Device drivers research at ERTOS

Towards Massive Server Consolidation

Xen VT status and TODO lists for Xen-summit. Arun Sharma, Asit Mallick, Jun Nakajima, Sunil Saxena

Interrupt Coalescing in Xen

Prof. Daniel Rossier, PhD

Xen/paravirt_ops upstreaming Jeremy Fitzhardinge XenSource 04/18/07

Development of I/O Pass-through: Current Status & the Future. Nov 21, 2008 Yuji Shimada NEC System Technologies, Ltd.

Virtualizing Oracle 11g/R2 RAC Database on Oracle VM: Methods/Tips

The Price of Safety: Evaluating IOMMU Performance

Cloud Computing Virtualization

double split driver model

Linux on Sun Logical Domains

Linux Virtualization Update

Xen Network I/O Performance Analysis and Opportunities for Improvement

PVHVM Linux guest why doesn't kexec work? Vitaly Kuznetsov Red Hat Xen Developer Summit, 2015

Implementation of Xen PVHVM drivers in OpenBSD

Vendor: Oracle. Exam Code: 1Z Exam Name: oracle VM 2 for x86 Essentials. Version: Demo

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

I/O virtualization. Jiang, Yunhong Yang, Xiaowei Software and Service Group 2009 虚拟化技术全国高校师资研讨班

Network device virtualization: issues and solutions

VGPU ON KVM VFIO BASED MEDIATED DEVICE FRAMEWORK Neo Jia & Kirti Wankhede, 08/25/2016

Netchannel 2: Optimizing Network Performance

Hostless Xen Deployment

Transforming XenServer into a proper open-source project

A. Use a specified server pool as the default for all commands requiring a server pool argument.

1 Virtualization Recap

CPSC 341 OS & Networks. Processes. Dr. Yingwu Zhu

FIFO-based Event Channel ABI

Emulation 2. G. Lettieri. 15 Oct. 2014

Virtualization and Performance

I/O and virtualization

Device Passthrough to Driver Domain in Xen

ARM-KVM: Weather Report Korea Linux Forum

KVM PV DEVICES.

SUSE Linux Enterprise Server: Supported Virtualization Technologies

Lecture Topics. Announcements. Today: Threads (Stallings, chapter , 4.6) Next: Concurrency (Stallings, chapter , 5.

Xen Project 4.4: Features and Futures. Russell Pavlicek Xen Project Evangelist Citrix Systems

Oracle Corporation. Speaker: Dan Magenheimer. without the Commitment. Memory Overcommit. <Insert Picture Here>

Unleashing the Power of Unikernels with Unikraft

Introduction PCI Interface Booting PCI driver registration Other buses. Linux Device Drivers PCI Drivers

Enabling Fast, Dynamic Network Processing with ClickOS

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Scheduling in Xen: Present and Near Future

Computer Systems II. First Two Major Computer System Evolution Steps

Fip-see: A Low Latency, High Throughput IPC Mechanism

Towards an Invisible Honeypot Monitoring Tool. Hack.Lu2006

FIFO-based Event Channel ABI

Virtualization with XEN. Trusted Computing CS599 Spring 2007 Arun Viswanathan University of Southern California

A Userspace Packet Switch for Virtual Machines

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

CSE543 - Computer and Network Security Module: Virtualization

PROCESSES. Jo, Heeseung

Processes. Jo, Heeseung

Implementation and Analysis of Large Receive Offload in a Virtualized System

VIRTIO: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

VDPA: VHOST-MDEV AS NEW VHOST PROTOCOL TRANSPORT

Virtual Machine Security

VIRTIO-NET: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD. CUNMING LIANG, Intel

xenopsd internal error: Packet.Error("EQUOTA")}} Resolved in v6.2

Threads and Too Much Milk! CS439: Principles of Computer Systems January 31, 2018

Virtio SCSI. An alternative virtualized storage stack for KVM. Stefan Hajnoczi Paolo Bonzini

PCI Bus & Interrupts

Oracle 1Z Oracle VM 2 for x86 Essentials.

Nested Virtualization and Server Consolidation

Implementation and. Oracle VM. Administration Guide. Oracle Press ORACLG. Mc Grauv Hill. Edward Whalen

Operating System Project / Lecture 1 Tasks and scheduling. Bon Keun Seo

Mon Sep 17, 2007 Lecture 3: Process Management

Performance Analysis of Virtual Environments

Linux and Xen. Andrea Sarro. andrea.sarro(at)quadrics.it. Linux Kernel Hacking Free Course IV Edition

CSE543 - Computer and Network Security Module: Virtualization

Virtualization, Xen and Denali

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, t he Energy Efficient Solutions logo, mobilegt, PowerQUICC,

Multi-Hypervisor Virtual Machines: Enabling An Ecosystem of Hypervisor-level Services

Vhost and VIOMMU. Jason Wang (Wei Xu Peter Xu

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Xenprobes, A Lightweight User-space Probing Framework for Xen Virtual Machine

A Fast and Transparent Communication Protocol for Co-Resident Virtual Machines

Oracle VM. 1 Introduction. Release Notes Release for x86

SUSE Linux Enterprise Server

Network optimizations for PV guests

Agenda. About us Why para-virtualize RDMA Project overview Open issues Future plans

Unix Processes. What is a Process?

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

CONCEPTS GUIDE BlueBoxx

CHAPTER 3 - PROCESS CONCEPT

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

CSCE 313: Intro to Computer Systems

CS5460: Operating Systems

Chapter 3: Processes. Operating System Concepts 9 th Edit9on

Live Migration with Mdev Device

Transcription:

Xenwatch Multithreading Dongli Zhang Principal Member of Technical Staf Oracle Linux http://donglizhang.org

domu creation failure: problem # xl create hvm.cfg Parsing config from hvm.cfg libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to add device with path /local/domain/0/backend/vbd/2/51712 libxl: error: libxl_create.c:1290:domcreate_launch_dm: Domain 2:unable to add disk devices libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to remove device with path /local/domain/0/backend/vbd/2/51712 libxl: error: libxl_domain.c:1097:devices_destroy_cb: Domain 2:libxl devices_destroy failed libxl: error: libxl_domain.c:1000:libxl destroy_domid: Domain 2:Non-existant domain libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 2:Unable to destroy guest libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 2:Destruction of domain failed Reported by: https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html Reproduced by: http://donglizhang.org/xenwatch-stall-vif.patch

domu creation failure: observation incomplete prior domu destroy stalled xenwatch thread in D state xenwatch hangs at kthread_stop() dom0# xl list Name ID Mem VCPUs State Time(s) Domain-0 0 799 4 r----- 50.2 (null) 2 0 2 --p--d 24.8 dom0# ps 38 PID TTY STAT TIME COMMAND 38? D 0:00 [xenwatch] dom0# cat /proc/38/stack [<0>] kthread_stop [<0>] xenvif_disconnect_data [<0>] set_backend_state [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork [<0>] 0xffffffff

domu creation failure: cause vif1.0-q0-dealloc thread cannot stop remaining inflight packets on netback vif vif1.0 statistics: sent > success + fail sk_buf on hold by other kernel components! static bool xenvif_dealloc_kthread_should_stop(struct xenvif_queue *queue) { /* Dealloc thread must remain running until all inflight * packets complete. */ return kthread_should_stop() &&!atomic_read(&queue->inflight_packets); } # ethtool -S vif1.0 NIC statistics: rx_gso_checksum_fixup: 0 tx_zerocopy_sent: 72518 tx_zerocopy_success: 0 tx_zerocopy_fail: 72517 tx_frag_overflow: 0

xen-netback zerocopy xen-netfront data DomU NIC driver xen-netback 1.mapped from domu to dom0 sk_buf 2. increment infligh packet and forward to NIC Data mapped from DomU xenwatch Dom0

xen-netback zerocopy NIC driver xen-netback xen-netfront 2. increment infligh packet and forward to NIC data 1.mapped from domu to dom0 sk_buf Data mapped from DomU 3. NIC driver does not release the grant mapping correctly! DomU xenwatch 4. xenwatch stall due to remaining inflight packet (unmapped grant) when removing xen-netback vif interface Dom0

domu creation failure: workaround? Workaround mentioned at xen-devel: https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html dom0# ifconfig ethx down dom0# ifconfig ethx up Reset DMA bufer and unmap inflight memory page from domu netfront

xenwatch stall extra case prerequisite application 1. Map data from blkfront 2. Encapsulate request as new bio 3. Submit bio to dom0 block device file system device mapper DomU xen-blkfront loop block (on nfs, iscsi glusterfs or more) iscsi xvda-0 kthread Dom0 with xen-blkback nvme channel Xen Hypervisor

xenwatch stall extra case 1 xenwatch [<0>] kthread_stop [<0>] xen_blkif_disconnect [<0>] xen_blkbk_remove [<0>] xenbus_dev_remove [<0>] device_release_driver [<0>] device_release_driver [<0>] bus_remove_device [<0>] device_del [<0>] device_unregister [<0>] frontend_changed [<0>] xenbus_otherend_changed [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork 3.xvda-0 hang and waiting for idle block mq tag [<0>] bt_get [<0>] blk_mq_get_tag [<0>] blk_mq_alloc_request [<0>] blk_mq_map_request [<0>] blk_sq_make_request [<0>] generic_make_request [<0>] submit_bio [<0>] dispatch_rw_block_io [<0>] do_block_io_op [<0>] xen_blkif_schedule [<0>] kthread [<0>] ret_from_fork Lack of free mq tag due to: loop device nfs iscsi ocfs2 more block/fs/storage issue...

xenwatch stall extra case 2 xenwatch [<0>] gnttab_unmap_refs_sync [<0>] free_persistent_gnts [<0>] xen_blkbk_free_caches [<0>] xen_blkif_disconnect [<0>] xen_blkbk_remove [<0>] xenbus_dev_remove [<0>] device_release_driver [<0>] bus_remove_device [<0>] device_unregister [<0>] frontend_changed [<0>] xenbus_otherend_changed [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork storage When disconnecting xen-blkback device, wait until all inflight persistent grant pages are reclaimed static void gnttab_unmap_refs_async(...) {... for (pc = 0; pc < item->count; pc++) { if (page_count(item->pages[pc]) > 1) { // delay grant unmap operation... } }... } page_count is invalid as the page is erroneously on-hold due to iscsi or storage driver

xenwatch stall symptom (null) domu in xl list xenwatch stall at xenstore update callback DomU creation/destroy failure Device hotplug failure Incomplete live migration on source dom0 Reboot dom0 as only option (if workaround is not available)

More Impacts DomU = application More domu running concurrently NFV To quickly setup and tear down NF The problem is much more severe... Let s give up xen! Xen developers are fired!

xen paravirtual driver framework DomainU Guest Application Networking Stack xen-netfront driver Xen Hypervisor Hardware Domain 0 Guest Networking Stack Bridging /Routing xen-netback driver Physical NIC Driver Grant Table Event Channel Xenbus/Xenstore Physical NIC

Paravirtual vs. PCI PCI Driver Xen Paravirtual Driver device discovery pci bus xenstore device abstraction pci_dev / pci_driver xenbus_device / xenbus_driver device configuration pci bar/capability xenstore shared memory N/A or IOMMU grant table notification interrupt channel

device init and config Xenstore Motherboard (Dom0 software daemon and database for all guests) (hardware with many slots) plug into slots Physical NIC Physical Disk Physical CPU Physical DIMM pci bus struct pci_dev struct pci_driver dom0# xenstore-ls local = "" domain = "" 0 = "" name = "Domain-0" device-model = "" 0 = "" state = "running" memory = "" target = "524288" static-max = "524288" freemem-slack = "1254331" libxl = "" disable_udev = "1" vm = "" libxl = "" insert/update entries Virtual NIC Virtual Disk Virtual CPU Virtual Memory xenbus bus struct xenbus_device struct xenbus_driver

xenstore and xenwatch watch at xenstore node with callback callback triggered when xenstore node is updated both dom0/domu kernel and toolstack can watch/update xenstore toolstack 2. Insert entries to xenstore: /local/domain/0/backend/<device>/<domid>/ /local/domain/7/device/<device>/<domid>/... 1. watch at /local/domain/0/backend/ Dom0 3. Notification: create backend device 1. watch at /local/domain/7/device xenstore 3. Notification: create frontend device DomU

xenwatch with single thread xenbus_thread appends new watch to the list xenwatch_thread processes watch from the list read watch details xenstore channel wake up xenstore ring bufer xenbus kthread append to global list xenwatch kthread struct xenbus_watch be_watch = {.node = "backend",.callback = backend_changed }; process frontend_changed()... handle_vcpu_hotplug_() backend_changed

Xenwatch Multithreading Solution To create a per-domu xenwatch kernel thread on dom0 for each domid

solution: challenges When to create/destroy per-domu xenwatch thread? How to calculate the domid given xenstore path? Split global locks into per-thread locks xenwatch path watched node /local/domain/1/device/vif/0/state /local/domain/1/device/vif/0/state backend/vif/1/0/hotplug-status backend/vif/1/0/hotplug-status backend/vif/1/0/state backend backend backend

solution: domu create/destroy 1/2 xl create vm.cfg xl destroy 7 dom0# xenstore-watch / dom0# xenstore-watch / / /local/domain/7 /local/domain /vm/612c6d38-fd87-4bb3-a3f5-53c546e83674 /vm /libxl/7 / /local/domain/0/device-model/7 /local/domain/7/device/vbd/51712... /local/domain/0/backend/vif/7/0/frontend-id /local/domain/0/backend/vif/7/0/online /local/domain/0/backend/vif/7/0/state /local/domain/0/backend/vif/7/0/script /local/domain/0/backend/vif/7/0/mac... /local/domain/0/backend/vkbd /vm/612c6d38-fd87-4bb3-a3f5-53c546e83674 /local/domain/7 /libxl/7 @introducedomain /libxl/7/dm-version /libxl/7/device/vbd/51712 /libxl/7/device/vbd /libxl/7/device /libxl/7/device/vbd/51712/frontend /libxl/7/device/vbd/51712/backend /local/domain/7/device/vbd/51712... @releasedomain

solution: domu create/destroy 2/2 creation: watch at @introducedomain destroy: watch at @releasedomain list /local/domain via XS_DIRECTORY dom0 watch at @introducedomain xenstore watch List /local/domain to identify which is created dom0 Suggested by Juergen Gross watch at @releasedomain xenstore watch List /local/domain to identify which is removed

solution: domid calculation Xenwatch subscriber should know the pattern of node path New callback for struct xenbus_watch : get_domid() Xenwatch subscriber should implement the callback be_watch callback struct xenbus_watch { struct list_head list; const char *node; /* path: backend/<pvdev>/<domid>/... */ static domid_t be_get_domid(struct xenbus_watch *watch, const char *path, const char *token) { const char *p = path; void (*callback)(struct xenbus_watch *, const char *path, const char *token); }; if (char_count(path, '/') < 2) return 0; domid_t (*get_domid)(struct xenbus_watch *watch, const char *path, const char *token); p = strchr(p, '/') + 1; p = strchr(p, '/') + 1; } return path_to_domid(p);

Xenwatch Multithreading Framework process... 1. use.get_domid() callback to calculate domid default xenwatch kthread 2. run callback if domid==0 3. otherwise, submit the to per-domu list domid=2 xenwatch kthread process domid=3 xenwatch kthread process......

xenbus_watch unregistration optimization By default, traverse ALL lists to remove pending xenwatch s.get_owner() is implemented if xenwatch is for a specific domu Only traverse a single list for per-domu xenwatch struct xenbus_watch { struct list_head list; const char *node; domid=1 xenwatch domid=9 xenwatch domid=11 xenwatch default xenwatch void (*callback)(struct xenbus_watch *, const char *path, const char *token); domid_t (*get_domid)(struct xenbus_watch *watch, const char *path, const char *token); }; domid_t (*get_owner)(struct xenbus_watch *watch);

Switch to xenwatch multithreading Step 1: implement.get_domid() Step 2: implement.get_owner() for per-domu xenbus_watch // e.g., /local/domain/1/device/vbd/51712/state static int watch_otherend(struct xenbus_device *dev) { struct xen_bus_type *bus = container_of(dev->dev.bus, struct xen_bus_type, bus); + dev->otherend_watch.get_domid = otherend_get_domid; + dev->otherend_watch.get_owner = otherend_get_owner; + return xenbus_watch_pathfmt(dev, &dev->otherend_watch, bus->otherend_changed, "%s/%s", dev->otherend, "state"); +static domid_t otherend_get_domid(struct xenbus_watch *watch, + const char *path, + const char *token) +{ + struct xenbus_device *xendev = + container_of(watch, struct xenbus_device, otherend_watch); + + return xendev->otherend_id; +} + + +static domid_t otherend_get_owner(struct xenbus_watch *watch) +{ + struct xenbus_device *xendev = + container_of(watch, struct xenbus_device, otherend_watch); + + return xendev->otherend_id; +}

Test Setup Patch for implementation: Patch to reproduce: http://donglizhang.org/xenwatch-multithreading.patch http://donglizhang.org/xenwatch-stall-vif.patch Intercept sk_buf (with fragments) sent out from vifx.y Control when intercepted sk_buf is reclaimed

Test Result 1)sk_buf from vifx.y is intercepted by xenwatch-stall-vif.patch 2)[xen-mtwatch-2] is stalled during VM shutdown 3)[xen-mtwatch-2] goes back to normal once sk_buf is released dom0# xl list Name ID Mem VCPUs State Time(s) Domain-0 0 799 4 r----- 50.2 (null) 2 0 2 --p--d 29.9 dom0# ps -x egrep "mtwatch xen-xenwatch" PID TTY STAT TIME COMMAND 39? S 0:00 [xenwatch] 2196? D 0:00 [xen-mtwatch-2] dom0# cat /proc/2196/stack [<0>] kthread_stop [<0>] xenvif_disconnect_data [<0>] set_backend_state [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork [<0>] 0xffffffff

Current Status Total LOC: ~600 Feature can be enabled only on dom0 Xenwatch Multithreading is enabled only when: xen_mtwatch kernel param xen_initial_domain() Feedback for [Patch RFC ] from xen-devel

Future work Extend XS_DIRECTORY to XS_DIRECTORY_PART To list 1000+ domu from xenstore Port d4016288ab from Xen to Linux Watch at parent node only (excluding descendants) Only parent node s update is notified Watch at /local/domain for thread create/destroy

Take-Home Message There is limitation in single-threaded xenwatch It is imperative to address such limitation Xenwatch Multithreading can solve the problem Only OS kernel is modified with ~600 LOC Easy to apply to existing xenbus_watch Question? Question?