Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources Ming Zhao, Renato J. Figueiredo Advanced Computing and Information Systems (ACIS) Electrical and Computer Engineering University of Florida {ming, renato}@acis.ufl.edu Advanced Computing and Information Systems laboratory
Motivating Application Dynamic Data-driven Brain-machine Interface Resource demanding parallel application Time-critical during online experiments Must finish each closed loop within 1ms No stringent time requirement for offline study Online Experimenting Offline Study Brain institute ACIS In Vivo Data Acquisition 1 BMI Model BMI Model K BMI Model Closed-Loop Control Parallel Computing on Virtualized Resources Data Storage Dynamic Data-driven Brain-machine Interfaces (http://www.acis.ufl.edu/dddasbmi)
Motivating Application Motivation for -based resource reservation Online experiments Dedicate cluster resources for s running BMI models Without reservation, deadline is often violated Offline study Share cluster resources with s running other workloads Cluster reservation Migrating exiting s to other resources Online Experimenting Offline Study Data polling Data transfer Computing Robot control Latencies 1ms In Vivo Data Acquisition 1 BMI Model BMI Model K BMI Model Closed-Loop Control Parallel Computing on Virtualized Resources Dynamic Data-driven Brain-machine Interfaces (http://www.acis.ufl.edu/dddasbmi) 3 Data Storage
Overview Goal: Efficient -based cluster resource reservation Challenge: Vacate resources in time (to meet schedule) but not too early (to fully utilize resources) Solution: Build a migration overhead model through experimental analysis
Outline Architecture Experimental analysis Methodology Migrating a single Migrating a sequence of s Migrating s in parallel Summary 5
-based Resource Reservation Virtual resource manager Global management of virtualized resources Presents an abstracted interface for resource requests Hides the complexity of using virtualized resources Coordinates with schedulers to reserve, allocate resources with s 1 Scheduler P1 3 ) instantiation 3) Reservation: Create a with... by 1am Virtual machine scheduler Per-host management of s Presents a unified interface for managing s Hides the complexity of using different software Job Manager 1) Resource request: Need 1GHz CPU, 1GB RAM at 1am ) Admission control: OK ) Resource handlers: Machine IP: 1.5.15.3 Virtual Resource Manager
Migration based Cluster Vacating Example P1 1 3 5 7 9 P1 P1 P P 1 3 P 1 3 5 5 P3 P3 P3 7 7 9 9 (I) P1 is shared by s running various workloads (II) P1 is reserved by migrating the existing s to P and P3 7 (III) P1 is dedicated to the s created for a parallel application
Experimental Setup Physical cluster Each node has four CPUs (.33GHz), GB memory Gigabit Ethernet 1 Node 1 Node Virtual machines ware Server based disk Shared read-only image on NFS (.vmdk) Independent changes (redo logs) on local disks (.REDO) memory Mapped to file (.vmem) Stored on local disks vmx REDO vmem Local vmx REDO vmem NFS vmdk Local
Experimental Setup Migration strategy Suspend Synchronizes memory state to file Copy Uses FTP to transfer memory state file, disk redo files Maximum throughput is about 1MB/s Resume Restores memory state from file in foreground Not based on ware solutions (otion) 1 vmx REDO vmem Node 1 Local vmx REDO vmem NFS vmdk Node Local 9
Migration Overhead of a Single Different memory sizes Suspend and resume phases are fast Copy time grows as memory size increases Different methods to store the migrated states RAMFS removes the impact of disk I/Os to the migration process time (s) 1 1 1 1 1 1MB 5MB 3MB 51MB MB 7MB 9MB 1MB time (s) 1 1 1 1 1 1MB 5MB 3MB 51MB MB 7MB 9MB 1MB suspend copy resume suspend copy resume Using disks on the destination host to store migrated state files Using RAMFS on the destination host to store migrated state files 1
Modeling Copy Phase Using regression methods Polynomial for using disk, linear for using RAMFS RAMFS setup is used for all the following experiments 1 1 1 Using disks y = 1E-5x +.3x + 1.7 R =.997 1 time (s) 1 Using RAMFS y =.9x +.75 R = 1 1 1 memory size (MB) Using regression to model the copy phase 11
Modeling Suspend and Resume Phases Resume Typically short: memory state is buffered after the copy phase Suspend Typically short: memory state is frequently synchronized to file 1 1 1 baseline 5MB 51MB 7MB 1MB time (s) 1 1 suspend resume 1
Modeling the Suspend and Resume Phases Running memory-intensive workload in the migrated A rogue program that continuously modifies memory Suspend time increases as more synchronization is needed 1 1 1 baseline 5MB 51MB 7MB 1MB time (s) 1 1 suspend resume 13
Migrating a Sequence of s Per- migration time is consistent regardless of the number of sequentially migrated s 1 1 9 single two s four s eight s 9 single two s four s 7 7 time (s) 5 time (s) 5 3 3 1 1 suspend copy resume total suspend copy resume total Per- migration time when migrating a sequence of 5MB-RAM s Per- migration time when migrating a sequence of 51MB-RAM s 1
Migrating s with CPU-intensive Workload CPU-intensive workload Runs iteratively, consumes 1% of CPU Performance degradation = Impacted iteration time Regular iteration time Takes longer for applications in s to recover to full performance 1 1 Performance degradation migration time 1 1 1 3 1 1 Migration delay (s) 1 1 Iteration Time (s) 1 1 1 3 Time (s) Migrating four s in sequence; each has 51MB-RAM and runs a CPU-intensive benchmark 15
Migrating s with Memory-intensive Workload Memory-intensive workload Runs iteratively, touches 1% of memory Performance degradation = Time of impacted iterations Regular iteration time * Migration is a more memory intensive process Migration delay (s) 3 3 1 1 1 1 1 Performance degradation migration time 1 3 Iteration time (s) 3 3 1 1 1 1 1 1 3 c Time (s) Migrating four s in sequence; each has 51MB-RAM and runs a memory-intensive benchmark 1
Migrating s with Web Workloads Apache Web server Serves constant-rate requests from outside clients (1 request/s each) Takes longer for the throughputs to recover 5 1 5 Reply rate 15 1 Reply rate 15 1 5 5 5 3 5 Reply rate 15 1 Reply rate 15 1 5 5 Reply rate 55 5 5 35 3 5 Aggregate 1 3 5 7 Time (s) Migrating four s in sequence; each has 51MB-RAM and runs a Web server 17
Migrating s in Parallel Parallel migration has shorter total migration time Faster vacating of cluster resources Sequential migration has shorter per- migration time Less impact on the applications in the s 5 5M-P 5M-P 5M-S 5M-S 51M-P 51M-S 3 5 5M-P 5M-S 51M-P 51M-S Migration time (s) 3 Migration time (s) 15 1 1 5 1 Number of s 1 Number of s Total migration time Per- migration time 1
Related Work -based resource management In-VIGO Virtual workspaces Virtuoso Shirako migration based resource reallocation VIOLIN Sandpiper migration optimizations ware otion Xen live migration 19
Summary Conclusions: It is feasible to predict the migration time of s with different configurations and different numbers It takes longer for a s application to recover than the migration time Sequential migration has less impact on s applications Parallel migration is faster for migrating multiple s Future work: Generalizes and validates the model across different physical platforms, with different technologies Considers modeling of live migrations
Acknowledgement DDDASBMI project team http://www.acis.ufl.edu/dddasbmi Sponsorship from NSF Publication approval from ware Questions? ming@acis.ufl.edu http://www.acis.ufl.edu/~ming 1