Coriolis: Scalable VM Clustering in Clouds

Similar documents
Optimizing Main Memory Usage in Modern Computing Systems to Improve Overall System Performance

Lecture 6. Register Allocation. I. Introduction. II. Abstraction and the Problem III. Algorithm

Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory

ECLT 5810 Clustering

Efficient, Scalable, and Provenance-Aware Management of Linked Data

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

CHAPTER 7 CONCLUSION AND FUTURE WORK

ECLT 5810 Clustering

Efficient in-network evaluation of multiple queries

Clustering. Chapter 10 in Introduction to statistical learning

Converged Infrastructure Matures And Proves Its Value

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Searching the Deep Web

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

LazyBase: Trading freshness and performance in a scalable database

The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds

Virtual Machine Placement in Cloud Computing

Lecture 25: Register Allocation

CompTIA CV CompTIA Cloud+ Certification. Download Full Version :

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

An Optimized Virtual Machine Migration Algorithm for Energy Efficient Data Centers

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform

Bi-Objective Optimization for Scheduling in Heterogeneous Computing Systems

Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information

Event Coverage in Theme Parks Using Wireless Sensor Networks with Mobile Sinks

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017

Informatica Enterprise Information Catalog

Clustering in Data Mining

Exploiting Parallelism to Support Scalable Hierarchical Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

How to Troubleshoot Databases and Exadata Using Oracle Log Analytics

Hyperbolic Traffic Load Centrality for Large-Scale Complex Communications Networks

Data center interconnect for the enterprise hybrid cloud

Register Allocation. Stanford University CS243 Winter 2006 Wei Li 1

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Prediction-Based Admission Control for IaaS Clouds with Multiple Service Classes

Lecture 15 Register Allocation & Spilling

Network-Aware Resource Allocation in Distributed Clouds

Searching the Deep Web

CloudView : Describe and Maintain Resource Views in Cloud

Data Matching and Deduplication Over Big Data Using Hadoop Framework

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

1 Case study of SVM (Rob)

Introduction to Cloud Computing and Virtual Resource Management. Jian Tang Syracuse University

Clustering COMS 4771

GET CLOUD EMPOWERED. SEE HOW THE CLOUD CAN TRANSFORM YOUR BUSINESS.

1/10/2011. Topics. What is the Cloud? Cloud Computing

Technology Insight Series

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

ELTMaestro for Spark: Data integration on clusters

CA ERwin Data Profiler

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Quality of Service Aspects and Metrics in Grid Computing

VMware and Xen Hypervisor Performance Comparisons in Thick and Thin Provisioned Environments

CQNCR: Optimal VM Migration Planning in Cloud Data Centers

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE

Constrained K-means Clustering with Background Knowledge. Clustering! Background Knowledge. Using Background Knowledge. The K-means Algorithm

VERITAS Storage Foundation 4.0 TM for Databases

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

Price Performance Analysis of NxtGen Vs. Amazon EC2 and Rackspace Cloud.

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

IBM dashdb Local. Using a software-defined environment in a private cloud to enable hybrid data warehousing. Evolving the data warehouse

Storage Solutions for VMware: InfiniBox. White Paper

Oasis: An Active Storage Framework for Object Storage Platform

The Future of High Performance Computing

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

webmethods Task Engine 9.9 on Red Hat Operating System

Storage Infrastructure Optimization

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Performance Analysis of Virtual Machines on NxtGen ECS and Competitive IaaS Offerings An Examination of Web Server and Database Workloads

PSOACI Why ACI: An overview and a customer (BBVA) perspective. Technology Officer DC EMEAR Cisco

The CORD reference architecture addresses the needs of various communications access networks with a wide array of use cases including:

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Graph-Based Synopses for Relational Data. Alkis Polyzotis (UC Santa Cruz)

Symbolic Execution. Wei Le April

Association Pattern Mining. Lijun Zhang

Skimmer: Rapid Scrolling of Relational Query Results. Manish Singh, Arnab Nandi and H.V. Jagadish

Benefits of IBM Power Systems in the Cloud 2012 IBM Corporation

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Information Retrieval

Data Management Glossary

Dynamic Grouping Strategy in Cloud Computing

The Desired State. Solving the Data Center s N-Dimensional Challenge

How Architecture Design Can Lower Hyperconverged Infrastructure (HCI) Total Cost of Ownership (TCO)

Register Allocation. Lecture 16

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

<Insert Picture Here> Introducing Oracle WebLogic Server on Oracle Database Appliance

CSE 530A. B+ Trees. Washington University Fall 2013

arxiv: v3 [cs.db] 30 Apr 2016

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

GPU Consolidation for Cloud Games: Are We There Yet?

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Online Optimization of VM Deployment in IaaS Cloud

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

VMware vsphere: Taking Virtualization to the Next Level

Joint Entity Resolution

Transcription:

1 / 21 Coriolis: Scalable VM Clustering in Clouds Daniel Campello 1 <dcamp020@fiu.edu> Carlos Crespo 1 Akshat Verma 2 RajuRangaswami 1 Praveen Jayachandran 2 1 School of Computing and Information Sciences Florida International University 2 IBM Research - India

2 / 21 The Benefits of Virtualization

3 / 21 Virtualization in Data Centers

3 / 21 Virtualization in Data Centers

3 / 21 Virtualization in Data Centers Virtual Machine Sprawl

4 / 21 Cloud Computing Age of the Cloud IT is no longer capital-intensive Commodity acquired on-demand Paid as per usage

4 / 21 Cloud Computing Age of the Cloud IT is no longer capital-intensive Commodity acquired on-demand Paid as per usage Emerging problem Virtual machine sprawl

4 / 21 Cloud Computing Age of the Cloud IT is no longer capital-intensive Commodity acquired on-demand Paid as per usage Emerging problem Virtual machine sprawl Standardization is the key Allow services on-demand Reduce system management costs in the software stack

5 / 21 Motivating Virtual Machine Clustering Classifying (possibly) diverse virtualized servers in a cloud into clusters of similar virtual machines (VMs) can improve the planning of many system management activities

6 / 21 Outline 1 Introduction 2 Virtual Machine Similarity Content Semantic Use Cases 3 Clustering Techniques k-means and k-medoids Coriolis tree-based 4 Evaluation 5 Related Work 6 Summary

7 / 21 Content Similarity Refers to data similarity in the raw files Subset of bytes contained within images are identical

7 / 21 Content Similarity Refers to data similarity in the raw files Subset of bytes contained within images are identical Extensively studied in the context of data deduplication

7 / 21 Content Similarity Refers to data similarity in the raw files Subset of bytes contained within images are identical Extensively studied in the context of data deduplication Recent Findings Large-scale study of VM in a production IaaS cloud:

7 / 21 Content Similarity Refers to data similarity in the raw files Subset of bytes contained within images are identical Extensively studied in the context of data deduplication Recent Findings Large-scale study of VM in a production IaaS cloud: Images tend to be similar to a small subset of collection.

7 / 21 Content Similarity Refers to data similarity in the raw files Subset of bytes contained within images are identical Extensively studied in the context of data deduplication Recent Findings Large-scale study of VM in a production IaaS cloud: Images tend to be similar to a small subset of collection. Computing pair-wise similarity is very expensive

8 / 21 Semantic Similarity Characterizes the similarity of the software functionality.

8 / 21 Semantic Similarity Characterizes the similarity of the software functionality. Some examples: Instances of same application Different versions of the same application Different applications with same goal (i.e MySQL and DB2)

9 / 21 Harnessing Image Similarity System Management Scenarios Allocation of servers to system administrators Administrators can manage up to 80% more servers

9 / 21 Harnessing Image Similarity System Management Scenarios Allocation of servers to system administrators Administrators can manage up to 80% more servers Troubleshooting Identify servers with similar software stack that respond differently to an update to find and fix possible issues

9 / 21 Harnessing Image Similarity System Management Scenarios Allocation of servers to system administrators Administrators can manage up to 80% more servers Troubleshooting Identify servers with similar software stack that respond differently to an update to find and fix possible issues Placement of virtual machines to hosts In-memory and storage deduplication

9 / 21 Harnessing Image Similarity System Management Scenarios Allocation of servers to system administrators Administrators can manage up to 80% more servers Troubleshooting Identify servers with similar software stack that respond differently to an update to find and fix possible issues Placement of virtual machines to hosts In-memory and storage deduplication Migration of enterprise applications across data centers Migration performed in batches or waves Minimize network transfer and re-configuration costs

10 / 21 Virtual Machine Clustering Clustering is NP-hard. Various heuristic exist.

Virtual Machine Clustering Clustering is NP-hard. Various heuristic exist. k-means Popular technique employed in the real world 10 / 21

Virtual Machine Clustering Clustering is NP-hard. Various heuristic exist. k-means Popular technique employed in the real world Each iteration: Assignment Step - Distance operation (kn) Update Step - Merge operation (N 1) 10 / 21

Virtual Machine Clustering Clustering is NP-hard. Various heuristic exist. k-means Popular technique employed in the real world Each iteration: Assignment Step - Distance operation (kn) Update Step - Merge operation (N 1) In practice Distance and Merge operations are usually very small 10 / 21

Virtual Machine Clustering Clustering is NP-hard. Various heuristic exist. k-means Popular technique employed in the real world Each iteration: Assignment Step - Distance operation (kn) Update Step - Merge operation (N 1) In practice Distance and Merge operations are usually very small Problems with 100 dimensions require only about 100 addition and division operations 10 / 21

11 / 21 Virtual Machine Clustering Distance can be calculated as 1 SIM(I i,i j ) SIM(I i,i j ) = wt(i i I j ) wt(i i I j )

11 / 21 Virtual Machine Clustering Distance can be calculated as 1 SIM(I i,i j ) SIM(I i,i j ) = wt(i i I j ) wt(i i I j ) Merge can be calculated as (I i I j )

11 / 21 Virtual Machine Clustering Distance can be calculated as 1 SIM(I i,i j ) SIM(I i,i j ) = wt(i i I j ) wt(i i I j ) Merge can be calculated as (I i I j ) Image Size Similarity (Content) Merge 8.8 GB 45.5 sec 14.7 sec 12.3 GB 75.2 sec 24.1 sec 13.6 GB 98.5 sec 31.2 sec 16.3 GB 142.3 sec 44.2 sec 19.7 GB 172.2 sec 53.5 sec 22.1 GB 232.7 sec 64.9 sec

11 / 21 Virtual Machine Clustering Distance can be calculated as 1 SIM(I i,i j ) SIM(I i,i j ) = wt(i i I j ) wt(i i I j ) Merge can be calculated as (I i I j ) Image Size Similarity (Content) Merge 8.8 GB 45.5 sec 14.7 sec 12.3 GB 75.2 sec 24.1 sec 13.6 GB 98.5 sec 31.2 sec 16.3 GB 142.3 sec 44.2 sec 19.7 GB 172.2 sec 53.5 sec 22.1 GB 232.7 sec 64.9 sec A data center with 1000 images would have to perform 1000 3 similarity operations, about 2000 years By using in-memory data structures, about 40 years

Approximate Clustering k-medoids k-medoids is a variant of k-means Restricts the cluster center to be one of the existing points (images) Pair-wise similarity can be computed in advance Similarity computation required for all images (N 2 ) 12 / 21

Approximate Clustering k-medoids k-medoids is a variant of k-means Restricts the cluster center to be one of the existing points (images) Pair-wise similarity can be computed in advance Similarity computation required for all images (N 2 ) A data center with 1000 images would have to perform 1000 2 similarity operations, about 2 years By using in-memory data structures, about 15 days 12 / 21

13 / 21 Solution Idea: Asymmetric Clustering Coriolis tree-based clustering Coriolis clustering approach involves constructing a tree

13 / 21 Solution Idea: Asymmetric Clustering Coriolis tree-based clustering Coriolis clustering approach involves constructing a tree The tree is constructed by adding images to it one by one

13 / 21 Solution Idea: Asymmetric Clustering Coriolis tree-based clustering Coriolis clustering approach involves constructing a tree The tree is constructed by adding images to it one by one Each node of the tree is either a cluster of images or a single image

13 / 21 Solution Idea: Asymmetric Clustering Coriolis tree-based clustering Coriolis clustering approach involves constructing a tree The tree is constructed by adding images to it one by one Each node of the tree is either a cluster of images or a single image Each level in the tree represents a minimum extent of similarity within a node

14 / 21 Coriolis Tree-based Clustering S >= 0 A,B,C D,E S > 0.5 A,B C,D,E S > 0.9 C,E A B D S = 1.0 C E

15 / 21 Coriolis tree-based clustering Two key ideas Speed up similarity computation

15 / 21 Coriolis tree-based clustering Two key ideas Speed up similarity computation An asymmetric similarity function S: Coverage offered by a larger node B (typically a cluster) to a new node A

15 / 21 Coriolis tree-based clustering Two key ideas Speed up similarity computation An asymmetric similarity function S: Coverage offered by a larger node B (typically a cluster) to a new node A S = wt(a B) min(wt(a), wt(b))

15 / 21 Coriolis tree-based clustering Two key ideas Speed up similarity computation An asymmetric similarity function S: Coverage offered by a larger node B (typically a cluster) to a new node A S = wt(a B) min(wt(a), wt(b)) Reuse similarity computations

15 / 21 Coriolis tree-based clustering Two key ideas Speed up similarity computation An asymmetric similarity function S: Coverage offered by a larger node B (typically a cluster) to a new node A S = wt(a B) min(wt(a), wt(b)) Reuse similarity computations We only compute the similarity of the new image to each children of the nodes where the image has been merged

15 / 21 Coriolis tree-based clustering Two key ideas Speed up similarity computation An asymmetric similarity function S: Coverage offered by a larger node B (typically a cluster) to a new node A S = wt(a B) min(wt(a), wt(b)) Reuse similarity computations We only compute the similarity of the new image to each children of the nodes where the image has been merged Similarity and Merge operations are proportional to the depth of the tree

16 / 21 Clustering a new Image F S >= 0 A,B,C D,E,F S > 0.5 A,B,F C,D,E S > 0.9 C,E A B,F D S = 1.0 B F C E

16 / 21 Clustering a new Image F S >= 0 A,B,C D,E,F S > 0.5 A,B,F C,D,E S > 0.9 C,E A B,F D S = 1.0 B F C E Which clusters are formed with similarity greater than 0.5?

17 / 21 Scalability Evaluation Experimental Setup We used VM images from 2 production data centers

17 / 21 Scalability Evaluation Experimental Setup We used VM images from 2 production data centers 9 images from a large-scale enterprise data center at IBM

17 / 21 Scalability Evaluation Experimental Setup We used VM images from 2 production data centers 9 images from a large-scale enterprise data center at IBM 12 images from the Computer Science department s small scale data center at FIU

17 / 21 Scalability Evaluation Experimental Setup We used VM images from 2 production data centers 9 images from a large-scale enterprise data center at IBM 12 images from the Computer Science department s small scale data center at FIU We randomly sampled files contained in 3 of the 21 images and generated new synthetic images

18 / 21 Scalability of k-medoids and Tree-based Clustering Clustering Time (minutes) 18 4500 4000 3500 3000 2500 2000 1500 1000 500 0 24 30 36 k-medoids Tree-based 42 Number of Images 48 0 5 10 15 20 25 30 35 Total Files (in Millions) 78 99

19 / 21 Related Work Finding Similar Clusters VMFlock: Virtual Machine Co-migration for the Cloud[IEEE/ACM HPDC 11] Applies standard de-duplication techniques for images Eliminate raw data duplicates across a given set of VM images It does not tackle identifying images with high redundancy or leveraging semantic similarity

20 / 21 Conclusions and Future Work Conclusions We described different types of similarity metrics for VMs and their use to aid administrators in their management activities

20 / 21 Conclusions and Future Work Conclusions We described different types of similarity metrics for VMs and their use to aid administrators in their management activities We argued that state-of-the-art k-medoids clustering algorithm incurs quadratic complexity infeasible for cloud scale data centers

20 / 21 Conclusions and Future Work Conclusions We described different types of similarity metrics for VMs and their use to aid administrators in their management activities We argued that state-of-the-art k-medoids clustering algorithm incurs quadratic complexity infeasible for cloud scale data centers We described the Coriolis framework and system specifically designed for scalable clustering of VM images while supporting arbitrary similarity metrics

20 / 21 Conclusions and Future Work Conclusions We described different types of similarity metrics for VMs and their use to aid administrators in their management activities We argued that state-of-the-art k-medoids clustering algorithm incurs quadratic complexity infeasible for cloud scale data centers We described the Coriolis framework and system specifically designed for scalable clustering of VM images while supporting arbitrary similarity metrics Future Work Our future work will explore the utility of Coriolis for data center administrator allocation, troubleshooting, and large-scale VM migration

21 / 21 Thank you! (I ll be happy to take questions)