Falling Out of the Clouds: When Your Big Data Needs a New Home

Similar documents
Composable Infrastructure for Public Cloud Service Providers

DriveScale-DellEMC Reference Architecture

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Dell EMC Hyper-Converged Infrastructure

Data Sheet. DriveScale Overview

OpenStack and Hadoop. Achieving near bare-metal performance for big data workloads running in a private cloud ABSTRACT

Dell EMC Hyper-Converged Infrastructure

Virtualization & On-Premise Cloud

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

Cloud Computing 4/17/2016. Outline. Cloud Computing. Centralized versus Distributed Computing Some people argue that Cloud Computing. Cloud Computing.

How Microsoft IT Reduced Operating Expenses Using Virtualization

COMPTIA CLO-001 EXAM QUESTIONS & ANSWERS

ELASTIC DATA PLATFORM

Making hybrid IT simple with Capgemini and Microsoft Azure Stack

CLOUD COMPUTING ABSTRACT

White Paper. Platform9 ROI for Hybrid Clouds

Hyper-Converged Infrastructure: Providing New Opportunities for Improved Availability

Next Generation Storage for The Software-Defned World

EsgynDB Enterprise 2.0 Platform Reference Architecture

Accelerate Big Data Insights

Choosing the Right Cloud Computing Model for Data Center Management

Storage Strategies for vsphere 5.5 users

Using Virtualization to Reduce Cost and Improve Manageability of J2EE Application Servers

Introduction to Cloud Computing. [thoughtsoncloud.com] 1

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

VMWARE EBOOK. Easily Deployed Software-Defined Storage: A Customer Love Story

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

New Approach to Unstructured Data

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

2013 AWS Worldwide Public Sector Summit Washington, D.C.

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

Kubernetes for Stateful Workloads Benchmarks

Genomics on Cisco Metacloud + SwiftStack

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Data Centers and Cloud Computing. Data Centers

Cloud Computing Introduction & Offerings from IBM

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

YOUR APPLICATION S JOURNEY TO THE CLOUD. What s the best way to get cloud native capabilities for your existing applications?

Exploring Cloud Security, Operational Visibility & Elastic Datacenters. Kiran Mohandas Consulting Engineer

Automating Elasticity. March 2018

Cloud Computing: Making the Right Choice for Your Organization

CSE6331: Cloud Computing

Data safety for digital business. Veritas Backup Exec WHITE PAPER. One solution for hybrid, physical, and virtual environments.

ZeroStack vs. AWS TCO Comparison ZeroStack s private cloud as-a-service offers significant cost advantages over public clouds.

Enabling Efficient and Scalable Zero-Trust Security

Simplified. Software-Defined Storage INSIDE SSS

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Converged Infrastructure Matures And Proves Its Value

Protecting Mission-Critical Workloads with VMware Fault Tolerance W H I T E P A P E R

SOFTWARE-DEFINED BLOCK STORAGE FOR HYPERSCALE APPLICATIONS

Data Centers and Cloud Computing

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. reserved. Insert Information Protection Policy Classification from Slide 8

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

PROTECT YOUR DATA FROM MALWARE AND ENSURE BUSINESS CONTINUITY ON THE CLOUD WITH NAVLINK MANAGED AMAZON WEB SERVICES MANAGED AWS

Rok: Decentralized storage for the cloud native world

Embedded Technosolutions

Private Cloud Database Consolidation Name, Title

ECE Enterprise Storage Architecture. Fall ~* CLOUD *~. Tyler Bletsch Duke University

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

High Performance and Cloud Computing (HPCC) for Bioinformatics

RED HAT ENTERPRISE LINUX. STANDARDIZE & SAVE.

SOFTWARE DEFINED STORAGE VS. TRADITIONAL SAN AND NAS

Market Report. Scale-out 2.0: Simple, Scalable, Services- Oriented Storage. Scale-out Storage Meets the Enterprise. June 2010.

Big Data Using Hadoop

Nutanix Tech Note. Virtualizing Microsoft Applications on Web-Scale Infrastructure

Benefits of SD-WAN to the Distributed Enterprise

Oracle Exadata Statement of Direction NOVEMBER 2017

Storage Key Issues for 2017

Scalable backup and recovery for modern applications and NoSQL databases. Best practices for cloud-native applications and NoSQL databases on AWS

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

HARNESSING THE HYBRID CLOUD TO DRIVE GREATER BUSINESS AGILITY

Dell EMC Isilon All-Flash

Enterprise Architectures The Pace Accelerates Camberley Bates Managing Partner & Analyst

PERFORMANCE CHARACTERIZATION OF MICROSOFT SQL SERVER USING VMWARE CLOUD ON AWS PERFORMANCE STUDY JULY 2018

Programowanie w chmurze na platformie Java EE Wykład 1 - dr inż. Piotr Zając

Cloud Computing & Visualization

GET CLOUD EMPOWERED. SEE HOW THE CLOUD CAN TRANSFORM YOUR BUSINESS.

Flash Considerations for Software Composable Infrastructure. Brian Pawlowski CTO, DriveScale Inc.

MIGRATING SAP WORKLOADS TO AWS CLOUD

Hitachi Unified Compute Platform Pro for VMware vsphere

WHITEPAPER. MemSQL Enterprise Feature List

A Better Approach to Leveraging an OpenStack Private Cloud. David Linthicum

CLOUD COMPUTING-ISSUES AND CHALLENGES

Introduction to iscsi

Introduction To Cloud Computing

IBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

How to Keep UP Through Digital Transformation with Next-Generation App Development

Big Data solution benchmark

Cloud Revenue Streams

5 reasons why choosing Apache Cassandra is planning for a multi-cloud future

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Database Consolidation onto Private Cloud. Piotr Kołodziej, Oracle Polska

Backup Exec 9.0 for Windows Servers. SAN Shared Storage Option

THE RISE OF THE MODERN DATA CENTER

CISCO HYPERFLEX SYSTEMS FROM KEYINFO. Bring the potential of hyperconvergence to a wide range of workloads and use cases

Modernizing Government Storage for the Cloud Era

Transcription:

Falling Out of the Clouds: When Your Big Data Needs a New Home Executive Summary Today s public cloud computing infrastructures are not architected to support truly large Big Data applications. While it make sense to start up some big data projects in the cloud, this whitepaper tells you when it is time to leave the cloud and come home. DriveScale can help you get out of the cloud; we offer a software-defined Big Data infrastructure that gives you control, predictability, and efficiency without losing the flexibility that cloud solutions provide. 1 of 5

Smarter Falling Out Data of Center the Clouds: Infrastructure When Your for Big Scale-Out Data Needs a New Home The Problem With Clouds The phenomenon of cloud computing cannot be ignored. Every application could be a candidate for running in the cloud, and almost every new application could potentially be incubated in the cloud. While some organizations may have regulatory or privacy issues that prevent their wholesale embrace of the cloud, running in the cloud will make sense for many of their applications. Cloud computing brings capabilities that a traditional fixed infrastructure either lacks or are much more efficient in the cloud. For bursty workloads, the elastic, on-demand nature of cloud resource allocation can be far cheaper and more convenient than maintaining fixed private resources. But beware of applications that require a lot of storage capacity. Storage is seldom used elastically storage requirements just grow and never shrink giving the cloud far less cost advantage when storage requirements reach a significant level. Also, predictable performance with cloud-based workloads can be very difficult to achieve. Many organizations have jobs that must run with predictable performance in some particular time windows overnight, end of month, end of quarter, etc. This requirement is in opposition to the elasticity the cloud is built for, so there can be no guarantees of repeatable performance. But not every workload is better in the cloud. Cloud technology mirrors that of mainstream data centers that are optimized for virtual machines and centralized storage. In the last decade a whole new class of applications have emerged that demand the efficiencies and scalability that can be achieved only with bare metal clusters and commodity storage. Examples are Hadoop, NoSQL databases such as Cassandra, streaming systems such as Kafka, among others. Even though these applications are often deployed in the cloud, they show dramatically better performance and predictability on bare metal. This class of Big Data applications do not fit well in either traditional private infrastructures or in the cloud, and demand purpose-built infrastructures. Example: Hadoop on AWS Lets take a detailed look at why Hadoop suffers in the cloud. By Hadoop, we mean here all of the numerous frameworks that use the Hadoop Distributed File System (HDFS) API, including Spark, Pig, Hive, etc. Hadoop and HDFS have revolutionized data intensive parallel computing by setting and adhering to a basic set of principles: Maximize the full potential of your Hadoop investment with DriveScale. Use of bare metal commodity hardware for both compute and storage. Colocation of compute and storage to avoid network bottlenecks. The ability to move sub-tasks of jobs to the compute nodes closest to their data. Use of very large block sizes to optimize sequential I/O streaming bandwidth. Distribution of large files among nodes to enable parallel processing of the blocks of a file. There are various ways to run Hadoop on Amazon Web Services (AWS): user-installed, or by AWS s Elastic Map-Reduce (EMR), or through service providers like Qubole. All rely on the same underlying EC2, S3, and EBS compute and storage offerings. Unfortunately, none of these offerings are built to meet the Hadoop principles stated above. 2 of 5

EC2 Only With EC2 and only locally attached disk storage, the performance of Hadoop compares well to bare metal. However, that means that you must pay full time for the instance with no elasticity benefit. EC2 also reboots instances when they deem it necessary, so node reliability is much worse than with bare metal. Also, many of the newer EC2 instance types do not offer local disk options at all, so this is not a viable approach going forward. EC2 With S3 AWS EMR uses S3 object storage by default. Although it provides good bandwidth for individual files, S3 bandwidth usage is measured against an instance s network bandwidth cap, so the ability to stream multiple files to a single instance can be severely restricted. Also, there are no guarantees for network bandwidth, so a noisy neighbor can lead to unpredictable performance. EC2 With EBS Similarly, Hadoop on EBS storage can be severely limited by network bandwidth. Even most of the EBS-optimized instance types have storage bandwidth caps that are ludicrously low for Hadoop workloads. Data Too Big, Compute Too Little A more subtle problem with Hadoop on AWS has to do with how compute resources are allocated. As of this writing, the EMR instances with the greatest compute capability are of the 8xlarge class that offer 36 vcpus. Amazon defines a vcpu as the compute power provided by a Xeon Hyper-Thread. Hyper-Threading is a CPU feature that doubles the number of apparent CPU cores but typically adds only about 10% performance. This means that a vcpu is really only about 55% of the performance of a single CPU core, so that a 36 vcpu instance has about the same compute power as 20 real cores. For comparison, typical compute optimized Xeon E5-V4 processor systems today (2017) have 28 to 32 cores. Sadly, the high compute instances are often not usable because the network caps in force are too low to be practical. This requires deploying more, smaller, CPU instances, which in turn leads to more dependence on the network and greater variability in performance. In contrast, a bare metal node can have many more cores and much higher I/O bandwidth, resulting in far fewer nodes needed in the cluster. When to Leave the Cloud If you are already frustrated by unpredictable job run times, or if you expect that your organization will reach spending $1M per year on Big Data in the cloud, then it is time to seriously consider the cost and control benefits that result by operating your own bare metal clusters. The combination of widespread availability of colocation data centers and the low interest rates for vendor equipment financing means that the up-front capital costs of running your own clusters can be small. The operating savings you realize relative to the cloud can then pay for the personnel and other costs for getting out of the cloud. 3 of 5

Building Big Data Infrastructure with DriveScale The solution you choose for your big data infrastructure should not be driven entirely by cost. Without the responsiveness that comes from the flexibility offered by the cloud, you and your users will be frustrated with the time it takes to deploy new clusters or reconfigure existing ones. With the rapid pace of evolution of big data frameworks and applications, a fixed infrastructure can become a serious handicap. DriveScale s solution offers a purpose-built system architecture for big data that also features the flexibility of the cloud. Sophisticated tools for cloud deployment, such as Cloudera Director or Hortonworks CloudBreak, can be used with DriveScale for rapid deployment and reconfiguration. By conforming to the basic Hadoop principles, DriveScale clusters demonstrate the full performance of bare metal but without any of the cloud limitations and caps mentioned earlier. DriveScale brings the benefits of commodity hardware to the customer without dictating the choice of any particular servers, switches, or storage. While DriveScale does offer expertise in hardware choice, we encourage customers to make their own tradeoffs around cost, convenience, and vendor relationships. Details of the DriveScale Solution The central principal of the DriveScale solution is the separation of compute and storage resources at the rack level. This enables the DriveScale management software to then define and re-define server roles based on the storage to which they attach. These server roles are in turn the elements that make up the cluster-level management that the customer sees. With the DriveScale solution, multiple clusters can co-exist in a single hardware pool, even though the compute-to-storage ratio may be different for nodes in each cluster. Because they re in the same pool, resources may be moved among clusters in an elastic fashion. The DriveScale solution adheres to the Hadoop principles, namely: Use of commodity hardware: - The DriveScale solution allows a broader choice of servers than other big data solutions, while promoting commodity storage through the use of industry standard dumb JBODs Co-location of compute and storage: - The DriveScale solution keeps the storage near the server within the rack and uses any excess bandwidth in top-of-rack Ethernet switches. As the number of racks grow, the storage bandwidth grows linearly. Ability to move compute close to the data: - The DriveScale solution preserves the normal Hadoop scheduling and placement for tasks Optimization of I/O for streaming bandwidth: - The DriveScale Adapter that connects JBODs with Ethernet is optimized for low cost, high bandwidth, and complete redundancy Distribution of files among nodes (and racks): - The DriveScale solution does not impact the Hadoop/HDFS data block distribution. 4 of 5

Business Benefits The business benefits of coming home from the cloud with DriveScale are simple: Cost Control Costs of big data in the cloud can rapidly spiral out of control due to the non-elastic nature of storage and the lack of an architecture optimized to big data workloads. Operating your own Big Data infrastructure lowers costs by leveraging commodity hardware with the efficiency and high utilization rates achieved with DriveScale. Predictability Because clusters in the DriveScale solution all use their own bare metal resources, critical workloads run predictably, allowing the customer to meet their business SLA requirements. Flexibility As workloads, applications, frameworks and data grow and evolve, DriveScale allows the customer to re-shape and optimize the clusters to meet the new requirements without needing expensive fork-lift upgrades. Contact DriveScale today at www.drivescale.com DriveScale, Inc 1230 Midas Way, Suite 210 Sunnyvale, CA 94085 Main: +1(408) 849-4651 www. drivescale.com WP.201703.01.01 5 of 5