Falling Out of the Clouds: When Your Big Data Needs a New Home Executive Summary Today s public cloud computing infrastructures are not architected to support truly large Big Data applications. While it make sense to start up some big data projects in the cloud, this whitepaper tells you when it is time to leave the cloud and come home. DriveScale can help you get out of the cloud; we offer a software-defined Big Data infrastructure that gives you control, predictability, and efficiency without losing the flexibility that cloud solutions provide. 1 of 5
Smarter Falling Out Data of Center the Clouds: Infrastructure When Your for Big Scale-Out Data Needs a New Home The Problem With Clouds The phenomenon of cloud computing cannot be ignored. Every application could be a candidate for running in the cloud, and almost every new application could potentially be incubated in the cloud. While some organizations may have regulatory or privacy issues that prevent their wholesale embrace of the cloud, running in the cloud will make sense for many of their applications. Cloud computing brings capabilities that a traditional fixed infrastructure either lacks or are much more efficient in the cloud. For bursty workloads, the elastic, on-demand nature of cloud resource allocation can be far cheaper and more convenient than maintaining fixed private resources. But beware of applications that require a lot of storage capacity. Storage is seldom used elastically storage requirements just grow and never shrink giving the cloud far less cost advantage when storage requirements reach a significant level. Also, predictable performance with cloud-based workloads can be very difficult to achieve. Many organizations have jobs that must run with predictable performance in some particular time windows overnight, end of month, end of quarter, etc. This requirement is in opposition to the elasticity the cloud is built for, so there can be no guarantees of repeatable performance. But not every workload is better in the cloud. Cloud technology mirrors that of mainstream data centers that are optimized for virtual machines and centralized storage. In the last decade a whole new class of applications have emerged that demand the efficiencies and scalability that can be achieved only with bare metal clusters and commodity storage. Examples are Hadoop, NoSQL databases such as Cassandra, streaming systems such as Kafka, among others. Even though these applications are often deployed in the cloud, they show dramatically better performance and predictability on bare metal. This class of Big Data applications do not fit well in either traditional private infrastructures or in the cloud, and demand purpose-built infrastructures. Example: Hadoop on AWS Lets take a detailed look at why Hadoop suffers in the cloud. By Hadoop, we mean here all of the numerous frameworks that use the Hadoop Distributed File System (HDFS) API, including Spark, Pig, Hive, etc. Hadoop and HDFS have revolutionized data intensive parallel computing by setting and adhering to a basic set of principles: Maximize the full potential of your Hadoop investment with DriveScale. Use of bare metal commodity hardware for both compute and storage. Colocation of compute and storage to avoid network bottlenecks. The ability to move sub-tasks of jobs to the compute nodes closest to their data. Use of very large block sizes to optimize sequential I/O streaming bandwidth. Distribution of large files among nodes to enable parallel processing of the blocks of a file. There are various ways to run Hadoop on Amazon Web Services (AWS): user-installed, or by AWS s Elastic Map-Reduce (EMR), or through service providers like Qubole. All rely on the same underlying EC2, S3, and EBS compute and storage offerings. Unfortunately, none of these offerings are built to meet the Hadoop principles stated above. 2 of 5
EC2 Only With EC2 and only locally attached disk storage, the performance of Hadoop compares well to bare metal. However, that means that you must pay full time for the instance with no elasticity benefit. EC2 also reboots instances when they deem it necessary, so node reliability is much worse than with bare metal. Also, many of the newer EC2 instance types do not offer local disk options at all, so this is not a viable approach going forward. EC2 With S3 AWS EMR uses S3 object storage by default. Although it provides good bandwidth for individual files, S3 bandwidth usage is measured against an instance s network bandwidth cap, so the ability to stream multiple files to a single instance can be severely restricted. Also, there are no guarantees for network bandwidth, so a noisy neighbor can lead to unpredictable performance. EC2 With EBS Similarly, Hadoop on EBS storage can be severely limited by network bandwidth. Even most of the EBS-optimized instance types have storage bandwidth caps that are ludicrously low for Hadoop workloads. Data Too Big, Compute Too Little A more subtle problem with Hadoop on AWS has to do with how compute resources are allocated. As of this writing, the EMR instances with the greatest compute capability are of the 8xlarge class that offer 36 vcpus. Amazon defines a vcpu as the compute power provided by a Xeon Hyper-Thread. Hyper-Threading is a CPU feature that doubles the number of apparent CPU cores but typically adds only about 10% performance. This means that a vcpu is really only about 55% of the performance of a single CPU core, so that a 36 vcpu instance has about the same compute power as 20 real cores. For comparison, typical compute optimized Xeon E5-V4 processor systems today (2017) have 28 to 32 cores. Sadly, the high compute instances are often not usable because the network caps in force are too low to be practical. This requires deploying more, smaller, CPU instances, which in turn leads to more dependence on the network and greater variability in performance. In contrast, a bare metal node can have many more cores and much higher I/O bandwidth, resulting in far fewer nodes needed in the cluster. When to Leave the Cloud If you are already frustrated by unpredictable job run times, or if you expect that your organization will reach spending $1M per year on Big Data in the cloud, then it is time to seriously consider the cost and control benefits that result by operating your own bare metal clusters. The combination of widespread availability of colocation data centers and the low interest rates for vendor equipment financing means that the up-front capital costs of running your own clusters can be small. The operating savings you realize relative to the cloud can then pay for the personnel and other costs for getting out of the cloud. 3 of 5
Building Big Data Infrastructure with DriveScale The solution you choose for your big data infrastructure should not be driven entirely by cost. Without the responsiveness that comes from the flexibility offered by the cloud, you and your users will be frustrated with the time it takes to deploy new clusters or reconfigure existing ones. With the rapid pace of evolution of big data frameworks and applications, a fixed infrastructure can become a serious handicap. DriveScale s solution offers a purpose-built system architecture for big data that also features the flexibility of the cloud. Sophisticated tools for cloud deployment, such as Cloudera Director or Hortonworks CloudBreak, can be used with DriveScale for rapid deployment and reconfiguration. By conforming to the basic Hadoop principles, DriveScale clusters demonstrate the full performance of bare metal but without any of the cloud limitations and caps mentioned earlier. DriveScale brings the benefits of commodity hardware to the customer without dictating the choice of any particular servers, switches, or storage. While DriveScale does offer expertise in hardware choice, we encourage customers to make their own tradeoffs around cost, convenience, and vendor relationships. Details of the DriveScale Solution The central principal of the DriveScale solution is the separation of compute and storage resources at the rack level. This enables the DriveScale management software to then define and re-define server roles based on the storage to which they attach. These server roles are in turn the elements that make up the cluster-level management that the customer sees. With the DriveScale solution, multiple clusters can co-exist in a single hardware pool, even though the compute-to-storage ratio may be different for nodes in each cluster. Because they re in the same pool, resources may be moved among clusters in an elastic fashion. The DriveScale solution adheres to the Hadoop principles, namely: Use of commodity hardware: - The DriveScale solution allows a broader choice of servers than other big data solutions, while promoting commodity storage through the use of industry standard dumb JBODs Co-location of compute and storage: - The DriveScale solution keeps the storage near the server within the rack and uses any excess bandwidth in top-of-rack Ethernet switches. As the number of racks grow, the storage bandwidth grows linearly. Ability to move compute close to the data: - The DriveScale solution preserves the normal Hadoop scheduling and placement for tasks Optimization of I/O for streaming bandwidth: - The DriveScale Adapter that connects JBODs with Ethernet is optimized for low cost, high bandwidth, and complete redundancy Distribution of files among nodes (and racks): - The DriveScale solution does not impact the Hadoop/HDFS data block distribution. 4 of 5
Business Benefits The business benefits of coming home from the cloud with DriveScale are simple: Cost Control Costs of big data in the cloud can rapidly spiral out of control due to the non-elastic nature of storage and the lack of an architecture optimized to big data workloads. Operating your own Big Data infrastructure lowers costs by leveraging commodity hardware with the efficiency and high utilization rates achieved with DriveScale. Predictability Because clusters in the DriveScale solution all use their own bare metal resources, critical workloads run predictably, allowing the customer to meet their business SLA requirements. Flexibility As workloads, applications, frameworks and data grow and evolve, DriveScale allows the customer to re-shape and optimize the clusters to meet the new requirements without needing expensive fork-lift upgrades. Contact DriveScale today at www.drivescale.com DriveScale, Inc 1230 Midas Way, Suite 210 Sunnyvale, CA 94085 Main: +1(408) 849-4651 www. drivescale.com WP.201703.01.01 5 of 5