Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics

Size: px

Start display at page:

Download "Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics"

Colin McBride
5 years ago
Views:

1 Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics Gunho Lee, Byung-Gon Chun, Randy H. Katz University of California, Berkeley, Intel Labs Berkeley Abstract Data analytics are key applications running in the clouds. The clouds provide dynamic provisioning that enables users to scale their clusters in response to the demand. These clouds are comprised of heterogeneous server platforms, but current data analytics applications do not consider these heterogeneous platforms. In this paper, we rethink resource allocation and scheduling on data analytics in the cloud to take advantage of the heterogeneity of the platforms. 1 Introduction As cloud computing services become more popular, more services are run in a public cloud. For example, a number of social game developers use a cloud to serve their game applications, as cloud computing provides fast deployment and easy scaling [4]. Such applications generate a huge amount of data about in-game activities and events, and analyzing the data is important to monitor user behavior and improve game play through datadriven iterative development processes [3]. Hence, it is desirable to store data and perform data analytics in the same cloud where the service runs, rather than moving the data to a separate data analytics cluster, to minimize response time and cost of queries When we have a data analytics cluster in the cloud, our primary interest might be to maintain the cluster with minimum cost while meeting the performance requirements. To do so, there are two key questions we need to address. One is how to allocate resources to the data analytics cluster. As the cloud computing provides fast provisioning, we can easily add more machines to the cluster in minutes when the demand surges. Similarly, we can remove and terminate machines when the cluster is idle to reduce the cost to maintain the cluster. As in other applications such as web services and storage systems [6], a data analytics cluster in the cloud should also take advantage of the dynamic nature of the cloud computing to be maintained in a cost-effective manner. The cost associativity [5] of cloud computing (e.g., using 1,000 EC2 machines for 1 hour costs the same as using 1 machine for 1,000 hours) is the key to realize the cost-effective data analytics cluster in the cloud. While allocating resources, heterogeneity should be considered to exploit the full potential of cloud services. Typically, cloud services provide various machine types. For example, Amazon EC2 offers 11 instance types with different specs and prices. One of them even allows access to a GPU that can be used to accelerate a particular application significantly [9]. Hence, we should allocate appropriate type of resources that fit to workload characteristics to achieve high performance per cost. This will lead to a heterogeneous cluster, which raises another interesting question. Given the heterogeneous nature of the data analytics cluster in the cloud, the other question we need to address is how to schedule jobs and tasks in the cluster. Especially if there are multiple jobs to run concurrently, the data analytics system should be aware of the heterogeneity of resources and jobs, then make scheduling decisions appropriately in a way that a job is assigned to preferred resources for performance. It is also important to provide fairness among jobs so that each job receives a fair amount of resources and does not starve in the heterogeneous environment. In this paper, we present architecture and design consideration to build a cost-effective data analytics system in the cloud. We first present background on Hadoop, Amazon EC2 and Data Analytic Cloud in Section 2. We present resource allocation and scheduling issues in heterogeneous environments and our system architecture that addresses the issues in Section 3. We illustrate the potential benefits of our approach with case studies in Section 4 and conclude in Section 5. 1

2 2 Background This paper focuses on a data analytic, which operates as a MapReduce cluster that relies on providers to host large data sets on a distributed storage system, and performs data processing on an analytic engine in public clouds, as described in [7]. Particularly, we consider the Hadoop Distributed File System (HDFS) and Hadoop MapReduce as a storage layer and an analytic engine in this paper. Also, we use Amazon EC2 as an example of a public cloud service. HDFS is comprised of multiple Datas, which store blocks, and a Name, which maintains metadata such as filenames and block locations. A file in HDFS is divided into blocks. Replicas of each block are stored on nodes that run Datas in the cluster. By default, the block size is 64MB, and each block has three replicas. On top of HDFS, Hadoop MapReduce is used to process data. Hadoop MapReduce has multiple TaskTrackers that perform actual computations on participating nodes and a JobTracker that manages jobs and Task- Trackers. Users submit jobs that consist of a map function and a reduce function to the JobTracker, which then launches map and reduce tasks on TaskTrackers. Each map task reads and processes one block from HDFS to generate intermediate results, which in turn fed into reduce tasks to produce the final result. Each TaskTracker holds a fixed number of slots to host map tasks and reduce tasks. Amazon EC2 is a public cloud service that enables users to lease virtual machines. Amazon charges per hour and per machine for this public cloud service without requiring any long-term commitments. Users can choose the spec of their virtual machines from 11 instance types tagged with different prices. Also, users can launch virtual machines in multiple locations across the world. EC2 charges for data transfer in and out of machines, but all transfers are free between machines that are in the same location. Hence, there is strong economic incentive for users to keep network communication within the same location. There are many other cloud providers with different pricing schemes, but their basic structures are similar. 3 Architecture There are two kinds of queries that enter into a data analytics system. One is a production query that is predefined and routinely processed. Usually, it runs against data appended after the previous run and may have a deadline. The other is an ad-hoc query that is entered once to answer certain question as the need arises. A data analytics system in the cloud should handle both types of Cloud Driver Analytics Engine (MapReduce) Job Storage (HDFS) Accelerator Figure 1: Data Analytic Cloud Architecture queries in a cost-effective manner. To that end, we propose architecture to build a data analytics system in the cloud as seen in Figure 1. In this architecture, participating nodes are grouped into one of two pools: long-lived core nodes to host both data and computations, and accelerator nodes that are added to the cluster temporarily when additional computing power is needed. An analytic engine (e.g., Hadoop MapReduce) runs on nodes in both pools while a storage system (e.g., HDFS) is deployed only on core nodes. The cloud driver manages nodes allocated to the analytic cloud and decides when to add/remove what type of nodes to/from which pool, and how many. Users submit a job to the cloud driver with a few hints about the job characteristics, including memory requirement, ability to use a GPU, and the deadline, if available. The cloud driver keeps the history of query executions to estimate the submission rate of production queries and their resource consumption rates. It also monitors the storage system to figure out the incoming data rate. 3.1 Resource Allocation The cloud driver is responsible to allocate resources to the cluster. The number of core nodes is determined primarily based on the storage size requirement. In addition, it refers to the history to see if more nodes are to be added to the core pool to accommodate the production queries. When many production queries with tight deadlines are anticipated or a large ad-hoc query is submitted, the cloud driver will add nodes to the accelerator pool temporarily to handle them rather than allocating too many core nodes that will be underutilized. 2

3 Instance Type $/Hour Disk(GB) $/GB/mon CU $/CU Mem(GB) GB/ I/O m1.small moderate m1.large high m2.2xlarge high c1.medium moderate cc1.4xlarge very high cg1.4xlarge very high Table 1: Example of EC2 Instances. CU represents Compute Unit. When adding nodes, the cloud driver also makes a decision on which instance type to use. If we only considers the cost for the storage, using m1.large instances will be the cheapest as seen in Table 1. However, as they have the least computing power among others, we might need more nodes to accommodate the production queries. In this case, using other instances such as c1.medium can be more efficient in terms of the whole expense. Moreover, as some jobs may run significantly faster on nodes of a particular instance type, such as cg1.4xlarge instances with GPUs. Hence, it is important to know a job/instance type relationship - which we call job affinity - to find a good mix of different instances to build a cost-effective cluster. To figure out job affinity, a newly submitted job will go through a calibration phase in which at least one task of the job will be assigned to a node of each instance type available. Then the cloud driver knows the computing rate of the job on a particular instance type. As a result, the cluster of the nodes is likely to become heterogeneous. To utilize the full potential of heterogeneous cluster, the analytics engine should be aware of node specifications and job affinity to assign tasks of a job to nodes appropriately. It is also responsible to make a job go through a calibration phase so that the cloud driver and the analytics engine can initialize and update the job affinity. 3.2 Scheduling Once the cloud driver allocates a set of resource units such as virtual machines, the analytics engine uses the resources. When the engine runs multiple jobs at the same time, the engine needs to consider heterogeneity for performance and fairness. In this section, we describe how the analytics engine estimates the computing rate of a job on heterogeneous nodes and how to use it to make a smarter scheduling decision for performance and fairness. To that end, we will use the Hadoop MapReduce as an example analytics engine Share and Fairness When a slot becomes available and there are multiple jobs to run, Hadoop scheduler picks a job to assign to the slot. Hadoop fair scheduler [2] realizes fair scheduling by having each job occupy the equal number of slots. However, the number of slots does not reflect the true share of a job as computation slots are not equal. Moreover, even on the same slot, the computation speed varies depending on jobs. Hence, we need to redefine share in a heterogeneous cluster. Ghodsi et al. [8] introduced a fairness policy called Dominant Resource Fairness(DRF) to be used in a heterogeneous cluster based on the user s report on resource requirements. To consider the performance variance on different resources, we will introduce Progress Share that captures the computation speed of a job on given slots. The share of a job is the portion of computing resources to which the job is assigned. In a heterogeneous cluster, the contribution of a particular resource to the share depends on the job workload characteristics and the specification of the resource. Suppose that there are two jobs, J1 and J2, run on one of two different nodes, N1 and N2. If J1 runs x times faster on N1 than N2, we can say that N1 gives x times more share, or computing power, than N2 to J1. On the other hand, if J2 runs on N1 as fast as N2, both nodes give the same share to J2. Specifically, we can say that J1 has x/(x + 1) share of the cluster and J2 has 1/2 share when J1 runs on N1 and J2 runs on N2. If the jobs switch places, the share of J1 will be changed to 1/(x + 1), whereas the share of J2 stays the same. As seen in the example above, once we figure out per-job slot computing rates, they can be used to calculate actual share of each job, which we call Progress Share. In general, given that CR(s, j) is the computing rate of slot s for job j, which will be determined during the calibration phase, the progress share PS(j) of job j is as follows: s PS(j) = S j CR(s,j) (1) s S CR(s,j) where S j is a set of slots running tasks of job j and S is the set of all slots. It is worth noting that the sum of the progress share for all jobs indicates the effectiveness 3

4 # of s :00:00 1:00:00 2:00:00 3:00:00 Time 9 m1 5 m1 + 4 m1 5 m1 + 4 c1 5 m1 + 6 c1 5 m1 + 8 c1 5 m c1 5 m c1 Figure 2: Job executions with various configurations Cost ($ per 4 hours) Speed (normalized) Figure 3: Cost/performance trade-off 5 m1 9 m1 5 m1 + 4 m1 5 m1 + 4 c1 5 m1 + 6 c1 5 m1 + 8 c1 5 m c1 5 m c1 of the resource assignment. If it is below 1, it means that there is an alternative assignment that makes the sum of progress share above - or at least equal to Scheduling The analytics engine scheduler should make sure that each job goes through the calibration phase to determine per-slot computing rate, CP, to calculate the progress share of each job. It should use the progress share to make a appropriate scheduling decision so that the cluster is better utilized (i.e., the sum of progress share 1) while each job receives a fair amount of computing resources. To do so, core nodes include at least one node per group of instance type with similar specs. When a job is submitted, it starts the calibration phase, in which at least one map task is launched on a node in each group. By measuring the completion time of these tasks, the scheduler can determine the per-job slot computing rate. This information is also fed into the cloud driver so that the driver can figure out job affinity to particular instance type. During the normal phase, the scheduler works similar to Hadoop fair scheduler. When a slot becomes available, a job that has less than its minimum or fair share will be chosen. It is also noteworthy that there is a mechanism called delay scheduling [10] that has a job yield to another job for a short period of time if there is no datalocal task to increase data locality. Similarly, if another job has a significantly higher computing rate on the slot, the slot can be yielded to that job. 4 Case Study In this section, we examine two case studies to illustrate the potential benefits of our system. First, we consider the case in which a number of production queries are issued every four hours. We chose five jobs from the gridmix2 benchmark [1] as production queries and used Competion Time (min) cc1 1 cc1 + 1 cg1 1 cc1 + 1 cg1 (progress share) Cluster Configuration Crypt Sort Figure 4: Effect of accelerable jobs and the progress share the FIFO scheduler in Hadoop. If we use 5 m1.large instances for core nodes, its capacity is overloaded by the production queries and, thus, the response time of a query keeps increasing over time. So, we need to add more nodes to the core pool, or the accelerator pool. Figure 2 shows the number of pending jobs in the queue during one round of the production queries on the clusters with various configurations. 9 m1 means using 9 m1.large instances as core nodes, 5 m1 + 4 c1 represents a cluster with 5 m1.large core nodes and 4 c1.medium accelerator nodes, and so on. As we can see, the jobs finished earlier as more accelerator nodes were added. In other words, if the queries have tighter deadlines, it is possible to meet the deadlines with the expenditure of an additional expense. Figure 3 shows the costperformance trade off. Note that c1.medium instances provides higher performance per cost, as they have faster CPUs at lower prices than m1.large, as seen in Table 1. However, as they are equipped with less memory, they might be in no use in reference to jobs which require a large mount of memory. The graph also points out that using more accelerators can cost less while allowing jobs to finish faster. Because the cluster with 6 accelerator nodes of c1.medium instance can finish all jobs before 2 hours, we need to pay for only 2 hours usage 4

5 of accelerator nodes. On the other hand, it takes more than 2 hours with 4 c1.medium accelerators, which leads to charges for 3 hours. In this case study, we can observe that there are number of ways to allocate resources with cost-performance trade-offs. Making a right decision maintains the cluster cost-effectively. The second case is when there is significant job affinity, such as a job that utilizes a GPU. We used a GPUaccelarable dictionary-based cryptographic attack job (Crypto), and a sort job (Sort) for this case. We prepared a small cluster consisting of two nodes and used the fair scheduler. The first bars in Figure 4 show the completion time of each job when the two nodes are all cc1.4xlarge instances. For the second bars, we replaced a node with a cg1.4xlarge instance that is identical to cc1.4xlarge except for the GPUs with which it is equipped. Here, we can observe a significant performance gain of using GPUs. This suggests the benefit of having a heterogeneous cluster. However, just having a heterogeneous cluster does not lead to an optimal usage, as the scheduler will assign tasks randomly without being aware of hardware and job affinity. The last bars show the results with a hardware-aware scheduler that adopts progress share. By giving priority to the Crypto job to the nodes with GPUs, the job was completed much earlier. It is worth noting that even the sort job finished earlier compared to other cases. This is because the Sort receives more than half of slots at any moment. Assining a fraction of a cg1.xlarge node to Crypto contributes a significant amount towards its progress share, which in turn makes the Sort job receive more resources to match its progress share to Crypto s. Even if Sort receives more slots than Crypto, we argue that it is fairer than giving the same number of slots to each job (as the current Hadoop does) as Crypto already receives a significantly preferred resources, which makes it progress faster. In this case study, we can see that the scheduling algorithm of the analytics engine plays an important role to improve performance as well as to schedule jobs fairly. 5 Conclusion References [1] Gridmix. mapreduce/docs/current/gridmix. html. [2] Hadoop fair scheduler. apache.org/common/docs/r0.20.1/ fair_scheduler.html. [3] Playmetrix launches beta of its game analytics service. press-releases/69065/playmetrix- Game-Analytics. [4] Social gaming applications. rightscale.com/products/cloudcomputing-uses/social-gamingapplications.php. [5] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the clouds: A berkeley view of cloud computing. Technical Report UCB/EECS , EECS Department, University of California, Berkeley, Feb [6] M. Armbrust, A. Fox, D. Patterson, N. Lanham, H. Oh, B. Trushkowsky, and J. Trutna. Scads: Scale-independent storage for social computing applications. In Conference on Innovative Data Systems Research (CIDR), [7] N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz. See spot run: Using spot instances for mapreduce workflows. In HotCloud, [8] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of heterogeneous resources in datacenters. Technical Report UCB/EECS , EECS Department, University of California, Berkeley, May [9] J. Polo, D. Carrera, Y. Becerra, V. Beltran, and J. T. andeduard Ayguad. Performance management of accelerated mapreduce workloads in heterogeneous clusters. In 39th International Conference on Parallel Processing (ICPP2010), [10] M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, As the cloud environment provides dynamic provisioning of heterogeneous servers, it is important to exploit these features to make a data analytics cluster in the cloud cost-effective. In this paper, we presented a system architecture to build such a cluster and discussed resource allocation and scheduling algorithms that take advantage of the heterogeneity of the cloud. 5

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment