CS Project Report

Size: px

Start display at page:

Download "CS Project Report"

Ariel Sharp
5 years ago
Views:

1 CS Project Report Kshitij Sudan kshitij@cs.utah.edu 1 Introduction With the growth in services provided over the Internet, the amount of data processing required has grown tremendously. To satisfy the computing requirements for large web applications, an underlying distributed platform is typically used. These platforms are usually clusters of commodity computers, and can consist of thousands of machines. The Map-Reduce software framework and its open-source implementation Hadoop are typically used to program these large distributed systems. Since these clusters are operated at a large scale, the cost of operating a cluster over its lifetime becomes a dominant cost. Thus the energy costs to operate a cluster should be included when designing such systems. To lower the energy consumption of commodity clusters, several approaches are taken. For example, Google and Facebook chose to re-design the underlying commodity hardware to make it more power efficient. A more recent approach is to virtualize the I/O sub-system and simplify each node to just have compute and memory. The advantage of virtualizing I/O is that most components of a traditional motherboard can be removed/reduced, allowing energy and physical space savings. The downside of this, especially disk virtualization, is that the traditional 1-to-1 mapping between disk and compute is altered because one can now pack more compute than disk in a given power and space envelop. However, the traditional assumption for scale-out web applications like Map-Reduce is that there is a 1-to-1 disk-to-compute mapping. This notion was formed using traditional servers and there is no clear evidence that this is still useful. Many Map-Reduce computations are not as disk I/O intensive as previously assumed, and if there is sufficient network bandwidth to shuttle data around then it is better to virtualize disk I/O for energy efficiency reasons. As an example, Figure 1 shows how 64 disks are shared among 512 nodes of a Hadoop cluster. With such a configuration, 8 CPU cores share one physical disk, i.e. a ratio of 1 disk per 8 cores. For such a system, we analyzed the performance of a collection of Hadoop benchmarks and reached the conclusion that a 1-to-1 mapping of compute and disk bandwidth is not necessarily beneficial when optimizing for energy efficiency. Figure 1. Virtualized Disk I/O for Hadoop Framework. The disk virtualization layer shown in Figure 1 is implemented using a combination of hardware and software techniques. Disk virtualization is also transparent to the operating system, and virtualized disks appear as standard devices within the OS. This obviates any need for system software modification. With disk virtualization, the OS running on each node is presented an independent disk, which under the hood is an offset on the same physical disk. As an example, if 16 CPU cores are configured to share a 1 TB physical disk, then each CPU is 1

2 presented a disk of 64 GB. CPU-0 accesses the disk from offset 0 through to offset of 64 GB, CPU-1accesses the same disk from offset 64 GB to 128 GB, and so on. With such an implementation of disk virtualization, the OS behaves as if it has an independent local disk attached to the system. 2 Motivation Disk BW (MB/sec) Core i7 - TeraSort - Aggregate Disk BW read write 60 per. Mov. Avg. (read) 60 per. Mov. Avg. (write) Disk BW (MB/sec) Atom TeraSort-Aggregate Disk BW 5 0 Rd MB Wr MB 60 per. Mov. Avg. (Rd MB) 60 per. Mov. Avg. (Wr MB) (a) Executing on a Core i7 based cluster (b) Executing on an Atom based cluster Figure 2. Disk bandwidth utilization by Core i7 and Atom based clusters while executing TeraSort benchmark. The overlaid lines are 1 minute averages of individual datapoints. A typical recommendation for a Hadoop cluster configuration is to use two physical disk per CPU core. This ensures maximum disk bandwidth is available to the core without interference from requests originating at other cores. Framework overheads however constrain the maximum usable disk bandwidth by the application, and Hadoop jobs typically use far less than the full disk bandwidth. Consider that each disk access request made by the application has to propagate through multiple layers of abstractions (Hadoop, JVM, TCP/IP stack, and the OS), with each layer having an associated overhead. Due to these overheads, even machines with heavyweight CPUs are unable to utilize the full disk bandwidth. Figure 2 shows the disk bandwidth usage for Core i7 and Atom CPU based clusters while executing TeraSort. Bandwidth usage is plotted on the Y-axis and execution time on the X-axis. Bandwidth was measured using the dstat utility which reported bandwidth usage every 1 sec. The 60 sec moving average of these 1 sec measurements is overlaid in the figure. It can be seen that the sustained average bandwidth usage for both Core i7 and Atom clusters varies between MB/sec. These values are considerably lower than the raw disk bandwidth each CPU can drive from the disk. Core i7 based systems can drive a sustained average raw disk read bandwidth of 112 MB/sec, while the Atom CPUs in our proposed system can drive up to 80 MB/sec of read bandwidth, and 40 MB/sec of write bandwidth. Note that these bandwidth limits are also a function of disk controller, and the disk internals, and not just the CPU 1. When using low-power CPUs like the Intel Atom, not only is the bandwidth greatly over-provisioned but the power consumption is also disproportionately distributed for a cluster. Typical disks consume between 7-25 Watts [1], which is 0.8x-3x the CPU power (8.5 W). Thus, 1 disk per Atom CPU leads to the disk consuming a large fraction of total node power and energy, while the resource itself (disk bandwidth) is being under-utilized. 3 Problem Statement For a system architecture that uses disk virtualization for improved energy efficiency, one of the major configuration parameter is the appropriate numbers of cores-per-disk (CPD). This project develops two analytical 1 These results were collected using the dd utility on Unix systems. 2

3 models that take the MapReduce application characteristics into account, and suggests the appropriate CPD configuration value. Note that these configuration suggestions are intended to be best effort suggestions and further tuning of the system might be required. 4 Hadoop Analytical Models for Systems with Virtualized I/O In this section, we discuss two models that describe the relation between cores-per-disk and execution time of the application. The first model assumes a fixed number of CPU cores in the system, and a fixed input dataset size. This model aims to determine the least execution time for a given CPD value. The second model assumes that the power-budget for the system is fixed, and assuming a fixed dataset size, attempts to determine the number of CPU cores and disks that should be used for least execution time. 4.1 Constant CPU Cores For this model, we assume the number of CPU cores, and the dataset size(n) is fixed. Ift C is the time the application spends executing on the CPU,t D is time spent accessing disks, andt misc is time spent in miscellaneous operations like network communication latency, job setup and tear-down time, etc. To keep the model simple, we assume that the miscellaneous costs are constant. Assuming the application completes in only one MapReduce round, the lower bound on the execution time of the MapReduce application t E can then be represented similar to model proposed by Goodrich et al. [2]: t E = Ω(t C +t D +t misc ) Since the number of cores are fixed at N, the time taken to perform the necessary compute operations for the application is also fixed. Thus t C is fixed and can be represented as t C = f(cores,n). However, the time it takes the application to access the disk is variable, dependent based on the CPD value and dataset size, i.e. t D = f(cpd,n). Since t E only has a lower bound expressed by the equation above, we now wish to now tighten this bound a little. To do so, note that if the application is CPU bound, then assuming t misc to be negligible, the upper bound on execution time can be represented as t E = O(t C ). Similarly, if the application is purely I/O bound, then the upper bound is t E = O(t D ). If the application is CPU bound, then the system can be trivially configured to maximum CPD value so as to use as few disks as possible. If the system is I/O bound, then the system should be configured to the least CPD value possible, since in that case fewer disk seeks would imply higher available disk bandwidth. Apart from these two corner cases, the CPD ratio has to be decided such that the compute and disk I/O times are comparable. This would lead to the application execution time being the least possible with a given number of CPU cores. Note that since CPU core count is fixed, arbitrarily reducing the disk I/O time will not improve the overall application execution time. This occurs because Hadoop MapReduce applications cannot exploit asynchrony in resource utilization at system level as much as other applications. This is explained next. It can be argued that at the system level, many operations are asynchronous - for e.g., the CPU might issue a disk access request, and then context switch to do some other useful work. However, for the Hadoop MapReduce applications, much of this asynchrony cannot be exploited due to the way the systems are configured. Hadoop clusters are typically configured with a single Map and a single Reduce task per core. This is usually done to not over-subscribe the compute resources. Since there is an implicit global barrier between Map and Reduce phases, these two phases cannot overlap. As a result, the computation for any phase can start only when all the data for the phase has been fetched from the disk into the main memory. Due to this limitation the model aims to achieve comparable compute and disk I/O time. Since the compute time is fixed for an application with a fixed dataset size and number of cores, we attempt to define the disk I/O time as a function of CPD value. Disk accesses are very sensitive to seek latency, as a disk head seek takes significantly longer than actual data read. When multiple cores share a single disk, the disk head 3

4 activity increases significantly due to mixing of access streams from different CPUs. Thus, to first order, disk access time can be represented in terms of CPD as follows: t D = disk seek latency CPD+const Here the constant denotes the cost of actual data transfer. Typically this value is much smaller than the seek latency, and is usually dropped. Thus t D can be approximated only in terms of CPD value. To achieve the least execution time with a fixed number of CPU cores, and fixed dataset size, the values oft D and t C have to be nearly equal. This leads to: This can be re-written as: CPD t C t D 1 t C disk seek latency This result can be intuitively understood as computing for nearly as long as it takes to read data from the disk. 4.2 Iso-Power System To improve energy efficiency, not only the execution time has to be minimized, but power consumption also has to be taken into account. In Section 4.1 we assumed the dataset size, and the number of CPU cores is fixed. We relax these constraint by now assuming that the number of CPU cores can also be varied. With this relaxation in assumptions, the compute time t C becomes a variable now. As noted above, t C = f(cores,n), and CP D = cores/disks, thus using Equation 1 above, we get: cores disks f(cores,n) disk seek latency A simple relation between number of disks, CPU cores, and power budget can be written as: power budget = a disks+b cores+const (3) where a and b are constants that characterize the power consumed by disks and cores, respectively. The const term accounts for the fixed overheads that cannot be amortized over disks and cores, like cooling fans, and power supplies. Since MapReduce workloads are scale-out applications, i.e. with larger number of compute cores, more parallelization can be achieved to lower the execution time, as a first order approximation t C = const N/cores. Thus, using Equation 2 and 3 above, one can derive an appropriate number of cores and disks to be used for an application, for a given power budget. 5 Conclusions and Future Work This project explored two basic models to understand the impact of various system configuration parameters on MapReduce applications. The presented models show that if the power consumed is not a constraint, then the CPD value is simply related to the CPU time the application consumes. If however the power budget is fixed, then using the iso-power model, one can derive the appropriate number of CPU cores and disks. The current model contains many system specific parameters that need to be empirically measured for a given system. The most critical measurement is to account for the time the application spends on the CPU. My current attempts to characterize this parameter were not very successful since it s hard to break-up the execution time of an application among time for compute and disk accesses. In future, I plan to develop a methodology to accurately account for distribution of application time among compute and disk accesses. I also plan to leverage these models to present guidelines such that algorithms can be developed for MapReduce applications that are aware of the underlying effects of disk virtualization. This can be leveraged to develop energy efficient algorithms for the system architecture described here. (1) (2) 4

5 References [1] Internet-Scale Datacenter Economics: Costs and Opportunities. In High Performance Transaction Systems, [2] M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, Searching, and Simulation in the MapReduce Framework. CoRR, abs/ ,

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization