Improving MapReduce Energy Efficiency for Computation Intensive Workloads

Size: px

Start display at page:

Download "Improving MapReduce Energy Efficiency for Computation Intensive Workloads"

Claude Rose
6 years ago
Views:

1 Improving MapReduce Energy Efficiency for Computation Intensive Workloads Thomas Wirtz and Rong Ge Department of Mathematics, Statistics and Computer Science Marquette University Milwaukee, WI {thomas.wirtz, Abstract MapReduce is a programming model for data intensive computing on large-scale distributed systems. With its wide acceptance and deployment, improving the energy efficiency of MapReduce will lead to significant energy savings for data centers and computational grids. In this paper, we study the performance and energy efficiency of the Hadoop implementation of MapReduce under the context of energyproportional computing. We consider how MapReduce efficiency varies with two runtime configurations: resource allocation that changes the number of available concurrent workers, and DVFS (Dynamic Voltage and Frequency Scaling) that adjusts the processor frequency based on the workloads computational needs. Our experimental results indicate significant energy savings can be achieved from judicious resource allocation and intelligent DVFS scheduling for computation intensive applications, though the level of improvements depends on both workload characteristic of the MapReduce application and the policy of resource and DVFS scheduling. I. INTRODUCTION MapReduce [1] is a programming model for data intensive computing on large-scale distributed systems that supports automatic parallel processing of large data sets. With MapReduce framework, programmers can focus on application algorithm design without dealing with low-level workload distribution and management. Today, MapReduce based applications are widely deployed in many business and educational datacenters. With data volume doubling every three years [2], MapReduce will potentially become a major computing paradigm in future data centers. Energy efficient MapReduce is critical for green data centers. Currently, it was estimated that data centers account for 1.5% of the overall U.S. electricity use [3]. Electricity costs are already the second highest expense after labor costs in data centers [4]. Nevertheless, efficiency is not among the top MapReduce design constraints. In a MapReduce based application, computations are broken into many short-lived map and reduce tasks. Map tasks communicate with reduce tasks via intermediate results stored on distributed storage. Process management and local and remote disk I/O accesses are likely to cause both performance and energy inefficiencies for MapReduce applications. There is a large body of work in application of MapReduce programming models [5, 6], library support for various programming languages [7], or debugging and tracing tools for MapReduce framework [8, 9]. Nevertheless, little work has studied MapReduce energy efficiency in depth. GreenHDFS [10] separated cluster servers into cold and hot zones and placed data in these zones according to data classification. GreenHDFS conserved energy by transitioning the servers in the cold zone to high energy saving power states. Chen et al [11, 12] and Leverich et al [13] studied MapReduce energy efficiency with varying number of worker nodes and found energy-saving potentials for MapReduce applications. Our focus is on MapReduce energy efficiency for dataand computation-intensive applications. This work is motivated by: (1) a vast number of scientific applications become data intensive as available data grow explosively; and (2) many of these applications and supporting software are ported to the MapReduce framework [6, 14, 15]. Unlike traditional MapReduce applications, such applications have larger number of operations by byte and higher demand for computational power. As with other parallel applications, the performance and energy efficiency of MapReduce applications is affected by the degree of parallelism and computational intensity (i.e., the ratio of on-chip CPU computation to off-chip memory and I/O access). Given an application, the optimal efficiency will be achieved when resource allocation matches application characteristics. Specifically, the number of allocated processing cores matches the degree of parallelism of the application; and the processor performance state matches the application's computational intensity. In this work, we use an experimental approach to prove the above concept. We choose three MapReduce benchmark applications Matrix Multiplication, CloudBurst, and Integer Sort in our study. For each benchmark, we vary the number of concurrent workers and processor frequency and investigate how performance and energy efficiency scale. We evaluate the energy and efficiency results within the context of energy-proportional computing. Energyproportional computing [16] is an ideal computing environment whereas server power is directly proportional to the level of server utilization and is zero at idle state (i.e., no active user workload). Energy-proportional computing has been promoted and used to guide hardware and architecture design. To emulate an energy-proportional computing system, we use work induced power instead of total system power in our analysis. That is, we exclude the 1

2 system idle power and consider it zero. By excluding the idle power, we are able to better capture the direct cost of workload execution and the trend of energy change with runtime configurations. The paper makes the following main contributions. First, we experimentally demonstrate that performance and energy inefficiency is not uncommon in the MapReduce framework for computation intensive applications. Such inefficiency is due to the overhead of automatic parallelization and I/O accesses and cannot be simply reduced by application developers. Second, the degree of parallelism of applications significantly affects the performance and energy efficiency of MapReduce applications. To achieve overall higher efficiency on the MapReduce framework, we need to tailor the resource allocations (i.e., the number of processing cores) to the applications degree of parallelism. Third, we compare three DVFS scheduling policies and the resulting energy efficiency for MapReduce framework. Overall, DVFS is effective for energy savings. In general, low power DVFS scheduling policy is optimal for systems with small idle power, while performance-constrained DVFS scheduling policy is optimal for systems with dominating idle power. The remainder of this paper is organized as follows. The related work is discussed in the next section. We discuss the variables of energy efficiency in section 3, and present our methodology in section 4. The experimental results are presented in section 5. Section 6 concludes the paper. II. BACKGROUND AND RELATED WORK A. Background MapReduce is a programming model introduced by Google for processing large data sets in parallel [1]. In the MapReduce framework, large data files are stored across distributed storage devices in small, workable chunks. With this model, programmers specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. During the MapReduce job execution, the map tasks, as well as the subsequent reduce tasks, execute in parallel on different key/value pairs. The initial file data chunks are input to the map tasks. At the end of the job, the output is compiled into one or more files. Hadoop MapReduce [17] is an open source MapReduce framework implementation in Java. The Hadoop framework comprises MapReduce libraries, Hadoop distributed file system (HDFS), and supporting system services. The Hadoop MapReduce system normally consists of one JobTracker node and multiple TaskTracker Nodes. The JobTracker initializes jobs, adds tasks to a queue, and holds job and task status. A TaskTracker fetches tasks from the JobTracker. The HDFS contains one primary NameNode that holds file system metadata and keeps track of the placement of file chunks, and multiple instances of DataNode that store the chunks of file data. The automated parallel computing in MapReduce boosts software development productivity. However, it may also potentially compromise performance and energy efficiency because of large amount of extra disk and network IO accesses, short lived processes, and load imbalance. B. Related Work Driven by ever-increasing operating costs and awareness of energy conservation, researchers have been actively developing technology to understand and improve energy efficiency in data centers. Existing work includes fine grain power profiling [18], analytical power modeling [19], and power management [20, 21, 22]. DVFS technology available on modern processors was widely used for data center power management [20]. Other technologies, such as PowerNap [21] and varying number of active servers [22], have also been investigated for systems where idle power dominates and thus the benefit from DVFS is limited. Though significant research was done on MapReduce systems and applications [5, 6], only a few addressed the problem of MapReduce energy (in)efficiency. GreenHDFS [10] separated cluster servers into cold and hot zones and transitioned the servers in the cold zone to high energy saving power states. It was shown that running Hadoop cluster with partial system nodes could save energy with some performance tradeoffs for application [13]. Instead of using a covering set of nodes, another independent study indicated using all available nodes for workload execution and powering them off after job completion was favored in terms of energy cost [23]. More recently, Chen at al analyzed how MapReduce operating parameters affected energy efficiency [11]. Power management for MapReduce systems was also explored through data placement [24], virtual machine placement [25], and data compression [26]. Fig.1. A typical deployment of Hadoop framework. The JobTracker and HDFS NameNode may reside on the same physical nodes, and the TaskTrackers and DataNodes are distributed on the other nodes. 2

3 While our work is close to [13], they differ in at least two major aspects. First, our work studies how the energy efficiency of MapReduce varies with both system scale and DVFS scheduling. Second, it targets computation intensive applications, compared to traditional MapReduce application in [13]. With the increasing application of MapReduce to a wide range of high performance and data intensive problems, improving the energy efficiency of MapReduce for such applications is a necessity. III. MAPREDUCE ENERGY EFFICIENCY As one type of parallel programs, the performance of MapReduce applications can be described by the general parallel performance models. For simplicity, we abstract a computer cluster as a power-aware system characterized with three parameters,,, where is the total number of compute nodes in the system, is the total number of processor cores per node, and is the operating frequency of the processor cores. By this abstraction we confine our work in a homogeneous environment. Changing processor operating frequency also leads to the change of its voltage as holds for DVFS processors. For computation intensive applications, we assume is the number of physical cores (no virtual cores) on a node and there is at most one worker per core at a time instance. Let and be the execution times of a MapReduce application when running with 1 and worker nodes respectively, and be the fraction of the workload that is parallelizable, can be calculated as 1 (1) Eq. (1) describes the execution time of an ideal parallel computing algorithm where there is no parallel overhead. In MapReduce framework, parallel overhead results from initial and intermediate data distribution and possible load imbalance. Considering the parallel overhead, Eq. (1) can be rewritten in a speedup form as follows: 1 (2) Denoting, the average node power when the application running with 1 and worker nodes, and, as the corresponding energy consumption, we can combine Eq. (1) with the energy equation and have 1 (3) For a perfectly parallelizable case where 1 and 0, increasing the number of worker nodes leads to proportionally improved performance, consistent average node power, and constant energy. For most applications where 01 and 0 holds, increasing the number of worker nodes improves performance but also leads to more energy cost. The performance improvement diminishes due to the sequential part of the application and various overhead, while energy cost always increase. Given the trends of performance and energy cost as the number of worker nodes increases, we expect there exists an optimal number of worker nodes that delivers a maximum performance to energy ratio for a given MapReduce application. Eq. (1) and (3) also indicate the energy cost of a MapReduce application increases slower with the number of worker nodes for larger or smaller. While increasing the problem size will normally result in a larger, our focus is on the scheduling of an optimal number of nodes for a fixed-size problem with a fixed. We also confine our effort in this work in improving energy efficiency without modifying the MapReduce framework, though optimizing the MapReduce framework implementation has the potential to effectively reduce and thus improve energy efficiency. In addition to resource allocation and MapReduce optimization, dynamic voltage and frequency scaling (DVFS) provides a further opportunity to improve energy efficiency. Previous studies have shown that workloads involve both on-chip and off-chip accesses; changing CPU frequency only affects the performance of on-chip accesses. For a given workload, if the on-chip access portion accounts for a fraction of the total execution time at base frequency, then the total execution time at frequency will be: 1 (4) For computation intensive workload where 1, decreasing frequency increases total execution time. However, for communication and I/O intensive workload where 0, varying processor frequency only has a small impact on the actual total execution time. On the other hand, the dominating dynamic power of microprocessors is proportional to. Thus, halving will reduce the power to one-eighth for DVFS processors as. Therefore, DVFS is effective for reducing energy cost with minimum performance impact for workloads that are not computation intensive. Nonetheless, applying DVFS to MapReduce applications that are computation intensive is nontrivial given the mixture of CPU execution phases in map and reduce functions and IO phases due to the data distribution across the network. IV. METHODOLOGY In this work, we use an experimental approach to study how resource allocation and DVFS scheduling will affect energy efficiency for MapReduce applications. A. MapReduce Workloads We include three MapReduce benchmark applications in our experiments. The Matrix Multiplication and CloudBurst benchmarks represent MapReduce applications that are both computation-intensive and data-intensive. To reveal the system behavior of the shuffling phase in many MapReduce applications, we also include the Sort benchmark in the Hadoop distribution. Matrix Multiplication. Matrix Multiplication calculates the product, where and are 2 matrices. 3

4 The implementation used in this study is a blocking MapReduce algorithm in [27]. Blocking is a common performance optimization technique to take advantage of memory cache locality. With blocking, the factor matrices and are split into smaller sub-matrices such that the latter can fit into low latency memory caches. This MapReduce implementation consists of two jobs: the first job performs the block multiplications and the second job sums up the results. In job 1, the map tasks route a copy of each A or B sub-matrix to all the reduce tasks, and the reduce tasks perform the sub-matrices multiplications. Depending on the number of reduce tasks and the number of sub-matrices, a reduce task may calculate one or more product sub-matrices. This strategy makes good use of parallelism at the expense of network traffic. In job 2, an identity map task reads from an input split, which is the output of reduce tasks in job 1, and a reduce task sums up the items for the same submatrix. To reduce the network traffic during the sort and shuffle phase, Combiner is used in the implementation. CloudBurst. CloudBurst [28] implements a MapReduce based parallel BLAST sequence alignment algorithm. It allows efficient mapping of reads to reference genomes with a smaller number of differences. The input of the program is comprised of two multi-fasta binary files in Hadoop SequenceFile format: one containing reads and the other containing one or more reference sequences. The output is all alignments for each read with up to a user-specified number of differences including both mismatches and indels. The program has three phases: map, shuffle, and reduce. The map task emits k-mers as keys for every k-mer in the reference and all non-overlapping k-mers in the reads. During the shuffle phase the k-mers shared by the reads and the references are grouped. The reduce task extends the seeds into end-to-end alignments allowing for a fixed number of mismatches or indels. Sort. The MapReduce Sort program performs a partial sort of its input data. This sort program simply uses the map/reduce framework to sort the input directory into the output directory. Each map task is the predefined IdentityMapper and each reduce task is the predefined IdentityReducer, both of which pass their inputs directly to the output. The full input dataset is transferred and sorted during the shuffle phase between the map and reduce tasks. Sort is a very useful benchmark for studying the shuffle phase, which exists in many MapReduce applications. B. Energy Management Parameter Space As discussed in Section III, the performance and energy of MapReduce applications are affected by two major factors: the number of concurrent workers, i.e., the number of worker nodes times the number of workers per node, and, the processor frequency on each work node. 1) The number of concurrent workers In this work, we execute each benchmark with multiple settings, where each setting is identified by a unique number of concurrent workers. The concurrency is determined by the number of worker nodes allocated and the number of concurrent tasks on each node. To maximize the performance and efficiency, we use all 8 processor cores (i.e., 8) on each node during benchmark runs. The concurrency ranges from 8 with 1 worker node to 56 with 7 worker nodes. We use hadoop-daemon.sh to control the TaskTracker on each compute node and give a delay of 15 minutes to allow Hadoop to recognize the active/inactive node. We repeat the experiments 5 times in each setting and use average performance and energy in the analysis. To ensure no extra disk and network I/O is introduced for the varying number of concurrent workers, data replication is set to 8 on our 8-node cluster. With this replica setting, each node has a copy of the required data in the local storage disk and accesses the data locally. For CloudBurst and Sort, the data is replicated prior to the job execution. For Matrix Multiplication, the data is generated on the fly. 2) The processor frequency The key of DVFS scheduling is to identify the workload phases and then adapt the processor frequency to match the computational demand of each phase. In this work, we analyze and identify the workload phases and corresponding performance and energy use by tracing system activities. Specifically, we trace CPU utilization, memory access, disk IO bandwidth, and network bandwidth on the worker nodes. We consider three DVFS scheduling policies: Fixed policy: a single processor frequency is used for all cores across the worker nodes during the entire execution. Adaptive I policy: based on workload phase heuristics observed from MapReduce application performance traces, we insert DVFS scheduling codes into MapReduce programs to adjust processor frequency during its execution. Specifically, this policy uses maximum processor frequency inside the map and reduce functions, and uses minimum processor frequency otherwise. Thus, the computations in the map and reduce tasks are with faster cores while I/O accesses are with slower cores for power reduction. The actual deployment of this policy on the Hadoop System is at job level. This is because Java based MapReduce framework lacks the capability to identify the specific physical core associated with a map/reducer task. Particularly, we set the affinity of the TaskTracker daemons to core 0 on each node, and fix its frequency at maximum speed. Then we apply the DVFS scaling to the remaining seven cores on each worker node. Adaptive II policy: This policy is performance-constraint and bounds the performance loss within a user specified value. The performance loss is relative to the performance at highest fixed processor frequency. In this work, we set the allowable performance loss 5%. With this constraint, a low processor frequency might not be scheduled for execution phases even if the resulting power reduction is much more than the performance loss. CPUMiser [29] implements this 4

runtime. CPUMiser [29] runs on each node in the cluster and adapts the processor frequency of each core to applications' demand. C. Evaluation Metrics We use execution time () as performance metric and total system energy ( ) for energy metric.

The first one is work-induced energy ( ), defined as: (5) The rationale of using work-induced energy in addition to total system energy lies in the fact that in today's data centers, idle power

Meanwhile, motivated by the concept of energyproportional computing [16], which essentially assumes zero idle power, many techniques are being developed to significantly reduce the idle power.

5 policy. CPUMiser uses hardware performance counters to collect fine grain CPU activity information, and uses such information to predict the performance and identify target processor speed periodically at runtime. CPUMiser [29] runs on each node in the cluster and adapts the processor frequency of each core to applications' demand. C. Evaluation Metrics We use execution time () as performance metric and total system energy ( ) for energy metric. We also introduce two other metrics in our analysis. The first one is work-induced energy ( ), defined as: (5) The rationale of using work-induced energy in addition to total system energy lies in the fact that in today's data centers, idle power dominates system power consumption, accounting for up to 60% of the system power under load. Meanwhile, motivated by the concept of energyproportional computing [16], which essentially assumes zero idle power, many techniques are being developed to significantly reduce the idle power. Thus, we believe workinduced energy provides a direct indication of energy demand by the applications and workloads. The second metric is energy-performance efficiency, defined as the ratio of performance per Joule, or The metric measures how performance per Joule scales with the number of processor cores within the context of energy-proportional computing. 1 indicates constant performance per Joule, or performance grows with (6) the number of worker nodes at the same speed as energy consumption. 1 indicates performance grows faster than energy consumption. V. EXPERIMENTAL RESULTS A. Experimental Setup The experiments are conducted on an 8-node poweraware cluster with Gigabit Ethernet interconnection. Each node has dual AMD Opteron quad-core 2380 processors running Fedora Core 10 Linux. Each core has a 64KB L1 instruction cache, a 64KB L1 data cache, and a unified 512KB L2 cache. The four cores on the same chip share one 6MB L3 cache. The cluster supports DVFS with 4 frequencies: 0.8GHz, 1.3GHz, 1.8GHz, and 2.5GHz. Each node has one WD1600AYPS Raid Edition 7200rpm SATA hard drive. Hadoop, version , is running on the cluster. One of the nodes runs NameNode and JobTracker, and the other seven nodes serve as the DataNodes and perform map and reduce tasks. Unless explicitly stated, the number of concurrent workers on each node is eight. For Matrix Multiplication, the input matrices A and B are 2560 by 2560 matrices, and the sub-matrix size is 512 by 512. This matrix size provides sufficient load for all cores with the configured 4MB Hadoop file block size (dfs.block.size in hdfs-site.xml). For CloudBurst, the input is the 7.9 million sequencing reads publicly available from the 1000 Genomes Project (accession SRR NA12878) and the chromosome 1 human genome (NCBI Build 36.1). For Sort, we use the randomwriter method in hadoop-*-examples.jar to create seven random 10GB files. We use the PowerPack toolkit [18] to profile power and ( a ) ( b ) ( c ) Fig.2. The variations of performance and energy with the number of concurrent workers for Matrix Multiplication. (a) the normalized performance, energy, and efficiency against 8 workers, (b) the I/O traces and (c) the power traces and CPU utilization when n=48 and f=2.5ghz. ( a) ( b ) ( c ) Fig.3. The variations of performance and energy with the number of workers for CloudBurst. (a) the normalized performance, energy, and efficiency against 8 workers, (b) the I/O traces and (c) the power traces and CPU utilization when n=48 and f=2.5ghz. 5

energy. We attach three Watts Up? Pro USB power meters to three worker nodes and measure the total power of each node.

From the recorded power profiles, work induced energy is calculated with Eq. (5). B. The Effects of the Number of Concurrent Workers Matrix Multiplication: As shown in Fig.

3, instead of an ideal speedup of 7, is achieved when 56.

The energy-performance efficiency increases with and achieves the maximum when 48. By allocating 48 concurrent workers on 6 nodes, we can achieve 3X speedup with 6.6% extra work-induced energy, or 2.

6 energy. We attach three Watts Up? Pro USB power meters to three worker nodes and measure the total power of each node. We report the energy consumption of all worker nodes by averaging the measured energy and multiplying by 8. We exclude the energy of the NameNode and leave such investigation to a future study. From the recorded power profiles, work induced energy is calculated with Eq. (5). B. The Effects of the Number of Concurrent Workers Matrix Multiplication: As shown in Fig. 2(a), the execution time decreases when the number of concurrent workers increases. Due to the items in Eq. (2), a maximum relative speedup of 3.3, instead of an ideal speedup of 7, is achieved when 56. While total system energy increases significantly when more worker nodes are used for parallel programs due to system idle power, work-induced energy only increases slightly. The energy-performance efficiency increases with and achieves the maximum when 48. By allocating 48 concurrent workers on 6 nodes, we can achieve 3X speedup with 6.6% extra work-induced energy, or 2.8X efficiency using the metric defined in Eq. (6). To explain the above observation, we trace the CPU utilization, network and disk accesses during the execution. Fig. 2(c) shows two apparent low CPU utilization phases during the execution. The first matches the distribution of input data for the first MapReduce job and the second corresponds to the finishing of the first MapReduce job and the setting up for the second MapReduce job. There is a short period of low CPU utilization during the first job execution when the map tasks finish and the shuffle occurs. As the reduce task is computation intensive, a high CPU utilization is sustained for the second MapReduce job. Complementing CPU utilization, three I/O intensive phases are observed in Fig. 2(b). The first phase corresponds to the job initialization, and the last two correspond to the first and second MapReduce job respectively The power trace in Fig. 2(c) highlights how total power and idle power of a single node vary during the execution. The work-induced power is the difference between the total power and the idle power. The idle power is about 160 Watts and dominates the total power, even when the CPU utilization is close to 100%. The idle power is about twice the maximum work induced power when matrix multiplication program executes. This observation indicates effective power reduction technologies should consider reducing system idle power as a top priority. The workinduced power curve follows the same trend as CPU utilization. This figure also implies that within this experimental environment, the majority of work-induced power comes from CPU activity, and the memory and I/O activity only slightly change the total node power. CloudBurst: As shown in Fig.3a, Cloudburst achieves super-linear speedup with the number of concurrent workers because more data can be accessed in memory versus from disks with larger number of workers. With 48 concurrent workers, Cloudburst achieves a maximum speedup of 12X and a minimum work-induced energy 0.7X, resulting in an optimal efficiency value of In contrast to Matrix Multiplication, CloudBurst has better scalability in both performance and energy. Thus allocating more resources for CloudBurst is preferred. The system activity traces provided in Fig. 3(b)-(c) and MapReduce log files indicate there are two MapReduce jobs in this benchmark; each job consisting of a map, a shuffle, and a reduce phase. The first job accounts for 90% of the total execution time and the CPU utilization is high during most of map and reduce phases, except in the middle and the end of map tasks where CPU utilization oscillates around 20%. The I/O traces further reveal that network traffic and disk I/O accesses are high within the map and reduce phases. In addition, there are short periods with low CPU and I/O activities between two MapReduce jobs or different phases. These traces indicate that even though CloudBurst is computation intensive, its MapReduce implementation involves significant disk and network accesses and warrants energy efficiency optimization. Sort: Unlike the above two benchmarks, Sort does not scale well with the number of cores. As shown in Fig. 4(a), while the execution time gradually decreases when more cores are used, the maximum speedup is still less than 2. On the other hand, work-induced energy gradually increases with the number of concurrent workers. Sort also delivers its best efficiency at 48. System activity traces in Fig. 4 (b)-(c) reveal that disk and network accesses are very active during most of the execution period. These heavy I/O activities are responsible ( a ) ( b ) ( c ) Fig. 4. The variations of performance and energy with the number workers for Sort. (a) the normalized performance, energy, and efficiency against 8 workers, (b) the I/O traces and (c) the power traces and CPU utilization when n=48 and f=2.5ghz. 6

The Effects of Processor Frequency While the analysis in the previous section demonstrates that resource allocation is an effective approach to improve both performance and efficiency, it also points

7 for a lower CPU utilization than previous two benchmarks. C. The Effects of Processor Frequency While the analysis in the previous section demonstrates that resource allocation is an effective approach to improve both performance and efficiency, it also points out that there are significant I/O activities within MapReduce applications. Provided that DVFS is a practical energy saving technology for non-cpu bound applications, we discuss how different DVFS scheduling policies presented in Section IV perform for MapReduce applications in this section. Fig. 5 shows the performance, energy, and efficiency when the three DVFS scheduling policies are applied to the benchmarks running with 56 concurrent workers. The first four groups correspond to fixed policy with 4 different frequencies: {2.5 GHz, 1.8 GHz, 1.3 GHz, and 0.8 GHz}. Adaptive I inserts DVFS control into the benchmark source code. Adaptive II uses CPUMiser to schedule the core frequencies. Fixed Policy: Overall, for all three benchmarks, a best efficiency has been observed when running the benchmarks at a fixed frequency, though the optimal frequency differs from code to code. For Matrix Multiplication, the optimal frequency is 1.8 GHz, at which there is 35% work-induced energy saving at the cost of 15% performance degradation, resulting in an efficiency number of For CloudBurst, 1.8 GHz also results in a best efficiency of 1.18, with 32% savings of work-induced energy at the cost of 24% performance loss. A more interesting result happens for Sort. At 1.3 GHz, it achieves an efficiency number of 1.33 with a 35% work-induced energy saving and a 4% performance gain. A performance gain from lower processor frequency has also been observed for NPB sorting benchmarks IS in our earlier work [30]. We believe this is a result of better matching between processor and system bus speeds. However, this explanation is not confirmed yet and we are still investigating it. While the results of fixed policy are promising, there are two major issues with it. First, it requires extensive performance and energy profiling. Second, the performance decrease is usually significant except for some rare cases such as the Sort benchmark. Adaptive I policy: With sufficient internal information about the workload, we expect the adaptive I policy to result in better efficiency improvement. However, the experiments show mixed results. For Matrix Multiplication, this policy reduces the work-induced energy by 19% at an expense of 17% performance degradation. For CloudBurst, it delivers a similar performance at 2.5 GHz and reduces the workinduced energy by 5%, which is equivalent to 3% total system energy saving. For Sort, the resulting performance and energy are similar to those achieved at 1.3 GHz. Adaptive II policy: Unlike the adaptive I policy, CPUMiser is implemented as a system software and adapts the processor frequency automatically without requiring code changes or performance profiling. Another unique feature of CPUMiser is that its performance control prevents some unacceptable cases such as large energy saving at the cost significant performance slowdown. The experimental results match our expectations. For Matrix Multiplication, the adaptive II policy reduces the work-induced energy by 23% with a 5% performance loss, improving the efficiency number by 23%. CPUMiser doesn t save energy for CloudBurst because lowering processor frequency would adversely degrade performance.. For Sort, CPUMiser delivers a same performance as ( a ) ( b ) ( c ) Fig. 5. The effects of various DVFS policies for Matrix Multiplication (a), Cloudburst (b) and Sort (c). ( a ) ( b ) ( c ) Fig. 6. The power traces under fixed 2.5GHz and Adaptive II DVFS scheduling policies for Matrix Multiplication (a), Cloudburst (b), and Sort (c). 7

8 2.5GHZ fixed policy with 4% induced energy reduction. Fig. 6 presents power traces of the three benchmarks with fixed 2.5 GHz and Adaptive II policies. The power traces with Adaptive II policy are identical to those at 2.5 GHz for Matrix Multiplication and CloudBurst, except some shift due to lower processor frequency and lower power consumption for idle or non-cpu intensive phases. For Sort, CPUMiser schedules processor frequency to lower values to save energy. The traces also reveal that as CPUMiser seeks performance oriented energy savings, it works best for current systems with large idle power but might not the best for future energy-proportional computing systems. VI. SUMMARY In this work, we use an experimental approach to study the scalability of performance, energy, and efficiency of MapReduce for computation intensive workloads. Various system activity traces indicate that MapReduce involves significant I/O accesses and CPU underutilization is not uncommon for MapReduce applications, due to the demand of intensive disk and network I/O accesses, as well as the separation of map and reduce tasks. By analyzing how efficiency changes with the number of concurrent MapReduce workers and DVFS scheduling policies, we found that judicious resource allocation (i.e., node counts) and DVFS scheduling could effectively improve efficiency. During our studies, we also observed that performance constrained DVFS scheduling strategies work well on systems with dominating idle power. Nevertheless, they need to be re-evaluated on energy-proportional computing systems where performance and power are treated equally. REFERENCES [1] J. Dean, and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (OSDI 2004), San Francisco, CA, 2004, pp [2] P. Dubey, "Recognition, Mining and Synthesis Moves Computers to the Era of Tera," Technology@Intel, [3] EPA, Report to Congress on Server and Data Center Energy Efficiency, Public Law , U.S., [4] Intel, "Increasing Data Center Density While Driving Down Power and Cooling Costs," ftp://download.intel.com/design/servers/technologies/thermal.pdf, 2006]. [5] J. Pan, Y. L. Biannic, and F. Magoulès, Parallelizing Multiple Group-by Query in Share-Nothing Environment: a MapReduce Study Case, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, 2010, pp [6] S. Chen, and S. Schlosser, Map-Reduce Meets Wider Varieties of Applications, Intel, [7] S. Leo, and G. Zanetti, Pydoop: a Python MapReduce and HDFS API for Hadoop, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, 2010, pp [8] D. Huang, X. Shi, S. Ibrahim et al., MR-Scope: a Real-Time Tracing Tool for MapReduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, 2010, pp [9] J. Ekanayake, H. Li, B. Zhang et al., Twister: a Runtime for Iterative MapReduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, 2010, pp [10] R. T. Kaushik, and M. Bhandarkar, GreenHDFS: towards an Energy- Conserving, Storage-Efficient, Hybrid Hadoop Compute Cluster, in Proceedings of the 2010 international conference on Power Aware Computing and Systems, Vancouver, BC, Canada, 2010, pp [11] Y. Chen, L. Keys, and R. H. Katz, Towards Energy Efficient MapReduce, UCB/EECS , EECS Department, University of California, Berkeley, [12] Y. Chen, A. Ganapathi, A. Fox et al., Statistical Workloads for Energy Efficiency MapReduce, UCB/EECS , University of California, Berkeley, [13] J. Leverich, and C. Kozyrakis, On the Energy (in)efficiency of Hadoop Clusters, SIGOPS Oper. Syst. Rev., vol. 44, no. 1, pp , [14] T. Hoefler, A. Lumsdaine, and J. Dongarra, Towards Efficient MapReduce Using MPI, in Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Espoo, Finland, 2009, pp [15] C. Ranger, R. Raghuraman, A. Penmetsa et al., Evaluating MapReduce for Multi-Core and Multiprocessor Systems, in Proceedings of the 13th IEEE International Symposium on High Performance Computer Architecture, USA, 2007, pp. 12. [16] L. A. Barroso, and U. Hölzle, The Case for Energy-Proportional Computing, Computer, vol. 40, no. 12, pp , [17] "Hadoop webpage," [18] R. Ge, X. Feng, S. Song et al., PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications, IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 5, pp , [19] D. Economou, S. Rivoire, C. Kozyrakis et al., "Full-System Power Analysis and Modeling for Server Environments," Workshop on Modeling, Benchmarking, and Simulation (MoBS), [20] P. Bohrer, E. N. Elnozahy, T. Keller et al., "The Case for Power Management in Web Servers," Power aware computing, pp : Kluwer Academic Publishers, [21] D. Meisner, B. T. Gold, and T. F. Wenisch, PowerNap: Eliminating Server Idle Power, in Proceeding of the 14th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2009), Washington, DC, USA, 2009, pp [22] B. M. Oppenheim, "Reducing Cluster Power Consumption by Dynamically Suspending Idle Nodes," DigitalCommons@CalPoly, [23] W. Lang, and J. M. Patel, Energy Management for MapReduce Clusters, Proc. VLDB Endow., vol. 3, no. 1-2, pp , [24] J. Xie, S. Yin, X. Ruan et al., Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters, in Proceedings of the 19th International Heterogeneity in Computing Workshop, Atlanta, Georgia, [25] M. Cardosa, A. Singh, H. Pucha et al., "Exploiting Spatio-Temporal Tradeoffs for Energy Efficient MapReduce in the Cloud," Department of Computer Science and Engineering, University of Minnesota, [26] Y. Chen, A. Ganapathi, and R. H. Katz, To Compress or Not to Compress - Compute vs. IO Tradeoffs for Mapreduce Energy Efficiency, in Proceedings of the first ACM SIGCOMM workshop on Green networking, New Delhi, India, 2010, pp [27] J. Norstad. "A MapReduce Algorithm for Matrix Multiplication," 2010; [28] M. C. Schatz, CloudBurst: Highly Sensitive Read Mapping with MapReduce, Bioinformatics, vol. 25, no. 11, pp , [29] R. Ge, X. Feng, W.-c. Feng et al., CPU MISER: A Performance- Directed, Run-Time System for Power-Aware Clusters, in Proceedings of International Conference on Parallel Processing (ICPP 2007), 2007, pp [30] R. Ge, X. Feng, and K. W. Cameron, Performance-constrained Distributed DVS Scheduling for Scientific Applications on Poweraware Clusters, in Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005, pp

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve