Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud

Size: px

Start display at page:

Download "Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud"

Jonas James
5 years ago
Views:

1 Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud Huan Ke, Peng Li, Song Guo, and Ivan Stojmenovic Abstract As a leading framework for processing and analyzing big data, MapReduce is leveraged by many enterprises to parallelize their data processing on distributed computing systems. Unfortunately, the all-to-all data forwarding from map tasks to reduce tasks in the traditional MapReduce framework would generate a large amount of network traffic. The fact that the intermediate data generated by map tasks can be combined with significant traffic reduction in many applications motivates us to propose a data aggregation scheme for MapReduce jobs in cloud. Specifically, we design an aggregation architecture under the existing MapReduce framework with the objective of minimizing the data traffic during the shuffle phase, in which aggregators can reside anywhere in the cloud. Some experimental results also show that our proposal outperforms existing work by reducing the network traffic significantly. B ig data [] has become increasingly popular with defining characteristics on volume, variety, value, and velocity. Many large companies like Facebook, Google, Yahoo!, and Amazon generate large amounts of data every day. Gartner [2] predicts that 4.4 million jobs will be created around big data by 204. Some technologies are needed to tap into the growing quantities of data to help businesses make better, more informed decisions. As a promising framework implemented by open source Hadoop [] for parallel big data processing in distributed computing systems, MapReduce [4] has been widely adopted to effectively and quickly analyze data ranging from terabytes to petabytes in size. Typically, a MapReduce job consists of a number of parallel map tasks, followed by reduction tasks that merge all intermediate results in the form of key-value pairs generated by map tasks to produce final results. These large-volume intermediate data delivered from map tasks to reduce tasks occupy excessive network bandwidth resources, leading to network congestion that can seriously degrade the performance of MapReduce jobs. Data aggregation has been shown to be effective in reducing intermediate data. Its basic idea is to aggregate the keyvalue pairs sharing the same keys before forwarding them to reduce tasks. For example, in the WordCount application that counts the number of words from a block of text, a map task will generate 00 key-value pairs of the, Ò if the shows up 00 times in the given text. In the traditional MapReduce framework, all these key-value pairs are directly sent to the reduce task. When data aggregation is applied, a simple keyvalue pair, the, 00Ò, is created by summing up the count Huan Ke, Peng Li, and Song Guo are with the University of Aizu. Ivan Stojmenovic is with the School of Information Technology and Engineering, University of Ottawa. results and then sent to the reduce task, leading to only one percent bandwidth occupation of the traditional scheme. Note that data aggregation can be applied only when the intermediate results are commutative (i.e., a + b = b +a) and associative (i.e., a + (b + c) = (a + b) + c). The promise of data aggregation was preliminarily exploited by the combiner function [4], which merges the intermediate data generated by a map task. Later, it was extended to aggregate the results of multiple map tasks within the same machine or rack [5]. However, these works ignore the data redundancy among parallel map-reduce flows of the same job. In this article, we propose a novel scheme that fully exploits data aggregation chances to further reduce data traffic within MapReduce jobs. Specifically, we devise a new module to be incorporated into existing Hadoop architecture, called the aggregator, which can merge the intermediate results not only from the same machines, but also from different ones. To achieve efficient data aggregation, we deal with the challenges of aggregator placement and data routing between map and reduce tasks. Background of MapReduce MapReduce is a software framework for big data processing on large clusters consisting of hundreds or thousands of machines. Users submit a data processing request, referred to as a job in MapReduce, by specifying a map and a reduce function. When a job is executed, two types of tasks, map and reduce, are created. The input data are divided into independent splits that are processed by map tasks in parallel. The generated intermediate results in forms of key-value pairs may be shuffled and sorted by the framework, and then fetched by reduce tasks to produce final results. For a better understanding, we use an example of Word- Count to show the process of MapReduce. As shown in Fig., the input file is divided into three splits that are processed by three map tasks, respectively. For example, the map task will IEEE Network September/October /5/$ IEEE 7

2 cat dog Mapper cat dog Mapper2 Reducer <fish,> Mapper Figure. MapReduce framework. extract 4 key-value pairs from the first data split: cat, Ò, dog, Ò, fish, Ò, cat, Ò. There are two reduce tasks in our example, each of which is responsible for processing two keys. After all key-value pairs are sent to the corresponding reduce tasks, they produce the final results by calculating the total number of each word. Concept of Data Aggregation In real implementation of MapReduce, like Hadoop, map and reduce tasks usually reside in different machines, as shown in Fig.. Since large amounts of intermediate data may be delivered from map tasks to reduce tasks, it would lead to a heavy traffic burden for the network. By carefully examining the intermediate results, we discover significant data redundancy in key-value pairs. In the WordCount example shown in Fig., the first map task, Mapper, generates two identical pairs of cat,ò. This observation motivates us to aggregate the key-value pairs sharing keys before forwarding them to reduce tasks. To fully exploit the data aggregation opportunities, we study two kinds of data aggregation schemes, intra-machine and inter-machine, which are elaborated in the following. Intra-Machine Data Aggregation The most straightforward way to reduce data traffic is to aggregate the same key-value pairs generated by map tasks within the same machine before they are sent over the network. This is referred to as intra-machine data aggregation in this article. The WordCount example with intra-machine data aggregation is shown in Fig. 2, where an aggregator is created to merge the intermediate results generated by each map task. For example, the number of key-value pairs sent out by the first machine is reduced to by aggregating two pairs of cat,ò as a single pair cat,2ò. Compared to the traditional scheme where 2 key-value pairs are sent from map tasks to reduce task, data aggregation can reduce the number to 8. Inter-Machine Data Aggregation In addition to intra-machine data aggregation, we can further reduce the data traffic by aggregating the intermediate results from different machines, referred to as inter-machine data aggregation. We still use the example in Fig. 2 to explain the idea of inter-machine data aggregation. Consider the links from map tasks to reduce tasks as a bottleneck of the network. To further reduce the number of key-value pairs sent over these links, we send the aggregated results (i.e., three key-value pairs) in machine 2 to the aggregator in machine, which merges the received data from two machines as shown in Fig. a. As a result, there are only four key-value pairs sent to reduce tasks. An alternative solution is to conduct inter-machine data aggregation at machine 2 as shown in Fig. b. Although the number of key-value pairs sent over the bottleneck links is still four, there are only two key-value pairs delivered from machine to 2, leading to a reduced total traffic cost. This example reveals that the selection of nodes conducting intermachine data aggregation can affect the performance, and thus becomes an additional challenge we need to handle. Architecture We enhance existing Hadoop architecture by integrating new modules, the aggregator and aggregator manager, to facilitate efficient aggregation in a virtual cloud data center. Overview A MapReduce cloud service enables cost-effective big data analytics without creating large infrastructures of their own. Using virtual machines and storage hosted by the cloud, enterprises can simply create a virtual MapReduce cluster to analyze big data. In the virtual cluster, the intermediate data forwarded from map tasks to reduce tasks would generate a large amount of data traffic in the shuffle phase. It motivates us to propose an architecture to aggregate the intermediate 8 IEEE Network September/October 205

3 cat dog Mapper cat dog Mapper2 Reducer <fish,> Mapper <dog,2> Figure 2. Intra-machine data aggregation. results on the fly with the objective of minimizing the network traffic of MapReduce jobs. As shown in Fig. 4, Hadoop consists of a JobTracker as a master node, and multiple Trackers located on remaining slave nodes. The JobTracker is responsible for handling all submitted jobs, making scheduling decisions, and parallelizing the application across the cluster. The Trackers are responsible for running the parallel tasks by following instructions from the JobTracker. To implement data aggregation, we incorporate two modules, the aggregator and aggregator manager, into existing Hadoop architecture. The aggregation operations are conducted by aggregators, while the aggregator manager, residing in the JobTracker, collaborates with other components to determine a set of virtual machines that should accommodate aggregators for each MapReduce job. This architecture succeeds in aggregating the intermediate data in the shuffle phase such that network traffic can be significantly reduced. In our enhanced programming framework, aggregators are located between the map and reduce phases. Each aggregator accepts the intermediate results as input generated by several map tasks, which are specified by the aggregator manager. Note that a mapper could send its intermediate results directly to reducers without passing through an aggregator, just like it does in the traditional MapReduce framework. After obtaining the intermediate results from map tasks, each aggregator performs a reduce-like operation to combine the key/value pairs with the same key, such that each key is included in a single pair with an aggregated value instead of multiple pairs. After that, all aggregated results with the same key should be sent to a single reducer. In the system architecture shown in Fig. 4, the execution of aggregators is managed by the Tracker in each virtual machine of the virtual cluster. When the Tracker receives a request of creating an aggregator from the aggregator manager residing in the Job- Tracker, it immediately initializes an instance of aggregator and specifies its associated map and reduce tasks using the information attached in the request. Finally, once the aggregation is completed, the Tracker destroys the aggregator and sends a notification message to the aggregator manager. Manager The aggregator manager mainly deals with aggregator placement and bandwidth assignment problems as described in the following. Placement An intuitive method is to create an aggregator on each machine along the path from a map task to a reduce task. However, it would occupy too many computational resources in the virtual cluster, leading to low resource utilization. In our proposal, we allow users to specify a maximum number of aggregators, like map task and reduce task. The aggregator manager aims to minimize the network traffic during the shuffle phase by answering the following two key questions: placement which machine should create an aggregator? Routing to which aggregator should the intermediate data of each mapper be forwarded? To determine aggregator placement, the aggregator manager needs the information of map and reduce tasks, including their locations and the estimated intermediate data volume. Moreover, it also needs to know the remaining resources at each Tracker, that is, whether it is able to accommodate an aggregator. Such information is attached to a periodical heartbeat message from each Tracker to the JobTracker, which reports the availability of resources for running a new task. After obtaining this information, the aggregator manager executes an in-cloud aggregation algorithm to determine the aggregator placement and routing strategies to be sent to each Tracker. Bandwidth Assignment With a pay-as-you-go charging model, tenants can create a virtual cluster by renting a set of virtual machines with performance isolation on CPU and memory resources. It has recently been recognized that the bandwidth between virtual machines also plays a critical role for applications in the cloud because network bandwidth may fluctuate significantly due to the competition of networkintensive applications. Nowadays, cloud service providers allow tenants to reserve the network bandwidth between virtual machines with a payment. Such a scheme can help tenants IEEE Network September/October 205

4 Mapper2 Reducer Mapper <dog,> <fish,> (a) Mapper2 <dog,> Reducer Mapper <dog,2> <fish,> (b) Figure. Inter-machine data aggregation. to be aware of network traffic and reduce the cost by appropriately reserving bandwidth according to their requirements. Without aggregation, a simple fair sharing bandwidth assignment scheme is enough because the output of map tasks is in general uniformly distributed among reduce tasks. When aggregation is applied, the communication paths in traditional MapReduce have been changed because of the existence of aggregators. Moreover, aggregators may reduce the amount of data going through them. Therefore, the bandwidth should be assigned according to the data traffic on each link after aggregation. Specifically, the aggregator manager estimates the data traffic according to the results of aggregator placement, and then reserves the bandwidth with the cloud service provider. In-Cloud Aggregation The objective of in-cloud aggregation is to minimize the total routing traffic in order to complete the given MapReduce job under a certain budget of aggregators. This section presents a greedy algorithm for optimizing the aggregator placement problem. Suppose a given MapReduce job with a number of map tasks and a single reduce task, which have been deployed into the cloud already. The corresponding aggregator placement problem is essentially to construct an overlay multicast tree in which root, leaf, and intermediate nodes represent the reducer, mappers, and aggregators, respectively. If no aggregators are introduced, all leaf nodes have to connect to the root node directly. To reduce the communication cost, a limited number of aggregators are provided such that all incoming traffic to any aggregator will be aggregated at a condensed volume before being delivered to the next hop. Notice that an intermediate node can function as both aggregator and mapper at the same time. The communication cost of a tree link is defined as the traffic volume over the link times the hop number of the path connecting the tree nodes of the link. The traffic volume from an aggregator is the aggregation coefficient a, which is smaller than in general, times the overall intermediate results from all its associated mappers, including itself if the aggregator is a mapper as well. The cost of a tree is the summation of costs 20 IEEE Network September/October 205

5 JobTracker Job initialization Job configuration partition run manager placement Bandwidth assignment Map task Map task Reduce task Reduce task Figure 4. The architecture of in-cloud aggregation (a) 8 () 2 (2) () (b) () (2) () (c) Figure 5. An example: a) the hop-distance matrix between virtual machines; b) the execution result in the first round; c) the execution result in the second round. on all links of the tree. The objective of our algorithm is to construct an overlay multicast tree such that its cost is minimized and the number of intermediate nodes does not exceed a given budget. Initially, the tree is constructed with the only reducer as the root node and all mappers as leaf nodes connected to the root by direct links with no aggregators involved. The basic idea of our algorithm is to iteratively reconstruct the multicast tree by disconnecting the tree link with the largest cost and then reconnecting the detached component to: An existing aggregator A new aggregator that is temporarily assigned in the cloud if under budget such that the resulting tree achieves the lowest cost for all possible trials belonging to the above two cases. If no such update is possible, an alternative tree link is considered in decreasing order of cost, that is, the link with the second largest cost is checked and so on. The iteration proceeds until the tree cannot be further improved, and the resulting tree is our desired solution. IEEE Network September/October 205 2

6 50 The above algorithm can be extended to the general case with multiple reducers in a straightforward manner. The difference is to create a vir- 00 tual root node representing all given reducers. 250 The cost of any link associated with this root node is calculated by averaging the costs of corresponding links to all real reducers. 200 For a better understanding, we use an example 50 to show the execution process of our proposed algorithm. We consider a virtual cluster consisting of five virtual machines, where node accom- 00 modates the only reducer, and each other node 50 contains a mapper and is able to accommodate an aggregator. The hop-distance h(i, j) between any two virtual machines i and j is expressed by the matrix shown in Fig. 5a, in which the size of intermediate data generated by the mapper in node i is denoted by m i. The settings are given as follows: m i = for i 5, a = 0.5, and the maximum number of aggregators is 2. In the first iteration of the algorithm, the initial overlay tree is shown in Fig. 5b-, where the 400 number beside a tree link shows the corresponding cost, and the link between nodes and (the root node) has the largest cost. To reduce such 50 traffic, all possible placements of a new aggregator (e.g., on nodes 2, 4, and 5) are shown in Fig. 00 5b-2. The updated tree with the minimum cost is given in Fig. 5b In the second iteration, the bottleneck link connecting nodes 4 and is identified. Two possible updates are considered as shown in Fig. 5c-2: 200. The mapper on node 4 connects to the existing aggregator on node 2. Accordingly, the cost on 50 link connecting nodes 2 and is updated as a (m + m 2 + m 4 ) h(2, ) = = The mapper on node 4 connects to a new 50 aggregator on node 5. Accordingly, the cost on link connecting nodes 5 and is updated as a (m 4 + m 5 ) h(5, ) = = 8. The second option is chosen as shown in Fig. 5c-, which is also the final tree constructed by our algorithm since no further improvement can be made. Compared to the original tree with an overall communication cost of 4, the new tree significantly reduces the cost to 8 by 47 percent. Performance Evaluation To evaluate our proposed heuristic algorithm, both prototype and simulation tests are conducted in this section. The performance baseline is provided by the original scheme (i.e., no aggregation is provided). For comparison, we also implement a random placement algorithm that places a given number of aggregators on physical nodes randomly. Prototype-Based Experiment Our prototype has been implemented on Hadoop and a VMWare workstation v The job WordCount is tested with source files acquired from To validate our proposed algorithm, we use the example illustrated in Fig. 5. The output sizes of mappers on nodes 5 are 206.M,.6M,84.7M, 6.0M, and 87.2M, respectively. From measurement, the aggregation coefficient a is approximately equal to 0.5. Traffic cost 400 Aggregation Random Original Figure 6. The traffic cost vs. different values of a. Traffic cost Aggregation Random Original Number of aggregators Figure 7. The traffic cost vs. maximum number of aggregators. Our experimental results show that the original overall communication cost is 664. if aggregation is not incorporated. On the other hand, when mappers and 2 are chosen to aggregate on node 2, the input and output sizes are 402.M and 20.M, respectively. Similarly, after mapper 4 and mapper 5 are aggregated on node 5, the input and output sizes are 7.M and 8.7M, respectively. The total cost becomes 546., showing a significant reduction by 46.7 percent. Simulation-Based Experiment In our simulations, the numbers of map tasks, reduce tasks, and virtual machines are set to 00, 2, and 00, respectively. The hop number between any pair of tasks is generated randomly within the range (, 20). The results in all figures are averaged over 00 instances. We first evaluate the effect of the aggregating coefficient a by fixing the maximum number of aggregators to 50. From Fig. 6, we find that the average traffic cost of aggregation and random algorithms both increase as the value of a grows from 0. to. When a =.0, aggregation cannot affect the traffic cost. Therefore, all algorithms perform the same. In all other 22 IEEE Network September/October 205

7 cases, our heuristic algorithm always outperforms the other two schemes, especially when a is small. Then we investigate the influence of the number of aggregators by changing its value from 0 to 00 and fixing a to 0.5. As shown in Fig. 7, the traffic cost of our algorithm decreases fast at the beginning compared to others, showing that substantial gain can be achieved by introducing more aggregators and optimizing their placement. When a sufficient number of aggregators is allowed, its performance converges to the random algorithm gradually. Our simulation results demonstrate that the performance of our proposed aggregation algorithm can significantly reduce in-cloud traffic in most cases. Conclusion In this article, we discuss the importance of aggregation on incloud traffic reduction. To verify our idea, we propose an aggregation architecture that can easily be incorporated into the existing MapReduce framework. We also investigate the aggregator placement problem and design an aggregation algorithm to minimize the overall network traffic among map-reduce tasks of a big data job. Both prototype and simulation-based tests have been conducted, and the experimental results validate the efficiency of our proposal in reducing the network traffic. Acknowledgment This research was partially supported by Strategic International Collaborative Research Program (SICORP) Japanese (JST) U.S. (NSF) Joint Research Big Data and Disaster Research (BDD). References [] D. Howe, et al., Big Data: The Future of Biocuration, Nature, vol. 455, no. 720, 2008, pp [2] M. Barlow, The Culture of Big Data, O Reilly Media, Inc., 20. [] Hadoop: [4] J. Dean and S. Ghemawat, Mapreduce: Simplified Data Processing on Large Clusters, Proc. OSDI, San Francisco, CA, 2004, pp. 0. [5] Y. Yu, P. K. Gunda, and M. Isard, Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations, Proc. ACM Symp. Op. Sys. Principles, 200, pp Biographies HUAN KE received her B.S. degree from Huazhong University of Science and Technology, China, in 20. She is currently a Master s student at the University of Aizu, Japan. Her research interests include cloud computing and big data. PENG LI received his B.S. degree from Huazhong University of Science and Technology in 2007, and his M.S. and Ph.D. degrees from the University of Aizu in 200 and 202, respectively. He is currently an associate professor at the University of Aizu. His research interests include networking modeling, cross-layer optimization, network coding, cooperative communications, cloud computing, smart grid, and performance evaluation of wireless and mobile networks. SONG GUO [M 02, SM ] received a Ph.D. degree in computer science from the University of Ottawa, Canada. He is a full professor at the School of Computer Science and Engineering, University of Aizu. His research interests are mainly in the areas of wireless communication and mobile computing, cloud computing and networking, and cyber-physical systems. He serves as Associate Editor of IEEE TPDS and IEEE TETC. He is a Senior Member of ACM. IVAN STOJMENOVIC [F 08] received his Ph.D. degree in mathematics. He is a full professor at the University of Ottawa. He has published over 00 papers, and edited seven books on wireless, ad hoc, sensor, and actuator networks and applied algorithms with Wiley. He is a Fellow of the Canadian Academy of Engineering since 202 and a member of the Academia Europaea (the Academy of Europe) since 202 (section: Informatics). IEEE Network September/October 205 2

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department