Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud

Size: px
Start display at page:

Download "Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud"

Transcription

1 Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud Huan Ke, Peng Li, Song Guo, and Ivan Stojmenovic Abstract As a leading framework for processing and analyzing big data, MapReduce is leveraged by many enterprises to parallelize their data processing on distributed computing systems. Unfortunately, the all-to-all data forwarding from map tasks to reduce tasks in the traditional MapReduce framework would generate a large amount of network traffic. The fact that the intermediate data generated by map tasks can be combined with significant traffic reduction in many applications motivates us to propose a data aggregation scheme for MapReduce jobs in cloud. Specifically, we design an aggregation architecture under the existing MapReduce framework with the objective of minimizing the data traffic during the shuffle phase, in which aggregators can reside anywhere in the cloud. Some experimental results also show that our proposal outperforms existing work by reducing the network traffic significantly. B ig data [] has become increasingly popular with defining characteristics on volume, variety, value, and velocity. Many large companies like Facebook, Google, Yahoo!, and Amazon generate large amounts of data every day. Gartner [2] predicts that 4.4 million jobs will be created around big data by 204. Some technologies are needed to tap into the growing quantities of data to help businesses make better, more informed decisions. As a promising framework implemented by open source Hadoop [] for parallel big data processing in distributed computing systems, MapReduce [4] has been widely adopted to effectively and quickly analyze data ranging from terabytes to petabytes in size. Typically, a MapReduce job consists of a number of parallel map tasks, followed by reduction tasks that merge all intermediate results in the form of key-value pairs generated by map tasks to produce final results. These large-volume intermediate data delivered from map tasks to reduce tasks occupy excessive network bandwidth resources, leading to network congestion that can seriously degrade the performance of MapReduce jobs. Data aggregation has been shown to be effective in reducing intermediate data. Its basic idea is to aggregate the keyvalue pairs sharing the same keys before forwarding them to reduce tasks. For example, in the WordCount application that counts the number of words from a block of text, a map task will generate 00 key-value pairs of the, Ò if the shows up 00 times in the given text. In the traditional MapReduce framework, all these key-value pairs are directly sent to the reduce task. When data aggregation is applied, a simple keyvalue pair, the, 00Ò, is created by summing up the count Huan Ke, Peng Li, and Song Guo are with the University of Aizu. Ivan Stojmenovic is with the School of Information Technology and Engineering, University of Ottawa. results and then sent to the reduce task, leading to only one percent bandwidth occupation of the traditional scheme. Note that data aggregation can be applied only when the intermediate results are commutative (i.e., a + b = b +a) and associative (i.e., a + (b + c) = (a + b) + c). The promise of data aggregation was preliminarily exploited by the combiner function [4], which merges the intermediate data generated by a map task. Later, it was extended to aggregate the results of multiple map tasks within the same machine or rack [5]. However, these works ignore the data redundancy among parallel map-reduce flows of the same job. In this article, we propose a novel scheme that fully exploits data aggregation chances to further reduce data traffic within MapReduce jobs. Specifically, we devise a new module to be incorporated into existing Hadoop architecture, called the aggregator, which can merge the intermediate results not only from the same machines, but also from different ones. To achieve efficient data aggregation, we deal with the challenges of aggregator placement and data routing between map and reduce tasks. Background of MapReduce MapReduce is a software framework for big data processing on large clusters consisting of hundreds or thousands of machines. Users submit a data processing request, referred to as a job in MapReduce, by specifying a map and a reduce function. When a job is executed, two types of tasks, map and reduce, are created. The input data are divided into independent splits that are processed by map tasks in parallel. The generated intermediate results in forms of key-value pairs may be shuffled and sorted by the framework, and then fetched by reduce tasks to produce final results. For a better understanding, we use an example of Word- Count to show the process of MapReduce. As shown in Fig., the input file is divided into three splits that are processed by three map tasks, respectively. For example, the map task will IEEE Network September/October /5/$ IEEE 7

2 cat dog Mapper cat dog Mapper2 Reducer <fish,> Mapper Figure. MapReduce framework. extract 4 key-value pairs from the first data split: cat, Ò, dog, Ò, fish, Ò, cat, Ò. There are two reduce tasks in our example, each of which is responsible for processing two keys. After all key-value pairs are sent to the corresponding reduce tasks, they produce the final results by calculating the total number of each word. Concept of Data Aggregation In real implementation of MapReduce, like Hadoop, map and reduce tasks usually reside in different machines, as shown in Fig.. Since large amounts of intermediate data may be delivered from map tasks to reduce tasks, it would lead to a heavy traffic burden for the network. By carefully examining the intermediate results, we discover significant data redundancy in key-value pairs. In the WordCount example shown in Fig., the first map task, Mapper, generates two identical pairs of cat,ò. This observation motivates us to aggregate the key-value pairs sharing keys before forwarding them to reduce tasks. To fully exploit the data aggregation opportunities, we study two kinds of data aggregation schemes, intra-machine and inter-machine, which are elaborated in the following. Intra-Machine Data Aggregation The most straightforward way to reduce data traffic is to aggregate the same key-value pairs generated by map tasks within the same machine before they are sent over the network. This is referred to as intra-machine data aggregation in this article. The WordCount example with intra-machine data aggregation is shown in Fig. 2, where an aggregator is created to merge the intermediate results generated by each map task. For example, the number of key-value pairs sent out by the first machine is reduced to by aggregating two pairs of cat,ò as a single pair cat,2ò. Compared to the traditional scheme where 2 key-value pairs are sent from map tasks to reduce task, data aggregation can reduce the number to 8. Inter-Machine Data Aggregation In addition to intra-machine data aggregation, we can further reduce the data traffic by aggregating the intermediate results from different machines, referred to as inter-machine data aggregation. We still use the example in Fig. 2 to explain the idea of inter-machine data aggregation. Consider the links from map tasks to reduce tasks as a bottleneck of the network. To further reduce the number of key-value pairs sent over these links, we send the aggregated results (i.e., three key-value pairs) in machine 2 to the aggregator in machine, which merges the received data from two machines as shown in Fig. a. As a result, there are only four key-value pairs sent to reduce tasks. An alternative solution is to conduct inter-machine data aggregation at machine 2 as shown in Fig. b. Although the number of key-value pairs sent over the bottleneck links is still four, there are only two key-value pairs delivered from machine to 2, leading to a reduced total traffic cost. This example reveals that the selection of nodes conducting intermachine data aggregation can affect the performance, and thus becomes an additional challenge we need to handle. Architecture We enhance existing Hadoop architecture by integrating new modules, the aggregator and aggregator manager, to facilitate efficient aggregation in a virtual cloud data center. Overview A MapReduce cloud service enables cost-effective big data analytics without creating large infrastructures of their own. Using virtual machines and storage hosted by the cloud, enterprises can simply create a virtual MapReduce cluster to analyze big data. In the virtual cluster, the intermediate data forwarded from map tasks to reduce tasks would generate a large amount of data traffic in the shuffle phase. It motivates us to propose an architecture to aggregate the intermediate 8 IEEE Network September/October 205

3 cat dog Mapper cat dog Mapper2 Reducer <fish,> Mapper <dog,2> Figure 2. Intra-machine data aggregation. results on the fly with the objective of minimizing the network traffic of MapReduce jobs. As shown in Fig. 4, Hadoop consists of a JobTracker as a master node, and multiple Trackers located on remaining slave nodes. The JobTracker is responsible for handling all submitted jobs, making scheduling decisions, and parallelizing the application across the cluster. The Trackers are responsible for running the parallel tasks by following instructions from the JobTracker. To implement data aggregation, we incorporate two modules, the aggregator and aggregator manager, into existing Hadoop architecture. The aggregation operations are conducted by aggregators, while the aggregator manager, residing in the JobTracker, collaborates with other components to determine a set of virtual machines that should accommodate aggregators for each MapReduce job. This architecture succeeds in aggregating the intermediate data in the shuffle phase such that network traffic can be significantly reduced. In our enhanced programming framework, aggregators are located between the map and reduce phases. Each aggregator accepts the intermediate results as input generated by several map tasks, which are specified by the aggregator manager. Note that a mapper could send its intermediate results directly to reducers without passing through an aggregator, just like it does in the traditional MapReduce framework. After obtaining the intermediate results from map tasks, each aggregator performs a reduce-like operation to combine the key/value pairs with the same key, such that each key is included in a single pair with an aggregated value instead of multiple pairs. After that, all aggregated results with the same key should be sent to a single reducer. In the system architecture shown in Fig. 4, the execution of aggregators is managed by the Tracker in each virtual machine of the virtual cluster. When the Tracker receives a request of creating an aggregator from the aggregator manager residing in the Job- Tracker, it immediately initializes an instance of aggregator and specifies its associated map and reduce tasks using the information attached in the request. Finally, once the aggregation is completed, the Tracker destroys the aggregator and sends a notification message to the aggregator manager. Manager The aggregator manager mainly deals with aggregator placement and bandwidth assignment problems as described in the following. Placement An intuitive method is to create an aggregator on each machine along the path from a map task to a reduce task. However, it would occupy too many computational resources in the virtual cluster, leading to low resource utilization. In our proposal, we allow users to specify a maximum number of aggregators, like map task and reduce task. The aggregator manager aims to minimize the network traffic during the shuffle phase by answering the following two key questions: placement which machine should create an aggregator? Routing to which aggregator should the intermediate data of each mapper be forwarded? To determine aggregator placement, the aggregator manager needs the information of map and reduce tasks, including their locations and the estimated intermediate data volume. Moreover, it also needs to know the remaining resources at each Tracker, that is, whether it is able to accommodate an aggregator. Such information is attached to a periodical heartbeat message from each Tracker to the JobTracker, which reports the availability of resources for running a new task. After obtaining this information, the aggregator manager executes an in-cloud aggregation algorithm to determine the aggregator placement and routing strategies to be sent to each Tracker. Bandwidth Assignment With a pay-as-you-go charging model, tenants can create a virtual cluster by renting a set of virtual machines with performance isolation on CPU and memory resources. It has recently been recognized that the bandwidth between virtual machines also plays a critical role for applications in the cloud because network bandwidth may fluctuate significantly due to the competition of networkintensive applications. Nowadays, cloud service providers allow tenants to reserve the network bandwidth between virtual machines with a payment. Such a scheme can help tenants IEEE Network September/October 205

4 Mapper2 Reducer Mapper <dog,> <fish,> (a) Mapper2 <dog,> Reducer Mapper <dog,2> <fish,> (b) Figure. Inter-machine data aggregation. to be aware of network traffic and reduce the cost by appropriately reserving bandwidth according to their requirements. Without aggregation, a simple fair sharing bandwidth assignment scheme is enough because the output of map tasks is in general uniformly distributed among reduce tasks. When aggregation is applied, the communication paths in traditional MapReduce have been changed because of the existence of aggregators. Moreover, aggregators may reduce the amount of data going through them. Therefore, the bandwidth should be assigned according to the data traffic on each link after aggregation. Specifically, the aggregator manager estimates the data traffic according to the results of aggregator placement, and then reserves the bandwidth with the cloud service provider. In-Cloud Aggregation The objective of in-cloud aggregation is to minimize the total routing traffic in order to complete the given MapReduce job under a certain budget of aggregators. This section presents a greedy algorithm for optimizing the aggregator placement problem. Suppose a given MapReduce job with a number of map tasks and a single reduce task, which have been deployed into the cloud already. The corresponding aggregator placement problem is essentially to construct an overlay multicast tree in which root, leaf, and intermediate nodes represent the reducer, mappers, and aggregators, respectively. If no aggregators are introduced, all leaf nodes have to connect to the root node directly. To reduce the communication cost, a limited number of aggregators are provided such that all incoming traffic to any aggregator will be aggregated at a condensed volume before being delivered to the next hop. Notice that an intermediate node can function as both aggregator and mapper at the same time. The communication cost of a tree link is defined as the traffic volume over the link times the hop number of the path connecting the tree nodes of the link. The traffic volume from an aggregator is the aggregation coefficient a, which is smaller than in general, times the overall intermediate results from all its associated mappers, including itself if the aggregator is a mapper as well. The cost of a tree is the summation of costs 20 IEEE Network September/October 205

5 JobTracker Job initialization Job configuration partition run manager placement Bandwidth assignment Map task Map task Reduce task Reduce task Figure 4. The architecture of in-cloud aggregation (a) 8 () 2 (2) () (b) () (2) () (c) Figure 5. An example: a) the hop-distance matrix between virtual machines; b) the execution result in the first round; c) the execution result in the second round. on all links of the tree. The objective of our algorithm is to construct an overlay multicast tree such that its cost is minimized and the number of intermediate nodes does not exceed a given budget. Initially, the tree is constructed with the only reducer as the root node and all mappers as leaf nodes connected to the root by direct links with no aggregators involved. The basic idea of our algorithm is to iteratively reconstruct the multicast tree by disconnecting the tree link with the largest cost and then reconnecting the detached component to: An existing aggregator A new aggregator that is temporarily assigned in the cloud if under budget such that the resulting tree achieves the lowest cost for all possible trials belonging to the above two cases. If no such update is possible, an alternative tree link is considered in decreasing order of cost, that is, the link with the second largest cost is checked and so on. The iteration proceeds until the tree cannot be further improved, and the resulting tree is our desired solution. IEEE Network September/October 205 2

6 50 The above algorithm can be extended to the general case with multiple reducers in a straightforward manner. The difference is to create a vir- 00 tual root node representing all given reducers. 250 The cost of any link associated with this root node is calculated by averaging the costs of corresponding links to all real reducers. 200 For a better understanding, we use an example 50 to show the execution process of our proposed algorithm. We consider a virtual cluster consisting of five virtual machines, where node accom- 00 modates the only reducer, and each other node 50 contains a mapper and is able to accommodate an aggregator. The hop-distance h(i, j) between any two virtual machines i and j is expressed by the matrix shown in Fig. 5a, in which the size of intermediate data generated by the mapper in node i is denoted by m i. The settings are given as follows: m i = for i 5, a = 0.5, and the maximum number of aggregators is 2. In the first iteration of the algorithm, the initial overlay tree is shown in Fig. 5b-, where the 400 number beside a tree link shows the corresponding cost, and the link between nodes and (the root node) has the largest cost. To reduce such 50 traffic, all possible placements of a new aggregator (e.g., on nodes 2, 4, and 5) are shown in Fig. 00 5b-2. The updated tree with the minimum cost is given in Fig. 5b In the second iteration, the bottleneck link connecting nodes 4 and is identified. Two possible updates are considered as shown in Fig. 5c-2: 200. The mapper on node 4 connects to the existing aggregator on node 2. Accordingly, the cost on 50 link connecting nodes 2 and is updated as a (m + m 2 + m 4 ) h(2, ) = = The mapper on node 4 connects to a new 50 aggregator on node 5. Accordingly, the cost on link connecting nodes 5 and is updated as a (m 4 + m 5 ) h(5, ) = = 8. The second option is chosen as shown in Fig. 5c-, which is also the final tree constructed by our algorithm since no further improvement can be made. Compared to the original tree with an overall communication cost of 4, the new tree significantly reduces the cost to 8 by 47 percent. Performance Evaluation To evaluate our proposed heuristic algorithm, both prototype and simulation tests are conducted in this section. The performance baseline is provided by the original scheme (i.e., no aggregation is provided). For comparison, we also implement a random placement algorithm that places a given number of aggregators on physical nodes randomly. Prototype-Based Experiment Our prototype has been implemented on Hadoop and a VMWare workstation v The job WordCount is tested with source files acquired from To validate our proposed algorithm, we use the example illustrated in Fig. 5. The output sizes of mappers on nodes 5 are 206.M,.6M,84.7M, 6.0M, and 87.2M, respectively. From measurement, the aggregation coefficient a is approximately equal to 0.5. Traffic cost 400 Aggregation Random Original Figure 6. The traffic cost vs. different values of a. Traffic cost Aggregation Random Original Number of aggregators Figure 7. The traffic cost vs. maximum number of aggregators. Our experimental results show that the original overall communication cost is 664. if aggregation is not incorporated. On the other hand, when mappers and 2 are chosen to aggregate on node 2, the input and output sizes are 402.M and 20.M, respectively. Similarly, after mapper 4 and mapper 5 are aggregated on node 5, the input and output sizes are 7.M and 8.7M, respectively. The total cost becomes 546., showing a significant reduction by 46.7 percent. Simulation-Based Experiment In our simulations, the numbers of map tasks, reduce tasks, and virtual machines are set to 00, 2, and 00, respectively. The hop number between any pair of tasks is generated randomly within the range (, 20). The results in all figures are averaged over 00 instances. We first evaluate the effect of the aggregating coefficient a by fixing the maximum number of aggregators to 50. From Fig. 6, we find that the average traffic cost of aggregation and random algorithms both increase as the value of a grows from 0. to. When a =.0, aggregation cannot affect the traffic cost. Therefore, all algorithms perform the same. In all other 22 IEEE Network September/October 205

7 cases, our heuristic algorithm always outperforms the other two schemes, especially when a is small. Then we investigate the influence of the number of aggregators by changing its value from 0 to 00 and fixing a to 0.5. As shown in Fig. 7, the traffic cost of our algorithm decreases fast at the beginning compared to others, showing that substantial gain can be achieved by introducing more aggregators and optimizing their placement. When a sufficient number of aggregators is allowed, its performance converges to the random algorithm gradually. Our simulation results demonstrate that the performance of our proposed aggregation algorithm can significantly reduce in-cloud traffic in most cases. Conclusion In this article, we discuss the importance of aggregation on incloud traffic reduction. To verify our idea, we propose an aggregation architecture that can easily be incorporated into the existing MapReduce framework. We also investigate the aggregator placement problem and design an aggregation algorithm to minimize the overall network traffic among map-reduce tasks of a big data job. Both prototype and simulation-based tests have been conducted, and the experimental results validate the efficiency of our proposal in reducing the network traffic. Acknowledgment This research was partially supported by Strategic International Collaborative Research Program (SICORP) Japanese (JST) U.S. (NSF) Joint Research Big Data and Disaster Research (BDD). References [] D. Howe, et al., Big Data: The Future of Biocuration, Nature, vol. 455, no. 720, 2008, pp [2] M. Barlow, The Culture of Big Data, O Reilly Media, Inc., 20. [] Hadoop: [4] J. Dean and S. Ghemawat, Mapreduce: Simplified Data Processing on Large Clusters, Proc. OSDI, San Francisco, CA, 2004, pp. 0. [5] Y. Yu, P. K. Gunda, and M. Isard, Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations, Proc. ACM Symp. Op. Sys. Principles, 200, pp Biographies HUAN KE received her B.S. degree from Huazhong University of Science and Technology, China, in 20. She is currently a Master s student at the University of Aizu, Japan. Her research interests include cloud computing and big data. PENG LI received his B.S. degree from Huazhong University of Science and Technology in 2007, and his M.S. and Ph.D. degrees from the University of Aizu in 200 and 202, respectively. He is currently an associate professor at the University of Aizu. His research interests include networking modeling, cross-layer optimization, network coding, cooperative communications, cloud computing, smart grid, and performance evaluation of wireless and mobile networks. SONG GUO [M 02, SM ] received a Ph.D. degree in computer science from the University of Ottawa, Canada. He is a full professor at the School of Computer Science and Engineering, University of Aizu. His research interests are mainly in the areas of wireless communication and mobile computing, cloud computing and networking, and cyber-physical systems. He serves as Associate Editor of IEEE TPDS and IEEE TETC. He is a Senior Member of ACM. IVAN STOJMENOVIC [F 08] received his Ph.D. degree in mathematics. He is a full professor at the University of Ottawa. He has published over 00 papers, and edited seven books on wireless, ad hoc, sensor, and actuator networks and applied algorithms with Wiley. He is a Fellow of the Canadian Academy of Engineering since 202 and a member of the Academia Europaea (the Academy of Europe) since 202 (section: Informatics). IEEE Network September/October 205 2

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

A Network-aware Scheduler in Data-parallel Clusters for High Performance

A Network-aware Scheduler in Data-parallel Clusters for High Performance A Network-aware Scheduler in Data-parallel Clusters for High Performance Zhuozhao Li, Haiying Shen and Ankur Sarker Department of Computer Science University of Virginia May, 2018 1/61 Data-parallel clusters

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Performing MapReduce on Data Centers with Hierarchical Structures

Performing MapReduce on Data Centers with Hierarchical Structures INT J COMPUT COMMUN, ISSN 1841-9836 Vol.7 (212), No. 3 (September), pp. 432-449 Performing MapReduce on Data Centers with Hierarchical Structures Z. Ding, D. Guo, X. Chen, X. Luo Zeliu Ding, Deke Guo,

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi MATSUO Nagoya Institute of Technology Gokiso, Showa, Nagoya, Aichi,

More information

Multi-path based Algorithms for Data Transfer in the Grid Environment

Multi-path based Algorithms for Data Transfer in the Grid Environment New Generation Computing, 28(2010)129-136 Ohmsha, Ltd. and Springer Multi-path based Algorithms for Data Transfer in the Grid Environment Muzhou XIONG 1,2, Dan CHEN 2,3, Hai JIN 1 and Song WU 1 1 School

More information

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD S.THIRUNAVUKKARASU 1, DR.K.P.KALIYAMURTHIE 2 Assistant Professor, Dept of IT, Bharath University, Chennai-73 1 Professor& Head, Dept of IT, Bharath

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop Ms Punitha R Computer Science Engineering M.S Engineering College, Bangalore, Karnataka, India. Mr Malatesh S H Computer Science

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Application-Aware SDN Routing for Big-Data Processing

Application-Aware SDN Routing for Big-Data Processing Application-Aware SDN Routing for Big-Data Processing Evaluation by EstiNet OpenFlow Network Emulator Director/Prof. Shie-Yuan Wang Institute of Network Engineering National ChiaoTung University Taiwan

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Performance Evaluation of Cloud Centers with High Degree of Virtualization to provide MapReduce as Service

Performance Evaluation of Cloud Centers with High Degree of Virtualization to provide MapReduce as Service Int. J. Advance Soft Compu. Appl, Vol. 8, No. 3, December 2016 ISSN 2074-8523 Performance Evaluation of Cloud Centers with High Degree of Virtualization to provide MapReduce as Service C. N. Sahoo 1, Veena

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004 Presented by Zachary Bischof Winter '10 EECS 345 Distributed Systems 1 Motivation Summary Example Implementation

More information

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He*, Qi Li # Huazhong University of Science and Technology *Nanyang Technological

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

A New Combinatorial Design of Coded Distributed Computing

A New Combinatorial Design of Coded Distributed Computing A New Combinatorial Design of Coded Distributed Computing Nicholas Woolsey, Rong-Rong Chen, and Mingyue Ji Department of Electrical and Computer Engineering, University of Utah Salt Lake City, UT, USA

More information

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework

Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework Li-Yung Ho Institute of Information Science Academia Sinica, Department of Computer Science and Information Engineering

More information

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b,

More information

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud 571 Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud T.R.V. Anandharajan 1, Dr. M.A. Bhagyaveni 2 1 Research Scholar, Department of Electronics and Communication,

More information

Hop Onset Network: Adequate Stationing Schema for Large Scale Cloud- Applications

Hop Onset Network: Adequate Stationing Schema for Large Scale Cloud- Applications www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 6 Issue 10 October 2017, Page No. 22616-22626 Index Copernicus value (2015): 58.10 DOI: 10.18535/ijecs/v6i10.11

More information

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Vol:6, No:1, 212 Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Nchimbi Edward Pius, Liu Qin, Fion Yang, Zhu Hong Ming International Science Index, Computer and Information Engineering

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Collaborative Next Generation Networking

Collaborative Next Generation Networking CALL-FOR-PAPERS ACM/Springer Mobile Networks & Applications (MONET) http://link.springer.com/journal/11036 SPECIAL ISSUE ON Collaborative Next Generation Networking Overview: To catch up with the ever-increasing

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi Journal of Energy and Power Engineering 10 (2016) 405-410 doi: 10.17265/1934-8975/2016.07.004 D DAVID PUBLISHING Shirin Abbasi Computer Department, Islamic Azad University-Tehran Center Branch, Tehran

More information

Risk-Aware Rapid Data Evacuation for Large- Scale Disasters in Optical Cloud Networks

Risk-Aware Rapid Data Evacuation for Large- Scale Disasters in Optical Cloud Networks Risk-Aware Rapid Data Evacuation for Large- Scale Disasters in Optical Cloud Networks Presenter: Yongcheng (Jeremy) Li PhD student, School of Electronic and Information Engineering, Soochow University,

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Multi-Method Data Delivery for Green Sensor-Cloud

Multi-Method Data Delivery for Green Sensor-Cloud Green Communications and Computing Multi-Method Data Delivery for Green Sensor-Cloud Chunsheng Zhu, Victor C. M. Leung, Kun Wang, Laurence T. Yang, and Yan Zhang The authors discuss the potential applications

More information

NaaS Network-as-a-Service in the Cloud

NaaS Network-as-a-Service in the Cloud NaaS Network-as-a-Service in the Cloud joint work with Matteo Migliavacca, Peter Pietzuch, and Alexander L. Wolf costa@imperial.ac.uk Motivation Mismatch between app. abstractions & network How the programmers

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

An improved MapReduce Design of Kmeans for clustering very large datasets

An improved MapReduce Design of Kmeans for clustering very large datasets An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb

More information

Vehicular Cloud Computing: A Survey. Lin Gu, Deze Zeng and Song Guo School of Computer Science and Engineering, The University of Aizu, Japan

Vehicular Cloud Computing: A Survey. Lin Gu, Deze Zeng and Song Guo School of Computer Science and Engineering, The University of Aizu, Japan Vehicular Cloud Computing: A Survey Lin Gu, Deze Zeng and Song Guo School of Computer Science and Engineering, The University of Aizu, Japan OUTLINE OF TOPICS INTRODUCETION AND MOTIVATION TWO-TIER VEHICULAR

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Cascaded Coded Distributed Computing on Heterogeneous Networks

Cascaded Coded Distributed Computing on Heterogeneous Networks Cascaded Coded Distributed Computing on Heterogeneous Networks Nicholas Woolsey, Rong-Rong Chen, and Mingyue Ji Department of Electrical and Computer Engineering, University of Utah Salt Lake City, UT,

More information

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment

HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment Sangwon Seo 1, Ingook Jang 1, 1 Computer Science Department Korea Advanced Institute of Science and Technology (KAIST), South

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DISTRIBUTED FRAMEWORK FOR DATA MINING AS A SERVICE ON PRIVATE CLOUD RUCHA V. JAMNEKAR

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Performance Analysis of Storage-Based Routing for Circuit-Switched Networks [1]

Performance Analysis of Storage-Based Routing for Circuit-Switched Networks [1] Performance Analysis of Storage-Based Routing for Circuit-Switched Networks [1] Presenter: Yongcheng (Jeremy) Li PhD student, School of Electronic and Information Engineering, Soochow University, China

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce April 19, 2012 Jinoh Kim, Ph.D. Computer Science Department Lock Haven University of Pennsylvania Research Areas Datacenter Energy Management Exa-scale Computing Network Performance

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

Efficient Mining Algorithms for Large-scale Graphs

Efficient Mining Algorithms for Large-scale Graphs Efficient Mining Algorithms for Large-scale Graphs Yasunari Kishimoto, Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka Abstract This article describes efficient graph mining algorithms designed

More information

Volume 2, Issue 4, April 2014 International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 4, April 2014 International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 4, April 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com Efficient

More information

A REVIEW PAPER ON BIG DATA ANALYTICS

A REVIEW PAPER ON BIG DATA ANALYTICS A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,

More information

A computational model for MapReduce job flow

A computational model for MapReduce job flow A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125

More information

Strategic Briefing Paper Big Data

Strategic Briefing Paper Big Data Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which

More information

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/TPDS.6.7, IEEE

More information

Call for Papers for Communication QoS, Reliability and Modeling Symposium

Call for Papers for Communication QoS, Reliability and Modeling Symposium Call for Papers for Communication QoS, Reliability and Modeling Symposium Scope and Motivation: In modern communication networks, different technologies need to cooperate with each other for end-to-end

More information

Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks

Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks Exploiting Efficient and Scalable Shuffle Transfers in Future Data Center Networks Deke Guo, Member, IEEE, Junjie Xie, Xiaolei Zhou, Student Member, IEEE, Xiaomin Zhu, Member, IEEE, Wei Wei, Member, IEEE,

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture BIG DATA Architecture Hierarchy of knowledge Data: Element (fact, figure, etc.) which is basic information that can be to be based on decisions, reasoning, research and which is treated by the human or

More information