Introducing iscsi Protocol on Online Based MapReduce Mechanism *

Size: px
Start display at page:

Download "Introducing iscsi Protocol on Online Based MapReduce Mechanism *"

Transcription

1 Introducing iscsi Protocol on Online Based MapReduce Mechanism * Shaikh Muhammad Allayear 1, Md. Salahuddin 1, Fayshal Ahmed 1, and Sung Soon Park 2 1 Department of Computer Science and Engineering East West University, Bangladesh 2 Anyang University, Korea allayear@ewubd.edu, { , }@ewu.edu.bd, sspark@anyang.ac.kr Abstract. In large internet enterprise and data management system, Hadoop MapReduce is a popular framework. For data intensive batch jobs, MapReduce provided its impact. To build consistent, hi-availability (HA) and scalable data management system to serve peta bytes of data for the massive users, are the main focused objects. MapReduce is a programming model that enables easy development of scalable parallel applications to process vast amount of data on large cluster. Through a simple interface with two functions map and reduce, this model facilities parallel implementation of real world tasks such as data processing for search engine and machine learning. Earlier version of Hadoop MapReduce has several performance problems like connection between Map to Reduce task, data overload and time consumption. In this paper, we proposed a modified MapReduce architecture MRA (MapReduce Agent) which is a fusion of iscsi protocol and the downloaded reference code of Hadoop*. Our developed MRA can reduce completion time, improve system utilization and give better performance. 1 Introduction Nowadays, dealing with datasets in the order of terabytes or even petabytes is a reality. Therefore, processing such big datasets in an efficient way is a clear need for many users. In this context, Hadoop MapReduce is a big data processing framework that has rapidly become the important factor in both industry and academia. Google proposed MapReduce. The MapReduce framework simplifies the development of large-scale distributed applications on clusters of commodity machines. It has become widely popular, e.g., Google uses it internally to process more than 20 PB per day. Yahoo!, Facebook and others use Hadoop, an open-source implementation of MapReduce. MapReduce has emerged as a popular way to harness the power of large clusters of * This research (Grants NO ) was supported by the 2013 Industrial Technology Innovation Project FundedGby Ministry Of Science, ICT and Future Planning. The source code for HOP can be downloaded from B. Murgante et al. (Eds.): ICCSA 2014, Part V, LNCS 8583, pp , Springer International Publishing Switzerland 2014

2 692 S.M. Allayear et al. computers. The programmer only needs to write the logic of a Map function and Reduce function. This eliminates the need to implement fault-tolerance and low-level memory management in the program. A key benefit of MapReduce is that it automatically handles failures, hiding the complexity of fault-tolerance from the programmer. If a node crashes, MapReduce reruns its tasks on a different machine. MapReduce is typically applied to large batch-oriented computations that are concerned primarily with time to job completion. The Google MapReduce framework [1] and open-source Hadoop system reinforce this usage model through a batch-processing implementation strategy: the entire output of each map and reduce task is materialized to a local file before it can be consumed by the next stage. Materialization allows for a simple and elegant checkpoint/restart fault tolerance mechanism that is critical in large deployments, which have a high probability of slowdowns or failures at worker nodes. To solve the above discussed problem we propose a modified MapReduce architecture that is MapReduce Agent (MRA).Our developed MRA provides several important advantages to a MapReduce framework. We highlight the potential benefits first: In map reduce framework, data transmit from map to reduce stage. So there may be connection problem. To solve this problem MRA creates iscsi[2] Multi-Connection and Error Recovery Method [3] to avoid drastic reduction of transmission rate from TCP congestion control mechanism and guarantee fast retransmission of corruptive packet without TCP re-establishment. For fault tolerance and workload, MRA creates Q-chained cluster. Q-chained cluster [3] is able to balance the workload fully among data connections in the event of packet losses due to bad channel characteristics. Basically Hadoop performs its I/O operation by dividing into blocks as well as iscsi protocol and that motivates us iscsi may provide better performance in Hadoop architecture. 1.1 Structure of the Paper The rest of this paper is organized as follows. Overview of the iscsi protocol, Hadoop MapReduce architecture and pipelining mechanism [5] described in section 2. At section 3 we described our research motivations. We describe our proposed model of Map Reduce Agent (MRA) in brief in section 4. We evaluated the performance and result in section 5. Finally, at section 6 we provided the conclusion of this paper. 2 Background In this section, besides the iscsi protocol we review the MapReduce programming model and describe the salient features of Hadoop, a popular open-source implementation of MapReduce.

3 Introducing iscsi Protocol on Online Based MapReduce Mechanism iscsi Protocol iscsi [2](Internet Small Computer System Interface) is a transport protocol that works on top of TCP. iscsi transports SCSI packets over TCP/IP. iscsi clientserver model describes clients as iscsi initiator and data transfer direction is defined with regard to the initiator. Outbound or outgoing transfers are transfer from initiator to the target. iscsi read/write operation parameters[6] values are determined during login phase and full feature phase by mutual understanding of iscsi initiator and target data transfer capability. During login phase iscsi target authenticate iscsi initiator and then allow entering into full feature phase. iscsi commands and data are exchanged in this phase. According to iscsi operations there are two classes of parameters need to negotiate during login phase. While iscsi write operation occurred, MaxBrustLength and MaxRecvDataSegmentLength (MRDSL) parameters are negotiated between iscsi initiator and target [7]. According to target capability the values are set. At iscsi read operation, iscsi initiator requests target to provide desired data based on initiator capability read operations parameters are set Number of sector per command, MaxRecvDataSegmentLength and Phase Collapse. 2.2 Programming Model To use MapReduce, the programmer [4] expresses their desired computation as a series of jobs. The input to a job is an input specification that will yield key-value pairs. Each job consists of two stages: first, a user-defined map function is applied to each input record to produce a list of intermediate key-value pairs. Second, a userdefined reduce function is called once for each distinct key in the map output and passed the list of intermediate values associated with that key. The MapReduce framework automatically parallelizes the execution of these functions and ensures fault tolerance. Optionally, the user can supply a combiner function [1]. Combiners are similar to reduce functions, except that they are not passed all the values for a given key: instead, a combiner emits an output value that summarizes the Input values it was passed. Combiners are typically used to perform map-side pre-aggregation, which reduces the amount of network traffic required between the map and reduce steps. Fig. 1. Map function interface

4 694 S.M. Allayear et al. 2.3 Hadoop Architecture Hadoop [4] is composed of Hadoop MapReduce, an implementation of MapReduce designed for large cluster, and the Hadoop Distributed File System (HDFS), a file system optimized for batch-oriented workloads such as MapReduce. In most Hadoop jobs, HDFS is used to store both the input to the map step and the output of the reduce step. Note that HDFS is not used to store intermediate results (e.g., the output of the map step): these are kept on each node s local file system. A Hadoop installation consists of a single master node and many worker nodes. The master, called the Job-Tracker, is responsible for accepting jobs from clients, dividing those jobs into tasks, and assigning those tasks to be executed by worker nodes. Each worker runs a Task-Tracker process that manages the execution of the tasks currently assigned to that node. Each TaskTracker has a fixed number of slots for executing tasks (two maps and two reduces by default). 2.4 Map Task Execution Each map task is assigned a portion of the input file called a split. By default, a split contains a single HDFS block (64 MB by default)[4], so the total number of file blocks determines the number of map tasks. The execution of a map task is divided into two phases. The map phase reads the task s split from HDFS, parses it into records (key/value pairs), and applies the map function to each record. After the map function has been applied to each input record, the commit phase registers the final output with the TaskTracker, which then informs the JobTracker that the task has finished executing. Figure 1 contains the interface that must be implemented by user-defined map function. After the map function has been applied to each record in the split the close method is invoked. Fig. 2. Map task index and data file format (2 partition/reduce case)

5 Introducing iscsi Protocol on Online Based MapReduce Mechanism 695 The third argument to the map method specifies an OutputCollector instance, which accumulates the output records produced by the map function. The output of the map step is consumed by the reduce step, so the OutputCollector stores map output in a format that is easy for reduce tasks to consume. Intermediate keys are assigned to reducers by applying a partitioning function, so the OutputCollector applies that function to each key produced by the map function, and stores each record and partition number in an in-memory buffer. The OutputCollector spills this buffer to disk when it reaches capacity. A spill of the in-memory buffer involves first sorting the records in the buffer by partition number and then by key. The buffer content is written to the local file systems an index file and a data file (figure 2). The index file points to the offset of each partition in the data file. The data file contains only the records, which are sorted by the key within each partition segment. During the commit phase, the final output of the map task is generated by merging all the spill files produced by this task a single pair of data and index files. These files are registered with the Task Tracker before the task completes. The Task Tracker will read these files when servicing requests from reduce tasks. 2.5 Reduce Task Execution The execution of a reduce task is divided into three phases. The shuffle phase fetches the reduce task s input data. Each reduce task is assigned a partition of the key range produced by the map step, so the reduce task must fetch the content of this partition from every map task s output. The sort phase groups records with the same key together. Fig. 3. Reduce function interface The reduce phase applies the user-defined reduce function to each key and corresponding list of values. In the shuffle phase, a reduce task fetches data from each map task by issuing HTTP requests to a configurable number of Task Trackers at once(5 by default). The Job Tracker relays the location of every Task Tracker that hosts map output to every Task Tracker that is executing a reduce task. Note that a reduce task cannot fetch the output of a map task until the map has finished executing and committed its final output to disk.

6 696 S.M. Allayear et al. After receiving its partition from all map outputs, the reduce task enters the sort phase. The map output for each partition is already sorted by the reduce key. The reduce task merges these runs together to produce a single run that is sorted by key. The task then enters the reduce phase, in which it invokes the user-defined reduce function for each distinct key in sorted order, passing it the associated list of values. The output of the reduce function is written to a temporary location on HDFS. After the reduce function has been applied to each key in the reduce task s partition, the task s HDFS output file is atomically renamed from its temporary location to its final location. In this design, the output of both map and reduce tasks is written to disk before it can be consumed. This is particularly expensive for reduce tasks, because their output is written to HDFS. Output materialization simplifies fault tolerance, because it reduces the amount of state that must be restored to consistency after a node failure. If any task (either maps or reduces) fails, the Task Tracker simply schedules a new task to perform the same work as the failed task. Since a task never exports any data other than its final answer, no further recovery steps are needed. 2.6 Pipelining Mechanism In pipelining version [5] of Hadoop they developed the Hadoop online prototype (HOP) that can be used to support continuous queries: MapReduce jobs that run continuously. They also proposed a technique known as online aggregation which can provide initial estimates of results several orders of magnitude faster than the final results. Finally the pipelining can reduce job completion time by up to 25% in some scenarios. 3 Motivations In pipelining mechanism [5] they used naïve implementation to send data directly from map to reduce tasks using TCP. When a client submits a new job to Hadoop, the JobTracker assigns the map and reduce tasks associated with the job to the available TaskTracker slots. They modified Hadoop so that each reduce task contacts every map task upon initiation of the job and opens a TCP socket which will be used to send the output of the map function. But there may be some drawbacks occurred in TCP connection and TCP congestion during data transmission. For that reason, TCP connection being disconnected and after that data can be retransmitted which takes long time. So we proposed MRA that can send data without retransmission using iscsi multi-connection and also manage load balancing of data because iscsi protocol works over TCP. Another motivation is that iscsi protocol is block I/O based and Hadoop s map task also assigns HDFS block for input process. 4 Proposed Model: MapReduce Agent (MRA) Traditional MapReduce implementation provides a poor interface for interactive data analysis, because they do not emit any output until the map task has been executed to

7 Introducing iscsi Protocol on Online Based MapReduce Mechanism 697 completion. After producing output of map function, our proposed MRA creates multi-connection with reducer rapidly. If one connection falls or data overload problem occurs then the rest of job will distribute to other connections. Our Q-Chained cluster [3] load balancer maintains this job. So that the reducer can continue its mechanism and that reduce job completion time. Fig. 4. Map Reduce Agent Architecture (MRA) Fig. 5. Overview of Multi-connection and Error Recovery Method of iscsi [3] 4.1 Multi-connection and Error Recovery Method of iscsi In order to alleviate the degradation of iscsi-based remote transfer service caused by TCP congestion control, we propose MRA Multi-Connection and Error Recovery method for one session which uses multiple connections for each session. As mentioned in [8], in a single TCP network connection when congestion occurs by a timeout or the reception of duplicate ACKs (Acknowledgement) then one half of the current window size is saved in sstresh (slow start window). Additionally, if the congestion is indicated by a timeout, cwnd (congestion window) is set to one segment. This may cause a significant degradation in online MapReduce performance. On the other hand in Multi-Connection case, if TCP congestion occurs within connection, the takeover mechanism selects another TCP connection. The general overview of the proposed Multi-Connection and Error Recovery based on iscsi protocol scheme which has been designed for iscsi based transfer system. When the mapper (worker) is in active mode or connected mode for reduce job that time session is started. This session is indicated to be a collection of multiple TCP connection. If packet losses occur due to bad channel characteristics in any connection, our proposed scheme will pick out Q-Chained Cluster s balanced redistribute data by the other active connections Error Recovery Procedure in iscsi Protocol Error recovery is strongly required for iscsi protocol. The following two considerations prompted the design of much of the error recovery functionality in iscsi [9].

8 698 S.M. Allayear et al. PDU may fail the digest check and be dropped, despite being received by the TCP layer. The iscsi layer must optionally be allowed to recover such dropped PDUs. A TCP connection may fail at any time during the data transfer. All the active tasks must optionally be allowed to continue on a different TCP connection within the same session. Many kinds of errors can be happened (e.g. bit error,packet loss etc). However, iscsi error recovery considers the errors on iscsi protocol layer. iscsi error recovery module considers following two errors: Sequence Number Error: During transmission Data PDU that has a sequence number, some PDU can be lost and receiver cannot get the valid PDU. We define this situation, "sequence number error". Connection Failure: If iscsi target or initiator cannot communicate each other via a TCP connection, we define this situation, "connection failure". The role of error recovery module is to detect the listed error and to guarantee the reliability of transportation the data on an iscsi protocol layer. Error Recovery Procedure. iscsi protocol with error recovery checks the sequence number of every iscsi PDU. If iscsi target or initiator receives an iscsi PDU with an out of order sequence number, then it requests an expected sequence number PDU again. In Connection failure case, when a connection has no data communication during the engaged time, iscsi protocol with error recovery checks the connection status by the nop-command [9]. We assume the multiple connections. Sequence Number Error. When an initiator receives an iscsi status PDU with an out of order or a SCSI response PDU with an expected data sequence number (ExpDataSN) that implies missing data PDU(s), it means that the initiator detected a header or payload digest error one or more earliest ready to transmission (R2T) PDUs or data PDUs. When a target receives a data PDU with an out of order data sequence number (DataSN), it means that the target must have hit a header or payload digest error on at least one of the earlier data PDUs. The target must discard the PDU and request retransmission with recovery R2T. The following cases lend themselves to connection recovery: TCP connection failure: The initiator must close the connection. It then must either implicitly or explicitly logout the failed connection with the reason code remove the connection for recovery" and reassign connection allegiance for all commands still in progress associated with the failed connection on one or more connections. For an initiator, a command is in progress as long as it has not received a response or a Datain PDU including status. Receiving an Asynchronous Message [9] that indicates one or all connections in a session has been dropped. The initiator must handle it as a TCP connection failure

9 Introducing iscsi Protocol on Online Based MapReduce Mechanism 699 for the connection(s) referred to in the Message. At an iscsi target, the following cases lend themselves to connection recovery TCP connection failure: The target must close the connection and, if more than one connection is available, the target should send an Asynchronous Message that indicates it has dropped the connection. Then, the target will wait for the initiator to continue recovery 4.2 Q-Chained Cluster Load Balancer Q-chained cluster is able to balance the workload fully among data connections in the event of packet losses due to bad channel characteristics. When congestion occurs in a data connection, this module can do a better job of balancing the workload which is originated by congestion connection, will be distributed among N-1 connections instead of a single data connection. However, when congestion occurs in a specific data connection, balancing the workload among the remaining connections can become difficult, as one connection must pick up the workload of the component where it takes place. In particular, unless the data placement scheme used allows the workload, which is originated by congestion connection to be distributed among the remaining operational connections. Figure 6 illustrates how the workload is balanced in the event of congestion occurrence in a data connection (data connection 1 in this example) with Q-chained cluster. For example, with the congestion occurrence of data connection 1, primary data Q1 is no longer transmitted in congestion connection for the TCP input rate to be throttled and thus its recovery data q1 of data connection 1 is passed to data connection 2 for conveying storage data. However, instead of requiring data connection 2 to process all data both Q2 and q1, Q-chained cluster offloads 4/5ths of the transmission of Q2 by redirecting them to q2 in data connection 3. In turn, 3/5ths of the transmission of Q3 in data connection 3 are sent to q3. This dynamic reassignment of the workload results in an increase of 1/5th in the workload of each remaining data connection. Fig. 6. Q-Chained Load Balancer 4.3 MRA between Jobs Although MapReduce was originally designed as a batch oriented system [5], it is often used for interactive data analysis. A user submits a job to extract information from a data set. Traditional MapReduce implementation provides a poor interface for

10 700 S.M. Allayear et al. interactive data analysis, because they do not emit any output until the map task has been executed to completion.but in MRA, the data records produced by map tasks are sent to reduce tasks shortly after each record is generated [see the figure 8: the flowchart of MRA]. As a result we can produce output more quickly. Fig. 7. Flow Chart of Hadoop Fig. 8. Flow Chart of MRA 4.4 Continuous Map Reduce Jobs A bare-bones implementation of continuous MapReduce jobs is easy to implement using MRA. No changes are needed to implement continuous map tasks: map output is already delivered to the appropriate reduce task shortly after it was generated. Our implemented MRA that allows map functions to force their current output to reduce tasks. When a reduce task is unable to accept such data, the mapper framework stores it locally and sends it after few time. With proper scheduling of reducers, this MRA allows a map task to ensure that an output record is promptly sent to the appropriate reducer. To support continuous reduce tasks, the user-defined reduce function must be periodically invoked on the map output available at that reducer. Applications will have different requirements for how frequently the reduce function should be invoked; possible choices include periods based on wall-clock time, logical time (e.g., the value of a field in the map task output), and the number of input rows delivered to the reducer. The output of the reduce function can be written to HDFS.

11 Introducing iscsi Protocol on Online Based MapReduce Mechanism 701 In our current implementation, the number of map and reduce tasks is fixed, and must be configured by the user. To maintain workload in remote transfer MRA creates Q-chained cluster 4.5 Fault Tolerance Our MRA Hadoop implementation is robust to the failure of both map and reduces tasks. To recover from map task failures, we added bookkeeping to the reduce task to record which map task produced each MRA spill file. To simplify fault tolerance, the reducer treats the output of a MRA map task as tentative until the JobTracker informs the reducer that the map task has committed successfully. The reducer can merge together spill files generated by the same uncommitted mapper but will not combine those spill files with the output of other map tasks until it has been notified that the map task has committed. Thus, if a map task fails, each reduce task can ignore any tentative spill files produced by the failed map attempt. The JobTracker will take care of scheduling a new map task attempt, as in stock Hadoop. If a reduce task fails and a new copy of the task is started, the new reduce instance must be sent all the input data that was sent to the failed reduce attempt. To reduce transmission failure we used iscsi multi-connection so that output of mapper data can transmit within a moment and avoid load balance Q-chained cluster provide better performance. 5 Performance Evaluation As per as [5] we also evaluate the effectiveness of online aggregation, we performed two experiments on Amazon EC2 using different data sets and query workloads. In their first experiment [5], they wrote a Top-K query using two MapReduce jobs: the first job counts the frequency of each word and the second job selects the K most frequent words. We ran this workload on 5.5GB of Wikipedia article text stored in HDFS, using a 128MB block size. We used a 60-node EC2 cluster; each node was a high-cpu medium EC2 instance with 1.7GB of RAM and 2 virtual cores. A virtual core is the equivalent of a 2007-era 2.5Ghz Intel Xeon processor. A single EC2 node executed the Hadoop Job- Tracker and the HDFS NameNode, while the remaining nodes served as slaves for running the TaskTrackers and HDFS DataNodes. A thorough performance comparison between pipelining, blocking and MRA is beyond the scope of this paper. In this section, we instead demonstrate that MRA can reduce job completion times in some configurations. We report performance using both large (512MB) and small (32MB) HDFS block sizes using a single workload (a word count job over randomly-generated text). Since the words were generated using a uniform distribution, map-side combiners were ineffective for this workload. We performed all experiments using relatively small clusters of Amazon EC2 nodes. We also did not consider performance in an environment where multiple concurrent jobs are executing simultaneously.

12 702 S.M. Allayear et al. 5.1 Performance Results of iscsi Protocol Our proposed scheme throughputs in different RTTs are measured for each number of connections in Figure 9. We see the slowness of the rising rate of throughput between 8 connections and 9 connections. This shows that reconstructing the data in turn influences throughputs and the packet drop rates are increased when the number of TCP connections is 9 as the maximum use of concurrent connections between initiator and target. Fig. 9. Throughput of Multi-Connection iscsi System. Y axis is containing Throughput easurement With Mbps & X axis is for number of connections. 50,100,250 and 500 RTT are measured by ms. Fig. 10. Throughput of Multi-Connection iscsi vs iscsi At different error rates. Y axis is Throughput & X axis is for Bit error rate. Fig. 11. Q-Chained Cluster Load Balancer vs No Load Balancer. MC: Multi Connection, Q- CC: Q-Chained Cluster NLB: No Load Balancer. Therefore, 8 is the maximum optimal number of connections from a performance point of view. Multi-Connection iscsi mechanism also works effectively because the data transfer throughputs increase linearly when the round trip time is larger than 250ms. In Figure 10, the performance comparison of Multi-Connection iscsi and iscsi at different bit-error rates is shown. We see that for bit-error rates of over the Multi-Connection iscsi (2 connections) performs significantly better than the iscsi

13 Introducing iscsi Protocol on Online Based MapReduce Mechanism 703 (1 connection), achieving a throughput improvement about 24 % in SCSI read. Moreover, as bit-error rates go up, the figure shows that the rising rate of throughput is getting higher at 33% in , 39.3% in and 44% in Actually, Multi-Connection iscsi can avoid the forceful reduction of transmission rate efficiently from TCP congestion control using another TCP connection opened during a service session, while iscsi does not make any progress. Under statuses of low bit error rates (< ), we see little difference between Multi-Connection iscsi and iscsi. At such low bit errors iscsi is quite robust at handling these. In Figure 11, Multi-Connection iscsi(8 connections) with Q-Chained cluster shows the better average performance about 11.5%. It can distribute the workload among all remaining connections when packet losses occur in any connection. To recall an example given earlier, with M = 6, when congestion occurs in a specific connection, the workload of each connection increases by only 1/5. However, if Multi-Connection iscsi (proposed Scheme) establishes a performance baseline without load balancing, any connection, which is randomly selected from takeover mechanism, is overwhelmed. 5.2 Performance Results and Comparison on MapReduce In the Hadoop map reduce architecture [4, 5]; their first task is to generate output which is done by map task consume the output by reduce task. The whole thing makes the process lengthy because reduce task have to wait for the output of the map task. In pipelining mechanism [5], they send output of map task immediately after generation of per output to the reduce task so it takes less time than Hadoop MapReduce[4]. During the transmission (TCP) if any problem occurred then they retransmit again which took more time and drastically reduce the performance of MapReduce mechanism. Fig. 12. CDF of map and reduce task completion times for a 10GB wordcount job using 20 map tasks and 20 reduce tasks (512MB block size). The total job runtimes were 361 seconds for blocking. Fig. 13. CDF of map and reduce task completion times for a 10GB wordcount job using 20 map tasks and 20 reduce tasks (512MB block size). The total job runtimes were 290 seconds for pipelining.

14 704 S.M. Allayear et al. Fig. 14. CDF of map and reduce task completion times for a 10GB wordcount job using 20 map tasks and 20 reduce tasks (512MB block size). The total job runtimes were 240 seconds for MRA. Fig. 15. CDF of map and reduce task completion times for a 10GB wordcount job using 20 map tasks and 1 reduce tasks (512MB block size). The total job runtimes were 29 minutes for blocking. Fig. 16. CDF of map and reduce task completion times for a 10GB wordcount job using 20 map tasks and 1 reduce tasks (512MB block size). The total job runtimes were 34 minutes for pipelining. Fig. 17. CDF of map and reduce task completion times for a 10GB wordcount job using 20 map tasks and 1 reduce tasks (512MB block size). The total job runtimes were 36 minutes for MRA.

15 Introducing iscsi Protocol on Online Based MapReduce Mechanism 705 Fig. 18. CDF of map and reduce task completion times for a 100GB wordcount job using 240 map tasks and 60 reduce tasks (512MB block size). The total job runtimes were 48 minutes for blocking. Fig. 19. CDF of map and reduce task completion times for a 100GB wordcount job using 240 map tasks and 60 reduce tasks (512MB block size). The total job runtimes were 36 minutes for pipelining. Fig. 20. CDF of map and reduce task completion times for a 100GB wordcount job using 240 map tasks and 60 reduce tasks (512MB block size). The total job runtimes were 32 minutes for MRA. On the other hand our proposed mechanism (MRA) recovers the drawback by using multi-connection and Q-chained load balancer method. In these circumstances MRA may prove its better time of completion. 6 Conclusion MapReduce has added new dimension for large scale parallel programming. Our paper demonstrated that MapReduce can be more useful if we use MRA. We attribute this success to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. Second, we have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems. Third, MRA can reduce the time to job completion.

16 706 S.M. Allayear et al. References [1] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified dataprocessing on large clusters. In OSDI (2004). [2] SAM-3 Information Technology SCSI Architecture Model 3, Working Draft, T10 Project 1561-D, Revision7 (2003) [3] Allayear, S.M., Park, S.S.: iscsi Multi-connection and Error Recovery Method for Remote Storage System in Mobile Appliance. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA LNCS, vol. 3981, pp Springer, Heidelberg (2006) [4] Hadoop, HYPERLINK, [5] Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M.: UC Berkeley: MapReduce Online. Khaled Elmeleegy, Russell Sears (Yahoo! Research) [6] Allayear, S.M., Park, S.S., No, J.: iscsi Protocol Adaptation with 2-way TCP Hand Shake Mechanism for an Embedded Multi-Agent Based Health Care Service. In: Proceedings of the 10th WSEAS International Conference on Mathematical Methods, Computational Techniques and Intelligent Systems, Corfu, Greece (2008) [7] Allayear, S.M., Park, S.S.: iscsi Protocol Adaptation With NAS System Via Wireless Environment. In: International Conference on Consumer Electronics (ICCE), Las Vegus, USA (2008) [8] Caceres, R., Iftode, L.: Improving the Performance of Reliable Transport Protocols in Mobile Computing Environments. IEEE JSAC [9] RFC 3270, [10] Verma, A., Zea, N., Cho, B., Gupta, I., Campbell, R.H.: Breaking the MapReduce Stage Barrier* [11] Yang, H., Dasdan, A., Hsiao, R., Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proc. of the 2007 ACM SIGMOD International Conference on Management of Data (January 2007) [12] Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD (1997) [13] Shah, M.A., Hellerstein, J.M., Brewer, E.A.: Highly-available, fault-tolerant, parallel dataflows. In: SIGMOD (2004) [14] Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive a warehousing solution over a Map-Reduce framework. In: VLDB (2009) [15] Wu, S., Jiang, S., Ooi, B.C., Tan, K.-L.: Distributed online aggregation. In: VLDB (2009) [16] Yang, C., Yen, C., Tan, C., Madden, S.: Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: ICDE (2010) [17] Chan, J.O.: An Architecture for Big Data Analytics [18] Daneshyar, S., Razmjoo, M.: Large-Scale Data Processing Using Mapreduce in Cloud Computing Environment [19] Ji, C., Li, Y., Qiu, W., Awada, U., Li, K.: Big Data Processing in Cloud Computing Environments [20] Padhy, R.P.: Big Data Processing with Hadoop-MapReduce in Cloud Systems [21] Stokely, M.: Histogram tools for distributions of large data sets

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein UC Berkeley Khaled Elmeleegy, Russell Sears Yahoo!

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein UC Berkeley Khaled Elmeleegy, Russell Sears Yahoo! MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein UC Berkeley Khaled Elmeleegy, Russell Sears Yahoo! Research Abstract MapReduce is a popular framework for data-intensive

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud

Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud Huan Ke, Peng Li, Song Guo, and Ivan Stojmenovic Abstract As a leading framework for processing and analyzing big data, MapReduce is leveraged

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Distributed Systems CS6421

Distributed Systems CS6421 Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

A Survey on Big Data

A Survey on Big Data A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Large-Scale GPU programming

Large-Scale GPU programming Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop Ms Punitha R Computer Science Engineering M.S Engineering College, Bangalore, Karnataka, India. Mr Malatesh S H Computer Science

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Indexing Strategies of MapReduce for Information Retrieval in Big Data International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Piranha: Optimizing Short Jobs in Hadoop

Piranha: Optimizing Short Jobs in Hadoop Piranha: Optimizing Short Jobs in Hadoop Khaled Elmeleegy Turn Inc. Khaled.Elmeleegy@Turn.com ABSTRACT Cluster computing has emerged as a key parallel processing platform for large scale data. All major

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x

Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2 MSc Research Project Cloud Computing Ashwini Rajaram Chandanshive x15043584 School of Computing National College

More information

L5-6:Runtime Platforms Hadoop and HDFS

L5-6:Runtime Platforms Hadoop and HDFS Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences SE256:Jan16 (2:1) L5-6:Runtime Platforms Hadoop and HDFS Yogesh Simmhan 03/

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1 International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

A REVIEW PAPER ON BIG DATA ANALYTICS

A REVIEW PAPER ON BIG DATA ANALYTICS A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:

More information

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

Outline 9.2. TCP for 2.5G/3G wireless

Outline 9.2. TCP for 2.5G/3G wireless Transport layer 9.1 Outline Motivation, TCP-mechanisms Classical approaches (Indirect TCP, Snooping TCP, Mobile TCP) PEPs in general Additional optimizations (Fast retransmit/recovery, Transmission freezing,

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information