BIGData (massive generation of content), has been growing

Size: px
Start display at page:

Download "BIGData (massive generation of content), has been growing"

Transcription

1 1 HEBR: A High Efficiency Block Reporting Scheme for HDFS Sumukhi Chandrashekar and Lihao Xu Abstract Hadoop platform is widely being used for managing, analyzing and transforming large data sets in various systems. Two basic components of Hadoop are: 1) a distributed file system (HDFS) 2) a computation framework (MapReduce). HDFS stores data on simple commodity machines that run DataNode processes (DataNodes). A commodity machine running NameNode process (NameNode) maintains meta data information of the file system. Every DataNode sends lists of all files currently stored on it, known as block report to the NameNode periodically. NameNode processes block reports to build a mapping between files and their locations on various DataNodes. The block reports form a heavy internal load of a Hadoop cluster as they consume computation resources of DataNodes and the NameNode as well as network bandwidth of the cluster. With extensive supporting experiment results, this paper proposes a new block report protocol, HEBR, for Hadoop to significantly reduce both computational and communication overhead, and thus improving overall Hadoop system s performance greatly. Index Terms Hadoop, Hadoop Distributed File System, Efficient distributed file systems, Block reporting scheme. 1 INTRODUCTION BIGData (massive generation of content), has been growing at an average rate of 40% per year, 80% of which is unstructured. Big Data is measured by three Vs: Variety, Velocity, and Volume [3]. Data with Variety can be challenging to categorize, while Velocity which refers to the speed of the data, can be challenging to process. The volume or size of Big Data tends to fluctuate depending on the problem to be assessed [4]. The three Vs makes it increasingly challenging to process Big Data and thus, researchers are on the look for efficient solutions. One solution to deal with Big Data is Apache s Hadoop, an open-source system. In 2006 Yahoo! began investing in its development and used Hadoop as its distributed data processing platform. Since then, Hadoop installations have grown to thousands of nodes in a cluster. For example, the biggest Hadoop clusters in Yahoo! [7], [8] consist of 4000 nodes and have a total space capacity of 14PB each. ebay is a heavy user of MapReduce paradigm, Apache Pig, Apache HBase etc. for search optimization and has a Hadoop cluster comprising of 532 nodes maintaining 5.3PB of data [22]. FaceBook owns 2 Hadoop clusters - a 1100-machine cluster comprising of 8800 cores storing 12PB of data and a 300- machine cluster with 2400 cores storing 3PB of data to maintain internal log and dimension data sources [22]. Hadoop uses its own distributed file system known as Hadoop Distributed File System (HDFS) that runs on commodity hardware to store files. Although HDFS shares many attributes with other distributed file systems [5], it is designed to be significantly more fault tolerant compared S. Chandrasekhar is with the Department of Computer Science, Wayne State University, Detroit MI, USA sumukhic@wayne.edu. L. Xu is with the Department of Computer Science, Wayne State University, Detroit MI, USA e- lihao@wayne.edu. He is an associate professor at the department. with some dedicated hardware solutions such as Redundant Array of Inexpensive Disks (RAID) [6] or Data Replication. A HDFS cluster comprises of a commodity machine running the NameNode process and many others running DataNode processes. The NameNode manages the name space of the file system and the DataNode stores the physical data files that are partitioned into blocks. Each block is replicated (3 replicas by default) and stored on different DataNodes. The number of replicas of a block can be adjusted by a parameter in configuration files of HDFS. A DataNode identifies block replicas on its disk and send periodic reports to the NameNode termed as block reports. With the help of these block reports, the NameNode builds a map that maps blocks of files to their physical locations on DataNodes. The map is referred as BlockMap. The block reports are essential meta data for the HDFS to know the physical locations of files on DataNodes and they form a significant portion of internal loads of a cluster [1]. The block report load depends on the number of DataNodes in a cluster. The cluster might become dysfunctional and may process fewer reads and writes when the load is too high. Although the block reports are randomized so that they do not accumulate at the NameNode, it typically still has to process more than 10 reports per second, each consisting of 60,000 blocks for a normal sized cluster (of about one thousand DataNodes) [1]. In such a typical configuration, approximately 30% of the NameNode s total computation capacity is used to handle block reports. Files are uploaded onto HDFS in intervals. Subsequent block reports are identical when no new files have been uploaded between them. When this happens, the NameNode still updates BlockMap with no new updates, thus wasting a lot of computational and network resources of the cluster. A new block report scheme, HEBR: High Efficiency Block

2 2 Reporting Scheme for HDFS was developed by us that sends much smaller list of blocks more frequently. Through out this paper, we refer to our block reporting scheme as HEBR and the block reporting scheme implemented in latest version of Hadoop as CBR. This paper presents HEBR and the results from different sets of experiments that evaluates and demonstrates its superior performance compared to CBR. HEBR sends as few as 59.78% of blocks to the NameNode on 4-node cluster, thus reducing the processing resources and network bandwidth usage greatly. This improves overall HDFS performance significantly. The paper is organized as follows: Section 2 briefly describes the architecture of HDFS; Section 3 describes the current approaches of sending and processing block reports and other related research areas; Section 4 discusses the proposed block reporting scheme, HEBR. The experimental set up and different test benchmarks on which performance evaluations of HEBR were conducted is discussed in section 5. In sections 6, 7 and 8, the results from some of the experiments conducted on the benchmarks are presented in detail. Section 9 concludes the paper with the finding that HEBR is a significantly more efficient block reporting scheme than CBR and is no worse than CBR under any experimental settings. 2 ARCHITECTURE OF HDFS HDFS follows a master-slave architecture with a single machine running NameNode software component that acts as a master server and many machines running DataNode processes that act as slaves. The DataNodes are typically arranged in a rack and are connected to a switch, usually with one or two Gigabit Ethernet boned links. The switch in turn has up-link connections to another tier of switches, thus, connecting all DataNodes, forming a real-time HDFS cluster. The NameNode manages the entire file system name space and list of blocks belonging to a file in FsImage meta data. FsImage is stored persistently in the physical memory of the NameNode. Each physical data block of a file written by a client application is stored in the local file system of the DataNodes. A block of a file is represented by two physical files - one that has the content of the file and other that has meta data information including the checksum for the block and generation stamp. A client application contacts the NameNode to get appropriate information to cater to read/write requests. To serve a read request, the NameNode returns a list of DataNodes that host blocks of the requested file, and the client contacts the closest DataNode for the data blocks. To serve a write request, the NameNode decides a set of DataNodes based on a proximity algorithm to host blocks of the file and returns this list to the client. The client then writes the blocks onto the DataNodes. The DataNodes may then write the blocks onto other DataNodes in same or different racks in order to maintain minimum number of replicated copies of each blocks. 3 RELATED WORK CBR comprises of Full Block reports (full-block) and Intermediate Block reports (intermediate-block). Details of CBR, research that evaluated performance of HDFS (evaluation strategy, metrics and experiment set up) and research results on improvements of performance of Hadoop are described here. 3.1 CBR A full-block report reports all valid blocks on a disk of a DataNode and an intermediate-block report reports one block that is being received or deleted from the disk of a DataNode Full Block Reports Upon registration to a cluster, a process scans the disk of a DataNode to report all the blocks stored for the first time (first-time block report). Subsequently, when a configurable timer expires (set by default to 6 hours), the DataNode runs the same background scan process. This time, the scanner collects all the current valid blocks on its disk. The scanner reconciles the differences between the blocks that reside on the disk and a map that has the list of blocks in every storage volume of the disk (volumemap). The differences of the blocks are used to update the volumemap. A storage volume in volumemap is associated with a DataStorage id and a block is associated with block id, generation stamp and its length. A map that maps the blocks to DataStorage id of the volume in which the blocks reside is a block report and is represented using a HashMap. The BlockMap that maps the blocks with the DataNodes they reside on, is not stored persistently on the disk of the NameNode [20]. The NameNode knows the locations of the blocks only when it processes the first-time, full-block and intermediate-block reports. An interface in the NameNode processes block reports based on implementation in the method reportdiff(). When a block reported by a DataNode has the same generation stamp and length as recorded in FsImage, the physical location of the block from the block report is updated in BlockMap. When a block reported is not in FsImage, the associated DataNode is notified to invalidate it. If a reported block is invalid, the NameNode triggers replication process and commands the DataNode to delete it. Every block hosted on a DataNode, (even those unchanged) is processed every time when a block report is processed. During this process, no read or write can occur Intermediate Block Reports In practice, files may be written onto the DataNodes before a scheduled full-block report is sent to the NameNode. With only full-block reports to report the physical locations of blocks, a client read request may not be fulfilled if blocks of the files to be read have not been processed at the NameNode yet. One solution to address this problem is intermediate-block reporting scheme discussed here. Any change (addition or deletion of a data block) on the disk of a DataNode triggers an intermediate block report. Such a report is a map that maps a block (the one added or deleted) to DataStorage id of the volume in which the block was uploaded or deleted from is sent to the NameNode [21]. The physical locations of the blocks that are being received or the ones that have already been received and validated at DataNodes are updated in BlockMap. NameNode processes

3 3 the blocks that were deleted at the DataNode differently. It verifies if such blocks are valid and triggers replication tasks. 3.2 Performance Evaluation of HDFS Researchers have explored the areas of evaluating the performance of Hadoop and other file systems [11], [12], [13]. They have used standard test benchmarks. Some of them are [11]: 1) TeraGen: Generates a file of desired size as an output usually ranging between 500GB to 3TB. 2) TeraSort: Sorts an input file across a Hadoop cluster. 3) TeraValidate: Verifies the sorted data for accuracy. 4) RandomWriter: Generates different file sizes based on a size parameter and uploads the files onto HDFS. Several research papers have focused on evaluating read/write performance with small and big data sets [10], [14], [15], [16]. Various statistics such as data throughput, average and standard deviation of I/O rate were computed. The authors of paper [10] integrated parallel virtual file system (PVFS) into Hadoop and compared its performance with HDFS using a set of data-intensive computing benchmark. They noticed that consistency, durability, and persistence trade offs made by HDFS affect application performance. Another research paper analyzed the performance of HDFS in terms of throughput achieved [15]. Yet another work [17] evaluated cluster configurations using Hadoop in order to check for parallelism performance and scalability. This evaluation established the capabilities from the perspective of storage, indexing techniques, query distribution, parallelism, scalability and performance in heterogeneous environments on HDFS. 3.3 Performance Improvement in HDFS In the current implementation of Hadoop, data locality has not been taken into account for launching map tasks, as it is assumed that most maps are data-local. A paper [17] addresses the problem of placing data across DataNodes. Their approach ensures that every DataNode in the cluster has a balanced data processing load. Yet another group [18] shows that providing a simple data placement scheme which considers several aspects of computing platforms and nature of jobs submitted can increase the throughput of the jobs completed by several orders in map-reduce paradigm. They also conducted a performance study of MapReduce on a 100-node cluster of Amazon EC2 with various levels of parallelism. They identified five design factors that affect the performance of Hadoop and investigated alternative methods for each of them. They claimed that careful tuning of these factors will improve overall performance of Hadoop. 4 HEBR: HIGH EFFICIENCY BLOCK REPORTING SCHEME To the best of our knowledge, no previous research efforts have focused on improvements to the performance of HDFS through a more efficient block reporting scheme. In this section, we present HEBR, our new block reporting scheme that is significantly more efficient in both computation and communication than CBR. Processing full-block reports that are similar or worse still, identical to each other, places a high computational load on the NameNode as well as a high communication overhead on the cluster. Since intermediate-block reports has only one block each and the NameNode processes every report independently, processing load on the NameNode increases further. We designed a highly efficient block reporting scheme that reduces network traffic between DataNodes and the NameNode, and also reduces block report load on the NameNode. We call it HEBR (High Efficiency Block Reporting Scheme). It is noted that in many applications, addition or deletion of blocks is usually correlated, e.g., blocks of a same file. Thus, the idea of HEBR is to send smaller block reports that may contain blocks associated with one file to the NameNode in each block report. The efficiency gains of HEBR is a result of it sending: 1) fewer block reports than many intermediate-block reports 2) smaller block reports than a full-block report HEBR sends a first-time block report which is essentially a full-block report at the time of registration of the DataNode to the cluster. This is the only time a DataNode running HEBR sends a full-block report. Subsequent reports contain newly uploaded blocks on the disk of a DataNode. CBR and HEBR are compared in Table 1. TABLE 1 Comparison of Two Approaches of Block Reporting Schemes Sends CBR HEBR A first-time Yes Yes Full-block reports periodically Yes No Intermediate-block reports with one block Yes No Group of newly uploaded blocks No Yes Deleted blocks No Yes A background disk scanning process is scheduled when a timer expires. This process scans the disks of the DataNodes. HEBR introduces a configurable time interval for scheduling block reports. Current time deducted by this interval is the gap between two subsequent block reports. When the scanner identifies blocks that are uploaded onto the disk after a previous block report was sent based on the time stamp on the blocks, it adds them to the next new block report. Although, the number of blocks written onto the disk depends on rate at which they are written and the interval between two reports, usually, more than one block is written onto HDFS between two block reports. Thus, each block report has more number of blocks than an intermediateblock report. But, as HEBR reports only the most recently uploaded blocks to the NameNode, it sends fewer blocks than a full-block report in every block report except in a first-time report. As difference in processing and communication time between a block report with just a single block and one with few blocks is negligible, the reduction in number of block reports sent to the NameNode, increases reporting efficiency. Since we are accumulating few blocks (not just

4 4 a block) before the DataNode reports to the NameNode, the number of block reports communicated in the network are much smaller than all the intermediate-block reports. The fewer the block reports that the NameNode needs to process, the lower the block reporting load it has. In CBR, all the blocks on the disks of the DataNodes are reported. If blocks reported are part of files in FsImage, their physical locations are updated in BlockMap when block reports are processed. If the blocks that are part of the file are not reported in a block report, the NameNode assumes the blocks have been deleted and triggers replication commands. HEBR does not report all the blocks on the disks of the DataNodes. Thus, the strategy for processing block reports have been slightly modified. Along with blocks that are recently uploaded, blocks that are deleted from a DataNode are also sent. To distinguish these blocks from the ones that are in memory, their time stamps are marked as 000. Under the assumption that the number of deleted blocks from DataNodes is much smaller than the number of all blocks stored on it, HEBR reports fewer and more recently uploaded blocks to the NameNode compared to CBR. Recovery mechanism from CBR has been retained. The locations of blocks that are not reported in a recent report but whose locations are already updated in BlockMap do not change. Since such blocks are not assumed to be deleted on disks of DataNodes, no replication process is triggered by the NameNode. For new blocks whose locations are not in BlockMap, physical locations of blocks are added into the map. Only in rare cases, there are more deleted blocks than existing blocks on a DataNode. This ensures that HEBR is always minimal. The implementation details of HEBR are discussed here. The architecture of HDFS (version 2.7.0) has been retained. We: 1) modified the configuration file to insert an interval that triggers HEBR report. 2) Updated the background scanning process of CBR such that it collects blocks that were uploaded or deleted from the disks of DataNodes recently. 3) altered the time stamps of blocks that are deleted from disks of DataNodes. 4) modified code that adds blocks into a block report, such that the blocks collected from the scanner comprises the block reports. 5) modified the block report processing mechanism on the NameNode. Overall, we added or altered 300 LOC (Lines of Java Code) on HDFS. These changes do not break any running cluster and can be easily adapted. 5 EVALUATION A detailed evaluation strategy for HEBR and their results are discussed in the following sections. To assess performance of HDFS using HEBR, we choose two easily accessible benchmarks that were developed to evaluate the performance of HDFS - Word-Count [23] and Random-Writer [24]. These benchmarks were often chosen by many researchers to study read/write throughput in HDFS. The benchmarks comprise of programs that upload only one file for every run onto HDFS [2], [12]. A map-reduce program analyzing Big Data files uploads multiple files for every run onto HDFS. To evaluate HEBR on workloads similar to the real map-reduce scenario, we modified the programs of the benchmarks to upload specified number of same files (50 by default) for every run onto HDFS. In addition, we built our own benchmark APIRandomWrite with some new features added to further evaluate HEBR. The experiments using programs of the benchmarks are conducted on both CBR and HEBR to compare their performance under various workloads. 5.1 Word-Count [23] In this benchmark, the program uses a map function that counts the number of words in each input file (at most four standard files) and a reduce function that accumulates the count of words from individual files. The cumulative count of every word in the input files is written as a file and uploaded onto HDFS. A modified version to upload multiple replicas of the output file during a set of an experiment resembling real workload is used. 5.2 Random-Writer [24] This benchmark comprises of a program that uses map function to create and upload a file onto HDFS. The reduce function is not implemented. The file size is a required parameter for the program. A modified version of the program that uploads multiple copies of the same file during one set of an experiment is used. This scenario is similar to a real world workload. 5.3 APIRandomWrite Since HEBR reports blocks deleted from DataNodes, API- RandomWrite comprises of a program that has following new features not in other standard benchmarks: 1) can upload multiple files and blocks of specified sizes onto HDFS 2) can delete a block and all its replicas in HDFS according to user s specification (number of blocks and the file from which the blocks are to be deleted) 5.4 Experiment Setup The experimental setup using the three benchmarks is discussed in this section. Each experiment conducted has a list of parameters and their default values are illustrated in Table 2 TABLE 2 Experimental Parameters and Default Values Parameter Role Default Values Interval Block Size File Size Trigger generation of block reports Decide number of blocks for files Determine the size of the file to be uploaded 1 minute 128MB Varied

5 5 Experiments are performed across the benchmarks by maintaining two of the above parameters as fixed and varying the third. The experiments conducted are labeled as: Different-File-Sizes, Different-Block-Sizes and Different- Intervals to indicate the varied parameters. The block sizes and the intervals are re-adjusted to three different values from their default (block size: 128MB and interval: 1 minute) in the configuration files of both versions of HDFS running CBR and HEBR. Thus, we conducted many subsets of trials with variations of block sizes and intervals during each set of the experiments. The new adjustment values for block sizes are dependent on file sizes uploaded onto HDFS during a trial. The variations of the intervals (3, 5 and 10 minutes) are standard across all trials. At the start of each experiment, HDFS is reformatted and all previous input files are deleted. The number of blocks in block reports sent from a DataNode running HEBR or CBR to the NameNode during every experiment was recorded. This data was saved onto a file. We collect these files from all the DataNodes that report to the NameNode in order to analyze the performance when the two block reporting schemes were implemented. HEBR scheme was examined when the programs of the three benchmarks are executed on single-node, 3-node and 4-node clusters. The configurations of machines that are part of the 3-node cluster are tabulated below in Table 3. The master node that runs both the NameNode and the DataNode processes is labeled as Master. One of the DataNodes is labeled as Slave1 and the other as Slave2. The commodity machines that constitute single-node and 3-node clusters were old and identical. Since a real world HDFS cluster may comprise of commodity machines of mixed configurations, a 4-node cluster was set up that consists of machines with varying configurations. A node that runs both the NameNode and the DataNode processes is labeled as Master, the other three DataNodes are labeled Slave1, Slave2 and Slave3. Their configurations are listed in Table 4. Nodes TABLE 3 Configurations of Machines in 3-node Cluster Architecture Cpu Modes Cpu Mhz Total Available Memory Master i bit GB Slave1 i bit, 64-bit GB Slave2 i bit GB Nodes TABLE 4 Configurations of Machines in 4-node Cluster Architecture Cpu Modes Cpu Mhz Total Available Memory Master x bit, 64-bit GB Slave1 x bit 64-bit GB Slave2 x bit 64-bit GB Slave3 i bit GB With the data collected from the DataNodes, simple statistics such as average and maximum number of blocks sent to the NameNode were computed and analyzed to compare performance of CBR and HEBR. Fewer blocks in a block report reduce the processing time and network bandwidth usage in the cluster. Thus, the lower the average, the more efficient the block reporting scheme is. An experiment is terminated when four consecutive block reports have no blocks. In the following sections, we present and discuss the results of experiments running programs of the benchmarks. 6 BENCHMARK 1: WORD-COUNT Four files of sizes approximately 700KB, 1.5MB, 650KB and 1.5MB were provided as inputs to the word-count program, not all at the same time. The number of files and their sizes determine the size of generated output file. The largest output file was of size 926KB when all the four files were input simultaneously. The experiments were run on singlenode and 3-node clusters. Experiment results are very rich, and we only discuss the results of one experiment (Different Intervals), which is superset of the other two experiments, running on the 3-node cluster. 6.1 Different Intervals on 3-node Cluster The block sizes were set to 128KB, 256KB and 512KB. The intervals were set to 1, 3 and 5 minutes for each block size adjustments. Under these settings, 50 files each, of sizes: 708KB, 825KB, 860KB and 926KB were uploaded onto the file system. Since experiments uploaded small files, the block sizes were set in units of KB. 36 different experiment trials were conducted and a small portion of the results (for one block size and an interval for the smallest (708KB) and largest file (926KB)) is discussed in this section, while detailed results are presented in Appendix A. Table 5 shows average and maximum number of blocks sent to the NameNode by all the DataNodes. Since the rates of writing blocks onto disks of DataNodes are roughly the same, the maximum number of blocks sent by all the DataNodes may be identical. The results from the experiment shows that HEBR outperforms CBR consistently. It sends significantly fewer blocks in every report when intervals are set to 1, 3 or 5 minutes. Figure 1 shows partial results from the experiment (one block size for every interval and for two file sizes). 6.2 Observation and Remarks Table 6 shows average percentage reduction in the number of blocks sent to the NameNode in block reports for all the experiments conducted using word-count program on single-node and 3-node clusters. The average number of blocks sent to the NameNode from the DataNodes running HEBR is always lower compared with CBR in all experimental settings. The results based on various experiments on single-node and 3-node, suggests that HEBR reduces up to 90% and in more conservative results, up to 62% of block reports sent to the NameNode. This follows from the fact that HEBR collects only the most recent uploads from the disk and not all the blocks into block reports. The meta data that represents a block in a block report takes 3 bytes [21]. Thus, when the average number of blocks in a block report is smaller, the block report sent to the NameNode is smaller in terms of the number of bytes. This reduces the bandwidth consumption and communication overhead between DataNodes and the NameNode.

6 6 TABLE 5 and Maximum number of Blocks sent to NameNode from Different-Intervals running Word-Count Benchmark on 3-node Cluster File Size Block Size Interval Node CBR HEBR 708KB 926KB Reduction (%) CBR Maximum HEBR Maximum Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master MB 1 min Slave Slave Master KB 5 min Slave Slave Master KB 1 min Slave Slave MB 1 min All Node TABLE 6 Percentage Reduction in number of Blocks sent from DataNodes running HEBR using Word-Count Benchmark Cluster Sizes Experiment Reduction (%) Single-node Different-File-Sizes Single-node Different-Block-Sizes Single-node Different-Intervals node Different-File-Sizes node Different-Block-Sizes node Different-Intervals When the files are larger, the blocks are written over a longer period of time. So, fewer blocks per block report are sent by HEBR when large files are uploaded. Larger the block sizes in the configuration file, fewer the number of blocks for each file. When the blocks are large, only few of them are written over a shorter period of time. Hence the average number of blocks sent to the NameNode is reduced. We notice that the average number of blocks is lower when the blocks are reported more frequently than when reported infrequently. If the generation timer between reports is more than 5 minutes, more number of blocks are reported per report. But when block reports are sent very frequently (1 minute), the processing resources of the NameNode may be overloaded. Thus, based on our inferences of the results, for best performances of HEBR, the optimal interval to be set is 3 minutes. A similar trend of results was observed from experiments running on single-node. 7.1 Different-Block-Sizes on 3-node Cluster The program uploads large files (100MB and 120MB) and small files (150KB). The pre-processing time associated with map-reduce paradigm causes delays in writing blocks. As large files were being uploaded, we set block sizes to 50, 128 and 256 MB in configuration files of both HDFS ruining CBR and HEBR schemes. A 100MB or 120MB file is partitioned into 2 or 3 blocks. The intervals were set to 1 minute. DataNodes running HEBR collect and send significantly fewer blocks on an average. Also, the maximum number of blocks sent in a single block report to the NameNode by all the DataNodes that were running HEBR was much lower than the DataNodes running the CBR as shown in Table 7. Figure 2 shows results from the experiment when a block size for each file size is chosen randomly. 7.2 Observation and Remarks Table 10 shows percentage reduction in number of blocks sent from DataNodes running HEBR when experiments were conducted using random-writer program. Experiments conducted in this benchmark also produced similar results to those conducted on the word-count benchmark. Based on the results, the average number of blocks sent to the NameNode from DataNodes running HEBR is always lower compared with CBR in all experimental settings. The results suggest that HEBR reduces up to 98% and in more conservative results, up to 52% of block reports sent to the NameNode. 7 BENCHMARK 2: RANDOM-WRITER Similar experiments were conducted using the program of the Random-Writer on single-node and 3-node cluster. The results are discussed in this section. In real life scenarios, varying sizes of files are stored on the file system. Thus, we upload files of different sizes ranging from KBs to MBs. The results of an experiment (Different-Block-Sizes) conducted on the 3-node cluster are presented here. Although the experiment Different-Block- Sizes is not a superset of the other two, the results were impressive. 8 BENCHMARK 3: APIRANDOMWRITE Similar experiments were conducted using the program of APIRandomWrite. Since our program does not use the mapreduce paradigm to accomplish the task, the processing time for writing files onto disks of DataNode is negligible. Experiments were run on single-node, 3-node and 4-node clusters by varying the parameters in the same order as the previous benchmarks. We evaluated HEBR when both large files (120MB) and small files (926KB) were uploaded onto HDFS. We present the results from one of the experiments (Different Intervals), that uploaded files of different

7 7 Fig. 1. and Maximum number of Blocks sent to NameNode from Different-Intervals running Word-Count on 3-node Cluster TABLE 7 and Maximum number of Blocks sent to NameNode from Different-Block-Sizes running Random-Writer Benchmark on 3-node Cluster File Size Block Size Node CBR HEBR 150KB 926KB 100MB 120MB 50MB 128MB 256MB 50MB 128MB 256MB 50MB 128MB 256MB 50MB 128MB 256MB Reduction (%) CBR Maximum HEBR Maximum Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave Master Slave Slave

8 8 sizes when both block sizes and intervals were varied in the configuration files on 3-node and 4-node clusters. This experiment is most significant among all the experiments as it is a superset of the other two experiments. 8.1 Different Intervals on 3-node Cluster To have a valid comparison, same file size (926KB) was uploaded for both word-count and APIRandomWrite. As an additional validation, larger files (in MB) were uploaded using APIRandomWrite. A batch file had instructions to upload 50 files of 150KB, 926KB, 100MB and 120MB was used as input to the program of the benchmark. For each of the file sizes, the block sizes were varied from the default (128MB) to 256MB and 50MB. For every block size adjustments, the intervals were varied from 1 minute to 3 or 5 minutes. The experiment has rich results, however, we present only a small portion of it (for two larger files, one intervals for each block size). The entire result is presented in Appendix B. Table 8 and Figure 3 show the results. 8.2 Different Intervals on 4-node Cluster We explored the advantages of HEBR on a larger cluster of four nodes. Experiments that were run on single-node and 3-node clusters were repeated and the results are discussed in this section. We upload files of same sizes as previous experiment, partitioned into blocks based on three different block sizes set to 128MB, 256MB and 512MB. It was determined that 3 minutes was the most optimal interval parameter, and hence, the 5 minute interval was not tested when running experiments on 4-node cluster. Thus, for each experiment set, the intervals were varied from 1 to 3 minutes. Only a portion of results of different trials is presented with details presented in Appendix C. Randomly chosen interval for a pair of file size and block sizes are shown in Table 9. This experiment again illustrates that HEBR sends much smaller number of blocks on an average and saves processing time at NameNode. The files are written at different rates in every DataNode and thus the maximum number of blocks sent by them are different. The results conclusively proves the advantages of using HEBR. We present complete results from the experiment when 50 files of 180MB were uploaded in Figure 4. TABLE 11 Percentage Reduction in number of Blocks sent from DataNodes running HEBR using APIRandomWrite Benchmark Cluster Sizes Experiment Reduction (%) Single-node Different-File-Sizes Single-node Different-Block-Sizes Single-node Different-Intervals node Different-File-Sizes node Different-Block-Sizes node Different-Intervals node Different-File-Sizes node Different-Block-Sizes node Different-Intervals Experiments conducted in this benchmark also produced similar results to those conducted on word-count and random-write benchmarks. The average number of blocks sent to the NameNode from DataNodes running HEBR is lower compared with CBR in all experimental settings. The results suggest that HEBR reduces up to 95% and in more conservative results, up to 62% of block reports sent to the NameNode. 8.3 Observation and Remarks We consolidate the results from all the experiments that were conducted using APIRandomWrite as percentages that indicate the decrease in number of blocks sent to the NameNode when HEBR is implemented in Table 11. TABLE 10 Percentage Reduction in number of Blocks sent from DataNodes running HEBR using Random-Writer Benchmark Cluster Sizes Experiment Reduction (%) Single-node Different-File-Sizes Single-node Different-Block-Sizes Single-node Different-Intervals node Different-File-Sizes node Different-Block-Sizes 85.97

9 9 Fig. 2. and Maximum number of Blocks sent to NameNode from Different-Block-Sizes running Random-Writer Benchmark on 3-node Cluster TABLE 8 and Maximum number of Blocks sent to NameNode from Different-Intervals running APIRandomWrite Benchmark on 3-node Cluster File Size Block Size Interval Node 100MB 120MB 50MB 128MB 256MB 50MB 128MB 256MB CBR HEBR Reduction (%) CBR Maximum HEBR Maximum Master min Slave Slave Master min Slave Slave Master min Slave Slave Master min Slave Slave Master min Slave Slave Master min Slave Slave Fig. 3. and Maximum number of Blocks sent to NameNode from Different-Intervals running APIRandomWrite Benchmark on 3-node Cluster

10 10 TABLE 9 and Maximum number of Blocks sent to NameNode from Different-Intervals running APIRandomWrite Benchmark on 4-node Cluster File Size Block Size Interval Node 100MB 120MB 180MB CBR HEBR Reduction (%) CBR Maximum HEBR Maximum Master MB 1 min Slave Slave Slave Master MB 3 min Slave Slave Slave Master MB 1 min Slave Slave Slave Master MB 3 min Slave Slave Slave Master MB 3 min Slave Slave Slave Master MB 3 min Slave Slave Slave Master MB 1 min Slave Slave Slave Master MB 3 min Slave Slave Slave Master MB 1 min Slave Slave Slave Fig. 4. and Maximum number of Blocks sent to NameNode from Different-Intervals running APIRandomWrite Benchmark on 4-node Cluster uploading 50 Files of 180MB

11 11 9 CONCLUSION In order to obtain meta data information, the current implementation of HDFS scans disks of DataNodes to collect all the blocks irrespective of when they were written to the disk and dump into a list known as block report. This block reporting scheme increases the computation loads on the NameNode and network bandwidth in a Hadoop cluster as it processes unnecessarily many identical block reports. This paper presents a new and much more efficient block reporting scheme, called HEBR (High Efficiency Block Reporting). In HEBR, a full block report containing all the blocks hosted on the DataNode is only sent once at the very first registration process. Subsequent reports only send a collection of blocks that were written between two subsequent block reports. The current implementation does not send blocks that are deleted from the DataNodes but HEBR does. When a large number of blocks are deleted from the disks, it is possible that HEBR may contain more number of blocks than the full block report. In such cases, a full block report is sent to ensure HEBR always performs better. HEBR has been evaluated using three workload benchmarks on various cluster configurations. All our experiments show that HEBR achieves much better performance than the current block reporting scheme: at its best, HEBR reduces the number of blocks sent to the NameNode by about 97.52% and by about 59.78% at its worst performance as compared to the current implementation. Since HEBR does not change the overall architecture of HDFS, it can easily be integrated into existing Hadoop systems. The experiments conducted so far gives us confidence that when these changes are implemented in large scale clusters, it will lead to even more significant improvement in efficiency and performance, thereby directly impacting the bottom line of business enterprises using Hadoop. We anticipate its adoption by the Hadoop community. The average and maximum number of blocks sent from all the three DataNodes during every set of experiments are presented in Table 14. APPENDIX C STATISTICS FROM DIFFERENT-INTERVALS RUN- NING APIRANDOMWRITE BENCHMARK ON 4-NODE CLUSTER We present the entire results from experiments when four files of different sizes are partitioned into various blocks based on three different block sizes are uploaded into file system. We adjust the intervals from 1 minute to 3 and 5 minutes for every set of experiments in configuration files in HDFS running CBR and HEBR. Statistics (average and maximum) number of blocks sent from all the four DataNodes during every set of experiments are presented in Table 15. APPENDIX A STATISTICS FROM DIFFERENT-INTERVALS RUNNING WORD-COUNT BENCHMARK ON 3-NODE CLUSTER We present the results from experiments when four files of different sizes are partitioned into number of blocks and uploaded into the file system. We re-adjust the intervals from 1 minute to 3 and 5 minutes for every set of experiments in configuration files of HDFS running both the block reporting schemes (CBR and HEBR). The maximum and average number of blocks sent from all the three DataNodes at every set of experiments are presented in Table 12 and 13. APPENDIX B STATISTICS FROM DIFFERENT-INTERVALS RUN- NING APTRANDOMWRITE BENCHMARK ON 3-NODE CLUSTER We present the entire results from experiments when four different file sizes are partitioned into various blocks based on three different block sizes are uploaded into file system. We adjust the intervals from 1 minute to 3 and 5 minutes for every experiment set in configuration files of HDFS running both the block reporting schemes (CBR and HEBR).

12 TABLE 12 and Maximum number of Blocks sent to NameNode from Different-Intervals running Word-Count Benchmark on 3-node Cluster (Part 1) File Size Block Size Interval Node CBR Maximum HBER Maximum CBR HEBR Reduction (%) 708KB 825KB Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave Master MB 1 min Slave Slave Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave Master MB 1 min Slave Slave Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave REFERENCES [1] H. Weatherspoon and J. D. Kubiatowicz, Erasure Coding Vs. Replication: A Quantitative Comparison, IPTPS, [2] D. Ismail and S. Harris, Performance Comparison of Big Data Analysis using Hadoop in Physical and Virtual Servers. [3] T. Ivanov and N. Korfiatis and R. V. Zicari, On the inequality of the 3V s of Big Data Architectural Paradigms: A case for heterogeneity, CoRR, [4] P. Zikopoulos, C. Eaton, D. DeRoos, T. Deutsch and G. Lapis, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGraw-Hill, [5] B. Dhruba, The hadoop distributed file system: Architecture and design, Apache Hadoop Documentation, [6] C. Fay et al., Bigtable: A distributed storage system for structured data, TOCS, [7] K.V. Shvachko and A.C. Murthy, Scaling Hadoop to 4000 Nodes at Yahoo!, Yahoo! Developer Network Blog, [8] O. OMalley and A.C. Murthy, Hadoop Sorts a Petabyte in Hours and a Terabyte in 62 Seconds, Yahoo! Developer Network Blog, [9] K.V. Shvachko, HDFS scalability: the limits to growth, Hadoop Wiki, [10] J. Shafer, A storage architecture for data-intensive computing, Ph.D. thesis, Rice University, [11] Performance measurement of a Hadoop Cluster. [12] Performance Evaluation of Read and Write Operations in Hadoop Distributed File System. [13] Hadoop Performance Evaluation [14] W. Tantisiriroj, S. Patil, G. Gibson, S. W. Son, S. J. Lang, and R. B. Ross, On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS, SC11, [15] B. Nicolae, D. Moise, G. Antoniu, L. Boug, and M. Dorier, Blob- Seer: Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications, IPDPS, [16] Hadoop Scalability and Performance Testing in Heterogeneous Clusters. [17] Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters. [18] A Framework for Performance Analysis and Tuning in Hadoop Based Clusters. [19] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker, A comparison of approaches to largescale data analysis, SIGMOD, [20] T. White, HDFS Reliability, 2009 [21] M. Foley, Consider redesign of block report processing, [22] K.V. Shvachko, Apache Hadoop The Scalability Updates, Hadoop Wiki, [23] M. B. Alam, M. Hasan and Md. K. Uddin, A New HDFS Structure Model to Evaluate the Performance of Word Count Application on Different File Size. [24]

13 TABLE 13 and Maximum number of Blocks sent to NameNode from Different-Intervals running Word-Count Benchmark on 3-node Cluster (Part 2) File Size Block Size Interval Node CBR Maximum HEBR Maximum CBR HEBR Reduction (%) 860KB 926KB Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave Master MB 1 min Slave Slave Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave Master MB 1 min Slave Slave Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave

14 TABLE 14 and Maximum number of Blocks sent to NameNode from Different-Intervals running APIRandomWrite Benchmark on 3-node Cluster File Size Block Size Interval Node CBR Maximum HEBR Maximum CBR HEBR Reduction (%) 100MB 120MB Master MB 1 min Slave Slave Master MB 3 min Slave Slave Master MB 5 min Slave Slave Master MB 1 min Slave Slave Master MB 3 min Slave Slave Master MB 5 min Slave Slave Master KB 1 min Slave Slave Master KB 3 min Slave Slave Master KB 5 min Slave Slave Master MB 1 min Slave Slave Master MB 3 min Slave Slave Master MB 5 min Slave Slave Master MB 1 min Slave Slave Master MB 3 min Slave Slave Master MB 5 min Slave Slave Master MB 1 min Slave Slave Master MB 3 min Slave Slave Master MB 5 min Slave Slave

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler MSST 10 Hadoop in Perspective Hadoop scales computation capacity, storage capacity, and I/O bandwidth by

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Yuval Carmel Tel-Aviv University Advanced Topics in Storage Systems - Spring 2013 Yuval Carmel Tel-Aviv University "Advanced Topics in About & Keywords Motivation & Purpose Assumptions Architecture overview & Comparison Measurements How does it fit in? The Future 2 About & Keywords

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction...3 2 Assumptions and Goals...3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets...3 2.4 Simple Coherency Model... 4 2.5

More information

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Google File System (GFS) and Hadoop Distributed File System (HDFS) Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-16 Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund August 15 2016 Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC Supported by the National Natural Science Fund The Current Computing System What is Hadoop? Why Hadoop? The New Computing System with Hadoop

More information

Hadoop and HDFS Overview. Madhu Ankam

Hadoop and HDFS Overview. Madhu Ankam Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley Sinbad Leveraging Endpoint Flexibility in Data-Intensive Clusters Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica UC Berkeley Communication is Crucial for Analytics at Scale Performance Facebook analytics

More information

Staggeringly Large File Systems. Presented by Haoyan Geng

Staggeringly Large File Systems. Presented by Haoyan Geng Staggeringly Large File Systems Presented by Haoyan Geng Large-scale File Systems How Large? Google s file system in 2009 (Jeff Dean, LADIS 09) - 200+ clusters - Thousands of machines per cluster - Pools

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Top 25 Big Data Interview Questions And Answers

Top 25 Big Data Interview Questions And Answers Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together! HADOOP 3.0 is here! Dr. Sandeep Deshmukh sandeep@sadepach.com Sadepach Labs Pvt. Ltd. - Let us grow together! About me BE from VNIT Nagpur, MTech+PhD from IIT Bombay Worked with Persistent Systems - Life

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why

More information

Decentralized Distributed Storage System for Big Data

Decentralized Distributed Storage System for Big Data Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

The Google File System. Alexandru Costan

The Google File System. Alexandru Costan 1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

CSE 124: Networked Services Lecture-17

CSE 124: Networked Services Lecture-17 Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments

More information

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored

More information

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment. Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client

More information

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

CS60021: Scalable Data Mining. Sourangshu Bhattacharya CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop K. Senthilkumar PG Scholar Department of Computer Science and Engineering SRM University, Chennai, Tamilnadu, India

More information

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Enhanced Hadoop with Search and MapReduce Concurrency Optimization Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Cluster File

More information

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c 2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic

More information

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

10 Million Smart Meter Data with Apache HBase

10 Million Smart Meter Data with Apache HBase 10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on

More information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information Subject 10 Fall 2015 Google File System and BigTable and tiny bits of HDFS (Hadoop File System) and Chubby Not in textbook; additional information Disclaimer: These abbreviated notes DO NOT substitute

More information

A Survey on Big Data

A Survey on Big Data A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj DiskReduce: Making Room for More Data on DISCs Wittawat Tantisiriroj Lin Xiao, Bin Fan, and Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University GFS/HDFS Triplication GFS & HDFS triplicate

More information

Jumbo: Beyond MapReduce for Workload Balancing

Jumbo: Beyond MapReduce for Workload Balancing Jumbo: Beyond Reduce for Workload Balancing Sven Groot Supervised by Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan sgroot@tkl.iis.u-tokyo.ac.jp

More information

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University MapReduce & HyperDex Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University 1 Distributing Processing Mantra Scale out, not up. Assume failures are common. Move processing to the data. Process

More information

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization

More information

Apache Flink: Distributed Stream Data Processing

Apache Flink: Distributed Stream Data Processing Apache Flink: Distributed Stream Data Processing K.M.J. Jacobs CERN, Geneva, Switzerland 1 Introduction The amount of data is growing significantly over the past few years. Therefore, the need for distributed

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved. Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information