Improving Small File Management in Hadoop
|
|
- Damon Parsons
- 5 years ago
- Views:
Transcription
1 VOLUME 5 NO 4, 2017 Improving Small File Management in Hadoop 1 O. Achandair, 2 M. Elmahouti, 3 S. Khoulji, 4 M.L. Kerkeb, 1,2,3,4 Information System Engineering Resarch Group Abdelmalek Essaadi University, National School of Applied Sciences, Tetouan, Morocco. o.achandair@gmail.com, mouradelmahouti@gmail.com,khouljisamira@gmail.com, kerkebml@gmail.com, acmlis.conference@gmail.com ABSTRACT Hadoop, considered nowadays as the de facto platform for managing big data, has revolutionized the way customers manage their data. As an open source implementation of map reduce, it was designed to offer a high scalability and availability across clusters of thousands of machines. Through it two principals components, which is HDFS for a distributed storage and MapReduce as the distributed processing engine, companies and research studies are taking a big benefit from its capabilities. However, Hadoop was designed to handle large size files, so when it comes to a large number of small files, the performance can be heavily degraded. The small file problem has been well defined by researchers and Hadoop community, but most of the proposed approaches only deal with the pressure caused on the NameNode memory. Certainly, grouping small files in different possible formats, that are most of time supported by the actual Hadoop distribution, reduce the metadata entries and solve the memory limitation, but that remain only a part of the equation. Actually, the real impact that organizations need to solve when dealing with lot of small files, is the cluster performance when those files are processed in Hadoop clusters. In this paper, we proposed a new strategy to use efficiently some one of the common solution that group files in a MapFile format. The core idea, is to organize small files files based on specific attributes in MapFile output files, and use prefetching and caching mechanisms during read access. This would lead to less calls of metadata from the NameNode, and better I/O performance during MapReduce jobs. The experimental results show that this approach can help to obtain better access time when the cluster contain massive number of small files. Keywords Cloud Hadoop, HDFS, Small Files, SequenceFile, MapFile 1 Introduction Apache Hadoop, an open source framework initiated by Yahoo!, developed for storing and processing data at a large scale on clusters of thousands of commodity hardware. Meanwhile, it offers a reliable, scalable, and fault tolerant platform deployed by many big companies such as Facebook, Google, IBM, Twitter and others to manage Terabytes to Petabytes of data. Hadoop was designed to handle large files, especially when traditional systems are facing limitations to analyze this new data dimension caused by the present data explosion. However, large files are not the DOI: /tmlai Publication Date: 15th August, 2017 URL:
2 O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, Vol 5 No 4 August (2017); p: only kind of files that Hadoop deployments needs to manage, the diversity of all kind of data becomes a standard in most of big data platform. There are many fields that produce tremendous numbers of small files continuously such as analysis for multimedia data mining [6], astronomy [7], meteorology [8], signal recognition [9], climatology [10,11], energy, and E learning [12] where numbers of small files are in the ranges of millions to billions. For instance, Facebook has stored more than 260billion images [13]. In biology, the human genome generates up to 30 million files averaging 190KB [14]. 1.1 HDFS Storage Layer HDFS, The Hadoop distributed file system provides high reliability, scalability and fault tolerance. It's designed to be deployed on big clusters of commodity hardware. It s based on a master slave architecture, the NameNode as a master and the DataNodes as slaves. The NameNode is responsible for managing the file system namespace, it keeps tracks of files during creation, deletion, replication [3] and manages all the related metadata [4] in the server memory. The NameNode splits files into blocks and sends the writes requests to be performed locally by DataNodes. To ensure a fault tolerance system, blocks replicas are pipelined across a list of DataNodes. This architecture as shown in Fig.1, with only one single NameNode simplifies the HDFS model, but it can cause memory overhead and reduces file access efficiency when dealing with a high rate of small files. Figure. 1: HDFS Architecture 1.2 MapReduce Processing Engine In the current version of Hadoop, Google re architected the processing engine to be more suitable for most of big data applications needs. The major improvement of Hadoop was the introduction of a resource management module, called YARN, independently of the processing layer. This brought significant performance improvements, offered the ability to support additional processing models, and provided a more flexible execution engine. Because of its independency architecture, existing MapReduce applications can run on YARN infrastructure without any changes. The MapReduce program execution on YARN can be described as follows: i. A user defines an application by submitting its configuration to the application manager ii. The resource manager allocates a container for the application manager iii. Resource manager submits the request to the concerned node manager iv. The Node manager launches the application manager container v. The application manager gets updated continuously by the node manager nodes, it monitors the progress of tasks When all the tasks are done, the application manager unregisters from the resource manager, like so, the container can be allocated again, See Fig. 2. URL: 690
3 T r a n s a c t i o n s o n M a c h i n e L e a r n i n g a n d A r t i f i c i a l I n t e l l i g e n c e V o l 5 N o 4, A u g Figure. 2: YARN Yet Another Resource Manager The rest of this paper is divided into the following, section 2 lists the existing solutions in the related literature. Section 3 present the proposed approach. Section 4 is allocated for our experimental works and results. Finally, Section 5 for conclusion and expectation. 2 Related Work To deal with the small file problem, numerous researchers have proposed different approaches. Some of those efforts have been adopted by Hadoop and are available for use natively, more precisely, Hadoop Archives (HAR Files) and SequenceFile. Figure. 3: HAR File Layout Figure 4: SequenceFile File Layout Hadoop Archive packs small files into a large file, so that we can access original files transparently, see Fig. 3. This technique allows more storage efficiency, as only metadata of the archive is recorded in the namespaceof the NameNode, but it doesn t resolve other constraints in terms of reading performance.also, the archive cannot be appended while adding more small files. The SequenceFile technique is to merge a group of small files in a flat file, as key value pairs, while key is the related file metadata and value is the related content, see Fig. 4. C o p y r i g h t S o c i e t y f o r S c i e n c e a n d E d u c a t i o n U n i t e d K i n g d o m 691
4 O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, Vol 5 No 4 August (2017); p: Unlike the HAR files, the SequenceFile supports compression, and they are more suitable for MapReduce tasks as they are splittable [15], so mappers can operate on chunks independently.however, converting into a SequenceFile can be a time consuming task, and it has a poor performance during random read access. To improve the metadata management, G. Mackey et al. [16] merged small files into a single larger file, using the HAR through a MapReduce task. The small files are referenced with an added index layer (Master index, Index) delivered with the archive, to retain the separation and to keep the original structure of files. C. Vorapongkitipunet al. [17] proposed an improved approach of the HAR technique, by introducing a single index instead of the two level indexes. Their new indexing mechanism aims to improve the metadata management as well as the performance during file access without changing the implemented HDFS architecture. Patel A et al. [18] proposed to combine files using the SequenceFile method. Their approach reduces memory consumption on the NameNode, but it didn t show how much the read and write performances are impacted. Y. Zhang et al. [19] proposed merging related small files according to WebGIS application, which improved the storage effeciency and HDFS metadata management, however the results are limited by the scene. D. Dev et al. [20] proposed a modification of the existing HAR. They used a hashing concept based on the sha256 as a key. This can improve the reliability and the scalability of the metadata management, also the reading access time is greatly reduced, but it takes more time to create the NHAR archives compared to the HAR mechanism. P. Gohil et al. [21] proposed a scheme for combining small file, merging, prefetchingthe related small files which improves the storage and access efficiency of small files, but does not give an appropriate solution for independent small files. 3 The Proposed Approach for Small File Management The core idea behind our approach is to combine different clients files that contain sets of small files, when relevant, and merge them through a merger process, to be stored in an optimal way before closing the current SFA connection. This implementation has been achieved in our previous work, See Fig 5, as the main task of the Small File Analyzer Server. In the current research, we improved the proposed SFA server, to handle other modules that offered us the possibility to consider other parameters during the merger process. We introduced a sorter process, that can operate as an independent module in the SFA server. The compressor module is another added module, that allowed us to compress merged files that are no longer used, or rarely called, such a way we can benefit from a considerable benefit of storage capacity. Also, we used a prefetching and caching technique to boost the performance while reading the same small files. URL: 692
5 T r a n s a c t i o n s o n M a c h i n e L e a r n i n g a n d A r t i f i c i a l I n t e l l i g e n c e V o l 5 N o 4, A u g Figure. 5: SFA Architecture Figure. 6: SFA improved Architecture For simplicity, the clients files are stored in the archaically files system of the SFA server, connecting clients streams directly to the merger is not covered in the scope of the current study. However, its considered to maintain the same concept when extending that implementation. The clients files are scanned through the java "BasicFileAttributes" interface. C o p y r i g h t S o c i e t y f o r S c i e n c e a n d E d u c a t i o n U n i t e d K i n g d o m 693
6 O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, Vol 5 No 4 August (2017); p: Merger This process group sets of files together based on the collected file system attributes, as for many application logic, this can make sense when files are consumedfor processing. For example, in some weather application, it may be interesting to combine files of the same period of each year. In E learning, files are grouped by disciplines, and so on. The SFA can also combine files based on substring of the file name, which is in many production cases, a hidden attribute of other properties like location, gender, or simply a type of source (station, sensor, satellite ). See. fig 7 Figure 7: AWS dataset example - continuous record of Earth s land surface as seen from space. The predefined strategy of the merger is defined in the master configuration file. Common attributessuch as date, size, owner, type and name are assigned coefficients up to the administrator strategy.the coefficients are set by a group of combined files See. Fig 8" Figure 8: SFAcoefficient of attributes per combined queue To maintain a flexible implementation, one can define a new attribute in a specified section of the master configuration file (exp : sfa_cust1_att in "Fig 8"), then it s possible to use it similarly as the common predefined attributes. The assigned priority is from 0 to 9 while 0 refer to an ignorable attribute. Finally, the list of files is identified in the corresponding queue of the SFA memory. Each file is assigned a value based on the priority of attributes strategy. 3.2 Sorter This process is tightly coupled to the merger, as the list of files is ready to be consumed. the sorter use the assigned values by the merger to each file, and use a sorting algorithm based on quicksort to generate the ordered final list. Quicksortis a well known sorting algorithm[19], that performs the best among its competitors. It s based on divide and conquer logic, through the three principals phases described below, Quicksort is used to sort a random array of numbers, that refer in our case to the generated values for each file by the merger. URL: 694
7 T r a n s a c t i o n s o n M a c h i n e L e a r n i n g a n d A r t i f i c i a l I n t e l l i g e n c e V o l 5 N o 4, A u g Consider an array S[1.. k] Phase I: Divide The input array S[l.. k] is divided in two empty subarrays: S[l.. m 1] and S[m+1.. k]. The subarrays are constructed in a way that every left element of S[1.. m 1] should be smaller or equal to S[m], and every right element of S[m+1.. k] should be bigger than S[m]. This phase is about choosing the index m, which is called the pivot element. Phase II: Conquer Its about sorting each subarray, S[l.. m 1] and S[m+1.. k] trough recursive calls of quicksort. Phase III: Combine One the subarrays are sorted, we combine them together to form the sorted array S[l.. k]. Sorter Algorithm: Consider a list of numbers S [1.. k] quicksort(s) if S = 0 then S else let p = pick a pivot from S S1 = s S s < p S2 = s S s = p S3 = s S s > p (S1,S3 ) = quicksort(s1 ) quicksort(s3 ) C o p y r i g h t S o c i e t y f o r S c i e n c e a n d E d u c a t i o n U n i t e d K i n g d o m 695
8 O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, Vol 5 No 4 August (2017); p: in append(s1,append(s2,s3 )) end For our implementation, the values assigned by the merger, permit us to use a priority based selection technique. Where the pivot is not selected randomly in the beginning, then its selected with the highest value. This can help reducing time to choose better pivot.we should emphasize that the pivot elements selected in each subarray created are used only for dividing the elements in the two subarrays, their position is not part of a comparison, but it is a result of the last move of this comparison. "See. Fig 9" Figure 9: Run example of quicksort The sorted list of files generated in this phase is stored on the HDFS in MapFile format, which is one of the HDFS supported format, that was introduced to reduce entries in the NameNode memory while dealing with a lot of small files. Moreover, it consists of an indexed SequenceFile to reduce time access compare to sequential read in SequenceFiles. See Fig 10. Figure. 10: MapFile File Layout 3.3 Compressor This process, is not by default activated. In fact, to analyze how the small files are affecting the Hadoop cluster, we import records from the FSimage of the NameNode, which contain a complete state of the HDFS state, then we aggregate data to store it in a reference handled by the database of the SFA, see Fig. 11. URL: 696
9 T r a n s a c t i o n s o n M a c h i n e L e a r n i n g a n d A r t i f i c i a l I n t e l l i g e n c e V o l 5 N o 4, A u g Figure. 11: SFA reference steps This reference is completely independent of the functionality of our SFA sever, and can be built on a new server if the performance is impacted. Though, it allow us to get some very useful information and reports about the small files in our cluster, such as : Files < 512 Kb ) Files < 512 Kb (Grouped by Path) Files < 512 Kb (Grouped by Owner) Files < 512 Kb (Grouped by Date) Number of blocks with a used space < 60% Most called files per MapFile (hot small files) Files being called in more than one month (per user, per path, per MapFile) The compressor process, is used to compress MapFiles values when relevant, scheduling this feature is not recommended but it remain possible in this implementation based on MapFiles age if needed. 3.4 File Prefetching and Caching : Prefetching and caching technique has been implemented in this study, to improve MapReduce jobs performance when small files are processed. In fact, prefetching can reduce the disk I/O overhead, and improve the access time as data can be brought to the cache before being requested. When a small file is requested, the metadata for the MapFile are pulled from the NameNode and the DataNodes that holds the related blocks is identified. Though, if the metadata of that MapFile exist in the local memory of the SFA server, no request is sent to the NameNode. Therefore, it becomes efficient to read small files with less requests sent to the NameNode. Also, on the DataNode level, a small file request, is usually combined with the MapFile that contain it, that specific format provide an index to determine the location of each small file through its key and then its offset and length. Once the small file is located, the MapFile index is cached in the SFA memory, and the specifics requested small files are cached in the DataNode memory. This can help accessing contained files directly in the next subsequent calls. To avoid memory overhead on the DataNodes, LRU algorithm was used to keep most called objects and manage future allocations efficiently [20]. Caching the specified objects in the DataNodes are achieved through mlock() and mmap(), because storing data off heap will not affect garbage collection when the amount of cached data is large. 4 Experimental Evaluation The experimental test platform is in this paper is achieved on an implemented cluster of one NameNode, 4 Datanodes, and the SFA server. The NameNode is a server of 3.10 GHz clock speed, 16GB of RAM and a gigabit Ethernet NIC. Each DataNode offer a 500GB Hard Disk, and they are deployed on Ubuntu The SFA server is built on 3.10 GHz clock speed, 16GB of RAM, and deployed on Ubuntu The replication factor is kept as the default value 3, and the block size of HDFS is chosen as 64Mb. The experimental datasets are basically standard auto generated.we use Hadoop version and java version 1.8.0_65. C o p y r i g h t S o c i e t y f o r S c i e n c e a n d E d u c a t i o n U n i t e d K i n g d o m 697
10 O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, Vol 5 No 4 August (2017); p: To evaluate our approach, the same datasets are stored in three different ways. Initially, the datasets are uploaded directly on HDFS without grouping in any format. Then the datasets are uploaded into HDFS using Hadoop Archive (HAR format). Finally, we stored files through the SFA server to compare the result of the previous strategies with our approach. The evaluation was based on memory usage and time cost for writing and reading small files.in the HAR case, original files are deleted at the end of HAR merging. 4.1 Comparison of the NameNode Memory Usage Figure. 12: NameNode memory usage According to Fig. 12, when we store small files through SFA, or in HAR format, The NameNode memory consumption is too low due to the reduced entries of metadata, as each list of files in MapFile or HAR file require only the related metadata for the whole merged file. 4.2 Comparison ofthe required time for storage To compare the required time for storage, we uploaded a set of datasets, from 1000to small files, in 3 ways as described for memory usage evaluation. Figure. 13 Comparison of the required time for storage URL: 698
11 T r a n s a c t i o n s o n M a c h i n e L e a r n i n g a n d A r t i f i c i a l I n t e l l i g e n c e V o l 5 N o 4, A u g According to Fig. 13, the SFA greatly outperforms native HDFS and the HAR. This is because combing clients files and attributes aggregation allow better block allocations, and reduces the NameNode requests for each client s writing request. 4.3 Comparison of the random and sequentialreading time of small files To evaluate the reading time, we selected randomly different number of small files from each previous upload case (HDFS, HAR and SFA). We downloaded randomly 200, 500, 1000, 1500, 2000, 4000, 6000 and 8000 files from the total amount ( files) in each case. Respectively we evaluate the required time for each download operation. Figure. 14 Comparison of the reading time of random small files Figure. 15 Comparison of the reading time of sequential small files C o p y r i g h t S o c i e t y f o r S c i e n c e a n d E d u c a t i o n U n i t e d K i n g d o m 699
12 O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, Vol 5 No 4 August (2017); p: Figure. 16 Comparison of the reading time of random small files / 3 times According to Fig. 14 and Fig. 15,when reading random selected files, SFA can reduce the access time by 24% and 13% compared to HDFS and HAR respectively. When reading sequential lists of files, SFA outperforms HDFS but it remains a bit slower than HAR. The reason is that the increase of the number of files can increase the number of seek operations on DataNodes through the index structure of MapFiles, which can result in the higher reading latency. For SFA, reading random files results in better access efficiency than reading sequential lists of files. Moreover, according to Fig. 16,the random access time can be improved after the second and third call of the same sets of small files, which proves clearly the efficiency of prefetching and caching mechanism on the DataNodes and the SFA sever. 5 Conclsion and Future Works In this paper, we focus on improving the cluster performance while processing small files. The SFA server has been used to combine clients small files into MapFiles. This technique has been improved in the global approach, which consist of grouping elements in specific order. Sorting small files per categories, up to the application logic, is a promising concept that can improve the reading time access. In fact, the memory usage has been addressed as the main problem in many researches, while in this study, we consider it as only one part of the small files problem equation. Our approach addresses the aspect of how the distribution of small files in a specific format can influence the cluster performance. Different strategies are now adopted in Hadoop to solve the small files problem, but there is a lack of standardization, as most of the solutions remain useful in specific environmentsbut not in others. Offering a system to analyze different aspects of the small files problem can help organizations to understand better the real factors that control the impact of their datasets. REFERENCES [1] B. White, T. Yeh, J. Lin, and L. Davis, Web scale computer vision using mapreduce for multimedia data mining, in Proceedings of the Tenth International Workshop on Multimedia Data Mining. ACM, 2010, p. 9. URL: 700
13 T r a n s a c t i o n s o n M a c h i n e L e a r n i n g a n d A r t i f i c i a l I n t e l l i g e n c e V o l 5 N o 4, A u g [2] K. Wiley, A. Connolly, J. Gardner, S. Krughoff, M. Balazinska, B. Howe, Y. Kwon, and Y. Bu, Astronomy in the cloud: using mapreduce for image co addition, Astronomy, vol. 123, no. 901, pp , [3] W. Fang, V. Sheng, X. Wen, and W. Pan, Meteorological data analysis using mapreduce, The Scientific World Journal, vol. 2014, [4] F. Wang and M. Liao, A map reduce based fast speaker recognition, in Information, Communications and Signal Processing (ICICS) th International Conference on. IEEE, 2013, pp [5] K. P. Ajay, K. C. Gouda, H. R. Nagesh, A Study for Handelling of High Performance Climate Data using Hadoop, Proceedings of the International Conference, pp: , April [6] D. Q. Duffy, J. L. Schnase, J. H. Thompson, S. M. Freeman, and T. L. Clune, Preliminary evaluation of mapreduce for high performance climate data analysis, [7] C. Shen, W. Lu, J. Wu, and B. Wei, A digital library architecture supporting massive small files and efficient replica maintenance, in Proceedings of the 10th Annual Joint Conference on Digital Libraries (JCDL '10), pp , June 2010 [8] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, Data warehousing and analytics infrastructure at facebook, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp [9] J. K. Bonfield and R. Staden, ZTR: A new format for DNA sequence trace data, Bioinformatics, vol. 18, no. 1, (2002), pp [10] T. Nukarapu, B. Tang, L. Wang, and S. Lu, Data replication in data intensive scientific applications with performance guarantee, IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 8, pp , [11] Y. Gao and S. Zheng, "A Metadata Access Strategy of Learning Resources Based on HDFS," in proceeding International Conference on Image Analysis and Signal Processing (IASP), pp , [12] J. Xie, S. Yin, et al. Improving MapReduce performance through data placement in heterogeneous Hadoop clusters, In 2010 IEEE International Symposium on Parallel & Distributed [13] G. Mackey; S. Sehrish; J. Wang. Improving metadata management for small files in HDFS. IEEE International Conference on Cluster Computing and Workshops (CLUSTR) pp.1 4. [14] C. Vorapongkitipun; N. Nupairoj. Improving performance of small file accessing in Hadoop. IEEE International Conference on Computer Science and Software Engineering (JCSSE) pp [15] Patel A, Mehta M A. A novel approach for efficient handling of small files in HDFS, 2015 IEEE International Advance Computing Conference (IACC), pp [16] Y. Zhang; D. Liu. Improving the Efficiency of Storing for Small Files in HDFS. International Conference on Computer Science & Service System (CSSS) pp [17] D. Dev; R. Patgiri. HAR+: Archive and metadata distribution! Why not both?. IEEE International Conference on Computer Communication and Informatics (ICCCI) pp.1 6. C o p y r i g h t S o c i e t y f o r S c i e n c e a n d E d u c a t i o n U n i t e d K i n g d o m 701
14 O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Improving Small File Management in Hadoop. Transactions on Machine Learning and Artificial Intelligence, Vol 5 No 4 August (2017); p: [18] P. Gohil; B. Panchal; J. S. Dhobi. A novel approach to improve the performance of Hadoop in handling of small files. International Conference on Electrical, Computer and Communication Technologies (ICECCT) pp.1 5. [19] Hoare CAR. Quicksort[J]. The Computer Journal, 1962, 5(1): [20] E.J. O Neil, P.E. O Neil, and G. Weikum, An Opti mality Proof of the LRU K Page Replacement Algorithm, J. ACM, vol. 46, no. 1, 1999, pp URL: 702
Optimizing Hadoop for Small File Management
VOLUME 5 NO 4, 2017 Optimizing Hadoop for Small File Management O. Achandair, M. Elmahouti, S. Khoulji, M.L. Kerkeb, Information System Engineering Resarch Group Abdelmalek Essaadi University, National
More informationOptimization Scheme for Small Files Storage Based on Hadoop Distributed File System
, pp.241-254 http://dx.doi.org/10.14257/ijdta.2015.8.5.21 Optimization Scheme for Small Files Storage Based on Hadoop Distributed File System Yingchi Mao 1, 2, Bicong Jia 1, Wei Min 1 and Jiulong Wang
More informationInternational Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN
68 Improving Access Efficiency of Small Files in HDFS Monica B. Bisane, Student, Department of CSE, G.C.O.E, Amravati,India, monica9.bisane@gmail.com Asst.Prof. Pushpanjali M. Chouragade, Department of
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationEnhanced Hadoop with Search and MapReduce Concurrency Optimization
Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationImproved MapReduce k-means Clustering Algorithm with Combiner
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationHadoop and HDFS Overview. Madhu Ankam
Hadoop and HDFS Overview Madhu Ankam Why Hadoop We are gathering more data than ever Examples of data : Server logs Web logs Financial transactions Analytics Emails and text messages Social media like
More informationProcessing Technology of Massive Human Health Data Based on Hadoop
6th International Conference on Machinery, Materials, Environment, Biotechnology and Computer (MMEBC 2016) Processing Technology of Massive Human Health Data Based on Hadoop Miao Liu1, a, Junsheng Yu1,
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationCIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu
CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationThe Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1
International Conference on Intelligent Systems Research and Mechatronics Engineering (ISRME 2015) The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian
More informationA brief history on Hadoop
Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)
More informationCS60021: Scalable Data Mining. Sourangshu Bhattacharya
CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationOpen Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments
Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationDynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c
2016 Joint International Conference on Service Science, Management and Engineering (SSME 2016) and International Conference on Information Science and Technology (IST 2016) ISBN: 978-1-60595-379-3 Dynamic
More informationA Novel Architecture to Efficient utilization of Hadoop Distributed File Systems for Small Files
A Novel Architecture to Efficient utilization of Hadoop Distributed File Systems for Small Files Vaishali 1, Prem Sagar Sharma 2 1 M. Tech Scholar, Dept. of CSE., BSAITM Faridabad, (HR), India 2 Assistant
More informationThe Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c
Applied Mechanics and Materials Online: 2013-09-27 ISSN: 1662-7482, Vols. 423-426, pp 2733-2736 doi:10.4028/www.scientific.net/amm.423-426.2733 2013 Trans Tech Publications, Switzerland The Design of Distributed
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationCSE 124: Networked Services Lecture-16
Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments
More informationHadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationDecision analysis of the weather log by Hadoop
Advances in Engineering Research (AER), volume 116 International Conference on Communication and Electronic Information Engineering (CEIE 2016) Decision analysis of the weather log by Hadoop Hao Wu Department
More informationCA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
More informationGoogle File System (GFS) and Hadoop Distributed File System (HDFS)
Google File System (GFS) and Hadoop Distributed File System (HDFS) 1 Hadoop: Architectural Design Principles Linear scalability More nodes can do more work within the same time Linear on data size, linear
More informationCLIENT DATA NODE NAME NODE
Volume 6, Issue 12, December 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Efficiency
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationCSE 124: Networked Services Fall 2009 Lecture-19
CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationHadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391
Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Outline Big Data Big Data Examples Challenges with traditional storage NoSQL Hadoop HDFS MapReduce Architecture 2 Big Data In information
More informationSouth Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10
ISSN Number (online): 2454-9614 Weather Data Analytics using Hadoop Components like MapReduce, Pig and Hive Sireesha. M 1, Tirumala Rao. S. N 2 Department of CSE, Narasaraopeta Engineering College, Narasaraopet,
More informationDistributed Face Recognition Using Hadoop
Distributed Face Recognition Using Hadoop A. Thorat, V. Malhotra, S. Narvekar and A. Joshi Dept. of Computer Engineering and IT College of Engineering, Pune {abhishekthorat02@gmail.com, vinayak.malhotra20@gmail.com,
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationHDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationCloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University
Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed
More informationINDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES
Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationResearch Article Mobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationHDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
More informationGoogle File System. Arun Sundaram Operating Systems
Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationAugust Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund
August 15 2016 Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC Supported by the National Natural Science Fund The Current Computing System What is Hadoop? Why Hadoop? The New Computing System with Hadoop
More informationBig Data and Object Storage
Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany 28.02.2018 Munich Volume, Variety & Velocity + Analytics Velocity
More informationA New Model of Search Engine based on Cloud Computing
A New Model of Search Engine based on Cloud Computing DING Jian-li 1,2, YANG Bo 1 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China 2. Tianjin Key
More informationTOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS
More informationBatch Inherence of Map Reduce Framework
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287
More informationPerformance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop
Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop K. Senthilkumar PG Scholar Department of Computer Science and Engineering SRM University, Chennai, Tamilnadu, India
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
More informationIndexing Strategies of MapReduce for Information Retrieval in Big Data
International Journal of Advances in Computer Science and Technology (IJACST), Vol.5, No.3, Pages : 01-06 (2016) Indexing Strategies of MapReduce for Information Retrieval in Big Data Mazen Farid, Rohaya
More informationCATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING
CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline
More informationSurvey on MapReduce Scheduling Algorithms
Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationAn Improved Performance Evaluation on Large-Scale Data using MapReduce Technique
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,
More informationA REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationP-Codec: Parallel Compressed File Decompression Algorithm for Hadoop
P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop ABSTRACT Idris Hanafi and Amal Abdel-Raouf Computer Science Department, Southern Connecticut State University, USA Computers and Systems
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More information50 Must Read Hadoop Interview Questions & Answers
50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?
More informationOptimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System
Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System L. Prasanna Kumar 1, 1 Assoc. Prof, Department of Computer Science & Engineering., Dadi Institute
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationQADR with Energy Consumption for DIA in Cloud
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationModeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment
DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction...3 2 Assumptions and Goals...3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets...3 2.4 Simple Coherency Model... 4 2.5
More informationPage 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24
Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony
More informationSMCCSE: PaaS Platform for processing large amounts of social media
KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationPerformance Analysis of Hadoop Application For Heterogeneous Systems
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. I (May-Jun. 2016), PP 30-34 www.iosrjournals.org Performance Analysis of Hadoop Application
More informationForget about the Clouds, Shoot for the MOON
Forget about the Clouds, Shoot for the MOON Wu FENG feng@cs.vt.edu Dept. of Computer Science Dept. of Electrical & Computer Engineering Virginia Bioinformatics Institute September 2012, W. Feng Motivation
More informationCooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop
Cooperation between Data ing and Simulation ing for Performance Analysis of Hadoop Byeong Soo Kim and Tag Gon Kim Department of Electrical Engineering Korea Advanced Institute of Science and Technology
More information