An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud
|
|
- Corey Collins
- 5 years ago
- Views:
Transcription
1 Volume 118 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud D.M.Kalai Selvi 1 and Dr.P.Ezhumalai 2 1 M,E(CSE),R.M.D Engineering college, kavaraipettai,chennai TamilNadu dmkalai@gmail.com 2 HOD of CSE Department, R.M.D Engineering college, kavaraipettai,chennai TamilNadu January 4, 2018 Abstract Hadoop is an open source implementation of Map Reduce Framework which is mostly used for parallel processing of big data analytics. Hadoop users accept the opportunities/resource plan/offers provided by Cloud Service Providers to lease the resources needed for processing their hadoop jobs and pay as per their use. But there is no efficient resource provisioning mechanism which can complete the job within specified deadline. It is the responsibility of the user who should manually modify the resource required for processing their hadoop jobs. So we proposed an improved HP (Hadoop Performance) model which used Hive bucket and partitioning concept to reduce the long duration of shuffle sort and efficiently complete the job within the desired deadline. Furthermore, efficient resource provisioning can be obtained using feedback system and user profile to determine their exact need and we suggest an optimal plan for their resource needs
2 Key Words : Hadoop; Map Reduce; Cloud Service Providers; Hive; Resource Provisioning; Bucket. 1 Introduction Rapid evolution of raw data from all sources like sensors, blogspots, social networking sites, aviation, etc which explored from our surroundings. This leads to the phenomena called Big Data which moved IT solutions to the other dimension where not only computer science but also all other fields like electronics,civil,transport need IT solutions for their business e.g. In U. S police department installs traffic cameras to read license plates of Automobiles to identify stolen automobiles in real time. Hadoop framework has capability to handle both structured and unstructured data where 80% of the data are unstructured data and only 20% are structured data among the whole big data sets. [1] Based on the challenges faced by Big Data it can be characterized by four dimensions: VOLUME: To manage large volumes of data VELOCITY: To manage rapidly arriving data VARIETY: To manage both structured and unstructured data VERACITY: To validate the correctness of large amount of arriving data. Cloud Service provider like Amazon EC2 enables user to configure the resources needed for their application, but in current form of EC2 cloud does not support Hadoop jobs with deadline constraint. So, it is the sole responsibility of the user to assign necessary resources needed to complete the job with specified deadline which is highly challenging task. It is necessary to focus on Hadoop performance modeling which is also a critical task. Hadoop jobs has multiple processing phases with three core phases Map phase, shuffle phase, reduce phase such that first slot of the shuffle phase is processed in parallel with map phase called overlapping stage. Other slots of the shuffle phase are processed
3 after the map phase is processed called non-overlapping stage. In the existing Hadoop performance model does not has the capability to automate the resource provisioning and complete the Hadoop job within the specified constraints. Furthermore, the number of reduce slots are strictly constant. So we proposed an improved Hadoop performance model which will automatically configures resources required by end user and complete the job within deadline. The contents of this paper are organized as follows. In section 2 we review all related literature. Section 3 will be problem analysis and Section 4 will be our proposed work and finally section 5 & 6 will be our experimental results and conclusion. 2 Literature Review [2] Starfish is MADDER (MAD-Magnetism, Agility, Depth) and Self tuning system for big data analytics on big data. The main goal of starfish is to enable Hadoop user and applications to get good performance automatically throughout the data lifecycle without manually understanding the need of the Hadoop user and manually allocate the needs by tuning various knobs. STARFISH collects detailed information of Hadoop job at a very fine granularity for automation of job estimation which is a burden. [3] The general problem in forming cluster is to determine resource and configure it according to Map Reduce Job in order to meet the desired constraints such as cost, deadline, execution time for given big data analytics referred as cluster sizing problem. In this paper, Elastisizer a system added on the top of Hadoop stack to automate the allocation of resources to meet the desired requirements of massive data analytics by the Elastisizer will provide reliable answers to cluster sizing queries in an automated fashion by using detailed information of Hadoop job profile.in Elastisizer, needs detailed information of job profile which increase the overhead of high execution time. [4] In this paper, the proposed Hadoop performance model considers both overlapping and non-overlapping stages. It uses scaling factor to automatically scale up or scale down the resource allocation within the specified timeframe. Though this Hadoop performance model increase the possibility of automatic resource al
4 location but it suffers simple linear regression. The HP model is restricted to a constant number of reduce tasks. [5][6] CRESP provides automatic estimation of job execution time and resource provisioning in an optimal manner. In general, the multiple waves generate better performance than single wave of the reduce phase. In CRESP, the number of reduce slots should be equal to the number of reduce slots and it consider only single wave of the reduce phase. In general, traditional Hadoop cluster environment assumed as homogeneous clusters. But now there is a explosion of heterogeneous Hadoop cluster due to the evolvement of more parallel processing of data intensive application by many companies. Due to heterogeneity of resources in Hadoop cluster, there is an inefficiency of Hadoop job completion time as well as results in the bottleneck of resources. In [9] they proposed bound-based performance modeling of Map Reduce jobs completion times in heterogeneous Hadoop cluster. But they does not explore the properties of Hadoop job profiles as well as does not analysis the breakdown of Hadoop job execution on different types of heterogeneous nodes in the Hadoop cluster. In this paper [10], they proposed a framework called ARIA (Automatic Resource Inference and Allocation ) for MapReduce environments to solve the problem of automatic resource allocation. It contains three inter-related componenets. First, job profile which summarizes performance characteristics of the underlying application during the map and reduce stage. Second, MapReduce performance model for the given job and its SLO(Service Level Objective) to estimate the amount of resource required to complete the job within deadline. Finally will be SLO scheduler which is based on EDF(Earliest Deadline First) that determines job ordering and allocating resources. The major disadvantage in this framework was that they did not considered node failures. In [11] they proposed a framework called AROMA (Automated Resource Allocation and Configuration of Map Reduce Environment). It was developed to automate resource provisioning in heterogeneous cloud and manipulating configuration of Hadoop parameters for achieving goals but minimizing the incurring cost. The major disadvantage in this system were configuration ineffective and does not support multi-stage job workflows. [12] introduces a PAR- ALLAX PROGRESS ESTIMATOR, which can estimate the per
5 formance by estimating the remaining time of MapReduce pipelines based on thetime had elapsed. The key strategy the author used is dividing the processes of MapReduce into five key pipelines and estimate the remaining time based on the time had elapsed. However, the stated approach does not study an explicit cost function that can be used in optimization problems.[13] also proposed optimal resource provisioning with minimal cost, but there is some prediction error. In order to improve the efficiency and reliability of Hadoop Performance model to provide user friendly automated resource provisioning scheme and effective job completion with the predefined constraints, an improved Hadoop performance model is proposed. The major contributions in this paper are: Improved Hadoop performance work design all the core phases of Map Reduce framework i.e., map phase, shuffle phase, partitioner phase, combiner phase, reduce phase. The improved HP model which used Hive bucket and partitioning concept to reduce the long duration of shuffle sort and efficiently complete the job within the desired deadline. Furthermore, efficient resource provisioning can be obtained using feedback system and user profile to determine their exact need and we suggest an optimal plan for their resource needs. The performance of improved HP model is evaluated in Amazon EC2 cloud and the evaluation results shows that the improved HP model outperforms both HP and Starfish model. 3 Problem Analysis 3.1 Modeling Hadoop Job Phases In Hadoop, any job which needed parallel processing and high performance will enter into the map reduce framework. Usually, the entered job is splitted into three core phases such as Map, shuffle and Reduce phase. In the existing system, fine granularity information about the job is collected [4] which is useful to complete
6 the job within deadline. Map tasks are executed in map slots and reduce tasks are executed in reduce slots. Each slot runs a task at a time. Slots are assigned in terms of CPU and RAM. Map and reduce phase can be executed in single or multiple phases. [7] Map Phase The map phase accepts input dataset from Hadoop Distributed File system (HDFS) and which should be in key value pair format. Mostly datasets that prefer parallel processing supports NOSQL database in which the dataset are stored in key-value database, column family database, graph database, document database. The map phase initially split the input datasets into blocks (by default 64MB) where each block considered as map tasks and assign a map slot. After the map phase is processed, the intermediate key-value pair result will be generated which will be stored in buffer not in disk in order to reduce the complication due to replica. The total execution time for map phase will be T mtotal = T mavg N m N mslot Shuffle Phase The process of transferring the map output in a sorted manner to the reduce phase in order to minimize the work of reducer is called shuffle. In this phase, Hadoop job fetches intermediate map output and copies it in one or more reducers. If N r N slot, then the shuffle phase will be completed in single wave. The total execution time for shuffle phase will be T stotal = T savg N r N rslot 2 Otherwise, the shuffle phase will be completed in multiple waves. Then the total execution time for shuffle phase will be T stotal = (T W 1 avg N W 1)+...+(T W n avg N W n) N rslot 3 Fig 1 describes job execution flow of existing Hadoop performance model
7 Hadoop job execution flow Table 1.1 defines variables
8 3.1.3 Reduce Phase The reduce phase is a user-defined function in which user can customized their processing logic. The customized reduce function process intermediate map output and produces the final output.usually, the final output after the reduce phase will be stored in HDFS. The Hadoop performance model supports both overlapping and non overlapping stages. The total execution time for reduce phase will be T rtotal = T ravg N r N rslot 4 4 Proposed Work 4.1 Improved Hadoop Performance Model Job Estimation Model In improved Hadoop performance model, we handle both overlapping and non overlapping stages. Generally when processing in multiple waves, the first wave of the shuffle phase starts immediately after completion of the first wave of the map phase and the first wave of the shuffle completed only after all waves of the map phase started which creates a long execution time for shuffle phase which can be overcome by HIVE. When an existing data infrastructure based on relational database wants to move in Hadoop which can be achieved using HIVE uses HQL(HIVE Query Language) similar to SQL(Structured Query language) use to fetch and process data from structured data infrastructure. [7][8] The two most important concept in HIVE used to overcome the problem of latency by shuffle phase are bucketing and partitioning. When the input datasets are bucketed based on their column using hash function along with partitioning, the overhead of large datasets are reduced using command line interface or web user interface where bucketing and partitioning commands are queried in HQL based on user needs. Already the datasets are bucketed i.e. sorted based on columns the task of shuffling get reduced which improves performance. Furthermore, it helps to complete job within deadline. The estimated
9 execution time for Hadoop jobs are finalized by user profile collects detailed information about Hadoop jobs such as data size, map tasks, reduce tasks and their duration. When a job processes an increasing size of an input dataset, the number of map tasks is proportionally increased while the number of reduce tasks is specified by a user in the configuration file. The number of reduce tasks can vary depending on user s configurations. When the number of reduce tasks is kept constant, the execution durations of both the shuffle tasks and the reduce tasks are linearly increased with the increasing size of the input dataset as considered in the HP model. This is because the volume of an intermediate data block equals to the total volume of the generated intermediate data divided by the number of reduce tasks. As a result, the volume of an intermediate data block is also linearly increased with the increasing size of the input dataset. However, when the number of reduce tasks varies, the execution durations of both the shuffle tasks and the reduce tasks are not linear to the increasing size of an input dataset. In the improved HP model, we considered varied number of reduce slots based on the user profile which has detailed information about past history input data size and their execution time and the number of map slots and reduce slots which helps to estimate the varied number of reduce slots Resource Provisioning Model Starfish and other Hadoop performance model [2] [3] [4] [5] supports manual or self tuning of the resources needed for their job execution. Resource provisioning plays a key role as a job can be completed within specified deadline say t only when the sufficient resources required for the job for their processing has been provided. In improved HP model supports automatic resource provisioning without manipulating knob parameters using user feedback system. In user feedback system, collects fine granularity information about the history of already processed job and by assigning weight or priority to each job based on previous logs so that high critical task can be provided resources immediately and automatically. Since even high critical task are provided resources actively which helps to complete Hadoop jobs within specified constraint
10
11 5 Performance Evaluation 5.1 Experimental Setup The performance of improved Hadoop performance model was evaluated on two experimental setup i.e Amazon EC2 cloud and Google cloud.the experimental Hadoop cluster was setup on Amazon EC2 Cloud using 20 m1.large instances. The specifications of the m1.large. In this cluster, we used Hadoop and configured one instance as Name Node and other 19 instances as Data Nodes. The Name Node was also used as a Data Node. The data block size of the HDFS was set to 64MB and the replication level of data block was set to 3. Each instance was configured with one map slot and one reduce slot. We run WordCount applications on Hadoop cluster and employed Starfish to collect the job profiles. For each application running on each cluster, we conducted 10 tests. For each test, we run 5 times and took the average durations of the phases
12 Fig.4. the performance of the WordCount application with a varied number of reduce tasks. The following graph describes the performance of improved HP model versus HP model for job estimation time
13 The following graph describes the estimated resource provisioning and compares resource provisioning performance HP model versus improved HP model The improved Hadoop performance model was also evaluated in Google cloud. The project scenario based on which improved Hadoop performance model evaluated was described here. In Google cloud, file directory showing a list of files representing famous song details was stored (big data).a list of plans regarding the configuration of map and reduce slots are stored in a separate table in Hive. The contents of the file were stored in a table in Hive in comma separated format. The details of user profile regarding configuration of the resource required and user feedback were stored in database in Hive. The file details was analyzed and partitioned big data as buckets using hash function based on major columns and the list of optimal resource plan is displayed along with the optimal resource config
14 uration. Using compute engine a virtual machine was initialized and Hadoop was installed in that VM along with Hive 1.2.x and Pig for manipulating the big data. The file directory was transferred from local file system to Hadoop Distributed File System (HDFS). Command The file which are stored in.csv format as 1 9,98993,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album 2 9,105156,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album
15 3 9,113452,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album 4 9,119398,2,Bengali,Contemporary celtic,3,anjan Dutt,free,movie 5 9,130766,3,Bengali,Contemporary celtic,2,anjan Dutt,free,movie 6 9,133420,3,Kannada,Carnival songs,2,anu Malik,free,movie 7 10,11091,4,Kannada,Carnival songs,2,anu Malik,free,movie 8 10,19275,3,Kannada,Carnival songs,2,anu Malik,free,movie 9 10,21140,4,Kannada,Carnival songs,2,anu Malik,free,album 10 10,44487,3,Kannada,Carnival songs,1,anu Malik,free,album 11 10,49147,1,Kannada,Carnival songs,1,anu Malik,free,album 12 10,54923,5,Kannada,Contemporary celtic,1,anu Malik,free,album 13 10,63451,3,Kannada,Contemporary celtic,1,anu Malik,free,album 14 10,75206,3,Kannada,Contemporary celtic,1,anu Malik,paid,movie 15 10,104641,1,Kannada,Contemporary celtic,5,anu Malik,paid,movie 16 10,110054,3,Kannada,Contemporary celtic,5,anu Malik,paid,movie Command to load file.csv LOGS= LOAD /user/root/songs.csv ; LOGS GROUP= GROUP LOGS ALL; LOG COUNT = FOREACH LOGS GROUP GEN- ERATE COUNT(LOGS); STORE LOG COUNT INTO /user/hive/warehouse/resource.db/filedetails /songs.txt ; use resource; drop table if exists filedetails; CREATE EXTER- NAL TABLE filedetails ( totalrow string) ROW FORMAT DE- LIMITED FIELDS TERMINATED BY, LINES TERMINATED BY \n LOCATION hdfs://localhost:9000/user/hive/warehouse/resource.db/filedetails/songs.txt ; Command To Analyze the plan set hive.cli.print.header=true; use resource; drop table if exists analysis;
16 create table analysis as select a.id,a.memory,a.processor,a.ssd,a.transfer,a. price,a.size from plans a JOIN filedetails b where a.size cast(b.totalrow as int); select a.id,a.memory,a.processor,a.ssd,a.transfer,a. price,a.size, b.feedback,b.name, b. from analysis a JOIN feedback b where a.id = b.planid; Sample Contents Stored in Hive Table Analysis 3 2GB 2 Core 40 GB 3 TB 20$ GB 2 Core 60GB 4 TB 40$ GB 4 Core 80GB 5 TB 80$ GB 8 Core 160GB 6TB 160$ GB 12 Core 320GB 7TB 320$ GB 16 Core 480GB 8TB 480$ GB 20 Core 640GB 9TB 640$ Feedback OK NULL NULL Timestamp Name Feedback :00 vaishali.m mvaishali1996@gmail.com It would be even more useful and clear if we had the option of working it live...and practical :00 yukti sharma dana.pink@yahoo.co.in it was good. need more time to elaborate.next time try for longer instances :00 subhash subhashthaya@gmail.com it is very nice and i have learnt many things about google design cloud during this session :00 akhil vemuri micheal.akhil.king@gmail.com good but expecting more :01 Devang P Patel patell.deva@gmail.com the cloud was good :01 H.VIGNESH vickynarrayanan@gmail.com awesome experience we had but time is not enough!!! NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL :01 vidya varsshini vidyavars1995@gmail.com its very nice but can make it even more better :01 subathra.r subaramachandran7@gmail.com good...but not so effective
17 :01 sruthi r sruthi.rdsk.1996@gmail.com nice..but little bit lag :01 SUSHMITHA.U sushmiudhay@gmail.com could have been more effective if it was more storage :01 vijayashankar vijayashankarenginer@outlook.com good to use this pack. Great eperience :01 venkatraman.d vengatvenkie@gmail.com its nice cloud any very good one for improving our knowledge Plans NULL memory processor ssd transfer price NULL 1 512MB 1 Core 20 GB 1TB 5$ GB 1 Core 30 GB 2 TB 10$ GB 2 Core 40 GB 3 TB 20$ GB 2 Core 60GB 4 TB 40$ GB 4 Core 80GB 5 TB 80$ GB 8 Core 160GB 6TB 160$ GB 12 Core 320GB 7TB 320$ GB 16 Core 480GB 8TB 480$ GB 20 Core 640GB 9TB 640$ Output
18 6 CONCLUSION Running a Map Reduce Hadoop job on a public cloud such as Amazon EC2 necessitates a performance model to estimate the job exe
19 cution time and further to provision a certain amount of resources for the job to complete within a given deadline. This paper has presented an improved HP model to achieve this goal taking into account multiple waves of the shuffle phase of a Hadoop job. The experimental results showed that the improved HP model outperforms both Starfish and the HP model in job execution estimation and resource provisioning. One future work would be to consider dynamic overhead of the VMs involved in running the user jobs to minimize resource overprovisioning. References [1] [2] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, Starfish: A Self-tuning System for Big Data Analytics, in In CIDR, 2011, pp [3] H. Herodotou, F. Dong, and S. Babu, No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics, in Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC 11), 2011, pp. 114 [4] A. Verma, L. Cherkasova, and R. H. Campbell, Resource provisioning framework for mapreduce jobs with performance goals, in Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware, 2011, pp [5] K. Chen, J. Powers, S. Guo, and F. Tian, CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds, IEEE Transcation Parallel Distrib. Syst., vol. 25, no. 6, pp , [6] H. Herodotou, Hadoop Performance Models, [Online]. Available: [Accessed: 22-Oct-2013]. [7] Hadoop Definitive guide Tom White OReilly Press
20 [8] E. Capriolo, D. Wampler, and J. Rutherglen, Programming Hive, O Reilley, 2012 [9] Z. Zhang, L. Cherkasova, and B. T. Loo, Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments, in Proceedings of the 2013 IEEE Sixth
21 1389
22 1390
Modelling of Map Reduce Performance for Work Control and Supply Provisioning
Modelling of Map Reduce Performance for Work Control and Supply Provisioning Palson Kennedy.R* CSE&PERIIT, Anna University, Chennai palsonkennedyr@yahoo.co.in Kalyana Sundaram.N IT & Vels University hereiskalyan@yahoo.com
More informationBenchmarking Approach for Designing a MapReduce Performance Model
Benchmarking Approach for Designing a MapReduce Performance Model Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com Boon Thau
More informationAnalysis in the Big Data Era
Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationPROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP
ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationOUTLOOK ON VARIOUS SCHEDULING APPROACHES IN HADOOP
OUTLOOK ON VARIOUS SCHEDULING APPROACHES IN HADOOP P. Amuthabala Assistant professor, Dept. of Information Science & Engg., amuthabala@yahoo.com Kavya.T.C student, Dept. of Information Science & Engg.
More informationA Review Paper on Big data & Hadoop
A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationStrategies for Incremental Updates on Hive
Strategies for Incremental Updates on Hive Copyright Informatica LLC 2017. Informatica, the Informatica logo, and Big Data Management are trademarks or registered trademarks of Informatica LLC in the United
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationAutomated Control for Elastic Storage
Automated Control for Elastic Storage Summarized by Matthew Jablonski George Mason University mjablons@gmu.edu October 26, 2015 Lim, H. C. and Babu, S. and Chase, J. S. (2010) Automated Control for Elastic
More informationTowards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More informationBROADBAND WIRELESS NETWORKING IN THE ERA OF BIG DATA
BROADBAND WIRELESS NETWORKING IN THE ERA OF BIG DATA Presented by: Dr. Tamer Omar Colleage of Enfineering & Technology Technology Systems Departmet East Carolina University INTRODUCTION Organizations accumulate
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationNovel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x
Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2 MSc Research Project Cloud Computing Ashwini Rajaram Chandanshive x15043584 School of Computing National College
More informationA What-if Engine for Cost-based MapReduce Optimization
A What-if Engine for Cost-based MapReduce Optimization Herodotos Herodotou Microsoft Research Shivnath Babu Duke University Abstract The Starfish project at Duke University aims to provide MapReduce users
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationA REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationBatch Inherence of Map Reduce Framework
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287
More informationAvailable online at ScienceDirect. Procedia Computer Science 98 (2016 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 554 559 The 3rd International Symposium on Emerging Information, Communication and Networks Integration of Big
More informationJoin Processing for Flash SSDs: Remembering Past Lessons
Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics Technical Report CS-2011-06, Computer Science, Duke University Herodotos Herodotou Duke University hero@cs.duke.edu
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data
More informationLITERATURE SURVEY (BIG DATA ANALYTICS)!
LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More informationMitigating Data Skew Using Map Reduce Application
Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationBest Practices and Performance Tuning on Amazon Elastic MapReduce
Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationMC 2 : Map Concurrency Characterization for MapReduce on the Cloud
MC 2 : Map Concurrency Characterization for Map on the Cloud Mohammad Hammoud and Majd F. Sakr Carnegie Mellon University in Qatar Education City, Doha, State of Qatar Emails: {mhhammou, msakr}@qatar.cmu.edu
More informationReal-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments
Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!
More informationPerformance Modeling and Optimization of Deadline-Driven Pig Programs
Performance Modeling and Optimization of Deadline-Driven Pig Programs ZHUOYAO ZHANG, HP Labs and University of Pennsylvania LUDMILA CHERKASOVA, Hewlett-Packard Labs ABHISHEK VERMA, HP Labs and University
More informationStatistics Driven Workload Modeling for the Cloud
UC Berkeley Statistics Driven Workload Modeling for the Cloud Archana Ganapathi, Yanpei Chen Armando Fox, Randy Katz, David Patterson SMDB 2010 Data analytics are moving to the cloud Cloud computing economy
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationA Comparison of MapReduce Join Algorithms for RDF
A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department
More informationPerformance Comparison of Hive, Pig & Map Reduce over Variety of Big Data
Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics
More informationHadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R
Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...
More informationResource and Performance Distribution Prediction for Large Scale Analytics Queries
Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia
More informationSoftNAS Cloud Performance Evaluation on AWS
SoftNAS Cloud Performance Evaluation on AWS October 25, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for AWS:... 5 Test Methodology... 6 Performance Summary
More informationBig Data Using Hadoop
IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationA SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING
Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationToday s content. Resilient Distributed Datasets(RDDs) Spark and its data model
Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationA Non-Relational Storage Analysis
A Non-Relational Storage Analysis Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Cloud Computing - 2nd semester 2012/2013 Universitat Politècnica de Catalunya Microblogging - big data?
More informationTOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationEnhanced Hadoop with Search and MapReduce Concurrency Optimization
Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization
More informationResearch Article Apriori Association Rule Algorithms using VMware Environment
Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationWelcome to the New Era of Cloud Computing
Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationD DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi
Journal of Energy and Power Engineering 10 (2016) 405-410 doi: 10.17265/1934-8975/2016.07.004 D DAVID PUBLISHING Shirin Abbasi Computer Department, Islamic Azad University-Tehran Center Branch, Tehran
More informationJuxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms
, pp.289-295 http://dx.doi.org/10.14257/astl.2017.147.40 Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms Dr. E. Laxmi Lydia 1 Associate Professor, Department
More informationKey aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling
Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment
More informationCAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri
More informationUsing Alluxio to Improve the Performance and Consistency of HDFS Clusters
ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma
More informationWearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life
Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life Ch.Srilakshmi Asst Professor,Department of Information Technology R.M.D Engineering College, Kavaraipettai,
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationNext-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data
Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data 46 Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data
More informationAutomated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University
D u k e S y s t e m s Automated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University Motivation We address challenges for controlling elastic applications, specifically storage.
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationComparative Analysis of Range Aggregate Queries In Big Data Environment
Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.
More informationV Conclusions. V.1 Related work
V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical
More informationResearch Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland
Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationSurvey on MapReduce Scheduling Algorithms
Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used
More information