An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud

Size: px
Start display at page:

Download "An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud"

Transcription

1 Volume 118 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud D.M.Kalai Selvi 1 and Dr.P.Ezhumalai 2 1 M,E(CSE),R.M.D Engineering college, kavaraipettai,chennai TamilNadu dmkalai@gmail.com 2 HOD of CSE Department, R.M.D Engineering college, kavaraipettai,chennai TamilNadu January 4, 2018 Abstract Hadoop is an open source implementation of Map Reduce Framework which is mostly used for parallel processing of big data analytics. Hadoop users accept the opportunities/resource plan/offers provided by Cloud Service Providers to lease the resources needed for processing their hadoop jobs and pay as per their use. But there is no efficient resource provisioning mechanism which can complete the job within specified deadline. It is the responsibility of the user who should manually modify the resource required for processing their hadoop jobs. So we proposed an improved HP (Hadoop Performance) model which used Hive bucket and partitioning concept to reduce the long duration of shuffle sort and efficiently complete the job within the desired deadline. Furthermore, efficient resource provisioning can be obtained using feedback system and user profile to determine their exact need and we suggest an optimal plan for their resource needs

2 Key Words : Hadoop; Map Reduce; Cloud Service Providers; Hive; Resource Provisioning; Bucket. 1 Introduction Rapid evolution of raw data from all sources like sensors, blogspots, social networking sites, aviation, etc which explored from our surroundings. This leads to the phenomena called Big Data which moved IT solutions to the other dimension where not only computer science but also all other fields like electronics,civil,transport need IT solutions for their business e.g. In U. S police department installs traffic cameras to read license plates of Automobiles to identify stolen automobiles in real time. Hadoop framework has capability to handle both structured and unstructured data where 80% of the data are unstructured data and only 20% are structured data among the whole big data sets. [1] Based on the challenges faced by Big Data it can be characterized by four dimensions: VOLUME: To manage large volumes of data VELOCITY: To manage rapidly arriving data VARIETY: To manage both structured and unstructured data VERACITY: To validate the correctness of large amount of arriving data. Cloud Service provider like Amazon EC2 enables user to configure the resources needed for their application, but in current form of EC2 cloud does not support Hadoop jobs with deadline constraint. So, it is the sole responsibility of the user to assign necessary resources needed to complete the job with specified deadline which is highly challenging task. It is necessary to focus on Hadoop performance modeling which is also a critical task. Hadoop jobs has multiple processing phases with three core phases Map phase, shuffle phase, reduce phase such that first slot of the shuffle phase is processed in parallel with map phase called overlapping stage. Other slots of the shuffle phase are processed

3 after the map phase is processed called non-overlapping stage. In the existing Hadoop performance model does not has the capability to automate the resource provisioning and complete the Hadoop job within the specified constraints. Furthermore, the number of reduce slots are strictly constant. So we proposed an improved Hadoop performance model which will automatically configures resources required by end user and complete the job within deadline. The contents of this paper are organized as follows. In section 2 we review all related literature. Section 3 will be problem analysis and Section 4 will be our proposed work and finally section 5 & 6 will be our experimental results and conclusion. 2 Literature Review [2] Starfish is MADDER (MAD-Magnetism, Agility, Depth) and Self tuning system for big data analytics on big data. The main goal of starfish is to enable Hadoop user and applications to get good performance automatically throughout the data lifecycle without manually understanding the need of the Hadoop user and manually allocate the needs by tuning various knobs. STARFISH collects detailed information of Hadoop job at a very fine granularity for automation of job estimation which is a burden. [3] The general problem in forming cluster is to determine resource and configure it according to Map Reduce Job in order to meet the desired constraints such as cost, deadline, execution time for given big data analytics referred as cluster sizing problem. In this paper, Elastisizer a system added on the top of Hadoop stack to automate the allocation of resources to meet the desired requirements of massive data analytics by the Elastisizer will provide reliable answers to cluster sizing queries in an automated fashion by using detailed information of Hadoop job profile.in Elastisizer, needs detailed information of job profile which increase the overhead of high execution time. [4] In this paper, the proposed Hadoop performance model considers both overlapping and non-overlapping stages. It uses scaling factor to automatically scale up or scale down the resource allocation within the specified timeframe. Though this Hadoop performance model increase the possibility of automatic resource al

4 location but it suffers simple linear regression. The HP model is restricted to a constant number of reduce tasks. [5][6] CRESP provides automatic estimation of job execution time and resource provisioning in an optimal manner. In general, the multiple waves generate better performance than single wave of the reduce phase. In CRESP, the number of reduce slots should be equal to the number of reduce slots and it consider only single wave of the reduce phase. In general, traditional Hadoop cluster environment assumed as homogeneous clusters. But now there is a explosion of heterogeneous Hadoop cluster due to the evolvement of more parallel processing of data intensive application by many companies. Due to heterogeneity of resources in Hadoop cluster, there is an inefficiency of Hadoop job completion time as well as results in the bottleneck of resources. In [9] they proposed bound-based performance modeling of Map Reduce jobs completion times in heterogeneous Hadoop cluster. But they does not explore the properties of Hadoop job profiles as well as does not analysis the breakdown of Hadoop job execution on different types of heterogeneous nodes in the Hadoop cluster. In this paper [10], they proposed a framework called ARIA (Automatic Resource Inference and Allocation ) for MapReduce environments to solve the problem of automatic resource allocation. It contains three inter-related componenets. First, job profile which summarizes performance characteristics of the underlying application during the map and reduce stage. Second, MapReduce performance model for the given job and its SLO(Service Level Objective) to estimate the amount of resource required to complete the job within deadline. Finally will be SLO scheduler which is based on EDF(Earliest Deadline First) that determines job ordering and allocating resources. The major disadvantage in this framework was that they did not considered node failures. In [11] they proposed a framework called AROMA (Automated Resource Allocation and Configuration of Map Reduce Environment). It was developed to automate resource provisioning in heterogeneous cloud and manipulating configuration of Hadoop parameters for achieving goals but minimizing the incurring cost. The major disadvantage in this system were configuration ineffective and does not support multi-stage job workflows. [12] introduces a PAR- ALLAX PROGRESS ESTIMATOR, which can estimate the per

5 formance by estimating the remaining time of MapReduce pipelines based on thetime had elapsed. The key strategy the author used is dividing the processes of MapReduce into five key pipelines and estimate the remaining time based on the time had elapsed. However, the stated approach does not study an explicit cost function that can be used in optimization problems.[13] also proposed optimal resource provisioning with minimal cost, but there is some prediction error. In order to improve the efficiency and reliability of Hadoop Performance model to provide user friendly automated resource provisioning scheme and effective job completion with the predefined constraints, an improved Hadoop performance model is proposed. The major contributions in this paper are: Improved Hadoop performance work design all the core phases of Map Reduce framework i.e., map phase, shuffle phase, partitioner phase, combiner phase, reduce phase. The improved HP model which used Hive bucket and partitioning concept to reduce the long duration of shuffle sort and efficiently complete the job within the desired deadline. Furthermore, efficient resource provisioning can be obtained using feedback system and user profile to determine their exact need and we suggest an optimal plan for their resource needs. The performance of improved HP model is evaluated in Amazon EC2 cloud and the evaluation results shows that the improved HP model outperforms both HP and Starfish model. 3 Problem Analysis 3.1 Modeling Hadoop Job Phases In Hadoop, any job which needed parallel processing and high performance will enter into the map reduce framework. Usually, the entered job is splitted into three core phases such as Map, shuffle and Reduce phase. In the existing system, fine granularity information about the job is collected [4] which is useful to complete

6 the job within deadline. Map tasks are executed in map slots and reduce tasks are executed in reduce slots. Each slot runs a task at a time. Slots are assigned in terms of CPU and RAM. Map and reduce phase can be executed in single or multiple phases. [7] Map Phase The map phase accepts input dataset from Hadoop Distributed File system (HDFS) and which should be in key value pair format. Mostly datasets that prefer parallel processing supports NOSQL database in which the dataset are stored in key-value database, column family database, graph database, document database. The map phase initially split the input datasets into blocks (by default 64MB) where each block considered as map tasks and assign a map slot. After the map phase is processed, the intermediate key-value pair result will be generated which will be stored in buffer not in disk in order to reduce the complication due to replica. The total execution time for map phase will be T mtotal = T mavg N m N mslot Shuffle Phase The process of transferring the map output in a sorted manner to the reduce phase in order to minimize the work of reducer is called shuffle. In this phase, Hadoop job fetches intermediate map output and copies it in one or more reducers. If N r N slot, then the shuffle phase will be completed in single wave. The total execution time for shuffle phase will be T stotal = T savg N r N rslot 2 Otherwise, the shuffle phase will be completed in multiple waves. Then the total execution time for shuffle phase will be T stotal = (T W 1 avg N W 1)+...+(T W n avg N W n) N rslot 3 Fig 1 describes job execution flow of existing Hadoop performance model

7 Hadoop job execution flow Table 1.1 defines variables

8 3.1.3 Reduce Phase The reduce phase is a user-defined function in which user can customized their processing logic. The customized reduce function process intermediate map output and produces the final output.usually, the final output after the reduce phase will be stored in HDFS. The Hadoop performance model supports both overlapping and non overlapping stages. The total execution time for reduce phase will be T rtotal = T ravg N r N rslot 4 4 Proposed Work 4.1 Improved Hadoop Performance Model Job Estimation Model In improved Hadoop performance model, we handle both overlapping and non overlapping stages. Generally when processing in multiple waves, the first wave of the shuffle phase starts immediately after completion of the first wave of the map phase and the first wave of the shuffle completed only after all waves of the map phase started which creates a long execution time for shuffle phase which can be overcome by HIVE. When an existing data infrastructure based on relational database wants to move in Hadoop which can be achieved using HIVE uses HQL(HIVE Query Language) similar to SQL(Structured Query language) use to fetch and process data from structured data infrastructure. [7][8] The two most important concept in HIVE used to overcome the problem of latency by shuffle phase are bucketing and partitioning. When the input datasets are bucketed based on their column using hash function along with partitioning, the overhead of large datasets are reduced using command line interface or web user interface where bucketing and partitioning commands are queried in HQL based on user needs. Already the datasets are bucketed i.e. sorted based on columns the task of shuffling get reduced which improves performance. Furthermore, it helps to complete job within deadline. The estimated

9 execution time for Hadoop jobs are finalized by user profile collects detailed information about Hadoop jobs such as data size, map tasks, reduce tasks and their duration. When a job processes an increasing size of an input dataset, the number of map tasks is proportionally increased while the number of reduce tasks is specified by a user in the configuration file. The number of reduce tasks can vary depending on user s configurations. When the number of reduce tasks is kept constant, the execution durations of both the shuffle tasks and the reduce tasks are linearly increased with the increasing size of the input dataset as considered in the HP model. This is because the volume of an intermediate data block equals to the total volume of the generated intermediate data divided by the number of reduce tasks. As a result, the volume of an intermediate data block is also linearly increased with the increasing size of the input dataset. However, when the number of reduce tasks varies, the execution durations of both the shuffle tasks and the reduce tasks are not linear to the increasing size of an input dataset. In the improved HP model, we considered varied number of reduce slots based on the user profile which has detailed information about past history input data size and their execution time and the number of map slots and reduce slots which helps to estimate the varied number of reduce slots Resource Provisioning Model Starfish and other Hadoop performance model [2] [3] [4] [5] supports manual or self tuning of the resources needed for their job execution. Resource provisioning plays a key role as a job can be completed within specified deadline say t only when the sufficient resources required for the job for their processing has been provided. In improved HP model supports automatic resource provisioning without manipulating knob parameters using user feedback system. In user feedback system, collects fine granularity information about the history of already processed job and by assigning weight or priority to each job based on previous logs so that high critical task can be provided resources immediately and automatically. Since even high critical task are provided resources actively which helps to complete Hadoop jobs within specified constraint

10

11 5 Performance Evaluation 5.1 Experimental Setup The performance of improved Hadoop performance model was evaluated on two experimental setup i.e Amazon EC2 cloud and Google cloud.the experimental Hadoop cluster was setup on Amazon EC2 Cloud using 20 m1.large instances. The specifications of the m1.large. In this cluster, we used Hadoop and configured one instance as Name Node and other 19 instances as Data Nodes. The Name Node was also used as a Data Node. The data block size of the HDFS was set to 64MB and the replication level of data block was set to 3. Each instance was configured with one map slot and one reduce slot. We run WordCount applications on Hadoop cluster and employed Starfish to collect the job profiles. For each application running on each cluster, we conducted 10 tests. For each test, we run 5 times and took the average durations of the phases

12 Fig.4. the performance of the WordCount application with a varied number of reduce tasks. The following graph describes the performance of improved HP model versus HP model for job estimation time

13 The following graph describes the estimated resource provisioning and compares resource provisioning performance HP model versus improved HP model The improved Hadoop performance model was also evaluated in Google cloud. The project scenario based on which improved Hadoop performance model evaluated was described here. In Google cloud, file directory showing a list of files representing famous song details was stored (big data).a list of plans regarding the configuration of map and reduce slots are stored in a separate table in Hive. The contents of the file were stored in a table in Hive in comma separated format. The details of user profile regarding configuration of the resource required and user feedback were stored in database in Hive. The file details was analyzed and partitioned big data as buckets using hash function based on major columns and the list of optimal resource plan is displayed along with the optimal resource config

14 uration. Using compute engine a virtual machine was initialized and Hadoop was installed in that VM along with Hive 1.2.x and Pig for manipulating the big data. The file directory was transferred from local file system to Hadoop Distributed File System (HDFS). Command The file which are stored in.csv format as 1 9,98993,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album 2 9,105156,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album

15 3 9,113452,3,Bengali,Contemporary celtic,3,anjan Dutt,paid,album 4 9,119398,2,Bengali,Contemporary celtic,3,anjan Dutt,free,movie 5 9,130766,3,Bengali,Contemporary celtic,2,anjan Dutt,free,movie 6 9,133420,3,Kannada,Carnival songs,2,anu Malik,free,movie 7 10,11091,4,Kannada,Carnival songs,2,anu Malik,free,movie 8 10,19275,3,Kannada,Carnival songs,2,anu Malik,free,movie 9 10,21140,4,Kannada,Carnival songs,2,anu Malik,free,album 10 10,44487,3,Kannada,Carnival songs,1,anu Malik,free,album 11 10,49147,1,Kannada,Carnival songs,1,anu Malik,free,album 12 10,54923,5,Kannada,Contemporary celtic,1,anu Malik,free,album 13 10,63451,3,Kannada,Contemporary celtic,1,anu Malik,free,album 14 10,75206,3,Kannada,Contemporary celtic,1,anu Malik,paid,movie 15 10,104641,1,Kannada,Contemporary celtic,5,anu Malik,paid,movie 16 10,110054,3,Kannada,Contemporary celtic,5,anu Malik,paid,movie Command to load file.csv LOGS= LOAD /user/root/songs.csv ; LOGS GROUP= GROUP LOGS ALL; LOG COUNT = FOREACH LOGS GROUP GEN- ERATE COUNT(LOGS); STORE LOG COUNT INTO /user/hive/warehouse/resource.db/filedetails /songs.txt ; use resource; drop table if exists filedetails; CREATE EXTER- NAL TABLE filedetails ( totalrow string) ROW FORMAT DE- LIMITED FIELDS TERMINATED BY, LINES TERMINATED BY \n LOCATION hdfs://localhost:9000/user/hive/warehouse/resource.db/filedetails/songs.txt ; Command To Analyze the plan set hive.cli.print.header=true; use resource; drop table if exists analysis;

16 create table analysis as select a.id,a.memory,a.processor,a.ssd,a.transfer,a. price,a.size from plans a JOIN filedetails b where a.size cast(b.totalrow as int); select a.id,a.memory,a.processor,a.ssd,a.transfer,a. price,a.size, b.feedback,b.name, b. from analysis a JOIN feedback b where a.id = b.planid; Sample Contents Stored in Hive Table Analysis 3 2GB 2 Core 40 GB 3 TB 20$ GB 2 Core 60GB 4 TB 40$ GB 4 Core 80GB 5 TB 80$ GB 8 Core 160GB 6TB 160$ GB 12 Core 320GB 7TB 320$ GB 16 Core 480GB 8TB 480$ GB 20 Core 640GB 9TB 640$ Feedback OK NULL NULL Timestamp Name Feedback :00 vaishali.m mvaishali1996@gmail.com It would be even more useful and clear if we had the option of working it live...and practical :00 yukti sharma dana.pink@yahoo.co.in it was good. need more time to elaborate.next time try for longer instances :00 subhash subhashthaya@gmail.com it is very nice and i have learnt many things about google design cloud during this session :00 akhil vemuri micheal.akhil.king@gmail.com good but expecting more :01 Devang P Patel patell.deva@gmail.com the cloud was good :01 H.VIGNESH vickynarrayanan@gmail.com awesome experience we had but time is not enough!!! NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL :01 vidya varsshini vidyavars1995@gmail.com its very nice but can make it even more better :01 subathra.r subaramachandran7@gmail.com good...but not so effective

17 :01 sruthi r sruthi.rdsk.1996@gmail.com nice..but little bit lag :01 SUSHMITHA.U sushmiudhay@gmail.com could have been more effective if it was more storage :01 vijayashankar vijayashankarenginer@outlook.com good to use this pack. Great eperience :01 venkatraman.d vengatvenkie@gmail.com its nice cloud any very good one for improving our knowledge Plans NULL memory processor ssd transfer price NULL 1 512MB 1 Core 20 GB 1TB 5$ GB 1 Core 30 GB 2 TB 10$ GB 2 Core 40 GB 3 TB 20$ GB 2 Core 60GB 4 TB 40$ GB 4 Core 80GB 5 TB 80$ GB 8 Core 160GB 6TB 160$ GB 12 Core 320GB 7TB 320$ GB 16 Core 480GB 8TB 480$ GB 20 Core 640GB 9TB 640$ Output

18 6 CONCLUSION Running a Map Reduce Hadoop job on a public cloud such as Amazon EC2 necessitates a performance model to estimate the job exe

19 cution time and further to provision a certain amount of resources for the job to complete within a given deadline. This paper has presented an improved HP model to achieve this goal taking into account multiple waves of the shuffle phase of a Hadoop job. The experimental results showed that the improved HP model outperforms both Starfish and the HP model in job execution estimation and resource provisioning. One future work would be to consider dynamic overhead of the VMs involved in running the user jobs to minimize resource overprovisioning. References [1] [2] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, Starfish: A Self-tuning System for Big Data Analytics, in In CIDR, 2011, pp [3] H. Herodotou, F. Dong, and S. Babu, No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics, in Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC 11), 2011, pp. 114 [4] A. Verma, L. Cherkasova, and R. H. Campbell, Resource provisioning framework for mapreduce jobs with performance goals, in Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware, 2011, pp [5] K. Chen, J. Powers, S. Guo, and F. Tian, CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds, IEEE Transcation Parallel Distrib. Syst., vol. 25, no. 6, pp , [6] H. Herodotou, Hadoop Performance Models, [Online]. Available: [Accessed: 22-Oct-2013]. [7] Hadoop Definitive guide Tom White OReilly Press

20 [8] E. Capriolo, D. Wampler, and J. Rutherglen, Programming Hive, O Reilley, 2012 [9] Z. Zhang, L. Cherkasova, and B. T. Loo, Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments, in Proceedings of the 2013 IEEE Sixth

21 1389

22 1390

Modelling of Map Reduce Performance for Work Control and Supply Provisioning

Modelling of Map Reduce Performance for Work Control and Supply Provisioning Modelling of Map Reduce Performance for Work Control and Supply Provisioning Palson Kennedy.R* CSE&PERIIT, Anna University, Chennai palsonkennedyr@yahoo.co.in Kalyana Sundaram.N IT & Vels University hereiskalyan@yahoo.com

More information

Benchmarking Approach for Designing a MapReduce Performance Model

Benchmarking Approach for Designing a MapReduce Performance Model Benchmarking Approach for Designing a MapReduce Performance Model Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com Boon Thau

More information

Analysis in the Big Data Era

Analysis in the Big Data Era Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

OUTLOOK ON VARIOUS SCHEDULING APPROACHES IN HADOOP

OUTLOOK ON VARIOUS SCHEDULING APPROACHES IN HADOOP OUTLOOK ON VARIOUS SCHEDULING APPROACHES IN HADOOP P. Amuthabala Assistant professor, Dept. of Information Science & Engg., amuthabala@yahoo.com Kavya.T.C student, Dept. of Information Science & Engg.

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Strategies for Incremental Updates on Hive

Strategies for Incremental Updates on Hive Strategies for Incremental Updates on Hive Copyright Informatica LLC 2017. Informatica, the Informatica logo, and Big Data Management are trademarks or registered trademarks of Informatica LLC in the United

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Automated Control for Elastic Storage

Automated Control for Elastic Storage Automated Control for Elastic Storage Summarized by Matthew Jablonski George Mason University mjablons@gmu.edu October 26, 2015 Lim, H. C. and Babu, S. and Chase, J. S. (2010) Automated Control for Elastic

More information

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

BROADBAND WIRELESS NETWORKING IN THE ERA OF BIG DATA

BROADBAND WIRELESS NETWORKING IN THE ERA OF BIG DATA BROADBAND WIRELESS NETWORKING IN THE ERA OF BIG DATA Presented by: Dr. Tamer Omar Colleage of Enfineering & Technology Technology Systems Departmet East Carolina University INTRODUCTION Organizations accumulate

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x

Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2. Ashwini Rajaram Chandanshive x Novel Algorithm with In-node Combiner for enhanced performance of MapReduce on Amazon EC2 MSc Research Project Cloud Computing Ashwini Rajaram Chandanshive x15043584 School of Computing National College

More information

A What-if Engine for Cost-based MapReduce Optimization

A What-if Engine for Cost-based MapReduce Optimization A What-if Engine for Cost-based MapReduce Optimization Herodotos Herodotou Microsoft Research Shivnath Babu Duke University Abstract The Starfish project at Duke University aims to provide MapReduce users

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability

More information

A REVIEW PAPER ON BIG DATA ANALYTICS

A REVIEW PAPER ON BIG DATA ANALYTICS A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Copyright 2012, Oracle and/or its affiliates. All rights reserved. 1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

Available online at ScienceDirect. Procedia Computer Science 98 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 98 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016 ) 554 559 The 3rd International Symposium on Emerging Information, Communication and Networks Integration of Big

More information

Join Processing for Flash SSDs: Remembering Past Lessons

Join Processing for Flash SSDs: Remembering Past Lessons Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics Technical Report CS-2011-06, Computer Science, Duke University Herodotos Herodotou Duke University hero@cs.duke.edu

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

LITERATURE SURVEY (BIG DATA ANALYTICS)!

LITERATURE SURVEY (BIG DATA ANALYTICS)! LITERATURE SURVEY (BIG DATA ANALYTICS) Applications frequently require more resources than are available on an inexpensive machine. Many organizations find themselves with business processes that no longer

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Best Practices and Performance Tuning on Amazon Elastic MapReduce Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud MC 2 : Map Concurrency Characterization for Map on the Cloud Mohammad Hammoud and Majd F. Sakr Carnegie Mellon University in Qatar Education City, Doha, State of Qatar Emails: {mhhammou, msakr}@qatar.cmu.edu

More information

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments Nikos Zacheilas, Vana Kalogeraki Department of Informatics Athens University of Economics and Business 1 Big Data era has arrived!

More information

Performance Modeling and Optimization of Deadline-Driven Pig Programs

Performance Modeling and Optimization of Deadline-Driven Pig Programs Performance Modeling and Optimization of Deadline-Driven Pig Programs ZHUOYAO ZHANG, HP Labs and University of Pennsylvania LUDMILA CHERKASOVA, Hewlett-Packard Labs ABHISHEK VERMA, HP Labs and University

More information

Statistics Driven Workload Modeling for the Cloud

Statistics Driven Workload Modeling for the Cloud UC Berkeley Statistics Driven Workload Modeling for the Cloud Archana Ganapathi, Yanpei Chen Armando Fox, Randy Katz, David Patterson SMDB 2010 Data analytics are moving to the cloud Cloud computing economy

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

A Comparison of MapReduce Join Algorithms for RDF

A Comparison of MapReduce Join Algorithms for RDF A Comparison of MapReduce Join Algorithms for RDF Albert Haque 1 and David Alves 2 Research in Bioinformatics and Semantic Web Lab, University of Texas at Austin 1 Department of Computer Science, 2 Department

More information

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R Table of Contents Introduction... 3 Topology Awareness in Hadoop... 3 Virtual Hadoop... 4 HVE Solution... 5 Architecture...

More information

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Resource and Performance Distribution Prediction for Large Scale Analytics Queries Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia

More information

SoftNAS Cloud Performance Evaluation on AWS

SoftNAS Cloud Performance Evaluation on AWS SoftNAS Cloud Performance Evaluation on AWS October 25, 2016 Contents SoftNAS Cloud Overview... 3 Introduction... 3 Executive Summary... 4 Key Findings for AWS:... 5 Test Methodology... 6 Performance Summary

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Journal homepage: www.mjret.in ISSN:2348-6953 A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING Bhavsar Nikhil, Bhavsar Riddhikesh,Patil Balu,Tad Mukesh Department of Computer Engineering JSPM s

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

A Non-Relational Storage Analysis

A Non-Relational Storage Analysis A Non-Relational Storage Analysis Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Cloud Computing - 2nd semester 2012/2013 Universitat Politècnica de Catalunya Microblogging - big data?

More information

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Enhanced Hadoop with Search and MapReduce Concurrency Optimization Volume 114 No. 12 2017, 323-331 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Enhanced Hadoop with Search and MapReduce Concurrency Optimization

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Welcome to the New Era of Cloud Computing

Welcome to the New Era of Cloud Computing Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi Journal of Energy and Power Engineering 10 (2016) 405-410 doi: 10.17265/1934-8975/2016.07.004 D DAVID PUBLISHING Shirin Abbasi Computer Department, Islamic Azad University-Tehran Center Branch, Tehran

More information

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms , pp.289-295 http://dx.doi.org/10.14257/astl.2017.147.40 Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms Dr. E. Laxmi Lydia 1 Associate Professor, Department

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing : Exploring Memory Locality for Big Data Analytics in Virtualized Clusters Eunji Hwang, Hyungoo Kim, Beomseok Nam and Young-ri

More information

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life Ch.Srilakshmi Asst Professor,Department of Information Technology R.M.D Engineering College, Kavaraipettai,

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data

Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data 46 Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data

More information

Automated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University

Automated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University D u k e S y s t e m s Automated Control for Elastic Storage Harold Lim, Shivnath Babu, Jeff Chase Duke University Motivation We address challenges for controlling elastic applications, specifically storage.

More information

MOHA: Many-Task Computing Framework on Hadoop

MOHA: Many-Task Computing Framework on Hadoop Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Comparative Analysis of Range Aggregate Queries In Big Data Environment Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information