INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Similar documents
EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment

Dynamic Resource Allocation Using Nephele Framework in Cloud

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Cloud Resource Allocation as Preemptive Scheduling Approach

ViCRA: Vigorous Cloud Resource Allocation for Coextensive Data Processing

A Fast Parallel Data Processing Method in Cloud Computing

HADOOP BLOCK PLACEMENT POLICY FOR DIFFERENT FILE FORMATS

A Comparative Study of Various Computing Environments-Cluster, Grid and Cloud

Survey on MapReduce Scheduling Algorithms

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

A New HadoopBased Network Management System with Policy Approach

Optimization of Multi-server Configuration for Profit Maximization using M/M/m Queuing Model

Nephele: Efficient Parallel Data Processing in the Cloud

MOHA: Many-Task Computing Framework on Hadoop

SURVEY ON LOAD BALANCING AND DATA SKEW MITIGATION IN MAPREDUCE APPLICATIONS

DYNAMIC DATA STORAGE AND PLACEMENT SYSTEM BASED ON THE CATEGORY AND POPULARITY

OPENSTACK PRIVATE CLOUD WITH GITHUB

Mitigating Data Skew Using Map Reduce Application

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

BIG DATA & HADOOP: A Survey

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Embedded Technosolutions

The MapReduce Framework

Data Analysis Using MapReduce in Hadoop Environment

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Composable Infrastructure for Public Cloud Service Providers

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

CLIENT DATA NODE NAME NODE

High Performance Computing on MapReduce Programming Framework

Kusum Lata, Sugandha Sharma

ABSTRACT I. INTRODUCTION

Lesson 14: Cloud Computing

A new efficient Virtual Machine load balancing Algorithm for a cloud computing environment

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

An Enhanced Approach for Resource Management Optimization in Hadoop

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Distributed Face Recognition Using Hadoop

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

Load Balancing Algorithms in Cloud Computing: A Comparative Study

ENHANCING CLUSTERING OF CLOUD DATASETS USING IMPROVED AGGLOMERATIVE ALGORITHMS

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

A Survey on Big Data

A QoS Load Balancing Scheduling Algorithm in Cloud Environment

PERFORMANCE ANALYSIS AND OPTIMIZATION OF MULTI-CLOUD COMPUITNG FOR LOOSLY COUPLED MTC APPLICATIONS

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

CSE6331: Cloud Computing

ABSTRACT I. INTRODUCTION

New research on Key Technologies of unstructured data cloud storage

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Load Balancing Algorithm over a Distributed Cloud Network

Double Threshold Based Load Balancing Approach by Using VM Migration for the Cloud Computing Environment

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

CES: A FRAMEWORK FOR EFFICIENT INFRASTRUCTURE UTILIZATION THROUGH CLOUD ELASTICITY AS A SERVICE (CES)

Figure 1: Virtualization

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Co-operative Scheduled Energy Aware Load-Balancing technique for an Efficient Computational Cloud

Global Journal of Engineering Science and Research Management

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Paradigm Shift of Database

Online Bill Processing System for Public Sectors in Big Data

MapReduce. Cloud Computing COMP / ECPE 293A

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Mounica B, Aditya Srivastava, Md. Faisal Alam

Mining Distributed Frequent Itemset with Hadoop

Selection of a Scheduler (Dispatcher) within a Datacenter using Enhanced Equally Spread Current Execution (EESCE)

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

An Improved Task Scheduling Algorithm based on Max-min for Cloud Computing

Simulation of Cloud Computing Environments with CloudSim

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Welcome to the New Era of Cloud Computing

Introduction to MapReduce

Similarities and Differences Between Parallel Systems and Distributed Systems

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

what is cloud computing?

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Batch Inherence of Map Reduce Framework

LARGE SCALE SATELLITE IMAGE PROCESSING USING HADOOP FRAMEWORK

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Exploiting Resource Overloading Using Utility Accrual Approach for Parallel Data Processing in Cloud

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

High Throughput WAN Data Transfer with Hadoop-based Storage

Large Scale Computing Infrastructures

An improved MapReduce Design of Kmeans for clustering very large datasets

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

White Paper Impact of DoD Cloud Strategy and FedRAMP on CSP, Government Agencies and Integrators.

A Review on Reliability Issues in Cloud Service

Efficient Algorithm for Frequent Itemset Generation in Big Data

Transcription:

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 5, Issue 7, July (2014), pp. 11-16 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET I A E M E NEPHELE: EFFICIENT DATA PROCESSING USING HADOOP Gandhali Upadhye P.G. Student, Dept. of Computer Engg., RMD Sinhgad School of Engg.,Warje. University of Pune, PUNE, India. Astt. Prof. Trupti Dange Dept. of Computer Engineering, RMD Sinhgad School of Engg.,Warje. University of Pune, PUNE, India. ABSTRACT Today, Infrastructure-as-a-Service (IaaS) cloud providers have incorporated parallel data processing framework in their clouds for performing Many-task computing (MTC) applications. Parallel data processing framework reduces time and cost in processing the substantial amount of users data. Nephele is a dynamic resource allocating parallel data processing framework, which is designed for dynamic and heterogeneous cluster environments. The existing framework does not support to monitor resource overload or underutilization, during job execution, efficiently. Consequently, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. Nephele s architecture offers for efficient parallel data processing in clouds. It is the first data processing framework for the dynamic resource allocation offered by today s IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can be assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Keywords: Cloud Computing, Resource allocation, IaaS, Nephele, Job Scheduling. I. INTRODUCTION Cloud Computing is a concept that involves a number of technologies, including server virtualization, multitenant application hosting, and a variety of Internet and systems management services. The computing power resides on the Internet and people access this computing power as they need it via the Internet. Growing organizations have been processing immense amount of data in a cost effective manner using cloud computing mechanism. More of the cloud providers like Google, Yahoo, Microsoft and Amazon are available for processing these data. Instead of creating large data 11

centers which are expensive, these providers move into architectural paradigm with commodity servers, to process these huge data [3]. Cloud computing is the delivery of computing and storage capacity as a service to a community of end-recipients. The name comes from the use of a cloudshaped symbol as an abstraction for the complex infrastructure it contains in system diagrams. Cloud computing entrusts services with a user's data, software and computation over a network. It is difficult for him to maintain all the software s and data in local servers which costs a lot. His business is constantly running out of storage space and fixing broken servers. Then he came to know about cloud computing, which gives a solution to his problem. Instead of maintaining all the data and software s at local machines, he can now depend on another company which provides all the features that he requires and are accessible through the Web. This is called Cloud Computing. 1.1 Infrastructure as a service (IaaS) Infrastructure as a service (IaaS) controls user and manage the systems in terms of the bandwidth, response time, resource expenses, and network connectivity, but they need not control the cloud infrastructure. Some of the key features of IaaS such as cloud bursting, resource gathering etc. differ according to the cloud environment. The greatest value of IaaS is mainly through a key element known as cloud bursting. The process of off-loading tasks to the cloud during times when the most compute resources are needed. However, for business IaaS takes an advantage in its capacity. IT companies [1], [2] able to develop its own software and implements that can able to handles the ability to re-schedule resources in an IaaS cloud. Elasticity is the first critical facet of IaaS. IaaS is easy to spot, because it is typically platform-independent. IaaS consists of a combination of internal and external resources. IaaS is low-level resource that runs independent of an operating system called a hypervisor and is responsible for taking rent of hardware resources based on pay as you go basics. This process is referred to as resource gathering [2]. Resource gathering by the hypervisor makes virtualization possible, and virtualization makes multiprocessing computing that leads to an infrastructure shared by several users with similar resources in regard to their requirements. II. PROBLEM STATEMENT Fig. 1: Cloud Environment Infrastructure Architecture The IaaS cloud providers integrate the processing frame-work to reduce the processing time and provide simplicity to the users. Reducing process time leads to reduction in the cost for attracting the users to use their cloud services. To improve and optimize the scenario of parallel data 12

processing. In order to avail utmost client satisfaction, the host server needs to be upgraded with the latest technology to fulfill all requirements. This will also make sure that the user gets its entire requirement fulfilled in optimal time. This will improve the overall resource utilization and, consequently, reduce the processing time. III. RELATED WORK In this section we mainly discuss with some existing frameworks that were already implemented for efficient parallel data processing. SCOPE ((Structured Computations Optimized for Parallel Execution) For Companies providing cloud-scale services of Easy and Efficient Parallel Processing of Massive Data Sets have an increasing need to store and analyze massive data sets such as search logs and click streams. For cost and performance reasons, processing is typically done on large clusters of with a huge amount needed to deploy large servers (I.e. shared-nothing commodity machines). It is imperative to develop a programming model that hides the complexity of the underlying system but provides flexibility by allowing users to extend functionality to meet a variety of requirements. SWIFT Swift, a system that combines a novel scripting language called Swift Script with a powerful runtime system based on Falkon, and Globus to allow for the concise specification, and reliable and efficient execution, of large loosely coupled computations. Swift adopts and adapts ideas first explored in the virtual data system, improving on that system in many regards. But this was not at all useful as this script was not in usage for the current developers, so that s the reason this framework is not popular in usage. FALKON Fast and Light-Weight Task Execution Framework. To enable the rapid execution of many tasks on compute clusters, we have developed Falkon. Falkon integrates multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and a streamlined dispatcher. Falkon s integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. We describe Falkon architecture and implementation, and present performance results for both micro benchmarks and applications. HADOOP Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage in static mode. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. This was not suitable in parallel processing as this was not suitable for dynamic nature; hence we came with a proposed new framework called Nephele Framework. IV. NEPHELE FRAMEWORK Nephele is a massively parallel data flow engine dealing with resource management, work scheduling, communication, and fault tolerance. Nephele can run on top of a cluster and govern the resources itself, or directly connect to an IaaS cloud service to allocate computing resources on 13

demand. Before submitting a Nephele compute job, a user must start an instance inside the cloud which runs the so called Job Manager. The Job Manager receives the client s jobs, is responsible for scheduling them and coordinates their execution. It can allocate or deallocate virtual machines according to the current job execution phase. The actual execution of tasks is carried out by a set of instances. Each instance runs a local component of the Nephele framework (Task Manager). A Task Manager receives one or more tasks from the Job Manager at a time, executes them and informs the Job Manager about their completion or possible errors. Unless a job is submitted to the Job Manager, we expect the set of instances (and hence the set of Task Managers) to be empty. Fig. 2: Structural overview of Nephele running in Infrastructure-as-a-Service (IaaS) in cloud V. MODULE DESCRIPTION Client Module The client which sends the request to the job manager for the execution of task the job manager will schedule the process and coordinates the task and wait for the completion response. The client is the one who initiates the request to the job manager. Job Manager Module The job manager will wait for the task from client, coordinates the process and it checks the availability of the server, if the server is available for the task to be done, it allocates the resource for execution and wait for the completion response. Cloud Controller Module This acts as interface between the job manager and task manager and provides the control and initiation of task managers. It is also responsible for coordinating and manages the execution and also dispatches the task. It checks for the availability of task managers and allocate the resource for the task to be executed. Task Manager Module The task manager will wait for task to be executed; it then executes it and send the complete response to the job manager in turn to the client. The actual execution of the task it done in the task manager. 14

VI. RESULTS This section will present the results of job scheduling for resource allocation and Nephele s architecture and process the task and allocate the resource for the executed task. The results are compared with Hadoop. Fig.3: Comparison graph for CPU utilization Fig.4: Comparison graph for job completion Fig.5: Comparison graph for processing time 15

VII. CONCLUSION AND FUTURE WORK In this project, Nephele, the first data processing framework to exploit the dynamic resource provisioning offered by today s IaaS clouds. Nephele s basic architecture and presented a performance comparison to the established data processing framework Hadoop. The performance evaluation gives a first impression on how the ability to assign specific virtual machine types to specific tasks of a processing job, as well as the possibility to automatically allocate/deallocate virtual machines in the course of a job execution, can help to improve the overall resource utilization and, consequently, reduce the processing cost. In particular, improving Nephele s ability to adapt to resource overload or underutilization during the job execution automatically. IT technicians are leading the challenge, while academia is bit slower to react. Several groups have recently been formed, such as the Cloud Security Alliance or the Open Cloud Consortium, with the goal of exploring the possibilities offered by cloud computing and to establish a common language among different providers. Our current profiling approach builds a valuable basis for this; however, at the moment the system still requires a reasonable amount of user annotations. In general, an important contribution to the growing field of Cloud computing services and points out exciting new opportunities in the field of parallel data processing is represented. VIII. REFERENCES 1. W. Zhao, K.Ramamritham, and J.A.Stankovic, Preemptive scheduling under time and resource constraints, In: IEEE Transactions on Computers C-36 (8) (1987)949 960. 2. Alex King Yeung Cheung and Hans-Arno Jacobsen, Green Resource Allocation Algorithms for Publish/Subscribe Systems, In: the 31th IEEE International Conference on Distributed Computing Systems (ICDCS), 2011. 3. R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, SCOPE: Easy and Efficient Parallel Processing of Massive DataSets, Proc. Very Large Database. 4. The Apache Software Foundation Welcome to Hadoop! http://hadoop.apache.org/,2012. 5. D. Warneke and O. Kao, Nephele: Efficient Parallel Data Processing in the Cloud, Proc. Second Workshop Many-Task Computing on Grids and Supercomputers (MTAGS 09), pp. 1-10, 2009. 6. I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a Fast and Light-weight taskexecution framework. In SC 07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1 12, New York, NY, USA, 2007. ACM. 7. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, pages 10 10, Berkeley, CA, USA, 2004. USENIX Association. 8. D. Warneke and O. Kao, Exploiting dynamic resource allocation for efficient parallel data processing in the cloud, IEEE transactions on parallel and distributed systems, vol. 22, no. 6, June 2011. 9. T. White, Hadoop: The Definitive Guide. O Reilly Media, 2009 10. Mr.Abhijit Adhikari, Ms.GandhaliUpadhye, Efficient Dynamic Resource Allocation in Parallel Data Processing for Cloud using PACT, IJARCSSE Vol.3, 11, Nov.2013. 11. Ms.GandhaliUpadhye, Prof.Trupti Dange, Cloud Resource Allocation as Non-Preemptive Approach", International Conference on Current Trends in Engineering and Technology (ICCTET) 2014. 12. Priya Deshpande, Brijesh Khundhawala and Prasanna Joeg, Dynamic Data Replication and Job Scheduling Based on Popularity and Category, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013, pp. 109-114, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 16