TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

Similar documents
Embedded Technosolutions

Stages of Data Processing

A Survey on Comparative Analysis of Big Data Tools

Big Data with Hadoop Ecosystem

Hadoop, Yarn and Beyond

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Oracle GoldenGate for Big Data

Online Bill Processing System for Public Sectors in Big Data

A Review Approach for Big Data and Hadoop Technology

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Next-Generation Cloud Platform

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Chapter 6 VIDEO CASES

The age of Big Data Big Data for Oracle Database Professionals

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

HADOOP FRAMEWORK FOR BIG DATA

A REVIEW PAPER ON BIG DATA ANALYTICS

Big Data & Hadoop ABSTRACT

Big Data and Cloud Computing

Big Data Hadoop Stack

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Cloud Computing Techniques for Big Data and Hadoop Implementation

Hadoop An Overview. - Socrates CCDH

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Microsoft Big Data and Hadoop

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Oracle Big Data Connectors

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Evolving To The Big Data Warehouse

Strategic Briefing Paper Big Data

Big Data Security issues and challenges in Cloud Computing Environment

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Cloud Computing & Visualization

An Introduction to Big Data Formats

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Big Data Architect.

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

A Text Information Retrieval Technique for Big Data Using Map Reduce

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

OPENSTACK PRIVATE CLOUD WITH GITHUB

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

Challenges for Data Driven Systems

BIG DATA & HADOOP: A Survey

Market Trends in Public Cloud Storage

Hadoop/MapReduce Computing Paradigm

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Top 25 Big Data Interview Questions And Answers

High Performance and Cloud Computing (HPCC) for Bioinformatics

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Scalable Tools - Part I Introduction to Scalable Tools

Introduction to Big-Data

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

The Hadoop Paradigm & the Need for Dataset Management

CISC 7610 Lecture 2b The beginnings of NoSQL

Chapter 5. The MapReduce Programming Model and Implementation

Certified Big Data and Hadoop Course Curriculum

High Performance Computing on MapReduce Programming Framework

Department of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

A Review Paper on Big data & Hadoop

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Project Design. Version May, Computer Science Department, Texas Christian University

CSE6331: Cloud Computing

A New HadoopBased Network Management System with Policy Approach

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Modern Data Warehouse The New Approach to Azure BI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Transaction Analysis using Big-Data Analytics

TCO REPORT. NAS File Tiering. Economic advantages of enterprise file management

DATA SCIENCE USING SPARK: AN INTRODUCTION

Taming Structured And Unstructured Data With SAP HANA Running On VCE Vblock Systems

Introduction To Cloud Computing

Ian Choy. Technology Solutions Professional

DEEP DIVE INTO CLOUD COMPUTING

Lecture 12 DATA ANALYTICS ON WEB SCALE

Big Data A Growing Technology

Overview of Data Services and Streaming Data Solution with Azure

Next-generation IT Platforms Delivering New Value through Accumulation and Utilization of Big Data

Introduction to Hadoop and MapReduce

WHITEPAPER. MemSQL Enterprise Feature List

More AWS, Serverless Computing and Cloud Research

When, Where & Why to Use NoSQL?

Databases 2 (VU) ( / )

Hybrid Data Platform

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Transcription:

Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY Mrs. R. UMA,Head of the Department Department of Computer Application S.HEMASANTHOSHINI 2, M.Phil Scholar Department Of CS & IT, Nadar Saraswathi College Of Arts And Science, Theni. ABSTRACT: Big data and cloud computing are both emerging technologies whose rate of adoption by businesses has been increasing rapidly over the past decade.to effectively manage and analyze big data is time consuming and challenging task. Cloud Computing is a new paradigm which provides infrastructure for computing and processing of all types of data resources. The relationship between big data and the cloud computing is based on integration in that the cloud represents the storehouse and the big data represents the product that will be stored in the storehouse. Integrating big data in cloud environment provides user with enhanced data processing techniques. A brief survey about the tools that are used for integrating Big Data and Cloud Computing has been presented in this paper. These two fields have gained tremendous momentum in the recent years and have attracted attention of several researchers. Keywords - Big data, cloud environment, big data management tools. I. INTRODUCTION Big Data is defined as a collection of huge size of data sets with different types so that it becomes difficult to process by using traditional data processing algorithms and platforms. Recently the number of data provisions has increased, such as social networks, sensor networks, high throughput instruments, satellite and streaming machines and these environments produce huge size of data. The amount of data being generated stored and shared has been on the rise. From data warehouses, web pages and blogs to audio/video streams, all of these are sources of massive amounts of data. Cloud computing refers to on-demand computer resources and systems available across the network that can provide a number of integrated computing services without local resources to facilitate user access. Many organizations like Mrs. R. UMA AND S. HEMASANTHOSHINI 1

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY Amazon AWS, IBM Smart Cloud and Windows Azure are now migrating their big data to clouds to take their advantage. These resources include data storage capacity, backup and selfsynchronization. This programming tools and frameworks has given birth to the concept of Big Data Processing and Analytics. A.BIG DATA Big data is the phrase commonly used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process. This data arrives at high speeds from multiple sources such as social media, transactions, interactions with other web pages, etc in a random fashion. The main characteristics of big data, known as 'five Vs', are as follows: 1. Volume: It represents the amount of data produced from multiple sources which show the huge data in numbers by zeta bytes. The volume is most evident dimension in what concerns to big data. 2.Variety: It represents data types, with, increasing the number of Internet users everywhere, smart phones and social networks users, the familiar form of data has changed from structured data in databases to unstructured data that includes a large number of formats such as images, audio and video clips, SMS, and GPS data. 3. Velocity: It represents the speed of data frequency from different sources, that is, the speed of data production such as Twitter and Face book. The huge increase in data volume and their frequency dictates the need for a system that ensures super-speed data analysis. 4. Veracity: It represents the quality of the data; it shows the accuracy of the data and the confidence in the data content. The quality of the data captured can vary greatly, which affects the accuracy of analysis. 5. Value: It represents the value of big data, i.e. it shows the importance of data after analysis. The value lies in careful analysis of the exact data, the information and ideas it provides. The value is the final stage that comes after processing volume, velocity, variety, contrast, validity and visualization. B.CLOUD COMPUTING: The cloud is a computing service that charges you based only on the amount of computing resources we use. The practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer. Cloud Computing Services: Mrs. R. UMA AND S. HEMASANTHOSHINI 2

Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 Software as a Service - End Users: It is an application that can be accessed from anywhere on the world as long as you can have a computer with an Internet Connection. We can access this cloud hosted application without any additional hardware or software. Platform as a Service -Application Developers In the PaaS model, cloud providers deliver a computing platform and/or solution stack typically including operating system, programming language execution environment, database, and web server. Infrastructure as a Service -Network Architect: It also known as hardware as a service. It allows existing applications to be run on a cloud suppliers hardware. cloud providers offer computers as physical or more often as virtual machines raw (block) storage, firewalls, load balancers, and networks Modes of Clouds: Public Cloud: Computing infrastructure is hosted by cloud vendor at the vendors premises and can be shared by various organizations. E.g. : Amazon, Google, Microsoft, Sales force Private Cloud: The computing infrastructure is dedicated to a particular organization and not shared with other organizations. It is more expensive and more secure when compare to public cloud. E.g.: HP data center, IBM, Sun, Oracle. Hybrid Cloud: Organizations may host critical applications on private clouds where as relatively less security concerns on public cloud. The usage of both public and private together is called hybrid cloud. II.INTEGRATION OF BIG DATA AND CLOUD COMPUTING: In today s computing world, most of the software companies don t provide the complete setup files of the software s instead they provide cloud to fetch data over the Internet. This type of scenario is possible only through the concept of cloud computing. The huge volume of data or big data is present on clouds which can be accessed via the programming methods that are hidden from a naïve user. With Hadoop, one can easily access and make use of the various resources in the integrated environment. Yet, utilizing a cloud system to store big data has long term benefits to both, the insights yielded, as well as, the performance of the IT sector. Big data requires advanced analytic techniques to deal with the extensive amounts of Mrs. R. UMA AND S. HEMASANTHOSHINI 3

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY data. Cloud systems are typically based on remote servers, which are able to handle extensive amounts of data with rapid response time for real time processes. Cost reduction Reduce overhead. Rapid provisioning/time to market Flexibility/scalability III. TOOLS FOR INTEGRATING BIG DATA AND CLOUD COMPUTING: Big data produces big challenge to manage massive amount of structured and unstructured data to handle. Cloud computing offers scalable solutions to manage such a large amount of data in cloud environment to take advantage of both technologies. To effectively incorporate and manage big data in cloud environment it is important to understand tools and service offered by them. Some vendors like Amazon Web Services (AWS), Google, Microsoft and IBM offers Cloud based Hadoop and NoSQL database platforms that are supporting Big data applications in addition to many cloud providers offer Hadoop framework that scale automatically on demand of customers for data processing. A.HADOOP: Hadoop provides an open source software framework for distributed storage and processing applications on very large datasets. It is a java based programming framework that uses a Master/Slave structure. Hadoop platform includes higher level declarative languages for writing queries and data analysis pipelines. Hadoop is composed of many components but in big data usage the two most components such as Hadoop Distributed File System (HDFS) and MapReduce are used. The other components provide complementary services and higher level of abstraction. i. MapReduce: MapReduce system is the main part in Hadoop framework that is used for processing and generating large datasets on a cluster with distributed or parallel algorithm. It is a programming paradigm used to process large volume of data by dividing the work into various independent nodes. A MapReduce program corresponds to two jobs. A Map() method which include obtaining, filtering and sorting datasets. A Reduce() method which include finding out summaries and generate final result. MapReduce system arranges distributed servers, manage all communications, parallel data transfers, also provide redundancy and fault tolerance. ii. HADOOP DISTRIBUTED FILE SYSTEM (HDFS): HDFS is used to store large data files that are too much to store on a single machine typically in gigabyte to terabyte. HDFS is a Mrs. R. UMA AND S. HEMASANTHOSHINI 4

Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 distributed, scalable and portable file system written in java for Hadoop framework. It maintains reliability by replicating data across multiple hosts to facilitate parallel processing, for that it split a file into blocks that will be stored across multiple machines. The cluster of HDFS has master-slave relationship with single namenode and multiple datanode. B.CASSANDRA AND HBASE: Both are open source, non relational, distributed DBMS written in java that supports data storage for large tables and runs on top of HDFS. It is columnar data model with features like compression, in memory operations and provides fault tolerance way of storing large quantities of sparse data. C.HIVE: It is a warehouse infrastructure by facebook providing for data summarization, adhoc querying and analysis. It provides SQL like language (HiveQL) to make powerful queries and get results in real time. D.PIG:It is a high level data flow language (PigLatin) and execution framework for parallel computation. E.ZOOKEEPER:It is a high performance coordination service for distributed application that can store configuration information and have master-slave node. F.APACHE SPARK: Apache Spark is distributed cluster computing system to speed up the data analytics and it is open source. It is based on a general execution model which allows the user programs to load the data into a cluster s memory thereby helping in in-memory computing and optimization. G. HPCC: High performance Computing Cluster framework is a massive parallel-processing computing platform and it is open source also. It has two different processing clusters. The Thor Processing Cluster is a data refinery that processes large volumes of heterogeneous data. It is responsible for extracting, transforming and loading processed raw data. The Thor cluster is much similar to the Hadoop MapReduce platform in its environment and file system. The Roxie Processing Cluster is a parallel data processing system that works as a rapid data delivery engine. V.CONCLUSION: In this paper we discussed about the tools that are flexible for the integration of Big Data in Cloud Computing. Cloud computing provides enterprises cost-effective, flexible access to big data s enormous magnitudes of information. Big data on the cloud generates vast amounts of on-demand computing resources that comprehend best practice analytics. Cloud computing represents an environment of flexible distributed resources that uses high techniques Mrs. R. UMA AND S. HEMASANTHOSHINI 5

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY in the processing and management of data and yet reduces the cost. All these characteristics show that cloud computing has an integrated relationship with big data. Both are moving towards rapid progress to keep pace with progress in technology requirements and users. REFERENCE: [1] Peter Mell, Timothy Grance, The NIST Definition of Cloud Computing, September 2011 [2] Qi Zhang, Lu Cheng, Raouf Boutaba, Cloud computing: state-of-the-art and research challenges, 20 April 2010 [3] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding, Data Mining with Big Data, January 2014 [4] A. Rajaraman and J. Ullman, Mining of Massive Data Sets, Cambridge Univ. Press, 2011. [5] IBM What Is Big Data: Bring Big Data to the Enterprise, http://www- 01.ibm.com/software/data/bigdata/, IBM, 2012. [6] A, Katal, Wazid M, and Goudar R.H. "Big Data: Issues, challenges, tools and Good practices.". Noida: 2013, pp. 404 409, 8-10 Aug.2013. [7] Venkata Narasimha Inukollu, Sailaja Arsi and Srinivasa Rao Ravuri, May 2014. computing in genomics. Journal of Biomedical Informatics. 46(1). 774-781. [8] Purcell, B. M. (2013), Big data using cloud computing, Journal of Technology Research, 5(1), 1-8. Mrs. R. UMA AND S. HEMASANTHOSHINI 6