Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY Mrs. R. UMA,Head of the Department Department of Computer Application S.HEMASANTHOSHINI 2, M.Phil Scholar Department Of CS & IT, Nadar Saraswathi College Of Arts And Science, Theni. ABSTRACT: Big data and cloud computing are both emerging technologies whose rate of adoption by businesses has been increasing rapidly over the past decade.to effectively manage and analyze big data is time consuming and challenging task. Cloud Computing is a new paradigm which provides infrastructure for computing and processing of all types of data resources. The relationship between big data and the cloud computing is based on integration in that the cloud represents the storehouse and the big data represents the product that will be stored in the storehouse. Integrating big data in cloud environment provides user with enhanced data processing techniques. A brief survey about the tools that are used for integrating Big Data and Cloud Computing has been presented in this paper. These two fields have gained tremendous momentum in the recent years and have attracted attention of several researchers. Keywords - Big data, cloud environment, big data management tools. I. INTRODUCTION Big Data is defined as a collection of huge size of data sets with different types so that it becomes difficult to process by using traditional data processing algorithms and platforms. Recently the number of data provisions has increased, such as social networks, sensor networks, high throughput instruments, satellite and streaming machines and these environments produce huge size of data. The amount of data being generated stored and shared has been on the rise. From data warehouses, web pages and blogs to audio/video streams, all of these are sources of massive amounts of data. Cloud computing refers to on-demand computer resources and systems available across the network that can provide a number of integrated computing services without local resources to facilitate user access. Many organizations like Mrs. R. UMA AND S. HEMASANTHOSHINI 1
TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY Amazon AWS, IBM Smart Cloud and Windows Azure are now migrating their big data to clouds to take their advantage. These resources include data storage capacity, backup and selfsynchronization. This programming tools and frameworks has given birth to the concept of Big Data Processing and Analytics. A.BIG DATA Big data is the phrase commonly used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process. This data arrives at high speeds from multiple sources such as social media, transactions, interactions with other web pages, etc in a random fashion. The main characteristics of big data, known as 'five Vs', are as follows: 1. Volume: It represents the amount of data produced from multiple sources which show the huge data in numbers by zeta bytes. The volume is most evident dimension in what concerns to big data. 2.Variety: It represents data types, with, increasing the number of Internet users everywhere, smart phones and social networks users, the familiar form of data has changed from structured data in databases to unstructured data that includes a large number of formats such as images, audio and video clips, SMS, and GPS data. 3. Velocity: It represents the speed of data frequency from different sources, that is, the speed of data production such as Twitter and Face book. The huge increase in data volume and their frequency dictates the need for a system that ensures super-speed data analysis. 4. Veracity: It represents the quality of the data; it shows the accuracy of the data and the confidence in the data content. The quality of the data captured can vary greatly, which affects the accuracy of analysis. 5. Value: It represents the value of big data, i.e. it shows the importance of data after analysis. The value lies in careful analysis of the exact data, the information and ideas it provides. The value is the final stage that comes after processing volume, velocity, variety, contrast, validity and visualization. B.CLOUD COMPUTING: The cloud is a computing service that charges you based only on the amount of computing resources we use. The practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer. Cloud Computing Services: Mrs. R. UMA AND S. HEMASANTHOSHINI 2
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 Software as a Service - End Users: It is an application that can be accessed from anywhere on the world as long as you can have a computer with an Internet Connection. We can access this cloud hosted application without any additional hardware or software. Platform as a Service -Application Developers In the PaaS model, cloud providers deliver a computing platform and/or solution stack typically including operating system, programming language execution environment, database, and web server. Infrastructure as a Service -Network Architect: It also known as hardware as a service. It allows existing applications to be run on a cloud suppliers hardware. cloud providers offer computers as physical or more often as virtual machines raw (block) storage, firewalls, load balancers, and networks Modes of Clouds: Public Cloud: Computing infrastructure is hosted by cloud vendor at the vendors premises and can be shared by various organizations. E.g. : Amazon, Google, Microsoft, Sales force Private Cloud: The computing infrastructure is dedicated to a particular organization and not shared with other organizations. It is more expensive and more secure when compare to public cloud. E.g.: HP data center, IBM, Sun, Oracle. Hybrid Cloud: Organizations may host critical applications on private clouds where as relatively less security concerns on public cloud. The usage of both public and private together is called hybrid cloud. II.INTEGRATION OF BIG DATA AND CLOUD COMPUTING: In today s computing world, most of the software companies don t provide the complete setup files of the software s instead they provide cloud to fetch data over the Internet. This type of scenario is possible only through the concept of cloud computing. The huge volume of data or big data is present on clouds which can be accessed via the programming methods that are hidden from a naïve user. With Hadoop, one can easily access and make use of the various resources in the integrated environment. Yet, utilizing a cloud system to store big data has long term benefits to both, the insights yielded, as well as, the performance of the IT sector. Big data requires advanced analytic techniques to deal with the extensive amounts of Mrs. R. UMA AND S. HEMASANTHOSHINI 3
TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY data. Cloud systems are typically based on remote servers, which are able to handle extensive amounts of data with rapid response time for real time processes. Cost reduction Reduce overhead. Rapid provisioning/time to market Flexibility/scalability III. TOOLS FOR INTEGRATING BIG DATA AND CLOUD COMPUTING: Big data produces big challenge to manage massive amount of structured and unstructured data to handle. Cloud computing offers scalable solutions to manage such a large amount of data in cloud environment to take advantage of both technologies. To effectively incorporate and manage big data in cloud environment it is important to understand tools and service offered by them. Some vendors like Amazon Web Services (AWS), Google, Microsoft and IBM offers Cloud based Hadoop and NoSQL database platforms that are supporting Big data applications in addition to many cloud providers offer Hadoop framework that scale automatically on demand of customers for data processing. A.HADOOP: Hadoop provides an open source software framework for distributed storage and processing applications on very large datasets. It is a java based programming framework that uses a Master/Slave structure. Hadoop platform includes higher level declarative languages for writing queries and data analysis pipelines. Hadoop is composed of many components but in big data usage the two most components such as Hadoop Distributed File System (HDFS) and MapReduce are used. The other components provide complementary services and higher level of abstraction. i. MapReduce: MapReduce system is the main part in Hadoop framework that is used for processing and generating large datasets on a cluster with distributed or parallel algorithm. It is a programming paradigm used to process large volume of data by dividing the work into various independent nodes. A MapReduce program corresponds to two jobs. A Map() method which include obtaining, filtering and sorting datasets. A Reduce() method which include finding out summaries and generate final result. MapReduce system arranges distributed servers, manage all communications, parallel data transfers, also provide redundancy and fault tolerance. ii. HADOOP DISTRIBUTED FILE SYSTEM (HDFS): HDFS is used to store large data files that are too much to store on a single machine typically in gigabyte to terabyte. HDFS is a Mrs. R. UMA AND S. HEMASANTHOSHINI 4
Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 distributed, scalable and portable file system written in java for Hadoop framework. It maintains reliability by replicating data across multiple hosts to facilitate parallel processing, for that it split a file into blocks that will be stored across multiple machines. The cluster of HDFS has master-slave relationship with single namenode and multiple datanode. B.CASSANDRA AND HBASE: Both are open source, non relational, distributed DBMS written in java that supports data storage for large tables and runs on top of HDFS. It is columnar data model with features like compression, in memory operations and provides fault tolerance way of storing large quantities of sparse data. C.HIVE: It is a warehouse infrastructure by facebook providing for data summarization, adhoc querying and analysis. It provides SQL like language (HiveQL) to make powerful queries and get results in real time. D.PIG:It is a high level data flow language (PigLatin) and execution framework for parallel computation. E.ZOOKEEPER:It is a high performance coordination service for distributed application that can store configuration information and have master-slave node. F.APACHE SPARK: Apache Spark is distributed cluster computing system to speed up the data analytics and it is open source. It is based on a general execution model which allows the user programs to load the data into a cluster s memory thereby helping in in-memory computing and optimization. G. HPCC: High performance Computing Cluster framework is a massive parallel-processing computing platform and it is open source also. It has two different processing clusters. The Thor Processing Cluster is a data refinery that processes large volumes of heterogeneous data. It is responsible for extracting, transforming and loading processed raw data. The Thor cluster is much similar to the Hadoop MapReduce platform in its environment and file system. The Roxie Processing Cluster is a parallel data processing system that works as a rapid data delivery engine. V.CONCLUSION: In this paper we discussed about the tools that are flexible for the integration of Big Data in Cloud Computing. Cloud computing provides enterprises cost-effective, flexible access to big data s enormous magnitudes of information. Big data on the cloud generates vast amounts of on-demand computing resources that comprehend best practice analytics. Cloud computing represents an environment of flexible distributed resources that uses high techniques Mrs. R. UMA AND S. HEMASANTHOSHINI 5
TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY in the processing and management of data and yet reduces the cost. All these characteristics show that cloud computing has an integrated relationship with big data. Both are moving towards rapid progress to keep pace with progress in technology requirements and users. REFERENCE: [1] Peter Mell, Timothy Grance, The NIST Definition of Cloud Computing, September 2011 [2] Qi Zhang, Lu Cheng, Raouf Boutaba, Cloud computing: state-of-the-art and research challenges, 20 April 2010 [3] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding, Data Mining with Big Data, January 2014 [4] A. Rajaraman and J. Ullman, Mining of Massive Data Sets, Cambridge Univ. Press, 2011. [5] IBM What Is Big Data: Bring Big Data to the Enterprise, http://www- 01.ibm.com/software/data/bigdata/, IBM, 2012. [6] A, Katal, Wazid M, and Goudar R.H. "Big Data: Issues, challenges, tools and Good practices.". Noida: 2013, pp. 404 409, 8-10 Aug.2013. [7] Venkata Narasimha Inukollu, Sailaja Arsi and Srinivasa Rao Ravuri, May 2014. computing in genomics. Journal of Biomedical Informatics. 46(1). 774-781. [8] Purcell, B. M. (2013), Big data using cloud computing, Journal of Technology Research, 5(1), 1-8. Mrs. R. UMA AND S. HEMASANTHOSHINI 6