A SURVEY ON BIG DATA TECHNIQUES

Size: px

Start display at page:

Download "A SURVEY ON BIG DATA TECHNIQUES"

Clara Sparks
5 years ago
Views:

1 A SURVEY ON BIG DATA TECHNIQUES K. Anusha 1, K. UshaRani 2, C. Lakshmi 3 ABSTRACT: Big Data, the term for data sets that are so huge and complex which becomes difficult to process using old data management tools. Big Data explores many novel techniques and methods to capture, store, distribute, manage and analyze petabytes of datasets with high-velocity and different structures. Data from sensors, electronic devices like mobiles, social websites, scientific data and enterprises are contributing to the sudden increase in data. These large divisions of data generally called Big Data have become one of the latest research trends today. Big Data is a data whose size, variety and difficulty require new design, techniques, algorithms and analytics to manage it and mine value and hidden knowledge from it. Map Reduce environment which is now available open-source in Hadoop is one of the emerging solution to the Big Data problem. Hadoop enables the distributed processing of large data sets across clusters of service servers. It is considered to range from a single server to thousands of machines, with a very high extent of fault tolerance. Hadoop s distributed processing having Hadoop Distributed File System, Map Reduce algorithms and overall design are the key steps towards achieving the demanding benefits of Big Data. In this paper, a review on technical challenges with Big Data, Hadoop and Map Reduce Architectures is presented. Keywords: Big Data, Hadoop, Map Reduce, Hadoop Distributed File System. I. INTRODUCTION A. Big Data : Definition Big Data is the most recent trend in the IT world and business right now. Big data is a term that refers to combinations of data sets whose size, variability, and velocity make them difficult to be captured, managed, processed or analyzed by standard technologies and tools, such as relational databases and desktop statistics, within the time necessary to make them useful [12]. These large chunks of data generally called as Big Data has redefined the current data processing state. Most analysts currently refer to data sets from terabytes (one terabyte=10 12 or 1000 gigabytes) to multiple petabytes (one petabyte=1015 or 1000 terabytes) as big data. Big Data system can be divided into three layers including Infrastructure Layer, Computing Layer and Application Layer from top to bottom [13]. 1 Research Scholar 2,3 Professor Department of Computer Science Sri PadmavathiMahilaVisvavidyalayam Tirupati Fig. 1 Layered Architecture of Big Data System B. Big Data Characteristics Volume is defined as the potential data capacity of terabytes to petabytes Velocity is defined as how rapidly the data is entering the systems Variety includes all types of data like structured and unstructured data C. Evolution of Big Data From customers to companies, all have an unsatisfiable desire for data and all that can be done with it. All are depending on data for new ways to identify fraud, and keeping a check on consumer behavior and also for so many other things [12]. In the past, enterprise systems used to be major sources of data, but now-a-days many additional sources are contributing to the data group like sensors, social networking sites, etc,. D. Technical Challenges with Big Data Processing i. Fault Tolerance: With the upcoming of new technologies like Cloud computing and Big Data it is always wished that whenever the failure occur it should be within acceptable threshold. Thus the major task is to limit the probability of failure to 65 P a g e

2 an acceptable level. But it is very expensive to reduce the probability of failure [13]. ii. Heterogeneous data: Unstructured data represents nearly every kind of data being produced like social media communications to recorded meetings, to handling of pdf documents to more. Working with unstructured data is difficult and costly too. Structured data is always organized into highly automated and controllable way. iii. Scale : The first thing anyone thinks of with Big Data is its size. Managing large and quickly increasing volumes of data has been a challenging issue for many decades. In the earlier period, this challenge was mitigated by processors getting faster to provide us with the resources needed to deal with increasing volume of data. But, there is a fundamental shift underway now: data volume is scaling faster than compute resources, and CPU speeds are static. iv. Privacy: The privacy of data is another massive concern, and one that increases in the perspective of Big Data. However, there is fear regarding the inappropriate use of personal data, particularly through relating of data from multiple sources. Managing seclusion effectively is both a technical and a sociological problem, which must be addressed jointly from both contexts to realize the promise of big data. II. LITERATURE REVIEW Jimmy Lin et.al Used Hadoop which is currently the large scale data analysis hammer of choice, but there exists classes of algorithms that aren t nails in the sense that they are not particularly amenable to the Map Reduce programming model [7]. He focuses on the simple solution to find alternative non-iterative algorithms that solves the same problem. The standard Map Reduce is well known and described in many places.each iteration of the page rank corresponds to the Map Reduce job. The author suggested iterative graph, gradient descent & EM iteration which is typically implemented as Hadoop job with driven set up iteration & Check for convergences. The author suggests that if all you have is a hammer; throw away everything that s not a nail [7]. S. Vikram Phaneendra & E. Madhusudhan Reddy et.al Explained that in olden days the data was less and easily handled by RDBMS but recently it is difficult to handle huge data through RDBMS tools, which is preferred as big data. In this they told that big data differs from other data in 5 dimensions such as volume, velocity, variety, value and complexity. They illustrated the Hadoop architecture consisting of name node, data node, edge node, HDFS to handle big data systems. Hadoop architecture handle large data sets, scalable algorithm does log management application of big data can be found out in financial, retail industry, health-care, mobility, insurance. The authors also focused on the challenges that need to be faced by enterprises when handling big data: - data privacy, search analysis, etc [6 ]. Albert Bifet et.al Stated that Streaming data analysis in real time is becoming the fastest and most efficient way to obtain useful knowledge; allowing organizations to react quickly when problem appear or detect to improve performance. Huge amount of data is created everyday termed as big data. The tools used for mining big data are apache Hadoop, apache big, cascading, scribe, storm, apache hbase, apache mahout, MOA, R, etc [8]. Thus, he instructed that our ability to handle many exabytes of data mainly dependent on existence of rich variety dataset, technique, software framework. Aditya B. Patel et.al Addresses Big data Problem using Hadoop and Map Reduce reports the experimental research on the Big Data problems in various domains. It describe the optimal and efficient solutions using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage data and Map Reduce framework for parallel processing to process massive data sets and records [9]. Suman Arora, Dr.Madhu Goel et.al Stated many techniques for making the efficient scheduler for the map reduce so that we can speed up our system or data retrieval Technique like Quincy, Asynchronous Processing, Speculative Execution, Job Awareness, Delay Scheduling, Copy Compute Splitting etc had made the scheduler effective for the faster processing [3]. Poonam S. Patil, Rajesh. N. Phursule et.al Illustrated the Map Reduce programming model has been successfully used at Google for many different purposes the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as Map Reduce computations. Map Reduce is easy to parallelize and distribute computations and to make such computations fault tolerant. And there are 66 P a g e

III. HADOOP & MAP REDUCE extensive list of products and projects that either extend Hadoop s functionality or expose some existing capability in new ways [5]. Vibhavari Chavan, Prof. Rajesh. N.

3 III. HADOOP & MAP REDUCE extensive list of products and projects that either extend Hadoop s functionality or expose some existing capability in new ways [5]. Vibhavari Chavan, Prof. Rajesh. N. Phursule et.al stated that Hadoop Map Reduce is a large scale, open source software framework dedicated to scalable, distributed, dataintensive computing. The framework breaks up large data into smaller parallelizable chunks and handles scheduling Maps each piece to an intermediate value Reduces intermediate values to a solution User-specified partition and combiner options Fault tolerant, reliable, and supports thousands of nodes and petabytes of data If you can rewrite algorithms into Maps and Reduces, and your problem can be broken up into small pieces solvable in parallel, then Hadoop s Map Reduce is the way to go for a distributed problem solving approach to large datasets Tried and tested in production Many implementation options. We can present the design and evaluation of a data aware cache framework that requires minimum change to the original Map Reduce programming model for provisioning incremental processing for Big Data applications using the Map Reduce model [4]. Amogh Pramod Kulkarni, Mahesh Khandewal et.al. Stated the importance of some of the technologies that handle Big Data like Hadoop, HDFS, Map Reduce. The author suggested about various schedulers used in Hadoop and about the technical aspects of Hadoop. The author also focuses on the importance of YARN which overcomes the limitations of Map Reduce. Dhole Poonam B, Gunjal Baisa L et.al. Focuses on Hadoop data flow and Pipelined Map Reduce data flow. The author suggested that Pipelined Map Reduce is much better than the traditional one. He states that it reduces the completion time of tasks. That means the implementation of Pipeline Map Reduce can processes large datasets effectively. Sabia and Love Arora et.al mainly focuses on various Big Data handling techniques those handle a massive amount of data from different sources and improves overall performance of systems. Mrigank Mridul, Akashdeep Khajuria, Snehasish Dutta, Kumar N. et.al stated that Map Reduce is the best tool available for processing data and its distributed, columnoriented database, HBase which uses HDFS for its underlying storage, and support provides more efficiency to the system [1]. A. Hadoop & Map Reduce These are the commonly used models for Big Data processing. Hadoop is a Programming framework used to sustain the processing of outsized data sets in a distributed computing atmosphere. Hadoop was developed by Google s Map Reduce that is a software framework where applications split down into various parts. The Apache Hadoop project consists of the Hadoop Distributed File System module and Hadoop Map Reduce in addition to other modules. The software is modeled to bring in upon the processing power of clustered computing while managing failures at node level. Fig. 2 Hadoop Architecture The Current Apache Hadoop ecosystem consists of the Hadoop Kernel, Map Reduce, HDFS and numbers of various components. B. Hadoop Distributed File System HDFS is a clustered file management system which holds huge amounts of data, and provides high turnout & highspeed access to data. HDFS stores massive amounts of information scale up incrementally and endure the breakdown of considerable chunks of the storage infrastructure without losing data. The system stores the files in a redundant way through a number of machines to make sure that they are fault-tolerant and presented to very similar applications [12]. Hadoop creates clusters of machines and coordinates work amongst them. Clusters can be built with low-cost computers. If one fails, Hadoop continues to run the cluster without losing data or disrupting work, by changing work to the remaining machines in the group. HDFS controls storage on the cluster by breaking received files into pieces, called blocks, and storing each of the blocks redundantly across the group of servers. In the common case, HDFS stores three full copies of each file by copying each piece to three different servers [12]. 67 P a g e

Name Node: Stores Meta Data only Data Nodes:Stores blocks from files 2 4 1 5 METADATA /user/aaron/foo 1,2,4 /user/aaron/bar 3,5 5 2 Fig.

4 Name Node: Stores Meta Data only Data Nodes:Stores blocks from files METADATA /user/aaron/foo 1,2,4 /user/aaron/bar 3,5 5 2 Fig. 3 HDFS or the Hadoop Distributed File System The computing systems in each cluster are called Data Nodes. A file consists of multiple blocks, and it is not essential that they are stored on the same machine as the choice of where each block will be stored is selected at random. As such, locating particular file needs sustain from multiple machines. If multiple machines are needed in allocating a file, then a file could become unavailable even if one machine in the cluster is lost. HDFS handles this problem by replicating each block across several systems which is set to 3 as default. It is required that this file system stores the metadata reliably. The whole process is controlled by a single system called the Name Node which has the metadata of the entire file system. As Metadata of each file is relatively low, this whole information is stored in main memory of Name Node machine, thus allowing for faster accessibility Map-Reduce were introduced by Google in order to process and accumulate large datasets on commodity hardware. Map Reduce is a representation for processing large-scale data records in clusters. The processing pillar in the Hadoop environment is the Map Reduce framework. The framework allows the design of a procedure to be applied to a massive data set, split the problem and data, and run it in parallel [12]. For example, a very large dataset can be condensed into minor subsets where analytics can be applied. In a conventional data warehousing circumstances, this might involve applying an ETL operation on the data to generate something usable by the analyst. In Hadoop, all these kinds of operations are written as Map Reduce jobs in Java. The outputs of these jobs can be written back to either HDFS or placed in a conventional data warehouse. There are two key functions in Map Reduce as follows: map the function takes key or value pairs as input and produces an intermediary set of key or value pairs reduce the function which merges all the intermediary values related with the same intermediate key Fig. 5 Mapping Fig. 6 Reducing Fig. 4 HDFS Architecture C. Map Reduce Fig. 4 HDFS Architecture C. Map Reduce 68 P a g e

5 Fig. 7 Map Reduce Architecture IV. CONCLUSION The paper describes the concept of Big Data along with the characteristics of Big Data like Volume, Velocity and variety. The paper also focuses on technical challenges with Big Data processing. These technical challenges must be addressed for efficient and rapid processing of Big Data. The paper explores Hadoop which is an open source software used for processing of Big Data. Hadoop with its efficient DFS & programming framework based on concept of mapped reduction, is a powerful tool to manage large data sets. With its Map Reduce programming paradigms, overall architecture, ecosystem, fault- tolerance techniques and distributed processing, Hadoop offers a whole infrastructure to handle Big Data. Users must use the benefits of Big-Data by adopting Hadoop infrastructure for data processing. REFERENCES [1] Mrigank Mridul, Akashdeep Khajuria, Snehasish Dutta, Kumar N Analysis of Big Data using Apache Hadoop and Map Reduce Volume 4, Issue 5, May [2] Amogh Pramod Kulkarni, Mahesh Khandewal, Survey on Hadoop and Introduction to YARN, International Journal of Emerging Technology and Advanced Engineering Website: ISO 9001:2008 Certified Journal, Volume 4, Issue 5, May 2014) [3] Suman Arora, Dr.Madhu Goel, Survey Paper on Scheduling in Hadoop International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014 [4] Ms. Vibhavari Chavan, Prof. Rajesh. N. Phursule, Survey Paper On Big Data International Journal of Computer Science and Information Technologies, Vol. 5 (6), [5] Poonam S. Patil, Rajesh. N. Phursule, Survey Paper on Big Data Processing and Hadoop Components International Journal of Science and Research (IJSR), Volume 3 Issue 10, October 2014 [6] S.Vikram Phaneendra & E.Madhusudhan Reddy Big Data- solutions for RDBMS problems- A survey In 12th IEEE/IFIP Network Operations & Management Symposium (NOMS 2010) (Osaka, Japan, Apr ). [7] Jimmy Lin Map Reduce Is Good Enough? The control project, IEEE Computer 32 (2013). [8] Albert Bifet Mining Big Data In Real Time Informatica 2013) DEC 2012 [9] Aditya B. Patel, Manashvi Birla and Ushma Nair, "Addressing Big Data Problem Using Hadoop and Map Reduce," in Proc Nirma University International Conference On Engineering, [10] Dhole Poonam B, Gunjal Baisa L, Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 [11] Sabia and Love Arora, Technologies to Handle Big Data: A Survey [12] Praveen Kumar1, Dr Vijay Singh Rathore, Efficient Capabilities of Processing of Big Data using Hadoop Map Reduce Vol. 3, Issue 6, June 2014 [13] Harshawardhan S. Bhosale, Prof. Devendra P. Gadekar A Review Paper on Big Data and Hadoop Volume 4, Issue 10, Oct 69 P a g e

An Emergence Techniques In Big Data Mining

International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 An Emergence Techniques In Big Data Mining Hemant