Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue on 5 th National Conference on Recent Trends in Information Technology 2016 Conference Held at P.V.P. Siddhartha Institute of Technology Kanuru, Vijayawada, India Efficient Big Data Processing in Hadoop MapReduce Kvn Krishna Mohan, K Prem Sai Reddy, K Geetha Sri, A Prabhu Deva, M. Sundarababu(Asst Professor) Department of IT, PVP Siddhartha Institute of Technology, Andhra Pradesh, India Abstract In this work, we propose how the Big Data Processing is done very efficiently in Hadoop using Map Reduce Technique. Big Data may include the variables like analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy. The World is on growing content. Everyone is facing the challenges of the Big Data. Hadoop is an open-source software framework written in Java, used for processing vast amounts of unstructured data. Hadoop, which runs on the Linux Machine, supports up to 10,000 cores and produced data. Hadoop has a different set of the framework which splits the data into blocks and distributes them among cluster nodes. Map Reducing undergoes the implementation and processing of the data set with parallel and distributed algorithm in a cluster. Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters. I. INTRODUCTION Hadoop is an open source framework which is used to store large amounts of data in a distributed environment. Hadoop is a Java-based programming shell which used for Processing of data among the different clusters. It is designed to scale up from single servers to thousands of machines. Hadoop consists of two steps-transform & Repository. Hadoop is a lot more Flexible, Economical, and Scalable. Hadoop has fault tolerance. Map reduce a technique used for splitting the large amounts of data. Map reduce consists of two tasks namely Map and Reduce. The given input data appears in the form of small chunks by using a mapper.after generating the small pieces of data, the data is shuffled, and the data which come from mapper is reduced and stored in the database. Map-reduce groundwork operates on a <Key-Value >pair. K-Means is one of the clustering technique in which it s an easy way to classify the data points to group of clusters. Partitioning of data into small no of clusters. This technique is rapid, robust and easy to understand. II. RELATED WORK The history of the clustering is so long that it contains a variety of studies. Clustering deemed in a parallel fashion, like Map Reduce Hadoop enlightens the data accuracy and reliability when compared to the other softwares. For example consider Facebook, which is one of the largest tech giants in the world has only unstructured data to process. And over millions of the data is to be retrieved within a few seconds. Unless the data retrieval is within seconds of time, nobody spares their time to use that website. Over the past, there is a vast increase in the data usage. In order to overcome the problem, everyone should come with a solution to compete with others in this competing world to stand first. Doug Cutting and Mike Cafarella after being inspired by the white paper introduced by the Google thought of creating an open source software framework for supporting distribution for Nutch search engine project. These steps lead to the main answer to the big data analytical problems. Many IT companies have faced huge downfall for inconsistent maintenance of the data. Hadoop makes it easier to process and storing bulk amounts of data. III. THEORETICAL ANALYSIS K MEANS, MAP REDUCE Clustering is a function of grouping the set of objects in which similar objects are combined into a single cluster. The main purpose of this clustering is unstructured data objects to structured data objects. Dissimilar Objects form into other clusters. Clustering should deal with noise and outliers. Clustering deals with different types of attributes. Clustering is used in various kinds of fields like marketing, biology, etc. K-means algorithm is used to partition the different set of data into clusters. The K-means is a cluster algorithm n-objects based on attributes into k-partitions where k<n. It assumes that the object attributes forms a vector space. This algorithm is used to classify or to group the objects based on attributes where k is a positive integer. The K-Means algorithm works: Begin with a decision on the value of the k= number of clusters. Initial partition classifies the data into k clusters. Then each sample is taken in sequence and computes its distance from the centroid of each of the cluster. If the sample is not present in the cluster which is closest to the centroid switch this sample to that cluster and update the centroid of the cluster. Repeat the process until the convergence is achieved. Use MapReduce only if you have enormous data. Use a lot of defensive checks. Testing can save a lot of time. This map reducing technique is best for less time to consume. Map reducing is a 2016, IJARCSSE All Rights Reserved Page 96
functional programming for analyzing the single record. Map reduce consists of a map and reduce in which the map processes the given input data and that data is shuffled, reduced into small required data which is stored in the database. K-Means Algorithm 1: k-means++ Input: K,the number of clusters X={x1,x2,.,xn},a set of data points Output: C={c1,c2,..,ck}. 1. C φ 2. Choosen one center x uniformly X at random,c=c {x}. 3. Repeat 4. Choose x X with the probability D(x)^2/ D(x)^2 5.C=C {x} 6 Until k centers are choosen 7 Proceed as with the standard k-means algorithm Algorithm 2: Mapper phase of k-means++ initialization Input: k, the number of clusters X = {x1,x2,..xn}, a set of data points. Output: (num[i], ci), i = 1, 2,,k. num[i] denotes the number of points that center i represents_ 1 C φ 2 Choose one center X uniformly from X at random, C = C U {x}. 3 for i = 1; i < k; + + do 4 L num[i] =(); 5 while ICI < k do 6 Compute D^2 (x) between x X and its Nearest center that has already been chosen 7 Choose x with the probability D^2(x)/ D(x)^2 8 C C U{X} 9 for i = 1; i < n; + + do 10 Find the nearest center ci C for xi 11 nartn[i]++ 12 return (num[i], ci) Algorithm 3: Reducer phase of k,-means++ initialization Input: k, the number of clusters, X, the set of {num, c) Output C= {c1,c2., ck}. 1. C 2 Choose one center x uniformly from X at random, C = C U {x}. 3 while ICI < K: do 4 Compute D^2 (x) between x X and its nearest center that has already been chosen 5 Choose x with the probability num, * D^2(x)/ D (x) ^2 6 C C U{X} 7 return C 2016, IJARCSSE All Rights Reserved Page 97
IV. DATA ANALYSIS A Medical data set is analyzed using the methods of map reduce and the best technique that is K-Means in data mining. Data mining is straightforward and easy to analyze the data compared to neural networks and Artificial Intelligence. In data mining, some of the functionalities used, i.e., clustering, classification, regression or prediction, and association, etc. The clustering is used to analyze the large amounts of data into required quantities of evidence. Finally, the clustering method is taken considerably to explain the different types of hospital data consists of large data sets. We choose this K-means algorithm where its speed of running the large data sets is incomparable to other techniques. The different patient s diseases are considered and are evaluated into different clusters in which the single cluster consists of same disease patients. The output is either in no format or the naming format. The output is not structured. So, to get the graph structure we used GIPHI tool. The evaluated output is very accurate. To assess the data we used two different type of clusters. The 2 clusters are processed using map reduce method to get the desired outcome. V. EXPERIMENTAL RESULTS After the analysis of the data, Hadoop is installed on a Linux Machine and the sample data is inserted into the repository. And Later on, the Map-Reduce code is executed along with an initialization cluster. We can browse all the files in the default URL: http://localhost:50070/. We need to specify all the paths for running the map-reduce like the input directory, output directory and the cluster initialization directory along with the algorithm we used. Hadoop provides a reliable output in mapping and analyzing of the data. This Data can be employed for further representation. Figure 1: Running the Hadoop on local host Figure 2: Browse to the user directory 2016, IJARCSSE All Rights Reserved Page 98
Figure 3: All the inserted data is present in the user directory Figure 4: Output Folder consists of two chunks of data Figure 5: Output Data 2016, IJARCSSE All Rights Reserved Page 99
VI. CONCLUSION This Paper concludes the importance of the Hadoop and its different techniques in the Big-Data. Though they are much scalable software available in the market, Hadoop is the preferred by more than half of the Fortune 50 companies. There is a considerable growth in the usage of the Hadoop from the last decade. Almost all the leading companies like Yahoo, Google, Amazon, Facebook, IBM, EMC... prefer Hadoop over much more other software. There are many companies which are giving up themselves due to the inadaptability of the big data techniques. As of now more than half of the fortune 50 companies are using Hadoop. Processing of the unstructured data is also a key feature, and this is also the main reason for preferring this over others. REFERENCES [1] Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI '2004), 137{150 (2004) [2] Lammel, R. Google's MapReduce programming model {Revisited. Sci. Computer. Program., vol.68, issue 3, 208{237 (2007) [3] White, T.: Hadoop: The Definitive Guide. Second Edition, Yahoo Press, (2009) [4] Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce (Synthesis Lectures on Human Language Technologies). Morgan and Claypool Publishers (2010) [5] Afrati, F. N., Ullman, J. D. Optimizing joins in a map-reduce environment. 13th International Conference on Extending Database Technology, 99{110 (2010) [6] Apache Hadoop http://hadoop.apache.org/ [7] MapReduce Design of K-Means Clustering Algorithm, International Conference on Multimedia Communications, 978-1-4799-0604-8/13 2013 IEEE, 2013 2016, IJARCSSE All Rights Reserved Page 100