Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

Similar documents
Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Chapter 5. The MapReduce Programming Model and Implementation

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

HADOOP FRAMEWORK FOR BIG DATA

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

The MapReduce Framework

Hadoop. copyright 2011 Trainologic LTD

Decision analysis of the weather log by Hadoop

Distributed Face Recognition Using Hadoop

CLIENT DATA NODE NAME NODE

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

A Review Paper on Big data & Hadoop

Introduction to Hadoop and MapReduce

Parallel data processing with MapReduce

MapReduce: A Programming Model for Large-Scale Distributed Computation

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Map Reduce & Hadoop Recommended Text:

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Clustering Lecture 8: MapReduce

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Hadoop An Overview. - Socrates CCDH

MapReduce for Data Intensive Scientific Analyses

Automatic Voting Machine using Hadoop

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Lecture 11 Hadoop & Spark

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Introduction to MapReduce

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Hadoop/MapReduce Computing Paradigm

BESIII Physical Analysis on Hadoop Platform

Improved MapReduce k-means Clustering Algorithm with Combiner

Top 25 Hadoop Admin Interview Questions and Answers

A Fast and High Throughput SQL Query System for Big Data

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

MI-PDB, MIE-PDB: Advanced Database Systems

A Survey on Big Data

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

50 Must Read Hadoop Interview Questions & Answers

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

Data Analysis Using MapReduce in Hadoop Environment

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Big Data Hadoop Stack

A Review on Hive and Pig

Mitigating Data Skew Using Map Reduce Application

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

Improving the MapReduce Big Data Processing Framework

Survey on MapReduce Scheduling Algorithms

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Global Journal of Engineering Science and Research Management

ABSTRACT I. INTRODUCTION

Department of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Introduction to HDFS and MapReduce

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Mounica B, Aditya Srivastava, Md. Faisal Alam

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Processing Technology of Massive Human Health Data Based on Hadoop

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Big Data for Engineers Spring Resource Management

Introduction to BigData, Hadoop:-

Comparative Analysis of K means Clustering Sequentially And Parallely

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Big Data Issues and Challenges in 21 st Century

Distributed Systems 16. Distributed File Systems II

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Microsoft Big Data and Hadoop

Stages of Data Processing

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Delegated Access for Hadoop Clusters in the Cloud

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

docs.hortonworks.com

Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud

An Enhanced Approach for Resource Management Optimization in Hadoop

Certified Big Data and Hadoop Course Curriculum

Chase Wu New Jersey Institute of Technology

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Research Article Apriori Association Rule Algorithms using VMware Environment

Databases 2 (VU) ( / )

Transcription:

4th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2016) Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d 1 2 School of Information Technology, Kunming University, Kunming 650214, China School of Auto-control and Mechanical Engineering, Kunming University, Kunming 650214, China 3 Office of Science Research, Kunming University, Kunming 650214, China a wallay@126.com, bpoundblue@126.com, clkxmh@sina.com, dlxm@sina.com Keywords: cloud computing; Hadoop; MapReduce; AP clustering algorithm; community structure Abstract. With the extensive application of network, massive growth in the scale of data through cloud computing has been observed. Cloud computing is a powerful technology to perform complex computing, and applications running on cloud computing with Hadoop architecture are increasing. In this paper, after studying the AP clustering algorithm, we proposed the AP clustering algorithm based on MapReduce model and realized the parallelizing AP clustering algorithm in the cloud computing platform of Hadoop. We tested the parallelizing AP clustering algorithm by using two real-word networks. The experimental results show the capability of our algorithm successfully to detect community structure in complex networks. 1. Introduction Cloud computing is a significant technology to perform massive-scale complex computing and has become a powerful architecture. It can significantly reduce the cost of hardware, service, and software [1]. To analyze the huge amounts of data for extracting meaningful information, there is requirement of deploying data intensive application. The advantages of cloud computing include parallel processing, virtualized resources, security, and data service integration with scalable data storage. Cloud computing can not only minimize the cost for automation and computerization of individuals and enterprises, but also can reduce the cost of infrastructure maintenance, efficient management, and user access. Hadoop is an open-source Apache software project based on java that enables the distributed processing of large datasets across clusters, and it is a reliable and scalable software framework for parallel and distributed computing. Hadoop has two primary components, HDFS and MapReduce. HDFS is a distributed file system designed to run on top of the local file systems of the cluster nodes and store extremely large files suitable for streaming data access. MapReduce is a simplified programming model for processing large amount of datasets pioneered by Google for data-intensive applications. Therefore, the storage system is not physically separated from the processing system. Traditional and Existing tools and applications become deficient to process large amount of data. Hadoop has the ability to solve the problem of handling and processing large-scale data that terabytes and petabytes sized. Many enterprises, companies and universities deploy Hadoop clusters in highly scalable and elastic computing environments. The rest of this paper is organized as follows. Section 2 introduces the cloud computing with Hadoop architecture. Section 3 introduces the MapReduce programing model. Section 4 describes the AP clustering algorithm. Section 5 realizes the AP clustering algorithm based on MapReduce. Section 6 presents and analyzes the experimental results. Section 7 concludes this paper. 2. Cloud computing with Hadoop architecture Hadoop is a preferential choice in open source cloud computing community for providing an efficient platform for Big Data, and it provides an extensive selection of effective cloud to help users achieving their goals [2]. Hadoop is not only a distributed file system with storage function, and also 2016. The authors - Published by Atlantis Press 65

a framework to perform distributed application on the large clusters consisting of general-purpose computing devices. Hadoop consists of the following projects, as shown in Fig. 1. Pig Chukwa Hive HBase MapReduce HDFS ZooKeeper Core Avro Fig. 1 Hadoop architecture Pig involves a high-level scripting language and offers a run-time platform that allows users to execute MapReduce on Hadoop. Hive offers a warehouse structure in HDFS. Hbase offers a scalable distributed database that supports structured data storage for large tables. Chukwa is a data collection and analysis framework incorporated with HDFS. Zookeeper is high-performance service to coordinate the processes of distributed applications. Avro is a data serialization system, and the tasks performed by Avro include data serialization, remote procedure calls, and data passing from one program or language to another. The most significant feature of Hadoop is that HDFS and MapReduce are closely related to each other. 3. MapReduce model MapReduce is programming model or a software framework used in Apache Hadoop, and it can help a programmer with less experience to write parallel programs and create a program capable of using computers in a cloud. Hadoop MapReduce is provided for writing applications which process and analyze large data sets in parallel on large multi-node clusters of commodity hardware in a scalable, reliable and fault tolerant manner [3]. Data analysis and processing specifies two functions namely, the map function (mapper) and the reduce function (reducer). The mapper regards the key/ value pair as input and generates intermediate key/value pairs. The reducer merges all the pairs associated with the same key and then generates an output. The map function is applied to each input (key1, value1), where the input domain is different from the generated output pairs list (key2, value2), and the reduce function is applied to each [key2, list (value2)] to generate a final result list (key3, value3). MapReduce programming model is shown in Fig. 2. Input files split 0 split 1 split 2 split 3 Map phase Intermediate files Reduce phase Output files Output file 0 Output file 1 Fig. 2 MapReduce programming model MapReduce uses master/slave architecture. In the architecture, JobTracker daemon runs on master node and TaskTracker daemon runs on each salve node. JobTracker and TaskTracker are known as the MapReduce engine [4]. (1) JobTracker. JobTracker is in charge of scheduling all jobs, and it is the core of the system assigning tasks. JobTracker runs on master node and monitors MapReduce tasks executed by TaskTracker on salve nodes. JobTracker is unique in master/slave architecture, and it is the only responsible for the control of MapReduce in a Hadoop system. The job is submitted to JobTracker on the Master node, and then JobTracker asks NameNode for the actual location of data in HDFS to be 66

processed. JobTracker locates TaskTracker on slave nodes and submits the jobs to TaskTracker on slave nodes. (2) TaskTracker. A TaskTracker is in charge of executing user-defined operations. A TaskTracker runs on slave nodes. It accepts jobs from JobTracker and executes MapReduce operations. A TaskTracker sends the heartbeat message to JobTracker during executing operations, and report about the execution status of each task. The heartbeat message permits JobTracker to know how many tasks are available in TaskTracker on a slave node. It is the of TaskTracker to help JobTracker collecting the whole situation of job execution and providing important gist for the distribution of the next task. 4 AP clustering algorithm Affinity propagation(ap) clustering algorithm is an high-efficient clustering method proposed by J.Frey in 2007 [5], and it has been shown to create clusters in much less time, and with much less error than traditional clustering algorithms (such as K-means). Unlike the k-means algorithm, which chooses a cluster center from subset of data points and need to specify the number of clusters, the AP algorithm considers simultaneously all data points as cluster centers and thus is independent of the quality oft he initial set of cluster centers. There are two types of messages that are passed between data points: and availability. The R(i,k) is sent from data point i to candidate exemplar point kand reflects how well suited it would be for point k to be the exemplar of point i.the availability A(i,k) is sent from data point candidate exemplarpoint k to data point i and reflects how appropriate it would be for data point i to choose candidate exemplar k as its exemplar. The similarity of each data points is set to a negative squared Euclidean distance between point i and j. nn SS(ii, jj) = XX iiii XX jjjj 2 kk=1 (1) where XX iiii and XX jjjj are the kth attribute of point X i and point X j. The algorithm process is as follows: (1) Initialization. The availabilities are initialized to zero. (2) Calculate responsibilities according to Eq. (2). R(ii, kk) SS(ii, kk) max kk kk {AA(ii, kk ) + SS(ii, kk )} (2) where SS(ii, kk) is the similarity between point i and k. (3) Calculate availabilities according to Eqs. (3) and (4). AA(ii, kk) mmmmmm{0, R(kk, kk)} + mmmmmm 0, R(jj, kk) jj ii,kk (3) jj ii,kk mmmmmm 0, R(jj, kk) (4) AA(kk, kk) (4) if the cluster centers do not chang or is greater than the maximum number of iterations then return else jump to step (2). 5. Realization of AP clustering algorithm based on MapReduce (1) Calculating the similarity Mapper gets the values of (X ik -X jk ) 2, then reducer adds the values and set to negative squared similarities. The key of input (key, value) is the number of row and column of input, and the value is value of matrices of corresponding location. (2) The distributed processing of AP clustering algorithm Mapper converts each row of similarity and availability from input text to the format required with Eq. (2). Reducer merges similarity values and availability values in a same row after MapReduce shuffle phase, and calculates the values in a same row according to Eq. (2). The process of calculating the values is as Fig. 3. 67

Input Map Reduce Output Similarity and availability Convert a row in similarity and availability Convert a row in similarity and availability Calculate the value in a same row Calculate the value in a same row Fig. 4 The MapReduce processing of calculating (3) Determining the cluster centers Cluster center is determined on R(k,k) and availability A(k,k) of data points k. When R(k,k)+ A(k,k)>0, that means the and availability of data points k are large enough for itself, and data point k is suitable for cluster center. (4) Dividing the data points After cluster centers are determined, all data points are needed to divide to different cluster. Data point i belongs to which cluster center is based on Eq. (5): max 1 kk nn {AA(ii, CC kk ) + RR(ii, CC kk )} (5) where n is the number of total cluster centers. CC kk is the kth cluster center. 6. Experimental results and analysis 6.1. Experiment settings Community structure detection in complex networks has been intensively investigated in recent years [6]. We use three interconnection computers to construct cloud computing environment. It is a master/slaver structure, and one computer is master and the other two are slavers. The cloud computing environment is shown in Table1. Table 1 Configuration of cloud computing environment Host IP Address Hardware Configuration Software Configuration CPU: Intel Xeon E5405 Mater 192.168.1.1 Memory: 4 GB CPU: Intel Core2 Q9400 Slaver1 192.168.1.2 Slaver2 192.168.1.3 6.2. Performance evaluation Memory: 2 GB CPU: Intel Core2 Q9400 Memory: 2 GB To evaluate the performance of AP clustering algorithm, EQ function is used as the criteria. EQ = 1 OO 2mm ii vv CCii,uu CC ii vvoo uu AA vvvv kk vvkk uu (6) 2mm where m is the number of edges. OO vv and OO uu are the numbers of communities who including node v and node u. A is the adjacency of network. kk vv and kk uu represent respectively the degrees of node v and node u. The higher EQ value means better overlapping community structure 6.3. Experimental results and analysis To test AP clustering algorithm, we have performed our experiments over two real-world networks. We applied it to two widely used real-world networks with a known community structure. They are the well- known Karate Club network and Dolphin network. The Karate Club network is a network of friendship which has 34 members of a karate club as nodes and 78 edges representing friendship between members. Due to a leadership issue, the club splits into two distinct groups. The 68

Dolphin network is a community of dolphins living in New Zealand. There are 62 dolphins and edges are set between network members that are seen together more often than expected by chance. The network is split naturally into two large groups, and the number of edges is 159. We compare AP clustering (APC) algorithm with the COPRA algorithm using run time and EQ to evaluate the performance. The results are compared with the COPRA algorithm and shown in Table 2. Table 2 Comparison of performance measures Karate network Dolphin network Algorithms run time (ms) EQ run time (ms) EQ COPRA 164 0.158 224 0.303 APC 32 0.168 92 0.306 The APC algorithm outperforms the COPRA algorithm in our experiments with shorter run time and higher EQ values. Especially, when the network scale become huge, the APC algorithm can accomplish clustering task spending shorter time than COPRA algorithm. Therefore, the distributed APC algorithm can deal with large-scale data within the limited time. 7. Conclusions In this paper, we have expounded the cloud computing with Hadoop architecture and the work mechanism of the MapReduce model. MapReduce is the most popular distributed computing framework. Then we have researched AP clustering algorithm and realized it on MapReduce. Compared with other traditional clustering algorithms (ex. K-Means), APC algorithm does not need to specify the number of clusters and choose a cluster center from subset of data points. We constructed cloud computing environment to realize parallelizing APC algorithm for identifying a community in complex networks. The performance of our algorithm was tested on two real-world networks. Experimental results confirm the validity and advantage of this approach. As future work, we will aim at combining our method with other methods to improve the quality of results. Acknowledgement In this paper, the research was sponsored by the Science Research Project of Kunming University (Project No. XJL15013) and Science Research Project of Kunming University (Project No. XJL14006). References [1] L. Chang, R. Ranjan, Z. Xuyun, Y. Chi, D. Georgakopoulos, C. Jinjun, Public Auditing for Big Data Storage in Cloud Computing a Survey, Computational Science and Engineering (CSE), 2013 IEEE 16 th International Conference on, 2013, pp. 1128 1135. [2] P.D. Londhe, S.S. Kumbhar, R.S. Sul, and A.J. Khadse, Processing big data using hadoop framework, in Proceedings of 4th SARC-IRF International Conference, New Delhi, India, Apr. 27, 2014, pp. 72-75. [3] J. Ekanayake, S. Pallickara, and G. Fox, Mapreduce for data intensive scientific analyses, in IEEE 4th International Conference on escience, Indianapolis, Indiana, USA, Dec. 7-12, 2008, pp. 277-284. [4] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM 51(1). [5] Frey B J, Dueck D. Clustering by passing messages between data points [J]. Science, 2007, 315(5814): 972-976. [6] S. Sadi, Ş. Öğüdücü, A.Ş. Uyar, An efficient community detection method using parallel clique-finding ants, in: The 2010 IEEE World Congress on Computational Intelligence, Barcelona, Spain, 2010, pp. 1 7. 69