Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment

Similar documents
The Research of Support Vector Machine in Agricultural Data Classification

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Support Vector Machines

Cluster Analysis of Electrical Behavior

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Classifier Selection Based on Data Complexity Measures *

Smoothing Spline ANOVA for variable screening

An Optimal Algorithm for Prufer Codes *

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

An Entropy-Based Approach to Integrated Information Needs Assessment

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Load Balancing for Hex-Cell Interconnection Network

Efficient Distributed File System (EDFS)

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Concurrent Apriori Data Mining Algorithms

Parallelization of a Series of Extreme Learning Machine Algorithms Based on Spark

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop

Remote Sensing Image Retrieval Algorithm based on MapReduce and Characteristic Information

CSCI 5417 Information Retrieval Systems Jim Martin!

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

A User Selection Method in Advertising System

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

A Deflected Grid-based Algorithm for Clustering Analysis

A Binarization Algorithm specialized on Document Images and Photos

Virtual Machine Migration based on Trust Measurement of Computer Node

Two-Stage Data Distribution for Distributed Surveillance Video Processing with Hybrid Storage Architecture

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

The Codesign Challenge

Available online at Available online at Advanced in Control Engineering and Information Science

Solving two-person zero-sum game by Matlab

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Application of VCG in Replica Placement Strategy of Cloud Storage

Ontology Generator from Relational Database Based on Jena

A fast algorithm for color image segmentation

Deep Classification in Large-scale Text Hierarchies

Lecture 5: Multilayer Perceptrons

AADL : about scheduling analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Classifying Acoustic Transient Signals Using Artificial Intelligence

PERFORMANCE EVALUATION FOR SCENE MATCHING ALGORITHMS BY SVM

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

An Anti-Noise Text Categorization Method based on Support Vector Machines *

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Meta-heuristics for Multidimensional Knapsack Problems

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Programming in Fortran 90 : 2017/2018

Feature Reduction and Selection

Mathematics 256 a course in differential equations for engineering students

X- Chart Using ANOM Approach

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Problem Set 3 Solutions

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Efficient Text Classification by Weighted Proximal SVM *

An Efficient Algorithm for PC Purchase Decision System

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Unsupervised Learning and Clustering

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Feature Selection as an Improving Step for Decision Tree Construction

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

A NEW LINEAR APPROXIMATE CLUSTERING ALGORITHM BASED UPON SAMPLING WITH PROBABILITY DISTRIBUTING

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Edge Detection in Noisy Images Using the Support Vector Machines

Optimizing Naïve Bayes Algorithm for SMS Spam Filtering on Mobile Phone to Reduce the Consumption of Resources

Module Management Tool in Software Development Organizations

Sorting: The Big Picture. The steps of QuickSort. QuickSort Example. QuickSort Example. QuickSort Example. Recursive Quicksort

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Data Mining: Model Evaluation

The Shortest Path of Touring Lines given in the Plane

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Network Intrusion Detection Based on PSO-SVM

Wavefront Reconstructor

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

An Improvement to Naive Bayes for Text Classification

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Audio Content Classification Method Research Based on Two-step Strategy

A New Feature of Uniformity of Image Texture Directions Coinciding with the Human Eyes Perception 1

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

Wishing you all a Total Quality New Year!

High-Boost Mesh Filtering for 3-D Shape Enhancement

Parallel matrix-vector multiplication

Unsupervised Learning

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Machine Learning. Topic 6: Clustering

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Transcription:

TELKOMNIKA, Vol.10, No.5, September 2012, pp. 1087~1092 e-issn: 2087-278X accredted by DGHE (DIKTI), Decree No: 51/Dkt/Kep/2010 1087 Parallel Implementaton of Classfcaton Algorthms Based on Cloud Computng Envronment Ljuan Zhou, Hu Wang, Wenbo Wang Captal Normal Unversty, Informaton Engneerng College, Bejng, Chna, 100048 e-mal: wanghu861218@163.com Abstract As an mportant task of data mnng, Classfcaton has been receved consderable attenton n many applcatons, such as nformaton retreval, web searchng, etc. The enlargng volumes of nformaton emergng by the progress of technology and the growng ndvdual needs of data mnng, makes classfyng of very large scale of data a challengng task. In order to deal wth the problem, many researchers try to desgn effcent parallel classfcaton algorthms. Ths paper ntroduces the classfcaton algorthms and cloud computng brefly, based on t analyses the bad ponts of the present parallel classfcaton algorthms, then addresses a new model of parallel classfyng algorthms. And t manly ntroduces a parallel Naïve Bayes classfcaton algorthm based on MapReduce, whch s a smple yet powerful parallel programmng technque. The expermental results demonstrate that the proposed algorthm mproves the orgnal algorthm performance, and t can process large datasets effcently on commodty hardware. Keywords: Naïve Bayes, Classfcaton, MapReduce, Hadoop Copyrght 2012 Unverstas Ahmad Dahlan. All rghts reserved. 1. Introducton Now, the rapd growth of the Internet and World Wde Web has led to vast amounts of nformaton avalable onlne consdered as Bg Data. The storng, managng, accessng, and processng of ths vast amount of data represents a fundamental need and an mmense challenge n order to satsfy needs to search, analyse, mne, and vsualze ths data as nformaton. Effcent parallel classfcaton algorthms and mplementaton technques are the key to meetng the scalablty and performancerequrements entaled n such scentfc data analyses. So far, several researchers have proposed some parallel classfcaton algorthms. All these parallel classfcaton algorthms have the followng flaws [1]: a) they all assume that all objects can bde n memory smultaneously; b) The parallel systems have offered restrcted programmng models and used the restrctons to parallelze the computaton automatcally. Both assumptons are prohbtve for the datasets composed wth mllons of objects. Therefore, dataset orented parallel classfyng algorthms should be developed. And the parallel algorthms should run on tens, hundreds, or even thousands of servers. For the emergence of cloud computng, parallel technques are able to solve more challengng problems, such as heterogenety and frequent falures. Cloud computng archtectures whch can support data parallel applcatons are a potental soluton to the terabyte and petabyte scale data processng requrements of Bg Data computng [2]. And several solutons have emerged ncludng the MapReduce archtecture poneered by Google and now avalable n an open-source mplementaton called Hadoop used by Yahoo, Facebook, and others. In ths paper, we adapt classfcaton algorthms n MapReduce framework whch s mplemented by Hadoop to make the classfyng method applcable to large scale data. We conduct comprehensve experments to evaluate the proposed algorthm by actual datasets. The results demonstrate that the effcency of the proposed algorthm s hgher than the ntal algorthm. The rest of the paper s organzed as follows. Secton 2 ntroduces MapReduce. Secton 3 presents the parallel Naïve Bayes algorthm based on MapReduce framework. Secton 4 shows expermental results and evaluatons. Fnally, the conclusons and future work are presented n Secton 5. Receved June 7, 2012; Revsed September 2, 2012; Accepted September 11, 2012

1088 e-issn: 2087-278X 2. MapReduce Overvew MapReduce s a software framework ntroduced by Google n 2004 to support dstrbuted computng on large data sets on clusters of computers. The MapReduce programmng mode s desgned to compute large volumes of data n a parallel fashon [3]. The model dvdes the workload across the cluster. It dvdes the nput nto nput splts. When clents submt a job to the framework, a sngle map processes an nput splt. And each splt s dvded nto records; the map processes each record n turn. The clent does not need to deal wth InputSplts drectly, because they are created by an InputFormat. An InputFormat s responsble for creatng the nput splts and dvdng them nto records. The framework assgns one splt to each map functon. The JobTracker pushes work out to avalable TaskTracker nodes n the cluster, strvng to keep the work as close to the data as possble by the rack-aware fle system. The TaskTracker wll process records n turn. The MapReduce framework makes the guarantee that the nput to every reducer s sorted by key. The process performs the sort and transfers the map outputs to the reducers as nputs known as the shuffle. The map functon not smply wrtes ts output to dsk. The process takes advantage of bufferng wrtten n memory and dong some pre-sortng for effcency reasons. Fgure 1 shows what happens. Input HDF S Map Task Splt Map Splt 0 Splt 1 Splt 2 Map Splt 3 Splt 4 Map Buffer n memory Partton sort and splt to dsk Merge on dsk Other maps Reduce Task Other reduce merge Reduce Out put merge Reduce Fgure 1. The framework of MapReduce 3. Parallel Nave Bayes Algorthm Based on MapReduce In ths secton we present the man desgn for Parallel Naïve Bayes based on MapReduce. Frstly, we gve a bref overvew of Naïve Bayes algorthm and analyse the parallel parts and seral parts n the algorthms. Then we explan how the necessary computatons can be formalzed as map and reduce operatons n detal. 3.1. Naïve Bayes Algorthm Naïve Bayes s a statstcal classfcaton method. It s a well-studed probablstc algorthm whch often used n classfcatons. It uses the knowledge of probablty and statstcs for classfcaton. Studes comparng classfcaton algorthms have found Naïve Bayes s comparable n performance wth decson tree and selected neural network classfers. Naïve Bayes have also exhbted hgh accuracy and speed when appled to large databases. The Naïve Bayes classfer assumes that the presence of a partcular feature of a class s unrelated TELKOMNIKA Vol. 10, No. 5, September 2012: 1087 1092

TELKOMNIKA e-issn: 2087-278X 1089 to the presence of any other features on a gven the class varable. Ths assumpton s called class condtonal ndependence. To demonstrate the concept of Naïve Bayes Classfcaton, consder the knowledge of statstcs. Let Y be the classfcaton attrbute and X{x1,x2,,xk} be the vector valued array of nput attrbutes, the classfcaton problem smplfes to estmatng the condtonal probablty P( Y X ) from a set of tranng patterns. P( Y X ) s the posteror probablty, and P( Y ) s the pror probablty. Suppose that there are m classes, Y1, Y2 Ym. Gven a tuple X, the classfer wll predct that X belongs to the class havng the hghest posteror probablty. The Naïve Bayes classfer predcts that tuple X belongs to the class Y f and only f P( Y X ) P( Y X ) j The Bayes rule states that ths probablty can be expressed as the formulaton (1) P( Y X ) P( X Y ) P( Y ) P( X ) = (2) As P( X ) s constant for all classes, only P( X Y ) P( Y ) needs be maxmzed. The pror probabltes are estmated by the probablty of Y n the tranng set. In order to reduce computaton n evaluatng P( X Y ), the Naïve Bayes assumpton of class condtonal ndependence s made. So the equaton can be wrtten nto the form of n P( X Y ) P( x Y ) = (3) k k = 1 and we easly estmate the probabltes P( X1 Y ), P( X 2 Y ),, P( X k Y ) from the tranng tuples. The predcted class label s the class Y for whch P( X Y ) P( Y ) s the maxmum. 3.2. Naïve Bayes Based on MapReduce Cloud Computng can be defned as a provson through the Internet of all computng servces. It s the most advanced verson of the clent-server archtecture and takes the system to a very hgh level of resource whch s sharng and scalng. The resource pools composed of a large number of computng resources whch are used to create hghly vrtualzed resources dynamcally for users. But for the analyss task of massve data, the cloud platform lack parallel mplementaton of massve data mnng and analyss algorthms [4]. Therefore, a new cloud computng model of massve data mnng ncludes the pre-processng for huge amounts of data, cloud computng for massve parallel data mnng algorthms, the new massve data mnng methods and so on [5]. The crtcal problem of the massve data mnng s the algorthm parallelzaton of data mnng. Cloud computng uses the new computng model known as MapReduce, whch means that the exstng data mnng algorthms and parallel strateges cannot be appled drectly to cloud computng platform for massve data mnng, so some transformaton must be done. Based on ths, for the characterstcs of massve data mnng algorthms, the cloud computng model has been optmzed and expanded to make t more sutable for massve data mnng [6]. Therefore, ths paper adopts the Hadoop dstrbuted system nfrastructure, whch provdes the storage capacty of HDFS and the computng capablty of MapReduce to mplement parallel classfcaton algorthms. The mplementaton of the parallel Naïve Bayes s MapReduce model s dvded nto tranng and predcton stages. 3.2.1. Tranng Stage The dstrbuted computng of Hadoop s dvded nto two phases whch are called Map and Reduce. Frst, the InputFormat whch s belonged to the Hadoop framework loads the nput data nto small data blocks known as data fragmentaton, and the sze of each data Parallel Implementaon of Classfcaton Algorhtms based on Computng (Ljuan Zhou)

1090 e-issn: 2087-278X fragmentaton s 5M, and the length of all of them s equal, and each splt s dvded nto records. Each map processes a sngle splt, and the map task passes the splt to the get RecordReader() method on InputFormat to gan a RecordReader for that splt. The RecordReader s terators of the records. Then the map task uses a RecordReader to generate record key-value pars, whch passes to the map functon. Secondly, the map functon statstcs the categores and propertes of the nput data, ncludng the values of categores and propertes. The attrbutes and categores of the nput records are separated by a comma, and the fnal attrbute s the property of classfcaton. Fnally, the reduce functon aggregates the number of each attrbute and category value, whch results n the form of (category, Index1:count1, Index2:count2, Index3:count3,, Indexn:countn), and then output the tranng model. Its mplementaton s descrbed as follows. Algorthm Produce Tranng: map(key, value) Input: the tranng dataset Output: <key, value > par, where key s the category, and value the frequency of attrbute value 1 FOR each sample DO BEGIN 2 Parse the category and the value of each attrbute 3 count thefrequence of the attrbutes 4 FOR each attrbute value DO BEGIN Take the label as key, andattrbute ndex: the frequence 5 of the attrbute value as value 6 Output<key, value > 7 END 8 END Algorthm Produce Tranng: reduce(key, value) Input: the key and value output by map functon Output: <key,value > par, where key s the lable, and value the result of frequency of attrbute values 1 sum 0 2 FOR each attrbute value DO BEGIN 3 sum+=value.next.get() 4 END 5 Take key as key, and sum as value 6 output<key, value > 3.2.2. Predcton Stage Predcate the data record wth the output of the tranng model. The mplementaton of the algorthm s stated as follows: frst, use the statstcal values of attrbute values and category values to tran the unlabeled record. In addton, use the dstrbuted cache to mprove the effcency of the algorthm n the processon of the algorthm mplementaton. Its mplementaton s descrbed as follows. Algorthm Produce Testng: map (key,value) Input: the test dataset and the Nave Bayes Model Output: the labels of the samples 1 modeltype newmodeltype() 2 categores modeltype.getcategorys() 3 FOR each attrbute value not NULL DO BEGIN 4 Obtan one category from categores 5 END FOR 6 FOR each attrbute value DO BEGIN 7 FOR each category value DO BEGIN 8 pct counter(attrbute,category)/counter(category) 9 result result*pct 10 END FOR 11 END FOR 12 Take the category of the max result as key, and the max result as value 13 output<key,value > TELKOMNIKA Vol. 10, No. 5, September 2012: 1087 1092

TELKOMNIKA e-issn: 2087-278X 1091 4. Expermental Results In ths secton, we perform some preparatory experments to test the effcency and scalablty of parallel Naïve Bayes algorthm proposed n ths paper. We buld a small cluster wth 3 busness machnes (1 master and 2 slaves) on Lnux, and each machne has two cores wth 3.10GHz, 4GB memory, and 500GB dsk. We use the Hadoop verson 0.20.2 and java verson 1.6.0_26. We use the UCI data sets to verfy the results. Expermental data sets are shown n Table one. Table 1. The expermental data sets Data sets Number of samples Dmenson Numbers of categores 1 Wne 178 13 3 2 Vertebral 310 6 2 3 Bank-data 600 11 2 4 Car 1728 6 4 5 Abalone 4177 8 28 6 Adult 32561 14 2 7 PokerHand 1000000 11 10 Frst, the pre-treatment over the above data sets must be done, all property types normalzed to nomnal attrbutes. Then, the Naïve Bayes classfer mplemented by the MapReduce trans the tranng data sets to generate the classfy model, and then use the model to classfy the removed category samples. The experment s run on the cluster composed wth three machnes, and the results s shown n Fgure 2, compared wth the general method of test results. Fgure 2. Executng tme wth dfferent szes The comparng experment shows that the performance of the mproved algorthms s hgher than the general methods wth large data set. And ths verfes the Bayesan algorthm runs on the cloud envronment s more effcent than the tradtonal Bayesan algorthm. However, due to the sze of data szes, attrbutes, and the number of dfferent categores, the tme that the algorthm spent s not appear a lnear relatonshp. Snce runnng Hadoop jobs, start the cluster frst whch takes a lttle of tme, so when the sze of data set s smaller, the data processng tme s relatvely longer. And ths also verfed the Hadoop s perfect to process huge amounts of data. Parallel Implementaon of Classfcaton Algorhtms based on Computng (Ljuan Zhou)

1092 e-issn: 2087-278X 5. Conclusons As data classfyng has attracted a sgnfcant amount of research attenton, many classfcaton algorthms have been proposed n the past decades. However, the enlargng data n applcatons makes classfyng of very large scale of data a challengng task. In ths paper, we propose a fast parallel Naïve Bayes algorthm based on MapReduce, whch has been wdely embraced by both academa and ndustry. Preparatory experments show that the parallel algorthms can not only process large datasets, but also enhance the effcency of the algorthm. In the future work, we wll further mplement other classfcaton algorthms and conduct the experments and consummate the parallel algorthms to mprove usage effcency of computng resources. Acknowledgements Ths research was supported by Chna Natonal Key Technology R&D Program (2012BAH20B03), Natonal Nature Scence Foundaton (31101078), Bejng Nature Scence Foundaton (4122016), Bejng Nature Scence Foundaton (4112013), and Bejng Educatonal Commttee scence and technology development plan project (KM201110028018), "The computer applcaton technology" Bejng muncpal key constructon of the dscplne. References [1] Wezhong Zhao, Hufang Ma and Qng He. Parallel K-Means Clusterng Based on MapReduce. Lecture Notes n Computer Scence. 2009; 5931: 674-679. [2] A Pavlo, E Paulson. A Comparson of Approaches to Large-Scale Data Analyss. Proc. ACM SIGMOD. 2009: 165-178. [3] Jeffrey Dean and Sanjay Ghemawar. MapReduce: Smplfed Data Processng on Large Clusters. In OSDI. 2004:137-150. [4] Jalya Ekanayake and Shrdeep Pallckara. MapReduce for Data Intensve Scentfc Analyss. IEEE escence, 2008: 277-284. [5] C. Chu, S. Km, et, al. Map-reduce for Machne Learnng on Multcore. In NIPS 07: Proceedngs of Twenty-Frst Annual Conference on Neural Informaton Processng Systems. [6] Qng He, FuzhenZhuang, Jncheng L and Zhongzh Sh, Parallel Implementaton of Classfcaton Algorthms Based on MapReduce, Lecture Notes n Computer Scence. 2010; 6401: 655-662. TELKOMNIKA Vol. 10, No. 5, September 2012: 1087 1092