Application of Clustering Algorithm in Big Data Sample Set Optimization

Similar documents
Cluster Analysis of Electrical Behavior

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Parallelism for Nested Loops with Non-uniform and Flow Dependences

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Network Intrusion Detection Based on PSO-SVM

An Entropy-Based Approach to Integrated Information Needs Assessment

Hierarchical clustering for gene expression data analysis

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Fast Computation of Shortest Path for Visiting Segments in the Plane

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Research of Dynamic Access to Cloud Database Based on Improved Pheromone Algorithm

Machine Learning. Topic 6: Clustering

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

A new segmentation algorithm for medical volume image based on K-means clustering

CS 534: Computer Vision Model Fitting

The Shortest Path of Touring Lines given in the Plane

Private Information Retrieval (PIR)

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Smoothing Spline ANOVA for variable screening

Maintaining temporal validity of real-time data on non-continuously executing resources

The Research of Support Vector Machine in Agricultural Data Classification

Journal of Chemical and Pharmaceutical Research, 2014, 6(10): Research Article. Study on the original page oriented load balancing strategy

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Angle-Independent 3D Reconstruction. Ji Zhang Mireille Boutin Daniel Aliaga

Remote Sensing Image Retrieval Algorithm based on MapReduce and Characteristic Information

A Binarization Algorithm specialized on Document Images and Photos

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

An Image Fusion Approach Based on Segmentation Region

Support Vector Machines

Parallelization of a Series of Extreme Learning Machine Algorithms Based on Spark

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Unsupervised Learning and Clustering

Concurrent Apriori Data Mining Algorithms

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

THE PATH PLANNING ALGORITHM AND SIMULATION FOR MOBILE ROBOT

ELEC 377 Operating Systems. Week 6 Class 3

An Improved Image Segmentation Algorithm Based on the Otsu Method

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Research and Application of Fingerprint Recognition Based on MATLAB

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Clustering Algorithm of Similarity Segmentation based on Point Sorting

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

TESTING AND IMPROVING LOCAL ADAPTIVE IMPORTANCE SAMPLING IN LJF LOCAL-JT IN MULTIPLY SECTIONED BAYESIAN NETWORKS

Unsupervised Learning

Risk-Based Packet Routing for Privacy and Compliance-Preserving SDN

Classifier Selection Based on Data Complexity Measures *

Feature Reduction and Selection

The Study of Remote Sensing Image Classification Based on Support Vector Machine

Module Management Tool in Software Development Organizations

Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

An Optimal Algorithm for Prufer Codes *

PRÉSENTATIONS DE PROJETS

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

The Discriminate Analysis and Dimension Reduction Methods of High Dimension

Keyword-based Document Clustering

Chinese Word Segmentation based on the Improved Particle Swarm Optimization Neural Networks

A Novel Optimization Technique for Translation Retrieval in Networks Search Engines

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Efficient Distributed File System (EDFS)

Performance Evaluation of Information Retrieval Systems

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Sequential Projection Maximin Distance Sampling Method

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Three supervised learning methods on pen digits character recognition dataset

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

Analysis of 3D Cracks in an Arbitrary Geometry with Weld Residual Stress

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Research of Image Recognition Algorithm Based on Depth Learning

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

A Clustering Algorithm Solution to the Collaborative Filtering

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Application of Improved Fish Swarm Algorithm in Cloud Computing Resource Scheduling

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

A fast algorithm for color image segmentation

Data Mining: Model Evaluation

Solving two-person zero-sum game by Matlab

Parallel matrix-vector multiplication

From Comparing Clusterings to Combining Clusterings

Deep Classification in Large-scale Text Hierarchies

Machine Learning: Algorithms and Applications

Sensor Selection with Grey Correlation Analysis for Remaining Useful Life Evaluation

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

Face Recognition Method Based on Within-class Clustering SVM

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

A Statistical Model Selection Strategy Applied to Neural Networks

Analysis on the Workspace of Six-degrees-of-freedom Industrial Robot Based on AutoCAD

Transcription:

Applcaton of Clusterng Algorthm n Bg Data Sample Set Optmzaton Yutang Lu 1, Qn Zhang 2 1 Department of Basc Subjects, Henan Insttute of Technology, Xnxang 453002, Chna 2 School of Mathematcs and Informaton Scence, Xnxang Unversty, Xnxang 453000, Chna Abstract In order to solve the problem of poor clusterng accuracy and slow convergence speed of K-means clusterng algorthm n bg data envronment, A K-means algorthm based on optmzed samplng clusterng are proposed. The algorthm not only ensures the ratonalty of the ntal value of k-means algorthm but also makes the algorthm clusterng n a smaller sample set to mprove effcency. Fnally, the clusterng centers of the orgnal samples are obtaned by re-clusterng usng a bottom-up herarchcal clusterng method. The algorthm combnes the advantages of herarchcal method, partton method and densty method. The theoretcal analyss and expermental results show that the samplng clusterng algorthm has better clusterng accuracy than other algorthms, and has strong robustness and scalablty. Keywords: Bg data, K-means, Probablty samplng, Clusterng accuracy, Mult-cluster, Evdence theory. 1. INTRODUCTION As nformaton, technology contnues to evolve, many large enterprses, nsttutons and organzatons contnue to have access to a vast array of dverse and heterogeneous data, as well as the techncal ssues of effcently storng, processng and analyzng such valuable data. It s of great mportance to mne useful nformaton effcently from bg data sets (Jang, 2006). The clusterng method can reveal the ntrnsc relatonshp between data wthout pror knowledge, and can cluster valuable data of the same attrbute nto a sngle category, whch can excavate the valuable nformaton of campus network bg data for storage and further analyss of nformaton. However, the tme and space complexty of the tradtonal methods s nadequate under the campus network bg data (Bruno and For, 2013). The k-means algorthm mantans a lnear relatonshp wth the data sze n and satsfes the large-scale data processng However, there s an NP problem n the algorthm (Marchett and Zhou, 2014). It can be seen that the sze of the sample set and the degree of coverage of the orgnal large dataset category play a decsve role n mprovng the applcaton of the k-means algorthm n large dataset clusterng (Almaksour and Anquetl, 2014). 2. BIG DATA SAMPLING BASED ON LEADERS ALGORITHM 2.1 Leaders algorthm Leaders method based on the energy densty, by selectng each type of Leader pont to complete the clusterng, the algorthm s as follows. Leader algorthm does not need a pror to specfy the number of categores, and only needs to scan a large data set once to cluster (Chen and Lu, 2016). The advantages of Leader algorthm n dealng wth large data sets are obvous, but the algorthm s senstve to the nput sequence of data ponts. Appear wthn the class smlarty s greater than the smlartes between classes. 520

Clusterng result T1 Bg data set Improved collecton Clusterng result Tk Clustered successfully deleted data Clustered successfully deleted data New collecton Local clusterng results Fgure 1. Bg data set local ncremental clusterng method clusterng process 2.2 Samplng Whle mantanng the correlaton between the samples, the data ponts whose ntersectons n FIG. 1 are not correctly classfed are dstrbuted to dfferent sngle samples by random samplng Concentraton, so through a sngle sample set of clusterng to acheve ths part of the data category re-clusterng (Chen et al., 2014; Peddnt and Saxena, 2014). The amount of data for a sngle sample set. n 1 n 1 1 s f n f n n (1) 2 log( ) log( ) 2 log( ) n Accordng to the formula (1), the sze of the samplng sample set s calculated. In ths paper, m tmes of largescale data sets of sze n are randomly sampled for m tmes. The samplng condtons are as follows: C C (2) j n n (3) j n m n (4) In the formula, = 1, 2,..., m, j = 1,2,..., m, m Z and j, there s no ntersecton of each sample. 3. SAMPLE SET CLUSTERING CENTER CALCULATION METHOD At each samplng, the data n each ntal cluster center area s randomly sampled. In ths way, samplng can make the unon of sample sets reach the maxmum possble coverage of all categores n the orgnal data set, so that the category of samplng unon ts center s close to the orgnal bg data set. The expected samplng s that each sngle samplng set contans all the classes of the orgnal bg data set. 3.1 Determnaton of sngle sample cluster center Because of the small amount of data n a sngle sample set, a varety of classcal clusterng methods could be used for clusterng each sngle sample set. In ths paper, k-means algorthm s used to cluster each sngle sample set, n whch the ntal We choose the ntal cluster centers determned by the leaders preprocessng stage to solve the problem that the number of teratons of the algorthm s easly affected by the ntal cluster center settng (Langone et al., 2016). At the same tme, fast sample clusterng can be acheved due to the small amount of data n a sngle sample set. 521

Snce each sample set has the same sze, the sngle sample set clusterng process s performed ndependently, whch can acheve parallel processng and further reduce the processng tme of the algorthm. At the same tme, n order to synthesze the advantages of dfferent clusterng methods, when clusterng sngle sample sets, other algorthms such as K-medods can be used for dfferent sample sets to enhance the clusterng effect of sample sets. 3.2 Mult-sample cluster cluster fuson Suppose the number of categores n a bg data set s k, and the category covered by the th sample set s k (1 k k). After clusterng the sample set, n order to solve the stuaton shown n Fgure 1, some Natural clusters wll nevtably be subdvded. At the same tme, due to the naccuracy of the leaders clusterng centers, the number of clusters n some sample sets wll be greater than the ntal cluster centers, so the clusterng results of each sngle sample set need to be carred out Merge mergers to pnpont the orgnal bg data-clusterng center. Map: Generate samples randomly Reduce: wrte data to the correspondng sample Map: Fern sample gets cluster center Reduce: Consoldaton of all sample clusters Tradtonal parallel k- means clusterng bg data Probablty samplng Sample clusterng and result ntegraton Bg data clusterng Fgure 2. Sample Clusterng K-means Process When the dstance between two classes' mean s less than the preset threshold, the two classes are combned nto a sngle class and the mean of the new class s recalculated. 4. EXPERIMENTAL RESULTS AND ANALYSIS 4.1 Expermental envronment The expermental envronment cluster contans a total of 6 nodes, ncludng 1 master node and 5 slave nodes. Each node s connected va a 100Mb / s Fast Ethernet Swtch. Cluster conssts of 6 ordnary PC, each PC nstalled operatng system are Ubuntu14.04, memory sze s 2GB, hard drve capacty of 350GB. 4.2 Data set Two types of data sets are used n the experment: the frst s a Gaussan dstrbuton dataset, whch generates a 1-ggabyte Gaussan dstrbuton data set. The second s from a standard test set-real-world UCIrvne machne learnng lbrary Data Set Bag of Words Data Set (BoW), whch conssts of 3-D data ponts and represents 3 categores of features (doc ID, word ID, count). 4.3 Performance evaluaton In order to llustrate the robustness, scalablty and good clusterng performance of the algorthm, experments were carred out to compare the results of the two data sets. In ths experment, the parallel K-means, K-means n Map Reduce, K-means n Map Reduce, K-means n sample clusterng and K-means n optmzed sample clusterng were comparatvely analyzed. The expermental clusterng category k = 3, the number of samples s = 24, the number of data ponts for each sample accordng to the sze of the orgnal data set n the set V = {30000, 20000, 15000, 10000, 8000, 5000, 3000} value. 522

error/10 9 tme/10 3 s Revsta de la Facultad de Ingenería U.C.V., Vol. 32, N 14, pp. 520-525, 2017 When the amount of data s small, the data n a stand-alone envronment only operates n memory and the calculaton speed s very fast. In a cluster envronment, a small amount of data occupes a large part of tme n the mport and export of varous compute nodes and dsks. Large amount of data, stand-alone memory s very dffcult to deal wth ths cluster wll show the advantages of dstrbuted computng, data transfer between nodes and dsk mport and export tme relatve to the entre task a small percentage of tme. Sample Samplng Clusterng K-means Gettng approxmate clusterng centers for large datasets wth representatve samples can dramatcally reduce the number of teratons requred for large data sets, reducng the amount of tme requred. 30 25 20 15 10 5 0 0 200 400 600 800 1000 1200 Data sze / MB Fgure 3. Tme performance curve Then compare the performance of each algorthm clusterng performance, as shown n Fgure 3. As can be seen from Fgure 3, the OSCK algorthm has the best clusterng performance and s obvously superor to the samplng clusterng K-means. Compared wth the clusterng K-means algorthm, the clusterng performance of the optmzed parallel K-means algorthm s better than that of the clusterng algorthm, but the clusterng performance s worse than that of the OSCK algorthm. By removng the sub-optmal clusterng centers of the samples, OSCK algorthm makes the fnal clusterng center more representatve, so t can mprove the clusterng accuracy to a certan extent. 30 25 20 15 10 5 0 0 200 400 600 800 1000 1200 1400 Data sze / MB Fgure 4. Clusterng performance curve 523

Speedup rato tme/10 3 s Revsta de la Facultad de Ingenería U.C.V., Vol. 32, N 14, pp. 520-525, 2017 Fnally, the algorthm s compared from the pont of vew of the nfluence of the number of nodes n the cluster on the performance of the algorthm. As can be seen from Fgure 5, the sample clusterng K-means algorthm and the OSCK algorthm are stll optmal n runtme when the number of nodes s changed. In addton, as the number of compute nodes ncreases, the runnng tme of each algorthm gradually decreases. However, Wll to some extent mprove performance. If T represents a sngle node calculaton tme, TP s a mult-node calculaton tme, the acceleraton rato Sp = T / TP. 18 16 14 12 10 8 6 4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Number of nodes Fgure 5. Algorthm tme-consumng contrast In order to verfy the robustness of the OSCK algorthm n processng large data sets, the experment uses Gaussan dstrbuton data set to run each algorthm 10 tmes, and statstcs the clusterng results of each run. The sample clusterng K-means and OSCK algorthms have good robustness compared wth the other two algorthms. 3.5 3 2.5 2 1.5 1 0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Number of nodes Fgure 6. Algorthm speedup comparson 5. CONCLUSIONS Wth the rapd development of network applcatons, computer networks have penetrated nto every area of socal lfe. Whle brngng strong mpetus to socal development, the ssue of network nformaton securty has 524

become an mportant ssue that affects the development of the Internet. Due to the dversty of connecton forms, termnal dstrbuton unevenness, network openness, and nterconnectvty, the network s vulnerable to attacks by hackers and other msdeeds. As t brngs great convenence and speed to people, but also brought huge rsks to people. Ths artcle focuses on the nformaton securty model, securty mechansms, and vrtual prvate network and ntruson detecton systems and so on. Among them, ntruson detecton system s the hot pont of network securty research n the future. The future technology wll be herarchcal and ntellgent development. Future IDSs should be able to detect and alert on ntruson at dfferent levels of the network protocol. REFERENCES Almaksour A., Anquetl E. (2014). Ilclass: error-drven antecedent learnng for evolvng takag-sugeno classfcaton systems, Appled Soft Computng, 19(2), 419-429. Bruno G., For A. (2013). Mcroclan: mcroarray clusterng analyss, Journal of Parallel & Dstrbuted Computng, 73(3), 360-370. Chen M.C., Kong X.S., Chen K. (2014). Applcaton of statstcal analyss software n food scentfc modelng, Advance Journal of Food Scence and Technology, 6(10), 1143-1146. Chen M.C., Lu Q.L. (2016). Blow-up crtera of smooth solutons to a 3D model of electro-knetc fluds n a bounded doman, Electronc Journal of Dfferental Equatons, 128, 1-8. Jang W. (2006). Nonparametrc densty estmaton and clusterng n astronomcal sky surveys, Computatonal Statstcs & Data Analyss, 50(3), 760-774. Langone R., Van B.M., Suykens J. (2016). Entropy-based ncomplete cholesky decomposton for a scalable spectral clusterng algorthm: computatonal studes and senstvty analyss, Entropy, 18(5), 182. Marchett Y., Zhou Q. (2014). Iteratve subsamplng n soluton path clusterng of nosy bg data, Statstcs, 9(4), 1-9. Peddnt S.T., Saxena N. (2014). Web search query prvacy: evaluatng query obfuscaton and anonymzng networks, Journal of Computer Securty, 22(1), 155-199. 525