Clustering Analysis based on Data Mining Applications Xuedong Fan

Applied Mechanics and Materials Online: 203-02-3 ISSN: 662-7482, Vols. 303-306, pp 026-029 doi:0.4028/www.scientific.net/amm.303-306.026 203 Trans Tech Publications, Switzerland Clustering Analysis based on Data Mining Applications Xuedong Fan Xi'an International University, China Email:ffff0729@63.com Keywords: Data mining, Pattern recognition, Clustering analysis Abstract. In this paper, a clustering algorithm based on data mining technology applications, the use of the extraction mode noise characteristics amount and pattern recognition algorithms, extraction and selection of the characteristic quantities of the three types of mode, carried out under the same working conditions data mining clustering analysis ultimately satisfying recognition. Introduction Data mining is by careful analysis of large amounts of data to uncover meaningful new relationships, patterns and trends. It uses pattern recognition techniques, statistical techniques and mathematical techniques, the core areas of knowledge discovery, to discover implicit, previously unknown and potentially useful knowledge extracted knowledge can be expressed as concepts, rules, laws, mode form. The reason why data mining technology in recent years has exciting prospects for the study, it is able to obtain a wide range of applications, and have achieved important practical value. It can be said, very extensive data mining applications in various fields, as long as the industry has the analytical value of the data warehouse or database, you can take advantage of the mining tools for the purpose of mining analysis. Generally more common cases occurred in the retail, manufacturing, financial, finance, insurance, communications, and medical services. Faced with a large number of information data, the formation of the great demand for data mining applications, rapid technological development and improvement. In this paper, data mining technology in rivers target recognition technology made some tentative preliminary research has made important theoretical and practical results of the reference value. At present, the basic task of data mining technology is mainly reflected in the five areas of classification and regression, clustering, association rules found timing mode, deviation detection: () Classification and regression: classification means the data is mapped to pre-defined group or class. Before in the analysis of the test data, the categories have been identified, classification usually called supervised learning. Classification algorithm requires to define a class, are usually described by the characteristics of the data known category based on the value of the data attribute. Regression: property historical data to predict future trends. Regression first assume that some known type of function (e.g., the linear function, etc.) can be fitted to the target data, and then using some error analysis to determine a best function with the target data fitting degree. (2) Cluster: is a group of individuals in accordance with the similarity classified into several categories, and its purpose is such that the distance between the individuals belonging to the same category as small as possible, while the distance between the different categories of individuals as large as possible. (3) Association rules: reveals the relationship between the data and this relationship is not directly represented in the data. The associated analysis task is to find the association rules between things or relevance. The links between things has much support and credibility; the meaningful association rule must be given two thresholds, the minimum support and minimum confidence. (4) The timing mode is described based on time or other sequences frequently occurring pattern or trend, and its modeling. With regression, it is also to predict future values using known data, but the difference between these data is the time where the variables are different. Sequential patterns associated mode and time series model combined focus to consider the correlation between the data in the time dimension. Timing Mode contains time series analysis and sequence found. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 30.203.36.75, Pennsylvania State University, University Park, USA-/05/6,04:25:5)

Applied Mechanics and Materials Vols. 303-306 027 (5) Deviation of the differences and extreme special case of the formulation, such as classification of anomalous instance, clustering outside the outliers, does not meet the rules of a special case, most of the data mining methods this difference as noise discarded, in some applications, however, the rare data may be more useful than the normal data. Deviation detection is used to discover abnormalities and changes in normal circumstances, and further analysis of this change was intentional fraud, or normal variation. If abnormal behavior, you need to prompt preventive measures as soon as possible to prevent. Target three types of rivers (passenger ships, cargo ships, Clippers) time-domain noise samples, the paper selects clustering rule mining. Cluster analysis is in the case not given designated categories, based on the information similarity information clustering method, so clustering also called unsupervised learning. Data is divided or split into intersect or do not intersect, the process of the group, can be done by determining the similarity between data on the pre-specified attribute clustering task. The input of clustering is a set of unlabeled data is divided according to the distance of the data itself or similarity. The principles of division maintain maximum similarity between the similarity and the smallest group within the group, that is as different as possible so that the data in the different cluster, and the data in the same cluster as similar as possible. Of course, clustering sample classification also can complete outlier mining. Study the similarities and differences clear clustering and classification, clustering and classification are complementary, interdependent, clustering is a group of individuals in accordance with the similarity classified into several categories, namely R feather flock together S clustering process is containing continue to be classified more attributes of the data object classification is performed automatically by the clustering algorithm, by identifying the characteristics of the data, the data is cut into a number of categories, the authors believe, can use clustering rules mining algorithms to identify three types of target clustering basis, and then use this basis to identify classification, the following key problem is to find a clustering mining algorithms come to this clustering basis. The cluster analysis method is a direct comparison of samples between things in nature, a similar nature classified as a class, the relatively large differences in the nature of points in a different class, therefore, this paper selected based on Euclidean distance feature extraction effective clustering feature amount of the selected pattern recognition algorithm to select rivers target. A river in the formation of the original features of the target noise Rivers target radiated noise in a passive detection device in the river essential source of information for target detection, identification, classification. Especially noise frequency-domain information has not interference, easy separation characteristics, can be used for the detection and identification of weak target signal. Very rich rivers target noise frequency domain characteristics, target identification, classification and parameter estimation has important value. In this paper, to three target noise samples for the study, and in the time domain and frequency domain analysis: Take the three types of ten groups target radiated noise samples do time-domain waveform, CTZ transform, power spectrum, power cepstrum, Fourier transform, the analytic power down spectral envelope spectrum and short-time Fourier transform analysis. Select more than eight transform to study the original features of the target sample, extraction and selection of effective features. 2 Feature extraction and selection 2. pattern recognition Pattern recognition: characterizing the object or phenomenon in all its forms (values, text and logical relations) information processing and analysis, to describe the object or phenomenon, identification, classification and interpretation of the process. Pattern of the information with the time or spatial distribution, a mode often use a great amount of information, said pattern recognition system is required in the digital link after pretreatment for removing interference is mixed and reduce some of the deformation and distortion.. Subsequent feature extraction, i.e. a group of features extracted from the input mode, the digitized or after pretreatment. The feature is a measure of the

028 Sensors, Measurement and Intelligent Materials selected one of its general deformation and distortion remain unchanged or almost the same, and containing only as little as possible of redundant information. The feature extracting process, the input mode from the target space is mapped into feature space. At this time, the mode is available in a feature space a point or a feature vector. This mapping only to compress the amount of information, but also easily classified. The role and purpose of pattern recognition is that the face of a specific thing correctly classified in a category. For two basic methods of pattern recognition: pattern recognition method of statistical pattern recognition method and structure (syntax), this article discusses only the design process based on statistical pattern recognition method. Pattern recognition system based on statistical methods, mainly composed of four parts: data acquisition, preprocessing, feature extraction and selection, classification decisions. The article before conversion of the raw data of the target radiated noise of rivers pretreatment. Therefore, the so-called pattern recognition focuses its feature extraction and selection. 2.2 feature extraction The original number of features may select feature extraction method based on the Euclidean distance metric. The distance between the two feature vectors is a good measure of their similarity. If the sample corresponding to the same class in the feature space, gathered together whereas a sample different types of mutually farther away, classification is relatively easy to implement. Therefore, in the given dimension D of the feature space, we use this d characteristics, they make various types are separated from each other as far as possible. δ (X(i), X(j)L)represents a distance between the classes ω i k-th sample ω i a sample, should be selected such feature X* c categories the average distance between each sample J (X) is the maximum, namely: J (X*) = max J (X) And: c c n i n j i j 2 Pi Pj ( XkXl ) i= j= nn i j k= l= J( x) = δ Here ni denotes the number of training samples ωi class design set S. Where pi is the a priori probability of class i, When these unknown a priori probability, can also estimate the number of training samples, where n is the total number of samples of the design set. This selection of the Euclidean distance: δ d 2 2 T ( Xk, Xl) = [ ( Xkj Xlj) ] = [( Xk Xl) ( Xk Xl j= The X subscripts significance is as follows: when only one subscript subscript indicates the sample number, two subscripts, the first one is the sample number; the second shows the characteristics of the sample sequence number. J(X), calculated in accordance with the Euclidean distance algorithm using MATLAB programming. 2.3 feature selection Feature selection purpose lies in the optimal feature set of extraction. This article to take a small amount of calculation suboptimal search algorithm, the optimal combinations of features alone, sequential forward, back order and increasing l minus r law. This in turn increased complexity, increasing reliability, selected by l minus r law. Involves only eight sample characteristics, programming increasing l minus r method is relatively simple, and the results are the permutations and combinations of the J order. Order R k said characteristic number k for all possible combinations of features that represents from x, x2,,,,,, xd remove remaining after the k-th feature of all possible combinations of features. The k-step optimal characteristics of the group should make J( R )= maxj( R k) ' k R k )] 2

Applied Mechanics and Materials Vols. 303-306 029 From the R =RD beginning k =,2,... until k = Dd. Resulting feature group R =R D d Increase l minus r Act (lr Act) individually added feature is in step k to k +. Then delete r feature individually. Analysis of experimental results The target noise rivers are eight kinds of feature extraction and 8 J (X) values calculated using Euclidean distance, and minus r with increasing size of J (X) Sort algorithm selected two better clustering, classification characteristic quantity that power cepstrum and envelope spectrum analysis of selected characteristic quantities. Then for each class of target calculated the amount of sample characteristics between J (X), then, according to the clustering rule, take the mean of each category of target J (X), and to develop its classification threshold, then the test set poly class analysis, taking into account the two characteristic parameters for identification classified. Take three types of rivers in the target group of 0 noise samples for feature extraction and selection includes 25 the 80 rivers goals data. The experiment is divided into data acquisition, preprocessing, feature extraction, feature selection and identification test samples of each type of target random sort 25,25,35 samples are drawn from each category in turn as learning the training set, the remaining 45 45,60 samples, as the identification test set, experimental results are shown in Table : Table : The target number of samples and the correct recognition rate: Number of training The number of samples Correct recognition rate samples tested of test samples First Second Third First Second Third First Second Third Total class class class class class class class class class average 25 25 35 45 45 60 96.8% 79.6% 72.5% 83.% The number of training samples test correct recognition rate of the samples tested by the number of samples First class second class third class first class second class third class, first class, second class, third class, the total average Experimental results show that: the classification results are satisfactory. The need to further improve the original features, but, if the computational complexity of genetic algorithms for feature extraction and selection, the effect may be better. In summary, this study rivers target recognition classification and clustering analysis techniques applied technology research, practical reference value and practical significance. References [] Guojun Mao. Data mining principle and algorithm [M]. Beijing: tsinghua university press, 2005.3 [2] Zhaoqi Bian. ZhangXueGong. Pattern recognition [M]. Beijing: tsinghua university press. 2000 [3] Wenwei Chen. etc. The data mining technology [M]. Beijing: Beijing industrial university press, 2002 [4] Richard O.Duda Peter E.H art David G.Stork Pattern Classification, Second Edition. Machinery industry press, 2004.2 [5] Jiawei Han. Data mining concept and technology. Mechanical industry press. 200.8 [6] Pang - Ning Tan/Michael Steinbach/Vipin Kumar. People's post and telecommunications press. 2006

Sensors, Measurement and Intelligent Materials 0.4028/www.scientific.net/AMM.303-306 Clustering Analysis Based on Data Mining Applications 0.4028/www.scientific.net/AMM.303-306.026