Clustering Analysis based on Data Mining Applications Xuedong Fan

Similar documents
An Indian Journal FULL PAPER. Trade Science Inc. Research on data mining clustering algorithm in cloud computing environments ABSTRACT KEYWORDS

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Introduction to Data Mining

Construction of the Library Management System Based on Data Warehouse and OLAP Maoli Xu 1, a, Xiuying Li 2,b

Web Data mining-a Research area in Web usage mining

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Data Mining Concepts

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Clustering & Classification (chapter 15)

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Classification with Diffuse or Incomplete Information

Web Usage Mining: A Research Area in Web Mining

Study on the Application Analysis and Future Development of Data Mining Technology

Research on Data Mining and Statistical Analysis Xiaoyao Lu1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Research Of Data Model In Engineering Flight Simulation Platform Based On Meta-Data Liu Jinxin 1,a, Xu Hong 1,b, Shen Weiqun 2,c

Tutorial 3. Jun Xu, Teaching Asistant csjunxu/ February 16, COMP4134 Biometrics Authentication

Data: a collection of numbers or facts that require further processing before they are meaningful

An Improved DFSA Anti-collision Algorithm Based on the RFID-based Internet of Vehicles

An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm

Basic Data Mining Technique

A Novel Texture Classification Procedure by using Association Rules

Data Mining for Fault Diagnosis and Machine Learning. for Rotating Machinery

Clustering and Visualisation of Data

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Design and Realization of Data Mining System based on Web HE Defu1, a

SOCIAL MEDIA MINING. Data Mining Essentials

Data Mining in the Application of E-Commerce Website

Knowledge Discovery and Data Mining

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

A Computer Vision System for Graphical Pattern Recognition and Semantic Object Detection

Detection and Deletion of Outliers from Large Datasets

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data mining techniques for actuaries: an overview

Exploration of Fault Diagnosis Technology for Air Compressor Based on Internet of Things

Semi-Supervised Clustering with Partial Background Information

Gene Clustering & Classification

Unsupervised Learning

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

The latest trend of hybrid instrumentation

Table Of Contents: xix Foreword to Second Edition

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Research on the Application of Digital Images Based on the Computer Graphics. Jing Li 1, Bin Hu 2

Correlation Based Feature Selection with Irrelevant Feature Removal

Introduction to Data Mining and Data Analytics

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

COMP 465 Special Topics: Data Mining

Research Institute of Uranium Geology,Beijing , China a

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Iteration Reduction K Means Clustering Algorithm

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Dynamic Clustering of Data with Modified K-Means Algorithm

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

D B M G Data Base and Data Mining Group of Politecnico di Torino

Knowledge Discovery and Data Mining 1 (VO) ( )

The Analysis and Research of IPTV Set-top Box System. Fangyan Bai 1, Qi Sun 2

Spatial Information Based Image Classification Using Support Vector Machine

An Improved Apriori Algorithm for Association Rules

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Serial Communication Based on LabVIEW for the Development of an ECG Monitor

Constructing an University Scientific Research Management Information System of NET Platform Jianhua Xie 1, a, Jian-hua Xiao 2, b

Learning Objectives for Data Concept and Visualization

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection

The RTP Encapsulation based on Frame Type Method for AVS Video

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

Information Retrieval System Based on Context-aware in Internet of Things. Ma Junhong 1, a *

Text Document Clustering Using DPM with Concept and Feature Analysis

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

A Data Classification Algorithm of Internet of Things Based on Neural Network

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Visualization and text mining of patent and non-patent data

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio

Data mining fundamentals

Histogram and watershed based segmentation of color images

Data Mining Course Overview

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

Customer Clustering using RFM analysis

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

Topic 1 Classification Alternatives

Question Bank. 4) It is the source of information later delivered to data marts.

Enhanced Bug Detection by Data Mining Techniques

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Image Denoising Based on Hybrid Fourier and Neighborhood Wavelet Coefficients Jun Cheng, Songli Lei

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Transcription:

Applied Mechanics and Materials Online: 203-02-3 ISSN: 662-7482, Vols. 303-306, pp 026-029 doi:0.4028/www.scientific.net/amm.303-306.026 203 Trans Tech Publications, Switzerland Clustering Analysis based on Data Mining Applications Xuedong Fan Xi'an International University, China Email:ffff0729@63.com Keywords: Data mining, Pattern recognition, Clustering analysis Abstract. In this paper, a clustering algorithm based on data mining technology applications, the use of the extraction mode noise characteristics amount and pattern recognition algorithms, extraction and selection of the characteristic quantities of the three types of mode, carried out under the same working conditions data mining clustering analysis ultimately satisfying recognition. Introduction Data mining is by careful analysis of large amounts of data to uncover meaningful new relationships, patterns and trends. It uses pattern recognition techniques, statistical techniques and mathematical techniques, the core areas of knowledge discovery, to discover implicit, previously unknown and potentially useful knowledge extracted knowledge can be expressed as concepts, rules, laws, mode form. The reason why data mining technology in recent years has exciting prospects for the study, it is able to obtain a wide range of applications, and have achieved important practical value. It can be said, very extensive data mining applications in various fields, as long as the industry has the analytical value of the data warehouse or database, you can take advantage of the mining tools for the purpose of mining analysis. Generally more common cases occurred in the retail, manufacturing, financial, finance, insurance, communications, and medical services. Faced with a large number of information data, the formation of the great demand for data mining applications, rapid technological development and improvement. In this paper, data mining technology in rivers target recognition technology made some tentative preliminary research has made important theoretical and practical results of the reference value. At present, the basic task of data mining technology is mainly reflected in the five areas of classification and regression, clustering, association rules found timing mode, deviation detection: () Classification and regression: classification means the data is mapped to pre-defined group or class. Before in the analysis of the test data, the categories have been identified, classification usually called supervised learning. Classification algorithm requires to define a class, are usually described by the characteristics of the data known category based on the value of the data attribute. Regression: property historical data to predict future trends. Regression first assume that some known type of function (e.g., the linear function, etc.) can be fitted to the target data, and then using some error analysis to determine a best function with the target data fitting degree. (2) Cluster: is a group of individuals in accordance with the similarity classified into several categories, and its purpose is such that the distance between the individuals belonging to the same category as small as possible, while the distance between the different categories of individuals as large as possible. (3) Association rules: reveals the relationship between the data and this relationship is not directly represented in the data. The associated analysis task is to find the association rules between things or relevance. The links between things has much support and credibility; the meaningful association rule must be given two thresholds, the minimum support and minimum confidence. (4) The timing mode is described based on time or other sequences frequently occurring pattern or trend, and its modeling. With regression, it is also to predict future values using known data, but the difference between these data is the time where the variables are different. Sequential patterns associated mode and time series model combined focus to consider the correlation between the data in the time dimension. Timing Mode contains time series analysis and sequence found. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 30.203.36.75, Pennsylvania State University, University Park, USA-/05/6,04:25:5)

Applied Mechanics and Materials Vols. 303-306 027 (5) Deviation of the differences and extreme special case of the formulation, such as classification of anomalous instance, clustering outside the outliers, does not meet the rules of a special case, most of the data mining methods this difference as noise discarded, in some applications, however, the rare data may be more useful than the normal data. Deviation detection is used to discover abnormalities and changes in normal circumstances, and further analysis of this change was intentional fraud, or normal variation. If abnormal behavior, you need to prompt preventive measures as soon as possible to prevent. Target three types of rivers (passenger ships, cargo ships, Clippers) time-domain noise samples, the paper selects clustering rule mining. Cluster analysis is in the case not given designated categories, based on the information similarity information clustering method, so clustering also called unsupervised learning. Data is divided or split into intersect or do not intersect, the process of the group, can be done by determining the similarity between data on the pre-specified attribute clustering task. The input of clustering is a set of unlabeled data is divided according to the distance of the data itself or similarity. The principles of division maintain maximum similarity between the similarity and the smallest group within the group, that is as different as possible so that the data in the different cluster, and the data in the same cluster as similar as possible. Of course, clustering sample classification also can complete outlier mining. Study the similarities and differences clear clustering and classification, clustering and classification are complementary, interdependent, clustering is a group of individuals in accordance with the similarity classified into several categories, namely R feather flock together S clustering process is containing continue to be classified more attributes of the data object classification is performed automatically by the clustering algorithm, by identifying the characteristics of the data, the data is cut into a number of categories, the authors believe, can use clustering rules mining algorithms to identify three types of target clustering basis, and then use this basis to identify classification, the following key problem is to find a clustering mining algorithms come to this clustering basis. The cluster analysis method is a direct comparison of samples between things in nature, a similar nature classified as a class, the relatively large differences in the nature of points in a different class, therefore, this paper selected based on Euclidean distance feature extraction effective clustering feature amount of the selected pattern recognition algorithm to select rivers target. A river in the formation of the original features of the target noise Rivers target radiated noise in a passive detection device in the river essential source of information for target detection, identification, classification. Especially noise frequency-domain information has not interference, easy separation characteristics, can be used for the detection and identification of weak target signal. Very rich rivers target noise frequency domain characteristics, target identification, classification and parameter estimation has important value. In this paper, to three target noise samples for the study, and in the time domain and frequency domain analysis: Take the three types of ten groups target radiated noise samples do time-domain waveform, CTZ transform, power spectrum, power cepstrum, Fourier transform, the analytic power down spectral envelope spectrum and short-time Fourier transform analysis. Select more than eight transform to study the original features of the target sample, extraction and selection of effective features. 2 Feature extraction and selection 2. pattern recognition Pattern recognition: characterizing the object or phenomenon in all its forms (values, text and logical relations) information processing and analysis, to describe the object or phenomenon, identification, classification and interpretation of the process. Pattern of the information with the time or spatial distribution, a mode often use a great amount of information, said pattern recognition system is required in the digital link after pretreatment for removing interference is mixed and reduce some of the deformation and distortion.. Subsequent feature extraction, i.e. a group of features extracted from the input mode, the digitized or after pretreatment. The feature is a measure of the

028 Sensors, Measurement and Intelligent Materials selected one of its general deformation and distortion remain unchanged or almost the same, and containing only as little as possible of redundant information. The feature extracting process, the input mode from the target space is mapped into feature space. At this time, the mode is available in a feature space a point or a feature vector. This mapping only to compress the amount of information, but also easily classified. The role and purpose of pattern recognition is that the face of a specific thing correctly classified in a category. For two basic methods of pattern recognition: pattern recognition method of statistical pattern recognition method and structure (syntax), this article discusses only the design process based on statistical pattern recognition method. Pattern recognition system based on statistical methods, mainly composed of four parts: data acquisition, preprocessing, feature extraction and selection, classification decisions. The article before conversion of the raw data of the target radiated noise of rivers pretreatment. Therefore, the so-called pattern recognition focuses its feature extraction and selection. 2.2 feature extraction The original number of features may select feature extraction method based on the Euclidean distance metric. The distance between the two feature vectors is a good measure of their similarity. If the sample corresponding to the same class in the feature space, gathered together whereas a sample different types of mutually farther away, classification is relatively easy to implement. Therefore, in the given dimension D of the feature space, we use this d characteristics, they make various types are separated from each other as far as possible. δ (X(i), X(j)L)represents a distance between the classes ω i k-th sample ω i a sample, should be selected such feature X* c categories the average distance between each sample J (X) is the maximum, namely: J (X*) = max J (X) And: c c n i n j i j 2 Pi Pj ( XkXl ) i= j= nn i j k= l= J( x) = δ Here ni denotes the number of training samples ωi class design set S. Where pi is the a priori probability of class i, When these unknown a priori probability, can also estimate the number of training samples, where n is the total number of samples of the design set. This selection of the Euclidean distance: δ d 2 2 T ( Xk, Xl) = [ ( Xkj Xlj) ] = [( Xk Xl) ( Xk Xl j= The X subscripts significance is as follows: when only one subscript subscript indicates the sample number, two subscripts, the first one is the sample number; the second shows the characteristics of the sample sequence number. J(X), calculated in accordance with the Euclidean distance algorithm using MATLAB programming. 2.3 feature selection Feature selection purpose lies in the optimal feature set of extraction. This article to take a small amount of calculation suboptimal search algorithm, the optimal combinations of features alone, sequential forward, back order and increasing l minus r law. This in turn increased complexity, increasing reliability, selected by l minus r law. Involves only eight sample characteristics, programming increasing l minus r method is relatively simple, and the results are the permutations and combinations of the J order. Order R k said characteristic number k for all possible combinations of features that represents from x, x2,,,,,, xd remove remaining after the k-th feature of all possible combinations of features. The k-step optimal characteristics of the group should make J( R )= maxj( R k) ' k R k )] 2

Applied Mechanics and Materials Vols. 303-306 029 From the R =RD beginning k =,2,... until k = Dd. Resulting feature group R =R D d Increase l minus r Act (lr Act) individually added feature is in step k to k +. Then delete r feature individually. Analysis of experimental results The target noise rivers are eight kinds of feature extraction and 8 J (X) values calculated using Euclidean distance, and minus r with increasing size of J (X) Sort algorithm selected two better clustering, classification characteristic quantity that power cepstrum and envelope spectrum analysis of selected characteristic quantities. Then for each class of target calculated the amount of sample characteristics between J (X), then, according to the clustering rule, take the mean of each category of target J (X), and to develop its classification threshold, then the test set poly class analysis, taking into account the two characteristic parameters for identification classified. Take three types of rivers in the target group of 0 noise samples for feature extraction and selection includes 25 the 80 rivers goals data. The experiment is divided into data acquisition, preprocessing, feature extraction, feature selection and identification test samples of each type of target random sort 25,25,35 samples are drawn from each category in turn as learning the training set, the remaining 45 45,60 samples, as the identification test set, experimental results are shown in Table : Table : The target number of samples and the correct recognition rate: Number of training The number of samples Correct recognition rate samples tested of test samples First Second Third First Second Third First Second Third Total class class class class class class class class class average 25 25 35 45 45 60 96.8% 79.6% 72.5% 83.% The number of training samples test correct recognition rate of the samples tested by the number of samples First class second class third class first class second class third class, first class, second class, third class, the total average Experimental results show that: the classification results are satisfactory. The need to further improve the original features, but, if the computational complexity of genetic algorithms for feature extraction and selection, the effect may be better. In summary, this study rivers target recognition classification and clustering analysis techniques applied technology research, practical reference value and practical significance. References [] Guojun Mao. Data mining principle and algorithm [M]. Beijing: tsinghua university press, 2005.3 [2] Zhaoqi Bian. ZhangXueGong. Pattern recognition [M]. Beijing: tsinghua university press. 2000 [3] Wenwei Chen. etc. The data mining technology [M]. Beijing: Beijing industrial university press, 2002 [4] Richard O.Duda Peter E.H art David G.Stork Pattern Classification, Second Edition. Machinery industry press, 2004.2 [5] Jiawei Han. Data mining concept and technology. Mechanical industry press. 200.8 [6] Pang - Ning Tan/Michael Steinbach/Vipin Kumar. People's post and telecommunications press. 2006

Sensors, Measurement and Intelligent Materials 0.4028/www.scientific.net/AMM.303-306 Clustering Analysis Based on Data Mining Applications 0.4028/www.scientific.net/AMM.303-306.026