A Similarity Measure Method for Symbolization Time Series

Similar documents
Cluster Analysis of Electrical Behavior

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Boundary-Based Time Series Sorting

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Available online at Available online at Advanced in Control Engineering and Information Science

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

A Deflected Grid-based Algorithm for Clustering Analysis

An Optimal Algorithm for Prufer Codes *

Module Management Tool in Software Development Organizations

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Modular PCA Face Recognition Based on Weighted Average

Classifier Selection Based on Data Complexity Measures *

An Image Fusion Approach Based on Segmentation Region

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Mining User Similarity Using Spatial-temporal Intersection

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Network Intrusion Detection Based on PSO-SVM

The Shortest Path of Touring Lines given in the Plane

Machine Learning: Algorithms and Applications

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Chinese Word Segmentation based on the Improved Particle Swarm Optimization Neural Networks

Professional competences training path for an e-commerce major, based on the ISM method

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Study of Data Stream Clustering Based on Bio-inspired Model

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Outlier Detection Methodologies Overview

Correlative features for the classification of textural images

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

The Theory and Application of an Adaptive Moving Least. Squares for Non-uniform Samples

Positive Semi-definite Programming Localization in Wireless Sensor Networks

ApproxMGMSP: A Scalable Method of Mining Approximate Multidimensional Sequential Patterns on Distributed System

Unsupervised Learning and Clustering

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

THE PATH PLANNING ALGORITHM AND SIMULATION FOR MOBILE ROBOT

Fast Computation of Shortest Path for Visiting Segments in the Plane

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

A Multiresolution Symbolic Representation of Time Series

A new segmentation algorithm for medical volume image based on K-means clustering

A Binarization Algorithm specialized on Document Images and Photos

Suppression for Luminance Difference of Stereo Image-Pair Based on Improved Histogram Equalization

Research on Categorization of Animation Effect Based on Data Mining

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A fast algorithm for color image segmentation

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

A NEW APPROACH FOR SUBWAY TUNNEL DEFORMATION MONITORING: HIGH-RESOLUTION TERRESTRIAL LASER SCANNING

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

S1 Note. Basis functions.

An Improved Image Segmentation Algorithm Based on the Otsu Method

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Face Recognition Method Based on Within-class Clustering SVM

Local Quaternary Patterns and Feature Local Quaternary Patterns

The Comparison of Calibration Method of Binocular Stereo Vision System Ke Zhang a *, Zhao Gao b

Load Balancing for Hex-Cell Interconnection Network

TARGET RECOGNITION ALGORITHM BASED ON SALIENT CONTOUR FEATURE SEGMENTS

Hierarchical clustering for gene expression data analysis

Face Recognition using 3D Directional Corner Points

Active Contours/Snakes

Programming in Fortran 90 : 2017/2018

The Research of Support Vector Machine in Agricultural Data Classification

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Fitting: Deformable contours April 26 th, 2018

Query Clustering Using a Hybrid Query Similarity Measure

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Unsupervised Learning

Feature-Area Optimization: A Novel SAR Image Registration Method

CLUSTERING ALGORITHM OF VEHICLE MOTION TRAJECTORIES IN ENTRANCES AND EXITS OF FREEWAY. Zongyuan SUN1 Dongxue LI2

Hierarchical Image Retrieval by Multi-Feature Fusion

A Multi-step Strategy for Shape Similarity Search In Kamon Image Database

SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE

Pruning Training Corpus to Speedup Text Classification 1

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

A Self-adaptive Similarity-based Fitness Approximation for Evolutionary Optimization

Object-Based Techniques for Image Retrieval

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

K-means Optimization Clustering Algorithm Based on Hybrid PSO/GA Optimization and CS validity index

An efficient iterative source routing algorithm

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

A Load-balancing and Energy-aware Clustering Algorithm in Wireless Ad-hoc Networks

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

Performance Assessment and Fault Diagnosis for Hydraulic Pump Based on WPT and SOM

Combining The Global and Partial Information for Distance-Based Time Series Classification and Clustering

Feature Reduction and Selection

An Internal Clustering Validation Index for Boolean Data

Transcription:

Research Journal of Appled Scences, Engneerng and Technology 5(5): 1726-1730, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scentfc Organzaton, 2013 Submtted: July 27, 2012 Accepted: September 03, 2012 Publshed: February 11, 2013 A Smlarty Measure Method for Symbolzaton Tme Seres Qang Nu and Zhgang L Department of Computer Scence and Technology, Chna Unversty of Mnng and Technology, Xuzhou 221116, Chna Abstract: Smlarty measure s the base tas of tme seres data mnng tass. LCSS measure method has obvous lmtatons n the two dfferent length tme seres selecton of a lnear functon. The ELCS measure method s proposed to alze the sequence, whch ntroducng the scale factor to lmt the search path of the smlarty matrx. Experment n herarchcal clusterng algorthm shows that the mproved measure maes up for the shortcomngs of LCSS, mproves the effcency and accuracy of clusterng and mproves tme complexty. Keywords: Herarchcal cluster, LCSS, smlarty measure, tme seres INTRODUCTION Tme seres has always been an mportant and nterestng research feld due to ts frequent appearance n dfferent applcatons. Tme seres smlarty measure that proposed by Agrawal et al. (1993) has become a hot research topc due to ts wde applcaton usages such as tme seres classfcaton, clusterng, abal fndngs on the bass of data mnng, Many methods have been developed for searchng tme seres measure method n large data sets and especally smlarty measure of tme seres s a very mportant tas n the process of data mnng. There s smlarty measure methods of tme seres, such as Faloutsos et al. (1994) proposed a fast subsequence matchng method based on the Eucldean dstance metrc, n whch the smlarty measure of the two tme seres s calculated as two ponts of the same dmenson and t sets a threshold to udge whether the result s smlar. Eucldean dstance requres two sequences of equal length and gnored the temporal characterstcs of tme seres, thus lmtng ts applcaton n tme seres smlarty measure. Chung et al. (2004) uses the weght method n the Eucldean dstance method and elmnates transform offset, but there are parameters set by manual nterventon. Berndt and Clford, (1994) ntroduce Dynamc Tme Warpng dstance (DTW) to the tme seres smlarty measure whch performed well n the local characterstcs comparaton of the two unequal length sequences, but the tme consumpton of the algorthm s too expensve. In addton, DTW algorthm can't found two tme seres peas between low pont and nflecton pont, such as the correspondng relatons between the feature ponts and the accuracy of the algorthm s low. Some researchers (Y et al., 1998; Km et al., 2001) mproved DTW by ntroducng the ndex technology, mang ts tme complexty reduced. An ndex-based approach for smlarty search supportng tme warpng n large sequence databases (Km et al., 2001) proposed the Segment-wse the Tme Warpng dstance (STW), mang the DTW tme complexty decreased greatly, but mang the smlarty measure accuracy reduced too. Latec et al. (2005) put forward a nd of mnmum varance matchng method to obtan the flexble smlarty matchng. In 1994, the Longest Common Subsequence (LCS) (Paterson and Danc, 1994) to the tme seres smlarty measure. Bollobas et al. (1997) put forward LCSS on the bass of LCS, mang a better smlarty measure of tme seres whch have ampltude translaton, tmelne stretchng and bendng deformaton. Some other researchers have proposed the slopebased, the model-based and the event-based smlarty measure. Ths research studes the smlarty measure problem of symbolc tme seres. Frstly, ths study ntroduces the defnton and the classcal smlarty measure. Then, we propose a new smlarty measure algorthm based on the LCSS algorthm: dfferent to the LCSS algorthm, the new algorthm avods the selecton of a lnear functon effectvely, mproves the accuracy of measurement and mproves tme effcency greatly compared to the DTW measure. Fnally, experments to verfy the proposed algorthm. LCS AND LCSS SIMILARITY MEASURE LCS measure: There are tme seres samples X, Y A, ther vector form s: X { x, x,..., x n, Y { y, y,..., y n, they satsfy the longest common subsequence of the followng condtons were X' { x, x,..., x and 1 2 l Correspondng Author: Qang Nu, Department of Computer Scence and Technology, Chna Unversty of Mnng and Technology, Xuzhou 221116, Chna 1726

Res. J. Appl. Sc. Eng. Technol., 5(5): 1726-1730, 2013 Y' { y, y,..., y 1 2 l, where l s the length of the Common subsequence, Smlarty between tme seres X and Y s defned as Sm( X, Y) 1. n If 1 l for each and f and 1 1 If 1 l for each and x x LCSS measure: LCS measure can avod the smlar ssues whch brought by the tme seres of short-term mutaton or ntermttent. However, the tme seres of ampltude translaton, tmelne stretchng and bendng deformaton can t get a good smlarty measure results. LCSS measure s desgned for the mprovement of the above problems. Let 0 be an nteger constant, 0 1 a real constant. And f L, L a lnear functon set. Gven two sequences X { x, x,..., x n and Y { y, y,..., y n, let X' { x, x,..., x and Y' { y, y,..., y be the longest 1 2 l 1 2 l subsequences n X and Y respectvely such that: For 1 l, and 1 1 For1 l, 1 l, y /(1 ) f( x ) y (1 ) the sequence wll undetect the canddate seres (Keogh and Pazzan, 2001). Thus, the LCSS algorthm tmelne stretchng support s very lmted. For the exstence problem of LCSS measure, ths study presents an Extended Longest Common Subsequence (ELCS) measure: Let 1and 0 be a real constant, Gven two sequences X { x, x,..., x n and Y { y, y,..., y n, The alzaton that all sequence s located n between value [0,1], Get X { x', x',..., x' al m Y { y', y',..., y' al m,let X' { x', x',..., x' al l and Y ' { y', ',..., ' y 1 y 2 al l be the longest subsequences n X and Y respectvely such that: For1 l, 1 and 1 For1 l, 1 and 1 m n For1 l, x' y' l l Let S ( X, Y ) 2l, al al m n Let S ( X, Y) l f,, n. Then smlarty between the tme seres s defned as formula (1): EXTENDED LONGEST COMMON SUBSEQUENCE MEASURE (ELCS) (1) Although the LCSS measure has some advantages, there are stll the followng ssues:, max, Sm X Y S X Y f L f,, Then the smlarty between the tme seres defned as the formula (2): Sm X, Y max{ S X, Y LCSS measure derved from a soluton set, for dfferent tme seres data set, the selecton of lnear functon f wll dfferent. In other words, only through the tranng data set for the correspondng lnear functon n advance, to further more accurate measure of the smlarty of the sequence. Tranng and test set s always dfferent, so the result s less X mn than deal. x ò 1, m X The LCSS can be appled wth two dfferent length max x mn x ò 1, m ò 1, m (3) sequence comparson, but because of, length dfference of tme seres X { x, x,..., x Whch avod the lnear functon f selecton n of dffcultes, at the same tme retaned the sequence of and Y { y, y,..., y n, that s mn. Otherwse, numercal trend nformaton. 1727, (2) Defned above, parameter maes the search path of the smlarty measure matrx concentrated n a damond area, not only to prevent the sequence of over match, whle reducng the tme complexty. And the selecton of the search path area s related to each sequence length closely, not only appear undetected sequence, but also well adapted tmelne stretchng and deformaton of the sequence match. Parameter θ n the defnton maes the smlarty measurement algorthm, after alzaton, get further flexblty to match the space. Sequence alzed processng as the formula (3):

EXPERIMENT Smlarty measure s other data mnng process foundaton, the measure veracty drectly affect other process treatment results. Instead, we can use the clusterng results to estmate the accuracy of the dfferent smlarty measure. Res. J. Appl. Sc. Eng. Technol., 5(5): 1726-1730, 2013 Expermental envronment and the data: The expermental envronment s 2.20 GHz E4500CPU, memory for the 1024M and Wndow XP Professonal system. The expermental data sets use Synthetc Control Chart Tme Seres (SCC) n the UCI of KDD Archve and CBF dataset. The number of expermental data n the SCC s 600, every tme the sequence's length s 60, dvded nto sx categores. The CBF dataset contans Cylnder (C), Bell (B), Funnel (F), t s typcal of synthetc data sets. Fg. 1: Successful classfcaton rate Experment process: In cluster analyss, tme seres of the same group resemble each other, dfferent sets of tme seres are not smlar. Ths study uses the bottomup herarchcal clusterng. Set the ntal data for the C, C,..., C n, the algorthm steps are: Step 1: Each tme seres as a class C Step 2: Calculate the smlarty between any two categores, get a smlarty matrx Step 3: Merge the two categores whch are smlar, then go to Step 2 loop, untl the class number s equal to the predetermned number of clusters The dstance between the clusters uses ELCS smlarty measure computaton. The results of the clusterng are standard,,..., and the clusterng results of each measure C C C C are C ' C ', C ',..., C ', the clusterng accuracy s computed by the followng formula (4) and (5): C C' SmC, C ' 2 C C ' max Sm( C, ' ), ' C Sm C C (4) (5) The calculaton of Sm( C', C ) and same. Because Sm( C', C ) and Sm( C', C) Sm( C, C' ) so s used as a fnal evaluaton 2 crtera. Sm( C, C ') s Sm( C, C ') s asymmetry, 1728 Fg. 2: Average nternal class dstance EXPERIMENTAL RESULTS AND ANALYSIS Parameter determnaton: The experment usng the SCC dataset s to analyses the nfluence of the algorthm. The ELCS measure contans the parameters and θ, the n the performance of the algorthm s very sgnfcant. Wth the changes of the parameter, the clusterng accuracy rate s showed n Fg. 1, the clusterng average nternal class dstance and average among class dstance are shown n Fg. 2 and 3. Wth the ncreases, the clusterng accuracy rate s changed from low to hgh. When 2.2, clusterng accuracy rate s the hghest, the average nternal class dstance s the smallest; the average among class dstance s largest. Ths result means each one of ELCS measure n the sequence satsfes the length. Whle m n s too large, not well qualfed the poston of the test sequence corresponds to the nformaton, get meanngless smlar sequence segments; Whle s too small, the search range of the smlarty matrx s

Res. J. Appl. Sc. Eng. Technol., 5(5) : 1726-1730, 2013 Fg. 3: Average among class dstance Fg. 6: Average nternal class dstance for the three dstance Fg. 4: Tme-consumng comparsons of three dstance Fg. 7: Average among class dstance for the three dstance ELCS three nds of dstance tme consumng, set them as the smlarty metrc of herarchcal clusterng. LCSS and the ELCS algorthm s selected the approprate parameters, mang the fnal classfcaton accuracy s ther best. The results are shown n Fg. 4. DTW algorthm consumng sgnfcantly hgher than other, as ELCS measure s condton 1 a and 1 m n s complex than LCSSS measure s, so spend more tme. Fgure 5 s a comparson of DTW, LCSS and the ELCS measure of clusterng accuracy rate. Each measure for the SCC data set has good results, because of the obvous characterstcs of SCCC dataset of data and Fg. 5: Clusterng accuracy rate for the threee dstance the data has a lttle nose. CBF dataset s a randomly generated dataset; each tme seres has a lot of gltches very lmted, a lot of data s dscarded to be mssed. that ncrease the dffculty of the clusterng. But no Wth the decrease of, the classfcaton accuracy matter to whch dataset, ELCS have shown good results, dropped sharply. that s the correct rate of clusterng s the hghest. The dataset dfferences above-mentoned, can be Three nds of measure-based clusterng seen n Fg. 6 and 7 easly. Clusterng results of the CBF comparson: To comparson of DTW, LCSS and the dataset average dstance nternal class s greater than the 1729

Res. J. Appl. Sc. Eng. Technol., 5(5): 1726-1730, 2013 SCC dataset, whle the average among class dstance s smaller. Due to LCSS and ELCS are based on LCS algorthm, do not exst DTW algorthm pont corresponds to a mult-pont problems, local nose can be gnored. CONCLUSION Based on the LCS measure, by ntroducng parameters whch standardzes smlarty matrx search path, ths study mproves the accuracy of the smlarty measure and overcomes the tradtonal smlarty measure based on Eucldean dstance whch lac of dealng wth nose nterference. By the experment on two dfferent types of data sets, ELCS measure gets hgher clusterng correctness than the exstng smlarty, but the tme expense s hgher. In short, the measure can be appled effectvely to a varety of tme seres smlarty measure. ACKNOWLEDGMENT Ths study was supported by Doctoral Program Foundaton of Mnstry of Educaton of Chna (20100095110003) and Fundamental Research Funds for the Central Unverstes (2011QNB23). REFERENCES Agrawal, R., C. Faloustos and A. Swam, 1993. Effcent smlarty search n sequence database [c]. Proceedngs of 4th Internatonal Conference on Foundatons of Data Organzaton and Algorthms. Sprnger, Berln, pp: 69-84. Berndt, D. and J. Clfford, 1994. Usng Dynamc Tme Warpng to Fnd Patterns n Tme Seres. AAAI-94 Worshop on Knowledge Dscovery n Databases, AAAI Press, Seattle, Washngton. Bollobas, B., G. Das, D. Gunopulos and H. Mannla, 1997. Tme-seres smlarty problems and wellseparated geometrc sets [A]. Proceedngs of the 13th Annual Symposum on Computatonal Geometry [C]. ACM Press, New Yor, pp: 454-456. Chung, L., T.C. Fu and R. Lu, 2004. An evolutonary approach to pattern-based tme seres segmentaton. IEEE T. Evolut. Comput., 8(5): 471-489. Faloutsos, C., M. Ranganathan and Y. Manolopoulos, 1994. Fast subsequence matchng n tme-seres databases [J]. SIGMOD Rec., 23(2): 417-429. Keogh, E.J. and M.J. Pazzan, 2001. Dervatve Dynamc Tme Warpng [DB/OL]. Retreved from: http://cteseerx.st.psu.edu/vewdoc/download?do= 10.1.1.23.3383&rep=rep1&type=pdf. Km, S.W., S. Par and W. Chu, 2001. An ndex-based approach for smlarty search supportng tme warpng n large sequence databases [A]. Proceedngs of the Internatonal Conference on Data Engneerng [C]. IEEE Computer Socety, Hedelberg, pp: 207-614. Latec L.J., V. Megalooonomou, Q. Wang, R. Laaemper, C.A. Ratanamahatana, et al., 2005. Partal elastc matchng of tme seres [A]. 5th IEEE Internatonal Conference on Data Mnng [C]. Nov. 27-30, Phladelpha. Paterson, M. and V. Danc, 1994. Longest common subsequences [J]. Lect. Notes Compu. Sc., 841: 127-142. Y, B.K., H.V. Jagadsh and C. Faloutsos, 1998. Effcent retreval of smlar tme sequences under tme warpng [A]. Proceedngs of the Internatonal Conference on Data Engneerng [C], IEEE Computer Socety, Orlando, pp: 201-208. 1730