Available online at Available online at Advanced in Control Engineering and Information Science

Similar documents
Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Cluster Analysis of Electrical Behavior

A Binarization Algorithm specialized on Document Images and Photos

An Optimal Algorithm for Prufer Codes *

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Virtual Machine Migration based on Trust Measurement of Computer Node

The Research of Support Vector Machine in Agricultural Data Classification

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

An Image Fusion Approach Based on Segmentation Region

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Parallelism for Nested Loops with Non-uniform and Flow Dependences

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Classifier Selection Based on Data Complexity Measures *

A New Feature of Uniformity of Image Texture Directions Coinciding with the Human Eyes Perception 1

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models

A Similarity Measure Method for Symbolization Time Series

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Ontology Generator from Relational Database Based on Jena

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

A Robust Webpage Information Hiding Method Based on the Slash of Tag

Meta-heuristics for Multidimensional Knapsack Problems

Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop

A Method of Hot Topic Detection in Blogs Using N-gram Model

Study of Data Stream Clustering Based on Bio-inspired Model

Data Preprocessing Based on Partially Supervised Learning Na Liu1,2, a, Guanglai Gao1,b, Guiping Liu2,c

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Web Document Classification Based on Fuzzy Association

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A Deflected Grid-based Algorithm for Clustering Analysis

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

Query Clustering Using a Hybrid Query Similarity Measure

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

Load Balancing for Hex-Cell Interconnection Network

Learning-Based Top-N Selection Query Evaluation over Relational Databases

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Constructing Minimum Connected Dominating Set: Algorithmic approach

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Design of Structure Optimization with APDL

An Improvement to Naive Bayes for Text Classification

Improving Web Image Search using Meta Re-rankers

A KIND OF ROUTING MODEL IN PEER-TO-PEER NETWORK BASED ON SUCCESSFUL ACCESSING RATE

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

UB at GeoCLEF Department of Geography Abstract

Cracking of the Merkle Hellman Cryptosystem Using Genetic Algorithm

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

A fast algorithm for color image segmentation

A high precision collaborative vision measurement of gear chamfering profile

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

SURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

Efficient Segmentation and Classification of Remote Sensing Image Using Local Self Similarity

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Related-Mode Attacks on CTR Encryption Mode

Available online at ScienceDirect. Procedia Environmental Sciences 26 (2015 )

Shape-adaptive DCT and Its Application in Region-based Image Coding

On-line Scheduling Algorithm with Precedence Constraint in Embeded Real-time System

USING GRAPHING SKILLS

Face Recognition using 3D Directional Corner Points

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

Relevance Feedback for Image Retrieval

Behavioral Model Extraction of Search Engines Used in an Intelligent Meta Search Engine

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

DYNAMIC NETWORK OF CONCEPTS FROM WEB-PUBLICATIONS

Pruning Training Corpus to Speedup Text Classification 1

Performance Evaluation of Information Retrieval Systems

Research and Application of Fingerprint Recognition Based on MATLAB

Efficient Broadcast Disks Program Construction in Asymmetric Communication Environments

The Shortest Path of Touring Lines given in the Plane

TN348: Openlab Module - Colocalization

Research of Dynamic Access to Cloud Database Based on Improved Pheromone Algorithm

Research on Categorization of Animation Effect Based on Data Mining

Boundary-Based Time Series Sorting

An Internal Clustering Validation Index for Boolean Data

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

The Research of Ellipse Parameter Fitting Algorithm of Ultrasonic Imaging Logging in the Casing Hole

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Mining User Similarity Using Spatial-temporal Intersection

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Security Vulnerabilities of an Enhanced Remote User Authentication Scheme

A Fuzzy Image Matching Algorithm with Linguistic Spatial Queries

Querying by sketch geographical databases. Yu Han 1, a *

Simulation Based Analysis of FAST TCP using OMNET++

Module Management Tool in Software Development Organizations

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

An Improved Image Segmentation Algorithm Based on the Otsu Method

Professional competences training path for an e-commerce major, based on the ISM method

Nonlocal Mumford-Shah Model for Image Segmentation

Transcription:

Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced n Control Engneerng and Informaton Scence The Clusterng Algorthm of Query Result based on Maxmal Frequent WEI Yu-we * 2 Faculty Electromechancal Engneerng,Guangdong Unversty of Technology,Guangzhou 510006, Chna Abstract Most of exstng web page clusterng algorthms s based on short and uneven snppets of web page, whch often cause bad clusterng performance On the other hand, the classcal clusterng algorthm for full text web pages s too complex to provde good cluster label n addton to the ncapablty on-lne clusterng To address above problems, ths artcle presents an on-lne web page clusterng algorthm based on maxmal frequent tem sets (MFIC) At frst, the maxmal frequent tem sets are mned, and then the web pages are clustered based on shared frequent tem sets Secondly, clusters are labelled based on the frequent tems Expermental results show that MFIC can effectvely reduce clusterng tme, mprove clusterng accuracy by 15%, and generate understandable labels 2010 Publshed by Elsever Ltd Open access under CC BY-NC-ND lcense Selecton and/or peer revew underresponsblty of [name organzer] "Keywords: Search Engne;Frequent Itemsets;Page Clusterng" 1 Introducton Wth the ncreasng of nternet nformaton, SE (Search Engne) becomes the ndspensable tools Now the most general SE sort the WebPages based on the correlaton degree to the user enqury, and return the results to user wth a lst vew The users ought to udge every webpage whether the results are satsfed wth ther demand The research shows the most users mae use of the short and uncertan search strng But 85% users only vew the results of the frst page, and 78% user never change ther research terms In * Correspondng author Tel:+8613602446699 E-mal address: weyuwe@gduteducn 1877-7058 2011 Publshed by Elsever Ltd do:101016/proeng201108306 Open access under CC BY-NC-ND lcense

WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 1643 addton, because of ther dfferent bacground, the results desred are dfferent Therefore, n order to meet the requrements of user's query qualty ncreasngly, the user wants to mprove the usablty of query results In order to solve the problem, ths artcle puts forward a onlne clusterng algorthms of query results based on webpage maxmal frequent temsets By mprovng the mnng algorthms of maxmal frequent temsets, t can be use for the onlne clusterng of he SE query results New algorthms uses the sharng relaton of webpage sets and frequent temsets to cluster, meanwhle descrbes the clear and defnte tags of every category The experment results show that the clusterng based on maxmal frequent temsets can reduce the clusterng tme based on full-text, at the same tme the clusterng accuracy can mprove 15% or so 2 Maxmal Frequent Itemset Mnng 21 The Basc Concept of Frequent Itemsets Fles should be n MS Word format only and should be formatted for drect prntng Fgures and tables should be embedded and not suppled separately Please mae sure that you use as much as possble normal fonts n your documents Specal fonts, such as fonts used n the Far East (Japanese, Chnese, Korean, etc) may cause problems durng processng In order to adopt maxmal frequent temset as the basc feature of on-lne cluster algorthm based on full-text webpage In ths secton, t frstly ntroduces the basc concept of frequent temset,, a detaled ntroducton related frequent temset refers to other document lterature Defnton 1: Assumng I = { I, I, 1 2 I n } s a set of n dfferent tems For a set X, X I and = X, the X s called as temset, the length of X s the amount of ncludng tems, t s Defnton 2: D = { T, T, 1 2, T m } s a set of m dfferent transactons, among T I For the gven the transacton set D, defne the support of X s the amount of transacton occurred X, named as Sup ( X ) User may defne a mnmal support countng, mn_supp, t s ether absolute countng or relatve countng Defnton 3: Supposng the transacton set D and the mnmal support countng mn_supp, for temsets, X I, f Sup( X )> mn_supp and ( Y I X Y ), Sup( Y )< mn_supp, then X s named as the maxmal frequent temset n the transacton set D In ths artcle, the transacton set s the webpage sets of query results, every webpage s a transacton Itemset s the sets ncluded terms n the webpage, the terms of webpage s the tem of transacton 22 Maxmal Frequent Itemsets Mnng Algorthm The common algorthm of frequent temset mnng s FP-Growth algorthm It frstly constructs a FP-Tree that s a threaded tree structure to storage the transacton of sets [3] The constructon of FP-Tree frstly maes a statstc for support countng of all tems, these tems ts support exceeded mnmal support countng arrange n Header table of FP-Tree n decreasng order Then every tme only read n a transacton, and map to FP-Tree routng Fg 1 s a example of FP-Tree (ts support s 2), among (a) expresses the transacton sets, (b) s the FP-Tree constructed In ths fgure, sold lne presents the routng of transacton mappng tree, the dotted lne ponts to the locaton n the tree from Header Table, the countng of node expresses the support correspondng to temset n the endng routne from root node to current node, such as the node "trademar: 2" presents the support of ths temset {car, Geely, trademar} s 2

1644 WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 Fg1 a example of FP-Tree ( mn_supp = 2 ) 3 Query Result Cluster Algorthm Based on Maxmal Frequent Itemsets After mnng the frequent temsets, there are two ways to cluster: Frst, adopt the alternatve word of frequent temset to create the feature vector of webpage and use the tradtonal clusterng algorthm based on vector space model Second, cluster wth the relaton of frequent temset overlap webpage set[4] The former has been proved that the tme complexty cannot satsfy the demand of on-lne cluster, at the same tme, the cluster e The clusterng algorthm ntroduced by ths artcle adopts the relaton of the webpage sharng the maxmal frequent temset to cluster For the purpose of ths artcle followng research, some defntons as follow: D = { T, T, 1 2, T m } s the set of all transactons, and t s the set of query webpage n ths artcle I = { I, I, 1 2, I n } s the set of all tem, and t s the set of terms ncluded n webpage sets S M = { M, M, 1 2, M n } s the set of all maxmal frequent temsets mned n webpages, the webpage overlapped by a maxmal frequent temset M names as P, P D The process of clusterng means that the set of webpages are dvded nto some clusters, named as C = { C1, C2,, C l }, t s the set of cluster The webpage sets ncluded by a cluster C mars CP, CP D, the set of maxmal frequent temsets ncluded names as CM, CM Sm, the set of frequent temsets ncluded s CI, CI I { T, T, 2 } D c = 1, T s the webpage set overlapped by cluster Below ntroduces the core steps of cluster algorthm Step 1: The generaton of cluster The longer the length of frequent temset, the more the terms ncluded, and the better expresses a detal topc, so the long cluster generated by frequent temset s gven prorty to select The frequent temset among S m sorts n the order of ther length, and n proper sequence select the longest frequent temset M to generate cluster C, CP s the webpage set ncluded by C, and t s the webpage set P overlapped by M, record the webpage set overlapped by cluster, Dc = Dc P In order to mprove the speed of cluster generaton, reduce the transmsson effects n subsequent mergng procedure, and further flter the frequent temset of S m If a frequent temset M overlaps the webpage set P Dc, t means that all webpages of P have been overlapped by clusters, and doesn't generate the cluster C correspond to M Step 2: The mergng of clusters The clusters orgnally generated are more, and there are a lot of overlappng, so need to merge and generate the fnal cluster The mergng of clusters means that the clusters wth hgh smlarty merge a

WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 1645 cluster; usually the smlarty of clusters s udged wth the smlarty of webpage sets ncluded For the clusterng algorthm based on frequent temset, the frequent temset ncluded by cluster s the mportant feature of cluster The smlarty of cluster s computed wth the smlarty of request temset ncluded [5] In order to mprove the accuracy, adopt the formula (1) to compute the smlarty of clusters The smlarty of cluster C and C names as Sm ( C, C ), the smlarty of webpage ncluded names as SmP, the smlarty of frequent temset ncluded names as SmI CP CP CI CI Sm( C, C ) = + (1) mn CP, CP mn CI, CI The more ( ) ( ) ( ) Sm C, C, the hgher the cluster C and C, and ntends to merge Step 3: The cluster purfcaton The clusters are subdvded nto the hard clusters and the soft clusters The former demands a webpage only s belong to a category, the latter allows a webpage to belong to multple category So the hard clusters can reflect realty well Because of the transmsson effects of clusters mergng, the clusters sometmes nclude some non-correlated webpages It s a vtal problem how to recognze the webpage of clusters s non-correlated webpage or multple category webpage In ths artcle, the recognton of noncorrelated webpage s udged by the support webpage relatve to cluster So ths artcle defnes the support as fellow, the webpage P relatve to cluster C Supp( P, C ) = M f ( P, M ) (3) M CM Supp, C s less than ths value, t s the non-correlatve webpage, and t would be delete from the clusters Accordng the the experment, a emprcal value can be set When ( P ) 4 Expermental Results and Analyss 41 Expermental Condton and Expermental Data The expermental data s the data set respondng to 8 ambguous query terms For SE, and gan the unon set Then mar the partcple and the word characterstc to the webpages, construct the ndex and eep the partcple results for latter algorthm Above-mentoned wor s off-lne completed, and prepares the data for on-lne query clusters The webpage set s manual mared category, every query term webpage set mars several categores T Because the K-Means algorthm demands to set value, separately set 4 value (5,6,7,8) to experment, and every query get the hghest F value as the fnal result n 4 expermental result STC, Lngo and MFIC can automatc generate an arbtrary number of categores, and there are some clusters only ncludng 2~3 webpages, but n practcal applcaton usually only shows the cluster ncludng maor webpages, the category less than 5 webpages names as other category n actual experment Accordng to changng the parameter, the category number of 3 algorthms ranges from5 to 10 42 Expermental Condton and Expermental Data In ths artcle, the experment compares wth MFIC based on full text and K-Means, at the same tme compares the cluster-tme wth STC based on abstract[6] For the full text of webpage, the cluster-tme s too long to apply for on-lne cluster The expermental data dsplays that the tme s more than 10 seconds

1646 WEI Yu-we / Proceda Engneerng 15 (2011) 1642 1646 In addton, the Lngo adopts the open ava experment, and other algorthms mplement wth C++, so they are compared wth t The graph suggests the MFIC cluster-tme s superor to K-Means Because the MFIC cluster s based on webpage full text, t s a foregone concluson that the cluster-tme s longer than STC based on abstract The expermental result shows that the MFIC can satfy wth the demand of on-lne cluster f ts clustertme s about 2 seconds In order to mprove on system response, they can set the cache of cluster result and reduce the user watng tme n detal applcaton Concluson Fg3 The cluster algorthm tme comparson Ths artcle proposes a cluster algorthm of SE returned results based on full text maxmal frequent temsets Frstly, research the mnng algorthm of frequent temsets, and mprove on the maxmal frequent temset mnng combnng FPMax algorthm, at the same tme ncrease the mnng speed Then t proposes a MFIC algorthm based on maxmal frequent temset The MFIC algorthm mostly ncludes three steps Frstly, generate the cluster wth the mned maxmal frequent temset Secondly, merge and udge the clusters combnng the smlarty of frequent temset wth the smlarty of document sets ncluded by cluster Fnally, propose a label generaton algorthm combnng frequent temset wth terms sequence References [1] Lpng Jng, Mchael K Ng An Entropy Weghtng -Means Algorthm for Subspace Clusterng of Hgh-Dmensonal Sparse Data IEEE Transactons on Knowledge and Data Engneerng, 2007, 19(8), p1026-1040 [2] We Song, Soon Cheol Par Genetc Algorthm based text clusterng technque Automatc evoluton of cluster wth hgh effcency 7th Internatonal Conference on Web-Age Informaton Management Worshops Hong Kong, 2006, p17-18 [3] Danel Crabtree, Xaoyng Gao Improvng Web Clusterng by Cluster Selecton The 2005 IEEE/WIC/ACM Internatonal Conference on Web Intellgence 2005, p172-178 [4] Hung Chm, Xaote Deng A New Suffx Tree Smlarty Measure for Document Clusterng World Wde Web Conference Commttee 2007, p121-129 [5] Danel Crabtree, Peter Andreae Query Drected Web Page Clusterng Proceedng of the IEEE/WIC/ACM Internatonal Conference on Web Intellgence 2006, p202-210