Web Text Feature Extraction with Particle Swarm Optimization

Similar documents
Performance Comparisons of PSO based Clustering

3D Model Retrieval Method Based on Sample Prediction

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Study on effective detection method for specific data of large database LI Jin-feng

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

New HSL Distance Based Colour Clustering Algorithm

Dynamic Programming and Curve Fitting Based Road Boundary Detection

Sectio 4, a prototype project of settig field weight with AHP method is developed ad the experimetal results are aalyzed. Fially, we coclude our work

Journal of Chemical and Pharmaceutical Research, 2013, 5(12): Research Article

Pattern Recognition Systems Lab 1 Least Mean Squares

Image Segmentation EEE 508

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

Neuro Fuzzy Model for Human Face Expression Recognition

Optimization for framework design of new product introduction management system Ma Ying, Wu Hongcui

Text Feature Selection based on Feature Dispersion Degree and Feature Concentration Degree

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

Ones Assignment Method for Solving Traveling Salesman Problem

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

A Novel Feature Extraction Algorithm for Haar Local Binary Pattern Texture Based on Human Vision System

Stone Images Retrieval Based on Color Histogram

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua

An improved support vector machine based on particle swarm optimization in laser ultrasonic defect detection

Cubic Polynomial Curves with a Shape Parameter

New Fuzzy Color Clustering Algorithm Based on hsl Similarity

Analysis of Documents Clustering Using Sampled Agglomerative Technique

An Image Retrieval Method Based on Hu Invariant Moment and Improved Annular Histogram

A Note on Least-norm Solution of Global WireWarping

Probabilistic Fuzzy Time Series Method Based on Artificial Neural Network

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

are two specific neighboring points, F( x, y)

Chapter 3 Classification of FFT Processor Algorithms

Accuracy Improvement in Camera Calibration

SOFTWARE usually does not work alone. It must have

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Introduction. Nature-Inspired Computing. Terminology. Problem Types. Constraint Satisfaction Problems - CSP. Free Optimization Problem - FOP

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Adaptive Resource Allocation for Electric Environmental Pollution through the Control Network

Symbolization of Mobile Object Trajectories with the Support to Motion Data Mining

Criterion in selecting the clustering algorithm in Radial Basis Functional Link Nets

Bayesian Network Structure Learning from Attribute Uncertain Data

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS

Evaluation scheme for Tracking in AMI

Using a Dynamic Interval Type-2 Fuzzy Interpolation Method to Improve Modeless Robots Calibrations

arxiv: v2 [cs.ds] 24 Mar 2018

DATA MINING II - 1DL460

We are IntechOpen, the first native scientific publisher of Open Access books. International authors and editors. Our authors are among the TOP 1%

Heuristic Approaches for Solving the Multidimensional Knapsack Problem (MKP)

Our second algorithm. Comp 135 Machine Learning Computer Science Tufts University. Decision Trees. Decision Trees. Decision Trees.

Python Programming: An Introduction to Computer Science

A Method of Malicious Application Detection

Outline. Research Definition. Motivation. Foundation of Reverse Engineering. Dynamic Analysis and Design Pattern Detection in Java Programs

RESEARCH ON AUTOMATIC INSPECTION TECHNIQUE OF REAL-TIME RADIOGRAPHY FOR TURBINE-BLADE

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

LU Decomposition Method

Improving Template Based Spike Detection

Research on Interest Model of User Behavior

On-line cursive letter recognition using sequences of local minima/maxima. Robert Powalka

Algorithms for Disk Covering Problems with the Most Points

Mining from Quantitative Data with Linguistic Minimum Supports and Confidences

Performance Plus Software Parameter Definitions

CSCI 5090/7090- Machine Learning. Spring Mehdi Allahyari Georgia Southern University

GPUMP: a Multiple-Precision Integer Library for GPUs

Python Programming: An Introduction to Computer Science

A Semi- Non-Negative Matrix Factorization and Principal Component Analysis Unified Framework for Data Clustering

Auto-recognition Method for Pointer-type Meter Based on Binocular Vision

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Bayesian approach to reliability modelling for a probability of failure on demand parameter

The identification of key quality characteristics based on FAHP

Automatic Collecting of Text Data for Cantonese Language Modeling

Research on K-Means Algorithm Based on Parallel Improving and Applying

Using Markov Model and Popularity and Similarity-based Page Rank Algorithm for Web Page Access Prediction

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Automated Extraction of Urban Trees from Mobile LiDAR Point Clouds

EFFECT OF QUERY FORMATION ON WEB SEARCH ENGINE RESULTS

Text Summarization using Neural Network Theory

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme

Optimization of Multiple Input Single Output Fuzzy Membership Functions Using Clonal Selection Algorithm

A Kernel Density Based Approach for Large Scale Image Retrieval

Feature classification for multi-focus image fusion

Research Article An Improved Metric Learning Approach for Degraded Face Recognition

A Novel Hybrid Algorithm for Software Cost Estimation Based on Cuckoo Optimization and K-Nearest Neighbors Algorithms

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

An Efficient Algorithm for Graph Bisection of Triangularizations

A New Bit Wise Technique for 3-Partitioning Algorithm

Evaluation of Support Vector Machine Kernels for Detecting Network Anomalies

Lecture 13: Validation

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

The Impact of Feature Selection on Web Spam Detection

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Eigenimages. Digital Image Processing: Bernd Girod, Stanford University -- Eigenimages 1

Outline. A Probabilistic Deduplication, Record Linkage and Geocoding System. Data cleaning and standardisation (2)

Low Complexity H.265/HEVC Coding Unit Size Decision for a Videoconferencing System

New Clustering Algorithm for Multidimensional Data

Evaluation of Different Fitness Functions for the Evolutionary Testing of an Autonomous Parking System

Effect of control points distribution on the orthorectification accuracy of an Ikonos II image through rational polynomial functions

What are Information Systems?

Transcription:

32 IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, Jue 2007 Web Text Feature Extractio with Particle Swarm Optimizatio Sog Liagtu,, Zhag Xiaomig Istitute of Itelliget Machies, Chiese Academy of Scieces,Hefei,23003 Chia Automatio Departmet, Uiversity of Sciece ad Techology of Chia,Hefei,230036 Chia Summary The Iteret cotiues to grow at a pheomeal rate ad the amout of iformatio o the web is overwhelmig. It provides us a great deal of iformatio resource. Due to its wide distributio, its opeess ad high dyamics, the resources o the web are greatly scattered ad they have o uified maagemet ad structure. This greatly reduces the efficiecy i usig web iformatio.web text feature extractio is cosidered as the mai problem i text miig. We use Vector Space Model (VSM) as the descriptio of web text ad preset a ovel feature extractio algorithm which is based o the improved particle swarm optimizatio with reverse thig particles (PSORTP). This algorithm will greatly improve the efficiecy of web texts processig. Key words: Web miig, VSM, Particle swarm optimizatio, Text feature extractio. Itroductio The web is growig at a tremedous rate. It cotais a huge amout of ustructured, distributed data. This cotet provides a great potetial source for iformatio extractio that eeds to be filtered, orgaized, ad maitaied i order to permit a efficiet use[]. However, Due to its wide distributio, its opeess ad high dyamics, the resources o the web are greatly scattered ad they have o uified maagemet ad structure. This greatly reduces the efficiecy i usig web iformatio. How to search for the iformatio from the Massive web iformatio speedily ad accurately has become a major problem[2]. There was a urget eed of tools that ca quickly ad effectively fid resources ad kowledge from the Web. The mai form of iformatio o the web is web text. So how to process these Web Texts becomes the key questio. Data miig ca help discoverig potetial kowledge ad iformatio from mass raw data ad effectively solve the problem that iformatio is abudat but kowledge is lackig. However, most classical Data Miig tools require structured iformatio. Therefore, Web-Based Text Miig has become a ew theme of Data Miig. I this paper, a review of Web miig is preseted outliig the techiques of Web miig ad data processig. A ew Web text feature Extractio method is proposed that a modified particle swarm optimizatio algorithm is used i it. 2. Web Miig ad Descriptio of Web Text 2. Defiitio of Web Miig Web miig ca be broadly defied as the discovery ad aalysis of useful iformatio from the World Wide Web[3]. This broad defiitio o the oe had describes the automatic search ad retrieval of iformatio ad resources available from millios of sites ad o-lie databases, i.e., Web cotet miig, ad o the other had, the discovery ad aalysis of user access patters from oe or more Web servers or o-lie services, i.e., Web usage miig. Web Miig targeted large, heterogeeous, distributed data of the Web. The Web is a semi-structured or ustructured. Therefore sematic uderstadable absece of machiery, atural laguage uderstadig ad text processig are the mai methods used i the fields of Web miig. 2.2 Descriptio of Web Text Most of the iformatio i the Iteret is i the form of Web texts. How to express this semi-structured ad ustructured iformatio of Web texts is the basic preparatory work of Web Miig. Vector space model is oe of good method which is applied more i recet years. It is used to represet each documet as a vector of certai weighted word frequecies. I a documet vector, the value o each term axis represets the importace of the term i the documet. Therefore the documet vector is merely a set of all terms with a value represetig the importace of each term i the documet, le V( d) = ( t, w( d);...; ti, wi( d);...; t, w( d)). t i is the term i ad wi ( d ) is the value represetig the importace of term i i documet d. wi( d) = ψ ( tfi( d )) is always defied as a fuctio of ti s frequecy- tfi ( d ) i documet d. TF-IDF is a commo method of determiig the weights of the term i VSM.The weight of the term is proportioal to Term Frequecy(TF) ad with the Documet Frequecy(DF) i iverse proportio[4]. The followig is a commoly used formula for calculatig Term Weight: Mauscript received Jue 5, 2007. Mauscript revised Jue 20, 2007.

IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 33 W = tf N log( + 0.0) k 2 2 N ( tf ) log ( + 0.0) k = k () tf is the frequecy of T k i documet D i. N is the total umber of sample documets. k is the umber of docmets which have the term T k. The terms which appear i the Headig, subheadig ad keywords always should be emphasized. 2.3 Feature Extractio Words ad phrases is the basic elemets of a ormal documet ad there is certai regularity with the frequecy of each term i differet documets. So we ca use it to distiguish documets which have differet cotets. Text feature vectors are obtaied through algorithms of words segmetatio ad statistical approach of term frequecy. The dimesio of the vector is immese by usig this approach.if o disposal is doe to the origial text vector, the vector calculatig expeses will be tremedous ad the efficiecy of the whole process will be extraordiary iefficiet. Therefore, we eed to process the text vector for further refiemet o the basis of that the origial meaig is esured ad the feature vector is rather compact. Extractig text features is a NP-complete problem. Curretly, may ovel approaches o the study of feature extractio is proposed. I the study of Liu li, a ovel term selectio ad weightig approach based o key words is preseted[5]. ZOU Jua proposed a method of Chiese text eigevalue extractio usig multiple heuristic rules i their paper[6]. We believe that the Web texts ca be see as a multi-dimesioal iformatio space which is made up of the feature terms. The the text features selectio is just the optimizatio process i the multi-dimesioal iformatio space. Therefore efficiet optimizatio algorithm is eeded i fidig the text features. As a geeral itelliget search algorithm, Particle Swarm Optimizatio(PSO) is discovered through simulatio of a simplified social model ad it ca search the multidimesioal complex space efficietly. I this paper, a ovel approach of Web text feature Extractio with particle swarm optimizatio algorithm is preseted. 3. Web Text Feature Extractio with Improved PSO Algorithm Particle Swarm Optimizatio (PSO) was first itroduced by James Keedy ad Russel C.Eberrhart i 995 ad it was discovered through simulatio of a simplified social model. It ca search the multidimesioal complex space efficietly through cooperatio ad competitio amog the idividuals i a populatio of particles[7]. We th that the text features selectio ca be trasformed ito the optimizatio process i the multi-dimesioal iformatio space. Firstly, the Web text should be coded ad the text vector is viewed as the positio vector of a particle. Because the umber of the text features is ukow, we proposed a ovel particle swarm optimizatio with alterable dimesio. Reverse thig particles are also added i the algorithm i order to improve the global search ability of the PSO. The mai process of the approach is preseted i the followigs: 3. The Selectio of Text Features Before extractig text feature, we eed to pretreat the text to get the feature terms. That extractig good features is a very importat meas ad a basic require for text processig. From the poit of view of atural laguage uderstadig, the ous ad verbs form the mai cotet of a text. Curretly, the mai methods used i text pretreatmet iclude stem-mig of Eglish documets ad segmetatio of Chiese documets. I this paper, our approach is the positive maximal match segmetatio based o dictioary. 3.2 Code The Feature The weight of a feature term of the web documets is oe dimesio of a particle s positio vector. Because the umber of the documet features is ukow, we proposed a dyamic PSO. The dimesio of particles positio i the algorithm is dyamic i order to adapt to the ucertai umber of the documet features. The particles i the swarm is iitialized by the radom selected feature terms from the term list after word segmetig. The particles positio vector is maily formed from the stadardized web documet vector model. 3.3 Optimizatio Procedure Whe the PSO starts, a populatio of particles is geerated first with radom positios ad a velocity is radom assiged to each particle[8]. The fitess of each particle is the evaluated accordig to the objective fuctio. The the particles begi to search for the best solutio. Each particle s trajectory is adjusted by dyamically alterig the velocity of each particle, accordig to its ow search experiece ad other particles experiece. The positio vector ad the velocity vector of particle j i the d-dimesioal search space ca be represeted as X j =

34 IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 (x j,x j2,x j3,,x jd) ad V j = (v j,v j2,v j3,,v jd ) respectively. Accordig to the objective fuctio, let the best positio of each particle be P j = (p j,p j2,p j3,,p jd ) ad the best positio of all the particles be G = (g,g 2,g 3, g d ). The formulas which the particles use to adjust its positio ad velocity is: similar( P, Q) = i= p q i 2 2 pi qi i= i= i (5) V = V+ c* rad()*( P X) + c2* rad2()*( G X) (2) j j j j j X = X + V (3) j+ j j+ where c ad c 2 are acceleratio coefficiets ad are always made 2, rad() ad rad2() are radom umbers i [0,][9]. The first part of (2) represets the previous velocity. The secod part represets the persoal experiece. The third part represets the collaborative effect of the particles ad it always pulls the particles toward the global best solutio which particles foud so far. At each iteratio of PSO, the velocity of each particle is calculated accordig to (2) ad the positio is updated accordig to (3). Geerally, a maximum velocity vector is defied i order to cotrol the V j. Wherever a v jd exceeds the defied limit, its velocity will be set to be v max d. If a particle fids a better positio tha the previously foud best positio, it will be stored i memory. The algorithm goes o util the satisfactory solutio is foud or the predefied umber of iteratios is met. 3.4 Evaluatio Fuctio Evaluatio fuctio is a importat criterio for judgig the merits of particle locatio. The locatio of each particle is composed of the feature weight of the same documets whe the terms of the web documet are acquired. So the best particle should be able to reflect the web documet well ad it also should cotai other particles distributio iformatio. Therefore, the idividual's fitess should be measured by all particles which preset the same documet. The more similar they are, the bigger their fitess should be. The fitess fuctio should be defied as followig: swarm_ size d d d i i j j= fitess ( P ) = similar( P, P )/ swarm _ size (4) The similarity of vectors ca be measured by the agle betwee two vectors or by the distace betwee two particles. It is easy to use the agle betwee two vectors ad the formula is le followig: p, q is the elemets of vector P,Q. I additio, the dimesio of a vector is also a importat criterio. If several particles have the same fitess, smaller the dimesio is, better the particle is. Therefore, we eed to icrease the pealty fuctio p(r) i the evaluatio fuctio i order to lead the particles to the best regio. I this paper, we set pealty fuctio p(r) as followig: d + 2 pd ( ) = (6) d + 2 Where d is the dimesio of vectors. The fial evaluatio fuctio is formed by addig the pealty fuctio to the fitess formula: value( d) = p( d) fitess( d) (7) 4. Feasibility Aalysis ad Desig Theoretically, the dimesio of particles' locatio should be dyamic. But i fact,it is t. We use empty elemets to fill the dimiished dimesio. Whe the weight of oe feature is less tha 0.05, we also replace it with empty elemet. The structure of the particles locatio is le followig: W W2 Wm B B2 B Fig.. the structure of a particle After the particle desig is fiished,we use the algorithm of particle swarm optimizatio with reverse thig particles[0]. The trait of the reverse thig particles is that they do ot update their positio accordig to (3). They move to a opposite directio. You ca look at the Fig. 2. They are selected radomly i the particle swarm ad their ew formula which will update their positio is : X '( t+ ) = X V (8) j j j

IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 35 Fig.2. trait of reverse thig particles Whe the algorithm starts, ormal particles ru followig the formula (2) ad (3) ad the reverse thig particles ru followig the formula (2) ad (8). If oe of the reverse thig particles fids a better solutio tha G(the best positio of all the particles), the reverse thig particle will chage to be a ormal particle ad oe particle will be selected radomly from the ormal particles to become a reverse thig particle. If the G is acquired by the ormal particles, the reverse thig particle will still be opposite. Whe the reverse thig particles positio reach the border of the problem space, they will be set to be the ormal particles which has the best positio G. The flow chart of PSORTP is depicted graphically i Fig 3 5. Experimets The mai purpose of the experimets is to verify the effectiveess of the ovel web text feature extractio algorithm. The mai experimetal materials is the web pages(html) which is Parsed by HTMLParser. We use two criterios: accuracy rate(ar) ad reductio rate(rr). hl ( ao ( t)) AR( t) = 00% ao (9) fv RR( t) = 00% ov (0) () ao t : the umber of feature terms of text t after optimizatio; hl ( ao ( t)) : the umber of high weight feature terms of HLSegmet[]; fv ov Fig.3. The flow chart of PSORTP : the umber of feature terms of text t; : the umber of origial terms of text t; We tested four web pages. The result is the followig: Table : The test results of the effectiveess of the ovel web text feature extractio algorithm. No. Text () ov t ao() t AR() t RR() t size 996 396 87 88.9% 47.2% 2 7 33 56 93.2 47.% 3 56 293 98 95.8 33.4% 4 78 88 4 96. 46.6%

36 IJCSNS Iteratioal Joural of Computer Sciece ad Network Security, VOL.7 No.6, May 2007 From the results, we ca see that by usig this method, the text features are optimized. Based o the text feature extractio algorithm, we ca greatly improve the efficiecy of web documets processig. 6. Coclusio I this paper, a ovel Web text feature Extractio algorithm is proposed. The algorithm is based o the particle swarm optimizatio with reverse thig particles ad the structure of the particles is also improved. The algorithm ca search the multidimesioal complex space efficietly. This method preset a good idea i feature dimesio reductio ad improvig the efficiecy of web documets processig. Now we are cotiuig our experimets about the algorithm ad we will release more of our results of the experimets i times to come. I the future we will cocetrate o applyig the method to the fields le Web documets filterig, classificatio ad clusterig. Refereces []. Maya Rupert,etc. The Web ad Complex Adaptive Systems Proceedigs of the 20th Iteratioal Coferece o Advaced Iformatio Networkig ad Applicatios (AINA 06), pp200-204,2006 [2]. Wag Shi, etc.. Web miig J. Computer Sciece, 27 (4) : 28~ 3, 2000 [3]. http://maya.cs.depaul.edu/~mobasher/webmier/survey/surv ey.html [4]. SHI Zhogzhi Kowledge discovery, Tsighua Uiversity Press, 2002 [5]. LIU Li etc. Term selectio ad weightig approach based o key words i text categorizatio Computer Egieerig ad Desig 2006.6 [6]. ZOU Jua etc. A Eigevalue Extractio Met hod for Chiese Texts Usig Multiple Heuristic Rules COMPUTER ENGINEERING & SCIENCE 2006.8 [7]. Keedy J., Eberhart R.C.: Particle swarm optimisatio. Proceedigs of the IEEE Iteratioal Coferece o Neural Networks. (995) 942-948. [8]. Rataweera, etc.:self-orgaizig Hierarchical Particle Swarm Optimizer With Time-Varyig Acceleratio Coefficiets. IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION. (2004) 240-255 [9]. Shi Y.H., Eberhart R.C.: Parameter Selectio i Particle Swarm Optimizatio. Evolutioary Programmig VII: Proc.EP 98. (998) 59-600 [0]. ZHANG Xiaomig, Wag Rujig. Particle Swarm Optimizatio Algorithm with Reverse Thig Particles. Computer Sciece, 2006, 33 (0) :56~59 []. www.hylada.com Sog Liagtu received the B.E. ad M.E. degrees, from Ahui Agri. Uiv. i 987 ad 990, respectively. he is workig toward the Ph.D, i Automatio departmet,uiversity of Sciece ad Techology of Chia. he has bee a associate professor i the Istitute of Itelliget Machies, Chiese Academy of Scieces,sice 2000. His research iterest icludes data acquisitio,patter recogitio ad itelliget systems. Zhag Xiaomig received the B.E. from Shadog Uiversity of techology i 2002. He received M.E. from Graduate Uiversity of Chiese Academy of Scieces i 2006. Now he is a Research Assistat of Istitute of Itelliget Machies, Chiese Academy of Scieces. His research iterest icludes computatioal itelligece, web miig, machie learig.