Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Similar documents
The Research of Support Vector Machine in Agricultural Data Classification

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Support Vector Machines

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Cluster Analysis of Electrical Behavior

Load Balancing for Hex-Cell Interconnection Network

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Support Vector Machines

Web Document Classification Based on Fuzzy Association

Classifier Selection Based on Data Complexity Measures *

Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

UB at GeoCLEF Department of Geography Abstract

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Machine Learning: Algorithms and Applications

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Deep Classification in Large-scale Text Hierarchies

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Pruning Training Corpus to Speedup Text Classification 1

Impact of a New Attribute Extraction Algorithm on Web Page Classification

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Face Recognition Based on SVM and 2DPCA

An Entropy-Based Approach to Integrated Information Needs Assessment

Query classification using topic models and support vector machine

Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection

Classifying Acoustic Transient Signals Using Artificial Intelligence

A Novel Term_Class Relevance Measure for Text Categorization

Feature Selection as an Improving Step for Decision Tree Construction

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Intelligent Information Acquisition for Improved Clustering

An Evolvable Clustering Based Algorithm to Learn Distance Function for Supervised Environment

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Network Intrusion Detection Based on PSO-SVM

CUM: An Efficient Framework for Mining Concept Units

CSCI 5417 Information Retrieval Systems Jim Martin!

Classification / Regression Support Vector Machines

Efficient Segmentation and Classification of Remote Sensing Image Using Local Self Similarity

Machine Learning 9. week

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

User Tweets based Genre Prediction and Movie Recommendation using LSI and SVD

Lecture 5: Multilayer Perceptrons

A Binarization Algorithm specialized on Document Images and Photos

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

A User Selection Method in Advertising System

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Solving two-person zero-sum game by Matlab

User Authentication Based On Behavioral Mouse Dynamics Biometrics

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Collaborative Topic Regression with Multiple Graphs Factorization for Recommendation in Social Media

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

CLASSIFICATION OF ULTRASONIC SIGNALS

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Collaboratively Regularized Nearest Points for Set Based Recognition

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Associative Based Classification Algorithm For Diabetes Disease Prediction

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

An Image Fusion Approach Based on Segmentation Region

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

ISSN: International Journal of Engineering and Innovative Technology (IJEIT) Volume 1, Issue 4, April 2012

Efficient Text Classification by Weighted Proximal SVM *

Machine Learning. Topic 6: Clustering

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

An Evaluation of Divide-and-Combine Strategies for Image Categorization by Multi-Class Support Vector Machines

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Query Clustering Using a Hybrid Query Similarity Measure

Learning-Based Top-N Selection Query Evaluation over Relational Databases

S1 Note. Basis functions.

Announcements. Supervised Learning

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Enhanced Watermarking Technique for Color Images using Visual Cryptography

Clustering of Words Based on Relative Contribution for Text Categorization

A Knowledge Management System for Organizing MEDLINE Database

A Hidden Markov Model Variant for Sequence Classification

An Optimal Algorithm for Prufer Codes *

Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data

Unsupervised Learning

Data Mining For Multi-Criteria Energy Predictions

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Edge Detection in Noisy Images Using the Support Vector Machines

Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

GA-Based Learning Algorithms to Identify Fuzzy Rules for Fuzzy Neural Networks

Vehicle Fault Diagnostics Using Text Mining, Vehicle Engineering Structure and Machine Learning

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Feature-Based Matrix Factorization

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

SRBIR: Semantic Region Based Image Retrieval by Extracting the Dominant Region and Semantic Learning

Transcription:

(IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak Department of Informaton Technology Department of Teacher Tranng n Electrcal Human Language Technology Laboratory Kng Mongkut s Unversty of Engneerng, Kng Mongkut s Unversty of Natonal Electroncs and Computer Technology North Bangkok, Technology North Bangkok, Technology Center (NECTEC), Bangkok, Thaland Bangkok, Thaland Bangkok, Thaland. Abstract Most Web page classfcaton models typcally apply the bag of words (BOW) model to represent the feature space. The orgnal BOW representaton, however, s unable to recognze semantc relatonshps between terms. One possble soluton s to apply the topc model approach based on the Latent Drchlet Allocaton algorthm to cluster the term features nto a set of latent topcs. Terms assgned nto the same topc are semantcally related. In ths paper, we propose a novel herarchcal classfcaton method based on a topc model and by ntegratng addtonal term features from neghborng pages. Our herarchcal classfcaton method conssts of two phases: (1) feature representaton by usng a topc model and ntegratng neghborng pages, and (2) herarchcal Support Vector Machnes (SVM) classfcaton model constructed from a confuson matrx. From the expermental results, the approach of usng the proposed herarchcal SVM model by ntegratng current page wth neghborng pages va the topc model yelded the best performance wth the accuracy equal to 90.33% and the F1 measure of 90.14%; an mprovement of 5.12% and 5.13% over the orgnal SVM model, respectvely. Keywords - Wep page classfcaton; bag of words model; topc model; herarchcal classfcaton; Support Vector Machnes I. INTRODUCTION Due to the rapd growth of Web documents (e.g., Web pages, blogs, emals) on the World Wde Web (WWW), Web page classfcaton has become one of the key technques for managng and organzng those documents, e.g., as document flterng n nformaton retreval. Generally, Web page classfcaton apples the technque of text categorzaton, whch uses the supervsed machne learnng algorthms for learnng the classfcaton model [1, 2]. Most prevous works on Web page classfcaton typcally appled the bag of words (BOW) model to represent the feature space. Under the BOW model, a Web page s represented by a vector n whch each dmenson contans a weght value (e.g., frequency) of a word (or term) occurrng n the page. The orgnal BOW representaton, however, s unable to recognze synonyms from a gven word set. As a result, the performance of a classfcaton model usng the BOW model could become deterorated. In ths paper, we apply a topc model to represent the feature space for learnng the Web page classfcaton model. Under the topc model concept, words (or terms), whch are statstcally dependent, are clustered nto the same topcs. Gven a set of documents D consstng of a set of terms (or words) W, a topc model generates a set of latent topcs T based on a statstcal nference on the term set W. In ths paper, we apply the Latent Drchlet Allocaton (LDA) [3] algorthm to generate a probablstc topc model from a Web page collecton. A topc model can help capture the hypernyms, hyponyms and synonyms of a gven word. For example, the words vehcle (hypernym) and automoble (hyponym) would be clustered nto the same topc. In addton, the words flm (synonym) and move (synonym) would also be clustered nto the same topc. The topc model helps mprove the performance of a classfcaton model by (1) reducng the number of feature dmensons and (2) mappng the semantcally related terms nto the same feature dmenson. In addton to the concept of topc model, our proposed method also ntegrates some addtonal term features from neghborng pages (.e., parent, chld and sblng pages). Usng some addtonal terms from neghborng pages could help ncrease more evdence for learnng the classfcaton model [4, 5]. We used the Support Vector Machnes (SVM) [6, 7] as the classfcaton algorthm. SVM has been successfully appled to text categorzaton tasks [6, 7, 8, 9]. SVM s based on the structural rsk mnmzaton prncple from computatonal theory. The algorthm addresses the general problem of learnng to dscrmnate between postve and negatve members of a gven class of n-dmensonal vectors. Indeed, the SVM classfer s desgned to solve only the bnary classfcaton problem [7]. In order to manage the mult-class classfcaton problem, many researches have proposed herarchcal classfcaton methods for solvng the mult-class problem. For example, Dumas and Chen proposed the herarchcal method by usng SVM classfer for classfyng a large, heterogeneous collecton of web content. The study showed that the herarchcal method has a better performance than the flat method [10]. Ca and Hofmann proposed a herarchcal classfcaton method that generalzes SVM based on dscrmnant functons that are structured n a 166 http://stes.google.com/ste/jcss/

(IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, way that mrrors the class herarchy. The study showed that the herarchcal SVM method has a better performance than the flat SVM method [11]. Most of the related work presented a herarchcal classfcaton method by usng dfferent approaches. However n prevous works, the bag of words (BOW) model s used to represent the feature space. In ths paper, we propose a new herarchcal classfcaton method by usng a topc model and ntegratng neghborng pages. Our herarchcal classfcaton method conssts of two phases: (1) feature representaton and (2) learnng classfcaton model. We evaluated among three dfferent feature representatons: (1) applyng the smple BOW model on current page, (2) applyng the topc model on current page, and (3) ntegratng the neghborng pages va the topc model. To construct a herarchcal classfcaton model, we use the class relatonshps obtaned from a confuson matrx of the flat SVM classfcaton model. The expermental results showed that by ntegratng the addtonal neghborng nformaton va a topc model, the classfcaton performance under the F1 measure was sgnfcantly mproved over the smple BOW model. In addton, our proposed herarchcal classfcaton method yelded a better performance compared to the SVM classfcaton method. The rest of ths paper s organzed as follows. In the next secton we provde a bref revew of Latent Drchlet Allocaton (LDA). Secton 3 presents the proposed framework of herarchcal classfcaton va the topc model and neghborng pages ntegraton. Secton 4 presents the experments wth the dscusson on the results. In Secton 5, we conclude the paper. II. A REVIEW OF LATENT DIRICHLET ALLOCATION Latent Drchlet Allocaton (LDA) has been ntroduced as a generatve probablstc model for a set of documents [3, 12]. The basc dea behnd ths approach s that documents are represented as random mxtures over latent topcs. Each topc s represented by a probablty dstrbuton over the terms. Each artcle s represented by a probablty dstrbuton over the topcs. LDA has also been appled for dentfcaton of topcs n a number of dfferent areas such as classfcaton, collaboratve flterng [3] and content-based flterng [13]. Generally, an LDA model can be represented as a probablstc graphcal model as shown n Fgure 1 [3]. There are three levels to the LDA representaton. The varables α and β are the corpus-level parameters, whch are assumed to be sampled durng the process of generatng a corpus. α s the parameter of the unform Drchlet pror on the per-document topc dstrbutons. β s the parameter of the unform Drchlet pror on the per-topc word dstrbuton. θ s a document-level varable, sampled once per document. Fnally, the varables z and w are word-level varables and are sampled once for each word n each document. The varable N s the number of word tokens n a document and varable M s the number of documents. Fgure 1. The Latent Drchlet Allocaton (LDA) model The LDA model [3] ntroduces a set of K latent varables, called topcs. Each word n the document s assumed to be generated by one of the topcs. The generatve process for each document w can be descrbed as follows: 1. Choose θ ~ Dr ( α ): Choose a latent topcs mxture vector θ from the Drchlet dstrbuton. 2. For each word wn W (a) Choose a topc z n ~ Multnomal ( θ ): Choose a latent topc z n from the multnomal dstrbuton. (b) Choose a word w n from Pw ( n z n, β ) a multnomal probablty condtoned on the topc z n. III. THE PROPOSED HIERARCHICAL CLASSIFICATION FRAMEWORK Fgure 2 llustrates the proposed herarchcal classfcaton framework whch conssts of two phases: (1) feature representaton for learnng the Web page classfcaton models, (2) learnng classfcaton models based on the Support Vector Machnes (SVM). In our proposed framework, we evaluated among three dfferent feature representatons: (1) applyng the smple BOW model on current page, (2) applyng the topc model on current page, and (3) ntegratng the neghborng pages va the topc model. After the feature representaton process, we use the class relatonshps obtaned from a confuson matrx of the flat SVM classfcaton model for buldng a new herarchcal classfcaton method. A. Feature Representaton The process for feature representaton can be explaned n detals as follows. Approach 1 (BOW): Gven a Web page collecton conssts of an artcle collecton whch s a set of m documents denoted by D = {D 0,, D m 1 }. In the process of text processng s appled to extract terms. Gven a set of terms s represented W = {W 0,, W k-1 }, where k s the total number of terms. Each term s provded wth certan weght w, whch the weght of each term s assgned wth term frequency. The set of terms s then fltered by usng the feature selecton technque, nformaton gan (IG) [1]. Once the term features are obtaned, we apply the Support Vector Machnes (SVM) to learn the classfcaton model. The model s then used to evaluate the performance of category predcton. 167 http://stes.google.com/ste/jcss/

Approach 2 (TOPIC_CURRENT): Gven a Web page collecton consstng of an artcle collecton whch s a set of m documents denoted by D = {D 0,, D m 1 }. The process of text processng s appled to extract terms. The set of terms s then generated by usng the topc model based on the LDA algorthm. The LDA algorthm generates a set of n topcs denoted by T = {T 0,, T n 1 }. Each topc s a probablty dstrbuton over p words denoted by T = [ w,, ], 0 wp 1 where w j s a probablstc value of word j assgned to topc. Based on ths topc model, each document can be represented as a probablty dstrbuton over the topc set T,.e., D = [ t,, tn 1 ], where t j s a probablstc value of topc j assgned to document. The output from ths step s the topc probablty representaton for each artcle. The Support Vector Machnes (SVM) s also used to learn the classfcaton model. Approach 3 (TOPIC_ INTEGRATED): The man dfference of ths approach from Approach 2 s we ntegrate the addtonal term features obtaned from the neghborng pages to mprove the performance of Web page classfcaton. (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, The process of ntegratng the neghborng pages s explaned as follows. Fgure 3 shows three types of neghborng pages, parent chld and sblng pages. Gven a Web page (.e., current page), there are typcally ncomng lnks from parent pages, outgong lnks to chld pages and lnks from ts parent pages to sblng pages. A parent chld and sblng pages are collectvely referred to as the neghborng pages. Usng the addtonal terms from the neghborng pages could help ncrease more evdence for learnng the classfcaton model. In ths paper, we vary a weght value of neghborng pages from zero to one. A weght value equals to zero means the neghborng pages are not ncluded for the feature 0 representaton. Under ths approach, terms from dfferent page types (.e., current, parent, chld and sblng) are frst transformed nto a set of n topcs (denoted by T = {T 0,..., T n-1 }) by usng the LDA algorthm. The weght values from 0 to 1 are then multpled to the topc dmenson T of parent, chld and sblng pages. The combned topc feature vector by ntegratng the neghborng topc vectors wth adjusted weght values can be computed by usng the algorthm lsted n Table 1. Fgure 2. The proposed herarchcal classfcaton framework 168 http://stes.google.com/ste/jcss/

Fgure 3. A current Web page wth three types of neghborng pages (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, propose a functon for varyng the weght values of terms from parent pages (PDT), chld pages (CDT) and sblng pages (SDT). The probablty values from all neghborng pages are ntegrated wth the current page (CurDT) to form a new ntegrated matrx (IDT). The process of algorthm begns wth the results from the LDA model; that s document-topc matrces from all page types. The algorthm starts by gatherng data from documenttopc matrces (CurDT, PDT, CDT, SDT) usng getpvalue functon. All P-values of the document-topc matrces are then multpled by the weght values of each document-topc matrx except for the current page matrx. Fnally all P-values from four matrces are summed up and then sent to IDT usng setpvalue functon. After the ntegratng process, we use the IDT matrx for learnng the classfcaton model. TABLE I. THE INTEGRATING NEIGHBORING PAGES (INP) ALGORITHM Algorthm : INP Input: CurDT, PDT, CDT, SDT, W p, W c, W s for all d n CurDT do for all t j n CurDT do CurDT getpvalue(curdt,, j) PP getpvalue(pdt,, j) * W p PC getpvalue(cdt,, j) * W c PS getpvalue(sdt,, j) * W s setpvalue(idt, CurDT + PP + PC + PS,, j) end for end for return IDT Parameters and varables: CurDT : document-topc matrx from current page PDT : document-topc matrx from parent pages CDT : document-topc matrx from chld pages SDT : document-topc matrx from sblng pages IDT : ntegrated document-topc matrx PP : P-Value from PDT at specfc ndex PC : P-value from CDT at specfc ndex PS : P-value from SDT at specfc ndex W p : weght value for parent pages, 0.0 W p 1.0 W c : weght value for chld pages, 0.0 W c 1.0 W s : weght value for sblng pages, 0.0 W s 1.0 P-value : probablty value getpvalue(m, r, c) : functon for gettng P-Value from row r and column c of matrx M setpvalue(m, p, r, c) : functon for settng P-Value on row r, column c of matrx M wth value p The INP algorthm that we present n ths paper ncorporates term features obtaned from the neghborng pages (.e. parent, chld and sblng pages) nto the classfcaton model. Usng addtonal terms from the neghborng pages could help ncrease more evdence for learnng the classfcaton model. In ths algorthm, we B. Classfcaton Model Three dfferent feature representaton approaches are used as nput to classfers. In ths paper, we propose two methods for buldng the classfcaton models: (1) Model 1: we adopt the SVM to classfy feature and (2) Model 2: we presented a new herarchcal classfcaton method by usng the class relatonshps obtaned from a confuson matrx for learnng a classfcaton model. Each method s descrbed n detals as follows. Model 1 (SVM): We used the SVM for learnng a classfcaton model. The SVM s the machne learnng algorthm proposed by Vapnk [7]. The algorthm constructs a maxmum margn hyperplane whch separates a set of postve examples from a set of negatve examples. In the case of examples not lnearly separable, SVM uses a kernel functons to map the examples from nput space nto hgh dmensonal feature space. Usng a kernel functon can solve the non-lnear problem. In our experments, we used a polynomal kernel. We mplemented the SVM classfer by usng the WEKA 1 lbrary. Model 2 (HSVM): The proposed method s based on SVM classfer, whch uses the class relatonshp obtaned from a confuson matrx for buldng a herarchcal SVM (HSVM). A confuson matrx shows the number of correct and ncorrect predctons made by the model compared wth the actual classfcatons of the test data. The sze of confuson matrx s m-by-m, where m s the number of classes. Fgure 4 shows an example of a confuson matrx from Approach 3 bult on a collecton of artcles obtaned from the Wkpeda Selecton for Schools. In a confuson matrx, the row corresponds to the actual classes, and the column corresponds to the predcton classes. In ths example, for class art, the model makes the correct predcton equal to 49 nstances and ncorrect predcton nto class ctzenshp (c) for 1 nstance and nto class desgn and technology (e) for 5 nstances. 1 Weka. http://www.cs.wakato.ac.nz/ml/weka/ 169 http://stes.google.com/ste/jcss/

Fgure 4. A confuson matrx of Wkpeda Selecton for Schools We used the confuson matrx for constructng a herarchcal structure. Frst, we need to transform the confuson matrx nto a new symmetrc matrx, called average parwse confuson matrx (APCM) by computng average values of parwse relatonshps between classes n a confuson matrx (CM). The process of transformng CM nto APCM can be explaned as follows. Gven a confuson matrx CM = [v a,p ], where a denotes each row correspondng to actual classes and p denotes each column correspondng to the predcton classes. For the correct predcton,.e., a equals to p n CM, we set the value equal to 0 n APCM. If a s not equal to p,.e., ncorrect predcton, we compute an average value of v a,p and v p,a for a parwse confuson value at ths poston. We appled ths calculaton method for every row and column. For example, n Fgure 4, v 0,0 = 49, a s equal to p (a correct predcton), v 0,0 s set equal to 0 n APCM. For v 0,2 = 1, where a = 0, p = 2 (a s not equal to p), an average parwse confuson value of v 0,2 and v 2,0 s equal to 1. The fnal result of an average parwse confuson matrx computaton s shown n Fgure 5. The computaton of an average parwse value s summarzed by the followng equaton: (2) where w ( v + v ) a, p p, a a, p =, f a p (1) 2 w a, p = 0, f a = p w a, p v a, p = A value from an average parwse confuson matrx (APCM) at row a and column p = A value from a confuson matrx (CM) at row a and column p (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, algorthm to merge two clusters by selectng maxmum average parwse value n a confuson matrx. We frst select a par of classes whch has the maxmum average parwse value n APCM to a dendrogram and select the next hghest average parwse value and go on wth ths process untl all classes are selected nto the dendrogram. The fnal result of dendrogram s shown n Fgure 6. For example, an average parwse value between class f and o s 21.5, the hghest value n APCM, therefore class f and class o are selected as the frst par n the dendrogram. The second hghest value s 19, ths value s an average parwse value between class h and m, therefore class h and class m are selected as the second par. The thrd hghest value s 15 between class g and o. However, class o s already pared wth class f. Therefore, we take only class g to combne wth class f and class g nodes. We perform ths process for all remanng classes. Fnally, we obtan a complete dendrogram for constructng the herarchcal classfcaton model. The herarchcal classfcaton models are constructed from bottom-up level. Wth ths herarchcal classfcaton structure, classes wth lower confuson values are classfed before classes wth hgher confuson. The herarchcal classfcaton model could help mprove the performance of mult-class classfcaton method. Fgure 5. An average parwse confuson matrx of Wkpeda Selecton for Schools Once the average parwse confuson matrx (APCM) s obtaned, we construct a dendrogram based on the sngle lnk algorthm of herarchcal agglomeratve clusterng (HAC) [14,15]. Sngle lnk clusterng s known to be confused by nearby overlappng clusters whch merge two clusters wth the smallest mnmum parwse dstance [14]. To construct our herarchcal classfcaton structure, we adopt the sngle lnk Fgure 6. A herarches of Wkpeda Selecton for Schools 170 http://stes.google.com/ste/jcss/

IV. EXPERIMENTS AND DISCUSSION A. Web page collecton In our experments, we used a collecton of artcles obtaned from the Wkpeda Selecton for Schools whch s avalable from the SOS Chldren's Vllages Web ste 2. There are 15 categores: art, busness studes, ctzenshp, countres, desgn and technology, everyday lfe, geography, hstory, IT, language and lterature, mathematcs, musc, people, relgon and scence. The total number of artcles s 4,625. Table 2 lsts the frst-level subject categores avalable from the collecton. Organzng artcles nto the subject category set provdes users a convenent way to access the artcles on the same subject. Each artcle contans many hypertext lnks to other artcles whch are related to the current artcle. TABLE II. Category THE SUBJECT CATEGORIES UNDER THE WIKIPEDIA SELECTION FOR SCHOOL No. of Artcles Category No. of Artcles Art 74 Busness Studes 88 Ctzenshp 224 Countres 220 Desgn and Technology 250 Everyday lfe 380 Geography 650 Hstory 400 IT 64 Language and lterature 196 Mathematcs 45 Musc 140 People 680 Relgon 146 Scence 1068 (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, the number of correctly classfed test documents Accuracy = (3) total number of test documents the number of correct postve predctons Precson = the number of postve predctons the number of correct postve predctons Recall = the number of postve data (4) (5) precson recall F1 = 2 (6) precson + recall where Accuracy represents the percentage of correct predctons n total predctons. Precson (P) s the percentage of the predcted documents for a gven category that are classfed correctly. Recall (R) s the percentage of the documents for a gven category that are classfed correctly. F1 measure s a sngle measure that tres to combne precson and recall. F1 measure ranges from 0 to 1 and the hgher the better. D. Expermental results We started by evaluatng the weght values of neghborng pages under Approach 3. Table 3 shows the results of combnaton the weght value of neghborng pages on our algorthm. For the SVM model, the best combnaton of neghborng pages wth the accuracy equal to 85.21% and the F1 measure of 0.8501 by weght of parent pages, chld pages and sblng pages equal to 0.4, 0.0 and 0.3, respectvely and for the HSVM model has the best combnaton of neghborng pages wth the accuracy equal to 90.33% and the F1 measure of 0.9014 by weght the same SVM model. The results showed that usng nformaton from parent pages and sblng pages are more effectve than chld pages for mprovng the performance of a classfcaton model. B. Experments We used the LDA algorthm provded by the lngustc analyss tool called LngPpe 3 to run our experments. LngPpe s a sute of Java tools desgned to perform lngustc analyss on natural language data. In ths experment, we applyed the LDA algorthm provded under the LngPpe API and set the number of topcs equal to 200 and the number of epochs to 2,000.For text classfcaton process, we used WEKA, an open-source machne learnng tool, to perform the experments. C. Evaluaton Metrcs The standard performance metrcs for evaluatng the text classfcaton used n the experments are accuracy, precson, recall and F1 measure [16]. We tested all algorthms by usng the 10-fold cross valdaton. Accuracy, precson, recall and F1 measure are defned as: TABLE III. CLASSIFICATION RESULTS BY INTEGRATING NEIGHBORING PAGES Models W p W c W s P R F1 Accuracy (%) SVM 0.4 0.0 0.3 0.8583 0.8337 0.8501 85.21 HSVM 0.4 0.0 0.3 0.8984 0.9046 0.9014 90.33 From Table 4, the results of classfcaton model based on two models between the SVM model and the herarchcal SVM (HSVM), the approach of ntegratng current page wth the neghborng pages va the topc model (TOPIC_INTEGRATED) yelded a hgher accuracy compared to applyng the topc model on current page (TOPIC_CURRENT) and applyng the BOW model. For the SVM model, on the TOPIC_INTEGRATED approach, the hghest accuracy s 85.21%; mprovement of 23.96% over the BOW model. For the HSVM model, on the TOPIC_INTEGRATED approach, the hghest accuracy s 90.33%; mprovement of 4.64% over the BOW model. 2 SOS Chldren's Vllages Web ste. http://www.soschldrensvllages.org.uk/ charty-news/wkpeda-for- schools.htm 3 LngPpe. http://alas-.com/lngppe 171 http://stes.google.com/ste/jcss/

TABLE IV. EVALUATION RESULTS ON CLASSIFICATION MODELS BY USING THREE FEATURE REPRESENTATION APPROACHES (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, TABLE V. CLASSIFICATION RESULTS BASE ON THREE FEATURE REPRESENTATION APPROACHES Classfcaton Models Feature Representaton Approaches SVM HSVM Accuracy (%) Accuracy (%) 1. BOW 61.25 85.69 2. TOPIC_CURRENT 78.54 88.97 3.TOPIC_INTEGRATED (W p =0.4,W c =0.0 & W s =0.3) 85.21 90.33 Table 5 shows the expermental results of three feature representaton approaches by usng two models between the SVM model and the herarchcal SVM (HSVM) model for the learnng classfcaton model. From ths table, the approach of ntegratng current page wth neghborng pages va the topc model (TOPIC_INTEGRATED) yelded a hgher performance compared to applyng the topc model on current page (TOPIC_CURRENT) and applcaton of the BOW model. The HSVM classfcaton model yelded a hgher performance compared to the SVM classfcaton model n all three feature representaton approaches. The results of classfcaton model based on the SVM model, applyng the TOPIC_CURRENT approach helped mprove the performance over the BOW by 17.2% based on the F1 measure and applyng the TOPIC_INTEGRATED approach, yelded the best performance wth the F1 measure of 85.01%; mprovement of 23.81% over the BOW model. For the learnng classfcaton model based on the HSVM model, applyng the TOPIC_CURRENT approach helped mprove the performance over the BOW by 3.88% based on the F1 measure and applyng the TOPIC_INTEGRATED, yelded the best performance wth the F1 measure of 90.14%; mprovement of 5.11% over the BOW model. The approach of ntegratng current page wth the neghborng pages va the topc model (TOPIC_INTEGRATED) and usng the HSVM model, however, yelded the best performance wth the F1 measure of 90.14%; mprovement of 5.13% over the TOPIC_INTEGRATED approach by usng the SVM model. Thus, ntegratng the addtonal neghborng nformaton, especally from the parent pages and sblng pages, va a topc model could sgnfcantly mprove the performance of a classfcaton model. The reason s due to the parent pages often provde terms, such as n the anchor texts, whch provde addtonal descrptve nformaton of the current page. Feature Representaton Approaches SVM Classfcaton Models V. CONCLUSIONS HSVM P R F1 P R F1 1. BOW 0.6000 0.6610 0.6120 0.8485 0.8541 0.8503 2. TOPIC_CURRENT 0.7960 0.7710 0.7840 0.8886 0.8908 0.8891 3.TOPIC_INTEGRATED (W p =0.4,W c =0.0 & W s =0.3) 0.8583 0.8337 0.8501 0.8984 0.9046 0.9014 To mprove the performance of Web page classfcaton, we proposed a new herarchcal classfcaton method based on a topc model and by ntegratng the addtonal term features obtaned from the neghborng pages to mprove the performance of Web page classfcaton. We appled the topc model approach based on the Latent Drchlet Allocaton algorthm to cluster the term features nto a set of latent topcs. Terms assgned nto the same topc are semantcally related. Our herarchcal classfcaton method conssts of two phases: (1) feature representaton by usng a topc model and ntegratng neghborng pages, and (2) herarchcal Support Vector Machnes (SVM) classfcaton model constructed from a confuson matrx. From the expermental results, the approach of ntegratng current page wth the neghborng pages va the topc model yelded a hgher performance compared to applyng the topc model on current page and applyng the BOW model. For learnng classfcaton model, the herarchcal SVM classfcaton model yelded a hgher performance compared to the SVM classfcaton model n all three feature representaton approaches and ntegratng current page wth the neghborng pages va the topc model approach, however, yelded the best performance wth the F1 measure of 90.14%; mprovement of 5.11% over the BOW model. The approach of ntegratng current page wth the neghborng pages va the topc model and usng the herarchcal SVM classfcaton model yelded the best performance wth the accuracy equal to 90.33% and the F1 measure of 90.14%; an mprovement of 5.12% and 5.13% over the orgnal SVM model, respectvely. REFERENCES [1] Y. Yang and J.P Pederson, A comparatve Study on Feature Selecton n Text Categorzaton, Proceedngs of the 14th Internatonal Conference on Machne Learnng, pp. 412-420, 1997. [2] S. T. Dumas, J. Platt, D. Heckerman, and M. Saham, Inductve Learnng Algorthms and Representatons for Text Categorzaton, Proceedngs of CIKM 1998, pp. 148-155, 1998. [3] D. M. Ble, A. Y. Ng, and M. I. Jordan, Latent Drchlet Allocaton, Journal of Machne Learnng Research, vol 3, pp. 993-1022, 2003. [4] X. Q and B.D. Davson, Classfers Wthout Borders: Incorporatng Felded Text From Neghborng Web Pages, Proceedngs of the 31st Annual Internatonal ACM SIGIR Conference on Research & Development on Informaton Retreval, Sngapore, pp. 643-650, 2008. 172 http://stes.google.com/ste/jcss/

[5] G. Chen and B.Cho, Web page genre classfcaton, Proceedngs of 2008 ACM symposum on Appled computng, pp. 2353-2357, 2008. [6] T. Joachms, Text Categorzaton wth Support Vector Machnes: Learnng wth Many Relevant Features, Proceedngs of European Conference on Machne Learnng (ECML), Berln, pp. 137-142, 1998. [7] V. Vapnk, The Nature of Statstcal Learnng Theory, Sprnger, New York,1995. [8] A. Sun, E.-P. Lm, and W.-K. Ng., Web classfcaton usng support vector machne, Proceedngs of the 4th Int l Workshop on Web Informaton and Data Management (WIDM), ACM Press, pp. 96-99, 2002. [9] W. Srura, P. Meesad, and C. Haruechayasak, A Topc-Model Based Feature Processng for Text Categorzaton, Proceedngs of the 5th Natonal Conference on Computer and Informaton Technology, pp.146-151, 2009. [10] S. Dumas and H. Chen, Herarchcal classfcaton of Web content., Proceedngs of SIGIR-00, 23rd ACM Internatonal Conference on Research and Development n Informaton Retreval, ACM Press, New York, pp. 256 263, 2000. [11] L. Ca and T. Hofmann, Herarchcal document categorzaton wth support vector machnes, In CIKM, pp. 78 87, 2004. [12] M. Steyvers and T.L. Grffths, Probablstc topc models, In: T., Landauer, D., McNamara, S., Denns, and W., Kntsch, (eds), Latent Semantc Analyss: A Road to Meanng, Laurence Erlbaum, 2006. [13] C. Haruechayasak and C. Damrongrat, Artcle Recommendaton Based on a Topc Model for Wkpeda Selecton for Schools, Proceedngs of the 11th Internatonal Conference on Asan Dgtal Lbrares, pp. 339-342, 2008. [14] A.K. Jan and R. C. Dubes., Algorthms for Clusterng Data, Prentce Hall, 1988. [15] G. Karyps, E. Han, and V. Kumar., Chameleon: A herarchcal clusterng algorthm usng dynamc modelng, IEEE Computer, 32(8):68 75, 1999. (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, [16] H. Yu, J. Han, and K. Chen-chuan Chang., PEBL: Web Page Classfcaton wthout Negatve Examples, IEEE Computer, 16(1): 70-81, 2004. AUTHORS PROFILE Wongkot Srura receved B.Sc. degree n Computer Scence, M.S. degree n Informaton Technology from Ubon Ratchathan Unversty. Currently, she s a Ph.D. canddate n the Department of Informaton Technology at Kng Mongkut s Unversty of Technology North Bangkok. Her current research nterests Web Mnng, Informaton flterng and Recommender system. Phayung Meesad receved the B.S. from Kng Mongkut s Unversty of Technology North Bangkok M.S. and Ph.D. degree n Electrcal Engneerng from Oklahoma State Unversty. Hs current research nterests Fuzzy Systems and Neural Networks, Evolutonary Computaton and Dscrete Control Systems. Currently, he s an Assstant Professor n Department of Teacher Tranng n Electrcal Engneerng at Kng Mongkut s Unversty of Technology North Bangkok Thaland. Choochart Haruechayasak receved B.S. from Unversty of Rochester, M.S. from Unversty of Southern Calforna and Ph.D. degree n Computer Engneerng from Unversty of Mam. Hs current research nterests Search technology, Data/text/Web mnng, Informaton flterng and Recommender system. Currently, he s chef of the Intellgent Informaton Infrastructure Secton under the Human Language Technology Laboratory (HLT) at Natonal Electroncs and Computer Technology Center (NECTEC), Thaland. 173 http://stes.google.com/ste/jcss/