Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Similar documents
Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Classifier Selection Based on Data Complexity Measures *

Deep Classification in Large-scale Text Hierarchies

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

The Research of Support Vector Machine in Agricultural Data Classification

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

UB at GeoCLEF Department of Geography Abstract

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Performance Evaluation of Information Retrieval Systems

CSCI 5417 Information Retrieval Systems Jim Martin!

Optimizing Document Scoring for Query Retrieval

Pruning Training Corpus to Speedup Text Classification 1

Machine Learning: Algorithms and Applications

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Efficient Text Classification by Weighted Proximal SVM *

Issues and Empirical Results for Improving Text Classification

S1 Note. Basis functions.

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Machine Learning 9. week

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Support Vector Machines

Query Clustering Using a Hybrid Query Similarity Measure

Cluster Analysis of Electrical Behavior

Edge Detection in Noisy Images Using the Support Vector Machines

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Feature Reduction and Selection

Relevance Feedback Document Retrieval using Non-Relevant Documents

A MODIFIED K-NEAREST NEIGHBOR CLASSIFIER TO DEAL WITH UNBALANCED CLASSES

A Novel Term_Class Relevance Measure for Text Categorization

A Binarization Algorithm specialized on Document Images and Photos

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Semantic Image Retrieval Using Region Based Inverted File

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

A Hybrid Text Classification System Using Sentential Frequent Itemsets

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Extraction of User Preferences from a Few Positive Documents

An Image Fusion Approach Based on Segmentation Region

TN348: Openlab Module - Colocalization

Local Quaternary Patterns and Feature Local Quaternary Patterns

CLASSIFICATION OF ULTRASONIC SIGNALS

Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

Optimal Workload-based Weighted Wavelet Synopses

A Weighted Method to Improve the Centroid-based Classifier

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Impact of a New Attribute Extraction Algorithm on Web Page Classification

Related-Mode Attacks on CTR Encryption Mode

Application of k-nn Classifier to Categorizing French Financial News

An Entropy-Based Approach to Integrated Information Needs Assessment

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Learning to Classify Documents with Only a Small Positive Training Set

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Collaboratively Regularized Nearest Points for Set Based Recognition

KOHONEN'S SELF ORGANIZING NETWORKS WITH "CONSCIENCE"

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Face Recognition Based on SVM and 2DPCA

A Misclassification Reduction Approach for Automatic Call Routing

Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System

Load Balancing for Hex-Cell Interconnection Network

Feature Selection for Natural Language Call Routing Based on Self-Adaptive Genetic Algorithm

Correlative features for the classification of textural images

Intelligent Information Acquisition for Improved Clustering

Clustering of Words Based on Relative Contribution for Text Categorization

Support Vector Machines

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Identifying Table Boundaries in Digital Documents via Sparse Line Detection

Hierarchical clustering for gene expression data analysis

A KIND OF ROUTING MODEL IN PEER-TO-PEER NETWORK BASED ON SUCCESSFUL ACCESSING RATE

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

User Authentication Based On Behavioral Mouse Dynamics Biometrics

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

Title: A Novel Protocol for Accuracy Assessment in Classification of Very High Resolution Images

Modular PCA Face Recognition Based on Weighted Average

A Taxonomy Fuzzy Filtering Approach

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Laplacian Eigenmap for Image Retrieval

Unsupervised Learning

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Modeling Hierarchical User Interests Based on HowNet and Concept Mapping

Associative Based Classification Algorithm For Diabetes Disease Prediction

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

A User Selection Method in Advertising System

Information Retrieval

Transcription:

Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur y Av. San Claudo. Edf. 135. Cudad Unverstara, Puebla, Pue. 72570. Méxco, Tel. (01222) 229 55 00 ext. 7212 Fax (01222) 229 56 72, emoyotl@mal.cs.buap.mx, hjmenez@fcfm.buap.mx Abstract. Ths paper presents a novel term selecton method called dstance to transton pont (DTP) that s equally effectve for unsupervsed and supervsed term selecton. DTP computes the dstance between the frequency of a term and the transton pont (TP) and then, by usng ths dstance as a crteron, t selects the terms more close to TP. Expermental results on Spansh texts show that feature selecton by DTP acheves superor performance to document frequency, and comparable performance to nformaton gan and ch-statstc. Moreover, when DTP s used to select terms n an unsupervsed polcy, t mproves the performance of tradtonal classfcaton algorthms such as -NN and Roccho. Keywords: dstance to transton pont, term selecton, text categorzaton. 1 Introducton The rapd growth n the volume of text documents avalable electroncally has led to an ncreased nterest n developng tools that allow organze textual nformaton. Text categorzaton (TC), whch s the classfcaton of text documents nto a set of predefned categores, s an mportant tas for handlng and organzng textual nformaton. Snce buldng text classfers manually s dffcult and tme consumng, the domnant approach to TC s based on machne learnng technques [10]. Wthn ths approach, a classfcaton learnng algorthm automatcally bulds a text classfer from a set of preclassfed documents, a tranng set. In TC a document d j s usually represented as a vector of term weghts d j =(w 1j,...,w Vj ), where V s the number of terms (the vocabulary sze) that occur n the tranng set, and w j measures the mportance of term t for the characterzaton of document d j. However, many classfcaton algorthms are computatonally hard, and ther computatonal cost s a functon of V [2]. Hence, feature selecton (FS)

technques are used to select a subset from the orgnal term set n order to mprove categorzaton effectveness and reduce computatonal complexty. In [12] fve FS methods were tested: document frequency, nformaton gan, ch-statstc, mutual nformaton and term strength. The frst three were found the most effectve. For that reason they wll be tested n ths paper. A wdely used approach to FS s the flterng, whch consst n selectng the terms that score hghest accordng to a crteron that measures the mportance of the term for the TC tas [4]. There are two man polces to perform term selecton: an unsupervsed polcy, where term scores are determned wthout usng any category nformaton, and a supervsed polcy, where nformaton on the membershp of tranng documents s used to determne term scores [5]. In ths paper we present a new term selecton method called dstance to transton pont (DTP), whch can be used for both unsupervsed and supervsed term selecton. DTP computes the dstance between the frequency of a term and the transton pont (TP),.e., the frequency that splts the terms of a text (or a set of texts) nto low frequency terms and hgh frequency terms. In the case of unsupervsed polcy, DTP calculates TP usng all tranng documents, whereas n the case of supervsed polcy, DTP calculates TP usng the tranng documents belongng to a specfc category. We report expermental results obtaned on Spansh texts wth two classfcaton algorthms: -NN and Roccho, three term selecton technques: document frequency (DF), nformaton gan (IG) and ch-statstc (CHI), and both unsupervsed and supervsed term selecton by DTP. The paper s organzed as follows. Secton 2 brefly ntroduces the term selecton methods (DF, IG and CHI). Secton 3 presents the detals of the DTP term selecton method for both unsupervsed and supervsed polces. Secton 4 descrbes the classfers and data used n the experments. Secton 5 presents our experments and results. Secton 6 concludes. 2 Term Selecton Methods In ths secton we gve a bref ntroducton on three effectve FS technques, one unsupervsed method (document frequency) and two supervsed methods (nformaton gan and ch-statstc). These methods assgn a score to each term and then select the terms that score hghest. In the followng, let D be the tranng set, N the number of documents n D, V the number of terms n D, and C={c 1,,c M } the set of categores. Document Frequency (DF). The document frequency of a term t s the number of documents n whch ths term occurs [9]. DF s a tradtonal term selecton method that does not need the category nformaton. It s the smplest technque and easly scales to a large data set wth a computaton complexty approxmately lnear n the number N [12].

Informaton Gan (IG). Informaton gan of a term t measures the number of bts of nformaton obtaned by nowng the presence or absence of t n a document. If t occurs equally frequently n all categores, then ts IG s 0. The nformaton gan of term t s defned as M IG( t ) = P( c )log P( c ) (1) = 1 + P( t ) + P( t ) M = 1 M = 1 P( c t )log P( c t ) P( c t )log P( c t ) where P(c ) s the number of documents belongng to category c dvded by N, P(t ) s the number of documents wth term t dvded by N, P(c t ) s the number of documents belongng to c wth t dvded by the total number of documents wth t. The computaton ncludes the estmaton of the condtonal probabltes of a category gven a term, and the entropy computatons n the defnton. The probablty estmaton has a tme complexty of O(N) and the entropy computatons has a tme complexty of O(VM) [12]. Ch-Statstc (CHI). The ch-statstc method measures the lac of ndependence between the term and the category. If term t and category c are ndependent, then CHI s 0. In TC, gven a two-way contngency table for each term t and category c (as represented n Table 1), CHI s calculated as follows 2 N( ad cb) CHI( t, c ) = ( a + c)( b + d)( a + b)( c + d) where a, b, c and d are the number of documents for each combnaton of c, c and t t. In order to get a global score CHI(t ) from CHI(t, c ) scores relatve to the, M ndvdual categores, the maxmum score t ) max { CHI ( t, c )} CHI max( = s used. = 1 The computaton of CHI scores has a quadratc complexty, smlar to IG [12]. Table 1. Two-way contngency table Category/Term t t c a b c c d (2) Yang and Pedersen [12] have shown that IG and CHI are the most effectve FS methods for -NN and LLSF classfcaton algorthms. Term selecton based on DF had smlar performance to IG and CHI methods. The latter result seems to states that the most mportant terms for categorzaton are those that occur more frequently n the tranng set.

3 Dstance to Transton Pont Our term selecton method DTP s based on TP. TP s derved from the Law of Zpf [1],[11],[14], and s the frequency that splts the terms of a text (or a set of texts) nto low frequency terms and hgh frequency terms. In [11] t was observed that TP ndcates the frequency around whch there are ey words of a text. In our prevous experments [7] we found that performance of categorzaton can be slghtly ncreased f terms that occur more often than TP are dsregarded. In ths paper TP s used to measure the mportance of the term for the categorzaton tas. Such measure s an nverse functon of the dstance between the frequency of a term and the TP; when the frequency of a term s dentcal to TP, the dstance wll be zero, producng a maxmum closeness score. Throughout the rest of ths secton we descrbe the computaton of TP and the detals of DTP for both unsupervsed and supervsed polces. The computaton of TP s performed as follows. Let T be a text (or a set of texts), and let I 1 be the number of terms wth frequency 1. Then accordng to [11] the transton pont of T s defned as TP = ( 1+ 8I1 1) / 2 (3) As we can see, TP calculaton only requres scannng the vocabulary of T n order to fnd I 1 (for more detals on TP see [11] and [8]). DTP unsupervsed. DTP computes the dstance to TP n the unsupervsed polcy as follows DTP t ) = TP frq( t ) (4) ( where frq(t ) s the frequency of t n D (D s the tranng set) and TP s computed on D. The computaton has a tme complexty of O(V). DTP supervsed. In the case of supervsed term selecton, DTP uses the category nformaton DTP t, c ) = TP frq ( t ) (5) ( where frq (t ) s the frequency of t n D (D s the set of tranng documents belongng to a specfc category c ) and TP s computed on D. As the globalzaton technque we have chosen DTP max because, n prelmnary experments [8], t consstently outperformed other globalzaton technques. The computaton ncludes the calculaton of the TP for each category and has a tme complexty of O(VM). DTP (whose use as a FS functon was frst proposed n [8]) selects the terms more close to TP. In FS we measure how close the frequency of a term and TP are to eachother. Thus the terms wth the hghest value for DTP are the more dstant to TP; snce we are nterested n the terms less dstant, we select the terms for whch DTP s

lowest. Our experments presented n Secton 5 show that the performance of tradtonal classfcaton algorthms (such as -NN and Roccho) s outperformed by term selecton wth DTP. 4 Classfers and Data In order to assess the effectveness of FS methods we used two classfers frequently used as a baselne n TC, -NN [13] and Roccho [3], both treat documents as term vectors. -NN s based on the categores assgned to the nearest tranng documents to the new document. The categores of these neghbors are weghted usng the smlarty of each neghbor to the new document, where the smlarty s measured by the cosne between the two document vectors. If one category belongs to multple neghbors, then the sum of the smlarty scores of these neghbors s the weght of the category [2],[10],[13]. Roccho s based on the relevance feedbac algorthm orgnally proposed for nformaton retreval. The basc dea s to construct a prototype vector for each category usng a tranng set of documents. Gven a category, the vectors of documents belongng to ths category are gven a postve weght, and the vectors of remanng documents are gven a negatve weght. By addng these postvely and negatvely weghted vectors, the prototype vector of ths category s obtaned. To classfy a new document, the cosne between the new document and prototype vector s computed [6],[10],[13]. The texts used n our experments are Spansh news downloaded from Mexcan newspaper La Jornada. We preprocess the texts removng stopwords, punctuaton and numbers, and stemmng the remanng words by means of a Porter's stemmer adapted to Spansh. Term weghtng was done by means of the standard tf df functon [9]. We have used a total of 1,449 documents belongng to sx dfferent categores (C: Culture, S: Sports, E: Economy, W: World, P: Poltcs, J: Socety & Justce) for tranng and two testng sets (see Table 2). We only managed one label settng,.e., each document was assgned n only one category. Table 2. Tranng and testng data Categores C S E W P J Tranng data No. of documents 104 114 107 127 93 91 No. of terms 7,205 4,747 3,855 5,922 4,857 4,458 Test data set 1 No. of documents 58 57 69 78 89 56 No. of terms 5,301 3,333 3,286 4,659 4,708 3,411 Test data set 2 No. of documents 83 65 61 51 90 56 No. of terms 6,420 3,855 2,831 3,661 4,946 3,822

To evaluate the effectveness of the classfcaton of documents by classfer, the standard precson, recall and F 1 measures were used. Precson s the number of documents correctly classfed, dvded by the total number of documents classfed. Recall s the number of documents correctly classfed, dvded by the total number of documents that should be classfed. The F 1 measure combnes precson (P) and recall (R) as follows: F 1 = 2RP/(R+P). These values can be computed for each ndvdual category frst and then be averaged over all categores. Or they can be globally computed over all the categores. These strateges are respectvely called macroaveragng and mcroaveraged. Same as [10], we evaluated mcroaveraged (F 1 ). 5 Experments We performed our FS experments wth both, a -NN classfer (usng = 30), and a Roccho classfer (where β = 16 and α = 4 as used n [6]). In these experments we compared three baselne term selecton technques: DF, IG and CHI max, and two varants of our DTP technque: DTP and DTP max. Table 3 lsts our F 1 values obtaned for -NN and Roccho wth the evaluated FS technques at dfferent percent of terms (the vocabulary sze n the tranng set s 14,272). Table 3. Mcroaveraged F 1 values for -NN and Roccho on test sets -NN Roccho Percent of terms DF IG CHI max DTP DTP max DF IG CHI max DTP DTP max 1.627.716.720.676.667.611.723.712.681.668 3.697.769.758.759.756.701.756.749.742.739 5.754.780.779.760.786.750.767.760.774.780 10.782.806.803.791.797.775.783.787.807.788 15.802.811.801.807.799.782.801.793.811.791 20.807.811.804.811.804.799.806.806.820.803 25.804.824.813.815.806.799.806.811.815.804 50.809.813.803.814.806.807.807.815.829.811 As seen n table 3, on both -NN and Roccho tests DTP s superor to DF, and comparable to IG and CHI max up to percents of terms around 5% and 3% respectvely, but becomes superor for percents hgher than those. These results, obtaned under both DTP varants show that an unsupervsed polcy performs better than ts supervsed counterpart. Results publshed n [12] showed that common terms are often nformatve, and vceversa. Our results under DTP do not contradct ths for, only the terms that have an extremely low or hgh frequency are removed, whle the terms wth medum

frequency score hghest and are preserved. Another nterestng result s that DTP unsupervsed, whle not usng category nformaton from the tranng set, has a performance smlar to supervsed IG and CHI. In addton to that DTP s much easer to compute than IG and CHI. 6 Conclusons In ths paper we have presented a novel term selecton method for TC: dstance to transton pont (DTP), whch s based on the proxmty to the frequency that splts the terms of a text as low and hgh frequency terms,.e., the transton pont (TP). Experments performed on Spansh texts wth two classfers (-NN and Roccho) showed that feature selecton by DTP acheves superor performance to document frequency, and comparable performance to nformaton gan and ch-statstc; three well nown and effectve technques. Remarably, DTP s a smple and easy to compute method. The degree of enhancement from our method n TC and ts relatonshp to other methods n the lterature s the subject of future nvestgatons by the authors. References 1. Booth, A.: A Law of Occurrences for Words of Low Frequency, Informaton and Control, (1967) 10(4) 386 93. 2. Galavott, L., Sebastan, F., Sm, M.: Experments on the Use of Feature Selecton and Negatve Evdence n Automated Text Categorzaton, Proc. of ECDL-00, 4th European Conference on Research and Advanced Technology for Dgtal Lbrares, (2000) 59 68. 3. Joachms, T.: A Probablstc Analyss of the Roccho Algorthm wth TFIDF for Text Categorzaton, Proc. of ICML-97, 14th Int. Conf. on Machne Learnng, (1997) 143 151. 4. John, G.H., Kohav, R., Pfleger, K.: Irrelevant Features and the Subset Selecton Problem, Proc. of ICML-94, 11th Int. Conf. on Machne Learnng, (1994) 121 129. 5. Karyps, G., Han, E.H.: Concept Indexng: A Fast Dmensonalty Reducton Algorthm wth Applcatons to Document Retreval & Categorzaton, Techncal Report TR-00-0016, Unversty of Mnnesota, (2000). 6. Lews, D.D., Schapre, R.E., Callan, J.P., Papa, R.: Tranng Algorthms for Lnear Text Classfers, Proc. of SIGIR-96, 19th ACM Int. Conf. on Research and Development n Informaton Retreval, (1996) 298 306. 7. Moyotl, E., Jménez, H.: An Analyss on Frequency of Terms for Text Categorzaton, Proc. of SEPLN-04, (2004) 141 146. 8. Moyotl, E., Jménez, H.: Dstanca al Punto de Transcón: Un Nuevo Método de Seleccón de Térmnos para Categorzacón de Textos, Tess de Lcencatura, Facultad de Cencas de la Computacón, BUAP, Puebla, Méxco, (2004). 9. Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatc Indexng, Communcatons of the ACM, (1975) 18(11) 613 620.

10. Sebastan, F.: Machne Learnng n Automated Text Categorzaton, ACM Computng Surveys, Vol. 34(1), (2002) 1 47. 11. Urbzagástegu-Alvarado, R.: Las posbldades de la ley de Zpf en la ndzacón automátca, Reporte de la Unversdad de Calforna Rversde, (1999). 12. Yang, Y., Pedersen, P.: A Comparatve Study on Feature Selecton n Text Categorzaton, Proc. of ICML-97, 14th Int. Conf. on Machne Learnng, (1997) 412 420. 13. Yang, Y., Lu, X.: A Re-examnaton of Text Categorzaton Methods, Proc. of SIGIR-99, 22nd ACM Int. Conf. on Research and Development n Informaton Retreval, (1999) 42 49. 14. Zpf, G.K.: Human Behavour and the Prncple of Least Effort, Addson-Wesley, (1949).