A Weighted Method to Improve the Centroid-based Classifier

Similar documents
Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Cluster Analysis of Electrical Behavior

The Research of Support Vector Machine in Agricultural Data Classification

Classifier Selection Based on Data Complexity Measures *

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Hierarchical clustering for gene expression data analysis

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Parallelism for Nested Loops with Non-uniform and Flow Dependences

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Support Vector Machines

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Pruning Training Corpus to Speedup Text Classification 1

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

An Image Fusion Approach Based on Segmentation Region

Module Management Tool in Software Development Organizations

Load Balancing for Hex-Cell Interconnection Network

A Binarization Algorithm specialized on Document Images and Photos

Network Intrusion Detection Based on PSO-SVM

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Machine Learning: Algorithms and Applications

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Edge Detection in Noisy Images Using the Support Vector Machines

Machine Learning. Topic 6: Clustering

Clustering Algorithm of Similarity Segmentation based on Point Sorting

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A new segmentation algorithm for medical volume image based on K-means clustering

Support Vector Machines

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

A fast algorithm for color image segmentation

Machine Learning 9. week

A Clustering Algorithm for Chinese Adjectives and Nouns 1

CSCI 5417 Information Retrieval Systems Jim Martin!

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Query Clustering Using a Hybrid Query Similarity Measure

Fast Feature Value Searching for Face Detection

Three supervised learning methods on pen digits character recognition dataset

Available online at Available online at Advanced in Control Engineering and Information Science

Discriminative Dictionary Learning with Pairwise Constraints

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Research of Neural Network Classifier Based on FCM and PSO for Breast Cancer Classification

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Deflected Grid-based Algorithm for Clustering Analysis

Data Mining: Model Evaluation

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Efficient Text Classification by Weighted Proximal SVM *

Finite Element Analysis of Rubber Sealing Ring Resilience Behavior Qu Jia 1,a, Chen Geng 1,b and Yang Yuwei 2,c

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Yan et al. / J Zhejiang Univ-Sci C (Comput & Electron) in press 1. Improving Naive Bayes classifier by dividing its decision regions *

CLASSIFICATION OF ULTRASONIC SIGNALS

Research Article A High-Order CFS Algorithm for Clustering Big Data

Performance Evaluation of Information Retrieval Systems

Face Recognition Based on SVM and 2DPCA

Clustering of Words Based on Relative Contribution for Text Categorization

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Deep Classification in Large-scale Text Hierarchies

Intelligent Information Acquisition for Improved Clustering

A Novel Term_Class Relevance Measure for Text Categorization

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

High-Boost Mesh Filtering for 3-D Shape Enhancement

A MODIFIED K-NEAREST NEIGHBOR CLASSIFIER TO DEAL WITH UNBALANCED CLASSES

Design of Structure Optimization with APDL

Optimal Design of Nonlinear Fuzzy Model by Means of Independent Fuzzy Scatter Partition

A Notable Swarm Approach to Evolve Neural Network for Classification in Data Mining

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Optimizing Document Scoring for Query Retrieval

Fast Computation of Shortest Path for Visiting Segments in the Plane

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Querying by sketch geographical databases. Yu Han 1, a *

Japanese Dependency Analysis Based on Improved SVM and KNN

Fuzzy Rough Neural Network and Its Application to Feature Selection

Unsupervised Learning

Meta-heuristics for Multidimensional Knapsack Problems

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier

Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection

Parallel matrix-vector multiplication

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

Issues and Empirical Results for Improving Text Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1. SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

GA-Based Learning Algorithms to Identify Fuzzy Rules for Fuzzy Neural Networks

Improving Web Image Search using Meta Re-rankers

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Transcription:

016 Internatonal onference on Electrcal Engneerng and utomaton (IEE 016) ISN: 978-1-60595-407-3 Weghted ethod to Improve the entrod-based lassfer huan LIU, Wen-yong WNG *, Guang-hu TU, Nan-nan LIU and Yu XING Department of omputer Scence and Engneerng, Unversty of Electronc Scence and Technology of hna, hengdu 611731, hna *orrespondng author Keywords: Text categorzaton, entrod-based classfer, achne learnng, Gravtaton model. bstract. entrod-ased lassfer () s one of the most wdely used text classfcaton method due to ts theoretcal smplcty and computatonal effcency. However, the accuracy of s not satsfactory when t deals wth the skewed dstrbuted data. In ths paper, we propose a new classfcaton model named as Gravtaton odel (G) to solve the model msft of. In the proposed model, we gve each category a mass factor to ndcate ts dstrbuton n vector space and ths factor can be learned from tranng data. We provde the performance comparsons wth and ts mproved methods based on the results of experments conducted on twelve real datasets, whch show that the proposed gravtaton model consstently outperforms. Furthermore, t reaches the same performance as the best centrod-based classfer and s more stable than the best one. Introducton entrod-ased lassfer () [1,,3,4,5,6] s one of the most popular T methods. The basc dea of s that an unlabeled sample should be assgned to a partcular class f the smlarty of ths sample to the centrod of the class s the largest. ompared wth other T methods, s more effcent snce ts computatonal complexty s lnear n the tranng/testng phase, ths mert s mportant for onlne text classfcaton task. lthough t has been shown that consstently outperforms other methods such as k-nearest-neghbors, Nave ayes, and Decson Tree on a wde range of datasets [7], often suffers from the model msft when the data s not well-dstrbuted. s we all know, the model msft wll lead to poor classfcaton performance. To solve the model msft of, numerous approaches have been proposed, such as lass-feature-entrod (F) [3], Generalzed luster entrod based lassfer (G) [8], DragPushng (DP) [6] and Large argn DragPushng (LDP) [5]. However, the exstng varants of focus on the methods that am to obtan good centrods n the constructon or tranng phase. Therefore, they cannot solve the nherent dsadvantages of model. In ths paper, a new centrod-based classfcaton model s ntroduced to easly overcome the nherent shortcomngs (or bas) of n the class-mbalanced dataset. In the proposed model, each class s gven a mass factor to ndcate the data dstrbuton of the correspondng class, and the value of mass factor wll be learned from the tranng set. Then, a new document wll be assgned to a partcular class wth the max gravtatonal force. The proposed method s emprcally evaluated by comparng t wth three frequently used centrod-based methods (.e.,, F, and DP) based on twelve real datasets n the feld of text categorzaton. The expermental results demonstrate that the proposed method works well on the real datasets, and outstandngly outperforms most of the state-of-the-art centrod-based approaches (e.g., and F). The remander of ths paper s organzed as follows: the proposed model s ntroduced n Secton, the performance of our model s evaluated n Secton 3, the paper closes wth the conclusons and future works n Secton 4. The Proposed Gravtaton odel The proposed Gravtaton odel (G) concentrates on the adustment of classfcaton hyper plane to reduce the bas nherent n, and t s entrely dfferent from the prevous works [3,4,5,6] whch

obtan the centrods wth good ntal term weghts n constructon phase or modfy the poston of centrods durng tranng phase. The otvaton of Gravtaton odel Gravtaton model s motvated by Newton's law of unversal gravtaton whch states that a partcle attracts every other partcle n the unverse by a force. The magntude of the attractve force F between two obects s as follows: Gm F, (1) r where G denotes the gravtatonal constant, s the mass of the frst obect, m s the mass of the second obect and r s the dstance between the two obects. Ths formula ndcates that the force s proportonal to the product of masses of the two obects and nversely proportonal to the square of the dstance between them. r r S 0 Obect Obect Fgure 1. Gravtaton equlbrum pont S between obect and. 0 The Defnton of Gravtaton odel Smlar to the unversal gravtaton, as shown n Fg. 1, obect (or ) can be seen as class (or ), and obect S can be seen as the unknown sample d n the vector space. Hence, the label of sample d wll be determned by the attractve forces from class and class. For example, f the sample d locates on the left of Lagrange pont S0 where the gravtatonal attracton of to obect S counteracts 's gravtatonal attracton to obect S, then the sample d wll be classfed nto category due to F F. onversely, f F F, the sample d wll be classfed nto category. Therefore, the classfcaton hyper plane n gravtaton model s no longer the hyper plane that class and share the same smlarty wth the unknown sample d as n but Lagrange hyper plane that class and share the same attractve force. In the vector space, Lagrange hyper plane between two classes s defned as follows:, () r r where r (or r ) s the dstance from sample d to the centrod of class (or class ). In fact, 1 r (or 1 r ) can be regarded as the measure of smlarty between sample d and centrod of class (or class ). Therefore, the formula above can be rewrtten as below: smd, c smd, c where sm d, ( or d, (3) c sm, c ) s the measure of smlarty between sample d and the centrod of class (or ), and (or ) s unknown mass factor whch can be learned from the tranng set. The obectve of learnng mass factor s to obtan the best Lagrange hyper plane so that the classfcaton error (.e., 1-accuracy) s the lowest. In the testng phase, gravtaton model frst calculates the attractve force between the undentfed sample d and each class, whch uses the followng formula:

F d, c smd, c where 1,, (4), denotes mass factor of class. Then, the class of document d s determned drectly by assgnng t to the class wth the max attractve force, that s lass ( d) arg max F d, c. (5) Obvously, the mass factors ntroduced by gravtaton model are used to ndcate the class dstrbuton n the tranng phase, and then they are appled to determne the category of undentfed samples n the testng phase. In ths framework, learnng mass factors leads to a Vorono partton that drectly allevates the class mbalance problem. Learnng ass Factor The key to gravtaton model s to determne the nherent mass factor of each class so that the model fts the tranng data well. The dea of learnng mass factors s to use the msclassfed samples to gude the update of the factors untl the number of msclassfed samples falls to zero or a small constant. The way of learnng factors s to updates the correspondng two mass factors for each msclassfed sample. For nstance, f document d that belongs to class s msclassfed nto class. To classfy d correctly, t s needed to ncrease the attractve force of class and correspondngly decrease the force between class should be enlarged whle should be reduced. The concrete update formulas for as follows: on sample d and sample d. Thus, the mass factor and are :, (6) :, (7) where s a constant value gven by the user, whch s used to control the update strength. For a msclassfed sample, the correspondng mass factors are updated, untl the dfference of classfcaton error between two adacent teratons s less than the gven threshold (denoted by ), or the number of teratons reaches the maxmum. Experments In ths secton, the performance of gravtaton model s evaluated across a range of text collectons by comparng t wth, F, and DP. Datasets The benchmark collectons used n the experments are 0-Newsgroups, Reuters-1578 and other ten datasets from Karyps Lab. They are concsely ntroduced as follows: Newsgroups --- t s a benchmark corpus typcally used n the research feld of text categorzaton (or text clusterng). Ths corpus conssts of 19,997 artcles that are organzed nto 0 dfferent categores, and t s hghly balanced snce each category has nearly 1,000 texts. Reuters --- the verson of pte splt 90 categores whch contans 11,406 texts for 90 categores are used n the experments. The nstance dstrbuton over the 90 categores s hghly mbalanced. Datasets from Karyps Lab --- ten datasets ncludng cacmcs, classc, fbs, htech, new3, ohscal, re0, re1, tr1, and tr3 are pcked out from Karyps' dataset. The detal descrpton of each collecton can be referred to lterature [9]. Experments Desgn The fve-fold cross-valdaton scheme s appled to evaluate the performance of algorthms. Wth respect to F, the denormalzed prototype vector s used for predcton, and the parameter b s

fxed to e 1. 5 ( 1. 18 ). Note that cross-valdaton scheme was not appled n the evaluaton of F n lterature [3] (F was traned by the whole dataset and tested by selectng three quarters of all samples). Thus, n ths experments, F s nvestgated n two ways.e., fve-fold cross-valdaton and the method as n lterature [3] to gve more reasonable results. For concse of reference, "F'' s used to denote the orgnal F as n lterature [3], whle "F-V'' denotes F wth fve-fold cross-valdaton evaluaton method. Regardng DP, the learnng rate s set to 0.01, and the maxmum teraton equals 50. For gravtaton model, the parameters and are set to 0.0001, and the maxmum teraton s also set to 50 n all corpora. Expermental Results and nalyses The overall performance comparson n mcrof1 and macrof1 are lsted n Table 1 and. The max value n each row has been hgh-lghted n bold wthout consderng the performance of F snce the evaluaton method of F s dfferent from other algorthms. From Table 1 and, the followng conclusons can be drawn. G consstently outperforms n mcrof1 and macrof1. The mcrof1 of G s respectvely 9.0%, 5.%, 4.6%, 4.3%, and 3.8% hgher than on Reuters, cacmcs, classc, re0, and re1, and s slghtly better than on the remanng datasets. eanwhle, the macrof1 of G respectvely beats by 6.48%, 3.9%, and.8% on Reuters, cacmcs, and classc. In the rest of datasets, the macrof1 of G s also obtaned small ncrease compared wth. Table 1. The comparson of dfferent classfers n mcrof1. Dataset G F-V DP F Newsgroups 0.887 0.886 0.773 0.899 0.968 Reuters 0.874 0.783 0.705 0.893 0.95 cacmcs 0.97 0.875 0.910 0.944 0.998 classc 0.930 0.884 0.811 0.948 0.887 fbs 0.80 0.794 0.719 0.800 0.901 htech 0.710 0.704 0.53 0.690 0.969 new3 0.799 0.779 0.570 0.81 0.984 ohscal 0.773 0.761 0.549 0.773 0.934 re0 0.805 0.76 0.61 0.815 0.919 re1 0.853 0.815 0.680 0.849 0.978 tr1 0.908 0.899 0.674 0.908 1.000 tr3 0.814 0.790 0.637 0.859 0.987 Table. The comparson of dfferent classfers n macrof1. Dataset G F-V DP F Newsgroups 0.883 0.883 0.765 0.899 0.97 Reuters 0.780 0.716 0.571 0.764 0.963 cacmcs 0.91 0.88 0.91 0.938 0.998 classc 0.936 0.908 0.856 0.951 0.91 fbs 0.780 0.764 0.654 0.78 0.907 htech 0.677 0.670 0.473 0.657 0.971 new3 0.809 0.797 0.603 0.83 0.988 ohscal 0.763 0.758 0.535 0.763 0.933 re0 0.794 0.78 0.440 0.800 0.93 re1 0.787 0.787 0.539 0.784 0.978 tr1 0.908 0.900 0.658 0.906 1.000 tr3 0.807 0.793 0.581 0.847 0.99 F has the best performance whle F-V has the worst performance compared wth other algorthms on most of the datasets. The dfferent performance between F and F-V s due to the reason that F uses the whole dataset to calculate prototype vectors whle F-V only adopts the tranng set. The mcrof1 and macrof1 of F-V are lower than on all datasets except cacmcs. On the contrary, the mcrof1 and macrof1 of F have an overwhelmng advantage

comparng wth others on eleven datasets. Therefore, ths s also an ndcaton that F has a low generalzaton ablty (or overfttng) on tranng data. DP s the best centrod-based classfer, and G has the approxmate performance as DP on most of the datasets. s shown n Table 1, G and DP obtan the same mcrof1 that s the best performance on ohscal and tr1. Except for F, G obtans the best performance n fve out of twelve datasets, whle DP acheves n nne out of twelve datasets. lso, a smlar trend can be seen from Table, where G remans the leader wth fve datasets, and DP takes the crown n eght datasets. From Table 1 and, t should be apprecated that G s always a very close runner-up when DP s the wnner. However, t s also worth notng that DP has poorer performance than n some of the datasets such as htech and re1, whch does not happen on G. Thus, DP cannot reduce the nherent bas of n some partcular stuatons. G undoubtedly proves tself to be more stable than DP. onclusons and Future Works In ths paper, we proposed a novel centrod classfcaton model whch s motvated by Newton's unversal gravtaton law. To calculate the gravtatonal force, G gves each category a mass factor to ndcate the sample dstrbuton of ths category n the vector space. In partcular, G can ft the dataset well n the case of the skewed dstrbuton. n unlabeled sample wll be assgned to the class wth the largest gravtatonal force to ths sample. The expermental results show that the proposed gravtaton model s effectve and effcent due to ts lnear tranng and testng tme. cknowledgement Ths work has been supported by oe- (nstry of Educaton of hna - hna oble ommuncatons orporaton) Jont Scence Fund under grant 0130661. References [1] Takçı H., Güngör T. hgh performance centrod-based classfcaton approach for language dentfcaton [J]. Pattern Recognton Letters, 01, 33(16): 077-084. [] Dandan W., Qngca, Xaolong W. Framework of entrod-ased ethods for Text ategorzaton [J]. IEIE Transactons on Informaton and Systems, 014, 97(): 45-54. [3] Guan H., Zhou J., Guo. class-feature-centrod classfer for text categorzaton []//Proceedngs of the 18th nternatonal conference on World wde web., 009: 01-10. [4] Lertnattee V., Theeramunkong T. lass normalzaton n centrod-based text categorzaton [J]. Informaton Scences, 006, 176(1): 171-1738. [5] Tan S. Large margn DragPushng strategy for centrod text categorzaton [J]. Expert Systems wth pplcatons, 007, 33(1): 15-0. [6] Tan S. n mproved centrod classfer for text categorzaton [J]. Expert Systems wth pplcatons, 008, 35(1): 79-85. [7] Han E.H.S., Karyps G. entrod-based document classfcaton: nalyss and expermental results []//European conference on prncples of data mnng and knowledge dscovery. Sprnger erln Hedelberg, 000: 44-431. [8] Pang G., Jang S. generalzed cluster centrod based classfer for text categorzaton [J]. Informaton Processng & anagement, 013, 49(): 576-586. [9] Zhao Y., Karyps G., Du D.Z. rteron functons for document clusterng [R]. Techncal Report, 005.