Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Similar documents
International Journal of Advanced Research in Computer Science and Software Engineering

Information-Theoretic Feature Selection Algorithms for Text Classification

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

String Vector based KNN for Text Categorization

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

Natural Language Processing

Automated Online News Classification with Personalization

A Bagging Method using Decision Trees in the Role of Base Classifiers

Information Retrieval. (M&S Ch 15)

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS229 Final Project: Predicting Expected Response Times

International ejournals

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Encoding Words into String Vectors for Word Categorization

A hybrid method to categorize HTML documents

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

Supervised classification of law area in the legal domain

A Novel Feature Selection Framework for Automatic Web Page Classification

Mining Web Data. Lijun Zhang

arxiv: v1 [cs.mm] 12 Jan 2016

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Facial Expression Classification with Random Filters Feature Extraction

1 Document Classification [60 points]

Machine Learning. Decision Trees. Manfred Huber

Keyword Extraction by KNN considering Similarity among Features

CS 6320 Natural Language Processing

HANDWRITTEN/PRINTED TEXT SEPARATION USING PSEUDO-LINES FOR CONTEXTUAL RE-LABELING

Feature selection. LING 572 Fei Xia

Feature Selection Methods for an Improved SVM Classifier

Studying the Impact of Text Summarization on Contextual Advertising

Web Information Retrieval using WordNet

Automatic Domain Partitioning for Multi-Domain Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Application of Support Vector Machine Algorithm in Spam Filtering

Multi-label classification using rule-based classifier systems

Performance Evaluation of Various Classification Algorithms

A Survey on Postive and Unlabelled Learning

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

A FUZZY NAIVE BAYESCLASSIFICATION USING CLASS SPECIFIC FEATURES FOR TEXT CATEGORIZATION

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Mining Web Data. Lijun Zhang

Developing Focused Crawlers for Genre Specific Search Engines

A NEW VARIABLES SELECTION AND DIMENSIONALITY REDUCTION TECHNIQUE COUPLED WITH SIMCA METHOD FOR THE CLASSIFICATION OF TEXT DOCUMENTS

Research Article An Ant Colony Optimization Based Feature Selection for Web Page Classification

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Bug or Not? Bug Report Classification using N-Gram IDF

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Region-based Segmentation and Object Detection

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

What Causes My Test Alarm? Automatic Cause Analysis for Test Alarms in System and Integration Testing

Data preprocessing Functional Programming and Intelligent Algorithms

A Framework for Web Host Quality Detection

Unsupervised Feature Selection for Sparse Data

A Novel PAT-Tree Approach to Chinese Document Clustering

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

K236: Basis of Data Science

Contents. Preface to the Second Edition

Multi-Class Segmentation with Relative Location Prior

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Automatic detection of books based on Faster R-CNN

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Data Preprocessing. Slides by: Shree Jaswal

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Chapter 6: Information Retrieval and Web Search. An introduction

1 Introduction. 2 Document classification process. Text mining. Document classification (text categorization) in Python using the scikitlearn

Saudi Journal of Engineering and Technology (SJEAT)

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Concept-Based Document Similarity Based on Suffix Tree Document

Reading group on Ontologies and NLP:

TA Section: Problem Set 4

Knowledge Discovery and Data Mining 1 (VO) ( )

Applying Supervised Learning

Application of rough ensemble classifier to web services categorization and focused crawling

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

A Semantic Model for Concept Based Clustering

Using Machine Learning for Classification of Cancer Cells

Impact of Term Weighting Schemes on Document Clustering A Review

Robot localization method based on visual features and their geometric relationship

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Research Article Simple-Random-Sampling-Based Multiclass Text Classification Algorithm

Discovering Advertisement Links by Using URL Text

Machine Learning Techniques for Data Mining

Data Mining Practical Machine Learning Tools and Techniques

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Final Project Report CS 5604 Information Storage and Retrieval CLA Team, Fall 2016

Recognition. Topics that we will try to cover:

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Rich feature hierarchies for accurate object detection and semantic segmentation

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Transcription:

Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese Academy of Sciences

Outline Introduction Our Framework Preprocessing Feature selection Machine learning methods Measurements for evaluation Experiments Conclusions

Outline Introduction Our Framework Preprocessing Feature selection Machine learning methods Measurements for evaluation Experiments Conclusions

Introduction Task Definition given a news document and a predefined hierarchy of categories with a depth of 2. the Classification and Code of News in Chinese (CCNC) as the predefined hierarchy of categories. Some samples from CCNC: We are required to provide the IDs of the categories which this document belongs to.

About categories, Introduction (contd.) This hierarchy of categories consists of at most 2 levels of subdivisions, specifically, which includes 24 main entries and 367 entries in the first and the second levels. Text corpus includes about 30,000 news articles. provided by courtesy of the Xinhua News Agency. category annotation in XML format is: may have more than one category ID; with up to 2 category IDs; Required to sort multiple IDs in descending order with respect to their confidence scores.

Outline Introduction Our Framework Preprocessing Feature selection Machine learning methods Measurements for evaluation Experiments Conclusions

Our Framework Our framework based on Feature selection,

Our Framework Our framework based on Feature selection,

Our Framework Preprocessing, word segmentation or stemming; Removing stop-words, prepositions, conjunctions and pronouns; occur in many documents and hold very high DF scores; Contain little useful information for feature representation. Feature selection, Bag of words (BOW) leads to a high dimensional feature space; selects a specific subset of the terms from original feature; remove these irrelevant and redundant words; CHI statistic is employed.

Our Framework Feature selection using CHI, each term is assigned with a score according to CHI function; with higher scores are selected; measures the lack of independence between term and the class, defined as Equation (1),

Our Framework From equ.(1), if term t and class c are independent, the value of it is zero. Otherwise, the larger indicate that the term t is more related to category c. From Table 1, shortcoming of the CHI is that they just count whether a term and a special category co-occurrence in each document, instead of the frequency. the native score may magnify the contributions of terms with lowfrequency in feature representation; propose a measurement of term-goodness for feature selection in Equ. (2),

Our Framework For construct feature dictionary, how many terms reserved? Where L is total number of terms, reserving ratio. Advantages reduce the dimensionality; removing noisy features; avoid over-fitting

Our Framework Feature weight, Tf-idf Machine learning methods, In this task, each text may have more than one category, but the concrete number of category is indeterminate. In this evaluation, we choose softmax regression model to predict a confidence score. Generalized version of logistic regression for probabilistic multiclass problems. (4)

Our Framework Estimate the parameter, the cost function, (5)

Measurements, Measurements for evaluation

Experiments Experimental data The Chinese News articles; 20Newsgroup. Experimental results, The definitions of hierarchical category indicate that the second level category information can deduce that of first level. For concision, we classify the test samples directly at the second level using our framework in this evaluation.

Evaluation Experimental Results on 20NGs, From Table 3, the terms weighting method tf *idf is more robust than tf *Chi. However, the softmax when the category number is small have little merits compared with SVMs.

Thanks For Your Time!