A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

Similar documents
A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

Minimal Test Cost Feature Selection with Positive Region Constraint

The Design of Supermarket Electronic Shopping Guide System Based on ZigBee Communication

Reversible Image Data Hiding with Local Adaptive Contrast Enhancement

Gender Classification Based on Support Vector Machine with Automatic Confidence

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Resource Load Balancing Based on Multi-agent in ServiceBSP Model*

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

K-MEANS BASED CONSENSUS CLUSTERING (KCC) A FRAMEWORK FOR DATASETS

Naïve Bayes for text classification

Fast Learning for Statistical Face Detection

An Improved KNN Classification Algorithm based on Sampling

Improved DAG SVM: A New Method for Multi-Class SVM Classification

Research on Design Reuse System of Parallel Indexing Cam Mechanism Based on Knowledge

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

Text Categorization (I)

Multiple Classifier Fusion using k-nearest Localized Templates

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design

Content Based Image Retrieval system with a combination of Rough Set and Support Vector Machine

Using Association Rules for Better Treatment of Missing Values

Using Gini-index for Feature Weighting in Text Categorization

An Integrated Face Recognition Algorithm Based on Wavelet Subspace

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Fingerprint Ridge Distance Estimation: Algorithms and the Performance*

Image Classification using Fast Learning Convolutional Neural Networks

Restricted Nearest Feature Line with Ellipse for Face Recognition

Efficient Pairwise Classification

Efficient Pairwise Classification

SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity The P2P Meets the Semantic Web *

The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization

Diagonal Principal Component Analysis for Face Recognition

A *69>H>N6 #DJGC6A DG C<>C::G>C<,8>:C8:H /DA 'D 2:6G, ()-"&"3 -"(' ( +-" " " % '.+ % ' -0(+$,

Rotation Invariant Finger Vein Recognition *

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System

A Finite State Mobile Agent Computation Model

Similarity Measures of Pentagonal Fuzzy Numbers

Face Alignment Under Various Poses and Expressions

HEURISTICS optimization algorithms like 2-opt or 3-opt

An efficient face recognition algorithm based on multi-kernel regularization learning

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means

A Support Vector Machine Classifier with Automatic Confidence and Its Application to Gender Classification

Fast and Effective Interpolation Using Median Filter

Information-Theoretic Feature Selection Algorithms for Text Classification

A Partition Method for Graph Isomorphism

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

A Revisit to LSB Substitution Based Data Hiding for Embedding More Information

The Theoretical Framework of the Optimization of Public Transport Travel

Optimal Extension of Error Correcting Output Codes

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

Fully Automatic Multi-organ Segmentation based on Multi-boost Learning and Statistical Shape Model Search

An Algorithm of Parking Planning for Smart Parking System

Keyword Extraction by KNN considering Similarity among Features

Adaptive osculatory rational interpolation for image processing

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Semantic Web Search Model for Information Retrieval of the Semantic Data *

Decomposing and Sketching 3D Objects by Curve Skeleton Processing

A 3D MODEL RETRIEVAL ALGORITHM BASED ON BP- BAGGING

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

Binary Decision Tree Using K-Means and Genetic Algorithm for Recognizing Defect Patterns of Cold Mill Strip

Topic 1 Classification Alternatives

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

Feature Data Optimization with LVQ Technique in Semantic Image Annotation

Leaf Image Recognition Based on Wavelet and Fractal Dimension

An Empirical Study of Lazy Multilabel Classification Algorithms

Improving the Discrimination Capability with an Adaptive Synthetic Discriminant Function Filter

Using K-NN SVMs for Performance Improvement and Comparison to K-Highest Lagrange Multipliers Selection

Design and Implementation of Music Recommendation System Based on Hadoop

A Graph Theoretic Approach to Image Database Retrieval

k-nearest Neighbor (knn) Sept Youn-Hee Han

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

Background Motion Video Tracking of the Memory Watershed Disc Gradient Expansion Template

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Generalized Coordinates for Cellular Automata Grids

Image Matching Using Distance Transform

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Dynamic Human Fatigue Detection Using Feature-Level Fusion

Color-Based Classification of Natural Rock Images Using Classifier Combinations

An Adaptive Threshold LBP Algorithm for Face Recognition

A Row-and-Column Generation Method to a Batch Machine Scheduling Problem

Classification with Diffuse or Incomplete Information

Web page recommendation using a stochastic process model

FAST RANDOMIZED ALGORITHM FOR CIRCLE DETECTION BY EFFICIENT SAMPLING

Nowadays data-intensive applications play a

High Capacity Reversible Watermarking Scheme for 2D Vector Maps

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

On Reduct Construction Algorithms

From dynamic classifier selection to dynamic ensemble selection Albert H.R. Ko, Robert Sabourin, Alceu Souza Britto, Jr.

Flatten Hierarchies for Large-scale Hierarchical Text Categorization

A Fast and High Throughput SQL Query System for Big Data

A Framework to Reversible Data Hiding Using Histogram-Modification

Robust Steganography Using Texture Synthesis

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Efficient Path Finding Method Based Evaluation Function in Large Scene Online Games and Its Application

Transcription:

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954 Huashan Rd., Shanghai 200030, China {zhaohai, blu}@cs.sjtu.edu.cn Abstract. This paper presents a Min-Max modular k-nearest neighbor (M 3 -k-nn) classification method for massively parallel text categorization. The basic idea behind the method is to decompose a large-scale text categorization problem into a number of smaller two-class subproblems and combine all of the individual modular k-nn classifiers trained on the smaller two-class subproblems into an M 3 -k-nn classifier. Our experiments in text categorization demonstrate that M 3 -k-nn is much faster than conventional k-nn, and meanwhile the classification accuracy of M 3 -k-nn is slightly better than that of the conventional k-nn. In practical, M 3 -k-nn has intimate relationship with high order k-nn algorithm; therefore, in theoretical sense, the reliability of M 3 -k-nn has been supported to some extend. 1 Introduction In all categorization algorithms that based on vector space model, k-nearest neighbor (k-nn) method is considered as taking on the best performance in some sense. The k-nn method is very concise and easy to use, which simply estimates the test sample s class as the class which most supported in its k nearest neighbor training samples. Because of its advantage, k-nn method has been widely used in text categorization. Many researchers focus their endeavor on the improvement of text categorization ([2], [5], [7]). Commonly, these researches adopt additional information for improving the ultimate classification result. However, the parallel techniques have been developed for traditional categorization for high-end application. Our research goal is to develop faster and more efficient methods for large-scale text categorization by direct modular classification without reducing the precision of the classification. Therefore we introduce a modular k-nn algorithm, M 3 -k-nn. In fact, many efficient methods have been proposed in the past decade in task decomposition and combination of modular classifiers ([1], [3], [6]), and the effectiveness of these methods have been studied and proved. But, thus far, no divide-and-conquer method has been put forward for k-nn algorithm. J. Zhang, J.-H. He, and Y. Fu (Eds.): CIS 2004, LNCS 3314, pp. 867 872, 2004. c Springer-Verlag Berlin Heidelberg 2004

868 H. Zhao and B.-L. Lu The basic ideal of our method is to adopt the k-nn algorithm in the combination of Min-Max Modular (M 3 ) Neural Network Model ([6]) and Round Robin Rule (R 3 ) Learning algorithm ([3]). Our experiments demonstrate that under parallel pattern, R 3 -k-nn or R 3 -M 3 -k-nn has the similar recognition precision to the traditional k-nn and sometime even better. Also, the flexibility of this modular combination method can be found obviously for large-scale parallel processing by using grid or large-scale cluster systems. 2 Algorithm Description We introduce the modular algorithm, M 3 -k-nn and one of its revised versions, R 3 -M 3 -k-nn. The whole algorithm has been divided into two phases. In the first phase, we simply use one-against-one task decomposition method. Suppose the number of classes for original classification problem is m, then we will have m(m 1)/2 independent base k-nn classifiers. The training samples of each base k-nn classifier come from two different classes. To combine the results, we consider two module combination policies as the following: P 1 P 2 M 3 -k-nn voting: In all base k-nn classifiers, if the outputs of m 1 base k-nn classifiers support the same class, then the classification result is just this class. R 3 -M 3 -k-nn voting: The result is the class that supported by most base k-nn classifiers. The policies P 1 and P 2 are exactly the same voting procedures used in [6] and [3], respectively. And policy P 2 is also one of voting strategies for multi-class support vector machine. In the second phase, we further divide a two-class subproblem from the first phase into a number of smaller two-class subproblems. The decomposition approach is to decompose two training sample sets pair in a two-class classification problem into m 1 and m 2 smaller sample sets, respectively. Thus, we get m 1 m 2 smaller two-class classification sub-problems. Let the output coding of class ID is 1 and 0. We define all base classifiers learning from the same training set of class 1 as a group. Then M 3 combination results of all base classifiers includes two stages: Firstly, combination rule Min is applied in each group to produce a group output. Secondly, all outputs of every groups are applied by combination rule Max to produce the final combination output. 3 Experiments We use Yomiuri News Corpus for this study. There are 2,190,512 documents in the full collections from the years 1987 to 2001. Two subsets of the corpus are used in the experiments, which belong to 10 classes. In the first subset, the number of training and test samples is 5,865 and 2,420, respectively. In the second subset, the number of training and test samples is 37,938 and 16,234, respectively.

A Modular k-nearest Neighbor Classification Method 869 A χ 2 statistical method ([7]) was used for preprocessing the documents after the morphological analysis was done with ChaSen. The number of features is 500. The goal of the experiments is to compare recognition accuracy between k- NN and M 3 -k-nn under the value of k varying. And we thus do not apply any optimization technology to increase the absolute accuracy. All simulations were performed on a PC with 2.4GHz Pentium 4 CPU and 512MB RAM. 3.1 M 3 -k-nn, R 3 -M 3 -k-nn and k-nn: Accuracy in Training Set 1 We divide training sample sets from each class at the same size 2000 for some modules, if there are still remained samples in a class whose number is less than 2000, we put them into a new module. The value of k varies from 3 to 20. The experiment results are shown in Fig.1. Obviously, while the value of k increases, the incorrect accuracy of both M 3 - k-nn and R 3 -M 3 -k-nn increases after decreasing, which changes more rapidly than k-nn does. In fact, both M 3 -k-nn and R 3 -M 3 -k-nn get to their peak accuracy earlier than k-nn does. Within the whole interval while the value of k changes, M 3 -k-nn, R 3 -M 3 -k-nn or k-nn performs in a similar manner. 71.5 71 R3 k NN k NN 70.5 Classification Acccuracy(%) 70 69.5 69 68.5 M3 k NN 68 67.5 67 4 6 8 10 12 14 16 18 20 K Fig. 1. M 3 -k-nn, R 3 -M 3 -k-nn and k-nn: Accuracy in Training Set 1 3.2 M 3 -k-nn and k-nn: Accuracy in Training Set 2 In the following experiment training set 2 is used. Here, we only compare k- NN with M 3 -k-nn. The result is shown in Fig.2, from which we can see that M 3 -k-nn gets to its peak accuracy earlier than k-nn does, too.

870 H. Zhao and B.-L. Lu 69.5 69 k NN Classification Acccuracy(%) 68.5 68 M3 k NN 67.5 67 4 6 8 10 12 14 16 18 20 K Fig. 2. M 3 -k-nn and k-nn: Accuracy in Training Set 2 3.3 Comparison of Running Time Comparison of classification time between two algorithms at each different value of k is shown in Table 1. All experiments run in a PC with Windows XP OS. Thus we estimate M 3 -k-nn s parallel processing performance by dividing the total running time of M 3 -k-nn by the number of modules. It is easy to see that the classification time used by M 3 -k-nn in a parallel model is much less than that of traditional k-nn. 4 Theoretical Analysis of the Algorithm Parts of our algorithm have been studied in [1] and [3], which shows that the load of a modular algorithm is lighter than the same one without any modularization in the phase 1 of the algorithm described in Section 2. This is an outstanding merit of all one-against-one modularization methods. For an m-class problem, suppose the number of all training samples is T.If one visiting for a training sample in k-nn algorithm is taken as a basic unit of time complexity, then the time complexity of k-nn algorithm will be about kt. As to the first phase of M 3 -k-nn, the time complexity of each k-nn base classifier is near to 2kT/m. Although the whole time complexity is (m 1)kT under serial processing, the time complexity under parallel processing is only 2kT/m. This means that the processing time of M 3 -k-nn will be decreased if only there is a synchronous increase of both the number of classes and the

A Modular k-nearest Neighbor Classification Method 871 Table 1. Comparison of classification time between k-nn and M 3 -k-nn k Training-Set 1 (ms) Training-Set 2 (ms) k-nn M 3 -k-nn single M 3 -k-nn module k-nn M 3 -k-nn single M 3 -k-nn module 3 13672 68569 1524 564219 2781059 61801 5 14406 72428 1610 559344 2816960 62599 7 15094 75881 1686 573891 2873521 63856 9 14797 74217 1649 553125 2912828 64730 11 14828 74507 1656 573625 3348766 74417 13 15438 78597 1747 570875 3207375 71275 15 15407 77900 1731 584172 3737203 83049 17 15188 75951 1688 598421 3776235 83916 19 15766 79188 1760 610797 3824140 84981 number of processing units. If the second phase of the algorithm is considered, the parallel processing time complexity will be decreased more. In the extreme case, the value can be 2kT/(mm 1 m 2 ). Now we will give a simple explanation for the similar accuracy between M 3 - k-nn and k-nn under the first task decomposition. There will be m(m 1)/2 sub-tasks in M 3 -k-nn for an original m-class classification problem. A base k- NN classifier learning from samples of class i, j, while 1 i, j m, orits corresponding output is denoted by X ij. We consider the composition of k nearest neighbors of a test sample while the original k-nn algorithm that is applied in all m-class training samples. We say, the k nearest neighbors must come from the union set of k nearest neighbors set of each base k-nn classifier s corresponding test sample in M 3 -k-nn. Otherwise, suppose there is a sample from class j outside the set at least, that is to say, it is in k nearest neighbors of m-class training set. Consider there is unique class j inm classes. If we limit the class upon class i and j while searching k nearest neighbors, the sample must come from either class i or class j, which contradict our initial suppose. Thus, k nearest neighbors from m-class is the subset of the union set of each base k-nn classifier s corresponding test sample s k nearest neighbors set in M 3 -k-nn. This result suggests that there is a fixed ratio between M 3 -k-nn s k voting sets and k-nn s. Especially, let k be equal to 1, thus, all samples from M 3 -k-nn s voting set will be in the same class. Then, corresponding nearest neighbors in m classes must give the output with the same class number. Namely, M 3-1-NN is equal to 1-NN. The reason that M 3 -k-nn gets to its peak accuracy faster than k-nn does is that less samples are processed in each module for M 3 -k-nn, then increasing of the value of k will have more notable effects on M 3 -k-nn than k-nn.

872 H. Zhao and B.-L. Lu However, in [4], another kind of k-nn classifiers combination has been considered, too. The main ideal of that paper is that k-nn classifiers combinations are made under different feature combinations, and all k-nn classifiers are applied to the whole training text sets, while our method is to cut the whole training sets into different parts and then to perform individual k-nn classification. 5 Conclusions We present a modular k-nn method, M 3 -k-nn and its revised version R 3 -M 3 - k-nn in this paper, which are applied to large-scale text categorization. Our experiments show that M 3 -k-nn admits a flexible k-nn classifiers modularization to realize a complete parallel processing, which will not reduce classification accuracy at the same time. In fact, as the value of k increasing, M 3 -k-nn gets to its peak classification accuracy faster than k-nn does. In addition, our analysis suggests that M 3 -k-nn algorithm concerns with a high order k-nn algorithm, which makes some theoretical guarantee for M 3 -k-nn s reliability. Acknowledgements The authors would like to thank Junjie Zhang for his programming. This research is partially supported by the National Natural Science Foundation of China under Grant No. 60375022. References 1. Allwein, E.L., Schapire, R. E., Singer, Y.: Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Journal of Machine Learning Research, Vol.1 (2000) 113-141 2. Chiang, J.-H., Chen, Y.-C.: Hierarchical Fuzzy-KNN Networks for News Documents Categorization. The 10th IEEE International Conference on Fuzzy Systems, Vol.2 (2001) 720-723 3. Frnkranz,J.: Round Robin Classification. The Journal of Machine Learning Research, Vol.2 (2002) 721-747 4. Han, H., Yand J.Y., Hu, Z.S.: Combination of KNN Classifiers Based on Multiple Correspondence Analysis. Information and Control, Vol.28 (1999) 350-356 5. Lim, H.S.: An Improved KNN Learning Based Korean Text Classifier with Heuristic Information. ICONIP 02, Vol.2 (2002) 731-735 6. Lu, B.L., Ito, M.: Task Decomposition and Module Combination Based on Class Relations: a Modular Neural Network for Pattern Classification. IEEE Transactions on Neural Networks, Vol.10 (1999) 1244-1256 7. Yang, Y., Petersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. ICML 97, (1997) 412-420.