Using Gini-index for Feature Weighting in Text Categorization
|
|
- Alvin Hamilton
- 5 years ago
- Views:
Transcription
1 Journal of Computational Information Systems 9: 14 (2013) Available at Using Gini-index for Feature Weighting in Text Categorization Weidong ZHU 1,, Yongmin LIN 2 1 School of Computer and Information Technology, Beijing Jiaotong University, Beijing , China 2 College of Economics and Management, Hebei Polytechnic University, Tangshan , China Abstract With the rapid development of World Wide Web, text categorization has played an important role in organizing and processing large amount of text data. As a simple, straightforward, high processing speed feature weighting method, TF IDF method is widely used in document classification. But this method simply considers the words of low frequency are important, the words of high frequency are unimportant, which may not reflect the usefulness of the words, and decreases the precision of classification. In this paper, a Gini-index based feature weighting method is presented, which solves the problem mentioned above. The experiments showed the TF GINI method has better classification performance. Keywords: Text Categorization; Feature Selection; Gini-index; Feature Weighting; VSM 1 Introduction Automatic text classification is a supervised learning task. It learns from the training document set with predefined class labels, and assigns the class label to the new document. With the rapid development of the network technology and digital library, online document grows quickly. Automatic text classification became the key technique for processing and organizing the high volume document data. Existing feature selection methods are based on statistical theory and machine learning methods. Among the famous methods are, Support Vector Machine (SVM), Naive Bayes, k-nearest Neighbor (knn), Neural Network (N net), Boosting, Linear Least Squares Fitting (LLSF) etc. In most classifiers, the vector space model is used to describe the document. The document is treated as a vector in the feature space. The measurement in its coordinate system uses TF IDF value proposed by Saltond in Text Frequency (TF) is the times of a word showed up in this document. Inverse document frequency IDF=log(N/Nt), where t is the word, N is the total number of document in the training set, Nt is the number of document showed t. The product of text frequency (TF) and inverse document frequency IDF as the value of the vector coordinate system has some advantages such as simple, intuitive and fast processing speed. Thus, it has been widely adopted in the document categorization. However, the TF IDF [1 3] Corresponding author. address: wdzhu@bjtu.edu.cn (Weidong ZHU) / Copyright 2013 Binary Information Press DOI: /jcis6629 July 15, 2013
2 5820 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) method simply considers the words of low frequency are important, the words of high frequency are unimportant, which may not reflect the usefulness of the words, and decreases the accuracy of classification. This paper present a different feature weighting solution (TF GINI) based on Gini-index. According to the distribution probability of the class with feature sample, calculate the Gini-index of the feature, then, weight the feature by the product of text frequency (TF) and the feature s Gini-index. It fully considers the feature s capability to distinguish different classes without increase the calculation complexity. Using the Reuters corpus and Chinese corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University, the comparing experiments showed the TF GINI method has better classification performance than the TF IDF method without increase the calculation time complexity. First chapter of this paper will discuss the defect of the TF IDF weighting solution. The second chapter will address the Gini-index based TF GINI feature weighting solution. The third chapter presents the experiment result and analysis. The last chapter is the conclusion and further study prospect. 2 Analysis of the TF IDF Weighting Solution Vector space model is currently one of the most simple and high efficient text describing models. Its basic scheme is: For certain natural language document D =D (t 1, w 1 ; t 2, w 2 ; ; t N, w N ), where t i is the feature selected from the document D, w i is the weight of the feature, 1 i N. To simplify the analysis, usually the sequence of the t k in the document is not considered and t k must different from each other, i.e. there is no repeating. Now, t 1, t 2,, t N can form a N dimension Coordinate system, and w 1, w 2,, w N are Coordinate values accordingly, therefore, D (w 1, w 2,, w N ) is a vector of the N dimension Coordinate system. The measurement of the coordinate system adopts TF IDF presented by Salton in Text Frequency (TF) is the times of a word showed up in this document. Inverse document frequency IDF=log(N/Nt), where t is the word, N is the total number of document in the training set, Nt is the number of document showed t. LU Yuchang has analyzed the two basic assumptions of the TF IDF scheme in the literature [3]. (1) If a word shows up many times in a document, it also appears many times in another document of the same class, and vice versa. Thus, the Text Frequency TF as part of the measurement to reflect the same class feature is acceptable. (2) If a word shows up less frequently, its capability to distinguish from other classes is stronger. Thus, the concept of Inverse document frequency IDF is introduced. The product of TF and IDF is used as the measurement of the feature space coordinate system. In the literature [3], from the angle of word weighting and vector rotation, it resolved the simple construction of IDF can not effectively reflect the usefulness of a word. It proposed removing the P(W) from the formula of Information Gain and text weight of Evidence, then weighting the word, and proved its effectiveness of improvement via the experiment. In the literature [2], Thorsten, using the probability theory, analyzed the product of TF and IDF as the measurement of the feature space coordinate system and pointed out it may not obtain high accuracy of the classification. He proposed a classification model between traditional TF IDF and Naïve bayes models. From the angle of the usefulness of feature for the classification, we found the weighting of
3 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) TF IDF may assign the bigger weight to the rare word, without consider its class distribution characteristics. These rare words may lead to invalid classification. A simple case study can reveal the defect of the TF IDF. Look into the following conditions: the total document number in the training document set is 300, which 100 belong to class A, and rest 200 belong to class B. If and only if both words t 1 and t 2 show up in class B documents, and Nt1=200, Nt2=100. Document D belongs to class B, D B, both words t 1 and t 2 show up in Document D, and TF(t 1 )= TF(t 2 ). Using the TF IDF method to weight these two words, we got TF(t 1 ) log(n/nt1)< TF(t 2 ) log(n/nt2), word t 1 has higher TF IDF value due to its rareness. However, under such condition, t 2 obviously has stronger classification capability, and contributes more for the classification. TF IDF simply use the Inverse document frequency to weight the feature without considering its class distribution. This is the main reason for low classification accuracy after weighting. 3 Gini-index Based Text Feature Weighting Method Gini-index is an impurity splitting method. It is suitable to the category, binary, continuous numeric type text field. It was proposed by Breiman in 1984, and has widely used in algorithms such as CART, SLIQ, SPRINT and Intelligent Miner decision tree (IBM s Data mining tool), achieved fairly good classification accuracy. 3.1 Gini-index principle The specific algorithm: Suppose that S is a collection of data samples of the s, its class label attribute has m different values, which defines different classes of Ci, (i = 1, 2,, m). According to the class label attribute value, S can be divided into m subsets (Si, i = 1, 2,, m). If S i is the subset of samples belongs to class Ci, and s i is the number of the samples in the subset Si, then the Gini-index of set S is Gini (S) = 1 m P 2 i (1) Where Pi is the probability of any sample of Ci, which estimated by s i /s. When the minimum of Gini(S) is 0, i.e. all records belong to the same category at this collection; it indicates the maximum useful information can be obtained. When all the samples in this collection have uniform distribution for certain category, Gini(S) reaches maximum, it indicates the minimum useful information obtained. The original form of the Gini-index is used to measure the impurity of attribute for categorization. The smaller its value, i.e. the lesser impurity, the better attribute. On the other hand, Gini (S) = m P 2 i (2) measuring the purity of attribute for categorization, the bigger its value, the better purity, the better attribute.
4 5822 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) Gini-index based text feature weighting solution The Gini-index is an excellent purity evaluation measure for set. The usefulness of the feature for the classification can be measured by feature s purity, i.e. the feature should be as pure as possible. If all the document show a feature belongs to the same class, this feature is pure [4]. Thus, we adopt the purity of the feature instead of the inverse document frequency, present the TF GINI weighting scheme. To be specific, after the text feature selection, we will seek the probability of occurrence of each class in the document set showing the feature t. Then, calculate the Gini-index according to the formula below. Gini (t) = m P (Ci t) 2 (3) The formula using the TF GINI to weight the feature t k and normalization is, w ik = tf ik gini (t k ) m [ tf jk gini (t k ) ] 2 (4) j=1 Where w ik is entry, t k is the weight in the document Di, tf ik is the frequency of t k showing in the Di. In the literature [5], Shankar discussed issue of text document feature selection and weighting adjustment by using the Gini-index principle. First, according to TF-IDF, generate the class central vector from all the words in the original feature space; then, calculate the Gini-index of all the features based on all the class central vector; last, select the feature with bigger Giniindex according to predefined number. Also, this discussion is only limited to the centroid-based classification. However, the method we present in this paper is completely different from it. We emphasized on feature weighting after feature selection, and the weighting solution is not only good for centroid-based classification, but also good for other existing text classifier. 4 Experiment Result and Analysis In order to further investigate the effect of the algorithm, we use VC++6.0 to implement the algorithm, and partial of the source code is from the text classifier source code provided by Li Ronglu of Department of Computer and Information Technology, Fu Dan Univerisity. 4.1 Data sets In our experiment, two corpuses have been used. One of them is the recognized English Standard classification corpus Reuters The other corpus in our experiment is Chinese Corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University. The Reuters news corpus is the most widely used corpus for text classification study. There are total of documents in the 1987 amendments to the Reuters We use the most common 10 classes, training set of 7951 documents, testing set of 2726 documents. After word
5 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) root recovery and removing the un-used words, there are words. Within the experiment set, the class distribution is uneven. There are 2875 documents belong to the largest class, which takes % of the total training documents. While there are 170 documents belong to the smallest class, only 2.41% of the total training documents. The second corpus in our experiment is Chinese Corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University. Totally it includes documents, which divided into 20 classes. We used 10 classes from them; include training set of 1882 documents, testing set of 900 documents. After word root recovery and removing the un-used words, there are words. The class distribution of training set is not even. Among them, 338 documents are political class, which is 17.96% of the training document set. Meanwhile the document of environment class has only 134 of them, just 7.12% of the total document set. 4.2 Classifier We used fknn classifier, the discriminant function is the FSWF rule presented in literature [6, 9]: µ j (X) = { k µ j (X i ) sim (X, X i ) (1 sim (X, X i )) 2/(b 1) } / { k 1 (1 sim (X, X i )) 2/(b 1) Where j=1,2,,c, µ j (X i ) sim (X, X i ) is the membership values of the j class of the known sample X. If the sample X belongs to j class, then µ j (X i ) is 1, otherwise, 0. In fknn, k is determined by parameter training optimization result, for Reuters-21578, k is 40; for Chinese document set, k is 10. Parameter b is equal to 2. } (5) 4.3 Experiment result and analysis First, select the text feature by using Information Gain, Expected Cross Entropy, the Weight of Evidence of Text and χ 2 Statistic [7]; then, weight the feature by two different weighting solutions TF IDF and TF GINI. To evaluate the feature weighting method, we study the performance from three aspects below: (1) Classification accuracy: The performance index used is Macro F1 and Micro F1 [8] which commonly adopted internationally; (2) The capability to process class uneven distribution; (3) Algorithm computational complexity. Table 1 and Table 2 summarized the performances of the classification under following conditions. Select the feature via the Information Gain, weight the feature by TF IDF and TF GINI respectively, on the most common 10 classes in the Reuters corpus and Chinese document set, use the fknn to classify after picking up different feature dimensions. From Table 1 and 2, we noticed, for both Chinese corpus and English corpus Reuter-21578, TF GINI weighting method outperform TF IDF on classification accuracy, no matter Macro F1 or Micro F1, with different feature selection number. For Chinese corpus, under 2000 dimensions, the best classification accuracy Micro F1 is better than TF IDF by 1.178%, while the Macro F1 is 1.248% higher. For Reuter-21578, under different dimension numbers, the average value is 1.465% higher than TF IDF on Micro F1, while Macro F1 is 0.903% higher. Also, the variance of the classification accuracy for both corpuses is much smaller than TF IDF method.
6 5824 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) This has fully demonstrated the effectiveness of our improvement. TF GINI method, without increasing the calculation complexity, not only increase the classification accuracy under uneven class distribution in English and Chinese corpuses, but also reduce the sensitivity to the feature dimensions in certain degree. Table 1: Classification performance of two weighting methods on Chinese training set Note: X, S show the mean and standard deviation of Macro F1 and Micro F1 on 10 common classes of Reuters with different feature numbers. TF IDF TF GINI number maf1 mif1 maf1 mif X S Table 3 and Table 4 summarized the Classification performance of both TF IDF and TF GINI weighting solution after three major feature selection schemes: Expected Cross Entropy, the Weight of Evidence of Text and χ 2 Statistic. After parameter optimization training, 2000 feature dimension are used on Chinese corpus, and 1000 feature dimension are used on Retuers Composite the information of Table 3 and Table 4, we found no matter which feature selection method is used, TF GINI weighting method always outperform TF IDF on Macro F1 and Micro F1. 5 Conclusions In this paper, we study the text document feature weighting based on feature s Gini-index. The study include the experiment and analysis between Gini-index based weighting method TF GINI and TF IDF method from three aspects, classification accuracy, processing unevenly distributed class document set and calculation complexity. The result shows, the Gini-index based text feature weighting method is a very prospective feature weighting method.
7 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) Table 2: Classification performance of two weighting methods on Reuters set Note: X, S show the mean and standard deviation of Macro F1 and Micro F1 on Chinese corpus with different feature numbers. TF IDF TF GINI number maf1 mif1 maf1 mif X S Table 3: Performances of two weighting methods with three feature selection methods on Chinese corpus set methods TF IDF TF GINI maf1 mif1 maf1 mif1 CroEntTxt χ WeiEviTxt Table 4: Performances of two weighting methods with three feature selection methods on Reuters corpus set methods TF IDF TF GINI maf1 mif1 maf1 mif1 CroEntTxt χ WeiEviTxt
8 5826 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) References [1] Roelleke Thomas, Wang Jun, TF-IDF uncovered: A study of theories and probabilities, ACM SIGIR st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings, p , 2008 [2] J Thorsten. A probabilistic analysis of the Rocchio algorithm with TF IDF for text categorization. In: Proc of 14th Int l Conf on Machine Learing (ICML 97) [3] Zhang Xiaoyan, Wang Ting, Liang Xiaobo, Ao Feng, Li, Yan, A class-based feature weighting method for text classification, Journal of Computational Information Systems, v 8, n 3, p , March 2012; ISSN: [4] Shang Wenqian, Huang Houkuan; Liu Yuling; Lin Yongmin, Research on the algorithm of feature selection based on Gini index for text categorization, Computer Research and Development, v 43, n 10, p , October 2006 [5] Shrikanth Shankar, George Karypis, A feature weight adjustment algorithm for Document Categorizaiton. The KDD2000, Boston, [6] Lin Yongmin, Zhu Weidong, Shang Wenqian, Improvement of Decisision Rhle knn Text Categorization, Computer Research and Development, v 42, p , 2005 [7] Li Kairong, Chen Guixiang, Cheng Jilin, Research on hidden markov model-based text categorization process. International Journal of Digital Content Technology and its Applications, v 5, n 6, p , June 2011 [8] Li Kunlun, Xie Jing, Sun Xue, Multi-class text categorization based on LDA and SVM, Procedia Engineering, v 15, p , 2011, [9] Shang Wenqian, Huang Houkuan, Zhu Haibin, Lin Yongmin, A novel feature selection algorithm for text categorization, Expert Systems with Applications, v 33, n 1, p 1-5, July 2007
Analysis on the technology improvement of the library network information retrieval efficiency
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationA hybrid method to categorize HTML documents
Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper
More informationAn Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationGRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM
http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationPerformance Evaluation of Various Classification Algorithms
Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationOpen Access Research on the Prediction Model of Material Cost Based on Data Mining
Send Orders for Reprints to reprints@benthamscience.ae 1062 The Open Mechanical Engineering Journal, 2015, 9, 1062-1066 Open Access Research on the Prediction Model of Material Cost Based on Data Mining
More informationWeight adjustment schemes for a centroid based classifier
Weight adjustment schemes for a centroid based classifier Shrikanth Shankar and George Karypis University of Minnesota, Department of Computer Science Minneapolis, MN 55455 Technical Report: TR 00-035
More informationA Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization
A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954
More informationOn Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions
On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition
More informationClassification Algorithms in Data Mining
August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms
More informationA User Preference Based Search Engine
A User Preference Based Search Engine 1 Dondeti Swedhan, 2 L.N.B. Srinivas 1 M-Tech, 2 M-Tech 1 Department of Information Technology, 1 SRM University Kattankulathur, Chennai, India Abstract - In this
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationWeight Adjustment Schemes For a Centroid Based Classifier. Technical Report
Weight Adjustment Schemes For a Centroid Based Classifier Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN
More informationFeature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News
Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung
More informationCentroid-Based Document Classification: Analysis & Experimental Results?
Centroid-Based Document Classification: Analysis & Experimental Results? Eui-Hong (Sam) Han and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis,
More informationHybrid Feature Selection for Modeling Intrusion Detection Systems
Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,
More informationA New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval
Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University
More informationCorrelation Based Feature Selection with Irrelevant Feature Removal
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
More informationCAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification
CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationChinese Text Auto-Categorization on Petro-Chemical Industrial Processes
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 6 Special issue with selection of extended papers from 6th International Conference on Logistic, Informatics and Service
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationSocial Media Computing
Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,
More informationREFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS
REFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS ZHANG Guiyun,LIU Yang, ZHANG Weijuan,WANG Yuanyuan Computer and Information Engineering College, Tianjin Normal University, Tianjin
More informationThe Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization
The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization Ying Cai and Xiaofei Wang Dept. of Computer Science and Technology, Beijing Information Science & Technology
More informationDecision Tree Learning
Decision Tree Learning Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 25, 2014 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 200 150
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More informationMulti-Dimensional Text Classification
Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationA Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationResearch on Design and Application of Computer Database Quality Evaluation Model
Research on Design and Application of Computer Database Quality Evaluation Model Abstract Hong Li, Hui Ge Shihezi Radio and TV University, Shihezi 832000, China Computer data quality evaluation is the
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationFace Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN
2016 International Conference on Artificial Intelligence: Techniques and Applications (AITA 2016) ISBN: 978-1-60595-389-2 Face Recognition Using Vector Quantization Histogram and Support Vector Machine
More informationSupervised classification of law area in the legal domain
AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms
More informationImpact of Term Weighting Schemes on Document Clustering A Review
Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan
More informationIndex Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.
International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa
More informationBP Neural Network Based On Genetic Algorithm Applied In Text Classification
20 International Conference on Information Management and Engineering (ICIME 20) IPCSIT vol. 52 (202) (202) IACSIT Press, Singapore DOI: 0.7763/IPCSIT.202.V52.75 BP Neural Network Based On Genetic Algorithm
More informationAn Improved KNN Classification Algorithm based on Sampling
International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 017) An Improved KNN Classification Algorithm based on Sampling Zhiwei Cheng1, a, Caisen Chen1, b, Xuehuan Qiu1,
More informationInformation-Theoretic Feature Selection Algorithms for Text Classification
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute
More informationAn Adaptive Histogram Equalization Algorithm on the Image Gray Level Mapping *
Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 601 608 2012 International Conference on Solid State Devices and Materials Science An Adaptive Histogram Equalization Algorithm on
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationText Clustering Incremental Algorithm in Sensitive Topic Detection
International Journal of Information and Communication Sciences 2018; 3(3): 88-95 http://www.sciencepublishinggroup.com/j/ijics doi: 10.11648/j.ijics.20180303.12 ISSN: 2575-1700 (Print); ISSN: 2575-1719
More informationCHAPTER 3 ASSOCIATON RULE BASED CLUSTERING
41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have
More informationAutomatic Domain Partitioning for Multi-Domain Learning
Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels
More informationA Recommender System Based on Improvised K- Means Clustering Algorithm
A Recommender System Based on Improvised K- Means Clustering Algorithm Shivani Sharma Department of Computer Science and Applications, Kurukshetra University, Kurukshetra Shivanigaur83@yahoo.com Abstract:
More informationClustering Documents in Large Text Corpora
Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science
More informationIncorporating Hyperlink Analysis in Web Page Clustering
Incorporating Hyperlink Analysis in Web Page Clustering Michael Chau School of Business The University of Hong Kong Pokfulam, Hong Kong +852 2859-1014 mchau@business.hku.hk Patrick Y. K. Chau School of
More informationLocation-Aware Web Service Recommendation Using Personalized Collaborative Filtering
ISSN 2395-1621 Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering #1 Shweta A. Bhalerao, #2 Prof. R. N. Phursule 1 Shweta.bhalerao75@gmail.com 2 rphursule@gmail.com #12
More information1) Give decision trees to represent the following Boolean functions:
1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following
More informationNonparametric Classification Methods
Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)
More informationTemperature Calculation of Pellet Rotary Kiln Based on Texture
Intelligent Control and Automation, 2017, 8, 67-74 http://www.scirp.org/journal/ica ISSN Online: 2153-0661 ISSN Print: 2153-0653 Temperature Calculation of Pellet Rotary Kiln Based on Texture Chunli Lin,
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationAn Ensemble Approach to Enhance Performance of Webpage Classification
An Ensemble Approach to Enhance Performance of Webpage Classification Roshani Choudhary 1, Jagdish Raikwal 2 1, 2 Dept. of Information Technology 1, 2 Institute of Engineering & Technology 1, 2 DAVV Indore,
More informationA Survey on Postive and Unlabelled Learning
A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled
More informationAn Adaptive Threshold LBP Algorithm for Face Recognition
An Adaptive Threshold LBP Algorithm for Face Recognition Xiaoping Jiang 1, Chuyu Guo 1,*, Hua Zhang 1, and Chenghua Li 1 1 College of Electronics and Information Engineering, Hubei Key Laboratory of Intelligent
More informationAN IMPROVED TAIPEI BUS ESTIMATION-TIME-OF-ARRIVAL (ETA) MODEL BASED ON INTEGRATED ANALYSIS ON HISTORICAL AND REAL-TIME BUS POSITION
AN IMPROVED TAIPEI BUS ESTIMATION-TIME-OF-ARRIVAL (ETA) MODEL BASED ON INTEGRATED ANALYSIS ON HISTORICAL AND REAL-TIME BUS POSITION Xue-Min Lu 1,3, Sendo Wang 2 1 Master Student, 2 Associate Professor
More informationChinese text clustering algorithm based k-means
Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 301 307 2012 International Conference on Medical Physics and Biomedical Engineering Chinese text clustering algorithm based k-means
More informationClustering Web Documents using Hierarchical Method for Efficient Cluster Formation
Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College
More informationKeywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.
Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred
More informationAn Integrated Face Recognition Algorithm Based on Wavelet Subspace
, pp.20-25 http://dx.doi.org/0.4257/astl.204.48.20 An Integrated Face Recognition Algorithm Based on Wavelet Subspace Wenhui Li, Ning Ma, Zhiyan Wang College of computer science and technology, Jilin University,
More informationText Categorization (I)
CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationSupervised Learning Classification Algorithms Comparison
Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------
More informationTree-based methods for classification and regression
Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting
More informationIteration Reduction K Means Clustering Algorithm
Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department
More information7. Decision or classification trees
7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,
More informationLRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier
LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More informationVideo Inter-frame Forgery Identification Based on Optical Flow Consistency
Sensors & Transducers 24 by IFSA Publishing, S. L. http://www.sensorsportal.com Video Inter-frame Forgery Identification Based on Optical Flow Consistency Qi Wang, Zhaohong Li, Zhenzhen Zhang, Qinglong
More informationA MINING TECHNIQUE FOR WEB DATA USING CLUSTERING
A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING Ms. Chhaya M.Meshram 1, Prof. Rahila Sheikh 2 1 B.D.C.O.E. Sevagram, 2 R.G.C.E.R.T. Chandrapur Abstract- Web text mining is an important branch in the data
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationDiscovering Advertisement Links by Using URL Text
017 3rd International Conference on Computational Systems and Communications (ICCSC 017) Discovering Advertisement Links by Using URL Text Jing-Shan Xu1, a, Peng Chang, b,* and Yong-Zheng Zhang, c 1 School
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationUnknown Malicious Code Detection Based on Bayesian
Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai
More informationImproving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,
More informationCSE4334/5334 DATA MINING
CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy
More informationSimilarity search in multimedia databases
Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationA high performance Hybrid Algorithm for Text Classification
A high performance Hybrid Algorithm for Text Classification Prema Nedungadi, Haripriya Harikumar, Maneesha Ramesh Amrita CREATE, Amrita University Abstract The high computational complexity of text classification
More informationFeature weighting classification algorithm in the application of text data processing research
, pp.41-47 http://dx.doi.org/10.14257/astl.2016.134.07 Feature weighting classification algorithm in the application of text data research Zhou Chengyi University of Science and Technology Liaoning, Anshan,
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationA Study on Data mining Classification Algorithms in Heart Disease Prediction
A Study on Data mining Classification Algorithms in Heart Disease Prediction Dr. T. Karthikeyan 1, Dr. B. Ragavan 2, V.A.Kanimozhi 3 Abstract: Data mining (sometimes called knowledge discovery) is the
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationSubspace Clustering. Weiwei Feng. December 11, 2015
Subspace Clustering Weiwei Feng December 11, 2015 Abstract Data structure analysis is an important basis of machine learning and data science, which is now widely used in computational visualization problems,
More informationData Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3
Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?
More informationConcept-Based Document Similarity Based on Suffix Tree Document
Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW ON TEXT CLASSIFICATION USING DIFFERENT CLASSIFICATION TECHNIQUES PRADIP
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More information