Using Gini-index for Feature Weighting in Text Categorization

Size: px
Start display at page:

Download "Using Gini-index for Feature Weighting in Text Categorization"

Transcription

1 Journal of Computational Information Systems 9: 14 (2013) Available at Using Gini-index for Feature Weighting in Text Categorization Weidong ZHU 1,, Yongmin LIN 2 1 School of Computer and Information Technology, Beijing Jiaotong University, Beijing , China 2 College of Economics and Management, Hebei Polytechnic University, Tangshan , China Abstract With the rapid development of World Wide Web, text categorization has played an important role in organizing and processing large amount of text data. As a simple, straightforward, high processing speed feature weighting method, TF IDF method is widely used in document classification. But this method simply considers the words of low frequency are important, the words of high frequency are unimportant, which may not reflect the usefulness of the words, and decreases the precision of classification. In this paper, a Gini-index based feature weighting method is presented, which solves the problem mentioned above. The experiments showed the TF GINI method has better classification performance. Keywords: Text Categorization; Feature Selection; Gini-index; Feature Weighting; VSM 1 Introduction Automatic text classification is a supervised learning task. It learns from the training document set with predefined class labels, and assigns the class label to the new document. With the rapid development of the network technology and digital library, online document grows quickly. Automatic text classification became the key technique for processing and organizing the high volume document data. Existing feature selection methods are based on statistical theory and machine learning methods. Among the famous methods are, Support Vector Machine (SVM), Naive Bayes, k-nearest Neighbor (knn), Neural Network (N net), Boosting, Linear Least Squares Fitting (LLSF) etc. In most classifiers, the vector space model is used to describe the document. The document is treated as a vector in the feature space. The measurement in its coordinate system uses TF IDF value proposed by Saltond in Text Frequency (TF) is the times of a word showed up in this document. Inverse document frequency IDF=log(N/Nt), where t is the word, N is the total number of document in the training set, Nt is the number of document showed t. The product of text frequency (TF) and inverse document frequency IDF as the value of the vector coordinate system has some advantages such as simple, intuitive and fast processing speed. Thus, it has been widely adopted in the document categorization. However, the TF IDF [1 3] Corresponding author. address: wdzhu@bjtu.edu.cn (Weidong ZHU) / Copyright 2013 Binary Information Press DOI: /jcis6629 July 15, 2013

2 5820 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) method simply considers the words of low frequency are important, the words of high frequency are unimportant, which may not reflect the usefulness of the words, and decreases the accuracy of classification. This paper present a different feature weighting solution (TF GINI) based on Gini-index. According to the distribution probability of the class with feature sample, calculate the Gini-index of the feature, then, weight the feature by the product of text frequency (TF) and the feature s Gini-index. It fully considers the feature s capability to distinguish different classes without increase the calculation complexity. Using the Reuters corpus and Chinese corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University, the comparing experiments showed the TF GINI method has better classification performance than the TF IDF method without increase the calculation time complexity. First chapter of this paper will discuss the defect of the TF IDF weighting solution. The second chapter will address the Gini-index based TF GINI feature weighting solution. The third chapter presents the experiment result and analysis. The last chapter is the conclusion and further study prospect. 2 Analysis of the TF IDF Weighting Solution Vector space model is currently one of the most simple and high efficient text describing models. Its basic scheme is: For certain natural language document D =D (t 1, w 1 ; t 2, w 2 ; ; t N, w N ), where t i is the feature selected from the document D, w i is the weight of the feature, 1 i N. To simplify the analysis, usually the sequence of the t k in the document is not considered and t k must different from each other, i.e. there is no repeating. Now, t 1, t 2,, t N can form a N dimension Coordinate system, and w 1, w 2,, w N are Coordinate values accordingly, therefore, D (w 1, w 2,, w N ) is a vector of the N dimension Coordinate system. The measurement of the coordinate system adopts TF IDF presented by Salton in Text Frequency (TF) is the times of a word showed up in this document. Inverse document frequency IDF=log(N/Nt), where t is the word, N is the total number of document in the training set, Nt is the number of document showed t. LU Yuchang has analyzed the two basic assumptions of the TF IDF scheme in the literature [3]. (1) If a word shows up many times in a document, it also appears many times in another document of the same class, and vice versa. Thus, the Text Frequency TF as part of the measurement to reflect the same class feature is acceptable. (2) If a word shows up less frequently, its capability to distinguish from other classes is stronger. Thus, the concept of Inverse document frequency IDF is introduced. The product of TF and IDF is used as the measurement of the feature space coordinate system. In the literature [3], from the angle of word weighting and vector rotation, it resolved the simple construction of IDF can not effectively reflect the usefulness of a word. It proposed removing the P(W) from the formula of Information Gain and text weight of Evidence, then weighting the word, and proved its effectiveness of improvement via the experiment. In the literature [2], Thorsten, using the probability theory, analyzed the product of TF and IDF as the measurement of the feature space coordinate system and pointed out it may not obtain high accuracy of the classification. He proposed a classification model between traditional TF IDF and Naïve bayes models. From the angle of the usefulness of feature for the classification, we found the weighting of

3 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) TF IDF may assign the bigger weight to the rare word, without consider its class distribution characteristics. These rare words may lead to invalid classification. A simple case study can reveal the defect of the TF IDF. Look into the following conditions: the total document number in the training document set is 300, which 100 belong to class A, and rest 200 belong to class B. If and only if both words t 1 and t 2 show up in class B documents, and Nt1=200, Nt2=100. Document D belongs to class B, D B, both words t 1 and t 2 show up in Document D, and TF(t 1 )= TF(t 2 ). Using the TF IDF method to weight these two words, we got TF(t 1 ) log(n/nt1)< TF(t 2 ) log(n/nt2), word t 1 has higher TF IDF value due to its rareness. However, under such condition, t 2 obviously has stronger classification capability, and contributes more for the classification. TF IDF simply use the Inverse document frequency to weight the feature without considering its class distribution. This is the main reason for low classification accuracy after weighting. 3 Gini-index Based Text Feature Weighting Method Gini-index is an impurity splitting method. It is suitable to the category, binary, continuous numeric type text field. It was proposed by Breiman in 1984, and has widely used in algorithms such as CART, SLIQ, SPRINT and Intelligent Miner decision tree (IBM s Data mining tool), achieved fairly good classification accuracy. 3.1 Gini-index principle The specific algorithm: Suppose that S is a collection of data samples of the s, its class label attribute has m different values, which defines different classes of Ci, (i = 1, 2,, m). According to the class label attribute value, S can be divided into m subsets (Si, i = 1, 2,, m). If S i is the subset of samples belongs to class Ci, and s i is the number of the samples in the subset Si, then the Gini-index of set S is Gini (S) = 1 m P 2 i (1) Where Pi is the probability of any sample of Ci, which estimated by s i /s. When the minimum of Gini(S) is 0, i.e. all records belong to the same category at this collection; it indicates the maximum useful information can be obtained. When all the samples in this collection have uniform distribution for certain category, Gini(S) reaches maximum, it indicates the minimum useful information obtained. The original form of the Gini-index is used to measure the impurity of attribute for categorization. The smaller its value, i.e. the lesser impurity, the better attribute. On the other hand, Gini (S) = m P 2 i (2) measuring the purity of attribute for categorization, the bigger its value, the better purity, the better attribute.

4 5822 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) Gini-index based text feature weighting solution The Gini-index is an excellent purity evaluation measure for set. The usefulness of the feature for the classification can be measured by feature s purity, i.e. the feature should be as pure as possible. If all the document show a feature belongs to the same class, this feature is pure [4]. Thus, we adopt the purity of the feature instead of the inverse document frequency, present the TF GINI weighting scheme. To be specific, after the text feature selection, we will seek the probability of occurrence of each class in the document set showing the feature t. Then, calculate the Gini-index according to the formula below. Gini (t) = m P (Ci t) 2 (3) The formula using the TF GINI to weight the feature t k and normalization is, w ik = tf ik gini (t k ) m [ tf jk gini (t k ) ] 2 (4) j=1 Where w ik is entry, t k is the weight in the document Di, tf ik is the frequency of t k showing in the Di. In the literature [5], Shankar discussed issue of text document feature selection and weighting adjustment by using the Gini-index principle. First, according to TF-IDF, generate the class central vector from all the words in the original feature space; then, calculate the Gini-index of all the features based on all the class central vector; last, select the feature with bigger Giniindex according to predefined number. Also, this discussion is only limited to the centroid-based classification. However, the method we present in this paper is completely different from it. We emphasized on feature weighting after feature selection, and the weighting solution is not only good for centroid-based classification, but also good for other existing text classifier. 4 Experiment Result and Analysis In order to further investigate the effect of the algorithm, we use VC++6.0 to implement the algorithm, and partial of the source code is from the text classifier source code provided by Li Ronglu of Department of Computer and Information Technology, Fu Dan Univerisity. 4.1 Data sets In our experiment, two corpuses have been used. One of them is the recognized English Standard classification corpus Reuters The other corpus in our experiment is Chinese Corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University. The Reuters news corpus is the most widely used corpus for text classification study. There are total of documents in the 1987 amendments to the Reuters We use the most common 10 classes, training set of 7951 documents, testing set of 2726 documents. After word

5 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) root recovery and removing the un-used words, there are words. Within the experiment set, the class distribution is uneven. There are 2875 documents belong to the largest class, which takes % of the total training documents. While there are 170 documents belong to the smallest class, only 2.41% of the total training documents. The second corpus in our experiment is Chinese Corpus from the International Database Center of Department of Computer Information and Technology, Fu Dan University. Totally it includes documents, which divided into 20 classes. We used 10 classes from them; include training set of 1882 documents, testing set of 900 documents. After word root recovery and removing the un-used words, there are words. The class distribution of training set is not even. Among them, 338 documents are political class, which is 17.96% of the training document set. Meanwhile the document of environment class has only 134 of them, just 7.12% of the total document set. 4.2 Classifier We used fknn classifier, the discriminant function is the FSWF rule presented in literature [6, 9]: µ j (X) = { k µ j (X i ) sim (X, X i ) (1 sim (X, X i )) 2/(b 1) } / { k 1 (1 sim (X, X i )) 2/(b 1) Where j=1,2,,c, µ j (X i ) sim (X, X i ) is the membership values of the j class of the known sample X. If the sample X belongs to j class, then µ j (X i ) is 1, otherwise, 0. In fknn, k is determined by parameter training optimization result, for Reuters-21578, k is 40; for Chinese document set, k is 10. Parameter b is equal to 2. } (5) 4.3 Experiment result and analysis First, select the text feature by using Information Gain, Expected Cross Entropy, the Weight of Evidence of Text and χ 2 Statistic [7]; then, weight the feature by two different weighting solutions TF IDF and TF GINI. To evaluate the feature weighting method, we study the performance from three aspects below: (1) Classification accuracy: The performance index used is Macro F1 and Micro F1 [8] which commonly adopted internationally; (2) The capability to process class uneven distribution; (3) Algorithm computational complexity. Table 1 and Table 2 summarized the performances of the classification under following conditions. Select the feature via the Information Gain, weight the feature by TF IDF and TF GINI respectively, on the most common 10 classes in the Reuters corpus and Chinese document set, use the fknn to classify after picking up different feature dimensions. From Table 1 and 2, we noticed, for both Chinese corpus and English corpus Reuter-21578, TF GINI weighting method outperform TF IDF on classification accuracy, no matter Macro F1 or Micro F1, with different feature selection number. For Chinese corpus, under 2000 dimensions, the best classification accuracy Micro F1 is better than TF IDF by 1.178%, while the Macro F1 is 1.248% higher. For Reuter-21578, under different dimension numbers, the average value is 1.465% higher than TF IDF on Micro F1, while Macro F1 is 0.903% higher. Also, the variance of the classification accuracy for both corpuses is much smaller than TF IDF method.

6 5824 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) This has fully demonstrated the effectiveness of our improvement. TF GINI method, without increasing the calculation complexity, not only increase the classification accuracy under uneven class distribution in English and Chinese corpuses, but also reduce the sensitivity to the feature dimensions in certain degree. Table 1: Classification performance of two weighting methods on Chinese training set Note: X, S show the mean and standard deviation of Macro F1 and Micro F1 on 10 common classes of Reuters with different feature numbers. TF IDF TF GINI number maf1 mif1 maf1 mif X S Table 3 and Table 4 summarized the Classification performance of both TF IDF and TF GINI weighting solution after three major feature selection schemes: Expected Cross Entropy, the Weight of Evidence of Text and χ 2 Statistic. After parameter optimization training, 2000 feature dimension are used on Chinese corpus, and 1000 feature dimension are used on Retuers Composite the information of Table 3 and Table 4, we found no matter which feature selection method is used, TF GINI weighting method always outperform TF IDF on Macro F1 and Micro F1. 5 Conclusions In this paper, we study the text document feature weighting based on feature s Gini-index. The study include the experiment and analysis between Gini-index based weighting method TF GINI and TF IDF method from three aspects, classification accuracy, processing unevenly distributed class document set and calculation complexity. The result shows, the Gini-index based text feature weighting method is a very prospective feature weighting method.

7 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) Table 2: Classification performance of two weighting methods on Reuters set Note: X, S show the mean and standard deviation of Macro F1 and Micro F1 on Chinese corpus with different feature numbers. TF IDF TF GINI number maf1 mif1 maf1 mif X S Table 3: Performances of two weighting methods with three feature selection methods on Chinese corpus set methods TF IDF TF GINI maf1 mif1 maf1 mif1 CroEntTxt χ WeiEviTxt Table 4: Performances of two weighting methods with three feature selection methods on Reuters corpus set methods TF IDF TF GINI maf1 mif1 maf1 mif1 CroEntTxt χ WeiEviTxt

8 5826 W. Zhu et al. /Journal of Computational Information Systems 9: 14 (2013) References [1] Roelleke Thomas, Wang Jun, TF-IDF uncovered: A study of theories and probabilities, ACM SIGIR st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings, p , 2008 [2] J Thorsten. A probabilistic analysis of the Rocchio algorithm with TF IDF for text categorization. In: Proc of 14th Int l Conf on Machine Learing (ICML 97) [3] Zhang Xiaoyan, Wang Ting, Liang Xiaobo, Ao Feng, Li, Yan, A class-based feature weighting method for text classification, Journal of Computational Information Systems, v 8, n 3, p , March 2012; ISSN: [4] Shang Wenqian, Huang Houkuan; Liu Yuling; Lin Yongmin, Research on the algorithm of feature selection based on Gini index for text categorization, Computer Research and Development, v 43, n 10, p , October 2006 [5] Shrikanth Shankar, George Karypis, A feature weight adjustment algorithm for Document Categorizaiton. The KDD2000, Boston, [6] Lin Yongmin, Zhu Weidong, Shang Wenqian, Improvement of Decisision Rhle knn Text Categorization, Computer Research and Development, v 42, p , 2005 [7] Li Kairong, Chen Guixiang, Cheng Jilin, Research on hidden markov model-based text categorization process. International Journal of Digital Content Technology and its Applications, v 5, n 6, p , June 2011 [8] Li Kunlun, Xie Jing, Sun Xue, Multi-class text categorization based on LDA and SVM, Procedia Engineering, v 15, p , 2011, [9] Shang Wenqian, Huang Houkuan, Zhu Haibin, Lin Yongmin, A novel feature selection algorithm for text categorization, Expert Systems with Applications, v 33, n 1, p 1-5, July 2007

Analysis on the technology improvement of the library network information retrieval efficiency

Analysis on the technology improvement of the library network information retrieval efficiency Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Open Access Research on the Prediction Model of Material Cost Based on Data Mining Send Orders for Reprints to reprints@benthamscience.ae 1062 The Open Mechanical Engineering Journal, 2015, 9, 1062-1066 Open Access Research on the Prediction Model of Material Cost Based on Data Mining

More information

Weight adjustment schemes for a centroid based classifier

Weight adjustment schemes for a centroid based classifier Weight adjustment schemes for a centroid based classifier Shrikanth Shankar and George Karypis University of Minnesota, Department of Computer Science Minneapolis, MN 55455 Technical Report: TR 00-035

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

A User Preference Based Search Engine

A User Preference Based Search Engine A User Preference Based Search Engine 1 Dondeti Swedhan, 2 L.N.B. Srinivas 1 M-Tech, 2 M-Tech 1 Department of Information Technology, 1 SRM University Kattankulathur, Chennai, India Abstract - In this

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Weight Adjustment Schemes For a Centroid Based Classifier. Technical Report

Weight Adjustment Schemes For a Centroid Based Classifier. Technical Report Weight Adjustment Schemes For a Centroid Based Classifier Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

Centroid-Based Document Classification: Analysis & Experimental Results?

Centroid-Based Document Classification: Analysis & Experimental Results? Centroid-Based Document Classification: Analysis & Experimental Results? Eui-Hong (Sam) Han and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis,

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Chinese Text Auto-Categorization on Petro-Chemical Industrial Processes

Chinese Text Auto-Categorization on Petro-Chemical Industrial Processes BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 6 Special issue with selection of extended papers from 6th International Conference on Logistic, Informatics and Service

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Social Media Computing

Social Media Computing Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,

More information

REFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS

REFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS REFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS ZHANG Guiyun,LIU Yang, ZHANG Weijuan,WANG Yuanyuan Computer and Information Engineering College, Tianjin Normal University, Tianjin

More information

The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization

The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization Ying Cai and Xiaofei Wang Dept. of Computer Science and Technology, Beijing Information Science & Technology

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 25, 2014 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 200 150

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Lecture 7: Decision Trees

Lecture 7: Decision Trees Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...

More information

Multi-Dimensional Text Classification

Multi-Dimensional Text Classification Multi-Dimensional Text Classification Thanaruk THEERAMUNKONG IT Program, SIIT, Thammasat University P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani, Thailand, 12121 ping@siit.tu.ac.th Verayuth LERTNATTEE

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Research on Design and Application of Computer Database Quality Evaluation Model

Research on Design and Application of Computer Database Quality Evaluation Model Research on Design and Application of Computer Database Quality Evaluation Model Abstract Hong Li, Hui Ge Shihezi Radio and TV University, Shihezi 832000, China Computer data quality evaluation is the

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN 2016 International Conference on Artificial Intelligence: Techniques and Applications (AITA 2016) ISBN: 978-1-60595-389-2 Face Recognition Using Vector Quantization Histogram and Support Vector Machine

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa

More information

BP Neural Network Based On Genetic Algorithm Applied In Text Classification

BP Neural Network Based On Genetic Algorithm Applied In Text Classification 20 International Conference on Information Management and Engineering (ICIME 20) IPCSIT vol. 52 (202) (202) IACSIT Press, Singapore DOI: 0.7763/IPCSIT.202.V52.75 BP Neural Network Based On Genetic Algorithm

More information

An Improved KNN Classification Algorithm based on Sampling

An Improved KNN Classification Algorithm based on Sampling International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 017) An Improved KNN Classification Algorithm based on Sampling Zhiwei Cheng1, a, Caisen Chen1, b, Xuehuan Qiu1,

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

An Adaptive Histogram Equalization Algorithm on the Image Gray Level Mapping *

An Adaptive Histogram Equalization Algorithm on the Image Gray Level Mapping * Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 601 608 2012 International Conference on Solid State Devices and Materials Science An Adaptive Histogram Equalization Algorithm on

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Text Clustering Incremental Algorithm in Sensitive Topic Detection

Text Clustering Incremental Algorithm in Sensitive Topic Detection International Journal of Information and Communication Sciences 2018; 3(3): 88-95 http://www.sciencepublishinggroup.com/j/ijics doi: 10.11648/j.ijics.20180303.12 ISSN: 2575-1700 (Print); ISSN: 2575-1719

More information

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

A Recommender System Based on Improvised K- Means Clustering Algorithm

A Recommender System Based on Improvised K- Means Clustering Algorithm A Recommender System Based on Improvised K- Means Clustering Algorithm Shivani Sharma Department of Computer Science and Applications, Kurukshetra University, Kurukshetra Shivanigaur83@yahoo.com Abstract:

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

Incorporating Hyperlink Analysis in Web Page Clustering

Incorporating Hyperlink Analysis in Web Page Clustering Incorporating Hyperlink Analysis in Web Page Clustering Michael Chau School of Business The University of Hong Kong Pokfulam, Hong Kong +852 2859-1014 mchau@business.hku.hk Patrick Y. K. Chau School of

More information

Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering

Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering ISSN 2395-1621 Location-Aware Web Service Recommendation Using Personalized Collaborative Filtering #1 Shweta A. Bhalerao, #2 Prof. R. N. Phursule 1 Shweta.bhalerao75@gmail.com 2 rphursule@gmail.com #12

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

Nonparametric Classification Methods

Nonparametric Classification Methods Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)

More information

Temperature Calculation of Pellet Rotary Kiln Based on Texture

Temperature Calculation of Pellet Rotary Kiln Based on Texture Intelligent Control and Automation, 2017, 8, 67-74 http://www.scirp.org/journal/ica ISSN Online: 2153-0661 ISSN Print: 2153-0653 Temperature Calculation of Pellet Rotary Kiln Based on Texture Chunli Lin,

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

An Ensemble Approach to Enhance Performance of Webpage Classification

An Ensemble Approach to Enhance Performance of Webpage Classification An Ensemble Approach to Enhance Performance of Webpage Classification Roshani Choudhary 1, Jagdish Raikwal 2 1, 2 Dept. of Information Technology 1, 2 Institute of Engineering & Technology 1, 2 DAVV Indore,

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

An Adaptive Threshold LBP Algorithm for Face Recognition

An Adaptive Threshold LBP Algorithm for Face Recognition An Adaptive Threshold LBP Algorithm for Face Recognition Xiaoping Jiang 1, Chuyu Guo 1,*, Hua Zhang 1, and Chenghua Li 1 1 College of Electronics and Information Engineering, Hubei Key Laboratory of Intelligent

More information

AN IMPROVED TAIPEI BUS ESTIMATION-TIME-OF-ARRIVAL (ETA) MODEL BASED ON INTEGRATED ANALYSIS ON HISTORICAL AND REAL-TIME BUS POSITION

AN IMPROVED TAIPEI BUS ESTIMATION-TIME-OF-ARRIVAL (ETA) MODEL BASED ON INTEGRATED ANALYSIS ON HISTORICAL AND REAL-TIME BUS POSITION AN IMPROVED TAIPEI BUS ESTIMATION-TIME-OF-ARRIVAL (ETA) MODEL BASED ON INTEGRATED ANALYSIS ON HISTORICAL AND REAL-TIME BUS POSITION Xue-Min Lu 1,3, Sendo Wang 2 1 Master Student, 2 Associate Professor

More information

Chinese text clustering algorithm based k-means

Chinese text clustering algorithm based k-means Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 301 307 2012 International Conference on Medical Physics and Biomedical Engineering Chinese text clustering algorithm based k-means

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts. Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred

More information

An Integrated Face Recognition Algorithm Based on Wavelet Subspace

An Integrated Face Recognition Algorithm Based on Wavelet Subspace , pp.20-25 http://dx.doi.org/0.4257/astl.204.48.20 An Integrated Face Recognition Algorithm Based on Wavelet Subspace Wenhui Li, Ning Ma, Zhiyan Wang College of computer science and technology, Jilin University,

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

Tree-based methods for classification and regression

Tree-based methods for classification and regression Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

Video Inter-frame Forgery Identification Based on Optical Flow Consistency Sensors & Transducers 24 by IFSA Publishing, S. L. http://www.sensorsportal.com Video Inter-frame Forgery Identification Based on Optical Flow Consistency Qi Wang, Zhaohong Li, Zhenzhen Zhang, Qinglong

More information

A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING

A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING A MINING TECHNIQUE FOR WEB DATA USING CLUSTERING Ms. Chhaya M.Meshram 1, Prof. Rahila Sheikh 2 1 B.D.C.O.E. Sevagram, 2 R.G.C.E.R.T. Chandrapur Abstract- Web text mining is an important branch in the data

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Discovering Advertisement Links by Using URL Text

Discovering Advertisement Links by Using URL Text 017 3rd International Conference on Computational Systems and Communications (ICCSC 017) Discovering Advertisement Links by Using URL Text Jing-Shan Xu1, a, Peng Chang, b,* and Yong-Zheng Zhang, c 1 School

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Unknown Malicious Code Detection Based on Bayesian

Unknown Malicious Code Detection Based on Bayesian Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai

More information

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

A high performance Hybrid Algorithm for Text Classification

A high performance Hybrid Algorithm for Text Classification A high performance Hybrid Algorithm for Text Classification Prema Nedungadi, Haripriya Harikumar, Maneesha Ramesh Amrita CREATE, Amrita University Abstract The high computational complexity of text classification

More information

Feature weighting classification algorithm in the application of text data processing research

Feature weighting classification algorithm in the application of text data processing research , pp.41-47 http://dx.doi.org/10.14257/astl.2016.134.07 Feature weighting classification algorithm in the application of text data research Zhou Chengyi University of Science and Technology Liaoning, Anshan,

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

A Study on Data mining Classification Algorithms in Heart Disease Prediction

A Study on Data mining Classification Algorithms in Heart Disease Prediction A Study on Data mining Classification Algorithms in Heart Disease Prediction Dr. T. Karthikeyan 1, Dr. B. Ragavan 2, V.A.Kanimozhi 3 Abstract: Data mining (sometimes called knowledge discovery) is the

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Subspace Clustering. Weiwei Feng. December 11, 2015

Subspace Clustering. Weiwei Feng. December 11, 2015 Subspace Clustering Weiwei Feng December 11, 2015 Abstract Data structure analysis is an important basis of machine learning and data science, which is now widely used in computational visualization problems,

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW ON TEXT CLASSIFICATION USING DIFFERENT CLASSIFICATION TECHNIQUES PRADIP

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information