Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study
|
|
- Blake York
- 6 years ago
- Views:
Transcription
1 Arabc Text Classfcaton Usng N-Gram Frequency Statstcs A Comparatve Study Lala Khresat Dept. of Computer Scence, Math and Physcs Farlegh Dcknson Unversty 285 Madson Ave, Madson NJ Khresat@fdu.edu Abstract- Ths paper presents the results of classfyng Arabc text documents usng the N-gram frequency statstcs technque employng a dssmlarty measure called the Manhattan dstance, and Dce s measure of smlarty. The Dce measure was used for comparson purposes. Results show that N-gram text classfcaton usng the Dce measure outperforms classfcaton usng the Manhattan measure. Keywords: N-gram, classfcaton, categorzaton, Arabc. I. INTRODUCTION The rapd growth of the Internet has ncreased the number of onlne documents avalable. Ths has led to the development of automated text and document classfcaton systems that are capable of automatcally organzng and classfyng documents. Text classfcaton (or categorzaton) s the process of structurng a set of documents accordng to a group structure that s known n advance. There are several dfferent methods for text classfcaton, ncludng statstcal-based algorthms, Bayesan classfcaton, dstance-based algorthms, k- nearest neghbors, decson tree-based methods [4] to name a few. Text classfcaton technques are used n many applcatons, ncludng e-mal flterng, mal routng, spam flterng, news montorng, sortng through dgtzed paper archves, automated ndexng of scentfc artcles, classfcaton of news stores and searchng for nterestng nformaton on the WWW. The maorty of these systems are desgned to handle documents wrtten n the Englsh language, and therefore are not applcable to documents wrtten n the Arabc language. Developng text classfcaton systems for Arabc documents s a challengng task due to the complex and rch nature of the Arabc language. The Arabc language conssts of 28 letters. The language s wrtten from rght to left. It has very complex morphology, and the maorty of words have a tr-letter root. The rest have ether a quadletter root, penta-letter root or hexa-letter root. Prevous work on Arabc text classfcaton has used dstance-based algorthms [5], Learnng algorthms [10], and Bayesan classfcaton methods [6] n developng automated text classfcaton systems. Specfcally, [8] used N-grams for searchng Arabc text documents. They nvestgated d-grams and tr-grams. No stemmng was performed. They concluded that the N-gram technque s not an effcent approach to corpus-based Arabc word conflaton. [9] used tr-grams for ndexng Arabc documents wthout any pror stemmng. The work of [11] uses N-grams wth and wthout stemmng for text searchng. Ther results ndcate that the use of tr-grams combned wth stemmng mproved the performance of search retreval, however, t was not statstcally sgnfcant. In ths paper the behavor of the N-Gram Frequency Statstcs technque for classfyng Arabc text documents s studed. The technque employs a dssmlarty measure called the Manhattan dstance, and Dce s measure of smlarty, for the purposes of classfcaton. The Dce measure was used for comparson purposes. Results show that N-gram text classfcaton usng the Dce measure gves better classfcaton results compared to the Manhattan measure. A corpus of Arabc text documents was collected from onlne Arabc newspapers. 40% of the corpus was used as tranng classes and the remanng 60% of the corpus was used for classfcaton. All documents, whether tranng documents or documents to be classfed went through a preprocessng phase removng punctuaton marks, stop words, dacrtcs, and non letters. For the tranng documents, the N-gram (N=3) frequency profle was generated for each document and saved n text fles. Then for each document to be classfed, the N-gram frequency profle was generated and compared aganst the N-gram frequency profles of all the tranng classes. The Manhattan and Dce measures were computed.
2 Usng the Manhattan measure, the category to whch a document belongs s the one wth the smallest Manhattan dstance, and usng the Dce measure, the category s the one wth the largest Dce measure. The classfcaton results usng these two measures were compared n terms of recall and precson. The rest of the paper s organzed as follows: n secton 2 the concept of N-grams s presented, secton 3 descrbes the text preprocessng phase and secton 4 gves detaled descrpton of the classfcaton procedure. Secton 5 presents the classfcaton results. II. N-GRAMS An N-gram [3] s an N-character slce of a strng. The N- gram method s language ndependent and works well n the case of nosy-text (text that contans typographcal errors). We used tr-grams for text classfcaton. The trgrams of a strng or token s a set of contnuous 3-letter slces of the strng. For example, the tr-grams for the word المودعين are: ا لم,لمو مود, ودع, دعي,عين. In general, a word of length w has w-2 tr-grams. Accordng to Zpf's law [12] : The nth most common word n a human language text occurs wth a frequency nversely proportonal to n Ths has the mplcaton that documents belongng to the same class or category wll have smlar N-gram frequency dstrbutons. Fgure 1 shows the Tr-gram frequency dstrbuton for a text document belongng to the sports category from our corpus. It clearly shows that the frequences of the most common Tr-grams are nversely proportonal to ther rank. Frequancy Dstrbuton for Tr-grams Frequency Tr-gram Rank Fgure 1. Frequency Dstrbuton of Tr-grams III. TEXT PREPROCESSING All text documents went through a preprocessng stage. Ths was necessary due to the varatons n the way text can be represented n Arabc. The preprocessng was performed for the documents to be classfed and the tranng classes themselves. Preprocessng conssted of the followng steps: 1) Convert text fles to UTF-8 encodng. 2) Remove punctuaton marks, dacrtcs, non letters, stop words. The defntons of these were obtaned from the Khoa stemmer [7].. ا wth ا إ,أ 3) Replace ntal.ئ wth ء followed by ى 4) Replace fnal
3 IV. N-GRAM BASED TEXT CLASSIFICATION A corpus of Arabc text documents was bult usng Arabc news artcles collected from onlne webstes of several Arabc newspapers. The corpus conssted of text documents coverng 4 categores: sports, economy, technology and weather. The technology and weather documents were very small n sze rangng from 1 KB to 4 KB. Sports and economy documents were much larger rangng from 2 KB to 15 KB for sports documents and 2 KB to 18 KB for economy documents. The smaller documents consttuted about 2% of the total number of documents n the sports and economy category. All these documents went through the text preprocessng step outlned above n secton 3. 40% of the corpus was selected as tranng classes, and the remanng 60% was used for testng the classfcaton procedure. The documents used for tranng went through the same procedures as dd the documents to be classfed. Specfcally, each document selected to be part of the tranng classes, was preprocessed as outlned above n secton 3. Then the N-gram profle was generated. Generatng the N-gram profle conssted of the followng steps: 1) Splt the text nto tokens consstng only of letters. All dgts are removed. 2) Compute all possble N-grams, for N=3 (Tr-grams) 3) Compute the frequency of occurrence of each N-gram. 4) Sort the N-grams accordng to ther frequences from most frequent to least frequent. Dscard the frequences 5) Ths gves us the N-gram profle for a document. For tranng class documents, the N-gram profles were saved n text fles. Each document to be classfed, went through the text preprocessng phase, then the N-gram profle was generated as descrbed above. The N-gram profle of each text document (document profle) was compared aganst the profles of all documents n the tranng classes (class profle) n terms of smlarty. Specfcally, two measures were used. The frst measure s a dstance or dssmlarty measure, called the Manhattan dstance [1]. It calculates a rank-order statstc for two profles by measurng the dfference n the postons of an N-gram n two dfferent profles. For each N-gram n the document profle, search for the N-gram n the class profle and calculate the dfference between ther postons. For N-grams not found n the class profle, a maxmum value s assgned. After all N-grams n the document profle have been exhausted, the sum of the dstance measures s computed. Manhattan ( P, P ) = k h = 1 ( P where P, P represent two N-gram profles The class that has the smallest Manhattan dstance s chosen as the class for the document beng classfed. The second measure used s the Dce measure [1] of 2 P P smlarty Dce( P, P ) = P + P Where P s the number of elements (N-grams) n profle P. Usng the Dce measure, the class wth the largest measure s chosen as the class for the text document beng classfed. The results obtaned usng the Manhattan measure and the Dce measure were compared n terms of precson and recall. Precson and recall are defned n [1] as follows: Where, CC Precson = TCF Recall CC = TC CC :number of correct categores(classes) found. TCF : total number of categores found TC : total number of correct categores h P h )
4 V. RESULTS To compare the performance of the tr-gram technque usng the Manhattan measure, and the Dce measure, the recall and precson values were computed. These values are shown n tables 1 and 2 respectvely. The best result for the tr-gram method usng the Manhattan measure was acheved for the sports category wth a recall value of 0.88, and the worst result was for the economy category wth a recall value of Table 1. Recall and precson usng Manhattan measure Statstcs for Manhattan measure Category Recall Precson Sports Economy Technology Weather category, followed by 0.98 for the sports category, and 0.89 for the economy category. The two measures produced equal low recall values for the technology category. The reason for ths s attrbuted to the nature of the newspaper artcles coverng technologcal ssues. They tend to be very dverse coverng a vast range of topcs. As a result, the tranng classes for the technology category dd not provde full coverage of all the dfferent topcs n the category. Overall, classfcaton usng the Dce measure outperformed classfcaton usng the Manhattan measure. The Manhattan measure has provded good classfcaton results for Englsh text documents [2]. The poor performance of the measure for Arabc n ths study, can be attrbuted to the nature of the Manhattan measure, and the complex morphologcal structure of Arabc, whch s qute dfferent than the structure for Englsh. Stemmng text documents before generatng the N-grams may gve us comparable results for the two measures. Table 2. Recall and precson usng Dce s measure Statstcs for Dce s measure category Recall Precson Sports Economy Technology Weather 1 1 The results for the tr-gram method usng the Dce measure exceed those for the Manhattan measure, reachng ts hghest recall value of 1 for the weather VI. CONCLUSION Ths paper presented the results of classfyng Arabc text documents usng the N-gram frequency statstcs technque employng a dssmlarty measure called the Manhattan dstance, and Dce s measure of smlarty. The Dce measure was used for comparson purposes. Results showed that N-gram text classfcaton usng the Dce measure outperforms classfcaton usng the Manhattan measure. REFERENCES [1] R. Baeza-Yates, and B. Rbero-Neto, Modern Informaton Retreval, Addson Wesley, [2] W. B. Cavnar, and J. M. Trenkle, N-Gram Based Text Categorzaton, Proceedngs of SDAIR-94, 3rd Annual Symposum on Document Analyss and Informaton Retreval, [3] M. Damashek, Gaugng Smlarty wth n-grams: Language-Independent Categorzaton of Text, Scence 267, pp , 10 February [4] M. H. Dunham, Data Mnng: Introductory and Advanced Topcs. Prentce Hall 2003 [5] R. M. Duwar, A Dstance-based Classfer for Arabc Text Categorzaton, In Proceedngs of the 2005 Internatonal Conference on Data Mnng, Las Vegas USA [6] M. El-Kourd, A. Bensad, and T. Rachd, Automatc Arabc document categorzaton based on the Naïve-Bayes Algorthm, Workshop on Computatonal Approaches to Arabc Scrpt-based Languages, COLING- 2004, Unversty of Geneva, Geneva, Swtzerland, August [7] S. Khoa, Personal communcaton. [8] H. S.Mustafa, and Q. Al-Radadeh Usng N-Grams for Arabc Text Searchng, Journal of the Amercan
5 Socety for Informaton Scence and Technology, 55(11), pp , [9] J. Savoy, and Y. Rasolofo, Report on the TREC-11 Experment: Arabc, Named Page and Topc Dstllaton Seraches, TREC [10] H. Sawaf, J. Zaplo, and H. Ney, Statstcal classfcaton methods for Arabc news artcles, Arabc Natural Language Processng n ACL2001, Toulouse France July [11] J. Xu, A. Fraser, and R. Weschedel, Emprcal Studes n Strateges for Arabc Retreval,. SIGIR 02 Tampere Fnland, [12] G. K. Zpf, Human Behavor and the Prncple of Least Effort, an Introducton to Human Ecology, Addson-Wesley, Readng, Mass., 1949.
Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task
Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto
More informationPruning Training Corpus to Speedup Text Classification 1
Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan
More informationUB at GeoCLEF Department of Geography Abstract
UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department
More informationExperiments in Text Categorization Using Term Selection by Distance to Transition Point
Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur
More informationContent Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers
IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth
More informationLearning the Kernel Parameters in Kernel Minimum Distance Classifier
Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department
More informationDeep Classification in Large-scale Text Hierarchies
Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong
More informationChi Square Feature Extraction Based Svms Arabic Language Text Categorization System
Journal of Computer Scence 3 (6): 430-435, 007 ISSN 1549-3636 007 Scence Publcatons Ch Square Feature Extracton Based Svms Arabc Language Text Categorzaton System Abdelwadood Moh'd A MESLEH Faculty of
More informationMULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION
MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and
More informationClassifier Selection Based on Data Complexity Measures *
Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.
More informationThe Research of Support Vector Machine in Agricultural Data Classification
The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou
More informationA Binarization Algorithm specialized on Document Images and Photos
A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1 Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne
More informationApplication of k-nn Classifier to Categorizing French Financial News
Applcaton of k-nn Classfer to Categorzng French Fnancal News Huazhong KOU, Georges GARDARIN 2, Alan D'heygère 2, Karne Zetoun PRSM Laboratory, Unversty of Versalles Sant-Quentn 45 Etats-Uns Road, 78035
More informationA Fast Content-Based Multimedia Retrieval Technique Using Compressed Data
A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,
More informationWeb Document Classification Based on Fuzzy Association
Web Document Classfcaton Based on Fuzzy Assocaton Choochart Haruechayasa, Me-Lng Shyu Department of Electrcal and Computer Engneerng Unversty of Mam Coral Gables, FL 33124, USA charuech@mam.edu, shyu@mam.edu
More informationSkew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach
Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research
More informationDescription of NTU Approach to NTCIR3 Multilingual Information Retrieval
Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan
More informationFEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur
FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents
More informationKeywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines
(IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak
More informationSteps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices
Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between
More informationClassic Term Weighting Technique for Mining Web Content Outliers
Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa Classc Term Weghtng Technque for Mnng Web Content Outlers W.R. Wan Zulkfel, N. Mustapha, and A. Mustapha
More informationA Fast Visual Tracking Algorithm Based on Circle Pixels Matching
A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng
More informationImplementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status
Internatonal Journal of Appled Busness and Informaton Systems ISSN: 2597-8993 Vol 1, No 2, September 2017, pp. 6-12 6 Implementaton Naïve Bayes Algorthm for Student Classfcaton Based on Graduaton Status
More informationCluster Analysis of Electrical Behavior
Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School
More informationParallelism for Nested Loops with Non-uniform and Flow Dependences
Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr
More informationIntrinsic Plagiarism Detection Using Character n-gram Profiles
Intrnsc Plagarsm Detecton Usng Character n-gram Profles Efstathos Stamatatos Unversty of the Aegean 83200 - Karlovass, Samos, Greece stamatatos@aegean.gr Abstract: The task of ntrnsc plagarsm detecton
More informationBAYESIAN MULTI-SOURCE DOMAIN ADAPTATION
BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn,
More informationSemantic Image Retrieval Using Region Based Inverted File
Semantc Image Retreval Usng Regon Based Inverted Fle Dengsheng Zhang, Md Monrul Islam, Guoun Lu and Jn Hou 2 Gppsland School of Informaton Technology, Monash Unversty Churchll, VIC 3842, Australa E-mal:
More informationInvestigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers
Journal of Convergence Informaton Technology Volume 5, Number 2, Aprl 2010 Investgatng the Performance of Naïve- Bayes Classfers and K- Nearest Neghbor Classfers Mohammed J. Islam *, Q. M. Jonathan Wu,
More informationQuery Clustering Using a Hybrid Query Similarity Measure
Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan
More informationUsing an Automatic Weighted Keywords Dictionary for Intelligent Web Content Filtering
Journal of Advances n Computer Research Quarterly pissn: 2345-606x eissn: 2345-6078 Sar Branch, Islamc Azad Unversty, Sar, I.R.Iran (Vol. 6, No. 1, February 2015), Pages: 101-114 www.jacr.ausar.ac.r Usng
More informationAn Anti-Noise Text Categorization Method based on Support Vector Machines *
An Ant-Nose Text ategorzaton Method based on Support Vector Machnes * hen Ln, Huang Je and Gong Zheng-Hu School of omputer Scence, Natonal Unversty of Defense Technology, hangsha, 410073, hna chenln@nudt.edu.cn,
More informationProblem Definitions and Evaluation Criteria for Computational Expensive Optimization
Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty
More informationClassifying Acoustic Transient Signals Using Artificial Intelligence
Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)
More informationJournal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data
Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(6):2860-2866 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A selectve ensemble classfcaton method on mcroarray
More informationLoad Balancing for Hex-Cell Interconnection Network
Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,
More informationWeb-supported Matching and Classification of Business Opportunities
Web-supported Matchng and Classfcaton of Busness Opportuntes. DIRO Unversté de Montréal C.P. 628, succursale Centre-vlle Montréal, Québec, H3C 3J7, Canada Jng Ba, Franços Parads,2, Jan-Yun Ne {bajng, paradfr,
More informationSHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE
SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE Dorna Purcaru Faculty of Automaton, Computers and Electroncs Unersty of Craoa 13 Al. I. Cuza Street, Craoa RO-1100 ROMANIA E-mal: dpurcaru@electroncs.uc.ro
More informationCorrelative features for the classification of textural images
Correlatve features for the classfcaton of textural mages M A Turkova 1 and A V Gadel 1, 1 Samara Natonal Research Unversty, Moskovskoe Shosse 34, Samara, Russa, 443086 Image Processng Systems Insttute
More informationCombining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval
Combnng Multple Resources, Evdence and Crtera for Genomc Informaton Retreval Luo S 1, Je Lu 2 and Jame Callan 2 1 Department of Computer Scence, Purdue Unversty, West Lafayette, IN 47907, USA ls@cs.purdue.edu
More informationX- Chart Using ANOM Approach
ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are
More informationMachine Learning. Topic 6: Clustering
Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess
More informationEnhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques
Enhancement of Infrequent Purchased Product Recommendaton Usng Data Mnng Technques Noraswalza Abdullah, Yue Xu, Shlomo Geva, and Mark Loo Dscplne of Computer Scence Faculty of Scence and Technology Queensland
More informationBOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET
1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School
More informationA Method of Hot Topic Detection in Blogs Using N-gram Model
84 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 A Method of Hot Topc Detecton n Blogs Usng N-gram Model Xaodong Wang College of Computer and Informaton Technology, Henan Normal Unversty, Xnxang, Chna
More informationTsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance
Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for
More informationAn Evaluation of Divide-and-Combine Strategies for Image Categorization by Multi-Class Support Vector Machines
An Evaluaton of Dvde-and-Combne Strateges for Image Categorzaton by Mult-Class Support Vector Machnes C. Demrkesen¹ and H. Cherf¹, ² 1: Insttue of Scence and Engneerng 2: Faculté des Scences Mrande Galatasaray
More informationUser Authentication Based On Behavioral Mouse Dynamics Biometrics
User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA
More informationSignature and Lexicon Pruning Techniques
Sgnature and Lexcon Prunng Technques Srnvas Palla, Hansheng Le, Venu Govndaraju Centre for Unfed Bometrcs and Sensors Unversty at Buffalo {spalla2, hle, govnd}@cedar.buffalo.edu Abstract Handwrtten word
More informationA Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems
A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty
More informationA Knowledge Management System for Organizing MEDLINE Database
A Knowledge Management System for Organzng MEDLINE Database Hyunk Km, Su-Shng Chen Computer and Informaton Scence Engneerng Department, Unversty of Florda, Ganesvlle, Florda 32611, USA Wth the exploson
More informationCHAPTER 2 DECOMPOSITION OF GRAPHS
CHAPTER DECOMPOSITION OF GRAPHS. INTRODUCTION A graph H s called a Supersubdvson of a graph G f H s obtaned from G by replacng every edge uv of G by a bpartte graph,m (m may vary for each edge by dentfyng
More informationAudio Content Classification Method Research Based on Two-step Strategy
(IJACSA) Internatonal Journal of Advanced Computer Scence and Applcatons, Audo Content Classfcaton Method Research Based on Two-step Strategy Sume Lang Department of Computer Scence and Technology Chongqng
More informationAn Optimal Algorithm for Prufer Codes *
J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,
More informationThe Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique
//00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy
More informationA KIND OF ROUTING MODEL IN PEER-TO-PEER NETWORK BASED ON SUCCESSFUL ACCESSING RATE
A KIND OF ROUTING MODEL IN PEER-TO-PEER NETWORK BASED ON SUCCESSFUL ACCESSING RATE 1 TAO LIU, 2 JI-JUN XU 1 College of Informaton Scence and Technology, Zhengzhou Normal Unversty, Chna 2 School of Mathematcs
More informationTHE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY
Proceedngs of the 20 Internatonal Conference on Machne Learnng and Cybernetcs, Guln, 0-3 July, 20 THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY JUN-HAI ZHAI, NA LI, MENG-YAO
More informationOnline Text Mining System based on M2VSM
FR-E2-1 SCIS & ISIS 2008 Onlne Text Mnng System based on M2VSM Yasufum Takama 1, Takash Okada 1, Toru Ishbash 2 1. Tokyo Metropoltan Unversty, 2. Tokyo Metropoltan Insttute of Technology 6-6 Asahgaoka,
More informationFuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval
Fuzzy -Means Intalzed by Fxed Threshold lusterng for Improvng Image Retreval NAWARA HANSIRI, SIRIPORN SUPRATID,HOM KIMPAN 3 Faculty of Informaton Technology Rangst Unversty Muang-Ake, Paholyotn Road, Patumtan,
More informationAn Empirical Comparative Study of Online Handwriting Chinese Character Recognition:Simplified v.s.traditional
2013 12th Internatonal Conference on Document Analyss and Recognton An Emprcal Comparatve Study of Onlne Handwrtng Chnese Recognton:Smplfed v.s.tradtonal Yan Gao, Lanwen Jn +, Wexn Yang School of Electronc
More informationUser Tweets based Genre Prediction and Movie Recommendation using LSI and SVD
User Tweets based Genre Predcton and Move Recommendaton usng LSI and SVD Saksh Bansal, Chetna Gupta Department of CSE/IT Jaypee Insttute of Informaton Technology,sec-62 Noda, Inda sakshbansal76@gmal.com,
More informationA Novel Term_Class Relevance Measure for Text Categorization
A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure
More informationEdge Detection in Noisy Images Using the Support Vector Machines
Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona
More informationSelecting Query Term Alterations for Web Search by Exploiting Query Contexts
Selectng Query Term Alteratons for Web Search by Explotng Query Contexts Guhong Cao Stephen Robertson Jan-Yun Ne Dept. of Computer Scence and Operatons Research Mcrosoft Research at Cambrdge Dept. of Computer
More informationParallel Implementation of Classification Algorithms Based on Cloud Computing Environment
TELKOMNIKA, Vol.10, No.5, September 2012, pp. 1087~1092 e-issn: 2087-278X accredted by DGHE (DIKTI), Decree No: 51/Dkt/Kep/2010 1087 Parallel Implementaton of Classfcaton Algorthms Based on Cloud Computng
More informationLearning-Based Top-N Selection Query Evaluation over Relational Databases
Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **
More informationFeature Selection for Natural Language Call Routing Based on Self-Adaptive Genetic Algorithm
IOP Conference Seres: Materals Scence and Engneerng PAPER OPEN ACCESS Feature Selecton for Natural Language Call Routng Based on Self-Adaptve Genetc Algorthm To cte ths artcle: A Koromyslova et al 017
More informationVehicle Fault Diagnostics Using Text Mining, Vehicle Engineering Structure and Machine Learning
Internatonal Journal of Intellgent Informaton Systems 205; 4(3): 58-70 Publshed onlne July 8, 205 (http://www.scencepublshnggroup.com//s) do: 0.648/.s.2050403.2 ISSN: 2328-7675 (Prnt); ISSN: 2328-7683
More informationNUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS
ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana
More informationPerformance Evaluation of Information Retrieval Systems
Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence
More informationFederated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks
Federated Search of Text-Based Dgtal Lbrares n Herarchcal Peer-to-Peer Networks Je Lu School of Computer Scence Carnege Mellon Unversty Pttsburgh, PA 15213 jelu@cs.cmu.edu Jame Callan School of Computer
More informationFast Feature Value Searching for Face Detection
Vol., No. 2 Computer and Informaton Scence Fast Feature Value Searchng for Face Detecton Yunyang Yan Department of Computer Engneerng Huayn Insttute of Technology Hua an 22300, Chna E-mal: areyyyke@63.com
More informationIssues and Empirical Results for Improving Text Classification
Issues and Emprcal Results for Improvng Text Classfcaton Youngoong Ko 1 and Jungyun Seo 2 1 Dept. of Computer Engneerng, Dong-A Unversty, 840 Hadan 2-dong, Saha-gu, Busan, 604-714, Korea yko@dau.ac.kr
More informationNeural Networks in Statistical Anomaly Intrusion Detection
Neural Networks n Statstcal Anomaly Intruson Detecton ZHENG ZHANG, JUN LI, C. N. MANIKOPOULOS, JAY JORGENSON and JOSE UCLES ECE Department, New Jersey Inst. of Tech., Unversty Heghts, Newark, NJ 72, USA
More informationA New Approach For the Ranking of Fuzzy Sets With Different Heights
New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays
More informationOutline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:
Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A
More informationFeature Selection as an Improving Step for Decision Tree Construction
2009 Internatonal Conference on Machne Learnng and Computng IPCSIT vol.3 (2011) (2011) IACSIT Press, Sngapore Feature Selecton as an Improvng Step for Decson Tree Constructon Mahd Esmael 1, Fazekas Gabor
More informationIncremental Learning with Support Vector Machines and Fuzzy Set Theory
The 25th Workshop on Combnatoral Mathematcs and Computaton Theory Incremental Learnng wth Support Vector Machnes and Fuzzy Set Theory Yu-Mng Chuang 1 and Cha-Hwa Ln 2* 1 Department of Computer Scence and
More informationSupport Vector Machines
/9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.
More informationSpam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection
E-mal Spam Flterng Based on Support Vector Machnes wth Taguch Method for Parameter Selecton We-Chh Hsu, Tsan-Yng Yu E-mal Spam Flterng Based on Support Vector Machnes wth Taguch Method for Parameter Selecton
More informationCorner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity
Journal of Sgnal and Informaton Processng, 013, 4, 114-119 do:10.436/jsp.013.43b00 Publshed Onlne August 013 (http://www.scrp.org/journal/jsp) Corner-Based Image Algnment usng Pyramd Structure wth Gradent
More informationMining Image Features in an Automatic Two- Dimensional Shape Recognition System
Internatonal Journal of Appled Mathematcs and Computer Scences Volume 2 Number 1 Mnng Image Features n an Automatc Two- Dmensonal Shape Recognton System R. A. Salam, M.A. Rodrgues Abstract The number of
More informationIntelligent Information Acquisition for Improved Clustering
Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center
More informationAutomatic Text Categorization of Mathematical Word Problems
Automatc Text Categorzaton of Mathematcal Word Problems Suleyman Cetntas 1, Luo S 2, Yan Png Xn 3, Dake Zhang 3, Joo Young Park 3 1,2 Department of Computer Scence, 2 Department of Statstcs, 3 Department
More informationCordial and 3-Equitable Labeling for Some Star Related Graphs
Internatonal Mathematcal Forum, 4, 009, no. 31, 1543-1553 Cordal and 3-Equtable Labelng for Some Star Related Graphs S. K. Vadya Department of Mathematcs, Saurashtra Unversty Rajkot - 360005, Gujarat,
More informationClustering of Words Based on Relative Contribution for Text Categorization
Clusterng of Words Based on Relatve Contrbuton for Text Categorzaton Je-Mng Yang, Zh-Yng Lu, Zhao-Yang Qu Abstract Term clusterng tres to group words based on the smlarty crteron between words, so that
More informationCourse Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms
Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques
More informationA Clustering Algorithm for Key Frame Extraction Based on Density Peak
Journal of Computer and Communcatons, 2018, 6, 118-128 http://www.scrp.org/ournal/cc ISSN Onlne: 2327-5227 ISSN Prnt: 2327-5219 A Clusterng Algorthm for Key Frame Extracton Based on Densty Peak Hong Zhao
More informationA Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures
A Novel Adaptve Descrptor Algorthm for Ternary Pattern Textures Fahuan Hu 1,2, Guopng Lu 1 *, Zengwen Dong 1 1.School of Mechancal & Electrcal Engneerng, Nanchang Unversty, Nanchang, 330031, Chna; 2. School
More informationEYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS
P.G. Demdov Yaroslavl State Unversty Anatoly Ntn, Vladmr Khryashchev, Olga Stepanova, Igor Kostern EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS Yaroslavl, 2015 Eye
More informationAn Entropy-Based Approach to Integrated Information Needs Assessment
Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology
More informationStudy of Data Stream Clustering Based on Bio-inspired Model
, pp.412-418 http://dx.do.org/10.14257/astl.2014.53.86 Study of Data Stream lusterng Based on Bo-nspred Model Yngme L, Mn L, Jngbo Shao, Gaoyang Wang ollege of omputer Scence and Informaton Engneerng,
More informationCLASSIFICATION OF ULTRASONIC SIGNALS
The 8 th Internatonal Conference of the Slovenan Socety for Non-Destructve Testng»Applcaton of Contemporary Non-Destructve Testng n Engneerng«September -3, 5, Portorož, Slovena, pp. 7-33 CLASSIFICATION
More informationLobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide
Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.
More informationNovel Pattern-based Fingerprint Recognition Technique Using 2D Wavelet Decomposition
Mathematcal Methods for Informaton Scence and Economcs Novel Pattern-based Fngerprnt Recognton Technque Usng D Wavelet Decomposton TUDOR BARBU Insttute of Computer Scence of the Romanan Academy T. Codrescu,,
More informationSURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB
SURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB V. Hotař, A. Hotař Techncal Unversty of Lberec, Department of Glass Producng Machnes and Robotcs, Department of Materal
More informationEXTENDED BIC CRITERION FOR MODEL SELECTION
IDIAP RESEARCH REPORT EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs IDIAP-RR-0-4 Dalle olle Insttute for Perceptual Artfcal Intellgence P.O.Box 59 artgny Valas Swtzerland phone +4 7
More informationA mathematical programming approach to the analysis, design and scheduling of offshore oilfields
17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and
More informationText Similarity Computing Based on LDA Topic Model and Word Co-occurrence
2nd Internatonal Conference on Software Engneerng, Knowledge Engneerng and Informaton Engneerng (SEKEIE 204) Text Smlarty Computng Based on LDA Topc Model and Word Co-occurrence Mngla Shao School of Computer,
More information