Probabilistic Anti-Spam Filtering with Dimensionality Reduction
|
|
- Anastasia Beasley
- 6 years ago
- Views:
Transcription
1 Probabilistic Anti-Spam Filtering with Dimensionality Reduction ABSTRACT One of the biggest problems of communication is the massive spam message delivery Everyday billion of unwanted messages are sent by spammers and this number does not stop growing Helpfully there are different approaches able to automatically detect and remove most of these messages and a well-known ones are based on Bayesian decision theory However many machine learning techniques applied to text categorization have the same difficulty: the high dimensionality of the feature space Many term selection methods have been proposed in the literature Nevertheless it is still unclear how the performance of naive Bayes anti-spam filters depends on the methods applied for reducing the dimensionality of the feature space In this paper we compare the performance of most popular methods used as term selection techniques with some variations of the original naive Bayes anti-spam filter Categories and Subject Descriptors I5 [Pattern Recognition]: Applications; I27 [Artificial Intelligence]: Natural Language Processing Text analysis General Terms Anti-spam Filtering Keywords Terms selection technique naive Bayes anti-spam filter dimensionality reduction 1 INTRODUCTION The term spam is generally used to denote an unsolicited commercial Spam messages are annoying to most users because they clutter their mailboxes This problem can be quantified in economical terms since many hours are wasted everyday by workers It is not just the time they Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and/or a fee SAC 10 March Sierre Switzerland Copyright 2010 ACM X-XXXXX-XX-X/XX/XX $500 waste reading the spam but also the time they spend deleting those messages According to a report published by McAfee in March the cost in lost productivity per day per user is approximate equal to $050 based on the users having to spend 30 seconds for dealing with two spam messages each day and the users spam filter working at 95 percent accuracy (average accuracy achieved by the majority of available anti-spam filters) Therefore the productivity loss per employee per year due to spam is approximate equal to $18250 Supposing a company with 1000 workers earning $30 per hour it would suffer $ per year in lost productivity This works out to more than $41000 per 1 percent of spam allowed into a company Many methods have been proposed to automatic classify message [3] However among all proposed techniques machine learning algorithms have been achieved more success [1 3] These methods include approaches that are considered top-performers in text categorization like Boosting Support Vector Machines (SVM) and naive Bayes classifiers[1 5] The latter currently appear to be very popular in commercial and open-source spam filters This is probably due to their simplicity which makes them easy to implement their linear computational complexity and their accuracy which in spam filtering is comparable to that of more elaborate learning algorithms [5] A major difficulty of dealing with text categorization problems using approaches based on Bayesian probability is the high dimensionality of the feature space [1] The native feature space consists of unique terms (characters or words) that occur in messages which can be tens or hundreds of thousands of terms even for a moderate-sized e- mail collection This is prohibitively high for the most of learning algorithms (exceptions are k-nearest neighbors and SVM) Hence it is highly desirable to reduce the native space without sacrificing categorization accuracy and it is also desirable to achieve such a goal automatically [4 8] In this paper we present a comparative study of the five most used term selection techniques with four variants of the original naive Bayes algorithm for anti-spam filtering in order to examine how the term selection techniques effect the categorization accuracy of different anti-spam filters based on the Bayesian decision theory The remainder of this paper is organized as follows: Section 2 presents details of term selection techniques The naive Bayes anti-spam filters are described in Section 3 Section 4 presents the performance measurements used for comparing the achieved results Section 5 describes the methodology we employ in our experiments Experimental results are showed in Section 6 Finally Section 7 offers conclusions and future works 1 See reports/mar_spam_reportpdf
2 2 DIMENSIONALITY REDUCTION In text categorization the high dimensionality of the term space (T ) may be problematic In fact many algorithms used for classifier can not scale to high values of T As a consequence a pass of dimensionality reduction is often applied before classifier whose effect is to reduce the size of the vector space from T to T T ; the set T is called the reduced term set [7] Techniques for term selection attempt to select from the original set T the subset T of terms (with T T ) that when used for document indexing yields the highest effectiveness For selecting the best terms we have to use a function that selects and ranks terms according to how good they are A computationally easier alternative is keeping the T T terms that receive the highest score according to a function that measures the importance of the term for the text categorization task 21 Representation Assuming that each message m is composed by a set of terms m = t 1 t n where each term t k corresponds to a word ( adult for example) a set of words ( to be removed ) or a single character ( $ ) we can represent each message by a vector x = x 1 x n where x 1 x n are values of the attributes X 1 X n associated with the terms t 1 t n In the simplest case each term represents a single word and all attributes are Boolean: X i = 1 if the message contains t i or X i = 0 otherwise Alternatively attributes may be integer values computed by term frequencies (TF) representing how many times each term occurs in the message A third alternative is to associate each attribute X i to a normalized TF x i = t i(m) m where t i(m) is the number of occurrences of the term represented by X i in m and m is the length of m measured in term occurrences Normalized TF takes into account the term repetition versus the size of message [1] 22 Term selection techniques In the following we describe the five most used Term Space Reduction (TSR) techniques Probabilities are interpreted on an event space of documents (for example P( t k c i) denotes the probability that for a random message m term t k does not occur in m and m belongs to category c i) and are estimated by counting occurrences in the training set Tr Since there are only two categories spam (c s) and legitimate (c l ) some functions are specified locally to a specific category; in order to assess the value of a term t k in a global category-independent sense either the sum f sum(t k ) = P c i=1 f(t k c i) the weighted sum f wsum(t k ) = P c i=1 P(ci)f(t k c i) or maximum f max(t k ) = max c i=1 f(t k c i) of their category-specific values f(t k c i) are computed 221 Document frequency (DF) DF is given by the frequency of messages with a term t k in the training set Tr as given by DF(t k ) = Trt k Tr where Tr tk represents the number of messages containing the term t k in the training set Tr and Tr is the amount of available messages [8] 222 Information gain (IG) IG is frequently employed as a term-goodness criterion in the field of machine learning [6] It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a message [4] The IG of a term t k is computed according to IG(t k ) = P P(tc) c [c i c i ] Pt [t k t k ] P(t c)log P(t)P(c) 223 Mutual information (MI) MI is a criterion commonly used in statistical language modeling of words associations and related applications [8] The mutual information criterion between t k and c i is defined as MI(t k c i) = log P(t kc i ) P(t k )P(c i ) MI(t k c i) has a natural value of zero if t k and c i are independent To measure the goodness of a term in a global feature selection we can combine category-specific scores of a term into the three alternate ways: f sum f wsum or f max 224 χ 2 statistic χ 2 measures the lack of independence between the term t k and the class c i It can be compared to the χ 2 distribution with one degree of freedom to judge extremeness χ 2 statistic has a natural value of zero if t k and c i are independent We can calculate the χ 2 statistic for the term t k in the class c i by χ 2 (t k ) = Tr [P(t kc i )P( t k c i ) P(t k c i )P( t k c i )] 2 P(t k )P( t k )P(c i )P( c i ) 225 Odds ratio (OR) OR measures the ratio between the odds of the term appearing in a relevant document versus the odds of it appearing in a nonrelevant one The odds ratio between t k and c i is given by OR(t k c i) = P(t kc i )(1 P(t k c i )) (1 P(t k c i ))P(t k c i ) An OR of 1 indicates that term t k is equally likely in both classes c i and c i OR greater than 1 indicates that t k is more likely in c i On the other hand OR less than 1 indicates that t k is less likely in c i As MI to measure the goodness of a term in a global feature selection we can combine category-specific scores using functions f sum f wsum or f max 3 NAIVE BAYES ANTI-SPAM FILTERS Naive Bayes anti-spam filtering (NB) has become the most popular mechanism to distinguish spam messages from legitimate [5] From Bayes theorem and the theorem of total probability the probability for a message with vector x = x 1 x n belongs to a category c i {c s c l } is P(c i x) = P(c i)p( x c i ) P( x) Since the denominator does not depend on the category the filter classifies each message in category that maximizes P(c i)p( x c i) In spam filtering it is equivalent to classifying P(c s)p( x c s) P(c s)p( x c s)+p(c l )P( x c l ) > a message as spam (c s) whenever T with T = 05 By varying T we can opt for more true negatives at the expense of fewer true positives or viceversa The a priori probabilities P(c i) can be estimated as the occurrences frequency of documents belonging to the category c i in the training set Tr On the other hand P( x c i) is practically impossible to estimate directly because we would need in Tr some messages identical to the one we want to classify However NB classifier makes a simple assumption that the terms in a message are conditionally independent and the order they appear in a message is irrelevant The probabilities P( x c i) are estimated differently in each NB version [5] Several studies have found the NB classifier to be surprisingly effective in the spam filtering task [1 5] despite the fact that its independence assumption is usually oversimplistic In the following we describe four well-known versions of NB anti-spam filter available in the literature 31 Multinomial term frequency naive Bayes The multinomial term frequency NB (MN TF NB) represents each message as a set of terms m = {t 1 t n} computing each one of t k as how many times it appears in m In this sense m can be represented by a vector x = x 1 x n where each x k corresponds to the number of occurrences of t k in m Moreover each message m of
3 category c i can be interpreted as the result of picking independently m terms from T with replacement and probability P(t k c i) for each t k Hence P( x c i) is the multinomial distribution: P( x c i) = P( m ) m! ny k=1 P(t k c i) x k x k! The criterion for classifying a message as spam becomes: P(c s) Q n k=1 P(t k c s) x k Pc i {c sc l } P(ci) Q n k=1 P(t k c i) x k > T and probabilities P(t k c i) are estimated as a Laplacian prior P(t k c i) = 1+Nt k c i n+n ci where N tk c i is the number of occurrences of term t k in the training messages of category c i and N ci = P n k=i Nt kc i 32 Multinomial Boolean naive Bayes The multinomial Boolean NB (MN Boolean NB) is similar to the MN TF NB including the estimates of P(t k c i) except that each attribute x k is Boolean 33 Multivariate Bernoulli naive Bayes Let T = {t 1 t n} the result set of terms after term selection The multivariate Bernoulli NB (MV Bernoulli NB) represents each message m as a set of terms only by computing the presence or absence of each term Therefore m can be represented as a binary vector x = x 1 x n where each x k shows whether or not t k will occur in m The probabilities P( x c i) are computed by [5]: ny P( x c i) = P(t k c i) x k (1 P(t k c i)) (1 xk) k=1 The criterion for classifying a message as spam becomes: P(c s) Q n k=1 P(t k c s) x k (1 P(t k c s)) (1 x k) Pc i {c sc l } P(ci) Q n k=1 P(t k c i) x k (1 P(tk c i)) (1 x k) > T where probabilities P(t k c i) are estimated as a Laplacian prior P(t k c i) = 1+ Trt k c i 2+ Tr ci where Tr tk c i is the number of training messages of category c i that contain the term t k and Tr ci is the total number of training messages of category c i 34 Flexible Bayes Flexible Bayes (FB) represents the probabilities P( x c i) as the average of L kci normal distributions with different values for µ kci but the same one for σ kci : P(x k c i) = 1 L X kci g(x k ; µ kci l σ ci ) L kci l=1 where L kci is the amount of different values that the attribute X k has in the training set Tr of category c i Each of these values is used as µ kci l of a normal distribution of the category c i However all distributions of a category c i 1 are taken to have the same σ ci = [1] Trci 4 PERFORMANCE MEASUREMENTS Let S and L sets of spam and legitimate messages respectively the possible prediction results are: true positives (T P) corresponding to the set of spam messages correctly classified true negatives (T N) the set of legitimate messages correctly classified false negatives (FN) the set of spam messages incorrectly classified as legitimate and false positives (FP) the set of legitimate messages incorrectly classified as spam Some well-known evaluation measurements are: True positive rate (Tpr) True negative rate (Tnr) Spam precision (Spr) Legitimate precision (Lpr) ROC curves [3] precision recall [5] Accuracy rate (Acc) and Total Cost Ratio (TCR) [1] T CR was first proposed by [1] which offers an indication of the improvement provided by the filter Greater T CR indicates better performance and for TCR < 1 not using the filter is better However we opt to use the Matthews Correlation Coefficient [2]: MCC = ( T P T N ) ( FP FN ) p ( T P + FP )( T P + FN )( T N + FP )( T N + FN ) since it provides more information than T CR It returns a real value between 1 and +1 A coefficient equals to +1 indicates a perfect prediction; 0 an average random prediction; and 1 an inverse prediction According to [2] MCC is one of the best measure which may often provides a more balanced evaluation of the prediction than other measures such as the proportion of correct predictions (accuracy) especially if the two classes are of very different sizes 5 METHODOLOGY We use the six well-known Enron corpora (E1 E6) [5] in our experiments In order to provide an aggressive dimensionality reduction we sort the messages according to their arrival s date and we perform the training stage by using the 90% of the first received messages The remainder ones are separated for classifying After collecting all terms of the training messages we remove all irrelevant terms In this case we consider as irrelevant all terms which appear in less than five training messages (less than 01% of each corpus) Once the training stage has finished we apply a term selection technique for reducing the term space dimensionality In order to provide a complete evaluation we vary the number of terms from 10% to 100% of all retained terms in the preprocessing stage Next we classify the testing messages using the naive Bayes anti-spam filters presented in Section 3 We perform all possible combinations between NB anti-spam filters and term selection methods The first studies in naive Bayes anti-spam filters employed an unbalanced classification using cost-sensitive learning by varying the value of threshold T depending on the usage scenario However recent works have showed that the choice of T is very difficulty because it can vary for each user The same works indicate that setting T = 05 seems reasonable and it generally offers a good average prediction [5] 6 EXPERIMENTAL RESULTS In the following we present the results achieved for each corpus In the remainder of this section consider the following abbreviations: MN Boolean NB as MN Bool MN term frequency NB as MN TF MV Bernoulli NB as Bern and flexible Bayes as FB Due to paper limitations we present the best results and the best average prediction achieved by each NB classifier using different term selection techniques (TST) We define
4 the best result as the maximum MCC attained by the combination of one NB classifier and TST using a specific number of terms 2 Tables and 11 present the best performance achieved by each classifier for each Enron dataset Tables and 12 detail the set of results attained by the two classifiers which accomplished the best performance Table 1: E1 The best result achieved by filter Bern OR wsum MN Bool OR wsum MN TF OR wsum FB IG Table 2: E1 Two best individual performances Measurement Bern & OR wsum MN Bool & OR wsum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Note that in Table 2 the combination of MV Bernoulli NB with OR wsum acquired the same TCR than MN Boolean NB with OR wsum However their MCC are differents It happens because the MCC offers a more balanced evaluation of the prediction particularly if the classes are of different sizes as discussed in Section 4 The E1 dataset has more legitimate messages than spams hence false negatives affect the classifiers performance more than false positives Table 3: E2 The best result achieved by filter Bern OR sum MN TF OR sum MN Bool OR sum FB IG Table 8 shows another drawback of TCR MV Bernoulli NB with IG using 20% of T achieved a perfect prediction (MCC = 1000 and TCR = + ) for Enron 4 On the other hand MN Boolean NB with OR wsum using 40% of T classified only one spam incorrectly as legitimate ( FP = 0 FN = 1) and accomplished MCC = 0996 and TCR = 450 If we analyze only the TCR we would wrong conclude that the first combination was much better than the second one The results indicate that the classifiers generally worsen their performance when the complete set of terms T is used for classifying However we found a trade off between 40% 60% of T which usually achieve the best performance 2 A comprehensive set of results including all tables and figures is available at Table 4: E2 Two best individual performances Measurement Bern & OR sum MN TF & OR sum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Table 5: E3 The best result achieved by filter Bern IG MN Bool IG MN TF IG FB MI max Table 6: E3 Two best individual performances Measurement Bern & IG MN Bool & IG T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Table 7: E4 The best result achieved by filter Bern IG MN Bool OR wsum MN TF NB OR wsum FB OR wsum Table 8: E4 Two best individual performances Measurement Bern & IG MN Bool & OR wsum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Even a set of selected terms composed by only 10% 30% of T offers better results than a set with all terms of T Collectively regarding the term selection techniques the reported experiments indicate that {OR IG} > {χ 2 DF } >> MI where > means performs better than However we found that the IG and χ 2 are less sensitive to the variation of T On the other hand MI presented the
5 Table 9: E5 The best result achieved by filter Bern OR sum MN Bool OR wsum MN TF OR sum FB χ Table 10: E5 Two best individual performances Measurement Bern & OR sum MN Bool & OR wsum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC Table 11: E6 The best result achieved by filter Bern OR sum MN Bool OR sum FB IG MN TF OR sum Table 12: E6 Two best individual performances Measurement Bern & OR sum MN Bool & OR sum T (% of T ) TPr(%) Spr(%) TNr(%) Lpr(%) Acc(%) T CR MCC worst individual and average performance We also verify that the NB classifiers performance is highly sensitive to the quality of terms selected by the TSTs and the number of selected terms T With respect to the classifiers the results indicate that {MV Bernoulli NB} > {MN Boolean NB MN term frequency NB} > {flexible Bayes} MV Bernoulli NB acquired the best individual performance for the majority of datasets used in our experiments It should be emphasized that MV Bernoulli NB is the unique approach which takes into account the absence of terms in the messages Our experiments also show that filters which uses Boolean attributes achieve better results than term frequencies ones 7 CONCLUSIONS AND FURTHER WORK In this paper we present an evaluation of feature selection methods in dimensionality reduction for anti-spam filtering domain We performed the comparison of performance achieved by four NB anti-spam filters with different kind of representation applied to classify messages as legitimate or spam from six public large and real datasets after a pass of dimensionality reduction employed by five term selection techniques varying the number of selected terms Regarding TSTs we found OR and IG most effective in aggressive term removal without losing categorization accuracy On the other hand the employment of MI generally offers poor results which frequently worsen the classifier s performance The results also indicate that IG and χ 2 are less sensitive to the variation of T Among of all classifiers MV Bernoulli NB achieved best performance The results also verify that Boolean attributes performs better than the term frequency ones We also confirm that the performance of NB classifiers is highly sensitive to the selected attributes and the number of selected terms by the TSTs in the training stage Better the term selection technique better the classifiers prediction Future works should take into consideration that spam filtering is a coevolutionary problem because while the filter tries to evolve its prediction capacity the spammers try to evolve their spam messages in order to overreach the classifiers An efficient approach should have an effective way to adjust its rules in order to detect the changes of spam features In this way collaborative filters can be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers performance Moreover spammers generally insert a large amount of noises in spam messages in order to difficult the probability estimation Thus the filters should have a flexible way to compare the terms in the classifying task Approaches based on fuzzy logic can be employed to make the comparison more flexible 8 REFERENCES [1] I Androutsopoulos G Paliouras and E Michelakis Learning to filter unsolicited commercial Technical Report 2004/2 National Centre for Scientific Research Demokritos Athens Greece March 2004 [2] P Baldi S Brunak Y Chauvin C Andersen and H Nielsen Assessing the accuracy of prediction algorithms for classification: an overview Bioinformatics 16(5): May 2000 [3] G Cormack spam filtering: A systematic review Foundations and Trends in Information Retrieval 1(4): [4] G Forman and E Kirshenbaum Extremely fast text feature extraction for classification and indexing In Proceedings of 17th ACM Conference on Information and Knowledge Management pages Napa Valley CA USA November 2008 [5] V Metsis I Androutsopoulos and G Paliouras Spam filtering with naive bayes - which naive bayes? In Proceedings of the 3rd International Conference on and Anti-Spam pages 1 5 Mountain View CA USA July 2006 [6] T Mitchell Machine Learning McCraw-Hill 1997 [7] F Sebastiani Machine learning in automated text categorization ACM Computing Surveys 34(1):1 47 March 2002 [8] Y Yang and J Pedersen A comparative study on feature selection in text categorization In Proceedings of the 14th International Conference on Machine Learning pages Nashville TN USA July 1997
2. On classification and related tasks
2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationEVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM
EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM Assosiate professor, PhD Evgeniya Nikolova, BFU Assosiate professor, PhD Veselina Jecheva,
More informationInformation-Theoretic Feature Selection Algorithms for Text Classification
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationApplication of Support Vector Machine Algorithm in Spam Filtering
Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification
More informationUsing Text Learning to help Web browsing
Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing
More informationAn Empirical Performance Comparison of Machine Learning Methods for Spam Categorization
An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University
More informationClassification Algorithms in Data Mining
August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms
More informationProject Report. Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes. April 17, 2008 Course Number: CSCI
University of Houston Clear Lake School of Science & Computer Engineering Project Report Prepared for: Dr. Liwen Shih Prepared by: Joseph Hayes April 17, 2008 Course Number: CSCI 5634.01 University of
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationIdentifying Important Communications
Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our
More information3 Feature Selection & Feature Extraction
3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy
More informationFlexible-Hybrid Sequential Floating Search in Statistical Feature Selection
Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationChapter 9. Classification and Clustering
Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationBayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationSNS College of Technology, Coimbatore, India
Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,
More informationNaïve Bayes for text classification
Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationLouis Fourrier Fabien Gaie Thomas Rolf
CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted
More informationProbabilistic Learning Classification using Naïve Bayes
Probabilistic Learning Classification using Naïve Bayes Weather forecasts are usually provided in terms such as 70 percent chance of rain. These forecasts are known as probabilities of precipitation reports.
More informationCPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016
CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Assignment 0: Admin 1 late day to hand it in tonight, 2 late days for Wednesday. Assignment 1 is out: Due Friday of next week.
More informationClassification. Slide sources:
Classification Slide sources: Gideon Dror, Academic College of TA Yaffo Nathan Ifill, Leicester MA4102 Data Mining and Neural Networks Andrew Moore, CMU : http://www.cs.cmu.edu/~awm/tutorials 1 Outline
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationData Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy
Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department
More informationRank Measures for Ordering
Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many
More informationMetrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to
More informationMachine Learning. Supervised Learning. Manfred Huber
Machine Learning Supervised Learning Manfred Huber 2015 1 Supervised Learning Supervised learning is learning where the training data contains the target output of the learning system. Training data D
More informationData Mining and Data Warehousing Classification-Lazy Learners
Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is
More informationInternational ejournals
Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationArtificial Intelligence. Programming Styles
Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to
More informationFinal Report - Smart and Fast Sorting
Final Report - Smart and Fast Email Sorting Antonin Bas - Clement Mennesson 1 Project s Description Some people receive hundreds of emails a week and sorting all of them into different categories (e.g.
More informationImproving Recognition through Object Sub-categorization
Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,
More informationBest First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis
Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction
More informationChapter III.2: Basic ranking & evaluation measures
Chapter III.2: Basic ranking & evaluation measures 1. TF-IDF and vector space model 1.1. Term frequency counting with TF-IDF 1.2. Documents and queries as vectors 2. Evaluating IR results 2.1. Evaluation
More informationAn Attempt to Identify Weakest and Strongest Queries
An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics
More informationProbabilistic Classifiers DWML, /27
Probabilistic Classifiers DWML, 2007 1/27 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationSupervised Learning Classification Algorithms Comparison
Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationA Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing
A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing Asif Dhanani Seung Yeon Lee Phitchaya Phothilimthana Zachary Pardos Electrical Engineering and Computer Sciences
More informationData Mining Classification: Alternative Techniques. Imbalanced Class Problem
Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems
More informationNaïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationEvaluation Metrics. (Classifiers) CS229 Section Anand Avati
Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,
More informationKeywords : Bayesian, classification, tokens, text, probability, keywords. GJCST-C Classification: E.5
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 13 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationCountering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008
Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification
More informationCollaborative Spam Mail Filtering Model Design
I.J. Education and Management Engineering, 2013, 2, 66-71 Published Online February 2013 in MECS (http://www.mecs-press.net) DOI: 10.5815/ijeme.2013.02.11 Available online at http://www.mecs-press.net/ijeme
More informationSupplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion
Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University
More informationData Preprocessing. Supervised Learning
Supervised Learning Regression Given the value of an input X, the output Y belongs to the set of real values R. The goal is to predict output accurately for a new input. The predictions or outputs y are
More informationStudy on Classifiers using Genetic Algorithm and Class based Rules Generation
2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules
More informationRita McCue University of California, Santa Cruz 12/7/09
Rita McCue University of California, Santa Cruz 12/7/09 1 Introduction 2 Naïve Bayes Algorithms 3 Support Vector Machines and SVMLib 4 Comparative Results 5 Conclusions 6 Further References Support Vector
More informationNearest neighbor classification DSE 220
Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000
More informationAnalysis of Naïve Bayes Algorithm for Spam Filtering across Multiple Datasets
IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets To cite this article: Nurul Fitriah Rusland
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
More informationClassification. Instructor: Wei Ding
Classification Part II Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification
More informationA Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression
Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study
More informationPrototype Selection for Handwritten Connected Digits Classification
2009 0th International Conference on Document Analysis and Recognition Prototype Selection for Handwritten Connected Digits Classification Cristiano de Santana Pereira and George D. C. Cavalcanti 2 Federal
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationRemote Sensing & Photogrammetry W4. Beata Hejmanowska Building C4, room 212, phone:
Remote Sensing & Photogrammetry W4 Beata Hejmanowska Building C4, room 212, phone: +4812 617 22 72 605 061 510 galia@agh.edu.pl 1 General procedures in image classification Conventional multispectral classification
More informationMachine Learning for. Artem Lind & Aleskandr Tkachenko
Machine Learning for Object Recognition Artem Lind & Aleskandr Tkachenko Outline Problem overview Classification demo Examples of learning algorithms Probabilistic modeling Bayes classifier Maximum margin
More informationSPAM, generally defined as unsolicited bulk (UBE) Feature Construction Approach for Categorization Based on Term Space Partition
Feature Construction Approach for Email Categorization Based on Term Space Partition Guyue Mi, Pengtao Zhang and Ying Tan Abstract This paper proposes a novel feature construction approach based on term
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationOn the automatic classification of app reviews
Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please
More informationImproving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall
Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu (fcdh@stanford.edu), CS 229 Fall 2014-15 1. Introduction and Motivation High- resolution Positron Emission Tomography
More informationProject Report: "Bayesian Spam Filter"
Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationFeatures: representation, normalization, selection. Chapter e-9
Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features
More informationA STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES
A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES Narsaiah Putta Assistant professor Department of CSE, VASAVI College of Engineering, Hyderabad, Telangana, India Abstract Abstract An Classification
More informationPredict Employees Computer Access Needs in Company
Predict Employees Computer Access Needs in Company Xin Zhou & Wenqi Xiang Email: xzhou15,wenqi@stanford.edu 1.Department of Computer Science 2.Department of Electrical Engineering 1 Abstract When an employee
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationA Reputation-based Collaborative Approach for Spam Filtering
Available online at www.sciencedirect.com ScienceDirect AASRI Procedia 5 (2013 ) 220 227 2013 AASRI Conference on Parallel and Distributed Computing Systems A Reputation-based Collaborative Approach for
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationFeature weighting classification algorithm in the application of text data processing research
, pp.41-47 http://dx.doi.org/10.14257/astl.2016.134.07 Feature weighting classification algorithm in the application of text data research Zhou Chengyi University of Science and Technology Liaoning, Anshan,
More informationNaïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict
More informationAn Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm
Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationForward Feature Selection Using Residual Mutual Information
Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics
More informationAn Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data
An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationLarge Scale Data Analysis Using Deep Learning
Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting
More informationNovel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition
Novel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition Rome, June 2017 Enaitz Ezpeleta, Iñaki Garitano, Ignacio Arenaza-Nuño, José Marı a Gómez Hidalgo, and Urko
More informationUnknown Malicious Code Detection Based on Bayesian
Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 3836 3842 Advanced in Control Engineering and Information Science Unknown Malicious Code Detection Based on Bayesian Yingxu Lai
More informationA Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems
A Privacy Preserving Model for Ownership Indexing in Distributed Storage Systems Tiejian Luo tjluo@ucas.ac.cn Zhu Wang wangzhubj@gmail.com Xiang Wang wangxiang11@mails.ucas.ac.cn ABSTRACT The indexing
More informationChapter 3: Supervised Learning
Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example
More information